# California Housing — Machine Learning Exercise

In this exercise you will work through a basic machine learning workflow using the California Housing dataset.

This dataset is relatively easy an small to work with. This will help us to get a good understanding of the tools that we work with. The goal is to predict median house values in California. Along the way, you will practice splitting data, fitting models, evaluating performance, and comparing approaches.

This notebook gives you the structure. You will fill in the details yourself. Explore and discovery various approaches each step, work together and discuss. Make it interactive. Try to minimize copy-pasting of sections so that you really understand what's going on.

## 1. Setup

Import the libraries you need for data handling, modeling, and evaluation.

In [None]:
from sklearn.datasets import fetch_california_housing

# Load dataset
data = fetch_california_housing(as_frame=True)
df = data.frame

df.head()

## 2. Explore the data

Take a look at the dataset. What features (X) are available? What is the target (y)?

- Inspect the shape of the data
- Check basic statistics and distributions
- Look for correlations and potential relationships

## 3. Split the data

We need separate data for training and testing. Use a train/test split so you can evaluate the model on data it has never seen.

Tasks:
- Define `X` and `y`
- Split into `X_train`, `X_test`, `y_train`, `y_test`
- Keep the test set aside until the very end

## 4. Naive baseline

Before fitting any real model, start with the simplest possible predictor: always predict the mean of the training target.

This gives you a baseline MAE and RMSE. Any useful model should do better than this.

## 5. Linear regression

Fit a linear regression model to the training data.

- Train the model
- Predict on the test set
- Evaluate using MAE and RMSE

Compare these results to the naive baseline.

## 6. Decision tree

Fit a decision tree model.

- Start with default parameters
- Then try tuning at least one parameter (for example `max_depth`)
- Compare results to linear regression and the baseline

## 7. Random forest and boosting

Train ensemble models.

- RandomForestRegressor
- GradientBoostingRegressor or XGBoost

Evaluate them in the same way and compare results.

## 8. Evaluate and compare

Now that you have several models:

- Collect MAE and RMSE for each model
- Make a simple comparison table
- Decide which model performs best and why

## 9. Model explanation
- Can you extract and plot the feature importances of one of the tree-based models?
- Visualize predicted vs actual on a scatter plot

## 10. Reflection

Summarize your findings:

- Which model was best?
- Did tuning help?
- What did you learn about the trade-offs between models?
- Which model was easiest to use?

## 11. Extra
- Use cross-validation instead of a single train/test split
- Experiment with Ridge or Lasso regression. Investigate how regularization impacts coefficients and performance. Can you tune the parameters to ensure they are the best fit?