# Machine Learning Intro â€“ Exercise Notebook

This notebook accompanies the *Startâ€‘up Seminar* (2025â€‘06â€‘04).  
Work through the problems to reinforce the ideas from the lecture:
1. Endâ€‘toâ€‘end regression workflow with the Californiaâ€‘housing data
2. Endâ€‘toâ€‘end classification workflow with the Breastâ€‘Cancer data
3. Endâ€‘toâ€‘end clustering analysis with the Iris data (incl. PCA visualization)

Youâ€™ll need ~2â€¯h if you code along diligently. Good luck!

## 0Â Â Environment setup
Run the cell below to import common libraries. Feel free to add any others you need.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import set_config
from sklearn.datasets import fetch_california_housing, load_breast_cancer, load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.metrics import silhouette_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor

set_config(transform_output='pandas')  # nicer display
np.random.seed(42)

## 1Â Â RegressionÂ â€“ CaliforniaÂ Housing
We will build a model to predict **median house value**.
The section mirrors the project workflow you saw in the slides:
- data acquisition & EDA
- preprocessing (train/test split, scaling, â€¦)
- baseline model â†’ evaluation (RMSE, MAE)
- more powerful model â†’ hyperâ€‘parameter tuning


### 1.1Â Â Load the dataset

In [2]:
# ----- TODO: load the Californiaâ€‘housing dataset into a DataFrame `df_reg` ----- #

# Load the Dataset
data = fetch_california_housing(as_frame=True)
df_reg = data.frame
display(df_reg.head())

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


### 1.2Â Â Train/test split
- Create `X_train`, `X_test`, `y_train`, `y_test` using an 80/20 split.
- Create `df_train` by concatenating `X_train` and `y_train`

In [1]:
# ----- TODO: split the data ----- #

In [None]:
# ----- TODO: create df_train ----- #

### 1.3Â Â Exploratory Data Analysis (EDA)
- Show basic `.info()` / `.describe()`
- Visualise one numerical feature distribution (e.g., histogram)
- Plot a pairwise scatter matrix or correlation heatâ€‘map

In [None]:
# TODO: your EDA code here

### 1.4Â Â Preâ€‘processing pipeline
Build a `scikitâ€‘learn` pipeline that (optionally) scales features with `StandardScaler` and fits a **LinearRegression** model as a baseline.

In [None]:
# ----- TODO: build and fit the baseline pipeline ----- #

### 1.5Â Â Evaluate the baseline
Compute **RMSE** and **MAE** on the test set.

In [None]:
# TODO: evaluate baseline
from sklearn.metrics import mean_squared_error, mean_absolute_error


### 1.6Â Â Decisionâ€‘Tree Regressor
- Fit a `DecisionTreeRegressor`
- Evaluate the metrics again
- Compare to baseline

In [None]:
# TODO: Decisionâ€‘Tree experiment

### 1.7Â Â Hyperâ€‘parameter tuning (GridÂ Search)
Tune `max_depth` and `min_samples_leaf` using **`GridSearchCV`** with 5â€‘fold CV.

In [None]:
# TODO: Grid search for tree regressor

## 2Â Â ClassificationÂ â€“ Breastâ€‘Cancer
Repeat a similar workflow for a binary classification problem.
Target variable: malignant vs. benign.

### 2.1Â Â Load the dataset

In [None]:
# TODO: load breastâ€‘cancer dataset into DataFrame `df_clf`
from sklearn.datasets import load_breast_cancer


### 2.2Â Â EDA for classification data

In [None]:
# TODO: your EDA code

### 2.3Â Â Preâ€‘processing & split
Use `StandardScaler` and **LogisticRegression** as a baseline.

In [None]:
# TODO: pipeline with logistic regression

### 2.4Â Â Evaluate baseline
Produce **confusionâ€‘matrix**, **accuracy**, and **precision/recall** scores.

In [None]:
# TODO: evaluation code

### 2.5Â Â Randomâ€‘Forest Classifier experiment

In [None]:
# TODO: RandomForest pipeline & evaluation

### 2.6Â Â Hyperâ€‘parameter tuning (RandomizedÂ Search)
Tune `n_estimators`, `max_depth`, and `max_features`.

In [None]:
# TODO: RandomizedSearchCV for RandomForest

## 3Â Â Custom metric exercise
Implement functions to compute **RMSE** and **MAE** _manually_ (without using `sklearn.metrics`). Verify they match `sklearn`â€™s implementations.

In [None]:
# TODO: custom metric functions

## 4Â Â (â˜… Bonus)Â Crossâ€‘validation for timeâ€‘series data
1. Simulate a univariate sineâ€‘wave with noise
2. Use `TimeSeriesSplit` to perform walkâ€‘forward validation with a `RandomForestRegressor`.
3. Plot validation RMSE vs. split index.

In [None]:
# TODO: bonus exercise

---

### ðŸŽ‰Â Congratulations â€” youâ€™ve completed the exercises!
If you still have time:
- experiment with another dataset of your choice
- swap models (e.g., try `GradientBoostingRegressor` / `XGBClassifier`)
- create featureâ€‘importance plots and interpret them

*Happy coding!*