# Machine Learning Intro – Exercise Notebook

This notebook accompanies the *Start‑up Seminar* (2025‑06‑04).  
Work through the problems to reinforce the ideas from the lecture:
1. End‑to‑end regression workflow with the California‑housing data
2. End‑to‑end classification workflow with the Breast‑Cancer data
3. End‑to‑end clustering analysis with the Iris data (incl. PCA visualization)

You’ll need ~2 h if you code along diligently. Good luck!

## 0.  Environment setup
Run the cell below to import common libraries. Feel free to add any others you need.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import set_config
from sklearn.datasets import fetch_california_housing, load_breast_cancer, load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.metrics import silhouette_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor

set_config(transform_output='pandas')  # nicer display
np.random.seed(42)

## 1.  Regression – California Housing

**Dataset**  

The California Housing dataset is a classic dataset from the 1990s used for regression tasks. It contains information on housing prices in California districts and is often used to predict median house value based on various features.

Key Characteristics:

- Target: Median house value (in $100,000s)

- Features (8 total):

    - `MedInc`: Median income in the district

    - `HouseAge`: Median house age

    - `AveRooms`: Average number of rooms per household

    - `AveBedrms`: Average number of bedrooms

    - `Population`: Total population

    - `AveOccup`: Average number of household members

    - `Latitude`: Latitude of the district

    - `Longitude`: Longitude of the district

**Hands-on Workout**  

We will build a model to predict **median house value**.
The section mirrors the project workflow you saw in the slides:
- data acquisition & EDA
- preprocessing (train/test split, scaling, …)
- baseline model → evaluation (RMSE, MAE)
- more powerful model → hyper‑parameter tuning

### 1.1  Load the dataset

In [2]:
# ----- TODO: load the California‑housing dataset into a DataFrame `df_reg` ----- #

# Load the Dataset
data = fetch_california_housing(as_frame=True)
df_reg = data.frame
display(df_reg.head())

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


### 1.2  Train/test split
- Create `X_train`, `X_test`, `y_train`, `y_test` using an 80/20 split.
- Create `df_train` by concatenating `X_train` and `y_train`

In [1]:
# ----- TODO: split the data ----- #

In [None]:
# ----- TODO: create df_train ----- #

### 1.3  Exploratory Data Analysis (EDA)
- Show basic `.info()` / `.describe()`
- Visualise one numerical feature distribution (e.g., histogram)
- Plot a pairwise scatter matrix or correlation heat‑map

In [None]:
# ----- TODO: your EDA code here ----- #

### 1.4  Pre‑processing pipeline
Build a `scikit‑learn` pipeline that (optionally) scales features with `StandardScaler` and fits a **LinearRegression** model as a baseline.

In [None]:
# ----- TODO: build and fit the baseline pipeline ----- #

### 1.5  Evaluate the baseline
Compute **RMSE** and **MAE** on the test set.

In [None]:
# ----- TODO: evaluate baseline ----- #

### 1.6  Decision‑Tree Regressor
- Fit a `DecisionTreeRegressor`
- Evaluate the metrics again
- Compare to baseline

In [None]:
# ----- TODO: Decision‑Tree experiment ----- #

In [None]:
# ----- TODO: XGBoost Regressor experiment ----- #

### 1.7  Hyper‑parameter tuning (Grid Search)
Tune `max_depth` and `min_samples_leaf` using **`GridSearchCV`** with 5‑fold CV.

In [2]:
# ----- TODO: Grid search for Decision-Tree Regressor ----- #

## 2.  Classification – Breast‑Cancer

**Dataset**  

Key Characteristics:

- Target

    - 0 = Malignant (cancerous)

    - 1 = Benign (non-cancerous)

- Features (30 total):

    All features are numeric, derived from digitized images of breast fine needle aspirates (FNAs). Each feature describes characteristics of the cell nuclei:

    There are 10 basic measurements (e.g., radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension), and for each, 3 statistics are recorded:

    - .mean – average value

    - .se – standard error

    - .worst – mean of the three worst (largest) values

    Examples:

        radius_mean, texture_se, area_worst, etc.


**Hands-on Workout** 

Repeat a similar workflow for a binary classification problem.

### 2.1  Load the dataset

In [3]:
# ----- TODO: load breast‑cancer dataset into DataFrame `df_clf` ----- #


### 2.2  Train/test split
- Create `X_train`, `X_test`, `y_train`, `y_test` using an 80/20 split.
- Create `df_train` by concatenating `X_train` and `y_train`

In [None]:
# ----- TODO: split the data ----- #

In [None]:
# ----- TODO: create df_train ----- #

### 2.3  EDA for classification data

- Visualize target class distribution
- Plot a correlation heat-map

In [None]:
# ----- TODO: your EDA code ----- #

### 2.4  Pre‑processing & split
Use `StandardScaler` and **LogisticRegression** as a baseline.

In [None]:
# ----- TODO: pipeline with logistic regression ----- #

### 2.5  Evaluate baseline
Produce **confusion‑matrix**, **accuracy**, and **precision/recall** scores.

In [None]:
# ----- TODO: evaluation code ----- #

### 2.6  Random‑Forest Classifier experiment

In [None]:
# ----- TODO: RandomForest pipeline & evaluation ----- #

### 2.7  Hyper‑parameter tuning (Randomized Search)
Tune `n_estimators`, `max_depth`, and `max_features`.

In [None]:
# ----- TODO: RandomizedSearchCV for RandomForest ----- #

## 3. Clustering & PCA Visualization - Iris
In this block, we will apply k-means clustering to a real dataset, decide the number of clusters, and visualize the results using PCA.

### 3.1 Load the dataset

In [None]:
# ----- TODO: load the iris dataset into a DataFrame `df_iris` ----- #

### 3.2 Preprocessing & Scaling
- Seperate features and target
- Scale the feature values with StandardScaler

In [None]:
# ----- TODO: preprocess the data ----- #

### 3.3 Silhouette Score - Evaluating cluster quality
- Compute the silhouette score for k = 2 to 10
- Visualize scores and determine the best k

In [None]:
# ----- TODO: silhouette score ----- #

### 3.4 Train k-means and visualize clusters with PCA
- Train k-means with the selected k
- Use PCA to project data to 2D
- Plot the clustered points 

In [None]:
# ----- TODO: train k-means and visualize clusters with PCA ----- #

### 3.5 Compare with ground-truth labels
- Create a crosstab of predicted clusters vs. true species
- Visualize the correspondence using a heatmap

In [None]:
# ----- TODO: compare with ground-truth labels ----- #

---

### 🎉 Congratulations — you’ve completed the exercises!
If you still have time:
- experiment with another dataset of your choice
- swap models (e.g., try `GradientBoostingRegressor` / `LGBMClassifier`)
- create feature‑importance plots and interpret them

*Happy coding!*