# Applied Machine Learning — Course Project 
## Corrupted Occupancy Dataset: Cleaning → Unsupervised → Supervised → Model Selection

**Course:** MT1575 Applied Machine Learning  
**Deliverable:** Submit this notebook (`.ipynb`) with all outputs visible (plots, tables, printed metrics).

---

## What we assess
Your grade is primarily based on:

1. **A correct ML workflow** (cleaning → preprocessing → modelling → evaluation)
2. **Model selection quality** (hyperparameters + comparison across models)
3. **Overfitting/underfitting analysis** (train vs validation behaviour)
4. **Clear communication through visual evidence** (many plots + compact tables)


---

## Tracks
Set `TRACK` in the setup cell:
- `TRACK="3hp"`: baseline requirements
- `TRACK="4.5hp"`: baseline + **Step 10 extension** (required for 4.5 hp)


In [None]:
# ===== Setup =====
TRACK = "3hp"          # "3hp" or "4.5hp"
DATA_FILE = "MT1575_occupancy_corrupted.csv"  # upload to notebook working directory
RANDOM_SEED = 42

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

np.random.seed(RANDOM_SEED)


## 1) Rules: allowed vs required

### You MAY use sklearn for
- splitting: `train_test_split`, `StratifiedKFold`
- preprocessing: `StandardScaler`, `OneHotEncoder`, `ColumnTransformer`
- metrics/plots: `accuracy_score`, `f1_score`, `roc_auc_score`, confusion matrix, ROC/PR displays
- models: `DecisionTreeClassifier`, `RandomForestClassifier`, `MLPClassifier`

### You MUST implement from scratch (functions)
You must implement and **use** these yourself:
- Logistic regression training with gradient descent (functions)
- kNN prediction (functions)
- PCA (functions: SVD or eigen-decomposition)
- k-means clustering (functions)

### You may NOT use (for the from-scratch parts)
- `sklearn.linear_model.LogisticRegression`
- `sklearn.neighbors.KNeighborsClassifier`
- `sklearn.decomposition.PCA`
- `sklearn.cluster.KMeans`

If forbidden classes are used for from-scratch parts, that part is **not submitted**.


## 2) Load dataset

### What we expect to see (outputs)
- Dataset shape printed
- First 5 rows displayed
- Column list printed
- Dtypes displayed


In [None]:
path = Path(DATA_FILE)
assert path.exists(), f"Could not find {DATA_FILE}. Upload it first."

df_raw = pd.read_csv(path)

display(df_raw.head())
print("Shape:", df_raw.shape)
print("Columns:", list(df_raw.columns))
display(df_raw.dtypes)


## 3) Data cleaning (YOU implement)

The dataset is intentionally corrupted (types, missing values, strange strings, outliers, duplicates).

### Hard requirements (we will check these outputs)
#### A) Before/after tables
1. Missing values table **before** and **after** cleaning  
   - show at least the top 10 columns with most missing values
2. Dtype table **before** and **after** cleaning

#### B) Plots (minimum 4)
You must show at least:
1. One plot showing a data problem **before** cleaning
2. The same (or equivalent) plot **after** cleaning
3. One histogram (before/after ideal)
4. One target-label plot (bar chart of `Occupancy` values before/after)

#### C) Output contract
By the end of this step, you must have:
- `df` (cleaned dataframe)
- `TARGET_COL = "Occupancy"` exists
- `df[TARGET_COL]` is binary integers in `{0,1}`
- Features used for modelling contain **no missing values**

### What you must write (short text)
In a markdown cell below your cleaning code, write **5–10 lines** explaining:
- which issues you found,
- what you changed,
- and why those choices are reasonable.


In [None]:
TARGET_COL = "Occupancy"
df = df_raw.copy()

# ----------------------------
# TODO: Cleaning pipeline
# ----------------------------
# Suggested steps:
# 1) Standardize column names (strip, lower, underscores)
# 2) Replace sentinel strings/values with NaN
# 3) Convert numeric columns stored as strings (comma->dot)
# 4) Remove duplicates
# 5) Handle missing values (drop or impute)
# 6) Handle outliers / invalid values (clip/remove)
# 7) Fix target labels: ensure exactly {0,1} integers

# ----------------------------
# REQUIRED: Before/after evidence
# ----------------------------
# TODO: missing values before/after tables
# TODO: dtype before/after table
# TODO: minimum 4 plots

print("After cleaning:")
print("Shape:", df.shape)
print("Target unique:", sorted(pd.Series(df[TARGET_COL]).dropna().unique()) if TARGET_COL in df.columns else "TARGET_COL missing")
display(df.head())
display(df.dtypes)


### Cleaning summary (required: 5–10 lines)
- ...


## 4) Exploratory Data Analysis (EDA)

### Hard requirements (we will check these outputs)
#### A) Class balance
- A table with counts and percentages for `Occupancy`.

#### B) Plots (minimum 6)
Include at least:
1. Two histograms (two different numeric features)
2. Two boxplots (two different numeric features)
3. One scatter plot of two features colored by `Occupancy`
4. One correlation visualization (heatmap OR top-10 correlations plot)

#### C) Short EDA text (required)
Write **5–10 lines**:
- what patterns you see,
- what might cause modelling difficulty,
- whether scaling/outliers/class imbalance seems important.


In [None]:
# TODO: class balance table

# TODO: minimum 6 plots (hist, boxplot, scatter, correlation)


### EDA summary (required: 5–10 lines)
- ...


## 5) Train/Validation/Test split + preprocessing (infrastructure cell)

This section sets up a standard ML pipeline. You are not graded on writing the split code itself,
but you are graded on using the resulting arrays correctly in later steps.

### What we expect you to do in the code cell below
- (Optional) edit `DROP_COLS` if you identify ID/leakage columns.
- Run the cell.
- Verify the printed output:
  - shapes are correct,
  - class balance is similar across splits,
  - processed arrays have consistent feature dimensions.

### Output (required variables)
This cell must produce:
- `Xtr_np`, `Xva_np`, `Xte_np`
- `y_train`, `y_val`, `y_test`


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

# Optional: drop ID/leakage columns
DROP_COLS = []

X = df.drop(columns=[TARGET_COL] + DROP_COLS)
y = df[TARGET_COL].astype(int)

X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, test_size=0.2, random_state=RANDOM_SEED, stratify=y_train_full
)

def balance_table(y_):
    vc = y_.value_counts()
    return pd.DataFrame({"count": vc, "pct": (vc / vc.sum()).round(3)})

print("Shapes:", X_train.shape, X_val.shape, X_test.shape)
display(pd.concat({
    "train": balance_table(y_train),
    "val": balance_table(y_val),
    "test": balance_table(y_test),
}, axis=1))

num_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
cat_cols = [c for c in X_train.columns if c not in num_cols]

preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ],
    remainder="drop"
)

Xtr = preprocessor.fit_transform(X_train)
Xva = preprocessor.transform(X_val)
Xte = preprocessor.transform(X_test)

Xtr_np = Xtr.toarray() if hasattr(Xtr, "toarray") else np.asarray(Xtr)
Xva_np = Xva.toarray() if hasattr(Xva, "toarray") else np.asarray(Xva)
Xte_np = Xte.toarray() if hasattr(Xte, "toarray") else np.asarray(Xte)

print("Processed shapes:", Xtr_np.shape, Xva_np.shape, Xte_np.shape)


## 6) From-scratch functions (required)

You must implement and use the functions below in later sections.

### Logistic Regression (scratch)
You must implement:
- `sigmoid(z)` (numerically stable)
- `logreg_loss_and_grad(X, y, w, l2)` → returns `(loss, grad)`
- `fit_logreg_gd(X, y, lr, n_steps, l2)` → returns `(w, loss_history)`
- `predict_proba_logreg(X, w)` and `predict_logreg(X, w, threshold)`

**What we expect to see**
- A plot of loss vs iteration for at least one run
- A complexity curve: train vs validation F1 vs `l2`

### kNN (scratch)
You must implement:
- `knn_predict(X_train, y_train, X_query, k, weighted=False)`

**What we expect to see**
- A complexity curve: train vs validation F1 vs `k`

### PCA (scratch)
You must implement:
- `pca_fit(X, n_components)` → returns `mean`, `components`, `explained_variance_ratio`
- `pca_transform(X, pca_model)`

**What we expect to see**
- Cumulative explained variance plot
- PCA-2D scatter colored by Occupancy

### k-means (scratch)
You must implement:
- `kmeans_fit(X, k, ...)` → returns `centroids`, `labels`, `inertia_history`
- `kmeans_predict(X, centroids)`

**What we expect to see**
- silhouette score vs k plot
- PCA-2D scatter colored by clusters


In [None]:
# -----------------------------
# Logistic Regression (scratch)
# -----------------------------
def sigmoid(z):
    # TODO: implement stable sigmoid
    raise NotImplementedError

def logreg_loss_and_grad(X, y, w, l2=0.0):
    # TODO: return (loss, grad)
    raise NotImplementedError

def fit_logreg_gd(X, y, lr=0.1, n_steps=2000, l2=0.0):
    # TODO: return (w, loss_history)
    raise NotImplementedError

def predict_proba_logreg(X, w):
    # TODO
    raise NotImplementedError

def predict_logreg(X, w, threshold=0.5):
    p = predict_proba_logreg(X, w)
    return (p >= threshold).astype(int)

# -----------------------------
# kNN (scratch)
# -----------------------------
def knn_predict(X_train, y_train, X_query, k=5, weighted=False):
    # TODO
    raise NotImplementedError

# -----------------------------
# PCA (scratch)
# -----------------------------
def pca_fit(X, n_components):
    # TODO
    raise NotImplementedError

def pca_transform(X, pca_model):
    # TODO
    raise NotImplementedError

# -----------------------------
# k-means (scratch)
# -----------------------------
def kmeans_fit(X, k, n_init=5, max_iter=200, tol=1e-4, random_state=42):
    # TODO
    raise NotImplementedError

def kmeans_predict(X, centroids):
    # TODO
    raise NotImplementedError


## 7) Unsupervised learning: PCA + k-means 

### Hard requirements (outputs)
#### PCA
1. Plot cumulative explained variance for at least 1–20 components (or up to d)
2. Choose `n_components` and justify in **3–6 lines**
3. Create a 2D PCA projection and plot a scatter colored by `Occupancy`

#### k-means
4. Run k-means for at least **4** values of k (e.g., 2–6)
5. Plot silhouette score vs k
6. Plot 2D PCA scatter colored by cluster labels for your chosen k

#### Interpretation (required text)
Write **5–10 lines** explaining:
- whether clusters seem to match labels,
- what this implies for supervised learning difficulty.


In [None]:
from sklearn.metrics import silhouette_score

# TODO: PCA explained variance + PCA2 scatter
# TODO: k-means sweep + silhouette plot + cluster scatter


### Unsupervised interpretation (required: 5–10 lines)
- ...


## 8) Supervised learning: model selection + overfitting evidence 

### Required model families
**From scratch**
- Logistic Regression
- kNN

**sklearn allowed**
- Decision Tree
- Random Forest
- MLPClassifier

### Hard requirements (outputs)
#### A) Comparison table (single table)
Create one table showing the **best validation** performance for each model family.
Table must include:
- Train accuracy, Train F1
- Validation accuracy, Validation F1
- Chosen hyperparameters (as a short text field)

#### B) Overfitting/complexity curves (minimum 4 figures)
You must show train vs validation F1 curves for:
1. Logistic regression vs `l2` (≥ 6 values)
2. kNN vs `k` (≥ 8 values)
3. Decision Tree vs `max_depth` (≥ 8 values)
4. Random Forest vs `n_estimators` (≥ 6 values)

MLP:
- `TRACK="3hp"`: recommended (at least 3 architectures)
- `TRACK="4.5hp"`: required (see Step 10)

#### C) Required text (8–15 lines)
Explain:
- which models overfit and where,
- why you selected your final model family,
- which hyperparameters mattered most.


In [None]:
from sklearn.metrics import accuracy_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

def metrics_row(name, config, ytr, yhat_tr, yva, yhat_va):
    return {
        "model": name,
        "config": config,
        "acc_train": accuracy_score(ytr, yhat_tr),
        "f1_train": f1_score(ytr, yhat_tr),
        "acc_val": accuracy_score(yva, yhat_va),
        "f1_val": f1_score(yva, yhat_va),
    }

rows = []

# TODO: implement sweeps and plots for:
# - LogReg: l2 grid (>= 6) -> train/val F1 curve + loss plot at least once
# - kNN: k grid (>= 8) -> train/val F1 curve
# - DT: depth grid (>= 8) -> train/val F1 curve
# - RF: n_estimators grid (>= 6) -> train/val F1 curve
# - MLP: try >= 3 architectures (recommended 3hp, required 4.5hp)

comparison = None  # pd.DataFrame(rows)
# display(comparison.sort_values("f1_val", ascending=False))


### Supervised interpretation (required: 8–15 lines)
- ...


## 9) Final model evaluation on the test set

### Hard requirements (outputs)
1. Select one final model family and hyperparameters (based on validation)
2. Retrain on **train + validation**
3. Evaluate on **test**:
   - Print Accuracy and F1
   - Confusion matrix plot

4. If probability scores are available:
   - ROC curve plot
   - Precision-Recall curve plot

### Required text (5–10 lines)
Explain why you think this test performance is a fair estimate (or not).


In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay

# Combine train + val
Xtv_np = np.vstack([Xtr_np, Xva_np])
ytv = np.concatenate([y_train.values, y_val.values])

FINAL_MODEL_NAME = "RF"  # choose: "LogReg", "kNN", "DT", "RF", "MLP"

# TODO: train final model on Xtv_np, ytv
# TODO: evaluate on Xte_np, y_test
# TODO: confusion matrix plot
# TODO: ROC/PR if you have scores


### Test-set discussion (required: 5–10 lines)
- ...


## 10) 4.5 hp Extension — Advanced Model Analysis (required if `TRACK="4.5hp"`)

This step is what justifies the extra 1.5 hp. It focuses on:
- stability of results,
- systematic model selection,
- bias–variance reasoning,
- robustness,
- interpretation.

If `TRACK="3hp"` you should skip this entire step.

---

### 10A) Stratified Cross-Validation (required)

You must perform **Stratified K-Fold Cross-Validation** (k ≥ 5).

**Minimum requirements**
- Evaluate at least **3** model families using cross-validation:
  - one from-scratch model (LogReg or kNN)
  - one tree-based model (DT or RF)
  - MLPClassifier
- For each model family, report:
  - mean F1 across folds
  - std(F1) across folds

**Outputs we expect**
1. A table: `model`, `config`, `mean_f1`, `std_f1`
2. A short text (5–10 lines): which model seems most stable and why?

---

### 10B) Bias–Variance behaviour (required)

For at least **2 model families**, produce a complexity curve and interpret it.

**Minimum requirements**
- Choose 2 of: LogReg, kNN, DT, RF, MLP
- For each chosen family:
  - plot train vs validation F1 vs a complexity hyperparameter (e.g., l2, k, depth, trees, alpha, architecture)
  - in 8–12 lines, explicitly identify:
    - underfitting (high bias) region
    - overfitting (high variance) region
    - your selected trade-off point

**Outputs we expect**
- 2 figures (or more) with train+val curves
- 2 short texts (8–12 lines each)

---

### 10C) Structured hyperparameter search (required)

You must run a structured hyperparameter search for at least **2** model families.
This should be larger than the sweeps in Step 8.

**Minimum requirements**
- ≥ 20 configurations per model family
- store all results in a DataFrame
- show top 10 configurations sorted by validation F1
- include at least one visualization of the search results:
  - heatmap (recommended for 2D grids) OR
  - line plot OR
  - scatter plot

**Outputs we expect**
- result DataFrames (full + top10)
- at least 1 search visualization per tuned family

---

### 10D) Threshold optimization (required if final model outputs probabilities)

If your final model can output probabilities (LogReg/MLP/RF usually can), do threshold tuning.

**Minimum requirements**
- sweep thresholds from 0.05 to 0.95
- plot F1 vs threshold
- show confusion matrix at threshold 0.5 and at your chosen threshold
- write 5–10 lines explaining precision/recall trade-off

---

### 10E) Robustness analysis (choose ONE, required)

Choose one of the options:

**Option 1: repeated splits**
- repeat the full training procedure for 5 random seeds
- report mean ± std of test F1

**Option 2: noise sensitivity**
- add Gaussian noise to numeric features (several noise levels)
- plot F1 vs noise level
- interpret sensitivity in 5–10 lines

---

### 10F) Feature importance / interpretation (required)

For your final selected model:
- If RF: use `feature_importances_`
- Otherwise: use permutation importance

**Minimum requirements**
- plot top 10 most important features
- write 8–12 lines:
  - do these features align with your EDA?
  - any suspicious/leakage features?


In [None]:
if TRACK != "4.5hp":
    print("TRACK is 3hp — skipping Step 10.")
else:
    from sklearn.model_selection import StratifiedKFold
    from sklearn.inspection import permutation_importance

    # TODO 10A: cross-validation table (mean±std F1)
    # TODO 10B: bias-variance interpretation (2 families)
    # TODO 10C: structured hyperparameter search (>=20 configs per family, >=2 families)
    # TODO 10D: threshold sweep if probabilistic final model
    # TODO 10E: robustness (choose 1)
    # TODO 10F: feature importance plot + interpretation

    pass


## 11) Final discussion (required)

Write concise answers (bullet points ok). Minimum **10–20 lines** total.

1. Which model did you select and why? (validation performance + overfitting evidence + complexity)
2. What was the biggest source of overfitting, and how did you address it?
3. What did unsupervised analysis (PCA + k-means) tell you about the data?
4. What would you do next with more time or more data?


### Final discussion
- ...


## 12) Submission checklist (what we will verify)

### Cleaning (Step 3)
- [ ] Missing-values table before/after
- [ ] Dtype table before/after
- [ ] ≥ 4 plots showing issues and improvements
- [ ] Target is binary {0,1} and no NaNs in modelling features
- [ ] Cleaning summary (5–10 lines)

### EDA (Step 4)
- [ ] Class balance table
- [ ] ≥ 6 plots + correlation visualization
- [ ] EDA summary (5–10 lines)

### Unsupervised (Step 7)
- [ ] PCA explained variance plot
- [ ] PCA-2D scatter colored by Occupancy
- [ ] Silhouette vs k (≥ 4 values)
- [ ] PCA-2D scatter colored by k-means clusters
- [ ] Unsupervised interpretation (5–10 lines)

### Supervised (Step 8)
- [ ] Complexity curves: LogReg, kNN, DT, RF (train vs val shown)
- [ ] Comparison table of best models (train + val metrics)
- [ ] Supervised interpretation (8–15 lines)

### Final evaluation (Step 9)
- [ ] Test metrics printed (Accuracy, F1)
- [ ] Confusion matrix plot
- [ ] ROC + PR curves if probabilities available
- [ ] Test discussion (5–10 lines)

### 4.5hp only (Step 10)
- [ ] 10A: CV table mean±std F1 + short text
- [ ] 10B: bias–variance analysis for 2 families
- [ ] 10C: structured search ≥20 configs for ≥2 families + visualizations
- [ ] 10D: threshold tuning if probabilistic final model
- [ ] 10E: robustness analysis
- [ ] 10F: feature importance plot + interpretation
