# Notebook 9 — Supervised Learning Fundamentals

You’ll implement and practice core supervised methods:

1) Linear Regression (OLS, Regularization, Polynomial features)
2) Logistic Regression (classification & calibration)
3) k-Nearest Neighbors (k-NN)
4) Decision Trees (basics + overfitting control)
5) Bias–Variance, Learning Curves & Evaluation

All datasets are synthetic to keep things reproducible. Exercises are marked with **TODO**.

> Tip: Run top-to-bottom. Use the same random seed for repeatability.

## Imports & Utility Helpers

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dataclasses import dataclass
from typing import Tuple

from sklearn.model_selection import train_test_split, KFold, cross_val_score, learning_curve
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, precision_recall_fscore_support, roc_auc_score, roc_curve, confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import make_regression, make_classification, make_moons

np.random.seed(42)
plt.rcParams['figure.figsize'] = (6,4)

ModuleNotFoundError: No module named 'pandas'

---
## 1. Linear Regression

### 1.1 Ordinary Least Squares (OLS)
For y = Xw + ε, OLS solves \(\min_w \|y - Xw\|^2\) with closed-form \(w = (X^\top X)^{-1} X^\top y\) when invertible (libraries use numerically stable solvers).

We’ll fit OLS to synthetic data and visualize fit & residuals.

In [None]:
# Synthetic linear regression data
X, y = make_regression(n_samples=300, n_features=1, noise=15.0, bias=5.0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

ols = LinearRegression()
ols.fit(X_train, y_train)
y_pred = ols.predict(X_test)
print('OLS coefficients:', ols.coef_, 'intercept:', ols.intercept_)
print('Test RMSE:', mean_squared_error(y_test, y_pred, squared=False))
print('Test R^2 :', r2_score(y_test, y_pred))

xs = np.linspace(X.min()-1, X.max()+1, 200).reshape(-1,1)
plt.scatter(X_train, y_train, s=12, alpha=0.6, label='train')
plt.scatter(X_test, y_test, s=12, alpha=0.6, label='test')
plt.plot(xs, ols.predict(xs), label='OLS fit')
plt.legend(); plt.title('Linear Regression: OLS fit'); plt.show()

resid = y_test - y_pred
plt.scatter(y_pred, resid, s=14)
plt.axhline(0, linestyle='--')
plt.xlabel('Predicted'); plt.ylabel('Residual'); plt.title('Residual Plot'); plt.show()

### 1.2 Polynomial Features & Regularization (Ridge/Lasso)
Higher-degree polynomials can overfit. **Ridge (L2)** and **Lasso (L1)** regularization control complexity.

We’ll create a noisy nonlinear target and compare OLS vs Ridge vs Lasso across polynomial degrees using CV.

In [None]:
# Nonlinear regression setup
rng = np.random.default_rng(0)
n = 250
x = np.linspace(-3, 3, n)
y_true = np.sin(x) + 0.3*np.cos(3*x)
y_noisy = y_true + 0.4*rng.standard_normal(n)
X = x.reshape(-1,1)
X_tr, X_te, y_tr, y_te = train_test_split(X, y_noisy, test_size=0.25, random_state=1)

def model_score(model):
    model.fit(X_tr, y_tr)
    pred = model.predict(X_te)
    return mean_squared_error(y_te, pred, squared=False)

degrees = [1, 3, 5, 9]
results = []
for d in degrees:
    pipe_ols = Pipeline([
        ('poly', PolynomialFeatures(d, include_bias=False)),
        ('sc', StandardScaler()),
        ('lr', LinearRegression())
    ])
    pipe_ridge = Pipeline([
        ('poly', PolynomialFeatures(d, include_bias=False)),
        ('sc', StandardScaler()),
        ('ridge', Ridge(alpha=1.0))
    ])
    pipe_lasso = Pipeline([
        ('poly', PolynomialFeatures(d, include_bias=False)),
        ('sc', StandardScaler()),
        ('lasso', Lasso(alpha=0.01, max_iter=20000))
    ])
    rmse_ols = model_score(pipe_ols)
    rmse_ridge = model_score(pipe_ridge)
    rmse_lasso = model_score(pipe_lasso)
    results.append((d, rmse_ols, rmse_ridge, rmse_lasso))

df_res = pd.DataFrame(results, columns=['degree','OLS_RMSE','Ridge_RMSE','Lasso_RMSE'])
print(df_res)

plt.plot(df_res['degree'], df_res['OLS_RMSE'], marker='o', label='OLS')
plt.plot(df_res['degree'], df_res['Ridge_RMSE'], marker='o', label='Ridge')
plt.plot(df_res['degree'], df_res['Lasso_RMSE'], marker='o', label='Lasso')
plt.xlabel('Polynomial degree'); plt.ylabel('Test RMSE')
plt.title('Regularization vs Degree'); plt.legend(); plt.show()

# Visualize fits for degree=9
d=9
grid = np.linspace(x.min(), x.max(), 300).reshape(-1,1)
for name, model in [
    ('OLS', Pipeline([('poly', PolynomialFeatures(d, include_bias=False)), ('sc', StandardScaler()), ('lr', LinearRegression())])),
    ('Ridge', Pipeline([('poly', PolynomialFeatures(d, include_bias=False)), ('sc', StandardScaler()), ('ridge', Ridge(alpha=1.0))])),
    ('Lasso', Pipeline([('poly', PolynomialFeatures(d, include_bias=False)), ('sc', StandardScaler()), ('lasso', Lasso(alpha=0.01, max_iter=20000))]))
]:
    model.fit(X_tr, y_tr); yhat = model.predict(grid)
    plt.plot(grid, yhat, label=name)
plt.scatter(X_tr, y_tr, s=8, alpha=0.4, label='train')
plt.plot(x, y_true, 'k--', label='true')
plt.title('Degree 9 fits: OLS vs Ridge vs Lasso'); plt.legend(); plt.show()

**Exercise 1 — Ridge/Lasso model selection (CV):**
Use K-fold CV on the nonlinear problem to pick `alpha` for Ridge and Lasso separately (search over a grid). Report selected alpha and test RMSE. **(Hint:** `cross_val_score`, negative MSE).

In [None]:
# TODO Exercise 1
alphas = np.logspace(-3, 2, 10)
degree = 9
# 1) Build pipeline with PolynomialFeatures -> StandardScaler -> Ridge/Lasso
# 2) For each alpha, compute CV score (KFold=5), pick best
# 3) Retrain on train with best alpha, evaluate on test


---
## 2. Logistic Regression (Classification)

Binary logistic models P(y=1|x) = σ(w^T x + b). Minimize log loss; often scale features and use regularization.

We’ll compare *linearly separable* vs *non-separable* data and evaluate with AUC/PR, confusion matrix, and calibration curve brief demo.

In [None]:
# Two datasets: linearly separable-ish (make_classification) and non-linear (moons)
X_lin, y_lin = make_classification(n_samples=600, n_features=10, n_informative=5, class_sep=1.5, random_state=42)
Xl_tr, Xl_te, yl_tr, yl_te = train_test_split(X_lin, y_lin, test_size=0.25, random_state=42)

logit = Pipeline([
    ('sc', StandardScaler()),
    ('lr', LogisticRegression(max_iter=500))
])
logit.fit(Xl_tr, yl_tr)
proba = logit.predict_proba(Xl_te)[:,1]
pred = (proba >= 0.5).astype(int)

acc = accuracy_score(yl_te, pred)
prec, rec, f1, _ = precision_recall_fscore_support(yl_te, pred, average='binary')
auc = roc_auc_score(yl_te, proba)
print(f'Logistic (linear features) — acc={acc:.3f}, prec={prec:.3f}, rec={rec:.3f}, f1={f1:.3f}, auc={auc:.3f}')

cm = confusion_matrix(yl_te, pred)
ConfusionMatrixDisplay(cm).plot(); plt.title('Confusion Matrix — Logistic'); plt.show()

fpr, tpr, _ = roc_curve(yl_te, proba)
plt.plot(fpr, tpr); plt.plot([0,1], [0,1], 'k--'); plt.xlabel('FPR'); plt.ylabel('TPR'); plt.title('ROC Curve'); plt.show()

# Nonlinear moons
Xm, ym = make_moons(n_samples=500, noise=0.25, random_state=42)
Xm_tr, Xm_te, ym_tr, ym_te = train_test_split(Xm, ym, test_size=0.25, random_state=42)
logit_poly = Pipeline([
    ('poly', PolynomialFeatures(3, include_bias=False)),
    ('sc', StandardScaler()),
    ('lr', LogisticRegression(max_iter=1000))
])
logit_poly.fit(Xm_tr, ym_tr)
proba2 = logit_poly.predict_proba(Xm_te)[:,1]
auc2 = roc_auc_score(ym_te, proba2)
print(f'Logistic (moons + poly) — AUC={auc2:.3f}')

# Plot decision boundary on moons (for visualization only)
xx, yy = np.meshgrid(np.linspace(Xm[:,0].min()-0.5, Xm[:,0].max()+0.5, 200),
                     np.linspace(Xm[:,1].min()-0.5, Xm[:,1].max()+0.5, 200))
grid = np.c_[xx.ravel(), yy.ravel()]
zz = logit_poly.predict_proba(grid)[:,1].reshape(xx.shape)
plt.contourf(xx, yy, zz, levels=20, alpha=0.6)
plt.scatter(Xm_tr[:,0], Xm_tr[:,1], c=ym_tr, s=10, edgecolor='k', alpha=0.7)
plt.title('Logistic + Polynomial Features — Decision Surface'); plt.show()

**Exercise 2 — Threshold tuning & class imbalance:**
Create an imbalanced dataset (`weights=[0.9, 0.1]` in `make_classification`). Train logistic regression, then:
1) Plot PR curve & choose a threshold that maximizes F1.  
2) Compare metrics at 0.5 vs chosen threshold.  
3) Repeat with `class_weight='balanced'` and compare.


In [None]:
# TODO Exercise 2 — imbalance & threshold tuning
from sklearn.metrics import precision_recall_curve

# 1) Build imbalanced data, train pipeline (Scaler+LogReg)
# 2) precision_recall_curve -> find threshold maximizing F1 = 2PR/(P+R)
# 3) Compare metrics at 0.5 vs best threshold; then class_weight='balanced'


---
## 3. k-Nearest Neighbors (k-NN)

Instance-based learner; complexity at prediction time. Sensitive to feature scaling and k.

We’ll use k-NN on moons and visualize decision boundaries for different k.

In [None]:
def plot_knn_decision(X, y, k):
    model = Pipeline([
        ('sc', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=k))
    ])
    model.fit(X, y)
    xx, yy = np.meshgrid(np.linspace(X[:,0].min()-0.5, X[:,0].max()+0.5, 200),
                         np.linspace(X[:,1].min()-0.5, X[:,1].max()+0.5, 200))
    grid = np.c_[xx.ravel(), yy.ravel()]
    zz = model.predict(grid).reshape(xx.shape)
    plt.contourf(xx, yy, zz, alpha=0.4)
    plt.scatter(X[:,0], X[:,1], c=y, s=10, edgecolor='k', alpha=0.7)
    plt.title(f'k-NN Decision Surface (k={k})')
    plt.show()
    return model

Xm, ym = make_moons(n_samples=400, noise=0.25, random_state=12)
for k in [1, 3, 7, 21]:
    _ = plot_knn_decision(Xm, ym, k)

# Simple CV to pick k by accuracy
scores = []
for k in range(1, 26, 2):
    model = Pipeline([('sc', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=k))])
    acc = cross_val_score(model, Xm, ym, cv=5, scoring='accuracy').mean()
    scores.append((k, acc))
dfk = pd.DataFrame(scores, columns=['k','cv_acc'])
print(dfk.head())
plt.plot(dfk['k'], dfk['cv_acc'], marker='o'); plt.xlabel('k'); plt.ylabel('CV accuracy'); plt.title('k selection via CV'); plt.show()

**Exercise 3 — Distance metrics & scaling:**
Compare k-NN with and without `StandardScaler` on a dataset with heterogeneous feature scales (e.g., concatenate moons with an extra unscaled feature). Try `metric='manhattan'` vs default. Summarize observations.

In [None]:
# TODO Exercise 3
# 1) Create data with mixed scales: Xm plus a large-scale extra feature
# 2) Compare pipelines: with/without StandardScaler; metric euclidean vs manhattan
# 3) Use CV accuracy or ROC-AUC (if binarized probs via kneighbors distances)


---
## 4. Decision Trees

Greedy partitions to maximize impurity reduction (Gini/Entropy). High-variance learners; control with depth/min-samples.

We’ll train a tree on moons and show how max_depth affects overfitting and training/test accuracy.

In [None]:
Xm, ym = make_moons(n_samples=500, noise=0.3, random_state=0)
X_tr, X_te, y_tr, y_te = train_test_split(Xm, ym, test_size=0.3, random_state=0)

depths = [1, 2, 3, 4, 6, 10, None]
rows = []
for d in depths:
    clf = DecisionTreeClassifier(max_depth=d, random_state=0)
    clf.fit(X_tr, y_tr)
    tr_acc = clf.score(X_tr, y_tr)
    te_acc = clf.score(X_te, y_te)
    rows.append((str(d), tr_acc, te_acc))
df = pd.DataFrame(rows, columns=['max_depth','train_acc','test_acc'])
print(df)
plt.plot(df['max_depth'], df['train_acc'], marker='o', label='train')
plt.plot(df['max_depth'], df['test_acc'], marker='o', label='test')
plt.xlabel('max_depth'); plt.ylabel('accuracy'); plt.title('Decision Tree — Depth vs Accuracy'); plt.legend(); plt.show()

# Visualize a small tree
small_tree = DecisionTreeClassifier(max_depth=3, random_state=0).fit(X_tr, y_tr)
plt.figure(figsize=(8,5))
plot_tree(small_tree, feature_names=['x1','x2'], class_names=['0','1'], filled=True, impurity=True)
plt.title('Decision Tree (depth=3)'); plt.show()

**Exercise 4 — Tree hyperparameters:**
Grid-search `max_depth`, `min_samples_split`, and `min_samples_leaf` using 5-fold CV on moons. Plot the best validation accuracy vs depth and show the confusion matrix for the best model on the held-out test set.

In [None]:
# TODO Exercise 4
from sklearn.model_selection import GridSearchCV

# 1) Grid over depth, min_samples_split, min_samples_leaf
# 2) Fit on train, choose best by CV
# 3) Evaluate on test, plot CM


---
## 5. Bias–Variance, Learning Curves & Evaluation

**Bias–Variance:**
- High bias (underfit): training & validation error both high; increasing model complexity helps.
- High variance (overfit): training error low, validation error high; more data/regularization helps.

We’ll draw **learning curves** for a tree and logistic regression on the moons dataset and compare generalization behavior.

We’ll also recap classification metrics briefly (acc, precision/recall/F1, ROC-AUC).

In [None]:
def plot_learning_curves(estimator, X, y, title='Learning Curve'):
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 8), scoring='accuracy', random_state=0)
    train_mean = train_scores.mean(axis=1)
    val_mean = val_scores.mean(axis=1)
    plt.plot(train_sizes, train_mean, marker='o', label='train acc')
    plt.plot(train_sizes, val_mean, marker='o', label='val acc')
    plt.xlabel('Training size'); plt.ylabel('Accuracy'); plt.title(title); plt.legend(); plt.show()

Xm, ym = make_moons(n_samples=800, noise=0.3, random_state=7)

plot_learning_curves(DecisionTreeClassifier(max_depth=None, random_state=0), Xm, ym,
                     title='Learning Curve — Decision Tree (high variance)')

logit_poly = Pipeline([
    ('poly', PolynomialFeatures(3, include_bias=False)),
    ('sc', StandardScaler()),
    ('lr', LogisticRegression(max_iter=1000))
])
plot_learning_curves(logit_poly, Xm, ym, title='Learning Curve — Logistic + Poly (moderate bias/variance)')

**Exercise 5 — Regression learning curves & noise:**
On the nonlinear regression dataset from Section 1.2, draw learning curves for OLS (degree=9) and Ridge (degree=9, tuned alpha). Increase noise level and observe how curves change. Summarize bias/variance observations in markdown after plots.

In [None]:
# TODO Exercise 5
# 1) Recreate nonlinear regression data with adjustable noise
# 2) Plot learning curves (use scoring='neg_root_mean_squared_error' or custom CV)
# 3) Compare OLS vs Ridge (alpha from Exercise 1)


---
## 6. Capstone Mini-Project — Model Comparison Pipeline

Build a compact experiment runner to compare **Logistic (linear & poly)**, **k-NN**, and **Decision Tree** on the moons dataset with 5-fold CV. Report:
- Mean accuracy, ROC-AUC
- Best hyperparameters (k, depth)
- Final test-set metrics and confusion matrix for the best model

You can expand this later into a more general experiment scaffold for future notebooks.

In [None]:
# Capstone scaffold (feel free to extend)
from sklearn.base import clone

Xm, ym = make_moons(n_samples=1000, noise=0.35, random_state=13)
X_tr, X_te, y_tr, y_te = train_test_split(Xm, ym, test_size=0.25, random_state=13)

candidates = {
    'logistic_linear': Pipeline([('sc', StandardScaler()), ('lr', LogisticRegression(max_iter=1000))]),
    'logistic_poly3': Pipeline([('poly', PolynomialFeatures(3, include_bias=False)), ('sc', StandardScaler()), ('lr', LogisticRegression(max_iter=1000))]),
    'knn_k': None,  # set below per k
    'tree': None    # set below per grid
}

cv = KFold(n_splits=5, shuffle=True, random_state=0)

def eval_model(model, X, y):
    acc = cross_val_score(model, X, y, cv=cv, scoring='accuracy').mean()
    return acc

# Evaluate logistic models
scores = []
for name in ['logistic_linear','logistic_poly3']:
    acc = eval_model(candidates[name], Xm, ym)
    scores.append((name, acc, None))

# kNN sweep
best_knn = None; best_acc = -np.inf; best_k = None
for k in range(1, 32, 2):
    model = Pipeline([('sc', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=k))])
    acc = eval_model(model, Xm, ym)
    if acc > best_acc:
        best_acc, best_knn, best_k = acc, model, k
scores.append((f'knn_k={best_k}', best_acc, {'k':best_k}))

# Tree sweep (depth only for simplicity)
best_tree = None; best_acc_t = -np.inf; best_d = None
for d in [1,2,3,4,6,8,10,None]:
    model = DecisionTreeClassifier(max_depth=d, random_state=0)
    acc = cross_val_score(model, Xm, ym, cv=cv, scoring='accuracy').mean()
    if acc > best_acc_t:
        best_acc_t, best_tree, best_d = acc, model, d
scores.append((f'tree_depth={best_d}', best_acc_t, {'max_depth':str(best_d)}))

df_scores = pd.DataFrame(scores, columns=['model','cv_acc','params'])
df_scores = df_scores.sort_values('cv_acc', ascending=False)
print(df_scores)

# Train best on train set, evaluate on test with ROC-AUC
best_name = df_scores.iloc[0]['model']
if best_name.startswith('knn'):
    final = best_knn
elif best_name.startswith('tree'):
    final = Pipeline([('id', None)])  # dummy
    final = best_tree
elif best_name == 'logistic_linear':
    final = candidates['logistic_linear']
else:
    final = candidates['logistic_poly3']

final = clone(final)
final.fit(X_tr, y_tr)

if hasattr(final, 'predict_proba'):
    prob = final.predict_proba(X_te)[:,1]
    auc = roc_auc_score(y_te, prob)
else:
    # for DecisionTreeClassifier we do have predict_proba; kept for completeness
    prob = final.predict_proba(X_te)[:,1]
    auc = roc_auc_score(y_te, prob)

pred = final.predict(X_te)
acc = accuracy_score(y_te, pred)
print(f'Best={best_name} | Test acc={acc:.3f}, AUC={auc:.3f}')
ConfusionMatrixDisplay(confusion_matrix(y_te, pred)).plot()
plt.title(f'Confusion Matrix — {best_name}'); plt.show()

## Wrap-up
- **Linear models**: start simple; add polynomial features for nonlinearity; control with **Ridge/Lasso**.
- **Logistic regression**: strong baseline; scale features; tune threshold for business metrics.
- **k-NN**: sensitive to scaling & k; great baseline for small/low-dim problems.
- **Decision trees**: interpretable; high variance; control with depth/min samples; ensembles later.
- **Bias–Variance**: diagnose with **learning curves**; fight overfit with regularization, more data, simpler models.

Next notebooks: **Unsupervised Learning, Ensembles, and Model Evaluation in depth**.