# Week 2: Supervised Learning Algorithms Demo

This notebook demonstrates new algorithms from Chapter 2:
- Linear Models (Linear Regression and Ridge Regression)
- Logistic Regression for Classification
- Decision Trees
- Model complexity and regularization

## Setup: Building a Student Performance Dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Create reproducible random data
rng = np.random.default_rng(42)
n = 100  # more students for better demonstrations

df = pd.DataFrame({
    "attendance": rng.integers(60, 101, n),
    "homework_rate": rng.integers(50, 101, n),
    "quiz_avg": rng.integers(40, 101, n),
    "exam_avg": rng.integers(40, 101, n),
})

# Regression target: final score
df["final_score"] = (0.2*df["homework_rate"] + 0.3*df["quiz_avg"] + 0.5*df["exam_avg"]).round(0)

# Classification target: pass/fail
df["pass_fail"] = np.where(df["final_score"] >= 70, "pass", "fail")

print(f"Dataset size: {len(df)} students")
df.head()

## Part 1: Linear Models for Regression

Linear models predict outputs as weighted sums of features: ŷ = w[0]×x[0] + w[1]×x[1] + ... + b

We'll compare:
- **LinearRegression**: no regularization
- **Ridge**: L2 regularization (shrinks coefficients toward zero)

In [None]:
from sklearn.linear_model import LinearRegression, Ridge

# Use final_score as regression target
y_reg = df["final_score"]
X = df[["attendance", "homework_rate", "quiz_avg", "exam_avg"]]

X_train, X_test, y_train, y_test = train_test_split(
    X, y_reg, test_size=0.25, random_state=0
)

print(f"Training set: {len(X_train)} students")
print(f"Test set: {len(X_test)} students")

### Compare Linear Regression vs Ridge

In [None]:
# Linear Regression (no regularization)
lr = LinearRegression()
lr.fit(X_train, y_train)
lr_train = lr.score(X_train, y_train)
lr_test = lr.score(X_test, y_test)

print("Linear Regression:")
print(f"  Training R²: {lr_train:.3f}")
print(f"  Test R²: {lr_test:.3f}")
print()

# Ridge with different alpha values
alphas = [0.1, 1.0, 10.0, 100.0]
ridge_results = []

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    
    train_r2 = ridge.score(X_train, y_train)
    test_r2 = ridge.score(X_test, y_test)
    
    ridge_results.append({
        "alpha": alpha,
        "train_R2": train_r2,
        "test_R2": test_r2
    })
    
    print(f"Ridge (alpha={alpha:5.1f}) → train: {train_r2:.3f}, test: {test_r2:.3f}")

ridge_df = pd.DataFrame(ridge_results)

In [None]:
# Visualize regularization effect
plt.figure(figsize=(8, 5))
plt.plot(ridge_df["alpha"], ridge_df["train_R2"], 'o-', label="Training R²")
plt.plot(ridge_df["alpha"], ridge_df["test_R2"], 's-', label="Test R²")
plt.xscale('log')
plt.xlabel("alpha (regularization strength)")
plt.ylabel("R² score")
plt.title("Ridge Regression: Effect of Regularization")
plt.legend()
plt.grid(alpha=0.3)
plt.show()

**Key insight**: Higher alpha means stronger regularization (simpler model). This reduces overfitting but may increase underfitting.

### Compare Model Coefficients

In [None]:
# Compare coefficients for different regularization strengths
fig, ax = plt.subplots(figsize=(10, 5))

# Linear Regression coefficients
feature_names = ["attendance", "homework_rate", "quiz_avg", "exam_avg"]
ax.plot(range(len(lr.coef_)), lr.coef_, 'o', label="LinearRegression", markersize=8)

# Ridge coefficients for different alphas
for alpha in [1.0, 10.0, 100.0]:
    ridge = Ridge(alpha=alpha).fit(X_train, y_train)
    ax.plot(range(len(ridge.coef_)), ridge.coef_, 's', label=f"Ridge alpha={alpha}", alpha=0.7)

ax.set_xticks(range(len(feature_names)))
ax.set_xticklabels(feature_names, rotation=45)
ax.set_ylabel("Coefficient magnitude")
ax.set_title("Effect of Regularization on Coefficients")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

**Key insight**: Ridge regularization shrinks coefficients toward zero. Stronger regularization (higher alpha) means smaller coefficients.

## Part 2: Logistic Regression for Classification

Despite the name, Logistic Regression is a **classification** algorithm. The parameter **C** controls regularization (higher C = less regularization).

In [None]:
from sklearn.linear_model import LogisticRegression

# Use pass/fail classification target
y_class = df["pass_fail"]
X_train_class, X_test_class, y_train_class, y_test_class = train_test_split(
    X, y_class, test_size=0.25, random_state=0
)

C_values = [0.01, 0.1, 1.0, 10.0, 100.0]
logreg_results = []

for C in C_values:
    logreg = LogisticRegression(C=C, max_iter=1000, random_state=0)
    logreg.fit(X_train_class, y_train_class)
    
    train_acc = logreg.score(X_train_class, y_train_class)
    test_acc = logreg.score(X_test_class, y_test_class)
    
    logreg_results.append({
        "C": C,
        "train_accuracy": train_acc,
        "test_accuracy": test_acc
    })
    
    print(f"C={C:6.2f} → train: {train_acc:.3f}, test: {test_acc:.3f}")

logreg_df = pd.DataFrame(logreg_results)

In [None]:
plt.figure(figsize=(8, 5))
plt.plot(logreg_df["C"], logreg_df["train_accuracy"], 'o-', label="Training accuracy")
plt.plot(logreg_df["C"], logreg_df["test_accuracy"], 's-', label="Test accuracy")
plt.xscale('log')
plt.xlabel("C (inverse regularization strength)")
plt.ylabel("Accuracy")
plt.title("Logistic Regression: Effect of C Parameter")
plt.legend()
plt.grid(alpha=0.3)
plt.show()

**Key insight**: For Logistic Regression, **higher C = less regularization = more complex model**. This is opposite to Ridge's alpha parameter!

## Part 3: Decision Trees

Decision trees learn hierarchies of if/else questions. The parameter **max_depth** controls how deep the tree can grow.

In [None]:
from sklearn.tree import DecisionTreeClassifier

depths = [1, 2, 3, 5, 10, None]  # None = unlimited depth
tree_results = []

for depth in depths:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=0)
    tree.fit(X_train_class, y_train_class)
    
    train_acc = tree.score(X_train_class, y_train_class)
    test_acc = tree.score(X_test_class, y_test_class)
    
    depth_str = "unlimited" if depth is None else str(depth)
    tree_results.append({
        "max_depth": depth_str,
        "train_accuracy": train_acc,
        "test_accuracy": test_acc
    })
    
    print(f"max_depth={depth_str:9s} → train: {train_acc:.3f}, test: {test_acc:.3f}")

### Visualize a Simple Tree

In [None]:
from sklearn.tree import plot_tree

# Train a small tree for visualization
simple_tree = DecisionTreeClassifier(max_depth=3, random_state=0)
simple_tree.fit(X_train_class, y_train_class)

plt.figure(figsize=(14, 8))
plot_tree(simple_tree, 
          feature_names=feature_names, 
          class_names=["fail", "pass"],
          filled=True, 
          rounded=True,
          fontsize=10)
plt.title("Decision Tree (max_depth=3)")
plt.show()

**Key insight**: Each node asks a yes/no question about one feature. The tree splits data until leaves are pure or max_depth is reached.

### Feature Importance

In [None]:
tree = DecisionTreeClassifier(max_depth=5, random_state=0)
tree.fit(X_train_class, y_train_class)

# Plot feature importances
importance_df = pd.DataFrame({
    "feature": feature_names,
    "importance": tree.feature_importances_
}).sort_values("importance", ascending=False)

plt.figure(figsize=(8, 5))
plt.barh(importance_df["feature"], importance_df["importance"])
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.title("Feature Importance from Decision Tree")
plt.tight_layout()
plt.show()

print(importance_df.to_string(index=False))

**Key insight**: Feature importance shows which features the tree used most for splitting. Higher values mean the feature was more useful for making predictions.

## Summary: Model Complexity Across Algorithms

Different algorithms have different complexity parameters:

| Algorithm | Parameter | More Complex | Less Complex |
|-----------|-----------|--------------||--------------|
| Ridge | `alpha` | Small alpha | Large alpha |
| Logistic Regression | `C` | Large C | Small C |
| Decision Tree | `max_depth` | Large depth | Small depth |

**General pattern**: 
- More complex models fit training data better but may overfit
- Less complex models are more restricted but may underfit
- The best model complexity depends on your dataset and is found using the test set

**Note**: Ridge uses `alpha` where larger = simpler, while Logistic Regression uses `C` where larger = more complex. They're inverses of each other!