# Session 9-10 Model Selection and Regularization

# Exercise: Model complexity, cross-validation, and regularization

By the end of this lab, you should be able to:

* Generate synthetic data for regression problems

* Identify underfitting and overfitting using model complexity

* Evaluate models using train‚Äìtest split and cross-validation

* Apply regularization to control overfitting

# Part 1: Data Generation

## Task 1.1 ‚Äì Create synthetic nonlinear regression data

In [None]:
# Part 1.1: Data generation

import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Fix random seed for reproducibility
np.random.seed(42)

# Generate input feature x (continuous range between -1 and 1)
n_samples = 30
X = np.random.uniform(-1, 1, size=n_samples)

# Define a nonlinear relationship (true function)
y_true = 2 * X**2 + X

# Add random Gaussian noise
noise = np.random.normal(0, 0.2, size=n_samples)
y = y_true + noise

# Reshape X for sklearn (expects 2D array)
X = X.reshape(-1, 1)

## Task 1.2 ‚Äì Visualize the dataset

In [None]:
# Part 1.2: Visualization

plt.scatter(X, y, color="blue", label="Observed data")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Synthetic Nonlinear Regression Dataset")
plt.legend()
plt.show()

### Question

#### Is a linear model sufficient to represent this data?

No. The curved pattern indicates a nonlinear relationship, which a straight line cannot capture well.

# Part 2: Polynomial Regression & Model Complexity

## Task 2.1 ‚Äì Train models with different polynomial degrees

In [None]:
# Part 2.1: Polynomial regression models

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

In [None]:
degrees = [1, 3, 10]  # low, medium, high complexity
models = {}

for d in degrees:
    poly = PolynomialFeatures(degree=d, include_bias=False)
    X_poly = poly.fit_transform(X)
    
    model = LinearRegression()
    model.fit(X_poly, y)
    
    models[d] = (poly, model)

## Task 2.2 ‚Äì Plot prediction curves

In [None]:
# Part 2.2: Plot prediction curves

X_plot = np.linspace(-1, 1, 200).reshape(-1, 1)

plt.scatter(X, y, color="black", label="Data")

for d, (poly, model) in models.items():
    X_plot_poly = poly.transform(X_plot)
    y_plot = model.predict(X_plot_poly)
    plt.plot(X_plot, y_plot, label=f"Degree {d}")

plt.xlabel("x")
plt.ylabel("y")
plt.title("Polynomial Regression with Different Degrees")
plt.legend()
plt.show()

### Questions

#### Which model is underfitting? Which model is overfitting? Which model appears to generalize best? Explain your reasoning based on the plots.

* Underfitting: Degree 1

* Overfitting: Degree 10

* Best generalization: Degree 3

Based on smoothness and ability to follow the trend without fitting noise.

# Part 3: Train‚ÄìTest Split Evaluation

## Task 3.1 ‚Äì Split data (70% train, 30% test)

In [None]:
# Part 3.1: Train-test split

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

## Task 3.2 ‚Äì Compute training and test MSE

In [None]:
# Part 3.2: Evaluate each polynomial degree

for d in degrees:
    poly = PolynomialFeatures(degree=d, include_bias=False)
    
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)
    
    model = LinearRegression()
    model.fit(X_train_poly, y_train)
    
    train_mse = mean_squared_error(y_train, model.predict(X_train_poly))
    test_mse = mean_squared_error(y_test, model.predict(X_test_poly))
    
    print(f"Degree {d}: Train MSE = {train_mse:.4f}, Test MSE = {test_mse:.4f}")

### Question

#### How do training and test errors differ for underfitting vs overfitting models?

* Underfitting: High train & test error

* Overfitting: Low train error, high test error

# Part 4: Cross-Validation

## Task 4.1 ‚Äì 5-fold cross-validation

In [None]:
# Part 4.1: Cross-validation

from sklearn.model_selection import cross_val_score

In [None]:
for d in degrees:
    poly = PolynomialFeatures(degree=d, include_bias=False)
    X_poly = poly.fit_transform(X)
    
    model = LinearRegression()
    
    # Negative MSE is used by sklearn
    cv_scores = cross_val_score(
        model, X_poly, y,
        cv=5,
        scoring="neg_mean_squared_error"
    )
    
    cv_mse = -cv_scores.mean()
    print(f"Degree {d}: CV MSE = {cv_mse:.4f}")

### Question

#### Why is cross-validation more reliable?

It averages performance across multiple splits, reducing dependence on a single random train‚Äìtest split.

# Part 5: Regularization

## Task 5.1 ‚Äì Ridge Regression (L2)

In [None]:
# Part 5.1: Ridge regression on high-degree model

from sklearn.linear_model import Ridge

In [None]:
degree = 10
lambdas = [0.01, 1, 100]

poly = PolynomialFeatures(degree=degree, include_bias=False)
X_poly = poly.fit_transform(X)

plt.scatter(X, y, color="black", label="Data")

for lam in lambdas:
    ridge = Ridge(alpha=lam)
    ridge.fit(X_poly, y)
    
    X_plot_poly = poly.transform(X_plot)
    y_plot = ridge.predict(X_plot_poly)
    
    plt.plot(X_plot, y_plot, label=f"Œª={lam}")

plt.title("Ridge Regression (L2 Regularization)")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()

### Question

#### How does increasing ùúÜ affect: Model complexity? Bias and variance?

Effect of increasing Œª

* Lower complexity

* Higher bias

* Lower variance

## Task 5.3 ‚Äì Lasso Regression (L1)

In [None]:
# Part 5.3: Lasso regression

from sklearn.linear_model import Lasso

In [None]:
lasso = Lasso(alpha=0.05, max_iter=10000)
lasso.fit(X_poly, y)

# Inspect coefficients
coefficients = lasso.coef_

print("Lasso coefficients:")
print(coefficients)

### Questions

#### Which regularization method sets some coefficients exactly to zero?

Lasso (L1)

#### Why does this imply feature selection?

Zero coefficients remove features entirely

# Part 6: Reflection (Answers)

#### (1) Why does increasing polynomial degree increase the risk of overfitting?

Higher polynomial degree increases flexibility, allowing the model to fit noise.

#### (2) How does regularization help control model complexity?

Regularization penalizes large coefficients, limiting model complexity.

#### (3) When would you prefer L1 over L2 regularization?

L1 is preferred when feature selection or sparsity is desired.

#### (4) Why is cross-validation important when tuning hyperparameters?

Cross-validation provides a more stable estimate of model performance.