# Question 4

## Ridge Regression: Effect of λ on Bias–Variance Tradeoff

We estimate the regression coefficients by minimizing:

$$
\sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^{p} \beta_j^2
$$

where $ \lambda $ controls the strength of the penalty on coefficient size.

---

### (a) Effect of $ \lambda $ on **Training RSS**

**Answer:** iii. Steadily increase.

**Justification:**

- When $ \lambda = 0 $, we recover **ordinary least squares (OLS)**, which minimizes training RSS.
- As $ \lambda $ increases, the penalty forces coefficients $ \beta_j $ to shrink.
- This reduces model flexibility, so **training error increases**.

---

### (b) Effect of $ \lambda $ on **Test RSS**

**Answer:** ii. Decrease initially, and then eventually start increasing in a U shape.

**Justification:**

- Small $ \lambda $: overfitting → high variance → high test error.
- Moderate $ \lambda $: better generalization → test error decreases.
- Large $ \lambda $: underfitting → high bias → test error increases.
- So, test RSS follows a **U-shape**.

---

### (c) Effect of $ \lambda $ on **Variance**

**Answer:** iii. Steadily decrease.

**Justification:**

- As $ \lambda $ increases, the model becomes less sensitive to training data.
- Coefficients shrink toward zero, making the model more stable.
- Thus, **model variance decreases** steadily.

---

### (d) Effect of $ \lambda $ on **(Squared) Bias**

**Answer:** iii. Steadily increase.

**Justification:**

- As $ \lambda $ increases, the model is less able to capture the true relationship.
- Predictions deviate more from the actual function.
- Therefore, **bias increases** with $ \lambda $.

---

### (e) Effect of $ \lambda $ on **Irreducible Error**

**Answer:** v. Remain constant.

**Justification:**

- Irreducible error is due to noise in the data (e.g., measurement error).
- It is **independent** of the model or choice of $ \lambda $.
- Hence, it **remains constant**.

---

### Summary 

| Part | Quantity             | Answer | Description |
|------|----------------------|--------|-------------|
| (a)  | Training RSS         | iii    | Increases as $ \lambda $ increases |
| (b)  | Test RSS             | ii     | U-shaped curve |
| (c)  | Variance             | iii    | Steadily decreases |
| (d)  | Squared Bias         | iii    | Steadily increases |
| (e)  | Irreducible Error    | v      | Remains constant |


# Question 9

In [5]:
# 1. Load libraries
import pandas as pd
import numpy as np
from ISLP import load_data
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline

# 2. Load data
College = load_data('College')
X = College.drop('Apps', axis=1)
X = pd.get_dummies(X, drop_first=True)
y = College['Apps']

# 3. Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)


### (b) OLS Model

In [8]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)
mse_lr


642753.8976533434

### (c) Ridge Regression

In [9]:
alphas = np.logspace(-3, 5, 100)
ridge = RidgeCV(alphas=alphas, scoring='neg_mean_squared_error', cv=10)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
mse_ridge


653969.1227226362

### (d) Lasso Regression

In [10]:
lasso = LassoCV(alphas=None, cv=10, max_iter=10000)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
n_nonzero = np.sum(lasso.coef_ != 0)
mse_lasso, n_nonzero

(831509.2476266614, 6)

### (e) PCR 

In [12]:
from sklearn.linear_model import LinearRegression

# Standardize + PCA + Linear Regression
mse_pcr = []
for m in range(1, X_train.shape[1]+1):
    pca = PCA(n_components=m)
    X_train_pca = pca.fit_transform(StandardScaler().fit_transform(X_train))
    X_test_pca = pca.transform(StandardScaler().fit_transform(X_test))
    
    model = LinearRegression()
    model.fit(X_train_pca, y_train)
    y_pred = model.predict(X_test_pca)
    mse_pcr.append(mean_squared_error(y_test, y_pred))

best_m_pcr = np.argmin(mse_pcr) + 1
best_m_pcr


17

### (f) PLS

In [14]:
mse_pls = []
for m in range(1, X_train.shape[1]+1):
    pls = PLSRegression(n_components=m)
    pls.fit(StandardScaler().fit_transform(X_train), y_train)
    y_pred = pls.predict(StandardScaler().fit_transform(X_test))
    mse_pls.append(mean_squared_error(y_test, y_pred))

best_m_pls = np.argmin(mse_pls) + 1
best_m_pls

9

### (g) Compare and Comment

In [15]:
results = pd.DataFrame({
    'Model': ['OLS', 'Ridge', 'Lasso', 'PCR', 'PLS'],
    'Test MSE': [mse_lr, mse_ridge, mse_lasso, min(mse_pcr), min(mse_pls)],
    'Best Param': ['-', ridge.alpha_, f'{n_nonzero} non-zero', best_m_pcr, best_m_pls]
})

print(results)


   Model       Test MSE  Best Param
0    OLS  642753.897653           -
1  Ridge  653969.122723   23.101297
2  Lasso  831509.247627  6 non-zero
3    PCR  958162.737647          17
4    PLS  941216.372794           9


| Model | Test MSE       | Best Param    |
| ----- | -------------- | ------------- |
| OLS   | **642,753.90** | –             |
| Ridge | 653,969.12     | 23.10         |
| Lasso | 831,509.25     | 6 non-zero    |
| PCR   | 958,162.74     | 17 components |
| PLS   | 941,216.37     | 9 components  |


1. Best Performance: OLS
The ordinary least squares (OLS) model yielded the lowest test MSE, meaning it had the best out-of-sample predictive accuracy in this case.

This suggests that regularization was not essential for this specific dataset — multicollinearity or overfitting may not have been severe.

2. Ridge Regression
Ridge regression performed nearly as well as OLS, with a slightly higher MSE.

It selected a moderate penalty term (λ ≈ 23.1), which shrinks all coefficients but retains them.

It’s more robust to multicollinearity, so it might be preferred in situations with noisy predictors — even if not optimal here.

3. Lasso Regression
Lasso produced a higher test error, but also selected only 6 non-zero coefficients, meaning it performs automatic feature selection.

It’s a good choice if you want a sparse, interpretable model, even if predictive accuracy slightly drops.

4. PCR & PLS
Both PCR and PLS performed worse than OLS/Ridge/Lasso.

PCR used 17 principal components (unsupervised), while PLS used 9 (supervised).

This might indicate that:

The outcome (Apps) isn’t strongly aligned with principal components.

Or, dimensionality reduction lost some important predictive information.
