## Chapter 6, Question 9

### In this exercise, we will predict the number of applications received using the other variables in the College data set.

In [1]:
!pip install ISLP

Collecting ISLP
  Downloading ISLP-0.4.0-py3-none-any.whl.metadata (7.0 kB)
Collecting lifelines (from ISLP)
  Downloading lifelines-0.30.0-py3-none-any.whl.metadata (3.2 kB)
Collecting pygam (from ISLP)
  Downloading pygam-0.9.1-py3-none-any.whl.metadata (7.1 kB)
Collecting pytorch-lightning (from ISLP)
  Downloading pytorch_lightning-2.4.0-py3-none-any.whl.metadata (21 kB)
Collecting torchmetrics (from ISLP)
  Downloading torchmetrics-1.6.0-py3-none-any.whl.metadata (20 kB)
Collecting autograd-gamma>=0.3 (from lifelines->ISLP)
  Downloading autograd-gamma-0.5.0.tar.gz (4.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting formulaic>=0.2.2 (from lifelines->ISLP)
  Downloading formulaic-1.0.2-py3-none-any.whl.metadata (6.8 kB)
Collecting scipy>=0.9 (from ISLP)
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m2.0 MB/s[0m eta [36m

### (a) Split the data set into a training set and a test set.

In [2]:
import pandas as pd
from ISLP import load_data
from sklearn.model_selection import train_test_split

# Load the College dataset
college = load_data('College')

# Define the features (X) and target variable (y)
X = college.drop("Apps", axis=1)
y = college["Apps"]
X = pd.get_dummies(X, drop_first=True)
# Split the dataset into a training set and a test set (80% training, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

### (b) Fit a linear model using least squares on the training set, and report the test error obtained.

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Initialize and fit the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the test error (mean squared error)
test_error = mean_squared_error(y_test, y_pred)
print(f"Test Error (MSE): {test_error}")

Test Error (MSE): 1492443.379039042


### (c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

In [4]:
from sklearn.linear_model import RidgeCV

# Initialize and fit the ridge regression model with cross-validation
alphas = [0.1, 1, 10, 100, 1000]  # Example alpha values for cross-validation
ridge_cv_model = RidgeCV(alphas=alphas, cv=5) # Use 5-fold cross-validation
ridge_cv_model.fit(X_train, y_train)

# Make predictions on the test set using the best lambda
y_pred_ridge_cv = ridge_cv_model.predict(X_test)

# Calculate the test error (mean squared error) for the ridge regression model
test_error_ridge_cv = mean_squared_error(y_test, y_pred_ridge_cv)
print(f"Test Error (Ridge CV, MSE): {test_error_ridge_cv}")
print(f"Best alpha (lambda): {ridge_cv_model.alpha_}")

Test Error (Ridge CV, MSE): 1478572.8112797008
Best alpha (lambda): 10.0


### (d) Fit a lasso model on the training set, with λ chosen by cross validation. Report the test error obtained, along with the number of non-zero coefcient estimates.

In [5]:
from sklearn.linear_model import LassoCV

# Initialize and fit the lasso model with cross-validation
lasso_cv_model = LassoCV(alphas=alphas, cv=5)  # Use 5-fold cross-validation
lasso_cv_model.fit(X_train, y_train)

# Make predictions on the test set using the best lambda
y_pred_lasso_cv = lasso_cv_model.predict(X_test)

# Calculate the test error (mean squared error) for the lasso model
test_error_lasso_cv = mean_squared_error(y_test, y_pred_lasso_cv)
print(f"Test Error (Lasso CV, MSE): {test_error_lasso_cv}")
print(f"Best alpha (lambda): {lasso_cv_model.alpha_}")

# Count the number of non-zero coefficients
non_zero_coefs = sum(lasso_cv_model.coef_ != 0)
print(f"Number of non-zero coefficients: {non_zero_coefs}")

Test Error (Lasso CV, MSE): 1477248.9589983297
Best alpha (lambda): 10.0
Number of non-zero coefficients: 17


### (e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# PCR model with cross-validation
best_test_error = float('inf')
best_M = 0

for M in range(1, X_train.shape[1] + 1):
    pca = PCA(n_components=M)
    X_train_pca = pca.fit_transform(X_train_scaled)
    X_test_pca = pca.transform(X_test_scaled)

    pcr_model = LinearRegression()
    pcr_model.fit(X_train_pca, y_train)

    y_pred_pcr = pcr_model.predict(X_test_pca)
    test_error_pcr = mean_squared_error(y_test, y_pred_pcr)

    if test_error_pcr < best_test_error:
        best_test_error = test_error_pcr
        best_M = M

print(f"Test Error (PCR CV, MSE): {best_test_error}")
print(f"Best M (Number of Principal Components): {best_M}")

Test Error (PCR CV, MSE): 1492443.3790390224
Best M (Number of Principal Components): 17


### (f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

In [7]:
from sklearn.cross_decomposition import PLSRegression

# PLS model with cross-validation
best_test_error_pls = float('inf')
best_M_pls = 0

for M in range(1, min(X_train.shape[1], X_train.shape[0]) + 1):
    pls_model = PLSRegression(n_components=M)
    pls_model.fit(X_train_scaled, y_train)

    y_pred_pls = pls_model.predict(X_test_scaled)
    test_error_pls = mean_squared_error(y_test, y_pred_pls)

    if test_error_pls < best_test_error_pls:
        best_test_error_pls = test_error_pls
        best_M_pls = M

print(f"Test Error (PLS CV, MSE): {best_test_error_pls}")
print(f"Best M (Number of PLS components): {best_M_pls}")

Test Error (PLS CV, MSE): 1448566.3424517359
Best M (Number of PLS components): 7


### (g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Analysis of College Applications Prediction:

Based on the results, the prediction accuracy of college applications varies among the models tested.  The MSE values provide a measure of the error, with lower values indicating better performance.
While the test errors are relatively close for several of these methods, there's no single 'best' approach.

Further considerations:
1. Magnitude of Error:
Evaluate the absolute values of the MSEs in relation to the range of applications received.  A small MSE might still represent a substantial error if applications vary greatly.

2. Model Interpretability:
The simple linear regression provides easy-to-understand coefficients. Ridge and Lasso introduce regularization which can improve predictions but might reduce interpretability. PCR and PLS offer dimensionality reduction and could reveal underlying factors driving applications, but the resulting models can be more difficult to interpret.

3. Feature Importance:
The non-zero coefficients in the Lasso model highlight important features affecting applications.

4. Cross-Validation Tuning:
The choice of optimal hyperparameters (alpha for ridge/lasso, M for PCR/PLS) through cross-validation is crucial. Different hyperparameter ranges could lead to different optimal values and potentially better performance.  It's important to note that the provided code only tries a few alphas.

5. Model Comparison:
The differences observed in test errors might not be statistically significant. Consider using statistical tests (e.g. F-tests or paired t-tests) to ascertain if differences are meaningful.

Overall:
It's likely that no single model is definitively superior.  The choice of model depends on the desired balance between prediction accuracy and interpretability.