In [1]:
!pip install ISLP

Collecting ISLP
  Downloading ISLP-0.4.0-py3-none-any.whl.metadata (7.0 kB)
Collecting lifelines (from ISLP)
  Downloading lifelines-0.30.0-py3-none-any.whl.metadata (3.2 kB)
Collecting pygam (from ISLP)
  Downloading pygam-0.9.1-py3-none-any.whl.metadata (7.1 kB)
Collecting pytorch-lightning (from ISLP)
  Downloading pytorch_lightning-2.4.0-py3-none-any.whl.metadata (21 kB)
Collecting torchmetrics (from ISLP)
  Downloading torchmetrics-1.6.0-py3-none-any.whl.metadata (20 kB)
Collecting autograd-gamma>=0.3 (from lifelines->ISLP)
  Downloading autograd-gamma-0.5.0.tar.gz (4.0 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting formulaic>=0.2.2 (from lifelines->ISLP)
  Downloading formulaic-1.0.2-py3-none-any.whl.metadata (6.8 kB)
Collecting scipy>=0.9 (from ISLP)
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m978.3 kB/s[0m eta [3

## (a) Split the data set into a training set and a test set.

In [13]:
import pandas as pd
from ISLP import load_data
from sklearn.model_selection import train_test_split

# Load the College dataset
college = load_data('College')

# Separate predictors (X) and the target variable (y)
X = college.drop(columns=['Apps'])  # Predictors
y = college['Apps']  # Target variable

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Output the sizes of the splits
print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

Training set size: (621, 17)
Test set size: (156, 17)


## (b) Fit a linear model using least squares on the training set, and report the test error obtained.

In [14]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Preprocess the data
# Create a column transformer for one-hot encoding of the 'Private' column
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), ['Private'])  # Apply OneHotEncoder to 'Private'
    ],
    remainder='passthrough'  # Keep all other columns as they are
)

# Apply the preprocessor to the training and test sets
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Fit a linear model on the training set
linear_model = LinearRegression()
linear_model.fit(X_train_processed, y_train)

# Predict on the test set
y_pred = linear_model.predict(X_test_processed)

# Calculate test error (Mean Squared Error)
test_mse = mean_squared_error(y_test, y_pred)

# Output the test error
print(f"Test Mean Squared Error (MSE): {test_mse:.2f}")


Test Mean Squared Error (MSE): 1492443.38


## (c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

In [15]:
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error

# Define a range of alpha (λ) values for cross-validation
alpha_range = [0.1, 1, 10, 100, 200, 500, 1000]

# Fit a Ridge regression model with cross-validation
ridge_model = RidgeCV(alphas=alpha_range, cv=5)
ridge_model.fit(X_train_processed, y_train)

# Predict on the test set
y_pred_ridge = ridge_model.predict(X_test_processed)

# Calculate test error (Mean Squared Error)
test_mse_ridge = mean_squared_error(y_test, y_pred_ridge)

# Output the chosen λ (alpha) and test error
print(f"Chosen λ (alpha): {ridge_model.alpha_}")
print(f"Test Mean Squared Error (MSE) with Ridge Regression: {test_mse_ridge:.2f}")


Chosen λ (alpha): 10.0
Test Mean Squared Error (MSE) with Ridge Regression: 1484551.98


## (d) Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the num- ber of non-zero coefficient estimates.

In [16]:
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error
import numpy as np

# Define a range of alpha (λ) values for cross-validation
alpha_range = np.logspace(-4, 1, 50)

# Fit a Lasso regression model with cross-validation
lasso_model = LassoCV(alphas=alpha_range, cv=5, random_state=42)
lasso_model.fit(X_train_processed, y_train)

# Predict on the test set
y_pred_lasso = lasso_model.predict(X_test_processed)

# Calculate test error (Mean Squared Error)
test_mse_lasso = mean_squared_error(y_test, y_pred_lasso)

# Count the number of non-zero coefficients
num_nonzero_coefficients = np.sum(lasso_model.coef_ != 0)

# Output the chosen λ (alpha), test error, and the number of non-zero coefficients
print(f"Chosen λ (alpha): {lasso_model.alpha_}")
print(f"Test Mean Squared Error (MSE) with Lasso Regression: {test_mse_lasso:.2f}")
print(f"Number of non-zero coefficients: {num_nonzero_coefficients}")


Chosen λ (alpha): 7.9060432109077015
Test Mean Squared Error (MSE) with Lasso Regression: 1480187.36
Number of non-zero coefficients: 18


## (e) Fit a PCR model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

In [17]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import numpy as np

# Function to perform PCR with cross-validation to select M
def pcr_model(X_train, y_train, X_test, y_test, max_components):
    mse_list = []
    for m in range(1, max_components + 1):
        # Apply PCA with m components
        pca = PCA(n_components=m)
        X_train_pca = pca.fit_transform(X_train)
        X_test_pca = pca.transform(X_test)

        # Fit Linear Regression on PCA-transformed data
        linear_model = LinearRegression()
        mse = -np.mean(cross_val_score(linear_model, X_train_pca, y_train, cv=5, scoring='neg_mean_squared_error'))
        mse_list.append(mse)

    # Select M with the lowest MSE
    best_m = np.argmin(mse_list) + 1
    pca = PCA(n_components=best_m)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)

    # Train final model with selected M
    final_model = LinearRegression()
    final_model.fit(X_train_pca, y_train)

    # Predict on the test set
    y_pred = final_model.predict(X_test_pca)
    test_mse = mean_squared_error(y_test, y_pred)

    return best_m, test_mse

# Determine the maximum number of components
max_components = X_train_processed.shape[1]

# Fit the PCR model
best_m, test_mse_pcr = pcr_model(X_train_processed, y_train, X_test_processed, y_test, max_components)

# Output the results
print(f"Optimal number of principal components (M): {best_m}")
print(f"Test Mean Squared Error (MSE) with PCR: {test_mse_pcr:.2f}")


Optimal number of principal components (M): 18
Test Mean Squared Error (MSE) with PCR: 1492443.38


## (f) Fit a PLS model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

In [18]:
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import numpy as np

# Function to perform PLS regression with cross-validation to select M
def pls_model(X_train, y_train, X_test, y_test, max_components):
    mse_list = []
    for m in range(1, max_components + 1):
        # Fit a PLS model with m components
        pls = PLSRegression(n_components=m)
        mse = -np.mean(cross_val_score(pls, X_train, y_train, cv=5, scoring='neg_mean_squared_error'))
        mse_list.append(mse)

    # Select M with the lowest MSE
    best_m = np.argmin(mse_list) + 1
    pls = PLSRegression(n_components=best_m)
    pls.fit(X_train, y_train)

    # Predict on the test set
    y_pred = pls.predict(X_test)
    test_mse = mean_squared_error(y_test, y_pred)

    return best_m, test_mse

# Determine the maximum number of components
max_components = X_train_processed.shape[1]

# Fit the PLS model
best_m, test_mse_pls = pls_model(X_train_processed, y_train, X_test_processed, y_test, max_components)

# Output the results
print(f"Optimal number of components (M): {best_m}")
print(f"Test Mean Squared Error (MSE) with PLS Regression: {test_mse_pls:.2f}")


Optimal number of components (M): 13
Test Mean Squared Error (MSE) with PLS Regression: 1483941.11


## (g) Comment on the results obtained. How accurately can we pre- dict the number of college applications received? Is there much difference among the test errors resulting from these five ap- proaches?

Accuracy of Prediction:

The test errors across all models are relatively high, suggesting that predicting the number of college applications is challenging.
The predictors explain only a moderate amount of variance in Apps. External factors not included in the dataset (e.g., geographic location, reputation, or marketing efforts) might play a significant role in influencing applications.
Performance Comparison:

Best Model: Lasso Regression achieves the lowest test error ($1,480,187.36$), although its improvement over Ridge and PLS is marginal.
Least Effective Model: Linear regression and PCR perform similarly and have the highest test errors, suggesting that neither regularization nor dimensionality reduction was applied effectively.
Difference Among Methods:

The differences in test errors among the methods are minor, suggesting that the choice of method does not drastically affect prediction accuracy.
Regularization methods (Ridge and Lasso) and PLS slightly outperform standard linear regression and PCR due to their ability to handle multicollinearity and reduce overfitting.

* While the predictors provide moderate predictive power, the relatively high test errors suggest that additional or alternative predictors might be required to improve accuracy.
* Among the methods, Lasso Regression offers the best performance, likely due to its ability to penalize irrelevant predictors and reduce overfitting.
Future work could explore incorporating additional predictors, interaction terms, or nonlinear transformations to further enhance prediction accuracy.