 9. In this exercise, we will predict the number of applications received
 using the other variables in the College data set.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.metrics import mean_squared_error

  from pandas.core import (


(a) Split the data set into a training set and a test set.

In [7]:
# Load the College dataset
data = pd.read_csv('College.csv')

# Convert 'Private' column to binary (1 for 'Yes', 0 for 'No')
data['Private'] = data['Private'].map({'Yes': 1, 'No': 0})

# Ensure all non-numeric columns are dropped or converted
X = data.drop(['Apps'], axis=1)
X = pd.get_dummies(X, drop_first=True)
y = data['Apps']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


 (b) Fit a linear model using least squares on the training set, and
 report the test error obtained.

In [8]:
# (b) Fit a linear model using least squares on the training set
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred_lin = lin_reg.predict(X_test)
lin_mse = mean_squared_error(y_test, y_pred_lin)
print(f'Linear Model Test Error (MSE): {lin_mse}')

Linear Model Test Error (MSE): 1928713.2923223844


(c) Fit a ridge regression model on the training set, with chosen
 by cross-validation. Report the test error obtained

In [9]:
# (c) Fit a ridge regression model on the training set with lambda chosen by cross-validation
ridge = RidgeCV(alphas=np.logspace(-6, 6, 13), store_cv_values=True)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)
ridge_mse = mean_squared_error(y_test, y_pred_ridge)
print(f'Ridge Regression Test Error (MSE): {ridge_mse}')

Ridge Regression Test Error (MSE): 1926377.3597969557




 (d) Fit a lasso model on the training set, with chosen by cross
validation. Report the test error obtained, along with the num
ber of non-zero coefficient estimates.

In [10]:
# (d) Fit a lasso model on the training set with lambda chosen by cross-validation
lasso = LassoCV(cv=10, random_state=42)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)
lasso_mse = mean_squared_error(y_test, y_pred_lasso)
non_zero_coef = np.sum(lasso.coef_ != 0)
print(f'Lasso Regression Test Error (MSE): {lasso_mse}')
print(f'Number of non-zero coefficients in Lasso: {non_zero_coef}')


Lasso Regression Test Error (MSE): 2250488.946323313
Number of non-zero coefficients in Lasso: 7


 (e) Fit a PCR model on the training set, with M chosen by cross
validation. Report the test error obtained, along with the value
 of M selected by cross-validation.

In [11]:
# (e) Fit a PCR model on the training set, with M chosen by cross-validation
pca = PCA()
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

# Cross-validation to choose the best number of components
mse_list = []
for m in range(1, X_train_pca.shape[1] + 1):
    lin_reg_pca = LinearRegression()
    mse = -np.mean(cross_val_score(lin_reg_pca, X_train_pca[:, :m], y_train, cv=10, scoring='neg_mean_squared_error'))
    mse_list.append(mse)

best_m_pcr = np.argmin(mse_list) + 1
print(f'Best number of components for PCR: {best_m_pcr}')

# Fit PCR model with the best number of components
lin_reg_pca = LinearRegression()
lin_reg_pca.fit(X_train_pca[:, :best_m_pcr], y_train)
y_pred_pcr = lin_reg_pca.predict(X_test_pca[:, :best_m_pcr])
pcr_mse = mean_squared_error(y_test, y_pred_pcr)
print(f'PCR Test Error (MSE): {pcr_mse}')

Best number of components for PCR: 17
PCR Test Error (MSE): 1931050.684086367


 (f) Fit a PLS model on the training set, with M chosen by cross
validation. Report the test error obtained, along with the value
 of M selected by cross-validation

In [12]:
# (f) Fit a PLS model on the training set, with M chosen by cross-validation
mse_list_pls = []
for m in range(1, X_train.shape[1] + 1):
    pls = PLSRegression(n_components=m)
    mse = -np.mean(cross_val_score(pls, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))
    mse_list_pls.append(mse)

best_m_pls = np.argmin(mse_list_pls) + 1
print(f'Best number of components for PLS: {best_m_pls}')

# Fit PLS model with the best number of components
pls = PLSRegression(n_components=best_m_pls)
pls.fit(X_train, y_train)
y_pred_pls = pls.predict(X_test)
pls_mse = mean_squared_error(y_test, y_pred_pls)
print(f'PLS Test Error (MSE): {pls_mse}')



Best number of components for PLS: 2
PLS Test Error (MSE): 2108931.652050856


(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much
 difference among the test errors resulting from these five ap
proaches?

Overall, the test errors (MSE) for the different models are relatively close, suggesting that they have similar performance in predicting the number of college applications received. 

The Ridge Regression model performed slightly better than the Linear Regression model, indicating that regularization helped reduce overfitting. 

In contrast, the Lasso Regression model had a higher test error compared to Ridge and Linear Regression, suggesting that the regularization might have been too strong, eliminating some important features. 

The Principal Component Regression (PCR) model showed similar performance to the Linear and Ridge Regression models, with the optimal number of components being 17, suggesting that much of the variance in the data could be explained by these components. 

The Partial Least Squares (PLS) model, which had the best number of components set at 2, performed slightly worse than PCR and Ridge, implying that fewer components may not have captured the variability in the data as effectively. 

In summary, the Ridge Regression model achieved the lowest test error, providing the best balance between model complexity and generalization ability for this dataset.