9. In this exercise, we will predict the number of applications received
using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Step 1: Upload the file
from google.colab import files
uploaded = files.upload()

# Step 2: Load the dataset
file_name = list(uploaded.keys())[0]  # Get the uploaded file name
college_data = pd.read_csv(file_name)

# Display the first few rows of the dataset
print("Dataset Preview:")
print(college_data.head())

# Step 3: Separate target and features
X = college_data.drop(columns=['Unnamed: 0', 'Apps'])  # Drop non-predictive column and target column
y = college_data['Apps']

# Convert categorical variables to dummy variables
X = pd.get_dummies(X, drop_first=True)

# Step 4: Split the data into training and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display dimensions of the splits
print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")
print(f"Target Training set: {y_train.shape}, Target Test set: {y_test.shape}")


Saving College.csv to College.csv
Dataset Preview:
                     Unnamed: 0 Private  Apps  Accept  Enroll  Top10perc  \
0  Abilene Christian University     Yes  1660    1232     721         23   
1            Adelphi University     Yes  2186    1924     512         16   
2                Adrian College     Yes  1428    1097     336         22   
3           Agnes Scott College     Yes   417     349     137         60   
4     Alaska Pacific University     Yes   193     146      55         16   

   Top25perc  F.Undergrad  P.Undergrad  Outstate  Room.Board  Books  Personal  \
0         52         2885          537      7440        3300    450      2200   
1         29         2683         1227     12280        6450    750      1500   
2         50         1036           99     11250        3750    400      1165   
3         89          510           63     12960        5450    450       875   
4         44          249          869      7560        4120    800      1500   

   Ph

The data has been successfully split into a training set and a test set:

Training set: 621 observations with 17 features.

Test set: 156 observations with 17 features.

(b) Fit a linear model using least squares on the training set, and
report the test error obtained.

In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Step 1: Initialize the linear regression model
linear_model = LinearRegression()

# Step 2: Fit the model on the training data
linear_model.fit(X_train, y_train)

# Step 3: Make predictions on the test set
y_pred = linear_model.predict(X_test)

# Step 4: Compute the test error (Mean Squared Error)
test_error = mean_squared_error(y_test, y_pred)

print(f"Test Error (Mean Squared Error): {test_error}")


Test Error (Mean Squared Error): 1492443.379039042


(c) Fit a ridge regression model on the training set, with λ  chosen
by cross-validation. Report the test error obtained.

In [3]:
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error

# Step 1: Define the alpha (lambda) values to test
alphas = [0.1, 1.0, 10.0, 100.0, 1000.0]

# Step 2: Initialize the Ridge regression model with cross-validation
ridge_cv_model = RidgeCV(alphas=alphas, cv=5)  # 5-fold cross-validation

# Step 3: Fit the model on the training data
ridge_cv_model.fit(X_train, y_train)

# Step 4: Best alpha (lambda) value selected by cross-validation
best_alpha = ridge_cv_model.alpha_
print(f"Best alpha (lambda) chosen by cross-validation: {best_alpha}")

# Step 5: Make predictions on the test set
y_pred_ridge = ridge_cv_model.predict(X_test)

# Step 6: Compute the test error (Mean Squared Error)
test_error_ridge = mean_squared_error(y_test, y_pred_ridge)
print(f"Test Error (Mean Squared Error) for Ridge Regression: {test_error_ridge}")


Best alpha (lambda) chosen by cross-validation: 10.0
Test Error (Mean Squared Error) for Ridge Regression: 1478572.8112797008


(d) Fit a lasso model on the training set, with λ chosen by crossvalidation.
Report the test error obtained, along with the number
of non-zero coefficient estimates.

In [4]:
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error

# Step 1: Define the alpha values to test
alphas = [0.1, 0.01, 0.001, 0.0001, 0.00001]

# Step 2: Initialize the Lasso regression model with cross-validation
lasso_cv_model = LassoCV(alphas=alphas, cv=5, random_state=42)  # 5-fold cross-validation

# Step 3: Fit the model on the training data
lasso_cv_model.fit(X_train, y_train)

# Step 4: Best alpha (lambda) value selected by cross-validation
best_alpha_lasso = lasso_cv_model.alpha_
print(f"Best alpha (lambda) chosen by cross-validation: {best_alpha_lasso}")

# Step 5: Make predictions on the test set
y_pred_lasso = lasso_cv_model.predict(X_test)

# Step 6: Compute the test error (Mean Squared Error)
test_error_lasso = mean_squared_error(y_test, y_pred_lasso)
print(f"Test Error (Mean Squared Error) for Lasso Regression: {test_error_lasso}")

# Step 7: Count the number of non-zero coefficients
non_zero_coefficients = sum(lasso_cv_model.coef_ != 0)
print(f"Number of non-zero coefficients: {non_zero_coefficients}")


Best alpha (lambda) chosen by cross-validation: 0.1
Test Error (Mean Squared Error) for Lasso Regression: 1492276.886370733
Number of non-zero coefficients: 17


(e) Fit a PCR model on the training set, with M chosen by crossvalidation.
Report the test error obtained, along with the value
of M selected by cross-validation.

In [5]:
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import numpy as np

# Step 1: Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 2: Define a function for PCR with cross-validation
def pcr_model(X, y, max_components):
    mse_list = []
    for m in range(1, max_components + 1):
        pca = PCA(n_components=m)
        X_pca = pca.fit_transform(X)

        model = LinearRegression()
        mse = -cross_val_score(model, X_pca, y, cv=5, scoring='neg_mean_squared_error').mean()
        mse_list.append(mse)
    return mse_list

# Step 3: Determine the optimal number of components using cross-validation
max_components = min(X_train_scaled.shape[1], X_train_scaled.shape[0])  # Limit max components to data dimensions
mse_list = pcr_model(X_train_scaled, y_train, max_components)
optimal_m = np.argmin(mse_list) + 1  # Index + 1 for the component number
print(f"Optimal number of components (M) chosen by cross-validation: {optimal_m}")

# Step 4: Fit the PCR model with the optimal number of components
pca = PCA(n_components=optimal_m)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

model = LinearRegression()
model.fit(X_train_pca, y_train)

# Step 5: Compute the test error
y_pred_pcr = model.predict(X_test_pca)
test_error_pcr = mean_squared_error(y_test, y_pred_pcr)
print(f"Test Error (Mean Squared Error) for PCR: {test_error_pcr}")


Optimal number of components (M) chosen by cross-validation: 17
Test Error (Mean Squared Error) for PCR: 1492443.3790390224


(f) Fit a PLS model on the training set, with M chosen by crossvalidation.
Report the test error obtained, along with the value
of M selected by cross-validation.

In [6]:
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import numpy as np

# Step 1: Define a function for PLS with cross-validation
def pls_model(X, y, max_components):
    mse_list = []
    for m in range(1, max_components + 1):
        pls = PLSRegression(n_components=m)
        mse = -cross_val_score(pls, X, y, cv=5, scoring='neg_mean_squared_error').mean()
        mse_list.append(mse)
    return mse_list

# Step 2: Determine the optimal number of components using cross-validation
max_components = min(X_train_scaled.shape[1], X_train_scaled.shape[0])  # Limit max components to data dimensions
mse_list_pls = pls_model(X_train_scaled, y_train, max_components)
optimal_m_pls = np.argmin(mse_list_pls) + 1  # Index + 1 for the component number
print(f"Optimal number of components (M) chosen by cross-validation for PLS: {optimal_m_pls}")

# Step 3: Fit the PLS model with the optimal number of components
pls_model = PLSRegression(n_components=optimal_m_pls)
pls_model.fit(X_train_scaled, y_train)

# Step 4: Compute the test error
y_pred_pls = pls_model.predict(X_test_scaled)
test_error_pls = mean_squared_error(y_test, y_pred_pls)
print(f"Test Error (Mean Squared Error) for PLS: {test_error_pls}")


Optimal number of components (M) chosen by cross-validation for PLS: 17
Test Error (Mean Squared Error) for PLS: 1492443.3790390242


(g) Comment on the results obtained. How accurately can we predict
the number of college applications received? Is there much
difference among the test errors resulting from these five approaches?

In [7]:
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import numpy as np

# Step 1: Define a function for PLS with cross-validation
def pls_model(X, y, max_components):
    mse_list = []
    for m in range(1, max_components + 1):
        pls = PLSRegression(n_components=m)
        mse = -cross_val_score(pls, X, y, cv=5, scoring='neg_mean_squared_error').mean()
        mse_list.append(mse)
    return mse_list

# Step 2: Determine the optimal number of components using cross-validation
max_components = min(X_train_scaled.shape[1], X_train_scaled.shape[0])  # Limit max components to data dimensions
mse_list_pls = pls_model(X_train_scaled, y_train, max_components)
optimal_m_pls = np.argmin(mse_list_pls) + 1  # Index + 1 for the component number
print(f"Optimal number of components (M) chosen by cross-validation for PLS: {optimal_m_pls}")

# Step 3: Fit the PLS model with the optimal number of components
pls_model = PLSRegression(n_components=optimal_m_pls)
pls_model.fit(X_train_scaled, y_train)

# Step 4: Compute the test error
y_pred_pls = pls_model.predict(X_test_scaled)
test_error_pls = mean_squared_error(y_test, y_pred_pls)
print(f"Test Error (Mean Squared Error) for PLS: {test_error_pls}")


Optimal number of components (M) chosen by cross-validation for PLS: 17
Test Error (Mean Squared Error) for PLS: 1492443.3790390242


If Test Errors Are Similar:

This suggests that the dataset is well-suited for all the methods, with no significant multicollinearity, irrelevant features, or noise.

The simpler model (Linear Regression) might suffice unless regularization or dimensionality reduction is specifically required for interpretability or other reasons.


If Regularized Models (Ridge, Lasso) Perform Better:

Ridge and Lasso might handle multicollinearity better and reduce overfitting, leading to lower test errors compared to Linear Regression.

This indicates that some features are highly correlated or not informative, which regularization handles effectively.

If PCR or PLS Perform Better:

These methods could capture key components or latent structures in the data, especially if the dataset has many redundant or noisy predictors.

A significant improvement would highlight the importance of dimensionality reduction in this dataset.