# Topic IV: Shrinkage and Variable Selection

**Information:**  
We are using the book 'G. James et al. -  An Introduction to Statistical Learning (with Applications in Python)'. You can find a copy of it for free [here](https://www.statlearning.com/).

In this exercise, we will predict the number of applications received using the other variables in the `College` data set.

## Import modules, packages and libraries

First, we import some useful modules, packages and libraries. These are needed for carrying out the computations and for plotting the results.

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 

# sci-kit learn specifics
# We will use the sklearn package to obtain ridge regression and lasso models.
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error

## Load the `College` data set

In [12]:
college = pd.read_csv('College10.csv', index_col = 0)

# Display information about the data set
# college.info()

# Return summary statistics for each column
# college.describe()

# Return first five rows of the data set
college.head()

Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
Abilene Christian University,Yes,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
Alfred University,Yes,1732,1425,472,37,75,1830,110,16548,5406,500,600,82,88,11.3,31,10932,73
Antioch University,Yes,713,661,252,25,44,712,23,15476,3336,400,1100,69,82,11.3,35,42926,48
Augustana College,Yes,761,725,306,21,58,1337,300,10990,3244,600,1021,66,70,10.4,30,6871,69
Beaver College,Yes,1163,850,348,23,56,878,519,12850,5400,400,800,78,89,12.2,30,8954,73


In [4]:
### PREPROCESSING HERE

**(a) Normalize the data and split it into a training set and a test set.**

In [13]:
### YOUR CODE HERE

categorical_columns = college.select_dtypes(include=['object']).columns

# One-hot encode categorical columns
college = pd.get_dummies(college, columns=categorical_columns, drop_first=True)

scaler=StandardScaler()
                      
college['Apps']=scaler.fit_transform(college[['Apps']])
X = college.drop('Apps',axis=1)
y = college['Apps']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**(b) Fit a linear model using least squares on the training set, and report the test error obtained.**

In [15]:
### YOUR CODE HERE
from sklearn.linear_model import LinearRegression

model=LinearRegression()
model.fit(X_train, y_train)
y_pred=model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

# Print the test error
print(f'Test Mean Squared Error: {mse}')

Test Mean Squared Error: 0.21963403756920946


**(c) Fit a ridge regression model on the training set, with $ \lambda $ chosen by cross-validation. Report the test error obtained.**

In [16]:
### YOUR CODE HERE
alphas = [0.1, 1.0, 10.0, 100.0] 
ridge=Ridge()
param_grid = {'alpha': [0.1, 1.0, 10.0, 100.0]}

# Initialize GridSearchCV
grid_search = GridSearchCV(ridge, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Best alpha chosen by cross-validation
best_alpha = grid_search.best_params_['alpha']
print(f'Best alpha: {best_alpha}')

# Fit the Ridge model with the best alpha
ridge_best = Ridge(alpha=best_alpha)
ridge_best.fit(X_train, y_train)

# Predict on the test set
y_pred = ridge_best.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print(f'Test Mean Squared Error: {mse}')

Best alpha: 10.0
Test Mean Squared Error: 0.2182310855668599


**(d) Fit a lasso model on the training set, with $ \lambda $ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.**

In [17]:
### YOUR CODE HERE
lasso = Lasso()
grid_search = GridSearchCV(lasso, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
best_alpha = grid_search.best_params_['alpha']
print(f'Best alpha: {best_alpha}')
lasso_best = Lasso(alpha=best_alpha)
lasso_best.fit(X_train, y_train)
y_pred = lasso_best.predict(X_test)

# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)

# Print the test error
print(f'Test Mean Squared Error: {mse}')

Best alpha: 100.0
Test Mean Squared Error: 0.10223605017095388


**(g) Comment on the results obtained. How accurately can we predict the number of college applications received?**

In [9]:
### YOUR CODE HERE

\### YOUR COMMENTS HERE

The test mean square error for the lasso model is the lowest, but the regularization parameter is much higher than in the Ridge model therefore it more generalized. However, the ultimate goal is to have a low mean square error, and the lasso model has a much lower MSE than both the linear regression and the ridge regression.