### Ch6_Q09
#### In this exercise, we will predict the number of applications received using the other variables in the College data set.

In [1]:
!pip install ISLP



In [2]:
from ISLP import load_data
df = load_data('College')
print(df.head())

  Private  Apps  Accept  Enroll  Top10perc  Top25perc  F.Undergrad  \
0     Yes  1660    1232     721         23         52         2885   
1     Yes  2186    1924     512         16         29         2683   
2     Yes  1428    1097     336         22         50         1036   
3     Yes   417     349     137         60         89          510   
4     Yes   193     146      55         16         44          249   

   P.Undergrad  Outstate  Room.Board  Books  Personal  PhD  Terminal  \
0          537      7440        3300    450      2200   70        78   
1         1227     12280        6450    750      1500   29        30   
2           99     11250        3750    400      1165   53        66   
3           63     12960        5450    450       875   92        97   
4          869      7560        4120    800      1500   76        72   

   S.F.Ratio  perc.alumni  Expend  Grad.Rate  
0       18.1           12    7041         60  
1       12.2           16   10527         56  
2    

#### (a) Split the data set into a training set and a test set.

In [28]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data into training and test sets (70% train, 30% test)
X = df.drop(columns=['Apps'], axis=1)  # Exclude 'Apps' as it's the target variable
y = df['Apps']

X = pd.get_dummies(X, drop_first=True) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=68)

#### (b) Fit a linear model using least squares on the training set, and report the test error obtained.

In [29]:
# Fit a linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Predict on the test set
y_pred_linear = linear_model.predict(X_test)

# Calculate test error (Mean Squared Error)
linear_test_error = mean_squared_error(y_test, y_pred_linear)
print("Linear Regression Test Error:", linear_test_error)

Linear Regression Test Error: 1042199.02298093


#### (c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

In [30]:
from sklearn.linear_model import RidgeCV

# Perform cross-validated ridge regression
ridge_model = RidgeCV(alphas=np.logspace(-6, 6, 13), cv=10)
ridge_model.fit(X_train, y_train)

# Predict on the test set
y_pred_ridge = ridge_model.predict(X_test)

# Calculate test error
ridge_test_error = mean_squared_error(y_test, y_pred_ridge)
print("Ridge Regression Test Error:", ridge_test_error)
print("Best Lambda (Ridge):", ridge_model.alpha_)

Ridge Regression Test Error: 1036920.5562227552
Best Lambda (Ridge): 10.0


#### (d) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

In [31]:
from sklearn.linear_model import LassoCV

# Perform cross-validated lasso regression
lasso_model = LassoCV(alphas=np.logspace(-6, 6, 13), cv=10, random_state=42)
lasso_model.fit(X_train, y_train)

# Predict on the test set
y_pred_lasso = lasso_model.predict(X_test)

# Calculate test error
lasso_test_error = mean_squared_error(y_test, y_pred_lasso)
print("Lasso Regression Test Error:", lasso_test_error)
print("Best Lambda (Lasso):", lasso_model.alpha_)

# Number of non-zero coefficients
print("Number of Non-Zero Coefficients (Lasso):", np.sum(lasso_model.coef_ != 0))


Lasso Regression Test Error: 1035934.2221492694
Best Lambda (Lasso): 10.0
Number of Non-Zero Coefficients (Lasso): 17


#### (e) Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

In [32]:
# Scale the data for PCA
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Perform PCA
pca = PCA()
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Determine optimal number of components using cross-validation
mse = []
for i in range(1, X_train_pca.shape[1] + 1):
    pcr_model = LinearRegression()
    scores = cross_val_score(pcr_model, X_train_pca[:, :i], y_train, cv=10, scoring='neg_mean_squared_error')
    mse.append(-scores.mean())

# Select the number of components with minimum MSE
optimal_components = np.argmin(mse) + 1
print("Optimal Number of Components (PCR):", optimal_components)

# Fit PCR model with optimal components
pcr_model = LinearRegression()
pcr_model.fit(X_train_pca[:, :optimal_components], y_train)

# Predict on the test set
y_pred_pcr = pcr_model.predict(X_test_pca[:, :optimal_components])

# Calculate test error
pcr_test_error = mean_squared_error(y_test, y_pred_pcr)
print("PCR Test Error:", pcr_test_error)


Optimal Number of Components (PCR): 17
PCR Test Error: 1042199.0229809299


#### (f) Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

In [33]:
# Perform cross-validation to determine optimal number of components for PLS
pls_mse = []
for i in range(1, X_train.shape[1] + 1):
    pls_model = PLSRegression(n_components=i)
    scores = cross_val_score(pls_model, X_train, y_train, cv=10, scoring='neg_mean_squared_error')
    pls_mse.append(-scores.mean())

# Select the number of components with minimum MSE
optimal_pls_components = np.argmin(pls_mse) + 1
print("Optimal Number of Components (PLS):", optimal_pls_components)

# Fit PLS model with optimal components
pls_model = PLSRegression(n_components=optimal_pls_components)
pls_model.fit(X_train, y_train)

# Predict on the test set
y_pred_pls = pls_model.predict(X_test)

# Calculate test error
pls_test_error = mean_squared_error(y_test, y_pred_pls)
print("PLS Test Error:", pls_test_error)


Optimal Number of Components (PLS): 17
PLS Test Error: 1042199.0229809326


#### (g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

**Performance:** <br>
Among the models, Ridge, Lasso, and PLS usually outperform Linear Regression due to regularization and dimensionality reduction techniques.<br>
If Lasso has a comparable test error with fewer predictors, it might be the most interpretable model.<br>
PLS often achieves a good balance between dimensionality reduction and predictive accuracy.<br>
**Model Selection:**
If prediction accuracy is the primary concern, select the model with the lowest test error.<br>
If interpretability matters, prefer Lasso or Ridge depending on the sparsity of the solution.