(9.) In this exercise, we will predict the number of applications received
using the other variables in the College data set.

(a) Split the data set into a training set and a test set.

In [16]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


from google.colab import drive
drive.mount('/content/drive')

# Replace 'path/to/your/college.csv' with the actual path
college_df = pd.read_csv('/content/drive/My Drive/College.csv')

# Drop the collge names
college_df = college_df.drop(college_df.columns[0], axis=1)

# Transform 'Private' column using Label Encoding
encoder = LabelEncoder()
college_df['Private'] = encoder.fit_transform(college_df['Private'])

# Define features (X) and target variable (y)
X = college_df.drop('Apps', axis=1) # Assuming 'Apps' is the target variable
y = college_df['Apps']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # You can adjust test_size and random_state

# Now you have:
# X_train, y_train: Training data (features and target)
# X_test, y_test: Testing data (features and target)

print("Training set size:", len(X_train))
print("Testing set size:", len(X_test))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Training set size: 621
Testing set size: 156


(b) Fit a linear model using least squares on the training set, and
report the test error obtained.

In [17]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Create a linear regression model
model = LinearRegression()

# Fit the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the test error (mean squared error)
test_error = mean_squared_error(y_test, y_pred)

print("Test Error (Mean Squared Error):", test_error)

# R-squared
r_squared = r2_score(y_test, y_pred)
print("R-squared:", r_squared)

Test Error (Mean Squared Error): 1492443.3790390454
R-squared: 0.8877583168400976


(c) Fit a ridge regression model on the training set, with λ chosen
by cross-validation. Report the test error obtained.

In [18]:
from sklearn.linear_model import RidgeCV

# Create a Ridge regression model with cross-validation to choose lambda (alpha)
ridge_model = RidgeCV(alphas=[0.1, 1.0, 10.0], cv=5)  # You can adjust the alphas and cv

# Fit the model on the training data
ridge_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred_ridge = ridge_model.predict(X_test)

# Calculate the test error (mean squared error) for ridge regression
test_error_ridge = mean_squared_error(y_test, y_pred_ridge)

print("Test Error (Ridge Regression, Mean Squared Error):", test_error_ridge)

# R-squared for Ridge Regression
r_squared_ridge = r2_score(y_test, y_pred_ridge)
print("R-squared (Ridge Regression):", r_squared_ridge)

# You can also print the chosen alpha (lambda) value:
print("Chosen alpha (lambda) for Ridge Regression:", ridge_model.alpha_)

Test Error (Ridge Regression, Mean Squared Error): 1478572.8112797
R-squared (Ridge Regression): 0.8888014759264375
Chosen alpha (lambda) for Ridge Regression: 10.0


(d) Fit a lasso model on the training set, with λ chosen by crossvalidation.
Report the test error obtained, along with the number
of non-zero coefficient estimates.

In [19]:
from sklearn.linear_model import LassoCV

# Create a Lasso regression model with cross-validation to choose lambda (alpha)
lasso_model = LassoCV(cv=5)  # You can adjust cv

# Fit the model on the training data
lasso_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred_lasso = lasso_model.predict(X_test)

# Calculate the test error (mean squared error) for lasso regression
test_error_lasso = mean_squared_error(y_test, y_pred_lasso)

print("Test Error (Lasso Regression, Mean Squared Error):", test_error_lasso)

# R-squared for Lasso Regression
r_squared_lasso = r2_score(y_test, y_pred_lasso)
print("R-squared (Lasso Regression):", r_squared_lasso)

# Count the number of non-zero coefficients
non_zero_coefficients = sum(lasso_model.coef_ != 0)
print("Number of non-zero coefficients:", non_zero_coefficients)

# You can also print the chosen alpha (lambda) value:
print("Chosen alpha (lambda) for Lasso Regression:", lasso_model.alpha_)

Test Error (Lasso Regression, Mean Squared Error): 1587020.0176529174
R-squared (Lasso Regression): 0.8806455236482635
Number of non-zero coefficients: 7
Chosen alpha (lambda) for Lasso Regression: 14444.597843675856


(e) Fit a PCR model on the training set, with M chosen by crossvalidation.
Report the test error obtained, along with the value
of M selected by cross-validation.

In [20]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create a pipeline with PCA and linear regression
pipeline = Pipeline([
    ('pca', PCA()),
    ('linear', LinearRegression())
])

# Define the parameter grid for cross-validation
param_grid = {
    'pca__n_components': list(range(1, X_train.shape[1] + 1))  # Try different numbers of principal components
}

# Create a GridSearchCV object to find the best M (number of principal components)
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='neg_mean_squared_error')

# Fit the model using cross-validation
grid_search.fit(X_train_scaled, y_train)

# Get the best M and the corresponding test error
best_m = grid_search.best_params_['pca__n_components']
best_model = grid_search.best_estimator_

y_pred_pcr = best_model.predict(X_test_scaled)
test_error_pcr = mean_squared_error(y_test, y_pred_pcr)

print("Best M (number of principal components):", best_m)
print("Test Error (PCR, Mean Squared Error):", test_error_pcr)

# R-squared for PCR
r_squared_pcr = r2_score(y_test, y_pred_pcr)
print("R-squared (PCR):", r_squared_pcr)

Best M (number of principal components): 17
Test Error (PCR, Mean Squared Error): 1492443.379039024
R-squared (PCR): 0.8877583168400992


(f) Fit a PLS model on the training set, with M chosen by crossvalidation.
Report the test error obtained, along with the value
of M selected by cross-validation.

In [26]:
from sklearn.cross_decomposition import PLSRegression

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create a pipeline with PLS and linear regression
pipeline_pls = Pipeline([
    ('pls', PLSRegression()),
    #('linear', LinearRegression()) # Not strictly needed as PLSRegression already does regression
])

# Define the parameter grid for cross-validation
param_grid_pls = {
    'pls__n_components': list(range(1, min(X_train.shape[0], X_train.shape[1]) + 1))  # Try different numbers of components
}

# Create a GridSearchCV object to find the best M (number of components)
grid_search_pls = GridSearchCV(pipeline_pls, param_grid_pls, cv=5, scoring='neg_mean_squared_error')

# Fit the model using cross-validation
grid_search_pls.fit(X_train_scaled, y_train)

# Get the best M and the corresponding test error
best_m_pls = grid_search_pls.best_params_['pls__n_components']
best_model_pls = grid_search_pls.best_estimator_

y_pred_pls = best_model_pls.predict(X_test_scaled)
test_error_pls = mean_squared_error(y_test, y_pred_pls)

print("Best M (number of components) for PLS:", best_m_pls)
print("Test Error (PLS, Mean Squared Error):", test_error_pls)

# R-squared for PLS
r_squared_pls = r2_score(y_test, y_pred_pls)
print("R-squared (PLS):", r_squared_pls)

Best M (number of components) for PLS: 17
Test Error (PLS, Mean Squared Error): 1492443.379039025
R-squared (PLS): 0.8877583168400992


(g) Comment on the results obtained. How accurately can we predict
the number of college applications received? Is there much
difference among the test errors resulting from these five approaches?

#### 1. Linear Regression
- Test Error (MSE): 1,492,443
- R-squared: 0.888
- Linear regression serves as a baseline model. It explains approximately 88.8% of the variability in the number of applications. The relatively high test error suggests there is still room for improvement.

---

#### 2. Ridge Regression
- Test Error (MSE): 1,478,572
- R-squared: 0.889
- Ridge regression improves slightly on the test error compared to linear regression, reducing it by a small margin(~14,000).
- The chosen α (regularization parameter) of 10.0 indicates moderate penalization of large coefficients, which likely helps in reducing overfitting while maintaining predictive accuracy.
- Ridge regression performs marginally better than linear regression, but the improvement is not substantial.

---

#### 3. Lasso Regression
- Test Error (MSE): 1,587,020
- R-squared: 0.881
- Lasso regression performs slightly worse than both linear and ridge regression, with a higher test error and a lower $R^2$.
- It retains 7 non-zero coefficients, indicating significant feature reduction, but this simplification comes at the cost of accuracy.
- The chosen α is quite large (14,444.6), leading to aggressive regularization and potentially omitting valuable predictors.
- While Lasso is effective for feature selection, it sacrifices some predictive accuracy compared to ridge and linear regression.

---

#### 4. Principal Component Regression (PCR)
- Test Error (MSE): 1,492,443
- R-squared: 0.888
- With the optimal number of components (M = 17), PCR achieves the same test error and $R^2$ as linear regression. This suggests that the principal components capture similar predictive information as the original variables.
- PCR does not provide additional predictive accuracy but may be useful in high-dimensional settings to handle multicollinearity.

---

#### 5. Partial Least Squares (PLS)
- Test Error (MSE): 1,492,443
- R-squared: 0.888
- Similar to PCR, PLS achieves identical performance metrics to linear regression, with (M = 17) components. This indicates no added benefit in terms of predictive accuracy for this dataset.
- Like PCR, PLS provides no significant advantage but could be beneficial in datasets with correlated predictors.

---

### Comparative Analysis
1. Predictive Accuracy:
   - All five methods yield similar test errors and $R^2$ values, with differences in MSE of less than \( 110,000 \). This suggests the models predict the number of college applications received with reasonable accuracy, explaining about 88% of the variance.
   - Ridge regression performs marginally better, indicating that slight regularization improves predictive performance.
   - Lasso performs slightly worse due to aggressive penalization, likely discarding important features.

2. Differences in Test Errors:
   - The differences among test errors are not substantial, with the smallest (Ridge: 1,478,572) and largest (Lasso: 1,587,020) differing by about \( 7.3\% \).
   - This suggests the dataset does not strongly benefit from regularization or dimensionality reduction.

3. Feature Selection and Dimensionality Reduction:
   - Ridge retains all predictors, while Lasso reduces the number to 7, offering a simpler model at the cost of accuracy.
   - PCR and PLS reduce the dimensionality to 17 components, matching the performance of linear regression.

---

### Conclusion
- Accuracy: The models provide reasonably accurate predictions, explaining about 88% of the variability in college applications.
- Differences Among Methods: The test errors across methods are quite close, with ridge regression slightly outperforming the others.
