Lab 5 - Cross-Validation for Model Selection

Tasks:

1.Utilize the diabetes dataset from lab 4. Perform cross-validation on nine polynomial models, ranging from degree 0 to 8.

2.Construct a table summarizing the cross-validation results: Each model should have a separate row in the table. Have the mean and standard deviation of the R-Squared, Mean Absolute Error (MAE) and MAPE metrics for each model.

3.Identification of the Best Model: Identify the model that exhibits the highest performance based on the R-Squared, MAE and MAPE metrics. Provide an explanation for choosing this specific model. Run the model on the test set and provide results (R-Squared, MAPE, MAE)
    
4.Additional analysis and interpretation of the models' performances. Explore further findings beyond the required metrics. The analysis should provide at least one relevant insight about the choice of the best model (for example - an analysis of in which instances does it fail), or further insights comparing the models.


In [2]:
import numpy as np
import pandas as pd

from sklearn import datasets

from sklearn.model_selection import KFold
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error

In [4]:

# Load the diabetes dataset
diabetes = datasets.load_diabetes()
X = diabetes.data  # Features/ input data used to predict
y = diabetes.target  # Target variable

# Prepare cross-validation to split the data into 10 parts for testing and training
kf = KFold(n_splits=10, shuffle=True, random_state=1)

# Initialize an empty list to collect results
results = []

# Loop through polynomial degrees from simplest(0) to complex(8)
for degree in range(9):
    # Create a pipeline: Polynomial feature transformation followed by linear regression
    pipeline = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    
    # Initialize empty lists to collect scores for each fold
    r2_scores = []
    mae_scores = []
    mape_scores = []
    
    # Loop through the cross-validation splits
    for train_index, test_index in kf.split(X):
        # Split data into training and testing sets
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        # Fit the model on the training data
        pipeline.fit(X_train, y_train)
        
        # Predict on the testing data
        y_pred = pipeline.predict(X_test)
        
        # Evaluate the model using R-squared, MAE, and MAPE
        r2_scores.append(r2_score(y_test, y_pred))
        mae_scores.append(mean_absolute_error(y_test, y_pred))
        mape_scores.append(mean_absolute_percentage_error(y_test, y_pred))
    
    # Calculate the mean and standard deviation of the evaluation metrics
    results.append({
        'Degree': degree,
        'R2 Mean': np.mean(r2_scores),
        'R2 Std Dev': np.std(r2_scores),
        'MAE Mean': np.mean(mae_scores),
        'MAE Std Dev': np.std(mae_scores),
        'MAPE Mean': np.mean(mape_scores),
        'MAPE Std Dev': np.std(mape_scores)
    })

# Convert the results list to a DataFrame for easier viewing and analysis
results_df = pd.DataFrame(results)

# Print the results DataFrame
print(results_df)

# To find the best model, the highest R2 Mean, and the lowest MAE Mean and MAPE Mean are looked at
# Example:
best_model_idx = results_df['R2 Mean'].idxmax()
best_model = results_df.iloc[best_model_idx]
print(f'Best Model: Degree {best_model["Degree"]}')

   Degree    R2 Mean  R2 Std Dev    MAE Mean  MAE Std Dev  MAPE Mean   
0       0  -0.027907    0.026996   65.924330     5.588439   0.622844  \
1       1   0.470314    0.089697   44.450481     2.993516   0.397483   
2       2   0.396881    0.107551   46.496951     3.632944   0.399535   
3       3 -12.757003    9.288513  169.420479    49.834302   1.321950   
4       4 -58.958412   35.641085  340.133115    65.530133   2.729640   
5       5 -48.620679   30.775408  307.366481    55.054135   2.494358   
6       6 -48.669635   30.962701  307.211035    54.982721   2.493718   
7       7 -48.673806   30.973853  307.208098    54.982389   2.493727   
8       8 -48.674386   30.980423  307.201113    54.973845   2.493718   

   MAPE Std Dev  
0      0.101546  
1      0.072758  
2      0.065484  
3      0.384330  
4      0.599754  
5      0.524684  
6      0.525639  
7      0.525721  
8      0.525733  
Best Model: Degree 1.0


 Analysis:

The performance significantly drops as the degree of the polynomial increases beyond 1. This is evident from the negative R-Squared mean values and increasing MAE and MAPE mean values for higher degree polynomials.

From degree 3 onwards, the models perform poorly as indicated by the negative R-Squared values and high error metrics (MAE and MAPE). This suggests that higher-degree polynomial models are overfitting to the training data, capturing noise rather than the underlying trend. Which emphasizes on the fact that it is important to chhose a model with the right amount of complexity. 

In this case, degree 1 polynomial performs the best.

