# Periyanayagi Christina - 8938218

# Lab 5 - Cross-Validation for Model Selection

### Utilize the diabetes dataset from lab 4. Perform cross-validation on nine polynomial models, ranging from degree 0 to 8.

In [11]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_absolute_error, mean_absolute_percentage_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline


### Loading the Diabetes dataset

In [12]:
diabetesDF = load_diabetes()


### Splitting the data into training sets and testing sets

In [13]:
X, y = diabetesDF.data, diabetesDF.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

###  Perform cross-validation on nine polynomial models, ranging from degree 0 to 8

In [14]:
degrees = range(9)
results = []

for degree in degrees:
    # Create a polynomial regression model using make_pipeline
    model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression())
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # Calculate the R-squared score
    r2 = r2_score(y_test, y_pred)
    
    # Calculate the Mean Absolute Error (MAE)
    mae = mean_absolute_error(y_test, y_pred)
    
    # Calculate the Mean Absolute Percentage Error (MAPE)
    mape = mean_absolute_percentage_error(y_test, y_pred) 
    
    results.append((degree, r2, mae, mape))

### Creating Dataframe to store the result

In [16]:
results_df = pd.DataFrame(results, columns=['Model Degree', 'R-Squared', 'MAE', 'MAPE'])
print(results_df)

   Model Degree  R-Squared         MAE      MAPE
0             0  -0.011963   64.006461  0.627918
1             1   0.452603   42.794095  0.374998
2             2   0.415640   43.581693  0.382857
3             3 -15.733467  178.966292  1.634635
4             4 -26.728083  261.667144  2.300991
5             5 -25.992920  255.968358  2.270202
6             6 -25.975743  255.908618  2.269658
7             7 -25.975483  255.906857  2.269649
8             8 -25.975483  255.906885  2.269649


### Calculate the mean and standard deviation of R-Squared, MAE, and MAPE

In [17]:
mean_r2 = np.mean(results_df['R-Squared'])
std_r2 = np.std(results_df['R-Squared'])

mean_mae = np.mean(results_df['MAE'])
std_mae = np.std(results_df['MAE'])

mean_mape = np.mean(results_df['MAPE'])
std_mape = np.std(results_df['MAPE'])

print(f"Mean R-Squared: {mean_r2}, Standard Deviation R-Squared: {std_r2}") #print the mean for R-Squared
print(f"Mean MAE: {mean_mae}, Standard Deviation MAE: {std_mae}") #print the mean for MAE
print(f"Mean MAPE: {mean_mape}, Standard Deviation MAPE: {std_mape}") #print the mean for MAPE


Mean R-Squared: -16.169433276824602, Standard Deviation R-Squared: 12.060392260551627
Mean MAE: 179.41182245848267, Standard Deviation MAE: 94.64225556808549
Mean MAPE: 1.6000618397239887, Standard Deviation MAPE: 0.830934520362114


In [18]:
best_r2_model = results_df[results_df['R-Squared'] == results_df['R-Squared'].max()]# Finding the model with the highest R-Squared

best_mae_model = results_df[results_df['MAE'] == results_df['MAE'].min()]# Find the model with the lowest MAE

best_mape_model = results_df[results_df['MAPE'] == results_df['MAPE'].min()]# Find the model with the lowest MAPE
print("Best Model based on R-Squared:")
print(best_r2_model)
print("\nBest Model based on MAE:")
print(best_mae_model)
print("\nBest Model based on MAPE:")
print(best_mape_model)


Best Model based on R-Squared:
   Model Degree  R-Squared        MAE      MAPE
1             1   0.452603  42.794095  0.374998

Best Model based on MAE:
   Model Degree  R-Squared        MAE      MAPE
1             1   0.452603  42.794095  0.374998

Best Model based on MAPE:
   Model Degree  R-Squared        MAE      MAPE
1             1   0.452603  42.794095  0.374998


### Identification of the Best Model: Identify the model that exhibits the highest performance based on the R-Squared, MAE and MAPE metrics. Provide an explanation for choosing this specific model. 

Considering the R-Squared, MAE, and MAPE metrics, it's evident that the model with degree 1 performs the best. The reason for choosing the specific model is in R-Sqaured the degree 1 model has the highest R-squared value, indicating its ability to explain a significant portion of the variance in the data. This suggests a strong fit to the data also in MAE and MAPE the degree 1 model exhibits the lowest Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). 
Therefore this signifies the model's superior predictive accuracy, as it has the smallest average absolute difference between predicted and actual values.

### Additional analysis and interpretation of the models' performances. You may explore further insights beyond the required metrics. The analysis should provide at least one relevant insight about the choice of the best model, or about characteristics of the chosen one (for example - an analysis of in which instances does it fail)

Beyond the required metrics, a closer examination of the models performance provides valuable insights:
For higher-degree models (degrees 3 to 8), the R-Squared values are significantly negative.
This indicates that these models perform poorly and may overfit to the training data. They are overly complex, capturing noise in the data.
In practical terms, this suggests that a higher-degree polynomial model may not be suitable for this dataset. 
It tends to make predictions that deviate substantially from the actual values, indicating a lack of generalization to unseen data.
In conclusion, the degree 1 model (a quadratic polynomial) is the best choice for making predictions. 
It demonstrates a strong balance between model complexity and predictive accuracy. However, as with any model selection, other factors such as interpretability and computational efficiency should be considered. For further analysis, it may be worthwhile to explore advanced techniques like regularized regression to fine-tune the model's performance.




