# Lab 5 - Cross-Validation for Model Selection

#### Cross-validation on nine polynomial models, ranging from degree 0 to 8.

In [49]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

In [50]:
# Load the diabetes dataset
diabetes = load_diabetes()

# Get features and target
X = diabetes.data
y = diabetes.target

In [51]:
results = []

for degree in range(9):
    # Create polynomial features up to the specified degree
    polynomial_features = PolynomialFeatures(degree=degree)
    X_poly = polynomial_features.fit_transform(X)

    model = LinearRegression()

    # Perform cross-validation and calculate the mean absolute error
    scores = cross_val_score(model, X_poly, y, scoring='neg_mean_absolute_error', cv=5)
    results.append((degree, -1 * scores.mean()))

#### Construct a table summarizing the cross-validation results.

In [52]:
df = pd.DataFrame(results, columns=['Degree', 'MAE'])
df['R-Squared'] = np.nan
df['MAPE'] = np.nan

for index, row in df.iterrows():
    degree = int(row['Degree'])

    # Create polynomial features based on the degree
    polynomial_features = PolynomialFeatures(degree=degree)
    X_poly = polynomial_features.fit_transform(X)

    model = LinearRegression()

    # Perform cross-validation and calculate R-squared and MAPE scores
    r2_scores = cross_val_score(model, X_poly, y, scoring='r2', cv=5)
    mape_scores = cross_val_score(model, X_poly, y, scoring='neg_mean_absolute_percentage_error', cv=5)

    # Update the dataframe with R-squared and MAPE values
    df.loc[index, 'R-Squared'] = r2_scores.mean()
    df.loc[index, 'MAPE'] = -1 * mape_scores.mean()

In [53]:
mean_values = df.mean()
std_values = df.std()

# Add rows for mean and standard deviation values in the dataframe
df.loc['Mean'] = mean_values
df.loc['Std'] = std_values

# Print the table summarizing the cross-validation results
print(df)

        Degree         MAE   R-Squared      MAPE
0     0.000000   66.045624   -0.027506  0.623622
1     1.000000   44.276499    0.482316  0.394860
2     2.000000   46.612882    0.391502  0.402669
3     3.000000  342.632418 -182.365458  2.324375
4     4.000000  303.158461  -70.667516  2.453685
5     5.000000  295.686026  -67.387407  2.405233
6     6.000000  295.631865  -67.447482  2.404954
7     7.000000  295.630403  -67.448529  2.404952
8     8.000000  295.579342  -67.442147  2.404576
Mean  4.000000  220.583724  -57.990247  1.757658
Std   2.738613  127.218077   57.198774  0.965706


#### Identification of the Best Model

Looking at the results, the best model appears to be the one with degree 1. 
<br><br>
The R-squared metric determines the percentage of the response variable variation explained by the model. In other words, the higher the R-squared, the better the model fits your data. The model with degree 1 has the highest R-squared value (0.482316) among all the other models, indicating that it explains the most variance in the prediction.
<br><br>
MAE (Mean Absolute Error) measures the average magnitude of the errors in a set of predictions without considering their direction. It’s the average of absolute differences between forecast and actual observation over the test sample, where all individual differences have equal weight. The model with degree 1 has the lowest MAE value (44.276499), which means it has the most minor prediction errors.
<br><br>
MAPE (Mean Absolute Percentage Error) measures the size of the error in percentage terms. It is calculated as the average of the unsigned percentage error. Similar to MAE, lower values are better. The degree 1 model has the lowest MAPE (0.394860), suggesting it has the most minor relative prediction errors.

#### Additional analysis and interpretation of the models' performances.

Looking at the data, it seems that as the degree of the model increases beyond 1, the MAE, R-squared, and MAPE values worsen significantly, indicating overfitting. This suggests that the model with higher degrees is too complex and is capturing the noise along with the underlying pattern in the data. This can negatively impact the generalization of the model on unseen data.
<br><br>
The model seems to underfit when the degree is 0, implying it's too simple to capture necessary information from the data.
<br><br>
The degree 1 model provides a balanced complexity, ensuring it could explain the most variance in the data set without significantly increasing error. However, while the degree 1 model is the best among the models examined, it is worth mentioning that an R-Squared value of approximately 0.48 is not exceptionally high, hence there might be room for improvement