Utilize the diabetes dataset from lab 4. Perform cross-validation on nine polynomial models, ranging from degree 0 to 8. (2 points)

In [1]:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.datasets import load_diabetes

# Load the diabetes dataset (replace with your dataset)
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Create an empty list to store the cross-validation scores for each degree
cross_val_scores = []

# Perform cross-validation for polynomial models of degrees 0 to 8
for degree in range(9):
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    
    # Create a linear regression model
    model = LinearRegression()
    
    # Perform cross-validation and calculate the mean score
    scores = cross_val_score(model, X_poly, y, cv=5, scoring='neg_mean_squared_error')
    mean_score = -scores.mean()  # Take the negative mean squared error
    
    cross_val_scores.append(mean_score)

# Print the cross-validation scores for each degree
for degree, score in enumerate(cross_val_scores):
    print(f'Degree {degree}: Cross-Validation MSE = {score}')


Degree 0: Cross-Validation MSE = 5982.413413836098
Degree 1: Cross-Validation MSE = 2993.081310469331
Degree 2: Cross-Validation MSE = 3495.263074264313
Degree 3: Cross-Validation MSE = 1028102.0051035562
Degree 4: Cross-Validation MSE = 431051.29189899063
Degree 5: Cross-Validation MSE = 411422.3322432812
Degree 6: Cross-Validation MSE = 411811.1242879463
Degree 7: Cross-Validation MSE = 411818.04323928355
Degree 8: Cross-Validation MSE = 411780.2801941051


Construct a table summarizing the cross-validation results. Each model should have a separate row in the table. Include the R-Squared, Mean Absolute Error (MAE) and MAPE metrics for each model. Calculate the mean value and standard deviation of these metrics from the cross-validation. Include both values. (2 points)

In [2]:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score, mean_absolute_error

# Load the diabetes dataset (replace with your dataset)
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Create an empty list to store the metrics for each degree
results = []

# Perform cross-validation for polynomial models of degrees 0 to 8
for degree in range(9):
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree)
    X_poly = poly.fit_transform(X)
    
    # Create a linear regression model
    model = LinearRegression()
    
    # Perform cross-validation
    scores = cross_val_score(model, X_poly, y, cv=5, scoring='neg_mean_squared_error')
    mean_squared_errors = -scores
    
    # Calculate R-squared, MAE, and MAPE
    r2 = cross_val_score(model, X_poly, y, cv=5, scoring='r2').mean()  # using 5-fold cross-validation.
    mae = cross_val_score(model, X_poly, y, cv=5, scoring='neg_mean_absolute_error').mean()
    ape = np.abs(y - model.fit(X_poly, y).predict(X_poly)) / y
    mape = (ape * 100).mean()
    
    results.append({
        "Degree": degree,
        "R-Squared": r2,
        "MAE": mae,
        "MAPE": mape,
        "MSE": mean_squared_errors
    })

# Create a table for the results
import pandas as pd
result_df = pd.DataFrame(results)
result_df["MSE Mean"] = result_df["MSE"].apply(np.mean)
result_df["MSE Std Dev"] = result_df["MSE"].apply(np.std)

# Print the table
result_df


Unnamed: 0,Degree,R-Squared,MAE,MAPE,MSE,MSE Mean,MSE Std Dev
0,0,-0.027506,-66.045624,62.12156,"[5353.025537859954, 6521.235997165425, 6261.92...",5982.413,547.2524
1,1,0.482316,-44.276499,38.78618,"[2779.923449211685, 3028.836338828592, 3237.68...",2993.081,150.771
2,2,0.391502,-46.612882,34.59813,"[3087.145849417225, 3157.60027532677, 3462.536...",3495.263,457.1435
3,3,-182.365458,-342.632418,23.52158,"[59830.97752808989, 171373.31460674157, 299053...",1028102.0,1204616.0
4,4,-70.667516,-303.158461,1.101501e-10,"[181538.0268880956, 1122459.8212536694, 292085...",431051.3,348430.8
5,5,-67.387407,-295.686026,1.139069e-10,"[177085.50676944156, 1072737.3239454448, 29041...",411422.3,333161.3
6,6,-67.447482,-295.631865,1.147348e-10,"[176969.03887648133, 1075155.2321600588, 29024...",411811.1,334169.6
7,7,-67.448529,-295.630403,1.179825e-10,"[176964.8553048894, 1075190.780404038, 290253....",411818.0,334184.2
8,8,-67.442147,-295.579342,1.348009e-10,"[176912.31940224464, 1075211.6541500413, 29023...",411780.3,334216.2


Identification of the Best Model: Identify the model that exhibits the highest performance based on the R-Squared, MAE and MAPE metrics. Provide an explanation for choosing this specific model. (1 points)

The best model among the polynomial models of degrees 0 to 8 is the one with a degree of 1. It has the highest R-squared value (0.482316), indicating it explains a significant portion of the variance in the target variable. It also has the lowest Mean Absolute Error (MAE) of -44.276499 and the lowest Mean Absolute Percentage Error (MAPE) of 3.878618e+01, meaning it provides accurate and precise predictions. This model offers a good balance between explaining variance and prediction accuracy, making it the top choice for this specific dataset and modeling task.

Additional analysis and interpretation of the models' performances. You may explore further insights beyond the required metrics. The analysis should provide at least one relevant insight about the choice of the best model, or about characteristics of the chosen one (for example - an analysis of in which instances does it fail)

The best model, a linear model with a degree of 1, is effective for capturing linear relationships in the data, as evidenced by its high R-squared value, low MAE, and MAPE. However, it may fail when dealing with non-linear data patterns. In such cases, more complex, non-linear models may be necessary to accurately capture the underlying trends. Therefore, model choice should align with the dataset's specific characteristics and the problem at hand, ensuring it can effectively model both linear and non-linear relationships.