# Lab5

1. Utilize the diabetes dataset from scikit Perform cross-validation on nine polynomial models, ranging from degree 0 to 8. with 75 % training and 25 % testing data.

In [1]:
# Importing necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

# Load the diabetes dataset
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

# Create a DataFrame to store the cross-validation results
results = pd.DataFrame(columns=['Degree', 'R-Squared', 'MAE'])

# Perform cross-validation on polynomial models of degrees 0 to 8
for degree in range(9):
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree)
    # degree=degree means that the degree of the polynomial will be the same as the loop variable
    # For example, when degree=0, the polynomial will be a constant term
    # When degree=1, the polynomial will be a linear function
    # When degree=2, the polynomial will be a quadratic function
    # And so on
    X_poly = poly.fit_transform(X_train)

    # Fit a linear regression model
    model = LinearRegression()
    scores = cross_val_score(model, X_poly, y_train, cv=None, scoring='neg_mean_absolute_error')
    # cv=None means that the function will use the default 5-fold cross-validation
    # 5-fold cross-validation means that the dataset is split into 5 parts, and the model is trained on 4 parts and tested on 1 part

    # y=None means that the function will use the default scoring metric for the model
    # For LinearRegression, the default scoring metric is R-squared
    mae = -scores.mean()
    # .mean() calculates the average of the scores
    # The negative sign is used because the scores are negative by default
    # The negative sign is used to make the scores positive

    r2 = cross_val_score(model, X_poly, y_train, cv=None, scoring='r2').mean()
    # .mean() calculates the average of the scores
    # The default scoring metric for LinearRegression is R-squared
    # R-squared is a measure of how well the model fits the data
    # R-squared ranges from 0 to 1, where 1 means the model fits the data perfectly or is predicting the target variable perfectly


    # Append the results using concat
    new_row = pd.DataFrame({'Degree': [degree], 'R-Squared': [r2], 'MAE': [mae]})
    results = pd.concat([results, new_row], ignore_index=True)


  results = pd.concat([results, new_row], ignore_index=True)


2. Construct a table summarizing the cross-validation results. Each model should have a separate row in the table. Include the R-Squared and Mean Absolute Error (MAE) metrics for each model. Calculate the mean value and standard deviation of these metrics from the cross-validation. Include both values.

In [2]:
# Calculate the mean and standard deviation of R-Squared and MAE
mean_r2 = results['R-Squared'].mean()
std_r2 = results['R-Squared'].std()
mean_mae = results['MAE'].mean()
std_mae = results['MAE'].std()

# Print the results
print("Cross-Validation Results:")
print(results)

# Print the mean and standard deviation of R-Squared and MAE
print("\nMean R-Squared:", mean_r2)
print("Standard Deviation of R-Squared:", std_r2)
print("Mean MAE:", mean_mae)
print("Standard Deviation of MAE:", std_mae)

Cross-Validation Results:
  Degree    R-Squared          MAE
0      0    -0.006681    68.274930
1      1     0.512232    44.917726
2      2     0.370773    48.694205
3      3 -2387.549996  1869.326642
4      4   -18.045974   215.564530
5      5   -17.483505   213.475218
6      6   -17.485083   213.460474
7      7   -17.484977   213.460399
8      8   -17.500454   213.488239

Mean R-Squared: -274.9637406376293
Standard Deviation of R-Squared: 792.2672295759245
Mean MAE: 344.5180402859339
Standard Deviation of MAE: 577.0556566674474


3. Identify the model that exhibits the highest performance based on the R-Squared and MAE metrics. Provide an explanation for choosing this specific model.

In [3]:
# Identify the model with the highest performance based on R-Squared and MAE
best_model_r2 = results.loc[results['R-Squared'].idxmax()]
best_model_mae = results.loc[results['MAE'].idxmin()]

print("\nBest Model based on R-Squared:")
print(best_model_r2)
print("\nBest Model based on MAE:")
print(best_model_mae)



Best Model based on R-Squared:
Degree               1
R-Squared     0.512232
MAE          44.917726
Name: 1, dtype: object

Best Model based on MAE:
Degree               1
R-Squared     0.512232
MAE          44.917726
Name: 1, dtype: object


The best training degree for R-Squared is one because it is the number closest to 1.
The best training degree for MAE is also one because it represents the minimum absolute difference between the prediction and actual observation.

Therefore, the chosen model degree will be one, as it has the best measure of prediction accuracy and has the best measure to assess the goodness-of-fit.