''' Get the data, and run a train-validation-test split. Description of each column can be found in sklearn documentation. Look at the documentation for the load_diabetes method to know what are as_frame and scaled arguments are for.'''

In [1]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

# For Loading Database
diabetes = load_diabetes(as_frame=True)

# Extract variable
X = diabetes.data
y = diabetes.target

# Split Data For Operation
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


Run a multivariate linear regression on all variables (1 point)

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

# Create and train the multivariate linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Make predictions on the validation set
y_val_pred_linear = linear_model.predict(X_val)

# Calculate R-squared and MAE for the linear model on the validation set
r2_linear = r2_score(y_val, y_val_pred_linear)
mae_linear = mean_absolute_error(y_val, y_val_pred_linear)
print("***Multivariate linear regression***")
print("R-squared is ",r2_linear)
print("MAE is ",mae_linear)

R-squared is  0.5112619269090262
MAE is  38.21668137234904


Run a polynomial regression of the 2nd degree on the BMI feature alone (0.5 point)

In [7]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Extracting BMI feature
X_train_bmi = X_train[['bmi']]
X_val_bmi = X_val[['bmi']]

# Create and train the polynomial regression model on BMI
poly_bmi_model = make_pipeline(PolynomialFeatures(degree=2, include_bias=False), LinearRegression())
poly_bmi_model.fit(X_train_bmi, y_train)

# Validations
y_val_pred_poly_bmi = poly_bmi_model.predict(X_val_bmi)

# Calculate R-squared and MAE for the polynomial model on BMI
r2_poly_bmi = r2_score(y_val, y_val_pred_poly_bmi)
mae_poly_bmi = mean_absolute_error(y_val, y_val_pred_poly_bmi)
print("***Polynomial regression model***")
print("R-squared is ",r2_poly_bmi)
print("MAE is ",mae_poly_bmi)


***Polynomial regression model***
R-squared is  0.296223055272985
MAE is  48.27302777867063


Run a multivariate polynomial regression of the 2nd degree on all variables (Hint: set include_bias=False in PolynomialFeatures) (0.5 points)

In [8]:
# Create and train the multivariate polynomial regression model
poly_model = make_pipeline(PolynomialFeatures(degree=2, include_bias=False), LinearRegression())
poly_model.fit(X_train, y_train)

# Make predictions on the validation set
y_val_pred_poly = poly_model.predict(X_val)

# Calculate R-squared and MAE for the multivariate polynomial model on validation set
r2_poly = r2_score(y_val, y_val_pred_poly)
mae_poly = mean_absolute_error(y_val, y_val_pred_poly)
print("***Multivariate polynomial model***")
print("R-squared is ",r2_poly)
print("MAE is ",mae_poly)

***Multivariate polynomial model***
R-squared is  0.36717480117280155
MAE is  42.47137889140918


Compare the three models by looking at R-squared, MAPE and MAE. Explain what the values mean for a non-expert and add your insight about the values of each model. Note: You can add any further comparisons and code (this is not necessary for a perfect score, but will be reviewed and evaluated) (2 points)

In [9]:
from sklearn.metrics import mean_absolute_percentage_error

# Calculate MAPE for all models on validation set
mape_linear = mean_absolute_percentage_error(y_val, y_val_pred_linear)
mape_poly_bmi = mean_absolute_percentage_error(y_val, y_val_pred_poly_bmi)
mape_poly = mean_absolute_percentage_error(y_val, y_val_pred_poly)

# Display results
print("Multivariate Linear Regression:")
print(f"R-squared: {r2_linear:.4f}, MAE: {mae_linear:.4f}, MAPE: {mape_linear:.4f}")

print("\nPolynomial Regression on BMI:")
print(f"R-squared: {r2_poly_bmi:.4f}, MAE: {mae_poly_bmi:.4f}, MAPE: {mape_poly_bmi:.4f}")

print("\nMultivariate Polynomial Regression:")
print(f"R-squared: {r2_poly:.4f}, MAE: {mae_poly:.4f}, MAPE: {mape_poly:.4f}")


Multivariate Linear Regression:
R-squared: 0.5113, MAE: 38.2167, MAPE: 0.3462

Polynomial Regression on BMI:
R-squared: 0.2962, MAE: 48.2730, MAPE: 0.4190

Multivariate Polynomial Regression:
R-squared: 0.3672, MAE: 42.4714, MAPE: 0.3809


Please answer the following questions:

1.How many parameters are we fitting for each of the three models? Explain these values. Hint: for         explaining the parameters of the polynomial regression, you can use poly.get_feature_names_out() (1 point)

ans.

Multivariate Linear Regression:
The number of parameters is rise to to the number of highlights also one for the caught term. In this case, it's the number of highlights within the diabetes dataset.

Polynomial Regression on BMI:
The polynomial relapse on BMI will have three parameters:
the coefficient for BMI, the coefficient for BMI squared, and the caught term.

Multivariate Polynomial Regression:
Comparative to multivariate direct relapse, the number of parameters is decided by the number of highlights, but in this case, it's expanded due to the polynomial terms. 

2.Which model would you choose for deployment, and why? (1 point)

ans.

The choice of the show for sending depends on different variables counting execution measurements, computational complexity, and interpretability. Here are a few contemplations:


Multivariate Linear Regression:
Straightforward, interpretable, and computationally effective. It may perform well in the event that the relationship between highlights and the target variable is generally straight. Be that as it may, it might battle with capturing non-linear connections.

Polynomial Regression on BMI:
On the off chance that there's prove that the relationship between BMI and the target variable is non-linear, this demonstrate can be a great choice. It features a direct number of parameters and may capture more complex designs.

Multivariate Polynomial Regression:
This model captures non-linear connections among all highlights. Be that as it may, it presents more parameters, expanding the hazard of overfitting. It can be computationally more costly and might require more information. 