## Practical Lab 4 - Multivariate Linear and Polynomial Regression, and Evaluation using R-Squared, MAPE and MAE

#### Get the data, and run a train-validation-test split

In [10]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

# Load diabetes dataset
diabetes = datasets.load_diabetes(as_frame=True)

# Get features and target
X = diabetes.data
y = diabetes.target

# Split into training set and temp 
X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.4, random_state=0)

# Split temp into validation and testing
X_valid, X_test, y_valid, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5, random_state=0)

#### Run a multivariate linear regression on all variables

In [11]:
from sklearn.linear_model import LinearRegression

# Create a linear regression object
lr = LinearRegression()

# Train the model using the training sets
lr.fit(X_train, y_train)

####  Run a polynomial regression of the 2nd degree on the BMI feature alone

In [12]:
from sklearn.preprocessing import PolynomialFeatures

# Create PolynomialFeatures object of degree 2
poly = PolynomialFeatures(degree=2)

# Transform the BMI data
X_train_bmi_poly = poly.fit_transform(X_train[['bmi']])
X_valid_bmi_poly = poly.transform(X_valid[['bmi']])

# Create a new linear regression object
lr_poly_bmi = LinearRegression()

# Fit the model using transformed BMI training data
lr_poly_bmi.fit(X_train_bmi_poly, y_train)

#### Run a multivariate polynomial regression of the 2nd degree on all variables

In [13]:
# Transform all features into polynomial features
X_train_poly = poly.fit_transform(X_train)
X_valid_poly = poly.transform(X_valid)

# Create a linear regression object
lr_poly = LinearRegression()

# Train the model using the transformed training data
lr_poly.fit(X_train_poly, y_train)

#### Compare the three models by looking at R-squared, MAPE and MAE.

In [14]:
from sklearn.metrics import r2_score, mean_absolute_error
import numpy as np

# Predict values for each set of features
y_pred_lr = lr.predict(X_valid)
y_pred_lr_bmi = lr_poly_bmi.predict(X_valid_bmi_poly)
y_pred_lr_poly = lr_poly.predict(X_valid_poly)

# Calculate metrics 
r2_lr = r2_score(y_valid, y_pred_lr)
r2_lr_bmi = r2_score(y_valid, y_pred_lr_bmi)
r2_lr_poly = r2_score(y_valid, y_pred_lr_poly)

mae_lr = mean_absolute_error(y_valid, y_pred_lr)
mae_lr_bmi = mean_absolute_error(y_valid, y_pred_lr_bmi)
mae_lr_poly = mean_absolute_error(y_valid, y_pred_lr_poly)

mape_lr = np.mean(np.abs((y_valid - y_pred_lr) / y_valid)) * 100
mape_lr_bmi = np.mean(np.abs((y_valid - y_pred_lr_bmi) / y_valid)) * 100
mape_lr_poly = np.mean(np.abs((y_valid - y_pred_lr_poly) / y_valid)) * 100

print("Model: Linear Regression (Multivariate)")
print("R-squared:", r2_lr)
print("Mean Absolute Error (MAE):", mae_lr)
print("Mean Absolute Percentage Error (MAPE):", mape_lr)

print("\nModel: Polynomial Regression (BMI only)")
print("R-squared:", r2_lr_bmi)
print("Mean Absolute Error (MAE):", mae_lr_bmi)
print("Mean Absolute Percentage Error (MAPE):", mape_lr_bmi)

print("\nModel: Polynomial Regression (Multivariate)")
print("R-squared:", r2_lr_poly)
print("Mean Absolute Error (MAE):", mae_lr_poly)
print("Mean Absolute Percentage Error (MAPE):", mape_lr_poly)

Model: Linear Regression (Multivariate)
R-squared: 0.4425508528863389
Mean Absolute Error (MAE): 48.200060108841676
Mean Absolute Percentage Error (MAPE): 42.214707962597004

Model: Polynomial Regression (BMI only)
R-squared: 0.3501135470369522
Mean Absolute Error (MAE): 52.02035930655676
Mean Absolute Percentage Error (MAPE): 48.60063435782185

Model: Polynomial Regression (Multivariate)
R-squared: -3.671071530949913
Mean Absolute Error (MAE): 134.51136363636363
Mean Absolute Percentage Error (MAPE): 104.56284794650321


R-squared is a statistical measure representing the proportion of the variance for a dependent variable explained by an independent variable or variables in a regression model. The value lies between 0 and 1, where a higher value indicates that a more significant proportion of the variance is accountable by the model.
<br><br>
Mean Absolute Error (MAE) measures errors between paired observations expressing the same phenomenon. It's the average over the absolute differences between prediction and actual observation, where all individual differences have equal weight.
<br><br>
Mean Absolute Percentage Error (MAPE) is a measure used to represent the accuracy of statistics. It calculates the average of the absolute percentage errors.
<br><br>
- For R-squared, values closer to 1 mean a better model.
<br>
- For both MAE and MAPE, lower values are better. Values of MAE around 0 mean a perfect prediction, while MAPE is expressed in percentage; for example, a MAPE of 20 means that the average forecast is off by 20%.

#### How many parameters are we fitting for each of the three models? Explain these values.<br>
1. Multivariate Linear Regression: Since we are using all ten features and one bias term, we are fitting 11 parameters.<br>
2. Polynomial Regression (BMI): Here, we use a polynomial of degree 2 of a single feature, BMI, and a bias term, resulting in 3 parameters. These parameters correspond to the bias term, the coefficient of the BMI feature (linear term), and the coefficient of the squared BMI feature (quadratic term).<br>
3. Multivariate Polynomial Regression: 66 parameters are being fitted in this model. This includes the original ten features, their square, the interaction between each pair of features (45 interaction features), and the bias term.<br>
#### Which model would you choose for deployment, and why?<br>
Even though the Multivariate Linear Regression model's performance might not be considered very good (since usually, we wish for a higher R-squared and lower errors), it outperforms the other two models, which suggests that adding a polynomial term does not help improve the model in this case.<br>
Moreover, the Multivariate Polynomial Regression model is substantially overfitting the data, as evidenced by its abysmal performance on the validation dataset (negative R-squared and very high MAE and MAPE values). Therefore, I would choose the Multivariate Linear Regression model for deployment in this particular case.