## Lab 4 ##

In [114]:
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, r2_score

X, y = datasets.load_diabetes(as_frame=True, scaled=False, return_X_y=True)

# Approximately 70-15-15 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
X_train, X_vali, y_train, y_vali = train_test_split(X_train, y_train, test_size= 0.12, random_state= 42)

# printing the shape to check dimensions
print(X_train.shape)
print(X_vali.shape)
print(X_test.shape)

(349, 10)
(48, 10)
(45, 10)


In [115]:
model = LinearRegression()

def model_fit(model, X_train, y_train, \
              X_vali, y_vali, X_test, y_test, poly=False, uni= False):
      
        if uni == True:
            X_train = X_train['bmi'].values.reshape(-1,1)
            X_vali = X_vali['bmi'].values.reshape(-1,1)
            X_test = X_test['bmi'].values.reshape(-1,1)

        if poly == True: 
           poly_transform = PolynomialFeatures(include_bias= False)
           X_train = poly_transform.fit_transform(X_train)
           X_vali = poly_transform.fit_transform(X_vali)
           X_test = poly_transform.fit_transform(X_test)
           print(len(poly_transform.get_feature_names_out()))
           
        model.fit(X_train, y_train)
        y_train_pred = model.predict(X_train)
        y_vali_pred = model.predict(X_vali)
        y_test_pred = model.predict(X_test)

        print('Train MAE: ' + str(mean_absolute_error(y_train, y_train_pred)))
        print('Validation MAE: ' + str(mean_absolute_error(y_vali, y_vali_pred)))
        print('Train MAPE: ' + \
              str(mean_absolute_percentage_error(y_train, y_train_pred)))
        print('Validation MAPE: ' + \
              str(mean_absolute_percentage_error(y_vali, y_vali_pred)))
        print('Train R2: ' + str(r2_score(y_vali, y_vali_pred)))
        print('Validation R2: ' + str(r2_score(y_vali, y_vali_pred)))
        
      

In [116]:
print("MLR")
model_fit(LinearRegression(), X_train, y_train,\
           X_vali, y_vali, X_test, y_test)
print("")
print("Univariate Polynomial")
model_fit(LinearRegression(), X_train, y_train,\
           X_vali, y_vali, X_test, y_test, poly= True, uni= True)
print("")
print("Mulivariate Polynomial")
model_fit(LinearRegression(), X_train, y_train,\
           X_vali, y_vali, X_test, y_test, poly= True)

MLR
Train MAE: 43.90172568436122
Validation MAE: 40.96787164357992
Train MAPE: 0.394805661417686
Validation MAPE: 0.37875534231323277
Train R2: 0.5684398040231259
Validation R2: 0.5684398040231259

Univariate Polynomial
2
Train MAE: 52.40807850282762
Validation MAE: 47.479107713163216
Train MAPE: 0.4859676450692311
Validation MAPE: 0.45573459983931813
Train R2: 0.4731868903291634
Validation R2: 0.4731868903291634

Mulivariate Polynomial
65
Train MAE: 39.371332676193305
Validation MAE: 42.23439821183154
Train MAPE: 0.3529149387623396
Validation MAPE: 0.36758510948350337
Train R2: 0.5513908106420733
Validation R2: 0.5513908106420733


### Q5 ###
Each metric is applied to both the training data, and the validation data. This is a way to check for overfitting. If the performance metric for the training data is much lower than the performance metric for the validation data, this is a sign that overfitting is occuring. In our case for the 3 models, there isn't a significant difference between the performance metrics for training and validation, and thus I do not think that overfitting is occuring. 

The MAE is the Mean Absolute Error. This can be interpreted as the the mean distance between the predicted results and the true values of the results. It has a minimum value of 0, but it has no maximum value. In the case of our 3 models here, we can see that the Multivariate Polynomial model has the lowest MAE. That means that, on average, the distance between the predicted value and the true value is the closest. It is difficult to interpret the performance of the model with this value in isolation, and as such different metrics are also usually used as performance metrics. 

The MAPE is the Mean Average Performance Error. This can be interpreted as the mean percentage difference between the predicted results and the true values. As it is a percentage, the value ranges between 0 and 100. Compared to MAE, this gives us some context to how our model is performing as MAPE as it is a percentage error instead of simply the mean distance. In the case of our 3 models, the MLR model and the Multivariate Polynomial models perform similarly (with Train/Validation MAPE of 0.39/0.38 and 0.35/0.37 respectively), while the Univariate Polynomial model performs poorly incomparison (0.49/0.46).

The $R^2$ value can be interpreted as the proportion of the variance in the data that can be explained by the model. It is a value with a max value of 1, but no minimum value. A maximum value of 1 would mean that our independent variables perfectly explains the dependent variables. The results of our 3 models is similar to the MAPE, where the linear model and the multivariate polynomial model perform similarly, while the univariate model does not perform as well. However, even our best performing model, thelinear model has an $R^2$ value of 0.57, which is not a very high value.

### Q6 ###
i) In the MLR model, we are fitting 11 parameters (one for each feature and one for the intercept). In the Univariate Polynomial model, we are fitting 3 parameters (1 for bmi, 1 for bmi squared, and the intercept). In the Multivariate Polynomial model, we are fitting 66 parameters (one for each feature, and one for each pairwise multiplication of the features, and the intercept).

In the linear model, each coefficient represents the effect of a unit change in that independent variable on the independent variable. (e.g. if a coefficient is 0.5, that means for 1 unit increase in its associated independent variable, it increases the dependent variable by 0.5). 

In the polynomial models, the coefficients can be interpreted in the same manner, but the distinct pairwise multiplication of the features are now interaction terms between the two independent variables, and multiplication of the same independent variable on itself is the the variables squared.

ii) I would choose between the linear model and the multivariate polynomial model. These two have very similar performance metrics, with the linear model having a slightly better $R^2$ value, and the polynomial model a slightly better MAPE value. Overall, I would pick the linear model, as it is a little more readily interpretable. 