<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2
## Part 3: Model Benchmarks

In [1]:
import pandas as pd
import numpy as np
import math
from tabulate import tabulate
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, RidgeCV,LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectFromModel

In [2]:
%store -r X_scaled
%store -r y_train
%store -r X_test
%store -r y_test

### Baseline Model

Taking the mean value from the training dataset as the predicted values. 

In [3]:
y_predict=y_train.mean()

In [4]:
Null_MSE=np.mean((y_test-y_predict)**2)

In [5]:
# Root Mean Square Errors for the testing dataset
RMSE=math.sqrt(Null_MSE)
R_2=0

In [6]:
print(f'\nThe metrics for the Baseline Null Model are:\n')
print(tabulate([['RMSE', RMSE], ['R^2 (test)', round(R_2,3)]], headers=['Metric', 'Value'], tablefmt='orgtbl'))


The metrics for the Baseline Null Model are:

| Metric     |   Value |
|------------+---------|
| RMSE       | 87288.1 |
| R^2 (test) |     0   |


### OLS Model

In [7]:
lr=LinearRegression()

In [8]:
ols=lr.fit(X_scaled,y_train)

In [9]:
ols.score(X_scaled,y_train)

0.9492799834139457

In [10]:
ols.score(X_test,y_test)

-66.80513366416575

In [11]:
y_predict=ols.predict(X_test)

In [12]:
RMSE=math.sqrt(np.mean((y_test-y_predict)**2))
R_2=1-RMSE**2/Null_MSE

In [13]:
print(f'\nThe metrics for the Linear Regression Model are:\n')
print(tabulate([['RMSE', RMSE], ['R^2 (test)', round(R_2,3)], ['Score (train)',round(ols.score(X_scaled,y_train),3)]], headers=['Metric', 'Value'], tablefmt='orgtbl'))


The metrics for the Linear Regression Model are:

| Metric        |      Value |
|---------------+------------|
| RMSE          | 717971     |
| R^2 (test)    |    -66.656 |
| Score (train) |      0.949 |


This says that the OLS model here is a severely overfit model, and it does not explain any of the variations in y by X for the testing dataset. Hence it will not replace the null model as the baseline model.

### RidgeCV Model

The follow range for alphas as input parameter is chosen based on parametric experiments, in a way such that convergence is guranteed for the Ridge model and also covers the range where the optimal alpha value lies.

In [14]:
r_alphas = np.logspace(1, 4, 1000)

In [15]:
# Cross-validate over our list of ridge alphas.
# alphas: pass an Array of alpha values to try. It is still the Regularization strength
ridge_cv = RidgeCV(alphas=r_alphas, scoring='r2', cv=3)

In [16]:
ridge_cv.fit(X_scaled,y_train)

In [17]:
ridge_cv.score(X_scaled,y_train)

0.9152627156130047

In [18]:
ridge_cv.score(X_test,y_test)

0.7740652557376815

In [19]:
ridge_cv.alpha_

335.3710152002929

In [20]:
y_predict=ridge_cv.predict(X_test)

In [21]:
RMSE=math.sqrt(np.mean((y_test-y_predict)**2))
R_2=1-RMSE**2/Null_MSE
metrics={'RMSE':round(RMSE),'R2':round(R_2,3)}

In [22]:
print(f'\nThe metrics for the RidgeCV Model are:\n')
print(tabulate([['RMSE', RMSE], ['R^2 (test)', round(R_2,3)], ['Score (train)',round(ridge_cv.score(X_scaled,y_train),3)]], headers=['Metric', 'Value'], tablefmt='orgtbl'))


The metrics for the RidgeCV Model are:

| Metric        |     Value |
|---------------+-----------|
| RMSE          | 41444.5   |
| R^2 (test)    |     0.775 |
| Score (train) |     0.915 |


Though the R^2 on the training dataset is quite ok, the 0.77 value on the test dataset suggests that this model is quite a overfit model, and may not generalize well to new data. 

### LassoCV Model

From parametric experiments, alpha=10^2.1 is the smallest alpha value that gurantees convergence. Hence the range of list of alphas input here is chosen such that it covers the optimal alpha value and approximates it more and mroe through iterations. 

In [23]:
l_alphas = np.logspace(2.1, 3, 1000)

In [24]:
lasso_cv = LassoCV(alphas=l_alphas, cv=3,max_iter=50000)

In [25]:
reg=lasso_cv.fit(X_scaled,y_train)

In [26]:
reg.score(X_scaled,y_train)

0.9045225189055355

In [27]:
reg.score(X_test,y_test)

0.8835005799832365

In [28]:
reg.alpha_

920.3731996618221

In [29]:
y_predict=reg.predict(X_test)

In [30]:
RMSE=math.sqrt(np.mean((y_test-y_predict)**2))
R_2=1-RMSE**2/Null_MSE
metrics={'RMSE':round(RMSE),'R2':round(R_2,3)}

In [31]:
print(f'\nThe metrics for the LassoCV Model are:\n')
print(tabulate([['RMSE', RMSE], ['R^2 (test)', round(R_2,3)], ['Score (train)',round(reg.score(X_scaled,y_train),3)]], headers=['Metric', 'Value'], tablefmt='orgtbl'))


The metrics for the LassoCV Model are:

| Metric        |     Value |
|---------------+-----------|
| RMSE          | 29760.3   |
| R^2 (test)    |     0.884 |
| Score (train) |     0.905 |


In [32]:
%store reg

Stored 'reg' (LassoCV)


The closeness between the R^2 on training dataset 0.905 and 0.884 on test dataset suggests that this model is well balanced and generalizes well to new data.