# Week 10: Regularized Linear Models

## On Noise

$$Y \approx \beta_0 + \beta_1X + \beta_2X + \dots + \beta_NX$$

Linear Regression finds the input-output relationahip as a weighted sum of the predictors.  
However, the data is not perfect.   
There is necessarily error/noise present  


**A Multiple Linear Regression Phenomenon**  
For a training given dataset, as more features are added to a model the $R^2$ increases even if the added parameter in uninformative.  
At a certain point, adding new parameters fits the model to the noise inherent in the data.  

## The Bias Variance Trade-off

<img src="https://miro.medium.com/max/1838/1*1BGl9kfU6nwO2QQ0-fWHcg.png" width="60%" style="margin-left:auto; margin-right:auto">



## Generalization Error

**Generalization Error** - a measure of how accurately a model can predict previously unseen data  

Comparing measures generalization is informative of the optimal model complexity

<img src="https://i.stack.imgur.com/0NbOY.png" width="80%" style="margin-left:auto; margin-right:auto">


<img src="https://miro.medium.com/max/875/0*XCe3mlLeGiUW3xfh" width="60%" style="margin-left:auto; margin-right:auto">

## Regularization: bringing to uniformity

**Regularized Linear Models**  

* Regularize a model to reduce overfitting: constrain it somehow
* For Linear Regression this means: constrain the weights (parameters) of the model. 
* This is usually implemented by adding a regularization term to the cost function

Today we will survey regularization methods for linear models  

1. Ridge Regression
2. Lasso Regression

## Meatspec Dataset

Since determining the fat content via analytical chemistry is time consuming (and expensive), a company would like to build a model to predict the fat content of new samples using absorbance spectra data (which can be measured more easily and cheaper).

**The Predictors** - 100 channels measure different near infrared absorpances.  
**The Target** - a measure of fat content  
**the shape** - (215,101) 215 records from meats with known fat content and data from 100 channels

In [None]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy.random as np
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
# Bring the data into the environment
url = 'https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/datasets/meatspec.csv'
df = pd.read_csv( url, index_col=0 )
df.shape

X = df.drop(['fat'], axis=1)
y = df['fat']

sc = StandardScaler()
X_scaled = sc.fit_transform(X)
X_scaled = pd.DataFrame(data = X_scaled, columns = X.columns)

X_scaled_train, X_scaled_test, y_train, y_test = train_test_split( X_scaled, y, test_size=0.20, random_state=42)

## Use `sklearn` to build a 'kitchen sink' MLR

we will use this both to see how MLR is done with `sklearn` and to compare performance with Regularization

`LinearRegression()` will implement OLS. OLS is ideal when the underlying relationship is Linear and we have n>>p. But if n is not much larger than p or p>n (unfeasible for OLS), there can be a lot of variability in the fit which can result in either overfitting and very poor predictive ability.

In [None]:
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

# instantiate a Linear Regression Model
lin_mod = LinearRegression()
# fit the model to the training data
lin_mod.fit( X_scaled_train, y_train )
# print the model intercept & coefficients
print( lin_mod.intercept_, lin_mod.coef_ )
# print the training R2 score
score=r2_score(y_train,lin_mod.predict(X_scaled_train))
print( 'r2 Training score is ', score )


<img src="https://raw.githubusercontent.com/SmilodonCub/DS4VS/master/Week10/overfitting.jpeg" width="60%" style="margin-left:auto; margin-right:auto">

### Evaluate the Model Performance on unseen data...

In [None]:
# use the model to make predictions on the test dataset
y_prediction = lin_mod.predict(X_scaled_test)
# predicting the accuracy score
score=r2_score(y_test,y_prediction)
print( 'Test data')
print('r2 score is ',score)
print('mean_sqrd_error is==', mean_squared_error(y_test,y_prediction))
print('root_mean_squared error of is==', np.sqrt(mean_squared_error(y_test,y_prediction)))

## 1) Ridge Regression

**Ridge Regression**  

- add a term to the cost function that froces the model to minimize the model weights. 
- **Cost Function** $J(\theta) = \mbox{MSE}(\theta) + \alpha \frac{1}{2}\sum_{i=1}^n \theta_i^2$
- half the square of the $l_2$norm
- **$\alpha$** - a hyperparameter that controls the minimization
    * $\alpha$ == 0 is basically MLR
    * $\alpha$ is large, and the weights are close to zero (regress to bias) 


In [None]:
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X_scaled_train, y_train)
print( ridge_reg.intercept_, ridge_reg.coef_ )

In [None]:
# use the model to make predictions on the test dataset
ridgey_prediction = ridge_reg.predict( X_scaled_test )
# predicting the accuracy score
score=r2_score(y_test,ridgey_prediction)
print('r2 score is ',score)
print('mean_sqrd_error is==', mean_squared_error(y_test,ridgey_prediction))
print('root_mean_squared error of is==', np.sqrt(mean_squared_error(y_test,ridgey_prediction)))

### HHhhmmmmmm

that $R^2$ score is an improvement. However, we just picked a totally random $\alpha$

### Hyperparameter $\alpha$

We could just randomly try $\alpha$s until we get a good result, but that would be inefficient and very biased.  
`scikit-learn` will come to the rescue with the `GridSearchCV`  

[`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) will perform a cross-validated sweep of a parameter space to find the best value for $\alpha$  
How convenient!

In [None]:
from sklearn.model_selection import GridSearchCV

model = Ridge()
# define the parameter space
#parameters = {'alpha':[1, 2, 3, 4, 5, 10, 15, 20, 25, 50, 75, 100, 250, 500, 1000]}
parameters = {'alpha':list(np.linspace(0.00001,.001, 1001))}
# define the grid search
Gridge_reg= GridSearchCV(model, parameters, scoring='neg_mean_squared_error',cv=5)
#fit the grid search
Gridge_reg.fit(X_scaled_train,y_train)
# best estimator
print(Gridge_reg.best_estimator_)

### Ridge Regression: Best Gridsearch Model

Let's use our best $\alpha$

In [None]:
# best model
best_Gridge_mod = Gridge_reg.best_estimator_
best_Gridge_mod.fit(X_scaled_train,y_train)
print( best_Gridge_mod.intercept_, best_Gridge_mod.coef_ )

In [None]:
best_Gridge_prediction = best_Gridge_mod.predict( X_scaled_test )
score=r2_score(y_test,best_Gridge_prediction)
print('r2 score is ',score)
print('mean_sqrd_error is==', mean_squared_error(y_test,best_Gridge_prediction))
print('root_mean_squared error of is==', np.sqrt(mean_squared_error(y_test,best_Gridge_prediction)))

## 2) Lasso Regression

**Least Absolute Shrinkage and Selection Operator Regression** - Similar to Ridge, but adds the $l_1$ norm to the cost function  

- **Cost Function** $J(\theta) = \mbox{MSE}(\theta) + \alpha \sum_{i=1}^n |\theta_i|$
- tends to eliminate weights of unimportant features

In [None]:
from sklearn.linear_model import Lasso

lasso_reg = Lasso(tol=1e-2, max_iter=100000 )
parameters = {'alpha':[0.0005, 0.001, 0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 1, 2, 3, 4, 5, 10, 15, 20]}
# define the grid search
Glasso_reg= GridSearchCV(lasso_reg, parameters, scoring='neg_mean_squared_error',cv=5)
#fit the grid search
Glasso_reg.fit(X_scaled_train,y_train)
# best estimator
print(Glasso_reg.best_estimator_)

### Lasso Regression: Best Gridsearch Model

Let's use our best $\alpha$

In [None]:
# best model
best_Lasso_mod = Glasso_reg.best_estimator_
best_Lasso_mod.fit(X_scaled_train,y_train)
print( best_Lasso_mod.intercept_, best_Lasso_mod.coef_ )

In [None]:
best_Lasso_prediction = best_Lasso_mod.predict( X_scaled_test )
score=r2_score(y_test,best_Lasso_prediction)
print('r2 score is ',score)
print('mean_sqrd_error is==', mean_squared_error(y_test,best_Lasso_prediction))
print('root_mean_squared error of is==', np.sqrt(mean_squared_error(y_test,best_Lasso_prediction)))

## Next week: Supervised Learning techniques for Categorical Target Variables
<img src="https://content.techgig.com/photo/80071467/pros-and-cons-of-python-programming-language-that-every-learner-must-know.jpg?132269" width="100%" style="margin-left:auto; margin-right:auto">