In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LinearRegression

#### Imports for Regularization

In [None]:
from sklearn.linear_model import Ridge, RidgeCV 
from sklearn.linear_model import Lasso,LassoCV
from sklearn.linear_model import ElasticNet,ElasticNetCV

## Regularization

* Regularization methods are intended to prevent or reduce overfitting thus improving generalization (i.e. improve performance on test set)

* The methods described here penalize the model by adding additional constraints to the parameter optimization:
    - Ridge Regression
    - Lasso Regression
    - Elastic Net Regression

* These regularization methods add a constraint(s) to the Ordinary Least Squares linear regression cost function that constrains (i.e. shrinks) the coefficients.
    -  Called Shrinkage Methods

#### Why favor smaller coefficients

* The larger the magnitude of the coefficient that greater the change in the response (i.e. greater variability)
* Do not want a lot of weight for any single predictor
    - If different training data, response could vary with just slight change in this one predictor.
        - Response would be sensitive overly sensitive to this predictor

### Ridge Regression

* In Oridinary Least Squares regression we estimate the coefficients $\beta$:

<div style="font-size: 110%;">
$$ \hat{\beta} =   \underset{\beta}{\mathrm{argmin}}\sum_i{(y_i - X_i^T\beta)^2}$$
</div>

* This produces the vector of coefficients $\beta$ that is the **Best Linear Unbiased Estimator (BLUE)**
    - Of all the unbiased estimators (that is the ones that $\frac{1}{N}\sum_1^\infty{\hat{\beta}}$  = the population $\beta$) it has the lowest variance
        - This is a theorem: Gauss-Markov theorem
    - But this may not have the best test set performance because the of the variance

* Ridge Regression adds a penalty equivalent to square of the magnitude of coefficients ($\beta^2$) to the OLS error
    - L2 regularization
    - This shrinks the coefficients toward zero but not to zero
    
<div style="font-size: 110%;">
$$\hat{\beta} =  \underset{\beta}{\mathrm{argmin}}\sum_i{(y_i - X_i^T\beta)^2} + \lambda \sum_{j=1}^p\beta_j^2$$
</div>

* By shrinking the coefficients, Ridge Regression constrains the model to fewer choices thus increasing the bias by a little, but the variance is greatly reduced due to the decrease in complexity

#### $\lambda$ Parameter

* Controls amount of regularization

* $\lambda$ = 0: same as OLS Regression

* $\lambda$ >> 0: drives all the coefficients to very close to zero 



### Ridge Regression in sklearn

#### Ridge Class

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

* **The alpha keyword parameter is the $\lambda$ hyperparameter**


### Model for Hitters data

* Dependent variable: Salary
* 16 numerical, 3 categorical predictors
* This dataset has null values

In [None]:
df = pd.read_csv("Hitters.csv")
df.tail()

In [None]:
np.sum(df.isnull())

#### Drop missing values and use only numerical values

In [None]:
df = df.dropna()
df = df.drop(['League','Division','NewLeague'],axis = 1)
df.tail()

#### Create arrays and split data

In [None]:
X = df.iloc[:,0:16].values
y = df.iloc[:,16].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y,  test_size = 0.25, random_state = 1234)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#### Coefficients, $R^2$, and Mean Square Error of Ridge regression model

In [None]:
model_r = Ridge(alpha = 10)
model_r.fit(X_train,y_train)
coefs_r = model_r.coef_
print('R-Squared: ',model_r.score(X_train,y_train))
yhat = model_r.predict(X_test)
mse = np.mean((y_test - yhat)**2)
print('MSE: ',mse)
print(coefs_r)


#### Coefficients, $R^2$, and Mean Square Error of OLS regression model

In [None]:
model_l = LinearRegression()
model_l.fit(X_train,y_train)
coefs_l = model_l.coef_
print('R-Squared: ',model_l.score(X_train,y_train))
yhat = model_l.predict(X_test)
mse = np.mean((y_test - yhat)**2)
print('MSE: ',mse)
print(coefs_l)

#### Show shrinkage of coefficients

In [None]:
p1 = plt.plot(range(len(coefs_l)),coefs_l,'r') 
p2 = plt.plot(range(len(coefs_r)),coefs_r,'b') 
plt.title("Shrinkage")
plt.ylabel("Coefficient")
plt.legend((p1[0],p2[0]),("Linear","Ridge"));

#### Check that no coefficients shrunk to zero

In [None]:
print(f'Number of coefficients ridge regression shrunk coefficients to 0: {np.sum(coefs_r == 0)}')

#### Ridge Cross Validation: RidgeCV Class

* Cross Validation for alpha

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html

In [None]:
model = RidgeCV(store_cv_values = True)
model.fit(X_train,y_train)
coefs_r = model.coef_
print('R-Squared: ',model.score(X_train,y_train))
print('Alpha: ', model.alpha_)

In [None]:
print(model.cv_values_[0])
model.get_params()

### Lasso Regression

* Similar to Ridge Regression

* It adds penalty equivalent to absolute value of the magnitude of coefficients to the OLS error
    - L1 regularization

* This can shrink the coefficients to zero 

* Like ridge Regression it increases bias a little but decreases variance by decreasing the number of predictors

* A form of model selection because it eliminates predictors

<div style="font-size: 110%;">
$$\hat{\beta} =  \underset{\beta}{\mathrm{argmin}}\sum_i{(y_i - X_i^T\beta)^2} + \lambda \sum_{j=1}^p|\beta_j|$$
</div>

#### Lasso Regression in sklearn

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

#### Coefficients, $R^2$, and Mean Square Error of Ridge regression model

In [None]:
model = Lasso(alpha = 10.0)
model.fit(X_train,y_train)
coefs_lasso = model.coef_
print('R-Squared: ',model.score(X_train,y_train))
yhat = model.predict(X_test)
mse = np.mean((y_test - yhat)**2)
print('MSE: ',mse)
plt.plot(range(len(coefs_lasso)),coefs_lasso,'b')   
print(f'Number of coefficients lasso regression shrunk to 0: {np.sum(coefs_lasso == 0)}')

In [None]:
for n,v in zip(df.columns,coefs_lasso):
    if v != 0: print(n,round(v,3))

#### Lasso Cross Validation, sklearn LassoCV class

In [None]:
from sklearn.linear_model import LassoCV
model = LassoCV()
model.fit(X,y)
coefs_lasso = model.coef_
print('R-Squared: ',model.score(X_train,y_train))
plt.plot(range(len(coefs_lasso)),coefs_lasso,'b')   
print(f'Number of coefficients lasso regression shrunk to 0: {np.sum(coefs_lasso == 0)}')
print('Alpha: ', model.alpha_)

In [None]:
for n,v in zip(df.columns,coefs_lasso):
    if v != 0: print(n,round(v,3))

In [None]:
np.corrcoef(df.loc[:,'Years'],df.loc[:,'Salary'])[0,1]

In [None]:
sns.scatterplot(df.loc[:,'Years'],df.loc[:,'Salary']);

### Ridge versus Lasso

#### When to use

* In general you use the shrinkage methods when you have many predictors
    - At least 3
* When you know that each predictor contributes something then use Ridge Regression
* When you suspect that some of the predictors are useless then use Lasso Regression

#### Constraint formulation

* Ridge
<div style="font-size: 110%;">
$$\sum_i{(y_i - X_i^T\beta)^2} \text{ subject to } ||\beta||_2^2 \le c^2$$
</div>
* Lasso
<div style="font-size: 110%;">
$$\sum_i{(y_i - X_i^T\beta)^2} \text{ subject to } ||\beta||_1 \le c$$
</div>

#### Constrained Optimization: Lagrange Multipliers
* Ridge
<div style="font-size: 110%;">
$$ L(\beta,\lambda) =  \sum_i{(y_i - X_i^T\beta)^2} + \lambda(||\beta||_2^2 - c^2)$$
$$ = ||y-X\beta||_2^2 + \lambda||\beta||_2^2$$
</div>
* Lasso
<div style="font-size: 110%;">
$$ L(\beta,\lambda) =  \sum_i{(y_i - X_i^T\beta)^2} + \lambda(||\beta||_1 - c)$$
$$ = ||y-X\beta||_2^2 + \lambda||\beta||_2^2$$
</div>

#### Geometric Interpretation

![](GeoIntrp.png)
$$\text{Figure 1. Left: Ridge, Right: Lasso}$$

Figure source: http://www.astroml.org/book_figures/chapter8/fig_lasso_ridge.html

#### Solution
* OLS solution to unconstrained problem: $(X^TX)^{-1}X^Ty$

* Ridge solution to constrained problem:
$$(X^TX + \lambda I)^{-1}X^Ty$$

* Lasso solution to constrained problem: No closed form solution, solve by numerical analysis




### Elastic Net Regression

* Combines Ridge and Lasso regression
* This combination allows:
    - For learning a sparse model where few of the weights are non-zero (like Lasso)
    - While still maintaining the regularization properties of Ridge. 
* Use when have a lot of variables and you don't know how each contributes or even if they are useful.

* Two $\lambda$s, $\lambda_1$ for Lasso and $\lambda_2$ for Ridge

<div style="font-size: 110%;">
$$\hat{\beta} =  \underset{\beta}{\mathrm{argmin}}\sum_i{(y_i - X_i^T\beta)^2} + \lambda_1 \sum_{j=1}^p|\beta_j| + \lambda_2 \sum_{j=1}^p\beta_j^2$$
</div>

* Use Cross Validation to determine the best $\lambda_1$ and $\lambda_2$

#### sklearn Elastic Net Regression

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html

* $\alpha$: the penalty weighting (i.e. $\lambda$)

* l1_ratio: the ElasticNet mixing parameter. $ 0 \le \text{l1_ratio} \le 1$
    - It determines the scaling of $\lambda_1$ and $\lambda_2$
    - l1_ratio = 0: the penalty is a L2 penalty
    - l1_ratio = 1: the penalty is a L1 penalty
    - For 0 < l1_ratio < 1 it is a combination of L1 and L2

In [None]:
model = ElasticNet(l1_ratio = 0.5)
model.fit(X_train,y_train)
coefs_EN = model.coef_
print('R-Squared: ',model.score(X_train,y_train))
yhat = model.predict(X_test)
mse = np.mean((y_test - yhat)**2)
print('MSE: ',mse)
plt.plot(range(len(coefs_EN)),coefs_EN,'b')   
print(f'Number of coefficients ElasticNet regression shrunk to 0: {np.sum(coefs_EN == 0)}')

In [None]:
for n,v in zip(df.columns,coefs_EN):
    if v != 0: print(n,round(v,3))

#### sklearn Elastic Net Regression Cross Vaidation

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNetCV.html

* l1_ratio:
    - When a number:
        * Weight for L1 constraint is alpha*l1_ratio and weight for L2 term is 0.5*alpha*(1 - l1_ratio)
    - When a list, the different values are tested by cross-validation
        - It is noted a good choice of list of values for l1_ratio is often to put more values close to 1 (i.e. Lasso) and less close to 0 (i.e. Ridge)
* $\alpha$: the penalty weighting

In [None]:
from sklearn.linear_model import ElasticNetCV
model = ElasticNetCV(l1_ratio = [.1, .5, .7, .9, .95, .99, 1])
model.fit(X_train,y_train)
coefs_EN = model.coef_
plt.plot(range(len(coefs_EN)),coefs_EN,'b')   
print(f'Number of coefficients ElasticNet regression shrunk to 0: {np.sum(coefs_EN == 0)}')

In [None]:
model.alpha_

In [None]:
model.l1_ratio_

In [None]:
for n,v in zip(df.columns,coefs_EN):
    if v != 0: print(n,round(v,3))

### Exercises

In [None]:
college = pd.read_csv("College2.csv",index_col = 0)
college.head()

Using college.csv, which contains:

A data frame with 777 observations on the following 18 variables.  

Private: No or Yes indicating private or public university  
Apps: Number of applications received  
Accept: Number of applications accepted  
Enroll: Number of new students enrolled  
Top10perc: Pct. new students from top 10% of H.S. class  
Top25perc: Pct. new students from top 25% of H.S. class  
F.Undergrad: Number of fulltime undergraduates  
P.Undergrad: Number of parttime undergraduates  
Outstate: Out-of-state tuition  
Room.Board: Room and board costs  
Books: Estimated book costs  
Personal: Estimated personal spending  
PhD: Pct. of faculty with Ph.D.'s  
Terminal: Pct. of faculty with terminal degree  
S.F.Ratio: Student/faculty ratio  
perc.alumni: Pct. alumni who donate  
Expend: Instructional expenditure per student  
Grad.Rate: Graduation rate  

In [None]:
idx = list(range(len(college.columns)))
idx.remove(1)
X = college.iloc[:, idx].values
y = college.iloc[:, 1].values # number of applications

labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)


sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


In [None]:
college.loc['Rensselaer Polytechnic Institute']

### 1. Ridge Regression

Predict Apps ( Number of applications received) with a Ridge regression model using RidgeCV.  
Output $R^2$ and MSE for the test data and the best alpha.

In [None]:
# Your code here

### 2. Lasso Regression

Predict Apps ( Number of applications received) with a Lasso regression model using LassoCV. Output $R^2$ and MSE for the test data and the best alpha.

In [None]:
# Your code here

### 3. Recursive Feature Elimination

Use RFECV on a Lasso model instance to determine the features to eliminate. 

In [None]:
# Your code here

### 4. Use RFECV fit.support

Predict Apps with a Linear regression model using the variables selected by the RFECV object. Output $R^2$ and MSE for the test data. 


In [None]:
# Your code here

### 5. Elastic Net Regression

Fit an ElasticNet Cross-Validated model to the training data.  
Plot the coefficients.  
Print the number of coefficients shrunk to zero.

In [None]:
# Your code here