# Regularization
A technique used to solve the overfitting problem of a machine learning model.

If a model learns too much, in which it learns the "background noise" (outliers) while being fit, then it may overfit the model to the unnecessary data points that hurts generalization.

Overfitting arrises from few data points or highly correlated independent variables (high multi-collinearity).

We can prevent this type of problem through regularization parameters.

# Ridge Regression
In a Machine Learning model, if we use a simple linear regression model, then the model will use the Ordinary Least Squares method to determine the line of best-fit.

However, what if there were a few background noises or not enough data points, so the model fits too much to the data?

### Simple Linear vs Ridge Regression
The Simple Linear Regression's formula is: ```y = b0 + b1*x```.

Simple Linear uses the formula with the smallest value of ```squared residuals```.

However, Ridge Regression uses the formula with the smallest value of:  
```squared residuals + (lambda * b1^2)```
- b1^2 adds a penalty to the traditional Least Squares method
- lambda determines how severe the b1^2 penalty is
    - The higher the lambda, the larger the squared residuals, so the higher the penalty
    
Lambda is considered a "regularization" parameter, which reduces overfitting.

The ```lambda * b1^2``` penalty is called the L2 (Ridge) penalty.
- The L2 penalty uses the sum of squared coefficients

#### Ridge Regression for Multi-Variate Linear Regression
If we were using a multi-variate linear regression model, then each b-coefficient would used in the Ridge Regression's smallest squared residuals formula:   
```squared residuals + [lambda * (b1^2 + b2^2 + b3^2 + ... bn^2)]```

An independent variable that is highly correlated to another independent variable have similarly correlating b-coefficients.

This is very helpful to prevent highly correlated independent variables because the model wouldn't use an equation of large b-coefficients for highly correlated independent variables.
- This is because the larger the b-coefficients, the greater the L2 penalty
- Thereby, the highly correlated independent variables don't contribute as significantly to the model

### Example of Overfit Simple Linear Regression
<img src="images/rr/overfit_regression_example.png" height="35%" width="35%"></img>

This overfitting here is being caused by the low amount of training data points.

The red line (regression line) is overfit to the red training data set. Therefore, the predictions made on the green testing data set would be inaccurate.

Instead, we can solve this problem by using Ridge Regression that adds some bias.
<img src="images/rr/ridge_regression_example.png" height="35%" width="35%"></img>

The blue line (ridge regression line) adds a small amount of bias, but now the line has less variance from the predictions.

# Lasso Regression
Lasso Regression uses the formula with the smallest value of:  
```squared residuals + (lambda * |b1|)```
- |b1| adds a penalty to the traditional Least Squares method
- lambda determines how severe the |b1| penalty is
    - The higher the lambda, the larger the squared residuals, so the higher the penalty
    
The ```lambda * |b1|``` penalty is called the L1 (Lasso) penalty.
- The L1 penalty uses the absolute value of the coefficients
    
The multi-variate concept for Lasso Regression is the same as Ridge Regression,  but it uses the L1 penalty.

# Lasso versus Ridge Regression?
Ridge Regression uses the L2 (square of the coefficients) penalty.  
Lasso Regression uses the L1 (absolute value of the coefficients) penalty.

Lasso Regression can actually use an equation that zeros-out b-coefficients, which gets rid of them completely.

In contrast, Ridge Regression can only ever use an equation with b-coefficients that are very small.

In [25]:
# import libraries
import numpy as np
import pandas as pd

In [40]:
# import the regression models
from sklearn.linear_model import LinearRegression, Ridge, Lasso

# create linear data
x = np.array([[0], [5]])
y = np.array([1, 2])

In [41]:
# create a simple linear regressor, then fit it to the training data
simple_regressor = LinearRegression()
simple_regressor.fit(x, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [42]:
# create a ridge regressor with lambda = 1, then fit it to the training data
ridge_regressor = Ridge(alpha=1)
ridge_regressor.fit(x, y)

Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [58]:
# create a ridge regressor with lambda = 0.10, then fit it to the training data
lasso_regressor = Lasso(alpha=0.10)
lasso_regressor.fit(x, y)

Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [59]:
# simple regressor prediction of 10
print(simple_regressor.predict([[10]]))

# ridge regressor prediction of 10
print(ridge_regressor.predict([[10]]))

# lasso regressor prediction of 10
print(lasso_regressor.predict([[10]]))

[3.]
[2.88888889]
[2.88]
