# introduction

Regularization is the process of simplifying a model. Simplifying a model helps us overcome high-dimesional issues because it allows us to discard features that don't contribute to the model's predictive ability.

#Shrinkage

In linear regression, we assume that the outcome (Y) is a linear combination of
 predictors with some noise (p) to account for the difference:

 $\displaystyle Y = \beta_0 + \sum\limits_{i=1}^p \beta_p X_p + \epsilon$

 The more predictors we include in the model, the more complex it gets. With more complex models, we can get better predictions but not **necessarily on the test data**. We must always be mindful of overfitting, so we should consider eliminating features that don't provide additional predictive power to the model. We don't know beforehand which features are valuable or not, so we need an approach to eliminate them via regularization.

 One way to simplify the model above is to force some of the coefficients to be close to zero or equal to zero. **Forcing coefficients to be close to zero** (or equal to zero) essentially removes the associated feature from the model, which ultimately simplifies it. Regularization is also called **shrinkage**, owing to the fact that coefficients are shrunk to zero or close to it.

## Regularization Through Penalty

A ridge regression model has the same form as a linear regression model, but its coefficient values are **regularized closer to 0.**

The difference comes in the loss function.


In standard linear regression, the coefficients are estimated by minimizing the mean squared error (MSE). Recall that the MSE is the **loss function** for linear regression and is defined as follows:


$\displaystyle L(\beta) = \frac{1}{n} \sum^n_{i=1} \left(y_i - \beta_0 - \sum^p_{j=1}\beta_j X_j\right)^2$

The loss function for ridge regression is similar, but it adds an extra **penalty term** to the MSE. The ridge regression coefficients minimize:

$\displaystyle L(\beta) = \frac{1}{n} \sum^n_{i=1} \left(y_i - \beta_0 - \sum^p_{j=1}\beta_j X_j\right)^2  + \alpha \sum^p_{j=1} \beta_j^2$

This second penalty term, $\alpha \sum^p_{j=1} \beta_j^2$, is how ridge regression regularizes its coefficients.

When the coefficients get large, this penalty term also gets larger as a result. Unless a larger coefficient significantly reduces the MSE, the coefficients will be shrunk towards zero. The penalty also incorporates a **tuning parameter** into the penalty term, denoted by $α$. High values of $α$
 will give the penalty term more weight, which encourages smaller regression coefficients, while smaller values of $α$ do the opposite.

 $α$ is a hyperparameter for ridge regression, so it should be chosen through cross-validation as opposed to being hand-picked. Notice that when $α=0$, we get back to the MSE, returning us to standard linear regression.


# Ridge Regression

`scikit-learn` has a class dedicated for ridge regression: the `RidgeCV class`. The `RidgeCV class` is also contained in the `linear_model` module, the same one as `LinearRegression`. There's another Ridge class that implements ridge regression, but RidgeCV has cross-validation built into it, so this is preferred.

    from sklearn.linear_model import RidgeCV
    model = RidgeCV()
    model.fit(X, y)

With `RidgeCV`, there's a `coef_` attribute that allows us to examine the estimated values of the coefficients of the model. Each feature we use in the model will have a coefficient, and the coefficient values in `coef_` will appear in the same order in which the features are used in the model. Our example code doesn't show any arguments when creating the `RidgeCV()` object, but there are two that are worth introducing:

  * `alphas`: this is an array of positive values to test for cross-validation. We can also specify a single value for the argument as well. RidgeCV has default values here, but we may want to use a different range or magnitude for the $\alpha$
 values we want to validate with.
  
  * `cv`: this value indicates **how many folds** to use in cross-validation. By default, RidgeCV implements efficient LOOCV to determine the best
 value to use among the alphas values we provide.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeCV

housing = pd.read_csv("housing.csv").dropna()

X = housing.drop(["ocean_proximity", "median_income"], axis=1)
y = housing["median_income"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=762)


#train
ridge_1= RidgeCV(alphas=1)
ridge_1000= RidgeCV(alphas=1000)

ridge_1.fit(X_train, y_train)
ridge_1000.fit(X_train, y_train)

#coefficients
ridge_1_coef = ridge_1.coef_
ridge_1000_coef = ridge_1000.coef_



# Optimizing the Tuning Parameter

Higher values of $\alpha$ reduce the coefficients more because it increases the value of the penalty term. The penalty term forces the model to compromise between higher coefficient values to reduce the MSE and raising the value of the penalty term.

The `RidgeCV` class implements efficient LOOCV to find the best value for $\alpha$, but unfortunately, the default values used by the class aren't ideal. By default, the value for the `alphas` argument is [0.1, 1, 10]. It's highly likely that the best value for $\alpha$ lies somewhere outside of these values, so we want to outline an approach for finding one.

The process for finding an optimal tuning parameter is an iterative one. It requires us to assign different lists for the alphas parameter with increasing precision until we find an adequate answer.

First, we can run RidgeCV with an alphas argument with more values for alpha. We can do this with the `linspace() `function.
    
    ridge = RidgeCV(alphas = np.linspace(0.1, 10, num=100))

This will give us more $\alpha$ values to cross-validate with, and more values to optimize on. Increasing this would be useful if we found that the optimal $\alpha$ lies between the minimum and maximum we provided, and we wanted to explore more granular values to home in on the best one.



## Iterative $\alpha$

In order to see the value of  $\alpha$ used in a fitted `RidgeCV` model, we can use the `alpha_` attribute (notice the underscore).

There's also an edge case we need to consider. In the above code, we assume the correct value for $\alpha$ is contained within the range of **0.1 and 10**, but what if it isn't? **One way we'll be able to notice this is if one of these two extremes ends up being the value of $\alpha$ chosen**. This result signifies that we should change the bounds of linspace()

**Ideally, the optimal value will fall in the middle of the bounds you choose, so you might need to experiment a bit.**

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeCV

housing = pd.read_csv("housing.csv").dropna()

X = housing.drop(["ocean_proximity", "median_income"], axis=1)
y = housing["median_income"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=762)

#train
ridge_initial= RidgeCV(alphas=np.linspace(10, 100, num=100))
ridge_1000= RidgeCV(alphas=1000)

#alphas
ridge_initial.fit(X_train, y_train)
alpha_initial = ridge_initial.alpha_

print(alpha_initial)

100.0


# Magnitude Problem

Recall that the coefficients of linear regression are interpreted as average changes to the outcome for unit changes in the features. Therefore, the magnitude of these coefficients depends heavily on the magnitude of the feature.

This can become a problem if we have features of different magnitudes. We can see this in the `housing` dataset. The `housing_median_age` column is on a scale of 10s, but the `population` column ranges into the 1000s. This difference in magnitudes can have adverse effects on regularized models like ridge regression.

The penalty term looks at the magnitude of the coefficients. The coefficient of a feature measured in the 1000s is going to be constrained differently compared to the coefficient of a feature with a magnitude in the 10s. This is undesirable because the coefficients will be penalized differently based on their scale.

We prevent this issue by making the scale across all of our coefficients the same.

#Standardization

Standardization is the process of transforming all of the features so that they are all similar in some way. The act of transforming a variable's **average to be 0** is also called centering, and transforming the **standard deviation to 1** is also called scaling.

$\displaystyle z = \frac{x - \mu}{\sigma}$

In order to standardize the features, we just need to instantiate a `StandardScaler()` object and create a new standardized dataset using the `fit_transform()` method. We pass an unstandardized dataset X to the method and reassign the returned standardized dataset to a new variable, `standardized_X`.

    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    standardized_X = scaler.fit_transform(X)

Before we fit a model to our data, we should **standardize our data first** to make sure that the coefficients are penalized similarly. Standardization hurts the interpretability of the model because it redefines what a "unit increase" is for the variable. Before standardization, a unit increase can be phrased in terms of the actual variable. Given that better predictive ability is what we want from machine learning models, this can be an acceptable compromise.

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler


housing = pd.read_csv("housing.csv").dropna()

X = housing.drop(["ocean_proximity", "median_income"], axis=1)
y = housing["median_income"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=762)

#Scaler object
scaler = StandardScaler()
standardized_X_train = scaler.fit_transform(X_train)

#RidgeCV object
ridge = RidgeCV(alphas=np.linspace(1, 10, num=100))

#ridge fit on scaled data
ridge.fit(standardized_X_train,y_train)

#coefficients
ridge_coefs= ridge.coef_




#The LASSO Model
The acronym LASSO stands for **"Least Absolute Shrinkage and Selection Operator."**

The model's name states precisely what it does: shrinkage and selection. We've learned both terms in this course, so we'll see how LASSO accomplishes them here.

Like ridge regression, LASSO is a regularized model. This regularization comes from an additional penalty term in the loss function. But unlike ridge regression, **LASSO punishes high coefficient values** with a different function, as shown below:

$\displaystyle L(\beta) = \frac{1}{n} \sum^n_{i=1} \left(y_i - \beta_0 - \sum^p_{j=1}\beta_j X_j\right)^2  + \alpha \sum^p_{j=1} \vert\beta_j\vert$

*LASSO penalizes the loss function using the absolute values of the coefficients instead of their squared values.*

While this change seems small, it gives LASSO a valuable characteristic that ridge regression lacks: **feature selection**. In ridge regression, if we feed **p** features into the model, then we'll still retain all of them after the regularization. In LASSO, some of the coefficients might be forced to zero, effectively removing the associated feature from the model.

##LASSO in SKLearn

scikit-learn implements LASSO in the `LassoCV class` in the `linear_model` module. There's also a Lasso class, but we're introducing LassoCV since it optimizes the tuning parameter as well. We show some pseudocode below:

    from sklearn.linear_model import LassoCV
    lasso = LassoCV(alphas = linspace(0.1, 10, num=100))
    lasso.fit(X, y)

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error


housing = pd.read_csv("housing.csv").dropna()
scaler = StandardScaler()

X = housing.drop(["ocean_proximity", "median_income"], axis=1)
standardized_X = scaler.fit_transform(X)
y = housing["median_income"]

X_train, X_test, y_train, y_test = train_test_split(standardized_X, y, test_size=0.20, random_state=762)

ridge = RidgeCV(alphas=np.linspace(1, 10, num=100))
ridge.fit(X_train, y_train)
ridge_coef = ridge.coef_

#instantiate LassoCV object
lasso = LassoCV(alphas=np.linspace(1, 10, num=100))

lasso.fit(X_train,y_train)
lasso_coef = lasso.coef_


#calculating MSE for both models
ridge_test_mse = mean_squared_error(y_test,ridge.predict(X_test))

lasso_test_mse = mean_squared_error(y_test,lasso.predict(X_test))


#Conclusion

When comparing the two models on the previous screen, we found that **ridge regression performed better than LASSO**. LASSO performs better when we suspect that many of the features we include in the model aren't helpful. In this case, each of the features did contribute to improving predictive ability, so removing them was actually harmful.