# Supervised Machine Learning by @attzulkafli.

## Lasso, Ridge, and Elastic Net Regression: Regularization


**Lasso Regression:**

**LASSO** stands for ``Least Absolute Shrinkage and Selection Operator`` where emphasis on the 2 key words – ‘absolute‘ and ‘selection‘.

Lasso regression performs **L1 regularization**, i.e. it adds a factor of sum of absolute value of coefficients in the optimization objective. 

- Minimize Error: $$\sum_{i=1}^{m} {(y - \hat{y})^2} + \alpha\sum_{i=1}^{n} {|\beta_i|} = \sum_{i=1}^{m} {(y -  \beta_0 - \beta_1  x_1 - \beta_2  x_2 - ... - \beta_n  x_n )^2} + \alpha\sum_{i=1}^{n} {|\beta_i|}$$

Thus, lasso regression optimizes the following:

**Objective = RSS + α * (sum of absolute value of coefficients)**

Here, α (alpha) works similar to that of ridge and provides a trade-off between balancing RSS and magnitude of coefficients. Like that of ridge, α can take various values. Lets iterate it here briefly:

1. α = 0: Same coefficients as simple linear regression
2. α = ∞: All coefficients zero (same logic as before)
3. 0 < α < ∞: coefficients between 0 and that of simple linear regression

**Ridge Regression:**


As mentioned before, ``ridge regression`` performs ‘L2 regularization‘, i.e. it adds a factor of sum of squares of coefficients in the optimization objective.

- Minimize Error: $$\sum_{i=1}^{m} {(y - \hat{y})^2}+ \alpha \sum_{i=1}^{n} {\beta_i^2} = \sum_{i=1}^{m} {(y -  \beta_0 - \beta_1  x_1 - \beta_2  x_2 - ... - \beta_n  x_n )^2} + \alpha \sum_{i=1}^{n} {\beta_i^2}$$

Thus, ridge regression optimizes the following:

**Objective = RSS + α * (sum of square of coefficients)**

Here, α (alpha) is the parameter which balances the amount of emphasis given to minimizing RSS vs minimizing sum of square of coefficients. α can take various values:

1. α = 0:
    - The objective becomes same as simple linear regression.
    - We’ll get the same coefficients as simple linear regression.
2. α = ∞:
    - The coefficients will be zero. Why? Because of infinite weightage on square of coefficients, anything less than zero will make the objective infinite.
3. 0 < α < ∞:
    - The magnitude of α will decide the weightage given to different parts of objective.
    - The coefficients will be somewhere between 0 and ones for simple linear regression.

**Elastic Net Regression:**

- Minimize Error: $$\sum_{i=1}^{m} {(y - \hat{y})^2}+ \lambda_1 \sum_{i=1}^{n} {|\beta_i|} + \lambda_2 \sum_{i=1}^{n} {\beta_i^2} = \sum_{i=1}^{m} {(y -  \beta_0 - \beta_1  x_1 - \beta_2  x_2 - ... - \beta_n  x_n )^2} + \lambda_1 \sum_{i=1}^{n} {|\beta_i|} + \lambda_2 \sum_{i=1}^{n} {\beta_i^2}$$

In `sklearn`, the relationship between $\lambda_1$ and $\lambda_2$ is defined by two parameters in `ElasticNet` function `alpha` and `l1_ratio` where:
$$\alpha = \lambda_1 + \lambda_2$$
and $$ l1-ratio = \frac {\lambda_1}{(\lambda_1 + \lambda_2)}$$

For example, if $\alpha = 1$, and l1_ratio = 0.3, then:
$$ {\lambda_1 + \lambda_2 = 1} \\ { \frac {\lambda_1}{(\lambda_1 + \lambda_2)} = 0.3}$$
Therefore: 
$$ {\lambda_1 = 0.3} \\  {\lambda_2 = 0.7}$$

In [18]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures


In [5]:
def PolynomialLasso(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         Lasso(**kwargs))

In [17]:
def PolynomialRidge(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         Ridge(**kwargs))


In [19]:
def PolynomialElastic(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree),
                         ElasticNet(**kwargs))

 **Example - Lasso on Boston with Polynomials**

- Load again the boston dataset from sklearn.
- Divide the data into 30% test and 70% train set.
- Fit a Polynomial Lasso Regression on the train data
    - Use degree=2, alpha=0.1, max_iter=100000
- What is the R2 for train and test? How many features were selected?
- Now try:
    - change PolynomialLasso to set interaction_only=True in PolynomialFeatures
    - degree=3, alpha=1, max_iter=100000

In [2]:
boston = load_boston()
X = boston.data
y = boston.target

In [3]:
X1, X2, y1, y2 = train_test_split(X, y,random_state=0,test_size=0.3)

In [16]:
Lasso_Poly2_boston = PolynomialLasso(2, alpha = 0.005, max_iter=1e5)
Lasso_Poly2_boston.fit(X1, y1)
print("Train score", Lasso_Poly2_boston.score(X1, y1))
print("Test score", Lasso_Poly2_boston.score(X2,y2))
k = Lasso_Poly2_boston.steps[1][1].coef_
print("Features all", len(k))
print("Features used", sum(k!= 0))
print("Features NOT used", sum(k == 0))


Train score 0.947634400969777
Test score 0.6444508360125614
Features all 105
Features used 91
Features NOT used 14


  model = cd_fast.enet_coordinate_descent(


In [14]:
Lasso_Poly3_boston = PolynomialLasso(3, alpha = 1, max_iter=1e5)
Lasso_Poly3_boston.fit(X1, y1)

print("Train score", Lasso_Poly3_boston.score(X1, y1))
print("Test score", Lasso_Poly3_boston.score(X2,y2))
k = Lasso_Poly3_boston.steps[1][1].coef_
print("Features all", len(k))
print("Features used", sum(k != 0))
print("Features NOT used", sum(k == 0))

Train score 0.983151786505286
Test score 0.1371925206481145
Features all 560
Features used 238
Features NOT used 322


  model = cd_fast.enet_coordinate_descent(


**Exercise: Ridge & ElasticNet on Boston with Polynomials**
- Fit a Polynomial Ridge Regression on the train data
    - Use degree=2, alpha=0.1, max_iter=100000
- What is the R2 for train and test? How many features were selected?
- Now try:
    - change PolynomialLasso to set interaction_only=True in PolynomialFeatures
    - degree=3, alpha=1, max_iter=100000
    


In [25]:
Ridge_Poly2_boston = PolynomialRidge(2, alpha = 0.1, max_iter=1e5)
Ridge_Poly2_boston.fit(X1, y1)
print("Train score", Ridge_Poly2_boston.score(X1, y1))
print("Test score", Ridge_Poly2_boston.score(X2,y2))
k = Ridge_Poly2_boston.steps[1][1].coef_
print("Features all", len(k))
print("Features used", sum(k != 0))
print("Features NOT used", sum(k == 0))

Train score 0.9506157246830967
Test score 0.614488986368475
Features all 105
Features used 104
Features NOT used 1


In [26]:
Ridge_Poly3_boston = PolynomialRidge(3, alpha = 1, max_iter=1e5)
Ridge_Poly3_boston.fit(X1, y1)
print("Train score", Ridge_Poly3_boston.score(X1, y1))
print("Test score", Ridge_Poly3_boston.score(X2,y2))
k = Ridge_Poly3_boston.steps[1][1].coef_
print("Features all", len(k))
print("Features used", sum(k != 0))
print("Features NOT used", sum(k == 0))

Train score 0.9455122611062422
Test score -3.855119853801522
Features all 560
Features used 559
Features NOT used 1




- Fit a Polynomial Elastic Net Regression on the train data
    - Use degree=2, alpha=0.1, max_iter=100000
- What is the R2 for train and test? How many features were selected?
- Now try:
    - change PolynomialLasso to set interaction_only=True in PolynomialFeatures
    - degree=3, alpha = 1, l1_ratio=0.5, max_iter=100000

In [22]:
Elastic_Poly2_boston = PolynomialElastic(2, alpha = 1, l1_ratio=0.5, max_iter=1e5)
Elastic_Poly2_boston.fit(X1, y1)
print("Train score", Elastic_Poly2_boston.score(X1, y1))
print("Test score", Elastic_Poly2_boston.score(X2,y2))
k = Elastic_Poly2_boston.steps[1][1].coef_
print("Features all", len(k))
print("Features used", sum(Elastic_Poly2_boston.steps[1][1].coef_ != 0))
print("Features NOT used", sum(Elastic_Poly2_boston.steps[1][1].coef_ == 0))

Train score 0.9163405765640357
Test score 0.7664562575950314
Features all 105
Features used 54
Features NOT used 51


In [24]:
Elastic_Poly3_boston = PolynomialElastic(3, alpha = 1, l1_ratio=0.5, max_iter=1e5)
Elastic_Poly3_boston.fit(X1, y1)
print("Train score", Elastic_Poly3_boston.score(X1, y1))
print("Test score", Elastic_Poly3_boston.score(X2,y2))
k = Elastic_Poly3_boston.steps[1][1].coef_
print("Features all", len(k))
print("Features used", sum(Elastic_Poly3_boston.steps[1][1].coef_ != 0))
print("Features NOT used", sum(Elastic_Poly3_boston.steps[1][1].coef_ == 0))

Train score 0.9858405664684327
Test score -0.04198637963927765
Features all 560
Features used 266
Features NOT used 294


  model = cd_fast.enet_coordinate_descent(
