# Regularization Exercise

Regularization in a way is met to "penalize" the model for being too complex. the Regulatorization factor is typically the "sum" of the coefficient of the model. There are two types of regularization. L1 is the sum of the absolute value of the coefficients whereas L2 is the sum of the square of the coefficients. Often the regularization is called Lambda. A small lambda does not penalize alot whereas a large does.

|L1 Regularization           |L2 Regularization         |
|------------------------- --|--------------------------|
|Computationally innefficient| Computationally efficient|
|Sparse Output               | Non-Sparse Output        |
|Feature selection           | No feature selection     |

Perhaps it's not too surprising at this point, but there are classes in sklearn that will help you perform regularization with your linear regression. You'll get practice with implementing that in this exercise. In this assignment's data.csv, you'll find data for a bunch of points including six predictor variables and one outcome variable. Use sklearn's `Lasso class` to fit a linear regression model to the data, while also using `L1 regularization` to control for model complexity.

Can check "Intro to ML with Python" page 55

Perform the following steps:
## 1. Load in the data

The data is in the file called 'data.csv'. Note that there's no header row on this file.
Split the data so that the six predictor features (first six columns) are stored in X, and the outcome feature (last column) is stored in y.


In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso

In [22]:
# Assign the data to predictor and outcome variables
data = pd.read_csv('regularization_data.csv')
print(data.shape)
data.head()

(99, 7)


Unnamed: 0,1.25664,2.04978,-6.23640,4.71926,-4.26931,0.20590,12.31798
0,-3.89012,-0.37511,6.14979,4.94585,-3.57844,0.0064,23.67628
1,5.09784,0.9812,-0.29939,5.85805,0.28297,-0.20626,-1.53459
2,0.39034,-3.06861,-5.63488,6.43941,0.39256,-0.07084,-24.6867
3,5.84727,-0.15922,11.41246,7.52165,1.69886,0.29022,17.54122
4,-2.86202,-0.84337,-1.08165,0.67115,-2.48911,0.52328,9.39789


In [23]:
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
print(type(X), X.shape)
print(type(y), y.shape)

<class 'pandas.core.frame.DataFrame'> (99, 6)
<class 'pandas.core.series.Series'> (99,)


## 2. Fit data using linear regression with Lasso regularization

Create an instance of sklearn's Lasso class and assign it to the variable lasso_reg. You don't need to set any parameter values: use the default values for the quiz.
Use the Lasso object's `.fit()` method to fit the regression model onto the data.


In [12]:
# Create the linear regression model with lasso regularization. Fit the  model
lasso_reg = Lasso().fit(X, y)

## 3. Inspect the coefficients of the regression model

Obtain the coefficients of the fit regression model using the `.coef_` attribute of the Lasso object. Store this in the reg_coef variable: the coefficients will be printed out, and you will use your observations to answer the question at the bottom of the page.

In [20]:
# Retrieve and print out the coefficients from the regression model.
reg_coef = lasso_reg.coef_
print(reg_coef)

[ 0.          2.33659619  2.0140086  -0.05753445 -3.91583673  0.        ]


Lasso regularization has set the coefficients for the first and sixth columns to 0. You might try fitting the model to a standard LinearRegression, to see what those coefficients were before regularization!

In [19]:
print(f"Model Score: {lasso_reg.score(X, y):.2f}")
print(f"Number of features used: {np.sum(lasso_reg.coef_ != 0)}")

Model Score: 0.99
Number of features used: 4


## 4. Redo with Train and Test sets

In [21]:
from sklearn.model_selection import train_test_split

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [25]:
print(type(X_train), X_train.shape)
print(type(y_train), y_train.shape)
print(type(X_test), X_test.shape)
print(type(y_test), y_test.shape)

<class 'pandas.core.frame.DataFrame'> (74, 6)
<class 'pandas.core.series.Series'> (74,)
<class 'pandas.core.frame.DataFrame'> (25, 6)
<class 'pandas.core.series.Series'> (25,)


In [26]:
lasso_reg2 = Lasso().fit(X_train, y_train)
print(lasso_reg2.coef_)
print(f"Train Number of features used: {np.sum(lasso_reg2.coef_ != 0)}")
print(f"Train Set Score: {lasso_reg2.score(X_train, y_train):.2f}")
print(f"Test Set Score: {lasso_reg2.score(X_test, y_test):.2f}")

[ 0.          2.37063394  2.06704239 -0.03967225 -3.91014885  0.        ]
Train Number of features used: 4
Train Set Score: 0.99
Test Set Score: 0.98
