This is the lecture note for **regularized linear models**

<p class = "alert alert-info" role="alert"><b>Note</b> that this lecture note gives a brief introduction to regularized linear models. I encourage you to read further about regularized linear models. </p>

Read more:

- [Regularized linear models medium](https://medium.com/analytics-vidhya/regularized-linear-models-in-machine-learning-d2a01a26a46)
- [Ridge regression wikipedia](https://en.wikipedia.org/wiki/Ridge_regression)
- [Tikhonov regularization wikipedia](https://en.wikipedia.org/wiki/Tikhonov_regularization)
- [Lasso regression wikipedia](https://en.wikipedia.org/wiki/Lasso_(statistics))
- [Korsvalidering](https://sv.wikipedia.org/wiki/Korsvalidering)
- [Cross validation](https://machinelearningmastery.com/k-fold-cross-validation/)
- [Scoring parameter sklearn](https://scikit-learn.org/stable/modules/model_evaluation.html)
- [ISLRv2 pp 198-205](https://www.statlearning.com/)

In [5]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

df = pd.read_csv("../data/Advertising.csv", index_col=0)
df.head()


Unnamed: 0,TV,Radio,Newspaper,Sales
1,230.1,37.8,69.2,22.1
2,44.5,39.3,45.1,10.4
3,17.2,45.9,69.3,9.3
4,151.5,41.3,58.5,18.5
5,180.8,10.8,58.4,12.9


In [14]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

X, y = df.drop('Sales', axis= "columns"), df["Sales"]

# feature engineered many more features, degree 3 -> 19 features
model_polynomial = PolynomialFeatures(3, include_bias=False) 
# We are going 3 because we want to test the regularization to be able to see if they get smaller
polynomial_features = model_polynomial.fit_transform(X)


# Vi splittar inte på X utan polynomial_features
# x1, x1^2,x1^3, x1x2x3, x1^2x2^2x3^2 ....
X_train, X_test, y_train, y_test = train_test_split(polynomial_features, y, test_size=.33, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((134, 19), (66, 19), (134,), (66,))

## Feature Scaling
- In this case we choose "standardization

Remove sample mean and divide by sample standard deviation 

$X' = \frac{X-\mu}{\sigma}$

LASSO, Ridge and Elasticnet regression that we'll use later require that the data is scaled.


In [20]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
 # calculates parameters sigma and mu to X_train based on X_train and transforms X_train
scaled_X_train = scaler.fit_transform(X_train)


# uses the parameters sigma and mu that was calculated before to transform X_test
scaled_X_test = scaler.transform(X_test) # Because we use the parameters from X_train, to run transform on X_test

scaled_X_test.mean(), scaled_X_test.std(), scaled_X_train.mean(), scaled_X_train.std()

# Den har mean som inte är 0

(-0.11982457640326809, 1.1245966534380971, -3.34898382919136e-17, 1.0)

## Regularizations 

### Ridge regression (Tikhonov regularization) (L1-Regularization)

$C(\vec{\theta}) = MSE(\vec{\theta}) + \lambda \frac{1}{2}\sum_{i=1}^n \theta_i^2$

In [29]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error


# alpha är lambda
def ridge_regression(X, penalty = 0):
    # alpha in Ridge is the same as lambda in the theory/formula
    model_ridge = Ridge(alpha=penalty)
    model_ridge.fit(scaled_X_train, y_train)
    y_pred = model_ridge.predict(scaled_X_test)
    return y_pred


# ridge regression with penalty = 0 is polynomial regression
y_pred = ridge_regression(scaled_X_test, penalty=0) # Samma som att vi använder vanlig linjär regression
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
MAE = mean_absolute_error(y_test, y_pred)

MAE, MSE, RMSE





(0.37485164412180333, 0.2650465950553843, 0.5148267621786812)

In [30]:
from sklearn.linear_model import LinearRegression

# Så vad är skillnaden? 

# polynomial regression
model_linear = LinearRegression()
model_linear.fit(scaled_X_train, y_train)
y_pred = model_linear.predict(scaled_X_test)


MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
MAE = mean_absolute_error(y_test, y_pred)

MAE, MSE, RMSE



(0.3748516441217811, 0.26504659505536016, 0.5148267621786576)

In [55]:
# ridge regression with penalty 0 is polynomial regression
y_pred = ridge_regression(scaled_X_test, penalty=0.01)
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
MAE = mean_absolute_error(y_test, y_pred)

MAE, MSE, RMSE

(0.37168406150022937, 0.26202411194690434, 0.5118829084340523)

## Lasso

In [56]:
from sklearn.linear_model import LassoCV #cv stands for cross validation

# Antalet alphas vill jag att den ska söka inom, man kan sätta in olika alphas som den skulle söka på också
# cv is k, k -fold
# man skulle kunna öka antalet iterationer med max_iter
model_lassoCV = LassoCV(n_alphas= 200, cv=5)
model_lassoCV.fit(scaled_X_train, y_train)

  model = cd_fast.enet_coordinate_descent(


In [57]:
model_lassoCV.alpha_ # Best alpha alltså penaltytermen. Hittad genom 5 fold cross-validation

0.004968802520343366

In [62]:
model_lassoCV.coef_ # Hur många features som är 0

# In the end of the it shrinks 

array([ 5.11468536,  0.42127203,  0.28896055, -4.63391705,  3.48972093,
       -0.390611  ,  0.        ,  0.        ,  0.        ,  1.24969939,
       -0.        ,  0.        ,  0.13795383, -0.01666923,  0.        ,
        0.        ,  0.10974819,  0.        ,  0.0458376 ])

In [65]:
y_pred = model_lassoCV.predict(scaled_X_test)

# Man kan jämföra denna # Det visar sig att den är sämre när man gissar själv
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
MAE = mean_absolute_error(y_test, y_pred)

MAE, MSE, RMSE

(0.46802072322691207, 0.3410150044071009, 0.5839648999786724)

## ElasticNet

In [68]:
from sklearn.linear_model import ElasticNetCV

model_elastic = ElasticNetCV(l1_ratio=[.1,.5,.7,.9,.95,1])
model_elastic.fit(scaled_X_train, y_train)


  model = cd_fast.enet_coordinate_descent(


In [69]:
model_elastic.l1_ratio_

# It wants us to do lasso (Mamma vill att vi kör lasso :D)

1.0

In [70]:
model_elastic.alpha_

0.004968802520343366