# Regularized linear regression : RIDGE 👮👮

0. Import usual librairies

In [1]:
import numpy as np
import pandas as pd
# Force to display all columns in the notebook
pd.set_option('display.max_columns', None)

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import  OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import r2_score

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
# setting Jedha color palette as default
pio.templates["jedha"] = go.layout.Template(
    layout_colorway=["#4B9AC7", "#4BE8E0", "#9DD4F3", "#97FBF6", "#2A7FAF", "#23B1AB", "#0E3449", "#015955"]
)
pio.templates.default = "jedha"
pio.renderers.default = "iframe_connected" # to be replaced by "iframe" if working on JULIE

1. Load the Online news dataset from the src folder, and use this command to clean out column names :
```python
data.columns = [name.strip() for name in data.columns]
```

The description of this dataset is contained in the .txt file present in the same folder.

Use the following command to extract a 1000 observations sample:
```python
data = data.sample(1000, random_state = 0)
```

Take a moment to display data info in order to check for missing values. 
We won't use the "url" column : you have to drop it.
Also just from the variables names we can anticipate that a number of variables will be collinear, remove those variables. Remove also "LDA_00", "rate_positive_words", "n_non_stop_words", that are also near collinear when given a small sample of data.


In [2]:
data = pd.read_csv("s3://full-stack-bigdata-datasets/Machine Learning Supervisé/Régression régularisées/news/OnlineNewsPopularity.csv")
data.columns = [name.strip() for name in data.columns]
data = data.sample(1000, random_state = 0)
data.head()

Unnamed: 0,url,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,average_token_length,num_keywords,data_channel_is_lifestyle,data_channel_is_entertainment,data_channel_is_bus,data_channel_is_socmed,data_channel_is_tech,data_channel_is_world,kw_min_min,kw_max_min,kw_avg_min,kw_min_max,kw_max_max,kw_avg_max,kw_min_avg,kw_max_avg,kw_avg_avg,self_reference_min_shares,self_reference_max_shares,self_reference_avg_sharess,weekday_is_monday,weekday_is_tuesday,weekday_is_wednesday,weekday_is_thursday,weekday_is_friday,weekday_is_saturday,weekday_is_sunday,is_weekend,LDA_00,LDA_01,LDA_02,LDA_03,LDA_04,global_subjectivity,global_sentiment_polarity,global_rate_positive_words,global_rate_negative_words,rate_positive_words,rate_negative_words,avg_positive_polarity,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,shares
18765,http://mashable.com/2014/01/13/nokia-first-and...,360.0,8.0,810.0,0.455696,1.0,0.62395,16.0,7.0,1.0,0.0,4.94321,7.0,0.0,0.0,0.0,0.0,0.0,1.0,-1.0,746.0,142.285714,12400.0,843300.0,241971.428571,1366.39726,3535.05551,2336.220331,488.0,4500.0,2247.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028572,0.028572,0.743268,0.028572,0.171017,0.511575,0.072003,0.019753,0.014815,0.571429,0.428571,0.355966,0.1,0.6,-0.194444,-0.6,-0.05,0.0,0.0,0.5,0.0,919
16349,http://mashable.com/2013/11/19/slow-motion-wed...,415.0,12.0,122.0,0.678571,1.0,0.783333,7.0,2.0,1.0,0.0,4.557377,7.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,633.0,211.5,0.0,843300.0,189985.714286,0.0,3396.488751,2510.601498,4300.0,4300.0,4300.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028908,0.028576,0.028606,0.028576,0.885334,0.451923,0.069231,0.02459,0.040984,0.375,0.625,0.666667,0.5,1.0,-0.22,-0.5,-0.15,0.433333,0.066667,0.066667,0.066667,1600
27703,http://mashable.com/2014/06/25/conan-obrien-wo...,197.0,12.0,891.0,0.391455,1.0,0.483649,6.0,3.0,22.0,2.0,4.712682,9.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.0,164.0,58.857143,0.0,843300.0,398911.111111,0.0,3799.224242,2395.346813,3700.0,3700.0,3700.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.022289,0.133647,0.022253,0.022224,0.799587,0.422562,0.219951,0.038159,0.005612,0.871795,0.128205,0.328018,0.1,1.0,-0.108333,-0.166667,-0.05,1.0,-0.25,0.5,0.25,11700
32947,http://mashable.com/2014/09/17/ios-8-without-d...,113.0,9.0,1323.0,0.380952,1.0,0.53074,31.0,11.0,13.0,0.0,4.561602,7.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.0,810.0,158.571429,12900.0,843300.0,418728.571429,2486.579592,3481.800852,2931.054867,810.0,48000.0,11356.666667,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.028615,0.028572,0.028585,0.028572,0.885657,0.393692,0.149161,0.037793,0.015117,0.714286,0.285714,0.359793,0.0625,0.8,-0.144266,-0.5,-0.05,0.0,0.0,0.5,0.0,18000
35434,http://mashable.com/2014/10/24/ebikes-commute-...,75.0,8.0,261.0,0.596154,1.0,0.721212,8.0,3.0,4.0,0.0,4.601533,5.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,529.0,115.0,42300.0,843300.0,607440.0,2494.426728,5880.397106,4186.229243,823.0,1200.0,1011.5,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.043388,0.04075,0.435827,0.440031,0.040003,0.278509,0.041667,0.022989,0.022989,0.5,0.5,0.330556,0.033333,1.0,-0.198611,-0.3,-0.125,0.344444,-0.227778,0.155556,0.227778,5800


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 18765 to 937
Data columns (total 61 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   url                            1000 non-null   object 
 1   timedelta                      1000 non-null   float64
 2   n_tokens_title                 1000 non-null   float64
 3   n_tokens_content               1000 non-null   float64
 4   n_unique_tokens                1000 non-null   float64
 5   n_non_stop_words               1000 non-null   float64
 6   n_non_stop_unique_tokens       1000 non-null   float64
 7   num_hrefs                      1000 non-null   float64
 8   num_self_hrefs                 1000 non-null   float64
 9   num_imgs                       1000 non-null   float64
 10  num_videos                     1000 non-null   float64
 11  average_token_length           1000 non-null   float64
 12  num_keywords                   1000 non-null 

There are no missing values in this dataset 😌😌

In [4]:
variables_to_keep = [col for col in data.columns if col not in ["url", "weekday_is_sunday","is_weekend", "LDA_00", "rate_positive_words", "n_non_stop_words"]]
data = data.loc[:,variables_to_keep]

1bis. Display a graph with the distribution of the variable shares, what can you conclude from this graph?

In [15]:
fig = px.histogram(x = data['shares'], nbins = 120, title = "Distribution of target variable")
fig.show()

In [7]:
fig = px.histogram(x = data['shares'], nbins = 120, log_y = True, title = 'Distribution of target variable (logarithmic scale)')
fig.show()

The graph of the target variable's distribution indicates that the disribution is extremely skewed, very few very high values are present which would cause our data to be extremely hard to model. Therefore we need to exclude from the dataset the rows where Y takes extremely high values. In this type of situation it is common to convert the target variable to a logarithmic scale.

2. Create a dataframe containing the explanatory variables and another one containing only the target variable, which is the number of shares. Convert y to logarithmic scale using np.log10

In [8]:
y = data.iloc[:,-1]
y = np.log10(y)
X = data.iloc[:,:-1]

In [9]:
y.head()

18765    2.963316
16349    3.204120
27703    4.068186
32947    4.255273
35434    3.763428
Name: shares, dtype: float64

In [10]:
fig = px.histogram(x = y, nbins = 120, title = "Distribution of target variable")
fig.show()

In [11]:
X.head()

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,average_token_length,num_keywords,data_channel_is_lifestyle,data_channel_is_entertainment,data_channel_is_bus,data_channel_is_socmed,data_channel_is_tech,data_channel_is_world,kw_min_min,kw_max_min,kw_avg_min,kw_min_max,kw_max_max,kw_avg_max,kw_min_avg,kw_max_avg,kw_avg_avg,self_reference_min_shares,self_reference_max_shares,self_reference_avg_sharess,weekday_is_monday,weekday_is_tuesday,weekday_is_wednesday,weekday_is_thursday,weekday_is_friday,weekday_is_saturday,LDA_01,LDA_02,LDA_03,LDA_04,global_subjectivity,global_sentiment_polarity,global_rate_positive_words,global_rate_negative_words,rate_negative_words,avg_positive_polarity,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity
18765,360.0,8.0,810.0,0.455696,0.62395,16.0,7.0,1.0,0.0,4.94321,7.0,0.0,0.0,0.0,0.0,0.0,1.0,-1.0,746.0,142.285714,12400.0,843300.0,241971.428571,1366.39726,3535.05551,2336.220331,488.0,4500.0,2247.0,1.0,0.0,0.0,0.0,0.0,0.0,0.028572,0.743268,0.028572,0.171017,0.511575,0.072003,0.019753,0.014815,0.428571,0.355966,0.1,0.6,-0.194444,-0.6,-0.05,0.0,0.0,0.5,0.0
16349,415.0,12.0,122.0,0.678571,0.783333,7.0,2.0,1.0,0.0,4.557377,7.0,0.0,0.0,0.0,0.0,1.0,0.0,4.0,633.0,211.5,0.0,843300.0,189985.714286,0.0,3396.488751,2510.601498,4300.0,4300.0,4300.0,0.0,1.0,0.0,0.0,0.0,0.0,0.028576,0.028606,0.028576,0.885334,0.451923,0.069231,0.02459,0.040984,0.625,0.666667,0.5,1.0,-0.22,-0.5,-0.15,0.433333,0.066667,0.066667,0.066667
27703,197.0,12.0,891.0,0.391455,0.483649,6.0,3.0,22.0,2.0,4.712682,9.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.0,164.0,58.857143,0.0,843300.0,398911.111111,0.0,3799.224242,2395.346813,3700.0,3700.0,3700.0,0.0,0.0,1.0,0.0,0.0,0.0,0.133647,0.022253,0.022224,0.799587,0.422562,0.219951,0.038159,0.005612,0.128205,0.328018,0.1,1.0,-0.108333,-0.166667,-0.05,1.0,-0.25,0.5,0.25
32947,113.0,9.0,1323.0,0.380952,0.53074,31.0,11.0,13.0,0.0,4.561602,7.0,0.0,0.0,0.0,0.0,1.0,0.0,-1.0,810.0,158.571429,12900.0,843300.0,418728.571429,2486.579592,3481.800852,2931.054867,810.0,48000.0,11356.666667,0.0,0.0,1.0,0.0,0.0,0.0,0.028572,0.028585,0.028572,0.885657,0.393692,0.149161,0.037793,0.015117,0.285714,0.359793,0.0625,0.8,-0.144266,-0.5,-0.05,0.0,0.0,0.5,0.0
35434,75.0,8.0,261.0,0.596154,0.721212,8.0,3.0,4.0,0.0,4.601533,5.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,529.0,115.0,42300.0,843300.0,607440.0,2494.426728,5880.397106,4186.229243,823.0,1200.0,1011.5,0.0,0.0,0.0,0.0,0.0,1.0,0.04075,0.435827,0.440031,0.040003,0.278509,0.041667,0.022989,0.022989,0.5,0.330556,0.033333,1.0,-0.198611,-0.3,-0.125,0.344444,-0.227778,0.155556,0.227778


2bis. Produce a list giving the indices of all couples of variables that are correlated above 90%

In [16]:
corr = X.corr()
high_corr = corr > 0.90
high_corr_list = [(i,j) for i in range(corr.shape[0]) for j in range(corr.shape[0]) if i != j and high_corr.iloc[i,j]]
high_corr_list

[(3, 4), (4, 3), (18, 19), (19, 18), (26, 28), (28, 26)]

2ter. Remove from X all variables that are correlated above 90%. Create an object X_clean that only contains the variables you would like to keep. If the list is empty, proceed as if it were not because we will need it later on.

In [17]:
no_keep = set([couple[0] for couple in high_corr_list])
keep = [i for i in range(X.shape[1]) if i not in no_keep]

X_clean = X.iloc[:,keep]

In [14]:
columns_to_keep = [c for c in X.columns if c not in no_keep]

X_clean = X.loc[:, columns_to_keep]
X_clean.columns

Index(['timedelta', 'n_tokens_title', 'n_tokens_content', 'n_unique_tokens',
       'num_hrefs', 'num_self_hrefs', 'num_imgs', 'num_videos',
       'average_token_length', 'num_keywords', 'data_channel_is_lifestyle',
       'data_channel_is_entertainment', 'data_channel_is_bus',
       'data_channel_is_socmed', 'data_channel_is_tech',
       'data_channel_is_world', 'kw_min_min', 'kw_max_min', 'kw_min_max',
       'kw_max_max', 'kw_avg_max', 'kw_min_avg', 'kw_max_avg', 'kw_avg_avg',
       'self_reference_min_shares', 'self_reference_max_shares',
       'weekday_is_monday', 'weekday_is_tuesday', 'weekday_is_wednesday',
       'weekday_is_thursday', 'weekday_is_friday', 'weekday_is_saturday',
       'LDA_01', 'LDA_02', 'LDA_03', 'LDA_04', 'global_subjectivity',
       'global_sentiment_polarity', 'global_rate_positive_words',
       'global_rate_negative_words', 'rate_negative_words',
       'avg_positive_polarity', 'min_positive_polarity',
       'max_positive_polarity', 'avg_negative_

3. Are the different variables in your dataset on the same scale ? Verify this by using the describe method.

In [18]:
X_clean.describe()

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,num_hrefs,num_self_hrefs,num_imgs,num_videos,average_token_length,num_keywords,data_channel_is_lifestyle,data_channel_is_entertainment,data_channel_is_bus,data_channel_is_socmed,data_channel_is_tech,data_channel_is_world,kw_min_min,kw_min_max,kw_max_max,kw_avg_max,kw_min_avg,kw_max_avg,kw_avg_avg,self_reference_max_shares,weekday_is_monday,weekday_is_tuesday,weekday_is_wednesday,weekday_is_thursday,weekday_is_friday,weekday_is_saturday,LDA_01,LDA_02,LDA_03,LDA_04,global_subjectivity,global_sentiment_polarity,global_rate_positive_words,global_rate_negative_words,rate_negative_words,avg_positive_polarity,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,347.879,10.45,544.41,11.495,3.33,4.364,1.374,4.566265,7.19,0.071,0.155,0.179,0.047,0.181,0.22,23.396,14001.259,764439.0,267915.591185,1051.344541,5837.981819,3167.298728,10329.515,0.171,0.198,0.186,0.188,0.149,0.05,0.133513,0.216622,0.210093,0.238889,0.448885,0.118456,0.039912,0.016804,0.289137,0.35602,0.094289,0.766214,-0.263951,-0.534681,-0.1077,0.288709,0.057511,0.339552,0.157284
std,213.608869,2.122264,441.576115,12.462315,4.018862,7.560467,4.497594,0.766976,1.92083,0.256953,0.362086,0.383544,0.211745,0.385211,0.414454,66.383642,58446.111617,191381.569447,134936.208217,1114.166303,7022.567331,1620.230431,35368.240283,0.376697,0.398692,0.389301,0.390908,0.356267,0.218054,0.214749,0.278962,0.287863,0.289179,0.110099,0.092484,0.017494,0.010756,0.149337,0.10203,0.06925,0.244911,0.125986,0.292579,0.096243,0.325868,0.266199,0.188327,0.222283
min,9.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018227,0.02,0.02,0.018187,0.0,-0.267949,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,0.0
25%,161.0,9.0,248.0,4.0,1.0,1.0,0.0,4.477636,6.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,843300.0,172823.9,0.0,3566.297886,2422.624166,1200.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025036,0.028572,0.028571,0.028587,0.399413,0.060225,0.028442,0.009738,0.192308,0.306071,0.05,0.6,-0.333333,-0.75,-0.125,0.0,0.0,0.166667,0.0
50%,331.5,10.0,411.5,8.0,2.0,1.0,0.0,4.668207,7.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,1250.0,843300.0,248185.714286,927.75,4396.245773,2865.209621,3000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033344,0.040015,0.04,0.05,0.458629,0.118323,0.038613,0.015511,0.285714,0.361991,0.1,0.8,-0.258333,-0.5,-0.1,0.2,0.0,0.5,0.033333
75%,539.0,12.0,748.0,14.0,4.0,4.0,1.0,4.841608,9.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,6625.0,843300.0,344100.35,1966.086364,5935.906533,3601.081724,8200.0,0.0,0.0,0.0,0.0,0.0,0.0,0.129532,0.334695,0.310754,0.417068,0.511995,0.177887,0.050689,0.021576,0.37931,0.41249,0.1,1.0,-0.190989,-0.3,-0.05,0.5,0.136364,0.5,0.25
max,731.0,18.0,4089.0,120.0,63.0,70.0,51.0,5.847262,10.0,1.0,1.0,1.0,1.0,1.0,1.0,217.0,690400.0,843300.0,798220.0,3609.718376,138700.0,36023.424516,663600.0,1.0,1.0,1.0,1.0,1.0,1.0,0.919755,0.919999,0.919767,0.919983,0.775,0.475435,0.136986,0.080645,1.0,0.8,0.6,1.0,0.0,0.0,0.0,1.0,1.0,0.5,1.0


3bis. Use the `train_test_split` command from the `sklearn.model_selection` package to create a training sample containing 70% of the observations and a test sample containing 30% of the observations.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

3ter. Is it important to normalize data before training a penalized model? If yes normalize your data.

In [20]:
#it is essential to normalize data when using a penalized model because the penalization is based
# on the value of the model parameters which directly depends on the scale of variables.
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

4. Generate a classical linear regression model, a ridge model where alpha is 10 and a ridge model where alpha is 10000.

In [21]:
linear_regressor = LinearRegression()
ridge_regressor_small_alpha = Ridge(alpha = 10)
ridge_regressor_large_alpha = Ridge(alpha = 10000)

5. Train these models on the train data

In [22]:
linear_regressor.fit(X_train, y_train)
ridge_regressor_small_alpha.fit(X_train, y_train)
ridge_regressor_large_alpha.fit(X_train, y_train)

Ridge(alpha=10000)

6. Generate performance scores for the three models on the learning and validation sample using the .score attribute.
What can you conclude from the scores obtained on the training sample ? 
What can you conclude from the scores obtained on the test sample ?

In [23]:
print("Score on training: ")
print("Linear Regression score : {}".format(linear_regressor.score(X_train, y_train)))
print("Ridge with small Alpha score : {}".format(ridge_regressor_small_alpha.score(X_train, y_train)))
print("Ridge with large Alpha score : {}".format(ridge_regressor_large_alpha.score(X_train,y_train)))

Score on training: 
Linear Regression score : 0.20309718532561072
Ridge with small Alpha score : 0.20148323455755235
Ridge with large Alpha score : 0.03703249451445478


The score produced by sklearn is $R^2$, and we are noticing a decrease in $R^2$ when the penalization parameter alpha increases. This is completely aligned with the theory, the penalization parameter alpha increases the bias of the model, which is the average prediction error of the model, leading to higher Sum of Square Residual and therefore lower $R^2$

In [24]:
print("Score on test: ")
print("Linear Regression score: {}".format(linear_regressor.score(X_test, y_test)))
print("Ridge with small Alpha score: {}".format(ridge_regressor_small_alpha.score(X_test, y_test)))
print("Ridge with large Alpha score: {}".format(ridge_regressor_large_alpha.score(X_test,y_test)))

Score on test: 
Linear Regression score: -0.2641530446855114
Ridge with small Alpha score: -0.1731083700517222
Ridge with large Alpha score: 0.009529922078841513


What we witness on the scores is very interesting. First, of all the scores obtained are much lower than those obtained on the training sample. Besides, one can notice the following :
- $\alpha = 0 \implies R^2_{test} << R^2_{train}$ : without any regularization, the model is overfitting
- $\alpha = 10 \implies R^2_{test} << R^2_{train}$: with a small $\alpha$, the model is still overfitting
- $\alpha = 10000 \implies R^2_{test} \sim R^2_{train}$ but both low: with a high $\alpha$, the model is underfitting (the score deteriorated on the train set)

So it seems that a happy middle ground could exist where a certain value of $\alpha$ would derive optimal results on our test set, this shows that Ridge can help us find the best compromise between bias and variance for a linear regression model.

7. Compare the coefficients of the three models using a table, what do you notice?

In [25]:
coef = pd.DataFrame()
coef['features'] = X.columns
coef['coef_linear_regressor'] = linear_regressor.coef_
coef['coef_ridge_small_alpha'] = ridge_regressor_small_alpha.coef_
coef['coef_ridge_large_alpha'] = ridge_regressor_large_alpha.coef_
coef

Unnamed: 0,features,coef_linear_regressor,coef_ridge_small_alpha,coef_ridge_large_alpha
0,timedelta,-0.005779,-0.003683,0.000685
1,n_tokens_title,0.010425,0.009204,-5.4e-05
2,n_tokens_content,0.064608,0.052034,0.001789
3,n_unique_tokens,0.130484,0.081234,-0.001164
4,n_non_stop_unique_tokens,-0.11348,-0.075011,-0.002259
5,num_hrefs,-0.011959,-0.009116,0.00254
6,num_self_hrefs,-0.007465,-0.007948,0.00069
7,num_imgs,0.030227,0.032087,0.003977
8,num_videos,-0.00735,-0.008363,0.000231
9,average_token_length,-0.014288,-0.009037,-0.001326


We notice that the higher the value of alpha, the more the coefficients seem to shrink near zero.

In [26]:
perf_lin = pd.DataFrame({"params": linear_regressor.coef_, 
                                       "model": "linear_regressor", 
                                       "index": range(0, len(X.columns))})

perf_ridge_large_alpha = pd.DataFrame({"params": ridge_regressor_large_alpha.coef_, 
                                       "model": "ridge Alpha = 100", 
                                       "index": range(0, len(X.columns))})

perf_ridge_small_alpha = pd.DataFrame({"params": ridge_regressor_small_alpha.coef_, 
                                       "model": "ridge Alpha = 0.01", 
                                       "index": range(0, len(X.columns))})

perf_compar = pd.concat([perf_ridge_large_alpha,perf_ridge_small_alpha,perf_lin])

px.line(perf_compar, x = 'index', y = 'params', color = 'model')

In the figure, we are able to notice the shrink even more.

8. Find the optimal value for the hyper-parameter alpha using sklearn function GridSearchCV. Try values from 0 to 1000 with a step of 10, use a value of 10 as the "cv" parameter and a value of 1 for the "verbose" parameter.

In [27]:
params = {'alpha': np.arange(0,10000,100)} # determine the range of parameters to try
ridge = Ridge() # create an instance of the model

grid = GridSearchCV(ridge, params, cv = 10, verbose = 1)
grid_fit = grid.fit(X_train, y_train)

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


In [28]:
print("Optimal value for alpha : ", grid_fit.best_params_)

Optimal value for alpha :  {'alpha': 700}


9. What is the score on the test set obtained using this optimal alpha parameter ? You might find a score that seems lower to the ones obtained before grid search. Can you explain why ?

In [29]:
print('Test score for the best model : ', grid_fit.best_estimator_.score(X_test,y_test))

Test score for the best model :  0.011984750234474428


In [30]:
scores = cross_val_score(grid_fit.best_estimator_, X_train, y_train, cv = 10)

print('The cross-validated R2-score is : ', scores.mean())
print('The standard deviation is : ', scores.std())

The cross-validated R2-score is :  0.06550028972850007
The standard deviation is :  0.04813737501083121


Acutally, the score of the best model is not **significantly different** from the previous ones !

10. **Bonus question** Try going back to the beginning of the exercise and running it without extracting a small sample of data from the original dataset, do you get the same types of results? What does it tell you about ridge regression?

When increasing the number of samples in the data, it seems that ridge is not better than the linear regression model anymore, the penalization does not work. What does this tell us ? When using a relatively small sample of data ridge was better than linear regression, meaning linear regression's variance was too high and its bias was too low to derve good results on the test set, therefore the penalized version, ridge, git us better results. This is linked to the fact that a smaller sample of data naturally has lower variance than a bigger sample, therefore a model with lower variance is needed.

When increasing the number of sample back to normal, we increase the variance in the data dramatically. This increase in variance within the data calls for a model with higher variance and lower bias, which explains why, when all samples are selected, the results of ridge are not as convincing anymore.