/
linear_models_sol_04.py
132 lines (107 loc) · 4.01 KB
/
linear_models_sol_04.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# %% [markdown]
# # 📃 Solution for Exercise M4.04
#
# In the previous notebook, we saw the effect of applying some regularization
# on the coefficient of a linear model.
#
# In this exercise, we will study the advantage of using some regularization
# when dealing with correlated features.
#
# We will first create a regression dataset. This dataset will contain 2,000
# samples and 5 features from which only 2 features will be informative.
# %%
from sklearn.datasets import make_regression
data, target, coef = make_regression(
n_samples=2_000,
n_features=5,
n_informative=2,
shuffle=False,
coef=True,
random_state=0,
noise=30,
)
# %% [markdown]
# When creating the dataset, `make_regression` returns the true coefficient
# used to generate the dataset. Let's plot this information.
# %%
import pandas as pd
feature_names = [f"Features {i}" for i in range(data.shape[1])]
coef = pd.Series(coef, index=feature_names)
coef.plot.barh()
coef
# %% [markdown]
# Create a `LinearRegression` regressor and fit on the entire dataset and
# check the value of the coefficients. Are the coefficients of the linear
# regressor close to the coefficients used to generate the dataset?
# %%
from sklearn.linear_model import LinearRegression
linear_regression = LinearRegression()
linear_regression.fit(data, target)
linear_regression.coef_
# %%
feature_names = [f"Features {i}" for i in range(data.shape[1])]
coef = pd.Series(linear_regression.coef_, index=feature_names)
_ = coef.plot.barh()
# %% [markdown]
# We see that the coefficients are close to the coefficients used to generate
# the dataset. The dispersion is indeed cause by the noise injected during the
# dataset generation.
# %% [markdown]
# Now, create a new dataset that will be the same as `data` with 4 additional
# columns that will repeat twice features 0 and 1. This procedure will create
# perfectly correlated features.
# %%
import numpy as np
data = np.concatenate([data, data[:, [0, 1]], data[:, [0, 1]]], axis=1)
# %% [markdown]
# Fit again the linear regressor on this new dataset and check the
# coefficients. What do you observe?
# %%
linear_regression = LinearRegression()
linear_regression.fit(data, target)
linear_regression.coef_
# %%
feature_names = [f"Features {i}" for i in range(data.shape[1])]
coef = pd.Series(linear_regression.coef_, index=feature_names)
_ = coef.plot.barh()
# %% [markdown]
# We see that the coefficient values are far from what one could expect.
# By repeating the informative features, one would have expected these
# coefficients to be similarly informative.
#
# Instead, we see that some coefficients have a huge norm ~1e14. It indeed
# means that we try to solve an mathematical ill-posed problem. Indeed, finding
# coefficients in a linear regression involves inverting the matrix
# `np.dot(data.T, data)` which is not possible (or lead to high numerical
# errors).
# %% [markdown]
# Create a ridge regressor and fit on the same dataset. Check the coefficients.
# What do you observe?
# %%
from sklearn.linear_model import Ridge
ridge = Ridge()
ridge.fit(data, target)
ridge.coef_
# %%
coef = pd.Series(ridge.coef_, index=feature_names)
_ = coef.plot.barh()
# %% [markdown]
# We see that the penalty applied on the weights give a better results: the
# values of the coefficients do not suffer from numerical issues. Indeed, the
# matrix to be inverted internally is `np.dot(data.T, data) + alpha * I`.
# Adding this penalty `alpha` allow the inversion without numerical issue.
# %% [markdown]
# Can you find the relationship between the ridge coefficients and the original
# coefficients?
# %%
ridge.coef_[:5] * 3
# %% [markdown]
# Repeating three times each informative features induced to divide the
# ridge coefficients by three.
# %% [markdown]
# ```{tip}
# We always advise to use l2-penalized model instead of non-penalized model
# in practice. In scikit-learn, `LogisticRegression` applies such penalty
# by default. However, one needs to use `Ridge` (and even `RidgeCV` to tune
# the parameter `alpha`) instead of `LinearRegression`.
# ```