/
parameter_tuning_manual.py
155 lines (132 loc) · 5.05 KB
/
parameter_tuning_manual.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# ---
# jupyter:
# kernelspec:
# display_name: Python 3
# name: python3
# ---
# %% [markdown]
# # Set and get hyperparameters in scikit-learn
#
# Recall that hyperparameters refer to the parameters that control the learning
# process of a predictive model and are specific for each family of models. In
# addition, the optimal set of hyperparameters is specific to each dataset and
# thus they always need to be optimized.
#
# This notebook shows how one can get and set the value of a hyperparameter in a
# scikit-learn estimator.
#
# They should not be confused with the fitted parameters, resulting from the
# training. These fitted parameters are recognizable in scikit-learn because
# they are spelled with a final underscore `_`, for instance `model.coef_`.
#
# We start by loading the adult census dataset and only use the numerical
# features.
# %%
import pandas as pd
adult_census = pd.read_csv("../datasets/adult-census.csv")
target_name = "class"
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
target = adult_census[target_name]
data = adult_census[numerical_columns]
# %% [markdown]
# Our data is only numerical.
# %%
data
# %% [markdown]
# Let's create a simple predictive model made of a scaler followed by a logistic
# regression classifier.
#
# As mentioned in previous notebooks, many models, including linear ones, work
# better if all features have a similar scaling. For this purpose, we use a
# `StandardScaler`, which transforms the data by rescaling features.
# %%
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
model = Pipeline(
steps=[
("preprocessor", StandardScaler()),
("classifier", LogisticRegression()),
]
)
# %% [markdown]
# We can evaluate the generalization performance of the model via
# cross-validation.
# %%
from sklearn.model_selection import cross_validate
cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print(
"Accuracy score via cross-validation:\n"
f"{scores.mean():.3f} ± {scores.std():.3f}"
)
# %% [markdown]
# We created a model with the default `C` value that is equal to 1. If we wanted
# to use a different `C` hyperparameter we could have done so when we created the
# `LogisticRegression` object with something like `LogisticRegression(C=1e-3)`.
#
# ```{note}
# For more information on the model hyperparameter `C`, refer to the
# [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).
# Be aware that we will focus on linear models in an upcoming module.
# ```
#
# We can also change the hyperparameter of a model after it has been created
# with the `set_params` method, which is available for all scikit-learn
# estimators. For example, we can set `C=1e-3`, fit and evaluate the model:
# %%
model.set_params(classifier__C=1e-3)
cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print(
"Accuracy score via cross-validation:\n"
f"{scores.mean():.3f} ± {scores.std():.3f}"
)
# %% [markdown]
# When the model of interest is a `Pipeline`, the hyperparameter names are of
# the form `<model_name>__<hyperparameter_name>` (note the double underscore in
# the middle). In our case, `classifier` comes from the `Pipeline` definition
# and `C` is the hyperparameter name of `LogisticRegression`.
#
# In general, you can use the `get_params` method on scikit-learn models to list
# all the hyperparameters with their values. For example, if you want to get all
# the hyperparameter names, you can use:
# %%
for parameter in model.get_params():
print(parameter)
# %% [markdown]
# `.get_params()` returns a `dict` whose keys are the hyperparameter names and
# whose values are the hyperparameter values. If you want to get the value of a
# single hyperparameter, for example `classifier__C`, you can use:
# %%
model.get_params()["classifier__C"]
# %% [markdown]
# We can systematically vary the value of C to see if there is an optimal
# value.
# %%
for C in [1e-3, 1e-2, 1e-1, 1, 10]:
model.set_params(classifier__C=C)
cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print(
f"Accuracy score via cross-validation with C={C}:\n"
f"{scores.mean():.3f} ± {scores.std():.3f}"
)
# %% [markdown]
# We can see that as long as C is high enough, the model seems to perform well.
#
# What we did here is very manual: it involves scanning the values for C and
# picking the best one manually. In the next lesson, we will see how to do this
# automatically.
#
# ```{warning}
# When we evaluate a family of models on test data and pick the best performer,
# we can not trust the corresponding prediction accuracy, and we need to apply
# the selected model to new data. Indeed, the test data has been used to select
# the model, and it is thus no longer independent from this model.
# ```
# %% [markdown]
# In this notebook we have seen:
#
# - how to use `get_params` and `set_params` to get the hyperparameters of a model
# and set them.