/
03_categorical_pipeline_ex_02.py
112 lines (93 loc) · 3.35 KB
/
03_categorical_pipeline_ex_02.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.15.2
# kernelspec:
# display_name: Python 3
# name: python3
# ---
# %% [markdown]
# # 📝 Exercise M1.05
#
# The goal of this exercise is to evaluate the impact of feature preprocessing
# on a pipeline that uses a decision-tree-based classifier instead of a logistic
# regression.
#
# - The first question is to empirically evaluate whether scaling numerical
# features is helpful or not;
# - The second question is to evaluate whether it is empirically better (both
# from a computational and a statistical perspective) to use integer coded or
# one-hot encoded categories.
# %%
import pandas as pd
adult_census = pd.read_csv("../datasets/adult-census.csv")
# %%
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])
# %% [markdown]
# As in the previous notebooks, we use the utility `make_column_selector` to
# select only columns with a specific data type. Besides, we list in advance all
# categories for the categorical columns.
# %%
from sklearn.compose import make_column_selector as selector
numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)
# %% [markdown]
# ## Reference pipeline (no numerical scaling and integer-coded categories)
#
# First let's time the pipeline we used in the main notebook to serve as a
# reference:
# %%
import time
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import HistGradientBoostingClassifier
categorical_preprocessor = OrdinalEncoder(
handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = ColumnTransformer(
[("categorical", categorical_preprocessor, categorical_columns)],
remainder="passthrough",
)
model = make_pipeline(preprocessor, HistGradientBoostingClassifier())
start = time.time()
cv_results = cross_validate(model, data, target)
elapsed_time = time.time() - start
scores = cv_results["test_score"]
print(
"The mean cross-validation accuracy is: "
f"{scores.mean():.3f} ± {scores.std():.3f} "
f"with a fitting time of {elapsed_time:.3f}"
)
# %% [markdown]
# ## Scaling numerical features
#
# Let's write a similar pipeline that also scales the numerical features using
# `StandardScaler` (or similar):
# %%
# Write your code here.
# %% [markdown]
# ## One-hot encoding of categorical variables
#
# We observed that integer coding of categorical variables can be very
# detrimental for linear models. However, it does not seem to be the case for
# `HistGradientBoostingClassifier` models, as the cross-validation score of the
# reference pipeline with `OrdinalEncoder` is reasonably good.
#
# Let's see if we can get an even better accuracy with `OneHotEncoder`.
#
# Hint: `HistGradientBoostingClassifier` does not yet support sparse input data.
# You might want to use `OneHotEncoder(handle_unknown="ignore",
# sparse_output=False)` to force the use of a dense representation as a
# workaround.
# %%
# Write your code here.