In [1]:
import pandas as pd
from pathlib import Path

data_folder = Path("/home/luba/Documents/DS/projects-courses-ongoing/sklearn-course-inria-[doing]/datasets")
figure_folders = Path("/home/luba/Documents/DS/projects-courses-ongoing/sklearn-course-inria-[doing]/figures")

adult_census = pd.read_csv(data_folder.joinpath("adult-census.csv"))
adult_census = adult_census.drop(columns="education-num")
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name,])

In [5]:
from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

<div class="admonition caution alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Caution!</p>
<p>Here, we know that <tt class="docutils literal">object</tt> data type is used to represent strings and thus
categorical features. Be aware that this is not always the case. Sometimes
<tt class="docutils literal">object</tt> data type could contain other types of information, such as dates that
were not properly formatted (strings) and yet relate to a quantity of
elapsed  time.</p>
<p class="last">In a more general scenario you should manually introspect the content of your
dataframe not to wrongly use <tt class="docutils literal">make_column_selector</tt>.</p>
</div>

## Dispatch columns to a specific processor

In the previous sections, we saw that we need to treat data differently
depending on their nature (i.e. numerical or categorical).

Scikit-learn provides a `ColumnTransformer` class which will send specific
columns to a specific transformer, making it easy to fit a single predictive
model on a dataset that combines both kinds of variables together
(heterogeneously typed tabular data).

We first define the columns depending on their data type:

* **one-hot encoding** will be applied to categorical columns. Besides, we
  use `handle_unknown="ignore"` to solve the potential issues due to rare
  categories.
* **numerical scaling** numerical features which will be standardized.

Now, we create our `ColumnTransfomer` by specifying three values:
the preprocessor name, the transformer, and the columns.
First, let's create the preprocessors for the numerical and categorical
parts.

In [2]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

In [6]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('one-hot-encoder', categorical_preprocessor, categorical_columns),
    ('standard-scaler', numerical_preprocessor, numerical_columns)
])

We can take a minute to represent graphically the structure of a
`ColumnTransformer`:

![columntransformer diagram](/home/luba/Documents/DS/projects-courses-ongoing/sklearn-course-inria-[doing]/figures/api_diagram-columntransformer.svg)

A `ColumnTransformer` does the following:

* It **splits the columns** of the original dataset based on the column names
  or indices provided. We will obtain as many subsets as the number of
  transformers passed into the `ColumnTransformer`.
* It **transforms each subsets**. A specific transformer is applied to
  each subset: it will internally call `fit_transform` or `transform`. The
  output of this step is a set of transformed datasets.
* It then **concatenates the transformed datasets** into a single dataset.

The important thing is that `ColumnTransformer` is like any other
scikit-learn transformer. In particular it can be combined with a classifier
in a `Pipeline`:

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))

In [8]:
from sklearn import set_config
set_config(display='diagram')
model

In [9]:
# lets use train and test
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data,
    target,
    test_size=0.25,
    random_state=42
)
_ = model.fit(data_train, target_train)
model.predict(data_test)[:5]

array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)

In [10]:
model.score(data_test, target_test)

0.8575874211776268

In [11]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data_train, target_train)
cv_results

{'fit_time': array([0.60776925, 0.52251768, 0.49279571, 0.50765657, 0.45256114]),
 'score_time': array([0.02112317, 0.02017379, 0.02133369, 0.02067327, 0.02001166]),
 'test_score': array([0.84632182, 0.85476385, 0.85107835, 0.84821185, 0.84957685])}

In [13]:
# on training
cv_results["test_score"].mean(), cv_results["test_score"].std()

(0.8499905454047667, 0.0028552489405761113)

## Fitting a more powerful model

**Linear models** are nice because they are usually cheap to train,
**small** to deploy, **fast** to predict and give a **good baseline**.

However, it is often useful to check whether more complex models such as an
ensemble of decision trees can lead to higher predictive performance. In this
section we will use such a model called **gradient-boosting trees** and
evaluate its generalization performance. More precisely, the scikit-learn model
we will use is called `HistGradientBoostingClassifier`. Note that boosting
models will be covered in more detail in a future module.

For tree-based models, the handling of numerical and categorical variables is
simpler than for linear models:
* we do **not need to scale the numerical features**
* using an **ordinal encoding for the categorical variables** is fine even if
  the encoding results in an arbitrary ordering

Therefore, for `HistGradientBoostingClassifier`, the preprocessing pipeline
is slightly simpler than the one we saw earlier for the `LogisticRegression`:

In [18]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)

preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns)
], remainder='passthrough')

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

model

In [19]:
_ = model.fit(data_train, target_train)
model.score(data_test, target_test)

0.8804356727540742

In [21]:
cv_results = cross_validate(model, data, target)
cv_results["test_score"].mean(), cv_results["test_score"].std()

(0.8730191984388933, 0.0021218543390928457)