#  Using numerical and categorical variables together

In [1]:
import pandas as pd

adult_census = pd.read_csv("./csv_result-phpMawTba.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="education-num")

target_name = "class"
target = adult_census[target_name]

data = adult_census.drop(columns=[target_name])

## Selection based on data types

Using `make_column_selector` helper to select the wanted columns.

In [2]:
from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)

numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

## Dispatch columns to a specific processor

Scikit-learn provides a `ColumnTransformer` class which sends specific
columns to a specific transformer, making it easy to fit a single predictive
model on a dataset that combines both kinds of variables together
(heterogeneously typed tabular data).

`ColumnTransfomer` takes three values: the preprocessor name, the transformer, and the columns.

In [4]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler

categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

In [22]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    [
        ("one-hot-encoder", categorical_preprocessor, categorical_columns),
        ("standard_scaler", numerical_preprocessor, numerical_columns),
    ]
)

Using `ColumnTransfomer` in a pipeline:

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))
model

Split data and train model:

In [12]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42
)

In [11]:
_ = model.fit(data_train, target_train)

In [13]:
data_test.head()

Unnamed: 0,id,age,workclass,fnlwgt,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
7762,7763,56,Private,33115,HS-grad,Divorced,Other-service,Unmarried,White,Female,0,0,40,United-States
23881,23882,25,Private,112847,HS-grad,Married-civ-spouse,Transport-moving,Own-child,Other,Male,0,0,40,United-States
30507,30508,43,Private,170525,Bachelors,Divorced,Prof-specialty,Not-in-family,White,Female,14344,0,40,United-States
28911,28912,32,Private,186788,HS-grad,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,40,United-States
19484,19485,39,Private,277886,Bachelors,Married-civ-spouse,Sales,Wife,White,Female,0,0,30,United-States


In [14]:
model.predict(data_test)[:5]

array(['<=50K', '<=50K', '>50K', '<=50K', '>50K'], dtype=object)

In [15]:
target_test[:5]

7762     <=50K
23881    <=50K
30507     >50K
28911    <=50K
19484    <=50K
Name: class, dtype: object

In [16]:
model.score(data_test, target_test)

0.8588977151748424

## Evaluation of the model with cross-validation

In [17]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv=5)
cv_results

{'fit_time': array([0.24032378, 0.22969675, 0.21816421, 0.22101712, 0.23146725]),
 'score_time': array([0.02236319, 0.02382541, 0.02334356, 0.02461004, 0.02358699]),
 'test_score': array([0.85105947, 0.85105947, 0.8495086 , 0.85298935, 0.85657248])}

In [18]:
scores = cv_results["test_score"]
print(
    "The mean cross-validation accuracy is: "
    f"{scores.mean():.3f} ± {scores.std():.3f}"
)

The mean cross-validation accuracy is: 0.852 ± 0.002


## Fitting a more powerful model


We look at a complex Model called **gradient-boosting trees** and evaluate
its generalization performance. More precisely, the scikit-learn model called `HistGradientBoostingClassifier`

For tree-based models, the handling of numerical and categorical variables is
simpler than for linear models:
* we do **not need to scale the numerical features**
* using an **ordinal encoding for the categorical variables** is fine even if
  the encoding results in an arbitrary ordering

In [19]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)

preprocessor = ColumnTransformer(
    [("categorical", categorical_preprocessor, categorical_columns)],
    remainder="passthrough",
)

model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

In [20]:
%%time
_ = model.fit(data_train, target_train)

CPU times: user 1.79 s, sys: 0 ns, total: 1.79 s
Wall time: 376 ms


In [21]:
model.score(data_test, target_test)

0.8793710588813365

We can observe that we get significantly higher accuracies with the Gradient
Boosting model. This is often what we observe whenever the dataset has a large
number of samples and limited number of informative features (e.g. less than
1000) with a mix of numerical and categorical variables.