# üìù Exercise M1.05

The goal of this exercise is to evaluate the impact of feature preprocessing
on a pipeline that uses a decision-tree-based classifier instead of a logistic
regression.

- The first question is to empirically evaluate whether scaling numerical
  features is helpful or not;
- The second question is to evaluate whether it is empirically better (both
  from a computational and a statistical perspective) to use integer coded or
  one-hot encoded categories.

In [4]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

In [5]:
adult_census.head

<bound method NDFrame.head of        age      workclass      education  education-num       marital-status  \
0       25        Private           11th              7        Never-married   
1       38        Private        HS-grad              9   Married-civ-spouse   
2       28      Local-gov     Assoc-acdm             12   Married-civ-spouse   
3       44        Private   Some-college             10   Married-civ-spouse   
4       18              ?   Some-college             10        Never-married   
...    ...            ...            ...            ...                  ...   
48837   27        Private     Assoc-acdm             12   Married-civ-spouse   
48838   40        Private        HS-grad              9   Married-civ-spouse   
48839   58        Private        HS-grad              9              Widowed   
48840   22        Private        HS-grad              9        Never-married   
48841   52   Self-emp-inc        HS-grad              9   Married-civ-spouse   

         

In [6]:
adult_census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   education       48842 non-null  object
 3   education-num   48842 non-null  int64 
 4   marital-status  48842 non-null  object
 5   occupation      48842 non-null  object
 6   relationship    48842 non-null  object
 7   race            48842 non-null  object
 8   sex             48842 non-null  object
 9   capital-gain    48842 non-null  int64 
 10  capital-loss    48842 non-null  int64 
 11  hours-per-week  48842 non-null  int64 
 12  native-country  48842 non-null  object
 13  class           48842 non-null  object
dtypes: int64(5), object(9)
memory usage: 5.2+ MB


In [7]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num","race"])

In [8]:
data.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,Female,0,0,30,United-States


As in the previous notebooks, we use the utility `make_column_selector` to
select only columns with a specific data type. Besides, we list in advance all
categories for the categorical columns.

In [9]:
from sklearn.compose import make_column_selector as selector

numerical_columns_selector = selector(dtype_exclude=object)
categorical_columns_selector = selector(dtype_include=object)
numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)

## Reference pipeline (no numerical scaling and integer-coded categories)

First let's time the pipeline we used in the main notebook to serve as a
reference:

In [10]:
import time

from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import HistGradientBoostingClassifier

categorical_preprocessor = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = make_column_transformer(
    (categorical_preprocessor, categorical_columns),
    remainder="passthrough",#means any column i didint preprocess kepp it excatly same 
)


model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

start = time.time()
cv_results = cross_validate(model, data, target)
elapsed_time = time.time() - start

scores = cv_results["test_score"]

print(
    "The mean cross-validation accuracy is: "
    f"{scores.mean():.3f} ¬± {scores.std():.3f} "
    f"with a fitting time of {elapsed_time:.3f} seconds"
)

The mean cross-validation accuracy is: 0.873 ¬± 0.002 with a fitting time of 10.549 seconds


## Scaling numerical features

Let's write a similar pipeline that also scales the numerical features using
`StandardScaler` (or similar):

In [11]:
# Write your code here.
import time
from sklearn.preprocessing import StandardScaler
preprocessor = make_column_transformer((StandardScaler(),numerical_columns),
        
                                       (OrdinalEncoder(handle_unknown=  "use_encoded_value", unknown_value=-1),categorical_columns)
                                       )
model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

start = time.time()
cv_results = cross_validate(model, data, target)
elapsed_time = time.time() - start

scores = cv_results["test_score"]

print(
    "The mean cross-validation accuracy is: "
    f"{scores.mean():.3f} ¬± {scores.std():.3f} "
    f"with a fitting time of {elapsed_time:.3f} seconds"
)


The mean cross-validation accuracy is: 0.874 ¬± 0.003 with a fitting time of 9.136 seconds


## One-hot encoding of categorical variables

We observed that integer coding of categorical variables can be very
detrimental for linear models. However, it does not seem to be the case for
`HistGradientBoostingClassifier` models, as the cross-validation score of the
reference pipeline with `OrdinalEncoder` is reasonably good.

Let's see if we can get an even better accuracy with `OneHotEncoder`.

Hint: `HistGradientBoostingClassifier` does not yet support sparse input data.
You might want to use `OneHotEncoder(handle_unknown="ignore",
sparse_output=False)` to force the use of a dense representation as a
workaround.

In [13]:
# Write your code here.
import time
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
preprocessor = make_column_transformer((StandardScaler(),numerical_columns),
        
                                       (OneHotEncoder(handle_unknown="ignore",
sparse_output=False),categorical_columns)
                                       )
model = make_pipeline(preprocessor, HistGradientBoostingClassifier())

start = time.time()
cv_results = cross_validate(model, data, target)
elapsed_time = time.time() - start

scores = cv_results["test_score"]

print(
    "The mean cross-validation accuracy is: "
    f"{scores.mean():.3f} ¬± {scores.std():.3f} "
    f"with a fitting time of {elapsed_time:.3f} seconds"
)


The mean cross-validation accuracy is: 0.873 ¬± 0.002 with a fitting time of 26.411 seconds


## Which encoder should I use?

|                  | Meaningful order              | Non-meaningful order |
| ---------------- | ----------------------------- | -------------------- |
| Tree-based model | `OrdinalEncoder`              | `OrdinalEncoder` with reasonable depth    |
| Linear model     | `OrdinalEncoder` with caution | `OneHotEncoder`      |

<div class="admonition important alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Important</p>
<ul class="last simple">
<li><tt class="docutils literal">OneHotEncoder</tt>: always does something meaningful, but can be unnecessary
slow with trees.</li>
<li><tt class="docutils literal">OrdinalEncoder</tt>: can be detrimental for linear models unless your category
has a meaningful order and you make sure that <tt class="docutils literal">OrdinalEncoder</tt> respects this
order. Trees can deal with <tt class="docutils literal">OrdinalEncoder</tt> fine as long as they are deep
enough. However, when you allow the decision tree to grow very deep, it might
overfit on other features.</li>
</ul>
</div>

Next to one-hot-encoding and ordinal encoding categorical features,
scikit-learn offers the [`TargetEncoder`](https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder).
This encoder is well suited for nominal, categorical features with high
cardinality. This encoding strategy is beyond the scope of this course,
but the interested reader is encouraged to explore this encoder.