# üìù Exercise M1.04

The goal of this exercise is to evaluate the impact of using an arbitrary
integer encoding for categorical variables along with a linear classification
model such as Logistic Regression.

To do so, let's try to use `OrdinalEncoder` to preprocess the categorical
variables. This preprocessor is assembled in a pipeline with
`LogisticRegression`. The generalization performance of the pipeline can be
evaluated by cross-validation and then compared to the score obtained when
using `OneHotEncoder` or to some other baseline score.

First, we load the dataset.

In [23]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

In [24]:
adult_census.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [25]:
adult_census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       48842 non-null  object
 2   education       48842 non-null  object
 3   education-num   48842 non-null  int64 
 4   marital-status  48842 non-null  object
 5   occupation      48842 non-null  object
 6   relationship    48842 non-null  object
 7   race            48842 non-null  object
 8   sex             48842 non-null  object
 9   capital-gain    48842 non-null  int64 
 10  capital-loss    48842 non-null  int64 
 11  hours-per-week  48842 non-null  int64 
 12  native-country  48842 non-null  object
 13  class           48842 non-null  object
dtypes: int64(5), object(9)
memory usage: 5.2+ MB


In [26]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num","race"])

In [27]:
# data.info()
data.head(20)

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,Female,0,0,30,United-States
5,34,Private,10th,Never-married,Other-service,Not-in-family,Male,0,0,30,United-States
6,29,?,HS-grad,Never-married,?,Unmarried,Male,0,0,40,United-States
7,63,Self-emp-not-inc,Prof-school,Married-civ-spouse,Prof-specialty,Husband,Male,3103,0,32,United-States
8,24,Private,Some-college,Never-married,Other-service,Unmarried,Female,0,0,40,United-States
9,55,Private,7th-8th,Married-civ-spouse,Craft-repair,Husband,Male,0,0,10,United-States


In the previous notebook, we used `sklearn.compose.make_column_selector` to
automatically select columns with a specific data type (also called `dtype`).
Here, we use this selector to get only the columns containing strings (column
with `object` dtype) that correspond to categorical features in our dataset.

In [28]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
data_categorical = data[categorical_columns]

Define a scikit-learn pipeline composed of an `OrdinalEncoder` and a
`LogisticRegression` classifier.

Because `OrdinalEncoder` can raise errors if it sees an unknown category at
prediction time, you can set the `handle_unknown="use_encoded_value"` and
`unknown_value` parameters. You can refer to the [scikit-learn
documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)
for more details regarding these parameters.

In [29]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression

# Write your code here.
model = make_pipeline(
    OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
    LogisticRegression(),
)

In [30]:
model

0,1,2
,steps,"[('ordinalencoder', ...), ('logisticregression', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,categories,'auto'
,dtype,<class 'numpy.float64'>
,handle_unknown,'use_encoded_value'
,unknown_value,-1
,encoded_missing_value,
,min_frequency,
,max_categories,

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


Your model is now defined. Evaluate it using a cross-validation using
`sklearn.model_selection.cross_validate`.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">Be aware that if an error happened during the cross-validation,
<tt class="docutils literal">cross_validate</tt> would raise a warning and return NaN (Not a Number) as scores.
To make it raise a standard Python exception with a traceback, you can pass
the <tt class="docutils literal"><span class="pre">error_score="raise"</span></tt> argument in the call to <tt class="docutils literal">cross_validate</tt>. An
exception would be raised instead of a warning at the first encountered problem
and <tt class="docutils literal">cross_validate</tt> would stop right away instead of returning NaN values.
This is particularly handy when developing complex machine learning pipelines.</p>
</div>

In [34]:
from sklearn.model_selection import cross_validate

# Write your code here.
results = cross_validate(model, data_categorical, target, cv=5)
scores = results["test_score"]
# import numpy as np
# np.mean(results["test_score"])
print(scores)
print(scores.mean())
print(scores.std())




[0.75586037 0.75616747 0.75593776 0.754914   0.75788288]
0.7561524973824083
0.0009653777795031363


Now, we would like to compare the generalization performance of our previous
model with a new model where instead of using an `OrdinalEncoder`, we use a
`OneHotEncoder`. Repeat the model evaluation using cross-validation. Compare
the score of both models and conclude on the impact of choosing a specific
encoding strategy when using a linear model.

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Write your code here.