# üìù Exercise M1.04

The goal of this exercise is to evaluate the impact of using an arbitrary
integer encoding for categorical variables along with a linear classification
model such as Logistic Regression.

To do so, let's try to use `OrdinalEncoder` to preprocess the categorical
variables. This preprocessor is assembled in a pipeline with
`LogisticRegression`. The generalization performance of the pipeline can be
evaluated by cross-validation and then compared to the score obtained when
using `OneHotEncoder` or to some other baseline score.

First, we load the dataset.

In [1]:
import pandas as pd

adult_census = pd.read_csv("dataset/adult-census.csv")

In [2]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

In the previous notebook, we used `sklearn.compose.make_column_selector` to
automatically select columns with a specific data type (also called `dtype`).
Here, we use this selector to get only the columns containing strings (column
with `object` dtype) that correspond to categorical features in our dataset.

In [4]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
data_categorical = data[categorical_columns]
print(data_categorical.head())

    workclass      education       marital-status          occupation  \
0     Private           11th        Never-married   Machine-op-inspct   
1     Private        HS-grad   Married-civ-spouse     Farming-fishing   
2   Local-gov     Assoc-acdm   Married-civ-spouse     Protective-serv   
3     Private   Some-college   Married-civ-spouse   Machine-op-inspct   
4           ?   Some-college        Never-married                   ?   

  relationship    race      sex  native-country  
0    Own-child   Black     Male   United-States  
1      Husband   White     Male   United-States  
2      Husband   White     Male   United-States  
3      Husband   Black     Male   United-States  
4    Own-child   White   Female   United-States  


Define a scikit-learn pipeline composed of an `OrdinalEncoder` and a
`LogisticRegression` classifier.

Because `OrdinalEncoder` can raise errors if it sees an unknown category at
prediction time, you can set the `handle_unknown="use_encoded_value"` and
`unknown_value` parameters. You can refer to the [scikit-learn
documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)
for more details regarding these parameters.

In [10]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression

model= make_pipeline(
    OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
    LogisticRegression(max_iter=1000)
)

In [11]:
print(type(model))
print(hasattr(model, "fit"))


<class 'sklearn.pipeline.Pipeline'>
True


Your model is now defined. Evaluate it using a cross-validation using
`sklearn.model_selection.cross_validate`.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">Be aware that if an error happened during the cross-validation,
<tt class="docutils literal">cross_validate</tt> would raise a warning and return NaN (Not a Number) as scores.
To make it raise a standard Python exception with a traceback, you can pass
the <tt class="docutils literal"><span class="pre">error_score="raise"</span></tt> argument in the call to <tt class="docutils literal">cross_validate</tt>. An
exception would be raised instead of a warning at the first encountered problem
and <tt class="docutils literal">cross_validate</tt> would stop right away instead of returning NaN values.
This is particularly handy when developing complex machine learning pipelines.</p>
</div>

In [12]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(
    model, data_categorical, target, cv=5, return_train_score=True, error_score='raise'
)

cv_results_df = pd.DataFrame(cv_results)
print(cv_results_df.describe())

       fit_time  score_time  test_score  train_score
count  5.000000    5.000000    5.000000     5.000000
mean   0.335544    0.023603    0.755477     0.755487
std    0.054917    0.007289    0.001715     0.000639
min    0.286339    0.016664    0.753071     0.754722
25%    0.301442    0.019783    0.755144     0.754920
50%    0.322890    0.022826    0.755553     0.755662
75%    0.340425    0.022943    0.755733     0.755950
max    0.426624    0.035799    0.757883     0.756181


Now, we would like to compare the generalization performance of our previous
model with a new model where instead of using an `OrdinalEncoder`, we use a
`OneHotEncoder`. Repeat the model evaluation using cross-validation. Compare
the score of both models and conclude on the impact of choosing a specific
encoding strategy when using a linear model.

In [13]:
from sklearn.preprocessing import OneHotEncoder

model2= make_pipeline(
    OneHotEncoder(handle_unknown="ignore"),
    LogisticRegression(max_iter=1000)
)

In [16]:
cross_validate2 = cross_validate(
    model2, data_categorical, target, cv=5, return_train_score=True, error_score='raise'
)   
cv_results2_df = pd.DataFrame(cross_validate2)
print(cv_results2_df.describe())

       fit_time  score_time  test_score  train_score
count  5.000000    5.000000    5.000000     5.000000
mean   0.182960    0.017830    0.832849     0.833975
std    0.017900    0.002754    0.002893     0.000672
min    0.169916    0.014215    0.828317     0.833521
25%    0.170455    0.015849    0.832327     0.833568
50%    0.178491    0.018523    0.832924     0.833598
75%    0.182406    0.019629    0.834971     0.834080
max    0.213536    0.020933    0.835705     0.835108


Compare the result of  both models and conclude on the impact of choosing a specific
encoding strategy when using a linear model.

Accuracy: the accuracy of one-hot encoder is higher than ordinal encoder and that because the categorical feature don't need ordinal encoder which may mislead the logistic regression model 

speed:one hot enocder is fastest since the fit time mean is 0.18

overfitting:both models seems to have no overfitting since the traing and testing score are similar 

In [20]:
import sys
sys.executable


'c:\\Users\\amnah\\AppData\\Local\\Programs\\Python\\Python311\\python.exe'