## Exercise M1.04

The goal of this exercise is to evaluate the impact of using an arbitrary integer encoding for categorical variables along with a linear classification model such as Logistic Regression.

To do so, let's try to use OrdinalEncoder to preprocess the categorical variables. This preprocessor is assembled in a pipeline with LogisticRegression. The generalization performance of the pipeline can be evaluated by cross-validation and then compared to the score obtained when using OneHotEncoder or to some other baseline score.

First, we load the dataset.

In [1]:
import pandas as pd

adult_census = pd.read_csv("/content/drive/MyDrive/DataSets/Adult_Census/adult-census(1).csv")

In [2]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

In the previous notebook, we used sklearn.compose.make_column_selector to automatically select columns with a specific data type (also called dtype). Here, we use this selector to get only the columns containing strings (column with object dtype) that correspond to categorical features in our dataset.

In [3]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
data_categorical = data[categorical_columns]

Define a scikit-learn pipeline composed of an OrdinalEncoder and a LogisticRegression classifier.

Because OrdinalEncoder can raise errors if it sees an unknown category at prediction time, you can set the handle_unknown="use_encoded_value" and unknown_value parameters. You can refer to the scikit-learn documentation for more details regarding these parameters.


In [4]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression

# Write your code here.

# Configure the OrdinalEncoder to handle unknown categories during prediction
ordinal_encoder = OrdinalEncoder(handle_unknown="use_encoded_value",unknown_value=-1)

# Create the LogisticRegression classifier
logistic_regression = LogisticRegression()

# Build the pipeline
pipeline = make_pipeline(ordinal_encoder, logistic_regression)

Your model is now defined. Evaluate it using a cross-validation using sklearn.model_selection.cross_validate.

Note

Be aware that if an error happened during the cross-validation, cross_validate would raise a warning and return NaN (Not a Number) as scores. To make it raise a standard Python exception with a traceback, you can pass the error_score="raise" argument in the call to cross_validate. An exception would be raised instead of a warning at the first encountered problem and cross_validate would stop right away instead of returning NaN values. This is particularly handy when developing complex machine learning pipelines.


In [5]:
from sklearn.model_selection import cross_validate

# Write your code here.

cv_results = cross_validate(pipeline, data, target, cv=5, error_score='raise')
print(cv_results)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

{'fit_time': array([1.32727695, 0.69999719, 0.58214378, 0.53853273, 0.55896878]), 'score_time': array([0.15865707, 0.07513952, 0.07521367, 0.07845163, 0.07632399]), 'test_score': array([0.79895588, 0.80171973, 0.79791155, 0.80077805, 0.80528256])}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [6]:
mean_test_score = cv_results['test_score'].mean()
print(f"Mean cross-validation test score: {mean_test_score}")

Mean cross-validation test score: 0.8009295520965087


Now, we would like to compare the generalization performance of our previous model with a new model where instead of using an OrdinalEncoder, we use a OneHotEncoder. Repeat the model evaluation using cross-validation. Compare the score of both models and conclude on the impact of choosing a specific encoding strategy when using a linear model.


In [7]:
from sklearn.preprocessing import OneHotEncoder

# Write your code here.

# Define the new pipeline with OneHotEncoder
one_hot_encoder = OneHotEncoder(handle_unknown='ignore')
logistic_regression = LogisticRegression()
one_hot_pipeline = make_pipeline(one_hot_encoder, logistic_regression)

# Perform cross-validation on the new pipeline
one_hot_cv_results = cross_validate(one_hot_pipeline, data, target, cv=5, error_score = 'raise')

# Calculate the mean test score for the new pipeline
one_hot_mean_test_score = one_hot_cv_results['test_score'].mean()
print(f"Mean cross-validation test score with OrdinalEncoder: {mean_test_score}") #Replace with actual score

# Conclude on the impact of the encoding strategy
if one_hot_mean_test_score > mean_test_score:
    print("OneHotEncoder provided a better mean test score for the linear model.")
elif one_hot_mean_test_score < mean_test_score:
    print("OrdinalEncoder provided a better mean test score for the linear model.")
else:
    print("Both encoding strategies provided the same mean test score for the linear model.")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Mean cross-validation test score with OrdinalEncoder: 0.8009295520965087
OneHotEncoder provided a better mean test score for the linear model.


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
