# Encoding of categorical variables (v2)
> Dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding
- toc: true
- badges: false
- comments: true
- author: Cécile Gallioz
- categories: [sklearn, v2]

# Loading

In [1]:
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
import time

In [2]:
myData = pd.read_csv("../../scikit-learn-mooc/datasets/adult-census.csv")

In [3]:
myData = myData.drop(columns="education-num")

In [4]:
print(f"The dataset data contains {myData.shape[0]} samples and {myData.shape[1]} features")

The dataset data contains 48842 samples and 13 features


In [5]:
target_column = 'class'
target = myData[target_column]
data = myData.drop(columns=target_column)

In [6]:
from sklearn.compose import make_column_selector as selector
# 
numerical_columns = selector(dtype_exclude=object)(data)
categorical_columns = selector(dtype_include=object)(data)
all_columns = numerical_columns + categorical_columns
data = data[all_columns]

In [7]:
data_numerical = data[numerical_columns]
data_categorical = data[categorical_columns]

> Here, we know that object data type is used to represent strings and thus categorical features. Be aware that this is not always the case. Sometimes object data type could contain other types of information, such as dates that were not properly formatted (strings) and yet relate to a quantity of elapsed time.



# Categorical and numerical in the same traitment

## Dispatch columns to a specific processor

In [8]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# 
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

In [12]:
from sklearn.compose import ColumnTransformer
# 
preprocessor = ColumnTransformer([
    ('categorical', categorical_preprocessor, categorical_columns),
    ('numerical', numerical_preprocessor, numerical_columns)])

## LogisticRegression on all data

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
# 
model = make_pipeline(preprocessor, LogisticRegression(max_iter=500))

cv_results = cross_validate(model, data, target, cv=10)
scores = cv_results["test_score"]
fit_time = cv_results["fit_time"]
print("The accuracy is "
      f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")

The accuracy is 0.851 +/- 0.003, for 0.972 seconds


## OrdinalEncoder + LogisticRegression = not so good

In [14]:
from sklearn.preprocessing import OrdinalEncoder
# 
model = make_pipeline(OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=100), 
                      LogisticRegression(max_iter=500))

cv_results = cross_validate(model, data_categorical, target, cv=10)
scores = cv_results["test_score"]
fit_time = cv_results["fit_time"]
print("The accuracy is "
      f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")

The accuracy is 0.755 +/- 0.002, for 0.336 seconds
