# Encoding of categorical variables
In this notebook, we will present typical ways of dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding.

Let’s first load the entire adult dataset containing both numerical and categorical data.

In [110]:
import pandas as pd
import numpy as np

adult_census = pd.read_csv("/Users/russconte/Adult_Census.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="Education-num")

target_name = "Class"
target = adult_census[target_name]

data = adult_census.drop(columns=[target_name])

In [111]:
data.head()

Unnamed: 0,Age,Workclass,fnlwgt,Education,Marital-Status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-Country
0,25,Private,226802,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,89814,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,336951,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,160323,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,103497,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


Let's look at all the categorical variables:

In [112]:
adult_census.select_dtypes(include=object) # selects just the categorical columns

Unnamed: 0,Workclass,Education,Marital-Status,Occupation,Relationship,Race,Sex,Native-Country,Class
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States,<=50K
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States,<=50K
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States,>50K
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States,>50K
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States,<=50K
...,...,...,...,...,...,...,...,...,...
48837,Private,Assoc-acdm,Married-civ-spouse,Tech-support,Wife,White,Female,United-States,<=50K
48838,Private,HS-grad,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,United-States,>50K
48839,Private,HS-grad,Widowed,Adm-clerical,Unmarried,White,Female,United-States,<=50K
48840,Private,HS-grad,Never-married,Adm-clerical,Own-child,White,Male,United-States,<=50K


We can get a summary of each column by adding .value_counts. For example:

In [113]:
adult_census["Relationship"].value_counts()

 Husband           19716
 Not-in-family     12583
 Own-child          7581
 Unmarried          5125
 Wife               2331
 Other-relative     1506
Name: Relationship, dtype: int64

# Select features based on their data type

In the previous notebook, we manually defined the numerical columns. We could do a similar approach. Instead, we will use the scikit-learn helper function make_column_selector, which allows us to select columns based on their data type. We will illustrate how to use this helper. We will select only the object columns:

In [114]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
categorical_columns

['Workclass',
 'Education',
 'Marital-Status',
 'Occupation',
 'Relationship',
 'Race',
 'Sex',
 'Native-Country']

We can use this to find only the categorical columns

In [115]:
data_categorical = data[categorical_columns]
data_categorical.head

<bound method NDFrame.head of            Workclass      Education       Marital-Status          Occupation  \
0            Private           11th        Never-married   Machine-op-inspct   
1            Private        HS-grad   Married-civ-spouse     Farming-fishing   
2          Local-gov     Assoc-acdm   Married-civ-spouse     Protective-serv   
3            Private   Some-college   Married-civ-spouse   Machine-op-inspct   
4                  ?   Some-college        Never-married                   ?   
...              ...            ...                  ...                 ...   
48837        Private     Assoc-acdm   Married-civ-spouse        Tech-support   
48838        Private        HS-grad   Married-civ-spouse   Machine-op-inspct   
48839        Private        HS-grad              Widowed        Adm-clerical   
48840        Private        HS-grad        Never-married        Adm-clerical   
48841   Self-emp-inc        HS-grad   Married-civ-spouse     Exec-managerial   

      Rel

As a way to double check the number of features, have Python do the checking! :)

In [116]:
print(f"The data set has {data_categorical.shape[1]} features")

The data set has 8 features


# Strategies to encode categories

### Encoding ordinal categories

The most intuitive strategy is to encode each category with a different number. The OrdinalEncoder will transform the data in such manner. We will start by encoding a single column to understand how the encoding works.

In [117]:
from sklearn.preprocessing import OrdinalEncoder

Education_column = data_categorical[["Education"]]

encoder = OrdinalEncoder()
education_encoded = encoder.fit_transform(education_column)
education_encoded

array([[ 1.],
       [11.],
       [ 7.],
       ...,
       [11.],
       [11.],
       [11.]])

Let's see what each of the education categories are:

In [118]:
encoder.categories_

[array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object)]

Now, we can check the encoding applied on all categorical features.

In [119]:
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

array([[ 4.,  1.,  4.,  7.,  3.,  2.,  1., 39.],
       [ 4., 11.,  2.,  5.,  0.,  4.,  1., 39.],
       [ 2.,  7.,  2., 11.,  0.,  4.,  1., 39.],
       [ 4., 15.,  2.,  7.,  0.,  2.,  1., 39.],
       [ 0., 15.,  4.,  0.,  3.,  4.,  0., 39.]])

In [120]:
print(f"The dataset encoded contains {data_encoded.shape[1]} features")

The dataset encoded contains 8 features


By default, OrdinalEncoder uses a lexicographical strategy to map string category labels to integers. This strategy is arbitrary and often meaningless. For instance, suppose the dataset has a categorical variable named "size" with categories such as “S”, “M”, “L”, “XL”. We would like the integer representation to respect the meaning of the sizes by mapping them to increasing integers such as 0, 1, 2, 3. However, the lexicographical strategy used by default would map the labels “S”, “M”, “L”, “XL” to 2, 1, 0, 3, by following the alphabetical order.

The OrdinalEncoder class accepts a categories constructor argument to pass categories in the expected ordering explicitly. You can find more information in the scikit-learn documentation if needed.

If a categorical variable does not carry any meaningful order information then this encoding might be misleading to downstream statistical models and you might consider using one-hot encoding instead (see below).

### Encoding nominal categories (without assuming any order)

OneHotEncoder is an alternative encoder that prevents the downstream models to make a false assumption about the ordering of categories. For a given feature, it will create as many new columns as there are possible categories. For a given sample, the value of the column corresponding to the category will be set to 1 while all the columns of the other categories will be set to 0.

We will start by encoding a single feature (e.g. "education") to illustrate how the encoding works.

In [121]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
education_encoded = encoder.fit_transform(Education_column)
education_encoded

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

We see that each category in "education" has been replaced by a numeric value. We could check the mapping between the categories and the numerical values by checking the fitted attribute categories_.

In [122]:
encoder.categories_

[array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object)]

We can check if the encoding was applied to all columns:

In [123]:
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.

In [124]:
print(f"The dataset encoded contains {data_encoded.shape[1]} features")

The dataset encoded contains 102 features


We see that encoding a single feature will give a NumPy array full of zeros and ones. We can get a better understanding using the associated feature names resulting from the transformation.

In [125]:
feature_names = encoder.get_feature_names_out(input_features=["Education"])
education_encoded = pd.DataFrame(education_encoded, columns=feature_names)
education_encoded

ValueError: input_features is not equal to feature_names_in_

Let’s wrap this NumPy array in a dataframe with informative column names as provided by the encoder object:

In [None]:
columns_encoded = encoder.get_feature_names_out(data_categorical.columns)
pd.DataFrame(data_encoded, columns=columns_encoded).head()

The number of features after the encoding is more than 10 times larger than in the original data because some variables such as occupation and native-country have many possible categories.

Choosing an encoding strategy

Choosing an encoding strategy will depend on the underlying models and the type of categories (i.e. ordinal vs. nominal).

Note: In general OneHotEncoder is the encoding strategy used when the downstream models are linear models while OrdinalEncoder is often a good strategy with tree-based models.

Using an OrdinalEncoder will output ordinal categories. This means that there is an order in the resulting categories (e.g. 0 < 1 < 2). The impact of violating this ordering assumption is really dependent on the downstream models. Linear models will be impacted by misordered categories while tree-based models will not.

You can still use an OrdinalEncoder with linear models but you need to be sure that:

the original categories (before encoding) have an ordering;
the encoded categories follow the same ordering than the original categories. The next exercise highlights the issue of misusing OrdinalEncoder with a linear model.
One-hot encoding categorical variables with high cardinality can cause computational inefficiency in tree-based models. Because of this, it is not recommended to use OneHotEncoder in such cases even if the original categories do not have a given order. We will show this in the final exercise of this sequence.

# Evaluate our predictive pipeline

We can now integrate this encoder inside a machine learning pipeline like we did with numerical data: let’s train a linear classifier on the encoded data and check the generalization performance of this machine learning pipeline using cross-validation.

Before we create the pipeline, we have to linger on the native-country. Let’s recall some statistics regarding this column.

In [None]:
data["Native-Country"].value_counts()

We see that the Holand-Netherlands category is occurring rarely. This will be a problem during cross-validation: if the sample ends up in the test set during splitting then the classifier would not have seen the category during training and will not be able to encode it.

In scikit-learn, there are two solutions to bypass this issue:

list all the possible categories and provide it to the encoder via the keyword argument categories;
use the parameter handle_unknown, i.e. if an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros.
Here, we will use the latter solution for simplicity.

We can now create our machine learning pipeline:

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    OneHotEncoder(handle_unknown="ignore"), LogisticRegression(max_iter=500)
)

# Note
Here, we need to increase the maximum number of iterations to obtain a fully converged LogisticRegression and silence a ConvergenceWarning. Contrary to the numerical features, the one-hot encoded categorical features are all on the same scale (values are 0 or 1), so they would not benefit from scaling. In this case, increasing max_iter is the right thing to do.

Finally, we can check the model’s generalization performance only using the categorical columns.

In [None]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(model, data_categorical, target)
cv_results

In [None]:
scores = cv_results["test_score"]
print(f"The accuracy is: {scores.mean():.3f} ± {scores.std():.3f}")

As you can see, this representation of the categorical variables is slightly more predictive of the revenue than the numerical variables that we used previously.

In this notebook we have:

seen two common strategies for encoding categorical features: ordinal encoding and one-hot encoding;
used a pipeline to use a one-hot encoder before fitting a logistic regression.

# Check the same process using the diamonds data set

In [169]:
diamonds = pd.read_csv('/Users/russconte/diamonds.csv')

In [170]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


Look at the data types:

In [171]:
diamonds.dtypes

carat      float64
cut         object
color       object
clarity     object
depth      float64
table      float64
price        int64
x          float64
y          float64
z          float64
dtype: object

The Diamonds data set has three objects–cut, color and clarity - and the rest of the columns are numerical (either float64 or int64)

In [172]:
target = diamonds["cut"]
diamonds = diamonds.drop(columns = "cut")
diamonds.head()

Unnamed: 0,carat,color,clarity,depth,table,price,x,y,z
0,0.23,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [173]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(diamonds)
categorical_columns

['color', 'clarity']

In [174]:
data_categorical = diamonds[categorical_columns]

In [175]:
data_categorical

Unnamed: 0,color,clarity
0,E,SI2
1,E,SI1
2,E,VS1
3,I,VS2
4,J,SI2
...,...,...
53935,D,SI1
53936,D,SI1
53937,D,SI1
53938,H,SI2


In [176]:
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.]])

In [177]:
print(f"The dataset encoded contains {data_encoded.shape[1]} features")

The dataset encoded contains 15 features


In [178]:
columns_encoded = encoder.get_feature_names_out(data_categorical.columns)
pd.DataFrame(data_encoded, columns=columns_encoded).head()

Unnamed: 0,color_D,color_E,color_F,color_G,color_H,color_I,color_J,clarity_I1,clarity_IF,clarity_SI1,clarity_SI2,clarity_VS1,clarity_VS2,clarity_VVS1,clarity_VVS2
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [179]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    OneHotEncoder(handle_unknown="ignore"), LogisticRegression(max_iter=500)
)

In [180]:
from sklearn.model_selection import cross_validate
cv_results = cross_validate(model, data_categorical, target)
cv_results

{'fit_time': array([0.69825315, 0.68562508, 0.61841106, 0.65547967, 0.6298511 ]),
 'score_time': array([0.01315594, 0.01029181, 0.0105269 , 0.01066923, 0.01029205]),
 'test_score': array([0.38598443, 0.39618094, 0.42259918, 0.39108268, 0.40674824])}

In [181]:
data_categorical.describe()

Unnamed: 0,color,clarity
count,53940,53940
unique,7,8
top,G,SI1
freq,11292,13065


In [182]:
target.describe()

count     53940
unique        5
top       Ideal
freq      21551
Name: cut, dtype: object

In [183]:
scores = cv_results["test_score"]
print(f"The accuracy is: {scores.mean():.3f} ± {scores.std():.3f}")

The accuracy is: 0.401 ± 0.013


In this notebook we have:

seen two common strategies for encoding categorical features: ordinal encoding and one-hot encoding;
used a pipeline to use a one-hot encoder before fitting a logistic regression.