[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/Tim-Abwao/machine-learning-with-scikit-learn/HEAD?labpath=3.%20Preprocessing%20Categorical%20Features.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Tim-Abwao/machine-learning-with-scikit-learn/blob/main/3.%20Preprocessing%20Categorical%20Features.ipynb)

# Introduction

Since machine learning algorithms work with numbers, categorical features have to be encoded into some numeric representation. This can be achieved with:

- `OrdinalEncoder`
- `OneHotEncoder`
- `TargetEncoder`

---

The *scikit-learn Transformer API* provides the following methods:

- `fit`: compute parameters from training data (e.g. $\mu$ and $\sigma$ for StandardScaler).
- `transform`: convert data.
- `fit_transform`: simultaneously apply the `fit` and `transform` steps above.

To avoid data leakage, `fit` / `fit_transform` are used only on training data, whereas `transform` is used on test data. Or better yet, the entire *Transformer* is added to a preprocessing pipeline.

In the examples below, we'll use `fit_transform` since the goal is solely to demonstrate the effect of the transformations.

In [1]:
import pandas as pd
import seaborn as sbn
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, TargetEncoder

# 1. OneHotEncoder

Adds a new column ([dummy variable](https://en.wikipedia.org/wiki/Dummy_variable_(statistics))) for each level in a categorical feature. In each row, to indicate presence of a particular category, its respective column is set to 1 while all other related columns are set to 0.

Suitable for nominal data (names or labels) where levels have no ranking e.g. hair color, weather (sunny, rainy, cloudy, ...). 

>**Caution:** Can lead to very sparse data with lots of columns if features have high cardinality ([curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality#Machine_learning)).

In [2]:
titanic_data = sbn.load_dataset("titanic").select_dtypes("O").dropna()
titanic_data.head()

Unnamed: 0,sex,embarked,who,embark_town,alive
0,male,S,man,Southampton,no
1,female,C,woman,Cherbourg,yes
2,female,S,woman,Southampton,yes
3,female,S,woman,Southampton,yes
4,male,S,man,Southampton,no


For memory efficiency, the default output is in [Compressed Sparse Row matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) format:

In [3]:
OneHotEncoder().fit_transform(titanic_data)

<889x13 sparse matrix of type '<class 'numpy.float64'>'
	with 4445 stored elements in Compressed Sparse Row format>

In [4]:
one_hot_encoder = OneHotEncoder(sparse_output=False)
ohe_data = pd.DataFrame(one_hot_encoder.fit_transform(titanic_data),
                        columns=one_hot_encoder.get_feature_names_out())
ohe_data.head()

Unnamed: 0,sex_female,sex_male,embarked_C,embarked_Q,embarked_S,who_child,who_man,who_woman,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton,alive_no,alive_yes
0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
3,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
4,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0


For features with just 2 levels, only one dummy variable is necessary (absence of one level equates to presence of the other). You can achieve this by specifying `drop="if_binary"`:

In [5]:
ohe_drop_binary = OneHotEncoder(sparse_output=False, drop="if_binary")
ohe_data_drop_binary = pd.DataFrame(ohe_drop_binary.fit_transform(titanic_data),
                        columns=ohe_drop_binary.get_feature_names_out())
ohe_data_drop_binary.head()

Unnamed: 0,sex_male,embarked_C,embarked_Q,embarked_S,who_child,who_man,who_woman,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton,alive_yes
0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
3,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
4,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [6]:
set(ohe_data) - set(ohe_data_drop_binary)

{'alive_no', 'sex_female'}

If you have *pandas* installed, you could alternatively use the [pandas.get_dummies()](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html#pandas.get_dummies) function to one-hot encode data:

In [7]:
pd.get_dummies(titanic_data).head()

Unnamed: 0,sex_female,sex_male,embarked_C,embarked_Q,embarked_S,who_child,who_man,who_woman,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton,alive_no,alive_yes
0,False,True,False,False,True,False,True,False,False,False,True,True,False
1,True,False,True,False,False,False,False,True,True,False,False,False,True
2,True,False,False,False,True,False,False,True,False,False,True,False,True
3,True,False,False,False,True,False,False,True,False,False,True,False,True
4,False,True,False,False,True,False,True,False,False,False,True,True,False


# 2. OrdinalEncoder

Replaces values in categorical features with integer values ranging from 0 to *n_classes - 1*.

Suitable for ordinal data which contains a natural order / ranking e.g. grade, level-of-satisfaction (very bad, bad, fair, good, very good).

In [8]:
diamond_data = sbn.load_dataset("diamonds").loc[:, ["cut", "clarity"]]
diamond_data.head()

Unnamed: 0,cut,clarity
0,Ideal,SI2
1,Premium,SI1
2,Good,VS1
3,Premium,VS2
4,Good,SI2


In [9]:
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit_transform(diamond_data)

array([[2., 3.],
       [3., 2.],
       [1., 4.],
       ...,
       [4., 2.],
       [3., 3.],
       [2., 3.]])

In [10]:
ordinal_encoder.categories_

[array(['Fair', 'Good', 'Ideal', 'Premium', 'Very Good'], dtype=object),
 array(['I1', 'IF', 'SI1', 'SI2', 'VS1', 'VS2', 'VVS1', 'VVS2'],
       dtype=object)]

> Default category rankings are by default alphabetical, and will likely be wrong. In the above arrays, "Ideal" and "IF" should be the best (last / rightmost).
>
> You can specify the proper order using the `categories` argument.

In [11]:
cut_levels = ["Fair", "Good", "Very Good", "Premium", "Ideal"]
clarity_levels = ["I1", "SI1", "SI2", "VS1", "VS2", "VVS1", "VVS2", "IF"]
ordinal_encoder = OrdinalEncoder(categories=[cut_levels, clarity_levels])
ordinal_encoder.fit_transform(diamond_data)

array([[4., 2.],
       [3., 1.],
       [1., 3.],
       ...,
       [2., 1.],
       [3., 2.],
       [4., 2.]])

In [12]:
ordinal_encoder.categories_

[array(['Fair', 'Good', 'Very Good', 'Premium', 'Ideal'], dtype=object),
 array(['I1', 'SI1', 'SI2', 'VS1', 'VS2', 'VVS1', 'VVS2', 'IF'],
       dtype=object)]

# 3. TargetEncoder

Uses the target mean conditioned on the categorical feature to encode nominal (unordered) categories with high cardinality e.g. zip code, nationality.


In [13]:
mpg_data = sbn.load_dataset("mpg")
X = mpg_data[["origin", "name"]]
y = mpg_data["mpg"]
X.head()

Unnamed: 0,origin,name
0,usa,chevrolet chevelle malibu
1,usa,buick skylark 320
2,usa,plymouth satellite
3,usa,amc rebel sst
4,usa,ford torino


The *name* feature has high cardinality:

In [14]:
X.nunique()

origin      3
name      305
dtype: int64

In [15]:
target_encoder = TargetEncoder()
target_encoder.fit_transform(X, y)[:5]

array([[20.04553145, 17.        ],
       [20.2560156 , 23.8169279 ],
       [20.2560156 , 23.8169279 ],
       [20.2560156 , 23.8169279 ],
       [20.04553145, 23.50660377]])

> Compared to the 2 columns above, one-hot encoding would have resulted in 308 columns.

In [16]:
OneHotEncoder().fit_transform(X)

<398x308 sparse matrix of type '<class 'numpy.float64'>'
	with 796 stored elements in Compressed Sparse Row format>

# 4. Category Encoders Package

The [category_encoders][ce] package has a host of additional encoders compatible with *scikit-learn*.

[ce]: https://contrib.scikit-learn.org/category_encoders/