# Encoding of categorical variables

In this notebook, we will present typical ways of dealing with
**categorical variables**, namely **ordinal encoding** and
**one-hot encoding**.

Let's first load the entire adult dataset containing both numerical and
categorical data.

In [None]:
import pandas as pd

df = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = df[target_name]

data = df.drop(columns=[target_name, "fnlwgt"])


## Identify categorical variables

As we saw in the previous section, a numerical variable is a
quantity represented by a real or integer number. These variables can be
naturally handled by machine learning algorithms that are typically composed
of a sequence of arithmetic instructions such as additions and
multiplications.

In contrast, categorical variables have discrete values, typically
represented by string labels taken from a finite list of possible choices.
For instance, the variable `native-country` in our dataset is a categorical
variable because it encodes the data using a finite list of possible
countries (along with the `?` symbol when this information is missing):

In [None]:
data["native-country"].value_counts().sort_index()

Now the question is: how can we easily recognize categorical columns
among the dataset ? Part of the answer lies in the columns' data type:

In [None]:
data.dtypes

If we look at the "native-country" column,
we observe its data type is `object`, meaning it contains string values.

Sometimes, categorical columns could also be encoded with integers. In such
case, looking at the data type will not be enough. In a previous notebook,
we saw it is the case with the column `"education-num"`.

In [None]:
data["education-num"].value_counts()

When considering categorical columns, we should include these columns.
However, we saw earlier that `"education-num"` and `"education"` represent
the exact same information. Therefore, we can get rid of one of the two.
Because in this notebook we want to work with categorical data,
we will use `"education"`, which is of `object` dtype.

## Select features based on their data type

In the previous notebook, we manually defined the numerical columns.
We could do the same here by using the scikit-learn helper function
`make_column_selector`, which allows us to select columns based on
their data type. We will illustrate how to use this helper.

In [None]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
categorical_columns

Here, we created the selector by passing the dtype to include ;
we then passed the input dataset to the selector object,
which returned a list of column names.
We can now filter out the unwanted columns:

In [None]:
data_categorical = data[categorical_columns]
data_categorical.head()

In [None]:
print(f"The dataset is composed of {data_categorical.shape[1]} features")

In the remainder of this section, we will present different strategies to
encode categorical data into numerical data which can be used by a
machine-learning algorithm.

## Encoding ordinal categories

The most intuitive strategy is to encode each category with a different
number. The `OrdinalEncoder` will transform the data in such manner.
We will start by encoding a single column to understand how the encoding
works.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

education_column = data_categorical[["education"]]

encoder = OrdinalEncoder()
education_encoded = encoder.fit_transform(education_column)

We can visually check the encoding obtained.

In [None]:
import seaborn as sns
sns.set_context("talk")

df = pd.DataFrame(
    education_encoded[:10], columns=education_column.columns)
ax = sns.heatmap(df, annot=True, cmap="tab20", cbar=False)
ax.set_ylabel("Sample index")
_ = ax.set_title("Ordinal encoding of 'education' column")

We see that each category in `"education"` has been replaced by a numeric
value. Now, we can check the encoding applied on all categorical features.

In [None]:
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

In [None]:
print(
    f"The dataset encoded contains {data_encoded.shape[1]} features")

We can see that the categories have been encoded for each feature (column)
independently. We can also note that the number of features before and after
the encoding is the same.

<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">This encoding was used by the dataset's publishers on the <tt class="docutils literal">"education"</tt>
feature, which gave the feature <tt class="docutils literal"><span class="pre">"education-num"</span></tt></p>
</div>

However, be careful when applying this encoding strategy:
using this integer representation can lead downstream models
to assume that the values are ordered (0 < 1 < 2 < 3... for instance).

By default, `OrdinalEncoder` uses a lexicographical strategy to map string
category labels to integers. This strategy is arbitrary and often meaningless.
For instance suppose the dataset has a categorical variable
named `"size"` with categories such as "S", "M", "L", "XL". We would like the
integer representation to respect the meaning of the sizes by mapping them to
increasing integers such as 0, 1, 2, 3.
However, the lexicographical strategy used by default would map the labels
“S”, “M”, “L”, “XL” to 2, 1, 0, 3. (following the alphabetical order).

The `OrdinalEncoder` class accepts a `categories` constructor argument to
pass in the correct ordering explicitly.

If a categorical variable does not carry any meaningful order information
then this encoding might be misleading to downstream statistical models and
you might consider using one-hot encoding instead (see below).

<div class="admonition important alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Important</p>
<p class="last">Note however that the impact of violating this ordering assumption is really
dependent on the downstream models (for instance linear models are much more
sensitive than models built from an ensemble of decision trees).</p>
</div>

## Encoding nominal categories (without assuming any order)

`OneHotEncoder` is an alternative encoder that can prevent the dowstream
models to make a false assumption about the ordering of categories. For a
given feature, it will create as many new columns as there are possible
categories. For a given sample, the value of the column corresponding to the
category will be set to `1` while all the columns of the other categories
will be set to `0`.

We will start by encoding a single feature (e.g. `"education"`) to illustrate
how the encoding works.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">We will pass the argument <tt class="docutils literal">sparse=False</tt> to the <tt class="docutils literal">OneHotEncoder</tt> which will
avoid obtaining a sparse matrix, which is less efficient but easier to
inspect results for didactic purposes.</p>
</div>

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
education_encoded = encoder.fit_transform(education_column)

As in the previous section, we will visually check the encoding.

In [None]:
df = pd.DataFrame(
    education_encoded[:10],
    columns=encoder.get_feature_names(education_column.columns))
ax = sns.heatmap(df, annot=True, cmap="RdBu", cbar=False)
ax.set_ylabel("Sample index")
_ = ax.set_title("Ordinal encoding of 'education' column")

As we can see, each category (unique value) became a column ;
the encoding returned, for each sample, a 1 to specify which category it belongs to.

Let's apply this encoding on the full dataset.

In [None]:
print(
    f"The dataset is composed of {data_categorical.shape[1]} features")
data_categorical.head()

In [None]:
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

In [None]:
print(
    f"The dataset encoded contains {data_encoded.shape[1]} features")

Let's wrap this NumPy array in a dataframe with informative column names as
provided by the encoder object:

In [None]:
columns_encoded = encoder.get_feature_names(data_categorical.columns)
pd.DataFrame(data_encoded, columns=columns_encoded).head()

Look at how the "workclass" variable of the first 3 records has been encoded
and compare this to the original string representation.

The number of features after the encoding is more than 10 times larger than
in the original data because some variables such as `occupation` and
`native-country` have many possible categories.

## Evaluate our predictive pipeline

We can now integrate this encoder inside a machine learning pipeline like we
did with numerical data: let's train a linear classifier on the encoded data
and check the performance of this machine learning pipeline using
cross-validation.

Before we create the pipeline, we have to linger on the `native-country`.
Let's recall some statistics regarding this column.

In [None]:
data["native-country"].value_counts()

We see that the `Holand-Netherlands` category is occuring rarely. This will
be a problem during cross-validation: if the sample ends up in the test set
during splitting then the classifier would not have seen the category during
training and will not be able to encode it.

In scikit-learn, there is two solutions to bypass this issue:

* list all the possible categories and provide it to the encoder via the
keyword argument `categories`;
* set the parameter `handle_unknown="ignore"`.

Here, we will use the former strategy because we are also going to use it for
the ordinal encoder later on.

In [None]:
categories = [data_categorical[column].unique()
              for column in data_categorical]

We can now create our machine learning pipeline.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(
    OneHotEncoder(categories=categories, drop="if_binary"),
    LogisticRegression(max_iter=500))

Finally, we can check the model's performance only using the categorical
columns.

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, data_categorical, target)
scores

In [None]:
print(f"The accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")

As you can see, this representation of the categorical variables is
slightly more predictive of the revenue than the numerical variables
that we used previously.


In this notebook we have:
* seen two common strategies for encoding categorical features : **ordinal
  encoding** and **one-hot encoding**;
* used a **pipeline** to use a **one-hot encoder** before fitting a logistic
  regression.