# Encoding Categorical Variables 
In this notebook, we will explore the two main techniques for handing categorical variables by encoding them. The two methods available for encoding categorical variables are Ordinal Encoding and One-Hot Encoding. In this notebook, we will explore their different use cases and how to work with them using the sci-kit Learn API. 

We will load in our datasets -- Adult Census -- containing both categorical and numerical features. 

In [8]:
import pandas as pd 
adult_census = pd.read_csv("adult_census.csv")

# we will drop the duplicated columns -- education-num and the fnlwgt column. 
duplicated_columns = ["education-num", "fnlwgt"]
adult_census = adult_census.drop(columns=duplicated_columns)

#Identify the target class column and separate it from the input data. 
target_name = "class"
target = adult_census[target_name]

data = adult_census.drop(columns=target_name)

In [9]:
adult_census.sample(3)

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
23697,46,Private,HS-grad,Married-civ-spouse,Transport-moving,Husband,White,Male,15024,0,44,United-States,>50K
28261,31,Private,Assoc-voc,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K
22019,23,Private,HS-grad,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K


### Identifying Categorical variables. 


In [13]:
data["marital-status"].value_counts().sort_index()

 Divorced                  6633
 Married-AF-spouse           37
 Married-civ-spouse       22379
 Married-spouse-absent      628
 Never-married            16117
 Separated                 1530
 Widowed                   1518
Name: marital-status, dtype: int64

In [14]:
data["native-country"].value_counts().sort_index()

 ?                               857
 Cambodia                         28
 Canada                          182
 China                           122
 Columbia                         85
 Cuba                            138
 Dominican-Republic              103
 Ecuador                          45
 El-Salvador                     155
 England                         127
 France                           38
 Germany                         206
 Greece                           49
 Guatemala                        88
 Haiti                            75
 Holand-Netherlands                1
 Honduras                         20
 Hong                             30
 Hungary                          19
 India                           151
 Iran                             59
 Ireland                          37
 Italy                           105
 Jamaica                         106
 Japan                            92
 Laos                             23
 Mexico                          951
 

One way to easily recognize categorical columns among the datasets is to check the columns' data type. 

In [15]:
data.dtypes

age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

We will observe that columns containing string values have the data type objects stored in them. 

### Feature Selection based on data type
Instead of manually selecting the categorical data types, we could use a scikit learn helper function called `make_column_selector` to select columns based on their data type. Let's see how this works.

In [17]:
from sklearn.compose import make_column_selector as selector 

categorical_columns_selector = selector(dtype_include = object)
categorical_columns = categorical_columns_selector(data)
categorical_columns

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

since we now have a list of column names that contain categorical data, we can filter out only the categorical data in our dataset. 

In [18]:
data_categorical = data[categorical_columns]
data_categorical.head()

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States


In [19]:
print(f"The dataset is composed of {data_categorical.shape[1]} features")

The dataset is composed of 8 features


Now, that we have succesfully identified and filtered out the categorical features, we can now take a look at the different data encoding strategies for encoding categorical variables into numerical data that can be used by a machine learning algorithm.

## Categorical Data Encoding strategies.

### Encoding ordinal categories
The `OrdinalEncoder` will transform the data by encoding each category with a different number. 

In [23]:
## A first look at how OrdinalEncoder works with a single column. 
from sklearn.preprocessing import OrdinalEncoder

education_column = data_categorical[["education"]]

encoder = OrdinalEncoder()
education_encoded = encoder.fit_transform(education_column)
education_encoded

array([[ 1.],
       [11.],
       [ 7.],
       ...,
       [11.],
       [11.],
       [11.]])

We could check the mapping between the categories and the numerical values that was assigned by the encoder. This is done by checking the fitted attribute `categories_`

In [25]:
encoder.categories_

[array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object)]

In [28]:
# and then check the encoding applied on all categorical features
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

array([[ 4.,  1.,  4.,  7.,  3.,  2.,  1., 39.],
       [ 4., 11.,  2.,  5.,  0.,  4.,  1., 39.],
       [ 2.,  7.,  2., 11.,  0.,  4.,  1., 39.],
       [ 4., 15.,  2.,  7.,  0.,  2.,  1., 39.],
       [ 0., 15.,  4.,  0.,  3.,  4.,  0., 39.]])

In [30]:
encoder.categories_

[array([' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private',
        ' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'],
       dtype=object),
 array([' 10th', ' 11th', ' 12th', ' 1st-4th', ' 5th-6th', ' 7th-8th',
        ' 9th', ' Assoc-acdm', ' Assoc-voc', ' Bachelors', ' Doctorate',
        ' HS-grad', ' Masters', ' Preschool', ' Prof-school',
        ' Some-college'], dtype=object),
 array([' Divorced', ' Married-AF-spouse', ' Married-civ-spouse',
        ' Married-spouse-absent', ' Never-married', ' Separated',
        ' Widowed'], dtype=object),
 array([' ?', ' Adm-clerical', ' Armed-Forces', ' Craft-repair',
        ' Exec-managerial', ' Farming-fishing', ' Handlers-cleaners',
        ' Machine-op-inspct', ' Other-service', ' Priv-house-serv',
        ' Prof-specialty', ' Protective-serv', ' Sales', ' Tech-support',
        ' Transport-moving'], dtype=object),
 array([' Husband', ' Not-in-family', ' Other-relative', ' Own-child',
        ' Unmarried'

In [31]:
print(f"The dataset encoded has {data_encoded.shape[1]} features")

The dataset encoded has 8 features


It is important to note that the OrdinalEncoder uses a lexicographical strategy to map string category labels to integers. This categories are often arbitrary and meaningful and could even msilead downstream predictive models. A step ahead on this is the modification of the `categories` constructors provided by the OrdinalEncoder() method. 

If eventually, the categorical variable does not carry any meaningful order information then the `OrdinalEncoder()` might be misleading to downstream statistical and machine learning models. As such, an alternative encoding strategy would be the one-hot encoding strategy. 

### Encoding nominal categories (without assuming any order)
`OneHotEncoder` is alternative encoder that prevents the  downstream models from making a false assumption about the ordering of categories. For any given feature, the OneHotEncoder will create as many new columns as there are possible categories. 

Here, we will start by encoding a single feature to showcase the idea behind how encoding works using the sci-kit learn API. 

In [34]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False) 
# Take a closer look at the concept of sparse matrices for compressing almost-empty matrices
# https://scipy-lectures.org/advanced/scipy_sparse/introduction.html#why-sparse-matrices
education_encoded = encoder.fit_transform(education_column)
education_encoded

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Encoding a single features returns an Numpy array full of zeros and ones. In order to clearly understand this transformation, we can associate the feature names with the transformation.

In [36]:
feature_names = encoder.get_feature_names(input_features=['education'])
education_encoded = pd.DataFrame(education_encoded, columns=feature_names)
education_encoded.sample(10)

Unnamed: 0,education_ 10th,education_ 11th,education_ 12th,education_ 1st-4th,education_ 5th-6th,education_ 7th-8th,education_ 9th,education_ Assoc-acdm,education_ Assoc-voc,education_ Bachelors,education_ Doctorate,education_ HS-grad,education_ Masters,education_ Preschool,education_ Prof-school,education_ Some-college
530,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
34052,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
43021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
40083,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
43655,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
22832,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
41197,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28088,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
33016,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
44355,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [37]:
#We will now apply this encoding to the full dataset
print(f"The dataset is composed of {data_categorical.shape[1]} features")
data_categorical.head()

The dataset is composed of 8 features


Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,sex,native-country
0,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,United-States
1,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,United-States
2,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,United-States
3,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,United-States
4,?,Some-college,Never-married,?,Own-child,White,Female,United-States


In [38]:
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]

array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.

In [39]:
print(f"The encoded data has {data_encoded.shape[1]} features")

The encoded data has 102 features


In [40]:
columns_encoded = encoder.get_feature_names(data_categorical.columns)
pd.DataFrame(data_encoded, columns= columns_encoded).head()

Unnamed: 0,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,education_ 10th,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


The number of features after the encoding is more than 10 times larger than in the original data because some variables such as `occupation` and `native-country` have many possible categories. 

### Choosing an encoding strategy
Deciding on which encoding strategy to use will depend on the underlying model and the type of categories (ordinal or nominal). Linear models will be impacted by misordered categories while tree-based models will not be.
In general, `OneHotEncoder` is the encoding strategy used when the downstream models are **linear models** while the `OrdinalEncoder` is used with **tree-based models**

We can still use `OrdinalEncoder` with linear models, but you need to be sure that: 
- the orginal categories (before encoding) have an ordering, 
- the encoded categories follow the same ordering like the original categories. 


## Evaluate our predictive pipeline
 At this point, it is now possible to integrate the encoder inside a machine learning pipeline like the one we used with numerical data. Here, we will train a linear classifier on the encoded data and compute the statistical performance of this ML pipeline using cross-validation. 
 

In [45]:
# Before building the ML pipeline, let's have a look at some statistics on the native-country column. 

data['native-country'].value_counts()

 United-States                 43832
 Mexico                          951
 ?                               857
 Philippines                     295
 Germany                         206
 Puerto-Rico                     184
 Canada                          182
 El-Salvador                     155
 India                           151
 Cuba                            138
 England                         127
 China                           122
 South                           115
 Jamaica                         106
 Italy                           105
 Dominican-Republic              103
 Japan                            92
 Guatemala                        88
 Poland                           87
 Vietnam                          86
 Columbia                         85
 Haiti                            75
 Portugal                         67
 Taiwan                           65
 Iran                             59
 Greece                           49
 Nicaragua                        49
 

We noticed that the Holand-Netherlands category is occuring rarely. This will evnetually spell trouble during cross-validation: if the sample ends up in the test set during splitting and the classifier would not have seen the category during training and will not be able to encode it.
In order to bypass this issue, two possible solutions are provided by sciki-learn: 
- list all the possible categories and provide it to the encoder via the keyword argument `categories`;
- use the parameter `handle_unknown`

Now, let's create our ML pipeline:

In [47]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(OneHotEncoder(handle_unknown='ignore'), LogisticRegression(max_iter=500))

Here, we need to increase the maximum number of iterations to obtain a fully converged LogisticRegression and silenece a CinvergenceWarning. Unlike numerical features, the one-hot encoded categorical features are all on the same scale (0 or 1), so they would not benefit from scaling. Which is why increasing max_iter is the right thing to do. 

In [49]:
# Check Model's Statistical Performance. 
from sklearn.model_selection import cross_validate
cv_results = cross_validate(model, data_categorical, target)
cv_results

{'fit_time': array([0.80627871, 0.82595897, 0.90604973, 0.94556165, 0.71154451]),
 'score_time': array([0.02186966, 0.02361679, 0.02296591, 0.02792406, 0.0222857 ]),
 'test_score': array([0.83222438, 0.83560242, 0.82872645, 0.83312858, 0.83466421])}

In [50]:
scores = cv_results["test_score"]
print(f"The accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")

The accuracy is: 0.833 +/- 0.002


The Accuracy shows us that the representration of the categorical variables is slightly more predictive of the revenue that the numerical variables we had used earlier. 