## One Hot Encoding

One hot encoding, consists in encoding each categorical variable with different boolean variables (also called dummy variables) which take values 0 or 1, indicating if a category is present in an observation.

For example, for the categorical variable "Gender", with labels 'female' and 'male', we can generate the boolean variable "female", which takes 1 if the person is 'female' or 0 otherwise, or we can generate the variable "male", which takes 1 if the person is 'male' and 0 otherwise.

For the categorical variable "colour" with values 'red', 'blue' and 'green', we can create 3 new variables called "red", "blue" and "green". These variables will take the value 1, if the observation is of the said colour or 0 otherwise. 


### Encoding into k-1 dummy variables

Note however, that for the variable "colour", by creating 2 binary variables, say "red" and "blue", we already encode **ALL** the information:

- if the observation is red, it will be captured by the variable "red" (red = 1, blue = 0)
- if the observation is blue, it will be captured by the variable "blue" (red = 0, blue = 1)
- if the observation is green, it will be captured by the combination of "red" and "blue" (red = 0, blue = 0)

We do not need to add a third variable "green" to capture that the observation is green.

More generally, a categorical variable should be encoded by creating k-1 binary variables, where k is the number of distinct categories. In the case of gender, k=2 (male / female), therefore we need to create only 1 (k - 1 = 1) binary variable. In the case of colour, which has 3 different categories (k=3), we need to create 2 (k - 1 = 2) binary variables to capture all the information.

One hot encoding into k-1 binary variables takes into account that we can use 1 less dimension and still represent the whole information: if the observation is 0 in all the binary variables, then it must be 1 in the final (not present) binary variable.

**When one hot encoding categorical variables, we create k - 1 binary variables**


Most machine learning algorithms, consider the entire data set while being fit. Therefore, encoding categorical variables into k - 1 binary variables, is better, as it avoids introducing redundant information.


### Exception: One hot encoding into k dummy variables

There are a few occasions when it is better to encode variables into k dummy variables:

- when building tree based algorithms
- when doing feature selection by recursive algorithms
- when interested in determine the importance of each single category

Tree based algorithms, as opposed to the majority of machine learning algorithms, **do not** evaluate the entire dataset while being trained. They randomly extract a subset of features from the data set at each node for each tree. Therefore, if we want a tree based algorithm to consider **all** the categories, we need to encode categorical variables into **k binary variables**.

If we are planning to do feature selection by recursive elimination (or addition), or if we want to evaluate the importance of each single category of the categorical variable, then we will also need the entire set of binary variables (k) to let the machine learning model select which ones have the most predictive power.


### Advantages of one hot encoding

- Straightforward to implement
- Makes no assumption about the distribution or categories of the categorical variable
- Keeps all the information of the categorical variable
- Suitable for linear models

### Limitations

- Expands the feature space
- Does not add extra information while encoding
- Many dummy variables may be identical, introducing redundant information


### Notes

If our datasets contain a few highly cardinal variables, we will end up very soon with datasets with thousands of columns, which may make training of our algorithms slow, and model interpretation hard.

In addition, many of these dummy variables may be similar to each other, since it is not unusual that 2 or more variables share the same combinations of 1 and 0s. Therefore one hot encoding may introduce redundant or duplicated information even if we encode into k-1.


## In this demo:

We will see how to perform one hot encoding with:
- Feature-Engine

The exercises are based on the training notes:
Feature Engineering for Machine Learning
by Soledad Galli


## One hot encoding with Feature-Engine

### Advantages
- quick
- returns dataframe
- returns feature names
- allows to select features to encode

## I do not use One hot encoding with sklearn because of the following limitations:

it returns a numpy array instead of a pandas dataframe.
it does not return the variable names, therefore inconvenient for variable exploration

## I do not use One hot encoding with Pandas because of the following limitations:

it does not preserve information from train data to propagate to test data

###NOTE:The exercise is based on the training notes:
Feature Engineering for Machine Learning
by Soledad Galli


In [1]:
import pandas as pd
import numpy as np

# to split the datasets
from sklearn.model_selection import train_test_split

# for one hot encoding with feature-engine
from feature_engine.categorical_encoders import OneHotCategoricalEncoder

In [2]:
# load car example dataset
columns = ['Brand', 'Price', 'Body', 'Mileage', 'EngineV', 'Engine Type',
       'Registration', 'Year']
data = pd.read_csv('C:\\Users\\gusal\\machine learning\\Feature engineering\\car example.csv', usecols = columns)


In [3]:
data.head()

Unnamed: 0,Brand,Price,Body,Mileage,EngineV,Engine Type,Registration,Year
0,BMW,4200.0,sedan,277,2.0,Petrol,yes,1991
1,Mercedes-Benz,7900.0,van,427,2.9,Diesel,yes,1999
2,Mercedes-Benz,13300.0,sedan,358,5.0,Gas,yes,2003
3,Audi,23000.0,crossover,240,4.2,Petrol,yes,2007
4,Toyota,18300.0,crossover,120,2.0,Petrol,yes,2011


In [4]:
data.columns


Index(['Brand', 'Price', 'Body', 'Mileage', 'EngineV', 'Engine Type',
       'Registration', 'Year'],
      dtype='object')

In [5]:
inputs = data.drop(['Price'], axis = 1)
target = data.Price

In [6]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    inputs,  # predictors
    target,  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((3041, 7), (1304, 7))

In [7]:
### Let's explore the cardinality

In [8]:
# Brand has 7 labels

len(X_train['Brand'].unique())

7

In [9]:
# Body has 6 labels

X_train['Body'].unique()

array(['other', 'sedan', 'crossover', 'hatch', 'vagon', 'van'],
      dtype=object)

In [10]:
# Engine Type has 4 labels

X_train['Engine Type'].unique()

array(['Petrol', 'Diesel', 'Gas', 'Other'], dtype=object)

In [11]:
# Registration has 2 labels

X_train['Registration'].unique()

array(['yes', 'no'], dtype=object)

### Encoding important

Just like imputation, all methods of categorical encoding should be performed over the training set, and then propagated to the test set. 

Why? 

Because these methods will "learn" patterns from the train data, and therefore you want to avoid leaking information and overfitting. But more importantly, because we don't know whether in future / live data, we will have all the categories present in the train data, or if there will be more or less categories. Therefore, we want to anticipate this uncertainty by setting the right processes right from the start. We want to create transformers that learn the categories from the train set, and used those learned categories to create the dummy variables in both train and test sets.

Applying one hot encoding with feature engine

In [12]:
ohe_enc = OneHotCategoricalEncoder(
    top_categories=None,
    variables=['Brand','Body', 'Engine Type','Registration'], # we can select which variables to encode
    drop_last=True) # to return k-1, false to return k


ohe_enc.fit(X_train)

OneHotCategoricalEncoder(drop_last=True, top_categories=None,
                         variables=['Brand', 'Body', 'Engine Type',
                                    'Registration'])

In [13]:
tmp = ohe_enc.transform(X_train)

tmp.head()

Unnamed: 0,Mileage,EngineV,Year,Brand_Mercedes-Benz,Brand_Volkswagen,Brand_Toyota,Brand_Audi,Brand_BMW,Brand_Mitsubishi,Body_other,Body_sedan,Body_crossover,Body_hatch,Body_vagon,Engine Type_Petrol,Engine Type_Diesel,Engine Type_Gas,Registration_yes
628,12,1.8,2012,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1
422,39,2.2,2015,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1
330,3,1.8,1991,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1
3838,84,3.0,2014,1,0,0,0,0,0,0,0,1,0,0,0,1,0,1
2369,144,4.0,2008,0,0,1,0,0,0,0,0,1,0,0,0,0,1,1


Note how feature-engine returns the dummy variables with their names, and drops the original variable, leaving the dataset ready for further exploration or building machine learning models.

In [14]:
# Feature-Engine's one hot encoder also selects
# all categorical variables automatically

ohe_enc = OneHotCategoricalEncoder(
    top_categories=None,
    drop_last=True) # to return k-1, false to return k


ohe_enc.fit(X_train)

OneHotCategoricalEncoder(drop_last=True, top_categories=None,
                         variables=['Brand', 'Body', 'Engine Type',
                                    'Registration'])

In [15]:
ohe_enc.variables

['Brand', 'Body', 'Engine Type', 'Registration']

In [24]:
tmp = ohe_enc.transform(X_train)

tmp.head()

Unnamed: 0,Mileage,EngineV,Year,Brand_Mercedes-Benz,Brand_Volkswagen,Brand_Toyota,Brand_Audi,Brand_BMW,Brand_Mitsubishi,Body_other,Body_sedan,Body_crossover,Body_hatch,Body_vagon,Engine Type_Petrol,Engine Type_Diesel,Engine Type_Gas,Registration_yes
628,12,1.8,2012,1,0,0,0,0,0,1,0,0,0,0,1,0,0,1
422,39,2.2,2015,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1
330,3,1.8,1991,0,1,0,0,0,0,0,1,0,0,0,1,0,0,1
3838,84,3.0,2014,1,0,0,0,0,0,0,0,1,0,0,0,1,0,1
2369,144,4.0,2008,0,0,1,0,0,0,0,0,1,0,0,0,0,1,1


In [25]:
tmp2 = ohe_enc.transform(X_test)

tmp2.head()

Unnamed: 0,Mileage,EngineV,Year,Brand_Mercedes-Benz,Brand_Volkswagen,Brand_Toyota,Brand_Audi,Brand_BMW,Brand_Mitsubishi,Body_other,Body_sedan,Body_crossover,Body_hatch,Body_vagon,Engine Type_Petrol,Engine Type_Diesel,Engine Type_Gas,Registration_yes
3228,242,3.0,2001,0,0,0,0,1,0,0,1,0,0,0,0,1,0,0
1505,263,3.2,2001,1,0,0,0,0,0,0,1,0,0,0,0,1,0,1
3222,650,2.2,2005,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1
556,194,2.0,2012,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1
1958,279,1.9,2000,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0


In [26]:
tmp.columns

Index(['Mileage', 'EngineV', 'Year', 'Brand_Mercedes-Benz', 'Brand_Volkswagen',
       'Brand_Toyota', 'Brand_Audi', 'Brand_BMW', 'Brand_Mitsubishi',
       'Body_other', 'Body_sedan', 'Body_crossover', 'Body_hatch',
       'Body_vagon', 'Engine Type_Petrol', 'Engine Type_Diesel',
       'Engine Type_Gas', 'Registration_yes'],
      dtype='object')

In [27]:
tmp2.columns

Index(['Mileage', 'EngineV', 'Year', 'Brand_Mercedes-Benz', 'Brand_Volkswagen',
       'Brand_Toyota', 'Brand_Audi', 'Brand_BMW', 'Brand_Mitsubishi',
       'Body_other', 'Body_sedan', 'Body_crossover', 'Body_hatch',
       'Body_vagon', 'Engine Type_Petrol', 'Engine Type_Diesel',
       'Engine Type_Gas', 'Registration_yes'],
      dtype='object')

In [None]:
## We can see that train and test contain the same number of features.
## this is a big advange of feature engine over one hot encoding with pandas and sklearn
