## One Hot Encoding

One hot encoding, consists in encoding each categorical variable with different boolean variables (also called dummy variables) which take values 0 or 1, indicating if a category is present in an observation.

For example, for the categorical variable "Gender", with labels 'female' and 'male', we can generate the boolean variable "female", which takes 1 if the person is 'female' or 0 otherwise, or we can generate the variable "male", which takes 1 if the person is 'male' and 0 otherwise.

In [None]:
import pandas as pd

Limitations of pandas:

In [None]:
df = pd.read_csv('Titanic.csv')

## Label Encoder

In [None]:
##Label Encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [None]:
dfle = le.fit_transform(df.Embarked)

In [None]:
dfle

## One hot encoding with Scikit-learn

### Advantages

- quick
- Creates the same number of features in train and test set

### Limitations

- it returns a numpy array instead of a pandas dataframe
- it does not return the variable names, therefore inconvenient for variable exploration

In [None]:
# for one hot encoding with sklearn
from sklearn.preprocessing import OneHotEncoder

In [None]:
# we create and train the encoder

encoder = OneHotEncoder(categories='auto',
                       drop='first', # to return k-1, use drop=false to return k dummies
                       sparse=False,
                       handle_unknown='error') # helps deal with rare labels

In [None]:
encoder.fit(xtrain.fillna('Missing'))

In [None]:
encoder.categories_

In [None]:
#transform the xtrain set in to dataframe

df1 = encoder.transform(xtrain.fillna("Missing"))

In [None]:
df1 = pd.DataFrame(df1)

In [None]:
df1.head(10)


## One hot encoding with Feature-Engine

### Advantages
- quick
- returns dataframe
- returns feature names
- allows to select features to encode

### Limitations
- Not sure yet.

In [None]:
#!pip install feature_engine
#conda install -c conda-forge feature_engine 

In [None]:
# for one hot encoding with feature-engine
from feature_engine.encoding import OneHotEncoder as fe_OneHotEncoder

In [None]:
ohe_enc = fe_OneHotEncoder(
    top_categories=None,
    variables=['Sex', 'Embarked'],  # we can select which variables to encode
    drop_last=False)  # to return k-1, false to return k


ohe_enc.fit(xtrain.fillna('Missing'))

In [None]:
ohe_enc.variables_

In [None]:
tmp = ohe_enc.transform(xtrain.fillna('Missing'))

tmp.head()