## One Hot Encoding

One hot encoding, consists in encoding each categorical variable with different boolean variables (also called dummy variables) which take values 0 or 1, indicating if a category is present in an observation.

For example, for the categorical variable "Gender", with labels 'female' and 'male', we can generate the boolean variable "female", which takes 1 if the person is 'female' or 0 otherwise, or we can generate the variable "male", which takes 1 if the person is 'male' and 0 otherwise.

In [1]:
import pandas as pd

Limitations of pandas:

In [2]:
df = pd.read_csv('Titanic.csv')

In [3]:
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [5]:
df.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [6]:
df1 = df[['Sex','Embarked']]
df1.head()

Unnamed: 0,Sex,Embarked
0,male,Q
1,female,S
2,male,Q
3,male,S
4,female,S


In [9]:
df1.describe()
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Sex       418 non-null    object
 1   Embarked  418 non-null    object
dtypes: object(2)
memory usage: 6.7+ KB


In [10]:
dummies = pd.get_dummies(df1)

In [11]:
dummies.head()

Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,0,1,0,1,0
1,1,0,0,0,1
2,0,1,0,1,0
3,0,1,0,0,1
4,1,0,0,0,1


## Label Encoder

In [12]:
##Label Encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [13]:
dfle = le.fit_transform(df.Embarked)

In [15]:
type(dfle)

numpy.ndarray

In [16]:
dfle

array([1, 2, 1, 2, 2, 2, 1, 2, 0, 2, 2, 2, 2, 2, 2, 0, 1, 0, 2, 0, 0, 2,
       2, 0, 0, 2, 0, 0, 2, 0, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 0, 2, 2, 2,
       2, 2, 0, 1, 0, 2, 2, 0, 2, 2, 0, 1, 2, 2, 2, 0, 2, 2, 2, 1, 0, 2,
       1, 2, 0, 2, 1, 2, 2, 0, 0, 0, 2, 2, 2, 1, 0, 2, 2, 2, 1, 0, 1, 2,
       1, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 1, 2, 0, 2, 1, 1, 2, 2,
       0, 1, 0, 1, 2, 0, 0, 2, 0, 2, 2, 1, 0, 2, 1, 2, 2, 1, 2, 2, 2, 0,
       2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2,
       2, 2, 2, 2, 2, 2, 1, 0, 2, 2, 2, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 2,
       2, 0, 2, 0, 2, 0, 2, 1, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 0, 2,
       2, 2, 1, 2, 0, 2, 2, 0, 1, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1, 2, 0, 2,
       0, 2, 2, 2, 0, 0, 2, 1, 2, 2, 2, 2, 2, 1, 0, 2, 0, 0, 2, 0, 0, 2,
       0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2,
       0, 2, 2, 2, 2, 2, 0, 1, 0, 1, 0, 2, 2, 2, 2, 2, 2, 2, 1, 0, 2, 2,
       2, 2, 0, 2, 2, 1, 0, 2, 2, 2, 0, 0, 2, 2, 2,

## One hot encoding with Scikit-learn

### Advantages

- quick
- Creates the same number of features in train and test set

### Limitations

- it returns a numpy array instead of a pandas dataframe
- it does not return the variable names, therefore inconvenient for variable exploration

In [17]:
# for one hot encoding with sklearn
from sklearn.preprocessing import OneHotEncoder

In [18]:
# we create and train the encoder

encoder = OneHotEncoder(categories='auto',
                       drop='first', # to return k-1, use drop=false to return k dummies
                       sparse=False,
                       handle_unknown='error') # helps deal with rare labels

In [19]:
xtrain = df[['Sex','Embarked']]

In [20]:
encoder.fit(xtrain.fillna('Missing'))

In [21]:
encoder.categories_

[array(['female', 'male'], dtype=object), array(['C', 'Q', 'S'], dtype=object)]

In [22]:
#transform the xtrain set in to dataframe

df1 = encoder.transform(xtrain.fillna("Missing"))

In [23]:
df1 = pd.DataFrame(df1)

In [24]:
df1.head(10)

Unnamed: 0,0,1,2
0,1.0,1.0,0.0
1,0.0,0.0,1.0
2,1.0,1.0,0.0
3,1.0,0.0,1.0
4,0.0,0.0,1.0
5,1.0,0.0,1.0
6,0.0,1.0,0.0
7,1.0,0.0,1.0
8,0.0,0.0,0.0
9,1.0,0.0,1.0



## One hot encoding with Feature-Engine

### Advantages
- quick
- returns dataframe
- returns feature names
- allows to select features to encode

### Limitations
- Not sure yet.

In [None]:
#!pip install feature_engine
#conda install -c conda-forge feature_engine 

In [25]:
# for one hot encoding with feature-engine
from feature_engine.encoding import OneHotEncoder as fe_OneHotEncoder

In [29]:
ohe_enc = fe_OneHotEncoder(
    top_categories=None,
    variables=['Sex', 'Embarked'],  # we can select which variables to encode
    drop_last=True)  # to return k-1, false to return k


ohe_enc.fit(xtrain.fillna('Missing'))

In [30]:
ohe_enc.variables_

['Sex', 'Embarked']

In [32]:
tmp = ohe_enc.transform(xtrain.fillna('Missing'))

tmp.head(10)

Unnamed: 0,Sex_male,Embarked_Q,Embarked_S
0,1,1,0
1,0,0,1
2,1,1,0
3,1,0,1
4,0,0,1
5,1,0,1
6,0,1,0
7,1,0,1
8,0,0,0
9,1,0,1
