### Mean Encoding with Feature Engine

In [39]:
!pip install feature_engine

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting feature_engine
  Downloading feature_engine-1.4.0-py2.py3-none-any.whl (276 kB)
[K     |████████████████████████████████| 276 kB 5.1 MB/s 
Installing collected packages: feature-engine
Successfully installed feature-engine-1.4.0


In [40]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.encoding import MeanEncoder

In [41]:
ds = pd.read_csv('train.csv',usecols=[ 'Sex', 'Embarked', 'Survived'])

In [42]:
ds.head()

Unnamed: 0,Survived,Sex,Embarked
0,0,male,S
1,1,female,C
2,1,female,S
3,1,female,S
4,0,male,S


In [43]:
# let's fill NaN in Embarked column

ds['Embarked'].fillna('Missing', inplace=True)

In [44]:
# Lets have a look at number of unique categories for each feature
for column in ds.columns:
  print(f"column {column} has {len(ds[column].unique())} unique categories")

column Survived has 2 unique categories
column Sex has 2 unique categories
column Embarked has 4 unique categories


In [45]:
# let's have a look at unique labels
ds['Sex'].unique()

array(['male', 'female'], dtype=object)

In [46]:
ds['Embarked'].unique()

array(['S', 'C', 'Q', 'Missing'], dtype=object)

### Note:
We calculate the target mean per category using the train set, and then use those mappings in the test set.

In Mean encoding with feature engine, we do not need to keep the target variable in the training dataset.



In [47]:
# Let's split train and test set
X_train, X_test, y_train, y_test = train_test_split(
    ds[['Sex', 'Embarked', 'Survived']],  # this time we keep the target!!
    ds['Survived'],  # target
    test_size=0.3,  # percentage of observation in test set
    random_state=10)  # seed to ensure reproducibility

# let's print the shape
X_train.shape, X_test.shape

((623, 3), (268, 3))

In [48]:
# Create te Mean Encoder model
mean_enc = MeanEncoder(
    variables=[ 'Sex', 'Embarked'])

In [49]:
# Fit the model

mean_enc.fit(X_train, y_train)

MeanEncoder(variables=['Sex', 'Embarked'])

In [50]:
# let's observe te mean target value assigned to eac category 
mean_enc.encoder_dict_

{'Sex': {'female': 0.7433628318584071, 'male': 0.20151133501259447},
 'Embarked': {'C': 0.5614035087719298,
  'Missing': 1.0,
  'Q': 0.46153846153846156,
  'S': 0.34725274725274724}}

In [51]:
# print the variable's which the encoders will transform
mean_enc.variables_

['Sex', 'Embarked']

In [52]:
# Transform and print the result
X_train = mean_enc.transform(X_train)
X_test = mean_enc.transform(X_test)

In [53]:
X_train.head()

Unnamed: 0,Sex,Embarked,Survived
7,0.201511,0.347253,0
765,0.743363,0.347253,1
339,0.201511,0.347253,0
374,0.743363,0.347253,0
183,0.201511,0.347253,1


In [54]:
X_test.head()

Unnamed: 0,Sex,Embarked,Survived
590,0.201511,0.347253,0
131,0.201511,0.347253,0
628,0.201511,0.347253,0
195,0.743363,0.561404,1
230,0.743363,0.347253,1


### Note
1. If the argument variables is left to None, then the encoder will automatically identify all categorical variables.

2. The encoder will not encode numerical variables. So if some of your numerical variables are in fact categories, you will need to re-cast them as object before using the encoder.

3. If there is a label in the test set that was not present in the train set, the encoder will through and error, to alert you of this behaviour.

4. Replacing categorical labels with this code and method will generate missing values for categories present in the test set that were not seen in the training set. Therefore it is extremely important to handle rare labels before-hand. 