### Probablity Ratio Encoding With Feature Engine

In [31]:
!pip install feature_engine

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting feature_engine
  Downloading feature_engine-1.4.0-py2.py3-none-any.whl (276 kB)
[K     |████████████████████████████████| 276 kB 7.1 MB/s 
Installing collected packages: feature-engine
Successfully installed feature-engine-1.4.0


In [32]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.encoding import PRatioEncoder

In [63]:
ds = pd.read_csv('train.csv',usecols=[ 'Sex','Cabin', 'Embarked', 'Survived'])

In [64]:
ds.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked
0,0,male,,S
1,1,female,C85,C
2,1,female,,S
3,1,female,C123,S
4,0,male,,S


In [65]:
# lets replace NaN values for Cabin and Embarked with label 'Missing'
ds['Cabin'] = ds['Cabin'].fillna('Missing')
ds['Embarked'] = ds['Embarked'].fillna('Missing')

In [66]:
ds.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked
0,0,male,Missing,S
1,1,female,C85,C
2,1,female,Missing,S
3,1,female,C123,S
4,0,male,Missing,S


In [67]:
# Now we extract the first letter of the cabin
# to create a simpler variable for practice

ds['Cabin'] = ds['Cabin'].astype(str).str[0]

In [68]:
# let's remove the observations where Cabin = T as there are very few

ds = ds[ds['Cabin']!= 'T']
ds.shape

(890, 4)

In [69]:
# Lets have a look at number of unique categories for each feature
for column in ds.columns:
  print(f"column {column} has {len(ds[column].unique())} unique categories")

column Survived has 2 unique categories
column Sex has 2 unique categories
column Cabin has 8 unique categories
column Embarked has 4 unique categories


In [70]:
# let's have a look at unique labels
ds['Sex'].unique()

array(['male', 'female'], dtype=object)

In [71]:
ds['Embarked'].unique()

array(['S', 'C', 'Q', 'Missing'], dtype=object)

In [72]:
ds['Cabin'].unique()
# note that M is for Missing

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F'], dtype=object)

### Note: 
We calculate the ratio P(1)/P(0) using the train set, and then use those mappings in the test set.

Note that to implement this with feature engine, we do not need to keep the target variable in the training dataset.



In [73]:
# Splitting the train and test set
X_train, X_test, y_train, y_test = train_test_split(
    ds[['Cabin', 'Sex', 'Embarked']],  
    ds['Survived'],  
    test_size=0.3,  
    random_state=0) 

# print the shape
X_train.shape, X_test.shape

((623, 3), (267, 3))

In [77]:
# Create the probability ratio encoder model
ratio_enc = PRatioEncoder(
    encoding_method = 'ratio',
    variables=['Cabin', 'Sex'])

In [78]:
# fitting the model
ratio_enc.fit(X_train, y_train)

PRatioEncoder(variables=['Cabin', 'Sex'])

In [80]:
# let's observe te mean target value assigned to eac category 
ratio_enc.encoder_dict_

{'Cabin': {'A': 0.8571428571428572,
  'B': 2.8571428571428563,
  'C': 1.388888888888889,
  'D': 2.571428571428571,
  'E': 2.8571428571428563,
  'F': 1.75,
  'G': 1.0,
  'M': 0.4660493827160494},
 'Sex': {'female': 3.245283018867925, 'male': 0.23602484472049687}}

In [81]:
# print the variable's which the encoders will transform
ratio_enc.variables_

['Cabin', 'Sex']

In [82]:
# Transform and print the result
X_train = ratio_enc.transform(X_train)
X_test = ratio_enc.transform(X_test)

In [83]:
X_train.head()

Unnamed: 0,Cabin,Sex,Embarked
64,0.466049,0.236025,C
709,0.466049,0.236025,C
52,2.571429,3.245283,C
387,0.466049,3.245283,S
124,2.571429,0.236025,S


### Note
1. If the argument variables is left to None, then the encoder will automatically identify all categorical variables.

2. The encoder will not encode numerical variables. So if some of your numerical variables are in fact categories, you will need to re-cast them as object before using the encoder.

3. If there is a label in the test set that was not present in the train set, the encoder will through and error, to alert you of this behaviour.

4. If the probability of target = 0 is zero for any category, the encoder will raise an error as the division by zero is not defined.



