# RareLabelEncoder
The RareLabelEncoder() groups labels that show a small number of observations in the dataset into a new category called 'Rare'. This helps to avoid overfitting.

The argument ' tol ' indicates the percentage of observations that the label needs to have in order not to be re-grouped into the "Rare" label.
The argument n_categories indicates the minimum number of distinct categories that a variable needs to have for any of the labels to be re-grouped into 'Rare'.


**Note**
If the number of labels is smaller than n_categories, then the encoder will not group the labels for that variable.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from feature_engine.encoding import RareLabelEncoder

In [2]:
def load_titanic():
    # Load dataset from OpenML
    data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
    
    # Replace '?' with NaN
    data = data.replace('?', np.nan)
    
    # Extract first letter of cabin
    data['cabin'] = data['cabin'].astype(str).str[0]
    
    # Convert 'pclass' to categorical
    data['pclass'] = data['pclass'].astype('O')
    
    # Ensure 'age' contains only numeric data by coercing errors to NaN
    data['age'] = pd.to_numeric(data['age'], errors='coerce')
    
    # Fill missing values in 'age' with the median age
    data['age'].fillna(data['age'].median(), inplace=True)
    
    # Ensure 'fare' contains only numeric data by coercing errors to NaN
    data['fare'] = pd.to_numeric(data['fare'], errors='coerce')
    
    # Fill missing values in 'fare' with the median fare
    data['fare'].fillna(data['fare'].median(), inplace=True)
    
    # Fill missing values in 'embarked' with 'C'
    data['embarked'].fillna('C', inplace=True)
    
    # Drop irrelevant columns
    data.drop(labels=['boat', 'body', 'home.dest'], axis=1, inplace=True)
    
    return data

# Load the data
data = load_titanic()

# Display the first few rows of the cleaned data
data.head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['age'].fillna(data['age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['fare'].fillna(data['fare'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we 

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C,S


In [3]:
X = data.drop(['survived', 'name', 'ticket'], axis=1)
y = data.survived

# we will encode the below variables, they have no missing values
X[['cabin', 'pclass', 'embarked']].isnull().sum()

cabin       0
pclass      0
embarked    0
dtype: int64

In [4]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((916, 8), (393, 8))

The RareLabelEncoder() groups rare / infrequent categories in a new category called "Rare", or any other name entered by the user.

For example in the variable colour,
if the percentage of observations for the categories magenta, cyan and burgundy are < 5%, all those categories will be replaced by the new label "Rare".

Note, infrequent labels can also be grouped under a user defined name, for example 'Other'. The name to replace infrequent categories is defined with the parameter replace_with.

The encoder will encode only categorical variables (type 'object'). A list of variables can be passed as an argument. If no variables are passed as argument, the encoder will find and encode all categorical variables (object type).

In [5]:
## Rare value encoder
'''
Parameters
----------

tol: float, default=0.05
    the minimum frequency a label should have to be considered frequent.
    Categories with frequencies lower than tol will be grouped.

n_categories: int, default=10
    the minimum number of categories a variable should have for the encoder
    to find frequent labels. If the variable contains less categories, all
    of them will be considered frequent.

max_n_categories: int, default=None
    the maximum number of categories that should be considered frequent.
    If None, all categories with frequency above the tolerance (tol) will be
    considered.

variables : list, default=None
    The list of categorical variables that will be encoded. If None, the 
    encoder will find and select all object type variables.

replace_with : string, default='Rare'
    The category name that will be used to replace infrequent categories.
'''

rare_encoder = RareLabelEncoder(tol=0.05, 
                                n_categories=5,
                                variables=['cabin', 'pclass', 'embarked'])
rare_encoder.fit(X_train)



In [6]:
rare_encoder.encoder_dict_

{'cabin': ['n', 'C'], 'pclass': [2, 3, 1], 'embarked': ['S', 'C', 'Q']}

In [7]:
train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_train)

test_t.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
501,2,female,13.0,0,1,19.5,n,S
588,2,female,4.0,1,1,23.0,n,S
402,2,female,30.0,1,0,13.8583,n,C
1193,3,male,28.0,0,0,7.725,n,Q
686,3,female,22.0,0,0,7.725,n,Q


In [8]:
test_t.cabin.value_counts()

cabin
n       702
Rare    143
C        71
Name: count, dtype: int64

#### The user can change the string from 'Rare' to something else.

In [9]:
## Rare value encoder

rare_encoder = RareLabelEncoder(tol = 0.03,
                                replace_with='Other', #replacing 'Rare' with 'Other'
                                variables=['cabin', 'pclass', 'embarked'],
                                n_categories=2
                           )

rare_encoder.fit(X_train)

train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_train)

test_t.sample(5)

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
586,2,female,29.0,1,0,26.0,n,S
1061,3,female,26.0,0,0,7.8542,n,S
1261,3,female,63.0,0,0,9.5875,n,S
272,1,female,23.0,1,0,82.2667,B,S
450,2,male,50.0,0,0,13.0,n,S


In [10]:
rare_encoder.encoder_dict_

{'cabin': ['n', 'C', 'B', 'E', 'D'],
 'pclass': [3, 1, 2],
 'embarked': ['S', 'C', 'Q']}

In [11]:
test_t.cabin.value_counts()

cabin
n        702
C         71
B         42
Other     37
E         32
D         32
Name: count, dtype: int64

#### The user can choose to retain only the most popular categories with the argument max_n_categories.

In [12]:
## Rare value encoder

rare_encoder = RareLabelEncoder(tol = 0.03,
                                variables=['cabin', 'pclass', 'embarked'],
                                n_categories=2,
                                
                                max_n_categories=3 #keeps only the most popular 3 categories in every variable.
                                
                           )

rare_encoder.fit(X_train)

train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_train)

test_t.sample(5)

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
931,3,male,28.0,0,0,7.7375,n,Q
1162,3,male,28.0,0,0,7.75,n,Q
417,2,male,26.0,0,0,10.5,n,S
686,3,female,22.0,0,0,7.725,n,Q
72,1,female,26.0,1,0,136.7792,C,C


In [13]:
rare_encoder.encoder_dict_

{'cabin': ['n', 'C', 'B'], 'pclass': [3, 1, 2], 'embarked': ['S', 'C', 'Q']}

## Automatically select all categorical variables
If no variable list is passed as argument, it selects all the categorical variables.

In [14]:
## Rare value encoder

rare_encoder = RareLabelEncoder(tol = 0.03, n_categories=3)

rare_encoder.fit(X_train)

rare_encoder.encoder_dict_



{'pclass': [2, 3, 1],
 'sex': ['female', 'male'],
 'cabin': ['n', 'C', 'B', 'E', 'D'],
 'embarked': ['S', 'C', 'Q']}

In [15]:
train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_train)

test_t.sample(5)

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
403,2,male,23.0,0,0,13.0,n,S
339,2,male,1.0,2,1,39.0,Rare,S
1169,3,male,38.5,0,0,7.25,n,S
448,2,male,36.0,0,0,13.0,n,S
225,1,male,23.0,0,0,93.5,B,S
