## Techniques for encoding Nominal Categorical Features

1. One-Hot Encoding (for nominal categorical features)
2. One-Hot Encoding using top k most frequent categories (for nominal categorical features)
3. Ordinal Encoding (for ordinal categorical features)
4. Count Or Frequency Encoding (for nominal and ordinal categorical features)
5. Target Guided Ordinal Encoding (for nominal and ordinal categorical features)
6. Mean Encoding (for nominal and ordinal categorical features)
7. Probability Ratio Encoding (for nominal and ordinal categorical features)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 1. One-Hot Encoding

**How it works??**<br>
For each category of a feature we create a new one hot encoded feature where there is 1 if the record belongs to that category, otherwise 0. So we create as many new one-hot encoded features as the total nr of categories.

**Note:**<br>
When we perform train test split before one-hot encoding it may happen that in test set in any categorical feature we have less categories and therefore we may create less one-hot encoded feature (train and test data must have same features and same amount). In this case we should better use OneHotEncoder class from sklearn because it will use for the test set the same encoded features in training set. In test it will end up being a feature (category that is in the traiinng set but not test set ) with only 0s.
We use drop_first=True if we want to remove 1 category from the feature which is unnecessary in order to have less nr of features. <br>
We use dummy_na=True if we want to add a new column which captures the nan values (1 of value if missing, otherwise 0)

In [5]:
df=pd.read_csv('titanic.csv',usecols=['Age', 'Sex', 'Embarked'])
df.head()

Unnamed: 0,Sex,Age,Embarked
0,male,22.0,S
1,female,38.0,C
2,female,26.0,S
3,female,35.0,S
4,male,35.0,S


**1.Way of performing One Hot Encoding (using pd.get_dummies)**

In [8]:
# columns will automatically get dropped after applying one hot encoding
new_df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True, dummy_na=True)
new_df

Unnamed: 0,Age,Sex_male,Sex_nan,Embarked_Q,Embarked_S,Embarked_nan
0,22.0,1,0,0,1,0
1,38.0,0,0,0,0,0
2,26.0,0,0,0,1,0
3,35.0,0,0,0,1,0
4,35.0,1,0,0,1,0
...,...,...,...,...,...,...
886,27.0,1,0,0,1,0
887,19.0,0,0,0,1,0
888,,0,0,0,1,0
889,26.0,1,0,0,0,0


**2.Way of performing One-Hot Encoding (using pd.get_dummies)**

In [13]:
# after applying one hot encoding we must drop the column
onehot = pd.get_dummies(df['Sex'], drop_first=True, dummy_na=True)
new_df = pd.concat([df, onehot], axis=1)
new_df.drop(['Sex'], axis=1)

Unnamed: 0,Sex,Age,Embarked,male,NaN
0,male,22.0,S,1,0
1,female,38.0,C,0,0
2,female,26.0,S,0,0
3,female,35.0,S,0,0
4,male,35.0,S,1,0
...,...,...,...,...,...
886,male,27.0,S,1,0
887,female,19.0,S,0,0
888,female,,S,0,0
889,male,26.0,C,1,0


**3.Way of performing One Hot Encoding (using OneHotEncoder class from sklearn)**

In [5]:
import pandas as pd
import numpy as np

X_train = pd.DataFrame()
X_train['A'] = ['Benz', 'Audi', 'BMW', 'Porche']
X_train['B'] = ['Cat1', 'Cat2', 'Cat3', 'Cat4']
X_train['C'] = [1, 2, 3, 4]
X_test = pd.DataFrame()
X_test['A'] = ['Benz', 'Audi', 'BMW', 'Benz']
X_test['B'] = ['Cat1', 'Cat2', 'Cat3', 'Cat1']
X_test['C'] = [1, 2, 5, 6]

In [6]:
# DYNAMIC

from sklearn.preprocessing import OneHotEncoder

def encode(X_train, X_test, cols_for_one_hot):
    
    cat_X_train = X_train[cols_for_one_hot]
    cat_X_test = X_test[cols_for_one_hot]
    
    one_hot_encoder = OneHotEncoder()
    new_X_train = one_hot_encoder.fit_transform(cat_X_train)
    onehot_X_train = pd.DataFrame(data=new_X_train.toarray(), columns=np.concatenate(one_hot_encoder.categories_))

    new_X_test = one_hot_encoder.transform(cat_X_test)
    onehot_X_test = pd.DataFrame(data=new_X_test.toarray(), columns=np.concatenate(one_hot_encoder.categories_))
    
    encoded_X_train = pd.concat([X_train, onehot_X_train], axis=1)
    encoded_X_train.drop(columns=cols_for_one_hot, axis=1, inplace=True)

    encoded_X_test = pd.concat([X_test, onehot_X_test], axis=1)
    encoded_X_test.drop(columns=cols_for_one_hot, axis=1, inplace=True)
    
    
    return encoded_X_train, encoded_X_test

In [7]:
cols_for_one_hot = ['A', 'B']
X_train, X_test = encode(X_train, X_test, cols_for_one_hot)

In [8]:
X_train

Unnamed: 0,C,Audi,BMW,Benz,Porche,Cat1,Cat2,Cat3,Cat4
0,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,2,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,3,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


In [9]:
X_test

Unnamed: 0,C,Audi,BMW,Benz,Porche,Cat1,Cat2,Cat3,Cat4
0,1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,2,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,5,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,6,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0


### 2. One-Hot Encoding using top k most frequent categories (category reduction technique)

**How it works?**<br>
We will select the top k most frequent categories in a categorical feature. We label all the other categories (low frequency category) with a new category called 'other'. Then we apply one-hot encoding to the feature. 

**When to use?**<br>
When we have high cardinality in a categorical feature. High cardinality means we have many different categories. If we would perform one-hot encoding we would drastically increase the nr of features. Instead we perform category reduction in order to decrease the nr of new created one-hot features

In [14]:
df=pd.read_csv('mercedes.csv',usecols=["X0","X1","X2","X3","X4","X5","X6"])
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6
0,k,v,at,a,d,u,j
1,k,t,av,e,d,y,l
2,az,w,n,c,d,x,j
3,az,t,n,f,d,x,l
4,az,v,n,f,d,h,d


In [15]:
def encode(dataset, feature, top_k):
    
    top_k_frequencies = dataset[feature].value_counts(ascending=False)[:top_k] 
    top_k_categories = top_k_frequencies.index
    total_nr_categories = len(dataset[feature].unique())
    nr_categories_other = len([cat for cat in dataset[feature].unique() if cat not in top_k_categories])
    print(f"Nr of categories (low frequency categories) labeled with 'other' : {nr_categories_other} out of total {total_nr_categories}")
    print(f"Nr of categories not labeled with 'other' : {top_k}")
    dataset[feature] = dataset[feature].apply(lambda x: x if x in top_k_categories else 'other')

    dataset = pd.get_dummies(dataset, columns=[feature], drop_first=True)
            
    return dataset

In [18]:
df = encode(df, 'X1', 10)
df.head()

Nr of categories (low frequency categories) labeled with 'other' : 17 out of total 27
Nr of categories not labeled with 'other' : 10


Unnamed: 0,X0,X2,X3,X4,X5,X6,X1_aa,X1_b,X1_c,X1_i,X1_l,X1_o,X1_other,X1_r,X1_s,X1_v
0,k,at,a,d,u,j,0,0,0,0,0,0,0,0,0,1
1,k,av,e,d,y,l,0,0,0,0,0,0,1,0,0,0
2,az,n,c,d,x,j,0,0,0,0,0,0,1,0,0,0
3,az,n,f,d,x,l,0,0,0,0,0,0,1,0,0,0
4,az,n,f,d,h,d,0,0,0,0,0,0,0,0,0,1


In [None]:
def encode(dataset, features, top_k):
    
    for feature in features:
        top_k_frequencies = dataset[feature].value_counts(ascending=False)[:top_k] 
        top_k_categories = top_k_frequencies.index
        total_nr_categories = len(dataset[feature].unique())
        nr_categories_other = len([cat for cat in dataset[feature].unique() if cat not in top_k_categories])
        print(f"Nr of categories (low frequency categories) labeled with 'other' : {nr_categories_other} out of total {total_nr_categories}")
        print(f"Nr of categories not labeled with 'other' : {top_k}")
        dataset[feature] = dataset[feature].apply(lambda x: x if x in top_k_categories else 'other')

    dataset = pd.get_dummies(dataset, columns=[feature], drop_first=True)
            
    return dataset

### 3. Ordinal Encoding

In [34]:
# suppose we have a ordinal categorical feature the order is S-C-Q
df=pd.read_csv('titanic.csv')
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [35]:
df['Embarked'] = df['Embarked'].map({'S' : 1, 'C' : 2, 'Q' : 3})
df['Embarked'].unique()

array([ 1.,  2.,  3., nan])

* We can also use the OrdinalEncoding class from sklearn.

### 4. Count Or Frequency Encoding

**How it works?** <br>
We replace the categories in a feature with their frequency. Most frequent categories will get higher weights.

**Advantages**<br>
1. Easy To Use
2. Not increasing feature space <br>

**Disadvantages**
1. It will provide same weight if the frequencies are same

In [46]:
df=pd.read_csv('titanic.csv')
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [42]:
def encode(data, features):
    for feature in features:
        dic = df[feature].value_counts().to_dict()
        df[feature] = df[feature].map(dic)
        
    return data

In [47]:
df = encode(df, ['Embarked'])
df['Embarked'].unique()

array([644., 168.,  77.,  nan])

### 5. Target Guided Ordinal Encoding

**How it works?** <br>
We group by each category and for each category we calculate the mean of target feature. Then we assign weights to each category starting from 0 till n. Weight=0 will have the category with the lowest mean and so on.

In [2]:
import pandas as pd
df=pd.read_csv('titanic.csv', usecols=['Cabin', 'Pclass', 'Survived'])
df.head()

Unnamed: 0,Survived,Pclass,Cabin
0,0,3,
1,1,1,C85
2,1,3,
3,1,1,C123
4,0,3,


In [3]:
def encode(df, features, target_feature):
    for feature in features:
        df[feature]=df[feature].map({k:i for i,k in enumerate(df.groupby([feature])[target_feature].mean().sort_values().index,0)})        
    return df

In [6]:
def encode(df, features, target_feature):
    for feature in features:
        ordinal_labels = df.groupby([feature])[target_feature].mean().sort_values().index
        ordinal_labels2 = {k:i for i,k in enumerate(ordinal_labels,0)}
        df[feature]=df[feature].map(ordinal_labels2)
        
    return df

In [7]:
df = encode(df, ['Cabin'], 'Survived')
df

Unnamed: 0,Survived,Cabin
0,0,
1,1,124.0
2,1,
3,1,47.0
4,0,
...,...,...
886,0,
887,1,128.0
888,0,
889,1,103.0


### 6. Mean Encoding

In [7]:
df=pd.read_csv('titanic.csv', usecols=['Cabin', 'Pclass', 'Survived'])
df.head()

Unnamed: 0,Survived,Pclass,Cabin
0,0,3,
1,1,1,C85
2,1,3,
3,1,1,C123
4,0,3,


In [8]:
def encode(df, features, target_feature):
    for variable in features:
        mean_ordinal=df.groupby([variable])[target_feature].mean().to_dict()
        df[variable]=df[variable].map(mean_ordinal)
        
    return df

In [9]:
df = encode(df, ['Cabin', 'Pclass'], 'Survived')
df

Unnamed: 0,Survived,Pclass,Cabin
0,0,0.242363,
1,1,0.629630,1.0
2,1,0.242363,
3,1,0.629630,0.5
4,0,0.242363,
...,...,...,...
886,0,0.472826,
887,1,0.629630,1.0
888,0,0.242363,
889,1,0.629630,1.0


### 7. Probability Ratio Encoding

**How it works?**<br>
We first calculate the probability of Survived based on a Categorical Feature (in our case Cabin). Probability of Not Survived---1-pr(Survived)
Then we calculate. pr(Survived)/pr(Not Survived). A value greater than 1 means that is more likely that the passengers who have category Cabin survived, value lower than 1 means that is more likely that the passenger who have category Cabin did not survive. If the prob=1, there is 50% that the passenger survived.

In [68]:
df=pd.read_csv('titanic.csv',usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [70]:
def encode(df, features, target_feature):
    for var in features:
        prob_df=df.groupby([var])[target_feature].mean()
        prob_df=pd.DataFrame(prob_df)
        prob_df['Died']=1-prob_df[target_feature]
        prob_df['Probability_ratio']=prob_df[target_feature]/prob_df['Died']
        probability_encoded=prob_df['Probability_ratio'].to_dict()
        df[var]=df[var].map(probability_encoded)
    
    return df

In [71]:
df = encode(df, ['Cabin'], 'Survived')
df.head()


Unnamed: 0,Survived,Cabin
0,0,
1,1,inf
2,1,
3,1,1.0
4,0,
