In [33]:
#define data
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from numpy import asarray

data = asarray([['red'], ['green'], ['blue']])
data

array([['red'],
       ['green'],
       ['blue']], dtype='<U5')

1- Label encoding

Label Encoding: Assign each categorical value an integer value based on <b>alphabetical order.

In machine learning, we usually deal with datasets that contain multiple labels in one or more than one columns. These labels can be in the form of words or numbers. To make the data understandable or in human-readable form, the training data is often labelled in words. 

Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

In [34]:
# Import label encoder
from sklearn import preprocessing
print(data) 
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
  
# Encode labels in column 'species'.
label_encoder_data= label_encoder.fit_transform(data)
label_encoder_data

[['red']
 ['green']
 ['blue']]


array([2, 1, 0])

2-Ordinal encodig

In [35]:
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define ordinal encoding
OrdinalCode = OrdinalEncoder()
# transform data
result = OrdinalCode.fit_transform(data)
print(result)

[['red']
 ['green']
 ['blue']]
[[2.]
 [1.]
 [0.]]


 3-One hot encodig

One Hot Encoding: Create new variables that take on values 0 and 1 to represent the original categorical values.

we had our variables like colors and the labels were “red,” “green,” and “blue,” we could encode each of these labels as a three-element binary vector as Red: [1, 0, 0], Green: [0, 1, 0], Blue: [0, 0, 1]. 

In [36]:
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder

# define one hot encoding
encoder = OneHotEncoder(sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


4-Target mean encodig

Target encoding involves replacing a categorical feature with average target value of all data points belonging to the category. 
For instance, Riyadh can be replaced with average of salary (target variable) of all datapoints where city is Riyadh.

In [37]:
data = [['Riyadh', 120], ['Abha',120], ['Jeddah',140], 
        ['Abha', 100], ['Abha',70], ['Jeddah',100],['Riyadh',60], 
        ['Jeddah', 110], ['Abha',100],['Riyadh',70] ]
df = pd.DataFrame(data, columns = ['City','Salary'])
df

Unnamed: 0,City,Salary
0,Riyadh,120
1,Abha,120
2,Jeddah,140
3,Abha,100
4,Abha,70
5,Jeddah,100
6,Riyadh,60
7,Jeddah,110
8,Abha,100
9,Riyadh,70


In [38]:
import category_encoders as ce
TargetEncoder=ce.TargetEncoder() 
df_city=TargetEncoder.fit_transform(df['City'],df['Salary'])

df_new = df_city.join(df.drop('City',axis = 1))
df_new

Unnamed: 0,City,Salary
0,85.200846,120
1,97.571139,120
2,114.560748,140
3,97.571139,100
4,97.571139,70
5,114.560748,100
6,85.200846,60
7,114.560748,110
8,97.571139,100
9,85.200846,70


5-frequency encodig

 Frequency encoding is a normalized version of count encoding.

In [39]:
Frequency = (df.groupby('City').size()) / len(df)
Frequency

City
Abha      0.4
Jeddah    0.3
Riyadh    0.3
dtype: float64

In [51]:
df['Frequency'] = df['City'].apply(lambda x : Frequency[x])
df

Unnamed: 0,City,Salary,Frequency
0,Riyadh,120,0.3
1,Abha,120,0.4
2,Jeddah,140,0.3
3,Abha,100,0.4
4,Abha,70,0.4
5,Jeddah,100,0.3
6,Riyadh,60,0.3
7,Jeddah,110,0.3
8,Abha,100,0.4
9,Riyadh,70,0.3


6-binary encoding

Binary is raw data that almost all computers use, but most computer users do not interact with it directly. The computer reads the binary code and translates it into data useful for the user. 

In [60]:
data = [['Riyadh', 120], ['Abha',120], ['Jeddah',140], 
        ['Abha', 100], ['Abha',70], ['Jeddah',100],['Riyadh',60], 
        ['Jeddah', 110], ['Abha',100],['Riyadh',70] ]
df = pd.DataFrame(data, columns = ['City','Salary'])

ce_bin = ce.BinaryEncoder(cols = ['City'])
l=ce_bin.fit_transform(df)
l
#Riyadh [01]- Abha[10]-Jeddah[11]

Unnamed: 0,City_0,City_1,Salary
0,0,1,120
1,1,0,120
2,1,1,140
3,1,0,100
4,1,0,70
5,1,1,100
6,0,1,60
7,1,1,110
8,1,0,100
9,0,1,70


<b>Conclusion
So, which one should you use?

It depends on the dataset, the model, and the performance metric you are trying to optimize. 
    
    In general, one-hot encoding is the most commonly used method for nominal variables. It is simple to understand and implement, and it works well with most machine learning models. 
    
    To fight the curse of dimensionality, binary encoding might be a good alternative to one-hot encoding because it creates fewer columns when encoding categorical variables.

    Ordinal encoding is a good choice if the order of the categorical variables matters. For example, if we were predicting the price of a house, the label “small”, “medium”, and “large” would imply that a small house is cheaper than a medium house, which is cheaper than a large house. 

    On the other hand, the target encoding is a supervised encoder that captures information about the label and potentially predictive features. This encoder does not increase the dimensionality of the feature space, but can lead to overfitting and is prone to target leakage.

    The frequency and count encoders are also supervised methods that do not increase the dimensionality of the feature space. However, these methods can only be used if the count refers to the target variable, otherwise, all categories that have similar cardinality will be counted the same.