In [1]:
##

What is Label Encoding?

Label Encoding is a popular encoding technique for handling categorical variables.

Let’s see how to implement label encoding in Python using the scikit-learn library and also understand the challenges with label encoding.

First import the required libraries and dataset:

In [2]:
#importing the libraries
import pandas as pd
import numpy as np

In [4]:
#reading the dataset
#df=pd.read_csv("Salary.csv")

In [None]:
#print df.info

Now, let us implement label encoding in Python:

In [None]:
# Import label encoder
from sklearn import preprocessing
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'Country'.
data['Country']= label_encoder.fit_transform(data[‘Country'])
print(data.head())

As you can see here, label encoding uses alphabetical ordering. Hence, India has been encoded with 0, the US with 2, and Japan with 1.

# What is One Hot Encoding?

One-Hot Encoding is another popular technique for treating categorical variables. It simply creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature. One-Hot Encoding is the process of creating dummy variables.

In this encoding technique, each category is represented as a one-hot vector. Let’s see how to implement one-hot encoding in Python:

In [None]:
# importing one hot encoder
from sklearn from sklearn.preprocessing import OneHotEncoder
# creating one hot encoder object
onehotencoder = OneHotEncoder()
#reshape the 1-D country array to 2-D as fit_transform expects 2-D and finally fit the object
X = onehotencoder.fit_transform(data.Country.values.reshape(-1,1)).toarray()
#To add this back into the original dataframe
dfOneHot = pd.DataFrame(X, columns = ["Country_"+str(int(i)) for i in range(data.shape[1])])
df = pd.concat([data, dfOneHot], axis=1)
#droping the country column
df= df.drop(['Country'], axis=1)
#printing to verify
print(df.head())

As you can see here, 3 new features are added as the country contains 3 unique values – India, Japan, and the US. In this technique, we solved the problem of ranking as each category is represented by a binary vector.

# Challenges of One-Hot Encoding: Dummy Variable Trap

One-Hot Encoding results in a Dummy Variable Trap as the outcome of one variable can easily be predicted with the help of the remaining variables. Dummy Variable Trap is a scenario in which variables are highly correlated to each other.

One of the common ways to check for multicollinearity is the Variance Inflation Factor (VIF):

    VIF=1, Very Less Multicollinearity
    VIF<5, Moderate Multicollinearity
    VIF>5, Extreme Multicollinearity (This is what we have to avoid)

# Compute the VIF scores:

In [None]:
# Function to calculate VIF
def calculate_vif(data):
    vif_df = pd.DataFrame(columns = ['Var', 'Vif'])
    x_var_names = data.columns
    for i in range(0, x_var_names.shape[0]):
        y = data[x_var_names[i]]
        x = data[x_var_names.drop([x_var_names[i]])]
        r_squared = sm.OLS(y,x).fit().rsquared
        vif = round(1/(1-r_squared),2)
        vif_df.loc[i] = [x_var_names[i], vif]
    return vif_df.sort_values(by = 'Vif', axis = 0, ascending=False, inplace=False)

X=df.drop(['Salary'],axis=1)
calculate_vif(X)

From the output, we can see that the dummy variables which are created using one-hot encoding have VIF above 5. We have a multicollinearity problem.

Now, let us drop one of the dummy variables to solve the multicollinearity issue:

In [None]:
#df = df.drop(df.columns[[0]], axis=1)
#calculate_vif(df)

Wow! VIF has decreased. We solved the problem of multicollinearity. Now, the dataset is ready for building the model.

# When to use a Label Encoding vs. One Hot Encoding

This question generally depends on your dataset and the model which you wish to apply. But still, a few points to note before choosing the right encoding technique for your model:

We apply One-Hot Encoding when:

    The categorical feature is not ordinal (like the countries above)
    The number of categorical features is less so one-hot encoding can be effectively applied

We apply Label Encoding when:

    The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school)
     The number of categories is quite large as one-hot encoding can lead to high memory consumption

# Label Encoding vs One Hot Encoding vs Ordinal Encoding


    Label Encoding: Label encoding assigns a unique numerical label to each category in a categorical variable. It preserves the ordinal relationship between categories if present. For example, “Red” may be encoded as 1, “Green” as 2, and “Blue” as 3.
    One-Hot Encoding: One-hot encoding converts each category in a categorical variable into a binary vector. It creates new binary columns for each category, representing the presence or absence of the category. Each category is mutually exclusive. For example, “Red” may be encoded as [1, 0, 0], “Green” as [0, 1, 0], and “Blue” as [0, 0, 1].
    Ordinal Encoding: Ordinal encoding is similar to label encoding but considers the order or rank of categories. It assigns unique numerical labels to each category, preserving the ordinal relationship between categories. For example, “Cold” may be encoded as 1, “Warm” as 2, and “Hot” as 3.