# Categorical Encoding

Machine learning models do not understand text, they only understand numbers. In some cases datasets we want to use to train a model containes text, so how do we go about changes text to a form that the machine learning algorithms can understand. 


Here is where we use **Label Encoding** and **One Hot Encoding**. Basically converting text data to a numeric format that machine learning algorithms can work with.

In most cases, we are talking of label encoding when we have **categorical columns**. Example **male** or **female** in the gender feature column.

Very few machine learning models are able to handle the categorical convertion on their own eg **CATBOAST** while most are not able to deal with the categorical features.

## Label Encoding

This is the simplest way to deal with categorical values encoding. This approach simply converts each category into a number where by the categories that come after the first one is simply an incremented value by one assigned to the previous one(This sometimes is called **running sequence**).

In [25]:
import pandas as pd
import numpy as np

In [26]:
df = pd.DataFrame({"Fruits": ["Mango", "Apple", "Orange", "Pineapple", "Grape"],
                      "Wieghts": [20, 10, 15, 500, 4]})
df

Unnamed: 0,Fruits,Wieghts
0,Mango,20
1,Apple,10
2,Orange,15
3,Pineapple,500
4,Grape,4


From the data above, you can see that the **Fruits** column is **categorical**. But is it how sure are we that its categorical in type? Lets confirm this in Python.

In [27]:
df.dtypes

Fruits     object
Wieghts     int64
dtype: object

You can clearly see that its column type is not **categorical** even though it looks like it. Lets go ahead and convert this to a categorical data type.

In [28]:
df['Fruits'] = df['Fruits'].astype("category")

In [29]:
df.dtypes

Fruits     category
Wieghts       int64
dtype: object

Now we can clearly see that its a categorical value. Now lets use **Label Encoding** to convert this into a numerica array.

We can use multiple ways to solve this, first lets see how we can do this using pandas.

In [30]:
df["Encoding_with_Pandas"] = df["Fruits"].cat.codes

In [31]:
df

Unnamed: 0,Fruits,Wieghts,Encoding_with_Pandas
0,Mango,20,2
1,Apple,10,0
2,Orange,15,3
3,Pineapple,500,4
4,Grape,4,1


We can also solve this using **scikit-learn**

In [32]:
from sklearn.preprocessing import LabelEncoder

In [33]:
label_encoder = LabelEncoder()

In [34]:
label_encoder.fit(df["Fruits"])

df["encoded_with_sklearn"] = label_encoder.transform(df["Fruits"])

In [35]:
df

Unnamed: 0,Fruits,Wieghts,Encoding_with_Pandas,encoded_with_sklearn
0,Mango,20,2,2
1,Apple,10,0,0
2,Orange,15,3,3
3,Pineapple,500,4,4
4,Grape,4,1,1


Shorter approach, fit and transform using one line.

In [36]:
df["encoded_with_sklearn"] = label_encoder.fit_transform(df["Fruits"])

In [37]:
df

Unnamed: 0,Fruits,Wieghts,Encoding_with_Pandas,encoded_with_sklearn
0,Mango,20,2,2
1,Apple,10,0,0
2,Orange,15,3,3
3,Pineapple,500,4,4
4,Grape,4,1,1


Label Encoding is a great approach but, one short coming of this approach is that...it unintentionally give weights to the different values from the category feature, since computers only understand numbers, **Grapes** will be given less weight compared to **Pineapples** since **1** < **4**. This will lead to models misinterpreting the feature as a form a hierarchy/ordering. This is now what we intended to to, we simply wanted to convert the data into numeric columns. This is the very downside of label encoding. Another approach around categorical encoding is **One Hot Encoding**, this tends to avoid this problem. Lets take a look at it.

## One Hot Encoding

In One Hot Encoding, we simply convert the values into multiple columns and each column is assigne a 1 or a 0. This helps to avoid the problem of hierarchy or ordering sort.

Lets take alook at how we can accomplish this is Python.

In [38]:
from sklearn.preprocessing import OneHotEncoder

In [39]:
one_hot_encoding = OneHotEncoder()

In [46]:
transformed_one_hot_encoding = one_hot_encoding.fit_transform(df[["Fruits"]])

In [47]:
transformed_one_hot_encoding

<5x5 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

In [63]:
ohe_df = pd.DataFrame(transformed_one_hot_encoding.toarray(), dtype="int")

In [64]:
ohe_df

Unnamed: 0,0,1,2,3,4
0,0,0,1,0,0
1,1,0,0,0,0
2,0,0,0,1,0
3,0,0,0,0,1
4,0,1,0,0,0


In [52]:
ohe_final_df = df.join(ohe_df)

In [53]:
ohe_final_df

Unnamed: 0,Fruits,Wieghts,Encoding_with_Pandas,encoded_with_sklearn,0,1,2,3,4
0,Mango,20,2,2,0.0,0.0,1.0,0.0,0.0
1,Apple,10,0,0,1.0,0.0,0.0,0.0,0.0
2,Orange,15,3,3,0.0,0.0,0.0,1.0,0.0
3,Pineapple,500,4,4,0.0,0.0,0.0,0.0,1.0
4,Grape,4,1,1,0.0,1.0,0.0,0.0,0.0


Another way around this is by using **Dummy Values** approach in pandas

In [58]:
dummy_df = pd.get_dummies(df["Fruits"], prefix = "Fruit_type", columns=["Fruits"])
dummy_df

Unnamed: 0,Fruit_type_Apple,Fruit_type_Grape,Fruit_type_Mango,Fruit_type_Orange,Fruit_type_Pineapple
0,0,0,1,0,0
1,1,0,0,0,0
2,0,0,0,1,0
3,0,0,0,0,1
4,0,1,0,0,0


In [62]:
dummy_df.style.applymap(lambda x: "background-color: yellow" if x>0 else "background-color: blue")

Unnamed: 0,Fruit_type_Apple,Fruit_type_Grape,Fruit_type_Mango,Fruit_type_Orange,Fruit_type_Pineapple
0,0,0,1,0,0
1,1,0,0,0,0
2,0,0,0,1,0
3,0,0,0,0,1
4,0,1,0,0,0


## When To Apply Which Method

Apply Lable Encoding, when we have at most two categorical values to encode. This will not bring a problem of hierarchy.

On the other hand we use One Hot Encoding when we have multiple values more than just two values this helps to deal with hierarchical ordering.