# Encoding

Encoding refers to the process of converting categorical data into a numerical format that can be used for computational purposes. Categorical data represents qualitative variables with discrete categories or levels, such as colors, types of cars, or categories of products. However, most machine learning algorithms and statistical models require numerical input data.

## Label Encoding


Description: Assigns a unique integer to each category.

Limitations: May introduce ordinality where none exists, which could mislead models

In [1]:
from sklearn.preprocessing import LabelEncoder

data = ['cat', 'dog', 'bird', 'cat', 'dog']
encoder = LabelEncoder()
encoded_data = encoder.fit_transform(data)
print(encoded_data)  # Output: [0 1 2 0 1]


[1 2 0 1 2]


## One-Hot Encoding

Description: Converts categorical variables into binary vectors.

Limitations: Can lead to a high-dimensional and sparse representation, which might be computationally expensive and prone to the curse of dimensionality.

In [1]:
import pandas as pd

# Sample categorical data

data = {'fruit': ['apple', 'orange', 'banana', 'apple']}

df = pd.DataFrame(data)

# One Hot Encoding using Pandas get_dummies

encoded_df = pd.get_dummies(df, columns=['fruit'],drop_first=False)

print(encoded_df)

   fruit_apple  fruit_banana  fruit_orange
0         True         False         False
1        False         False          True
2        False          True         False
3         True         False         False


# Dummy Encoding

Description: Similar to one-hot encoding but avoids multicollinearity.

Limitations: May not capture the entire variance in the data.

In [15]:
data = ['red', 'blue', 'green', 'red', 'blue',"vol"]
encoded_data = pd.get_dummies(data, prefix='color')
print(encoded_data)


   color_blue  color_green  color_red  color_vol
0       False        False       True      False
1        True        False      False      False
2       False         True      False      False
3       False        False       True      False
4        True        False      False      False
5       False        False      False       True


# Ordinal Encoding

Description: Assigns integers based on an ordered rank.

Limitations: Assumes a fixed and known order, which might not always be applicable.

In [16]:
data = ['small', 'medium', 'large', 'small', 'medium']
mapping = {'small': 1, 'medium': 2, 'large': 3}
encoded_data = [mapping[val] for val in data]
print(encoded_data)


[1, 2, 3, 1, 2]


## Frequency Encoding

Description: Encodes categories based on their frequency of occurrence.

Limitations: Categories with the same frequency are indistinguishable.

In [17]:
data = ['cat', 'dog', 'bird', 'cat', 'dog']
freq_encoding = pd.Series(data).value_counts(normalize=True)
encoded_data = [freq_encoding[val] for val in data]
print(encoded_data)


[0.4, 0.4, 0.2, 0.4, 0.4]


## Target Encoding


Description: Encodes categories based on the mean or median of the target variable within each category.

Limitations: Prone to overfitting, especially with smaller datasets.

In [21]:
#pip install category_encoders

In [1]:
import category_encoders as ce

data = ['cat', 'dog', 'bird', 'cat', 'dog']
target = [1, 0, 1, 0, 1]
encoder = ce.TargetEncoder()
encoded_data = encoder.fit_transform(data, target)
print(encoded_data)


          0
0  0.585815
1  0.585815
2  0.652043
3  0.585815
4  0.585815


![image.png](attachment:image.png)

![image.png](attachment:image.png)