# Data Encoding (categorical -> numerical value)

1. Nominal/OHE Encoding
2. label and ordinal encoding
3. target Guided ordinal encoding

## Nominal/OHE Encoding

One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In one hot encoding, each category is represented as a binary vector, where only one element is "hot" (1) and all other elements are "cold" (0). This allows the model to understand the categorical data without assuming any ordinal relationship between the categories.

1. Red: [1, 0, 0]
2. Blue: [0, 1, 0]
3. Green: [0, 0, 1]

that means if we have a categorical feature with three categories (Red, Blue, Green), we can represent it using three binary features. For example, if we have a data point with the category "Blue", it would be represented as [0, 1, 0].

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
df = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']
})

In [3]:
df.head()

Unnamed: 0,Color
0,Red
1,Blue
2,Green
3,Blue
4,Red


In [24]:
## Create an instance of onehotencoder
encoder = OneHotEncoder(sparse_output=False)

In [25]:
encoded = encoder.fit_transform(df[['Color']])

In [26]:
encoder_df = pd.DataFrame(
    encoded,
    columns=encoder.get_feature_names_out(['Color'])
)

In [27]:

encoder_df

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0


In [None]:
# for new data
# Transform a new value. Note: 'yellow' would fail because it wasn't seen during fit.
# Also, .toarray() is not needed because sparse_output=False was used.
encoder.transform([['Blue']])



array([[1., 0., 0.]])

In [30]:
pd.concat([df, encoder_df], axis=1)

Unnamed: 0,Color,Color_Blue,Color_Green,Color_Red
0,Red,0.0,0.0,1.0
1,Blue,1.0,0.0,0.0
2,Green,0.0,1.0,0.0
3,Blue,1.0,0.0,0.0
4,Red,0.0,0.0,1.0
