#### Data Encoding

Process of converting categorical variables to numerical so that the machine can understand it.

There are various encoding techniques:

1. Nominal/ OHE Encoding

2. Label

3. Ordinal Encoding

4. Target Guided Ordinal Encoding

---

##### 1. Nominal/OHE Encoding

One Hot Encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as an binary vector where each bit corresponds to a unique category. 

For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it with OHE as follows:

1. Red: [1, 0, 0]

2. Green: [0, 1, 0]

3. Blue: [0, 0, 1]

In [26]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [27]:
# Create a simple dataframe
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'blue', 'red', 'green', 'red', 'blue', 'green']
})

In [28]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,blue
4,red


In [29]:
# Create an instance of OneHotEncoder
encoder = OneHotEncoder()

In [30]:
# Perform fit and transform
encoded_values  = encoder.fit_transform(df[['color']]).toarray()

In [31]:
encoder_df = pd.DataFrame(encoded_values, columns=encoder.get_feature_names_out())

In [32]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0
5,0.0,1.0,0.0
6,0.0,0.0,1.0
7,1.0,0.0,0.0
8,0.0,1.0,0.0


In [33]:
OHE_df = pd.concat([df, encoder_df], axis=1)

In [34]:
OHE_df

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,blue,1.0,0.0,0.0
4,red,0.0,0.0,1.0
5,green,0.0,1.0,0.0
6,red,0.0,0.0,1.0
7,blue,1.0,0.0,0.0
8,green,0.0,1.0,0.0


---

##### 2. Label Encoding

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories.

For example, if you have a categorical variable 'color' with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red:1

2. Green: 2

3. Blue: 3

In [36]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,blue
4,red


In [37]:
from sklearn.preprocessing import LabelEncoder

lbl_encoder = LabelEncoder()

In [41]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 0, 2, 1, 2, 0, 1])

In [42]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [43]:
lbl_encoder.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

In [45]:
lbl_encoder.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

The problem with Label Encoding is that It creates the unique values in ascending order like 0 for blue, 1 for green, and 2 for red. In this scenerio the model might think that the red is greater than blue or green, or green is greater than blue since they have greater values.

---

##### 3. Ordinal Encoding

It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order.

For example, if we have a categorical variable "education level" with four possible values (high_school, college, graduate, post_graduate), we can represent it using ordinal encoding as follows:

1. High school: 1

2. College: 2

3. Graduate: 3

4. Post-graduate: 4

In [47]:
from sklearn.preprocessing import OrdinalEncoder

In [48]:
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'large', 'small', 'large', 'medium', 'medium']
})

In [49]:
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])

In [50]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [2.],
       [0.],
       [2.],
       [1.],
       [1.]])

In [51]:
encoder.transform([['small']])



array([[0.]])

In [52]:
encoder.transform([['medium']])



array([[1.]])

In [53]:
encoder.transform([['large']])



array([[2.]])

---

##### 4. Target Guided Ordinal Encoding

It is a technique used to encode categorical variables based on their relationships with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each categorical variable with numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of the model.

In [1]:
import pandas as pd

df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris', 'London'],
    'price': [200, 150, 300, 250, 100, 320, 310]
})

In [2]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,100
5,Paris,320
6,London,310


In [4]:
mean_price = df.groupby('city')['price'].mean().to_dict()

In [7]:
mean_price

{'London': 230.0, 'New York': 150.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [5]:
df['city_encoded'] = df['city'].map(mean_price)

In [8]:
df[['price', 'city_encoded']]

Unnamed: 0,price,city_encoded
0,200,150.0
1,150,230.0
2,300,310.0
3,250,250.0
4,100,150.0
5,320,310.0
6,310,230.0
