## Data Encoding
-  Nominal / OHE 
-  Label and Ordinal Encoding
-  Target Guided Ordinal Encoding

### 1. Nominal/OHE (One Hot Encoding)
One hot encoding also known as nominal encoding, is a technique used to represent categorial data, which is more suitable for machine learning algorithms. In this technique, each category is represented as binaary vector where each bit correspons to a unique category. For example, if we have a categorical variable 'color' with the three possible values (red, green, blue), we can represent them in one hot encoding as follows:

1:**Red**:[1, 0, 0]

2:**Green**: [0, 1, 0]

3: **Blue**: [0, 0, 1]

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [3]:
# Create a simple DataFrame
df = pd.DataFrame({
    'color':['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [4]:
df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [5]:
# Create an instance of the OneHotEncoder
encoder = OneHotEncoder()

In [6]:
# Perform fit and transform: 
# fit() learns from the categories 
# transform() converts those categories into sparse matrix.
#  We can do both simultaneously using fit_transform()
# after that convert the sperse matrix to array using toarray()
encoded_values = encoder.fit_transform(df[['color']]).toarray()

In [7]:
encoded_df = pd.DataFrame(encoded_values, columns=encoder.get_feature_names_out())

In [8]:
encoded_df.head()

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0


In [9]:
## For any new that which might come
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [10]:
pd.concat([df, encoded_df], axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [11]:
import seaborn as sns
tips = sns.load_dataset('tips')

In [12]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [13]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

encoded_values = encoder.fit_transform(tips[['time']]).toarray()

encoded_df = pd.DataFrame(encoded_values, columns=encoder.get_feature_names_out())

encoded_df

pd.concat([tips, encoded_df], axis=1)


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,time_Dinner,time_Lunch
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,1.0,0.0
2,21.01,3.50,Male,No,Sun,Dinner,3,1.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,1.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0
...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,1.0,0.0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,1.0,0.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,1.0,0.0
242,17.82,1.75,Male,No,Sat,Dinner,2,1.0,0.0


### Label Encoding
Label Encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label Encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in the alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable 'color' with the three categories (red, blue, green), we can represent it using label encoding as follows.
- Red: 1
- Green: 2
- Blue: 3

In [14]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [15]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder = LabelEncoder()

In [16]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

In [17]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [18]:
lbl_encoder.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [19]:
lbl_encoder.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In htis technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable 'education level' with four possible values (high-school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:
- High School: 1
- College: 2
- Graduate: 3
- Post Graduate: 4

In [20]:
from sklearn.preprocessing import OrdinalEncoder

In [21]:
# Create a simple DataFrame
df = pd.DataFrame({
    'size':['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [22]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [24]:
# create an instance of OrdinalEncoder
ord_encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
ord_encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [26]:
ord_encoder.transform([['small']])



array([[0.]])

### Target Guided Ordinal Encoding
It is a technique used to encode variables based on their relationship with the target variable. This encoding technique is useful when we have categorical variable with a large number of unique categories, and we want to use this variable as a feature in out machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [28]:
## Create a DataFrame with feature and target
df = pd.DataFrame({
    'city':['New York', 'London', 'Paris', 'New York', 'Tokyo', 'Paris'],
    'price':[200, 150, 300, 250, 100, 320]
})
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,New York,250
4,Tokyo,100
5,Paris,320


In [29]:
# in TargetGuidedOrdinalEncoding we will assign mean of price column (target) to for each category in city column
mean_price = df.groupby('city')['price'].mean().to_dict()
mean_price 

{'London': 150.0, 'New York': 225.0, 'Paris': 310.0, 'Tokyo': 100.0}

In [30]:
df['city_encoded'] = df['city'].map(mean_price)

In [None]:
# This will the columns we use for model training as we converted the city column to city_encoded by filling each category of the city with the mean of its price 
df[['price', 'city_encoded']]

Unnamed: 0,price,city_encoded
0,200,225.0
1,150,150.0
2,300,310.0
3,250,225.0
4,100,100.0
5,320,310.0


In [34]:
dataset = sns.load_dataset('tips')
dataset

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [44]:
# lets convert the time column to time_encoded using Target Based Ordinal Encoding by filling it with total_bill's mean
mean_value = dataset.groupby('time', observed=False)['total_bill'].mean().to_dict()
mean_value

{'Lunch': 17.168676470588235, 'Dinner': 20.79715909090909}

In [46]:
dataset['time_encoder'] = dataset['time'].map(mean_value)

In [48]:
dataset

## In future if we need to use time column as a feature we can use time_encoder column instead

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,time_encoder
0,16.99,1.01,Female,No,Sun,Dinner,2,20.797159
1,10.34,1.66,Male,No,Sun,Dinner,3,20.797159
2,21.01,3.50,Male,No,Sun,Dinner,3,20.797159
3,23.68,3.31,Male,No,Sun,Dinner,2,20.797159
4,24.59,3.61,Female,No,Sun,Dinner,4,20.797159
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,20.797159
240,27.18,2.00,Female,Yes,Sat,Dinner,2,20.797159
241,22.67,2.00,Male,Yes,Sat,Dinner,2,20.797159
242,17.82,1.75,Male,No,Sat,Dinner,2,20.797159
