# Data Encoding
1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

### 1 Nominal/ONE HOT  Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

Red: [1, 0, 0]
Green: [0, 1, 0]
Blue: [0, 0, 1] 

In [2]:
import numpy as np 
import pandas as pd 

In [3]:
df=pd.DataFrame({
    'color': ['red','blue','green','green','blue','red']
})

In [4]:
df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,blue
5,red


In [5]:
from sklearn.preprocessing import OneHotEncoder

In [8]:
Encoder=OneHotEncoder()

In [13]:
encoded=Encoder.fit_transform(df[['color']]).toarray()

In [14]:
encoded

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])

In [22]:
df2=pd.DataFrame(encoded,
                 columns=Encoder.get_feature_names_out())

In [23]:
df2

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,1.0,0.0,0.0
5,0.0,0.0,1.0


In [29]:
df_n=pd.concat([df,df2],axis=1)
df_n

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,blue,1.0,0.0,0.0
5,red,0.0,0.0,1.0


In [31]:
# for any new data 
Encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

# 2. Label Encoding
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

Red: 0
Green: 1
Blue: 2

In [32]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,blue


In [33]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder=LabelEncoder()

In [34]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 0, 2])

In [35]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [36]:
lbl_encoder.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [37]:
lbl_encoder.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

# Ordinal Encoading 
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

High school: 1
College: 2
Graduate: 3
Post-graduate: 4

In [43]:
from sklearn.preprocessing import OrdinalEncoder

In [39]:
# create a sample dataframe with an ordinal variable
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [40]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [41]:
## create an instance of ORdinalEncoder and then fit_transform
encoder=OrdinalEncoder(categories=[['small','medium','large']])

In [42]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [44]:
encoder.transform([['small']])



array([[0.]])

# Target Guided Ordinal Encoding
It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [45]:
import pandas as pd

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [46]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [49]:
dict_y=df.groupby('city')['price'].mean().to_dict()

In [51]:
df['city_encoded']=df['city'].map(dict_y)

In [52]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190
1,London,150,150
2,Paris,300,310
3,Tokyo,250,250
4,New York,180,190
5,Paris,320,310
