### Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

#### Nominal / OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data which is more suitable for machine learning algorithms. In this technique each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a category variable 'color' with three possible values (red, green, blue), we can represent it in one hot encoding as follows:

1. Red:[1, 0, 0]
2. Green:[0, 1, 0]
3. Blue:[0, 0, 1]

In [3]:
import pandas as pd 
from sklearn.preprocessing import OneHotEncoder

In [4]:
# Create a simple dataframe
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [3]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [4]:
# create an instance of the OneHotEncoder class
encoder = OneHotEncoder()

In [9]:
# Perform Fit & Transform
encoded = encoder.fit_transform(df[['color']]).toarray()

In [10]:
import pandas as pd
encoder_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [11]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [12]:
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [13]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [14]:
import seaborn as sns
sns.load_dataset('tips')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


#### Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The label are usually assigned in in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable 'color' with three possible values (red, green, blue), we can represent it using label encoding as follows:
1. Red: 1
2. Green: 2
3. Blue: 3

In [5]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [6]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder = LabelEncoder()

In [7]:
lbl_encoder.fit_transform(df['color'])

array([2, 0, 1, 1, 2, 0])

In [8]:
lbl_encoder.transform(['red'])

array([2])

##### Ordinal Encoding

It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [9]:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

In [10]:
df = pd.DataFrame({
    'size': ['XL', 'L', 'M', 'M', 'L', 'XL', 'S']
})

In [11]:
df

Unnamed: 0,size
0,XL
1,L
2,M
3,M
4,L
5,XL
6,S


In [12]:
# create an instance of ordinal encoder
encoder = OrdinalEncoder(categories=[['S', 'M', 'L', 'XL']])

In [13]:
encoder.fit_transform(df[['size']])

array([[3.],
       [2.],
       [1.],
       [1.],
       [2.],
       [3.],
       [0.]])

In [14]:
encoder.transform([['XL']])



array([[3.]])

#### Target Guided Ordinal Encoding
It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In this encoding, we replace each categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model. 

In [16]:
import pandas as pd
df = pd.DataFrame({
    'city': ['London', 'Paris', 'Berlin', 'New York', 'London', 'Berlin'],
    'price': [100, 200, 300, 400, 500, 600]
})

In [17]:
df

Unnamed: 0,city,price
0,London,100
1,Paris,200
2,Berlin,300
3,New York,400
4,London,500
5,Berlin,600


In [20]:
mean_price = df.groupby('city')['price'].mean()

In [21]:
mean_price

city
Berlin      450.0
London      300.0
New York    400.0
Paris       200.0
Name: price, dtype: float64

In [22]:
df['city_encoded']=df['city'].map(mean_price)

In [27]:
df[['city','city_encoded']]

Unnamed: 0,city,city_encoded
0,London,300.0
1,Paris,200.0
2,Berlin,450.0
3,New York,400.0
4,London,300.0
5,Berlin,450.0


In [29]:
import seaborn as sns
sns.load_dataset('tips')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
