# <font color = 'orange'> Data Encoding

### The process of encoding of categorical data into numerical data to perform mathematical operation on the data because our model cannot understand categorical data.  

There are 3 Types of data encoding:
1. Nominal or OHE(OneHotEncoder) Encoding. 
2. Label Encoding.
3. Ordinal Encoding.
4. Target Guided Ordinal Encoding. 

---

## <font color='blue'> 1.Nominal or OHE(OneHotEncoder) Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms.   
    
In this technique, each category is represented as a **binary vector** where each bit corresponds to a unique category.
    
**For example**, if we have a categorical variable "color" with three possible values (red, green, blue),  
    we can represent it using one hot encoding as follows:
1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [1]:
import pandas as pd 

df = pd.DataFrame({'color':['red', 'blue', 'green', 'green', 'red', 'blue']})

df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [2]:
from sklearn.preprocessing import OneHotEncoder

# Create an instance of OneHotEncoder
encoder = OneHotEncoder()

In [3]:
# We will fit and transform

encoder.fit_transform(df[['color']])
# it will create a sparse matrix

<6x3 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [4]:
encoded = encoder.fit_transform(df[['color']]).toarray()
encoded

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [5]:
encoded_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

encoded_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [6]:
# For any new data our transform can transform into correspoding binay vector
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [7]:
final_df = pd.concat([df,encoded_df],axis=1)
final_df.set_index('color',inplace=True)
final_df

Unnamed: 0_level_0,color_blue,color_green,color_red
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
red,0.0,0.0,1.0
blue,1.0,0.0,0.0
green,0.0,1.0,0.0
green,0.0,1.0,0.0
red,0.0,0.0,1.0
blue,1.0,0.0,0.0


---

### <font color = 'green'> Internal assignment:  
Convert all the categorical feature of **tips** dataset into corresponding binary vector using OneHotEncoder

In [8]:
import seaborn as sns

df = sns.load_dataset('tips')
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [9]:
# we have categorical feature : sex, smoker, day and time
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

In [10]:
encoded = encoder.fit_transform(df[['sex','smoker','day','time']]).toarray()

encoded

array([[1., 0., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       ...,
       [0., 1., 0., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [1., 0., 1., ..., 1., 1., 0.]])

In [11]:
import pandas as pd 

encoded_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

encoded_df

Unnamed: 0,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...
239,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
240,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
241,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
242,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [12]:
final_df = pd.concat([df,encoded_df],axis=1)

final_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,16.99,1.01,Female,No,Sun,Dinner,2,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,21.01,3.50,Male,No,Sun,Dinner,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
242,17.82,1.75,Male,No,Sat,Dinner,2,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [13]:
# for any new data 

encoder.transform([['Female','Yes','Thur','Lunch']]).toarray()



array([[1., 0., 0., 1., 0., 0., 0., 1., 0., 1.]])

---

## <font color='blue'> 2.Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.  
  
Label encoding involves assigning a **unique numerical label to each category** in the variable.  
    
**Note** : The labels are usually based on **alphabetical order** (or) **the frequency of the categories**.  
    
**For example**, if we have a categorical variable "color" with three possible values (red, green, blue),   
    we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In [14]:
import pandas as pd

df = pd.DataFrame({'color':['red', 'blue', 'green', 'green', 'red', 'blue']})

df

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [15]:
from sklearn.preprocessing import LabelEncoder

lbl_encoder = LabelEncoder()

In [16]:
encoded = lbl_encoder.fit_transform(df[['color']])

encoded

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

In [17]:
# for any new data
# we are giving multiple values for a single feature
lbl_encoder.transform([['red'],['blue'],['green']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1])

#### Here, color are nominal data which don't have any rank so label encoding is a not good choice because our model may understand the big value is good and least value is bad like red is 2 so it is good or blue is 0 so it is bad. It don't give any meaning to the feature.

---

## <font color='blue'> 3.Ordinal Encoding
It is used to **encode categorical data that have an intrinsic order or ranking**.  
    
In this technique, each category is assigned a numerical value based on its **position in the order**.  
    
**For example**, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate),  
    we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4  

Post-graduate will have highest value as it is the highest degree that one can earn in given data.

In [18]:
import pandas as pd 

df = pd.DataFrame({'size': ['small', 'medium', 'large', 'medium', 'small', 'large']})

In [19]:
from sklearn.preprocessing import OrdinalEncoder

# Create an instance of OrdinalEncoder and then fit_transform
ord_encoder = OrdinalEncoder(categories = [['small','medium','large']])

In [20]:
ord_encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [21]:
# for any new data 

ord_encoder.transform([['small']])



array([[0.]])

In [22]:
# giving multiple values for a single feature 
# for any new data transformation can be done
ord_encoder.transform([['large'],['medium'],['small']])



array([[2.],
       [1.],
       [0.]])

---

### <font color = 'green'> Internal assignment:  
Convert **day** categorical feature of **tips** dataset into corresponding rank using OrdinalEncoder 

In [23]:
import seaborn as sns

df = sns.load_dataset('tips')

df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [24]:
df['day'].value_counts()

Sat     87
Sun     76
Thur    62
Fri     19
Name: day, dtype: int64

In [25]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(categories=[['Thur','Fri','Sat','Sun']])

In [26]:
# we will give more priority to sunday then saturday then friday then thrusday 
encoded = encoder.fit_transform(df[['day']])

encoded

array([[3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [3.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],
       [2.],

In [27]:
encoded_df = pd.DataFrame(encoded,columns =['day_encoded'])

encoded_df

Unnamed: 0,day_encoded
0,3.0
1,3.0
2,3.0
3,3.0
4,3.0
...,...
239,2.0
240,2.0
241,2.0
242,2.0


In [28]:
import pandas as pd

final_df = pd.concat([df['day'],encoded_df],axis=1)
final_df

Unnamed: 0,day,day_encoded
0,Sun,3.0
1,Sun,3.0
2,Sun,3.0
3,Sun,3.0
4,Sun,3.0
...,...,...
239,Sat,2.0
240,Sat,2.0
241,Sat,2.0
242,Sat,2.0


In [29]:
# for any new data 

encoder.transform([['Sat']])



array([[2.]])

---

## <font color='blue'> 4.Target Guided Ordinal Encoding 
It is a technique used to encode categorical variables **based on their relationship with the target variable**.  
    
This encoding technique is useful when we have a categorical variable with a **large number of unique categories**, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we **replace** each **category in the categorical variable** with a **numerical value based on the mean or median of the target variable** for that category.  
This creates a **monotonic relationship** between the categorical variable and the target variable, which can **improve the predictive power** of our model.

In [30]:
import pandas as pd 

df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


#### we will convert city categorical variable into numerical value based on target value.

In [31]:
mean_price = df.groupby('city')['price'].mean().to_dict()
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [32]:
df['city_encoded'] = df['city'].map(mean_price)

df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [33]:
# so we can give only price and city_encoded for our model training

df[['price','city_encoded']]

Unnamed: 0,price,city_encoded
0,200,190.0
1,150,150.0
2,300,310.0
3,250,250.0
4,180,190.0
5,320,310.0


---

### <font color = 'green'> Internal assignment:  
Convert **time** categorical feature of **tips** dataset into number using Target guided ordinal encoding with respect to total bill.

In [34]:
import seaborn as sns

df = sns.load_dataset('tips')

df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [35]:
df['time'].value_counts()

Dinner    176
Lunch      68
Name: time, dtype: int64

In [36]:
import pandas as pd

# group by time then take mean of total bill of each group 
# convert the value into dictionary 
# encode the time categorical value
mean_totalbill = df.groupby('time')['total_bill'].mean().to_dict()
mean_totalbill

{'Lunch': 17.168676470588235, 'Dinner': 20.79715909090909}

In [37]:
# now we will encode the values 

df['time_encoded'] = df['time'].map(mean_totalbill)

df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,time_encoded
0,16.99,1.01,Female,No,Sun,Dinner,2,20.797159
1,10.34,1.66,Male,No,Sun,Dinner,3,20.797159
2,21.01,3.50,Male,No,Sun,Dinner,3,20.797159
3,23.68,3.31,Male,No,Sun,Dinner,2,20.797159
4,24.59,3.61,Female,No,Sun,Dinner,4,20.797159
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,20.797159
240,27.18,2.00,Female,Yes,Sat,Dinner,2,20.797159
241,22.67,2.00,Male,Yes,Sat,Dinner,2,20.797159
242,17.82,1.75,Male,No,Sat,Dinner,2,20.797159


---