#### ✅ What is Data Encoding?
* It’s the process of converting categorical features (like "Male"/"Female" or "Red"/"Green"/"Blue") into numeric values.
---
#### 🎯 Why?
* Because most ML algorithms (like Linear Regression, SVM, etc.) can only work with numbers — not strings.
---

#### Types of Data Encoding😎
* 1.) Nominal/OHE Encoding
* 2.) Label and Ordinal Encoding
* 3.) Target Guided Ordinal Encoding

#### ONE-HOT ENCODING 🔥🔥
🧠 Concept Recap:
One-Hot Encoding creates a new column for each unique category in your selected column and assigns:

* 1 where the category is present

* 0 otherwise

In [3]:
import pandas as pd
colors=pd.DataFrame({'color':['red','blue','green','green','red','blue']})
colors


Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red
5,blue


In [7]:
from sklearn.preprocessing import OneHotEncoder
encoder=OneHotEncoder()
encoded=encoder.fit_transform(colors).toarray()

In [8]:
df_encoded=pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
df_encoded

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [10]:
final_df=pd.concat([colors,df_encoded], axis=1)
final_df

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [None]:
import seaborn as sns
data=sns.load_dataset('tips')
encoded_data=encoder.fit_transform(data[['day']]).toarray()
encoded_df=pd.DataFrame(encoded_data,columns=encoder.get_feature_names_out())
date_df=pd.DataFrame(data['day'])
df=pd.concat([date_df,encoded_df],axis=1)

Unnamed: 0,day,day_Fri,day_Sat,day_Sun,day_Thur
0,Sun,0.0,0.0,1.0,0.0
1,Sun,0.0,0.0,1.0,0.0
2,Sun,0.0,0.0,1.0,0.0
3,Sun,0.0,0.0,1.0,0.0
4,Sun,0.0,0.0,1.0,0.0
...,...,...,...,...,...
239,Sat,0.0,1.0,0.0,0.0
240,Sat,0.0,1.0,0.0,0.0
241,Sat,0.0,1.0,0.0,0.0
242,Sat,0.0,1.0,0.0,0.0


## 🔤 What is Label Encoding?
#### It converts categorical values into integer labels.

In [None]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
encoded=encoder.fit_transform(data['day'])
encoded_date=pd.DataFrame({'Orignal':data['day'], 'Encoded':encoded})
encoded_date

Unnamed: 0,Orignal,Encoded
0,Sun,2
1,Sun,2
2,Sun,2
3,Sun,2
4,Sun,2
...,...,...
239,Sat,1
240,Sat,1
241,Sat,1
242,Sat,1


#### 🏷️ Label Encoding
* Converts categorical text data → integers.

* Example: ['sun', 'mon', 'tue'] → [2, 1, 0]
---
#### No inherent order assumed — it just assigns integers arbitrarily.

* ⚠️ Risk: Algorithms might interpret the numeric labels as ordinal or ordered (which they’re not!).

In [2]:
import pandas as pd
df=pd.DataFrame({"Breathing":['water','air','air','water','fire','wind','fire', 'air','wind']})
df

Unnamed: 0,Breathing
0,water
1,air
2,air
3,water
4,fire
5,wind
6,fire
7,air
8,wind


In [3]:
from sklearn.preprocessing import OrdinalEncoder
encoder=OrdinalEncoder()
encoded=encoder.fit_transform(df)
encoded

array([[2.],
       [0.],
       [0.],
       [2.],
       [1.],
       [3.],
       [1.],
       [0.],
       [3.]])

In [4]:
encoded_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())
encoded_df

Unnamed: 0,Breathing
0,2.0
1,0.0
2,0.0
3,2.0
4,1.0
5,3.0
6,1.0
7,0.0
8,3.0


In [6]:
final=pd.concat([df,encoded_df],axis=1)
final

Unnamed: 0,Breathing,Breathing.1
0,water,2.0
1,air,0.0
2,air,0.0
3,water,2.0
4,fire,1.0
5,wind,3.0
6,fire,1.0
7,air,0.0
8,wind,3.0


#### 🎯 What is Target Guided Ordinal Encoding?
* It’s when you assign ordinal labels to a categorical feature based on the target variable.

* "Don't just number them… rank them based on their impact on the target."
---

In [8]:
df=pd.DataFrame({"Cities":['Mumbai', 'Pune', 'Delhi', 'Boston', 'Pune', 'Mumbai', 'Boston', 'Delhi', 'Chennai', 'Boston'], 'Sales':[120000, 170000, 180000, 190000, 10000, 200000, 200001, 340500, 90000, 800000]})
df

Unnamed: 0,Cities,Sales
0,Mumbai,120000
1,Pune,170000
2,Delhi,180000
3,Boston,190000
4,Pune,10000
5,Mumbai,200000
6,Boston,200001
7,Delhi,340500
8,Chennai,90000
9,Boston,800000


In [11]:
mean_sales=df.groupby(df['Cities'])['Sales'].mean().sort_values()

In [12]:
mean_sales.index

Index(['Chennai', 'Pune', 'Mumbai', 'Delhi', 'Boston'], dtype='object', name='Cities')

In [23]:
ordinal_map={key:rank for rank,key in enumerate(mean_sales.index)}

In [24]:
ordinal_map

{'Chennai': 0, 'Pune': 1, 'Mumbai': 2, 'Delhi': 3, 'Boston': 4}

In [25]:
df['city_encoded']=df['Cities'].map(ordinal_map)

In [26]:
df

Unnamed: 0,Cities,Sales,city_encoded
0,Mumbai,120000,2
1,Pune,170000,1
2,Delhi,180000,3
3,Boston,190000,4
4,Pune,10000,1
5,Mumbai,200000,2
6,Boston,200001,4
7,Delhi,340500,3
8,Chennai,90000,0
9,Boston,800000,4


In [22]:
ordinal_map.keys()

dict_keys([0, 1, 2, 3, 4])