# 🔷 What is Data Encoding?
Data Encoding is the process of converting data into a specific format so that it can be efficiently and accurately stored, processed, or transmitted.

In machine learning or data analysis, encoding is used to convert categorical data (like country, gender, color) into numerical format because most algorithms only understand numbers.

# 🔷 Why Encoding is Important?
Many machine learning algorithms can't work with categorical data directly. For example:

| Gender |
| ------ |
| Male   |
| Female |

You cannot give "Male" or "Female" to most ML models. You must encode them as numbers.



# Nominal or One Hot Encoding
Consider a categorical feature called "color" with values such as red, green, and blue. This feature is categorical because it consists of distinct categories.

To enable the model to understand this feature, one hot encoding converts it into numerical features by creating new binary features for each category. For example, the categories red, green, and blue become three new features: red, green, and blue.

For each data point, the feature corresponding to its category is set to 1, and the others are set to 0. For instance, if the color is red, the red feature is 1, and green and blue are 0.

# Disadvantages of One Hot Encoding
If a categorical feature has many categories, such as 100, one hot encoding will create 100 new features, increasing dimensionality.
This leads to sparse matrices, where most values are zeros and only one value is one per row.
Sparse matrices can cause overfitting, where the model fits the training data too well but performs poorly on new data.
Therefore, one hot encoding is not recommended for features with many categories.

In [11]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import OneHotEncoder

In [12]:
df =pd.DataFrame({
    'Color' : ['Red','Yellow','Green','Blue','Pink']
})

In [13]:
df

Unnamed: 0,Color
0,Red
1,Yellow
2,Green
3,Blue
4,Pink


In [14]:
encoder = OneHotEncoder()
encoded=encoder.fit_transform(df[['Color']]).toarray()

In [15]:
encoded_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())
encoded_df 


Unnamed: 0,Color_Blue,Color_Green,Color_Pink,Color_Red,Color_Yellow
0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0,0.0


In [16]:
df1=pd.concat([df,encoded_df],axis=1)
df1

Unnamed: 0,Color,Color_Blue,Color_Green,Color_Pink,Color_Red,Color_Yellow
0,Red,0.0,0.0,0.0,1.0,0.0
1,Yellow,0.0,0.0,0.0,0.0,1.0
2,Green,0.0,1.0,0.0,0.0,0.0
3,Blue,1.0,0.0,0.0,0.0,0.0
4,Pink,0.0,0.0,1.0,0.0,0.0


In [17]:
import seaborn as sns
df2=sns.load_dataset('tips')

In [18]:
df2.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [19]:
df2['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [20]:
encoded1 = encoder.fit_transform(df2[['day']]).toarray()
encoded1

array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],


In [21]:
df3 = pd.DataFrame(encoded1,columns=[name.split('_')[1] for name in encoder.get_feature_names_out()])
df3

Unnamed: 0,Fri,Sat,Sun,Thur
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0
...,...,...,...,...
239,0.0,1.0,0.0,0.0
240,0.0,1.0,0.0,0.0
241,0.0,1.0,0.0,0.0
242,0.0,1.0,0.0,0.0


In [22]:
df_4 = pd.concat([df2,df3],axis=1)
df_4

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Fri,Sat,Sun,Thur
0,16.99,1.01,Female,No,Sun,Dinner,2,0.0,0.0,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,0.0,1.0,0.0
2,21.01,3.50,Male,No,Sun,Dinner,3,0.0,0.0,1.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,0.0,1.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.0,1.0,0.0,0.0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,0.0,1.0,0.0,0.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.0,1.0,0.0,0.0
242,17.82,1.75,Male,No,Sat,Dinner,2,0.0,1.0,0.0,0.0


In [23]:
df_4.drop(columns=['day'],axis=1,inplace=True)

In [24]:
df_4.head()

Unnamed: 0,total_bill,tip,sex,smoker,time,size,Fri,Sat,Sun,Thur
0,16.99,1.01,Female,No,Dinner,2,0.0,0.0,1.0,0.0
1,10.34,1.66,Male,No,Dinner,3,0.0,0.0,1.0,0.0
2,21.01,3.5,Male,No,Dinner,3,0.0,0.0,1.0,0.0
3,23.68,3.31,Male,No,Dinner,2,0.0,0.0,1.0,0.0
4,24.59,3.61,Female,No,Dinner,4,0.0,0.0,1.0,0.0


# Label And  Encoding
Introduction to Label Encoding and Ordinal Encoding
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data. Label encoding involves assigning a unique numerical label to each category in the variable.

For example, if you have three categories like red, green, and blue, you will assign labels like one, two, and three. Each category receives a unique numerical value.

# Limitations of Label Encoding
One problem with label encoding is that the model may interpret the numerical labels as having an order or ranking. For example, if red is assigned two, green is one, and blue is zero, the model may think red is greater than green and blue, which is not true for nominal data.

In [25]:
df

Unnamed: 0,Color
0,Red
1,Yellow
2,Green
3,Blue
4,Pink


In [26]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

In [27]:
label_encoded=label_encoder.fit_transform(df[['Color']])
label_encoded

array([3, 4, 1, 0, 2])

In [28]:
label_encoder.transform([['Red']])

array([3])

In [29]:
df1 =pd.DataFrame(label_encoded,columns=['Colours_encoded'])

In [30]:
df1

Unnamed: 0,Colours_encoded
0,3
1,4
2,1
3,0
4,2


In [31]:
df2 = pd.concat([df,df1],axis=1)
df2

Unnamed: 0,Color,Colours_encoded
0,Red,3
1,Yellow,4
2,Green,1
3,Blue,0
4,Pink,2


# Introduction to Ordinal Encoding
If you have a use case where you need to assign ranks to categories, you can use ordinal encoding. Ordinal encoding is used to encode categorical data that have an intrinsic order or ranking. Each category is assigned a numerical value based on its position in the order.

**intrinsic order or ranking means** 
**The categories have a natural sequence (like 1st < 2nd < 3rd or Low < Medium < High).**

# For example:
 if you have a categorical variable 'education level' with values like high school, college, graduate, and post graduate, you can represent it using ordinal encoding. High school can be rank one, college two, graduate three, and post graduate four.

In [32]:
df = pd.DataFrame({
    'Size' : ['Small','Medium','Large','Medium','Large','Small']
})

In [33]:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])

In [34]:
df['size_encoded'] = encoder.fit_transform(df[['Size']])

In [35]:
df

Unnamed: 0,Size,size_encoded
0,Small,0.0
1,Medium,1.0
2,Large,2.0
3,Medium,1.0
4,Large,2.0
5,Small,0.0


In [36]:
encoder.transform([['Small']])

array([[0.]])

# **Target Guided Ordinal Encoding**
# Introduction to Target Guided Ordinal Encoding
In this lecture, we continue our discussion on feature engineering. We will explore a new encoding technique called target guided ordinal encoding. This method is used to encode categorical variables based on their relationship with the target variable.

Target guided ordinal encoding is particularly useful when dealing with categorical variables that have a large number of unique categories. The core idea is to replace each category in the categorical variable with a numerical value derived from the mean or median of the target variable for that category.

**Example Dataset**

Consider a simple dataset with two features: City and Price. Here, Price is the target variable, and we want to convert the categorical feature City into a numerical form based on this target variable.

For instance, the category "New York" appears twice with prices 200 and 180. To encode "New York", we calculate the mean of these two prices, which is 190, and replace the category with this value.

In [37]:
data = {
    'City': ['Lahore', 'Karachi', 'Lahore', 'Islamabad', 'Karachi', 'Islamabad', 'Lahore', 'Karachi', 'Islamabad'],
    'Price': [200, 250, 220, 300, 230, 310, 210, 240, 290]
}

df = pd.DataFrame(data)

In [38]:
mean_price = df.groupby('City')['Price'].mean().to_dict()

In [39]:
mean_price

{'Islamabad': 300.0, 'Karachi': 240.0, 'Lahore': 210.0}

In [40]:
df['City_Encoded'] = df['City'].map(mean_price)

In [41]:
df

Unnamed: 0,City,Price,City_Encoded
0,Lahore,200,210.0
1,Karachi,250,240.0
2,Lahore,220,210.0
3,Islamabad,300,300.0
4,Karachi,230,240.0
5,Islamabad,310,300.0
6,Lahore,210,210.0
7,Karachi,240,240.0
8,Islamabad,290,300.0


| Encoding Technique    | ML Might Prioritize High Values? | Why?                            |
| --------------------- | -------------------------------- | ------------------------------- |
| Label Encoding        | ✅ Yes                            | Numbers have order              |
| Ordinal Encoding      | ✅ Yes                            | Assumes order                   |
| Target Guided Ordinal | ✅ Yes                            | Values ranked by target average |
| OneHot Encoding       | ❌ No                             | No value order                  |
               


In [42]:
df =sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [43]:
grp = df.groupby('time')['total_bill'].mean().to_dict()
grp

{'Lunch': 17.168676470588235, 'Dinner': 20.79715909090909}

In [44]:
df['Encoded-time'] = df['time'].map(grp)

In [45]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,Encoded-time
0,16.99,1.01,Female,No,Sun,Dinner,2,20.797159
1,10.34,1.66,Male,No,Sun,Dinner,3,20.797159
2,21.01,3.50,Male,No,Sun,Dinner,3,20.797159
3,23.68,3.31,Male,No,Sun,Dinner,2,20.797159
4,24.59,3.61,Female,No,Sun,Dinner,4,20.797159
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,20.797159
240,27.18,2.00,Female,Yes,Sat,Dinner,2,20.797159
241,22.67,2.00,Male,Yes,Sat,Dinner,2,20.797159
242,17.82,1.75,Male,No,Sat,Dinner,2,20.797159


In [46]:
df.drop(columns=['time'],axis=1,inplace=True)

In [47]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,size,Encoded-time
0,16.99,1.01,Female,No,Sun,2,20.797159
1,10.34,1.66,Male,No,Sun,3,20.797159
2,21.01,3.5,Male,No,Sun,3,20.797159
3,23.68,3.31,Male,No,Sun,2,20.797159
4,24.59,3.61,Female,No,Sun,4,20.797159
