Label Encoding
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

Red: 1
Green: 2
Blue: 3

In [5]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [4]:
## Create a simple dataframe 
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [6]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [7]:
lbl_encoder = LabelEncoder()

In [8]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

In [9]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [10]:
lbl_encoder.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [11]:
lbl_encoder.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

In [12]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

High school: 1
College: 2
Graduate: 3
Post-graduate: 4

In [13]:
## Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

In [18]:
# create a sample dataFrame with an ordinal variable
df = pd.DataFrame({'education': ['High school','college','Graduate','Post-graduate','High school','college','Graduate','Post-graduate','college','college']})

In [19]:
df

Unnamed: 0,education
0,High school
1,college
2,Graduate
3,Post-graduate
4,High school
5,college
6,Graduate
7,Post-graduate
8,college
9,college


In [20]:
# create a instance of OrdinalEncoder and then fit_transform
encoder=OrdinalEncoder(categories=[['High school','college','Graduate','Post-graduate']])

In [21]:
encoder.fit_transform(df[['education']])

array([[0.],
       [1.],
       [2.],
       [3.],
       [0.],
       [1.],
       [2.],
       [3.],
       [1.],
       [1.]])

In [23]:
encoder.transform([['college']])



array([[1.]])

Target Guided Ordinal Encoding
It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [24]:
# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [25]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [28]:
mean_price =df.groupby('city')['price'].mean().to_dict()

In [29]:
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [32]:
df['city_encoded']=df['city'].map(mean_price)

In [33]:
df[['price','city_encoded']]

Unnamed: 0,price,city_encoded
0,200,190.0
1,150,150.0
2,300,310.0
3,250,250.0
4,180,190.0
5,320,310.0


In [34]:
import seaborn as sns

In [36]:
tips=sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [38]:
tips[['time','total_bill']]

Unnamed: 0,time,total_bill
0,Dinner,16.99
1,Dinner,10.34
2,Dinner,21.01
3,Dinner,23.68
4,Dinner,24.59
...,...,...
239,Dinner,29.03
240,Dinner,27.18
241,Dinner,22.67
242,Dinner,17.82


In [40]:
mean_bill=tips.groupby('time')['total_bill'].mean().to_dict()

In [41]:
mean_bill

{'Lunch': 17.168676470588235, 'Dinner': 20.79715909090909}

In [46]:
tips['time_encoded']=tips['time'].map(mean_bill)

In [47]:
tips['time_encoded']

  output = repr(obj)


0      20.797159
1      20.797159
2      20.797159
3      20.797159
4      20.797159
         ...    
239    20.797159
240    20.797159
241    20.797159
242    20.797159
243    20.797159
Name: time_encoded, Length: 244, dtype: category
Categories (2, float64): [17.168676, 20.797159]

In [48]:
tips[['total_bill','time_encoded']]

Unnamed: 0,total_bill,time_encoded
0,16.99,20.797159
1,10.34,20.797159
2,21.01,20.797159
3,23.68,20.797159
4,24.59,20.797159
...,...,...
239,29.03,20.797159
240,27.18,20.797159
241,22.67,20.797159
242,17.82,20.797159
