<a href="https://colab.research.google.com/github/AmiAnurag/Feature-Engineering/blob/main/Labelling_Categorical_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import OrdinalEncoder

# Ordinal Labelling of Data

In [2]:
df=pd.read_csv("/content/drive/MyDrive/DataSets/Advance house pricing/train.csv",usecols=['Id','LandContour','LotShape','SalePrice'])

In [3]:
df.head()
df.shape

(1460, 4)

In [4]:
df_train=df.iloc[:1300,:].copy()
df_test1=df.iloc[1300:1380,:].copy()
df_test2=df.iloc[1380:,:].copy()

In [5]:
df_train.LandContour.value_counts()

Lvl    1169
Bnk      53
HLS      45
Low      33
Name: LandContour, dtype: int64

### **In this labelling using the ordinal encoder module of sklearn we just randomly assing values to the labels .**

In [6]:
ordinal_encoder=OrdinalEncoder()
label_df_train=df_train.copy()
label_df_train[['LotShape','LandContour']] = ordinal_encoder.fit_transform(df_train[['LotShape','LandContour']])

In [7]:
label_df_train.head()

Unnamed: 0,Id,LotShape,LandContour,SalePrice
0,1,3.0,3.0,208500
1,2,3.0,3.0,181500
2,3,0.0,3.0,223500
3,4,0.0,3.0,140000
4,5,0.0,3.0,250000


In [8]:
label_df_train.LandContour.value_counts()

3.0    1169
0.0      53
1.0      45
2.0      33
Name: LandContour, dtype: int64

## Understanding the importance of fit and fit_transform

In [9]:
label_df_test1=df_test1.copy()
label_df_test2=df_test2.copy()
print("For Test data 1")
print(label_df_test1.LandContour.value_counts())
print("For test data 2")
print(label_df_test2.LandContour.value_counts())
label_df_test2[['LotShape','LandContour']]=ordinal_encoder.transform(df_test2[['LotShape','LandContour']])
label_df_test1[['LotShape','LandContour']]=ordinal_encoder.fit_transform(df_test1[['LotShape','LandContour']])
print("For label data 1 (using fit)")
print(label_df_test1.LandContour.value_counts())
print("For Label Data 2 (using fit transform)")
print(label_df_test2.LandContour.value_counts())

For Test data 1
Lvl    69
Bnk     6
HLS     3
Low     2
Name: LandContour, dtype: int64
For test data 2
Lvl    73
Bnk     4
HLS     2
Low     1
Name: LandContour, dtype: int64
For label data 1 (using fit)
3.0    69
0.0     6
1.0     3
2.0     2
Name: LandContour, dtype: int64
For Label Data 2 (using fit transform)
3.0    73
0.0     4
1.0     2
2.0     1
Name: LandContour, dtype: int64


Though here no significant difference is observed . Especially transform labels the data according to its previous fit_transform, i.e same random values are taken.

# One Hot Encoding

In [10]:
df.head()

Unnamed: 0,Id,LotShape,LandContour,SalePrice
0,1,Reg,Lvl,208500
1,2,Reg,Lvl,181500
2,3,IR1,Lvl,223500
3,4,IR1,Lvl,140000
4,5,IR1,Lvl,250000


In [11]:
df_oh=df.copy()

In [12]:
df_oh.head()

Unnamed: 0,Id,LotShape,LandContour,SalePrice
0,1,Reg,Lvl,208500
1,2,Reg,Lvl,181500
2,3,IR1,Lvl,223500
3,4,IR1,Lvl,140000
4,5,IR1,Lvl,250000


In [13]:
oh=pd.get_dummies(df_oh.LotShape)
oh.columns=["LotShape_"+i for i in oh.columns]

In [14]:
df_oh=pd.concat([df_oh,oh],axis=1)
df_oh.head()

Unnamed: 0,Id,LotShape,LandContour,SalePrice,LotShape_IR1,LotShape_IR2,LotShape_IR3,LotShape_Reg
0,1,Reg,Lvl,208500,0,0,0,1
1,2,Reg,Lvl,181500,0,0,0,1
2,3,IR1,Lvl,223500,1,0,0,0
3,4,IR1,Lvl,140000,1,0,0,0
4,5,IR1,Lvl,250000,1,0,0,0


In [15]:
#drop the lotshape column as its of no use now
df_oh.drop(columns=['LotShape'],inplace=True)
df_oh.head()

Unnamed: 0,Id,LandContour,SalePrice,LotShape_IR1,LotShape_IR2,LotShape_IR3,LotShape_Reg
0,1,Lvl,208500,0,0,0,1
1,2,Lvl,181500,0,0,0,1
2,3,Lvl,223500,1,0,0,0
3,4,Lvl,140000,1,0,0,0
4,5,Lvl,250000,1,0,0,0


In [16]:
#on virtue of reducing the dimensionality more we can also drop the first column of dummy variables because one among all column is just the complementarty of all.
# So we can easily remove it without reduction of meaning.
df_oh.drop(columns=['LotShape_IR1'],inplace=True)
df_oh.head()

Unnamed: 0,Id,LandContour,SalePrice,LotShape_IR2,LotShape_IR3,LotShape_Reg
0,1,Lvl,208500,0,0,1
1,2,Lvl,181500,0,0,1
2,3,Lvl,223500,0,0,0
3,4,Lvl,140000,0,0,0
4,5,Lvl,250000,0,0,0


## **Count Frequency Encoding** 

encoding technique by replacing the label with its count

In [17]:
df_cf=df.copy()
df_cf.head()

Unnamed: 0,Id,LotShape,LandContour,SalePrice
0,1,Reg,Lvl,208500
1,2,Reg,Lvl,181500
2,3,IR1,Lvl,223500
3,4,IR1,Lvl,140000
4,5,IR1,Lvl,250000


In [18]:
count_frequency=dict(df_cf.LandContour.value_counts())
count_frequency

{'Bnk': 63, 'HLS': 50, 'Low': 36, 'Lvl': 1311}

In [19]:
df_cf['LandContour_cf']=df_cf.LandContour.map(count_frequency)
df_cf.iloc[1027:1037,:]

Unnamed: 0,Id,LotShape,LandContour,SalePrice,LandContour_cf
1027,1028,IR1,HLS,293077,50
1028,1029,Reg,Lvl,105000,1311
1029,1030,Reg,Lvl,118000,1311
1030,1031,Reg,Lvl,160000,1311
1031,1032,Reg,Lvl,197000,1311
1032,1033,IR1,Lvl,310000,1311
1033,1034,Reg,Lvl,230000,1311
1034,1035,Reg,Bnk,119750,63
1035,1036,IR1,Lvl,84000,1311
1036,1037,IR1,HLS,315500,50


In [20]:
# Disadvantage ---- Same frequncy labels are treated same.

## **Ordinal Number Encoding**  
We will define our own convinient way of labelling as suits our mind.

In [21]:
df_oe=df.copy()
df_oe.head()

Unnamed: 0,Id,LotShape,LandContour,SalePrice
0,1,Reg,Lvl,208500
1,2,Reg,Lvl,181500
2,3,IR1,Lvl,223500
3,4,IR1,Lvl,140000
4,5,IR1,Lvl,250000


In [22]:
label_name=df_oe.LandContour.unique()

      #  Lvl :	Near Flat/Level	
      #  Bnk :	Banked - Quick and significant rise from street grade to building
      #  HLS :	Hillside - Significant slope from side to side
      #  Low :	Depression

# as we can think of ourselves lvl is better than banking then hillside then low
# So let me define the ordinal encoding as lvl:3,Bnk:2,Hls:1,Low:0
label_name

array(['Lvl', 'Bnk', 'Low', 'HLS'], dtype=object)

In [23]:
ord_lbl={k:i for i,k in enumerate(label_name,0)}

In [24]:
ord_lbl

{'Bnk': 1, 'HLS': 3, 'Low': 2, 'Lvl': 0}

In [25]:
df_oe['LandCountour_ordinal encoding']=df_oe.LandContour.map(ord_lbl)
df_oe.head()

Unnamed: 0,Id,LotShape,LandContour,SalePrice,LandCountour_ordinal encoding
0,1,Reg,Lvl,208500,0
1,2,Reg,Lvl,181500,0
2,3,IR1,Lvl,223500,0
3,4,IR1,Lvl,140000,0
4,5,IR1,Lvl,250000,0


## **Target Guided Ordinal encoding**

labelling/Ranking by keeping the target variable into consideration.

In [36]:
df_te=df.copy()
df_te.head()

Unnamed: 0,Id,LotShape,LandContour,SalePrice
0,1,Reg,Lvl,208500
1,2,Reg,Lvl,181500
2,3,IR1,Lvl,223500
3,4,IR1,Lvl,140000
4,5,IR1,Lvl,250000


In [37]:
df_te.groupby(['LandContour'])['SalePrice'].mean().sort_values()

LandContour
Bnk    143104.079365
Lvl    180183.746758
Low    203661.111111
HLS    231533.940000
Name: SalePrice, dtype: float64

In [38]:
#IN this type of encoding we encode the rank of the label by taking the decision from the mean values of the target column.
label_name=df_te.groupby(['LandContour'])['SalePrice'].mean().sort_values().index
label_name

Index(['Bnk', 'Lvl', 'Low', 'HLS'], dtype='object', name='LandContour')

In [39]:
label_name={k:i for i,k in enumerate(label_name,0)}
label_name

{'Bnk': 0, 'HLS': 3, 'Low': 2, 'Lvl': 1}

In [40]:
df_te["Land_contour Target Encoded"]=df_te.LandContour.map(label_name)

In [41]:
df_te.head()

Unnamed: 0,Id,LotShape,LandContour,SalePrice,Land_contour Target Encoded
0,1,Reg,Lvl,208500,1
1,2,Reg,Lvl,181500,1
2,3,IR1,Lvl,223500,1
3,4,IR1,Lvl,140000,1
4,5,IR1,Lvl,250000,1


## **Mean Guided Ordinal Encoding**

Simmilar as target encoding only the difference is we use mean of target variable for encoding instead of rank

In [42]:
df_me=df.copy()
df_me.head()

Unnamed: 0,Id,LotShape,LandContour,SalePrice
0,1,Reg,Lvl,208500
1,2,Reg,Lvl,181500
2,3,IR1,Lvl,223500
3,4,IR1,Lvl,140000
4,5,IR1,Lvl,250000


In [43]:
df_me.groupby(['LotShape'])['SalePrice'].mean().sort_values()

LotShape
Reg    164754.818378
IR1    206101.665289
IR3    216036.500000
IR2    239833.365854
Name: SalePrice, dtype: float64

In [44]:
df_me["LotShape Mean Encoded"]=df_me.LotShape.map(dict(df_me.groupby(['LotShape'])['SalePrice'].mean().sort_values()))

In [45]:
df_me.head()

Unnamed: 0,Id,LotShape,LandContour,SalePrice,LotShape Mean Encoded
0,1,Reg,Lvl,208500,164754.818378
1,2,Reg,Lvl,181500,164754.818378
2,3,IR1,Lvl,223500,206101.665289
3,4,IR1,Lvl,140000,206101.665289
4,5,IR1,Lvl,250000,206101.665289


## **After encoding the categorical variables we will simply use it for further model building and drop the category column. Like in the previous case we will simply drop the LotShape column and keep the LotShape Mean Encoded column**