### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [7]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder


In [8]:
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'blue', 'red', 'green', 'violet', 'red', 'violet'],
})

In [9]:
df.head(10)

Unnamed: 0,color
0,red
1,blue
2,green
3,blue
4,red
5,green
6,violet
7,red
8,violet


In [10]:
##create an instance of Onehotencoder
encoder=OneHotEncoder()

In [11]:
## perform fit and transform
encoded=encoder.fit_transform(df[['color']]).toarray()

In [12]:
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [14]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red,color_violet
0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0
5,0.0,1.0,0.0,0.0
6,0.0,0.0,0.0,1.0
7,0.0,0.0,1.0,0.0
8,0.0,0.0,0.0,1.0


In [15]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red,color_violet
0,red,0.0,0.0,1.0,0.0
1,blue,1.0,0.0,0.0,0.0
2,green,0.0,1.0,0.0,0.0
3,blue,1.0,0.0,0.0,0.0
4,red,0.0,0.0,1.0,0.0
5,green,0.0,1.0,0.0,0.0
6,violet,0.0,0.0,0.0,1.0
7,red,0.0,0.0,1.0,0.0
8,violet,0.0,0.0,0.0,1.0


In [18]:
import seaborn as sns
tips_dataset = sns.load_dataset('tips')
tips_dataset.head(25)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.0,Male,No,Sun,Dinner,2
7,26.88,3.12,Male,No,Sun,Dinner,4
8,15.04,1.96,Male,No,Sun,Dinner,2
9,14.78,3.23,Male,No,Sun,Dinner,2


In [20]:
tips_dataset.shape

(244, 7)

In [21]:
tips_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [22]:
tips_dataset['day'].unique()

['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Thur', 'Fri', 'Sat', 'Sun']

In [23]:
encoded=encoder.fit_transform(tips_dataset[['day']]).toarray()

In [24]:
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [27]:
encoded_df=pd.concat([tips_dataset,encoder_df],axis=1)

In [28]:
encoded_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,day_Fri,day_Sat,day_Sun,day_Thur
0,16.99,1.01,Female,No,Sun,Dinner,2,0.0,0.0,1.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,0.0,1.0,0.0
2,21.01,3.50,Male,No,Sun,Dinner,3,0.0,0.0,1.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,0.0,1.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.0,1.0,0.0,0.0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,0.0,1.0,0.0,0.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.0,1.0,0.0,0.0
242,17.82,1.75,Male,No,Sat,Dinner,2,0.0,1.0,0.0,0.0


In [40]:
encoded_df[encoded_df['day_Fri'] == 1]


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,day_Fri,day_Sat,day_Sun,day_Thur
90,28.97,3.0,Male,Yes,Fri,Dinner,2,1.0,0.0,0.0,0.0
91,22.49,3.5,Male,No,Fri,Dinner,2,1.0,0.0,0.0,0.0
92,5.75,1.0,Female,Yes,Fri,Dinner,2,1.0,0.0,0.0,0.0
93,16.32,4.3,Female,Yes,Fri,Dinner,2,1.0,0.0,0.0,0.0
94,22.75,3.25,Female,No,Fri,Dinner,2,1.0,0.0,0.0,0.0
95,40.17,4.73,Male,Yes,Fri,Dinner,4,1.0,0.0,0.0,0.0
96,27.28,4.0,Male,Yes,Fri,Dinner,2,1.0,0.0,0.0,0.0
97,12.03,1.5,Male,Yes,Fri,Dinner,2,1.0,0.0,0.0,0.0
98,21.01,3.0,Male,Yes,Fri,Dinner,2,1.0,0.0,0.0,0.0
99,12.46,1.5,Male,No,Fri,Dinner,2,1.0,0.0,0.0,0.0


### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [41]:
from sklearn.preprocessing import OrdinalEncoder

In [42]:
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'huge', 'small', 'large', 'medium', 'small', 'large', 'huge', 'huge'],
})

In [43]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,huge
5,small
6,large
7,medium
8,small
9,large


In [44]:
encoder=OrdinalEncoder(categories=[['small','medium','large','huge']])

In [45]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [3.],
       [0.],
       [2.],
       [1.],
       [0.],
       [2.],
       [3.],
       [3.]])

In [46]:
encoder.transform([['small']])



array([[0.]])

In [47]:
encoder.transform([['huge']])



array([[3.]])