# One-Hot-Encoding




It's a technique used in machine learning and data preprocessing to convert categorical variables into a binary matrix format, where each category is represented by a unique binary code

It's ofter used with algorithms that require numerica input, such as:- 
- neural networks
- decision trees

One converts categorical variables are `integer mapped`, then the integer values are converted into a binary vecto of a fixed length

Each position in the vector corresponds to a category
It's either 0 or 1 based on whether the category matches the origin value

The result is binary vectors can be used input features in machine learning models

## Advantages

- allows use of categorical variables in models that require numerical input
- can improve model performance by providing more to the model about the categorical
- suitable the categorical variables with a small number categories
- avoids implying any ordinal relationship between categories

## Limitation

- can lead to increased dimensionality
- can lead sparse data
- can lead to overfitting
- doesn't capture relatioships between categories as it treats all categories as independent

In [162]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder


In [163]:
data = {"Employee_ID": [45, 78, 56, 12, 7, 68, 23, 45, 89, 75, 47, 62],
        "Gender":["Male", "Female", "Female", "Male", "Female", "Female", "Male", "Female", "Male", "Female", "Female", "Male"],
        "Remarks": ["Nice", "Good", "Great", "Great", "Nice", "Great", "Good", "Nice", "Great", "Nice", "Good", "Nice",]
        }

data = pd.DataFrame(data)
data.head()

Unnamed: 0,Employee_ID,Gender,Remarks
0,45,Male,Nice
1,78,Female,Good
2,56,Female,Great
3,12,Male,Great
4,7,Female,Nice


In [164]:
# using pandas get_dummies function

one_hot_endoded_data = pd.get_dummies(data, columns=['Remarks', 'Gender'])
one_hot_endoded_data

Unnamed: 0,Employee_ID,Remarks_Good,Remarks_Great,Remarks_Nice,Gender_Female,Gender_Male
0,45,0,0,1,0,1
1,78,1,0,0,1,0
2,56,0,1,0,1,0
3,12,0,1,0,0,1
4,7,0,0,1,1,0
5,68,0,1,0,1,0
6,23,1,0,0,0,1
7,45,0,0,1,1,0
8,89,0,1,0,0,1
9,75,0,0,1,1,0


### Using Sci-kit Learn Library

In [165]:
data

Unnamed: 0,Employee_ID,Gender,Remarks
0,45,Male,Nice
1,78,Female,Good
2,56,Female,Great
3,12,Male,Great
4,7,Female,Nice
5,68,Female,Great
6,23,Male,Good
7,45,Female,Nice
8,89,Male,Great
9,75,Female,Nice


In [166]:
data['Gender'] = data['Gender'].astype('category')
data['Remarks'] = data['Remarks'].astype('category')

In [167]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   Employee_ID  12 non-null     int64   
 1   Gender       12 non-null     category
 2   Remarks      12 non-null     category
dtypes: category(2), int64(1)
memory usage: 504.0 bytes


In [168]:
data['Gen_new'] = data['Gender'].cat.codes
data['Rem_new'] = data['Remarks'].cat.codes


In [169]:
data['Remarks'].value_counts()

Nice     5
Great    4
Good     3
Name: Remarks, dtype: int64

In [170]:
data

Unnamed: 0,Employee_ID,Gender,Remarks,Gen_new,Rem_new
0,45,Male,Nice,1,2
1,78,Female,Good,0,0
2,56,Female,Great,0,1
3,12,Male,Great,1,1
4,7,Female,Nice,0,2
5,68,Female,Great,0,1
6,23,Male,Good,1,0
7,45,Female,Nice,0,2
8,89,Male,Great,1,1
9,75,Female,Nice,0,2


In [171]:

enc = OneHotEncoder()

enc_data = enc.fit_transform(data[['Gen_new', 'Rem_new']]).toarray()

enc_df = pd.DataFrame(enc_data, columns=enc.get_feature_names_out(['Gen_new', 'Rem_new']))

In [172]:
enc_df

Unnamed: 0,Gen_new_0,Gen_new_1,Rem_new_0,Rem_new_1,Rem_new_2
0,0.0,1.0,0.0,0.0,1.0
1,1.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,1.0,0.0
3,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,1.0
5,1.0,0.0,0.0,1.0,0.0
6,0.0,1.0,1.0,0.0,0.0
7,1.0,0.0,0.0,0.0,1.0
8,0.0,1.0,0.0,1.0,0.0
9,1.0,0.0,0.0,0.0,1.0


In [173]:
new_df = pd.concat([data, enc_df], axis=1)
new_df

Unnamed: 0,Employee_ID,Gender,Remarks,Gen_new,Rem_new,Gen_new_0,Gen_new_1,Rem_new_0,Rem_new_1,Rem_new_2
0,45,Male,Nice,1,2,0.0,1.0,0.0,0.0,1.0
1,78,Female,Good,0,0,1.0,0.0,1.0,0.0,0.0
2,56,Female,Great,0,1,1.0,0.0,0.0,1.0,0.0
3,12,Male,Great,1,1,0.0,1.0,0.0,1.0,0.0
4,7,Female,Nice,0,2,1.0,0.0,0.0,0.0,1.0
5,68,Female,Great,0,1,1.0,0.0,0.0,1.0,0.0
6,23,Male,Good,1,0,0.0,1.0,1.0,0.0,0.0
7,45,Female,Nice,0,2,1.0,0.0,0.0,0.0,1.0
8,89,Male,Great,1,1,0.0,1.0,0.0,1.0,0.0
9,75,Female,Nice,0,2,1.0,0.0,0.0,0.0,1.0
