### 🔠 Encoding

**Encoding** is the process of converting categorical data (like labels or text) into a numerical format so that it can be used in machine learning models, which typically require numeric inputs.

---

### 🧊 One-Hot Encoding

**One-Hot Encoding** is a technique used to represent categorical variables as binary vectors. Each unique category is transformed into a new column, and a value of 1 is placed in the column corresponding to the category for a given observation, with 0s in all other columns.

---

### 👥 Example: Gender

Suppose we have the following data:

| Person | Gender |
|--------|--------|
| A      | Male   |
| B      | Female |
| C      | Female |
| D      | Male   |

Using One-Hot Encoding, it becomes:

| Person | Gender_Female | Gender_Male |
|--------|----------------|--------------|
| A      | 0              | 1            |
| B      | 1              | 0            |
| C      | 1              | 0            |
| D      | 0              | 1            |

This allows categorical variables to be used in ML models without introducing unintended ordinal relationships.


In [15]:
# import the required packages
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

In [None]:
# prepare the data.
data = {
    'Person': ['A', 'B', 'C', 'D', 'E'],
    'Gender': ['Male', 'Female', 'Female', 'Male',np.nan]
}
dataset = pd.DataFrame(data)

Unnamed: 0,Person,Gender
0,A,Male
1,B,Female
2,C,Female
3,D,Male
4,E,


In [29]:
# checking the missing values

dataset.isnull().sum()
dataset.fillna('Unknown', inplace=True)

In [30]:
# filtering the data to encoding
en_data = dataset['Gender']

In [31]:
# one hot encoding using Pandas 
pd.get_dummies(en_data)

Unnamed: 0,Female,Male,Unknown
0,False,True,False
1,True,False,False
2,True,False,False
3,False,True,False
4,False,False,True


In [35]:
# one hot encoding using sklearn
en_data = dataset[['Gender']] # converting en_data to 2d array 
ohe = OneHotEncoder()
arr = ohe.fit_transform(en_data).toarray()
feature_names = ohe.get_feature_names_out(['Gender'])

In [36]:
pd.DataFrame(arr, columns=feature_names)

Unnamed: 0,Gender_Female,Gender_Male,Gender_Unknown
0,0.0,1.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0


In [37]:
# using drop first that will delete the first column of one hot encoding
ohe = OneHotEncoder(drop='first')
arr = ohe.fit_transform(en_data).toarray()
feature_names = ohe.get_feature_names_out(['Gender'])
pd.DataFrame(arr, columns=feature_names)

Unnamed: 0,Gender_Male,Gender_Unknown
0,1.0,0.0
1,0.0,0.0
2,0.0,0.0
3,1.0,0.0
4,0.0,1.0
