# One-Hot Encoding - Jupyter Notebook

# Introduction

This notebook shows **One-Hot Encoding** using two approaches:

 **A simple example** for conceptual understanding.  
 **Applying One-Hot Encoding** on a real-world dataset loaded from a CSV file.


# Importing Required Libraries

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Simple Example for Understanding

In [11]:
# Creating a small dataset with categorical values
data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Banana', 'Apple', 'Orange']}
df_simple = pd.DataFrame(data)

# Applying One-Hot Encoding
encoder = OneHotEncoder(sparse_output = False)
df_encoded = pd.DataFrame(encoder.fit_transform(df_simple['Fruit'].values.reshape(-1,1)), columns = encoder.get_feature_names_out())

# Display the results
print("Simple Example - One Hot Encoding:\n\n",df_encoded)

Simple Example - One Hot Encoding:

    x0_Apple  x0_Banana  x0_Orange
0       1.0        0.0        0.0
1       0.0        1.0        0.0
2       0.0        0.0        1.0
3       0.0        1.0        0.0
4       1.0        0.0        0.0
5       0.0        0.0        1.0


# Another way

In [13]:
# Creating a small dataset with categorical values
data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Banana', 'Apple', 'Orange']}
df_simple = pd.DataFrame(data)

# Applying One-Hot Encoding
encoder = OneHotEncoder(sparse_output = False)
df_encoded = pd.DataFrame(encoder.fit_transform(df_simple[['Fruit']]), columns = encoder.get_feature_names_out())

# Display the results
print("Simple Example - One Hot Encoding:\n\n",df_encoded)

Simple Example - One Hot Encoding:

    Fruit_Apple  Fruit_Banana  Fruit_Orange
0          1.0           0.0           0.0
1          0.0           1.0           0.0
2          0.0           0.0           1.0
3          0.0           1.0           0.0
4          1.0           0.0           0.0
5          0.0           0.0           1.0


# Real-World Example - Applying One-Hot Encoding on a CSV Dataset

In [33]:
# Load dataset
df_real = pd.read_csv("sample_data.csv")

# Display first rows
print("\nReal-World Dataset (Before Encoding):\n")
df_real.head()


Real-World Dataset (Before Encoding):



Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [34]:
# Apply One-Hot Encoding on Geography column
df_geo_encoded = pd.DataFrame(encoder.fit_transform(df_real[["Geography"]]),columns = encoder.get_feature_names_out())

# Display first few rows after encoding
print("\nGeography column (After Encoding):\n")
df_geo_encoded


Geography column (After Encoding):



Unnamed: 0,Geography_France,Geography_Germany,Geography_Spain
0,1.0,0.0,0.0
1,0.0,0.0,1.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,0.0,0.0,1.0
...,...,...,...
9995,1.0,0.0,0.0
9996,1.0,0.0,0.0
9997,1.0,0.0,0.0
9998,0.0,1.0,0.0


In [29]:
# Merge the encoded column with data set and remove the geography column
print("\nReal-World Dataset (After Encoding):\n")
df_real = pd.concat([df_geo_encoded,df_real],axis = 1).drop(columns = "Geography")
df_real


Real-World Dataset (After Encoding):



Unnamed: 0,Geography_France,Geography_Germany,Geography_Spain,RowNumber,CustomerId,Surname,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1.0,0.0,0.0,1,15634602,Hargrave,619,Female,42,2,0.00,1,1,1,101348.88,1
1,0.0,0.0,1.0,2,15647311,Hill,608,Female,41,1,83807.86,1,0,1,112542.58,0
2,1.0,0.0,0.0,3,15619304,Onio,502,Female,42,8,159660.80,3,1,0,113931.57,1
3,1.0,0.0,0.0,4,15701354,Boni,699,Female,39,1,0.00,2,0,0,93826.63,0
4,0.0,0.0,1.0,5,15737888,Mitchell,850,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,1.0,0.0,0.0,9996,15606229,Obijiaku,771,Male,39,5,0.00,2,1,0,96270.64,0
9996,1.0,0.0,0.0,9997,15569892,Johnstone,516,Male,35,10,57369.61,1,1,1,101699.77,0
9997,1.0,0.0,0.0,9998,15584532,Liu,709,Female,36,7,0.00,1,0,1,42085.58,1
9998,0.0,1.0,0.0,9999,15682355,Sabbatini,772,Male,42,3,75075.31,2,1,0,92888.52,1


# Difference Between fit(), fit_transform(), and transform()

| Method           | Description |
|-----------------|-------------|
| **fit()**       | Learns the unique labels and assigns them binary vectors but does **NOT** transform data. |
| **fit_transform()** | Learns the labels and transforms the data in one step. |
| **transform()**  | Transforms new data based on learned labels without re-learning. |


In [36]:
# Fitting the encoder without transforming
encoder.fit(df_simple[['Fruit']])
print("\nClasses learned:", encoder.get_feature_names_out(['Fruit']))

# Transforming separately
encoded_values = encoder.transform([['Apple'], ['Orange']])
print("\nTransforming ['Apple', 'Orange']:\n", encoded_values)


Classes learned: ['Fruit_Apple' 'Fruit_Banana' 'Fruit_Orange']

Transforming ['Apple', 'Orange']:
 [[1. 0. 0.]
 [0. 0. 1.]]




# Reversing Encoding (Decoding back to original)

In [37]:
decoded_values = encoder.inverse_transform(encoded_values)
print("\nDecoded back to original:", decoded_values)


Decoded back to original: [['Apple']
 ['Orange']]


# Important Notes About One-Hot Encoding

### ⚠️ Important Tips for One-Hot Encoding  

1️⃣ **If you call `fit()` again on new data, it will overwrite the previous categories.**  
   - Example: If you fit on `['Apple', 'Banana', 'Orange']`, then later fit on `['Grapes', 'Mango']`,  
     the original categories will be lost, and new ones will be learned.  

2️⃣ **Always use `transform()` on new data** to maintain consistency with previously learned categories.  

3️⃣ **If you encode training data and test data separately, there might be inconsistencies.**  
   - Example: `'Apple'` might be `[1,0,0]` in training but `[0,1,0]` in testing if categories differ.  
   - **Solution:** Always `fit()` on training data and only `transform()` on test data to ensure category order remains the same.  

✅ **Best Practice:** Always `fit()` once on a reference dataset and `transform()` for new data to maintain consistency! 🚀  
