# Label Encoding - Jupyter Notebook

# Introduction

This notebook shows **Label Encoding** using two approaches:

 **A simple example** for conceptual understanding.  
 **Applying Label Encoding** on a real-world dataset loaded from a CSV file.


# Importing Required Libraries

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Simple Example for Understanding

In [15]:
# Creating a small dataset with categorical values
data = {'Fruit': ['Apple', 'Banana', 'Orange', 'Banana', 'Apple', 'Orange']}
df_simple = pd.DataFrame(data)

# Applying Label Encoding
encoder = LabelEncoder()
df_simple['Encoded_Fruit'] = encoder.fit_transform(df_simple['Fruit'])

# Display the results
print("Simple Example - Label Encoding:\n\n", df_simple)

Simple Example - Label Encoding:

     Fruit  Encoded_Fruit
0   Apple              0
1  Banana              1
2  Orange              2
3  Banana              1
4   Apple              0
5  Orange              2


# Real-World Example - Applying Label Encoding on a CSV Dataset

In [19]:
# Load dataset
df_real = pd.read_csv("sample_data.csv")

# Display first few rows
print("\nReal-World Dataset (Before Encoding):\n", df_real.head())


# Apply Label Encoding on Gender column
df_real["Gender"] = encoder.fit_transform(df_real["Gender"])

# Display first few rows after encoding
print("\nReal-World Dataset (After Encoding):\n", df_real.head())


Real-World Dataset (Before Encoding):
     User ID  Gender  Age  EstimatedSalary  Purchased
0  15624510    Male   19            19000          0
1  15810944    Male   35            20000          0
2  15668575  Female   26            43000          0
3  15603246  Female   27            57000          0
4  15804002    Male   19            76000          0

Real-World Dataset (After Encoding):
     User ID  Gender  Age  EstimatedSalary  Purchased
0  15624510       1   19            19000          0
1  15810944       1   35            20000          0
2  15668575       0   26            43000          0
3  15603246       0   27            57000          0
4  15804002       1   19            76000          0


# Difference Between fit(), fit_transform(), and transform()

In [12]:
"""
fit(): Learns the unique labels and assigns them numeric values but does NOT transform data.
fit_transform(): Learns the labels and transforms the data in one step.
transform(): Transforms new data based on learned labels without re-learning.
"""

# Fitting the encoder without transforming
encoder.fit(df_simple['Fruit'])
print("\nClasses learned:", encoder.classes_)

# Transforming separately
encoded_values = encoder.transform(['Apple', 'Orange'])
print("\nTransforming ['Apple', 'Orange']:", encoded_values)


Classes learned: ['Apple' 'Banana' 'Orange']

Transforming ['Apple', 'Orange']: [0 2]


# Reversing Encoding (Decoding back to original)

In [16]:
decoded_values = encoder.inverse_transform(encoded_values)
print("\nDecoded back to original:", decoded_values)


Decoded back to original: ['Apple' 'Orange']


# Important Notes About Label Encoding

### ⚠️ Important Tips for Label Encoding

1️⃣ **If you call `fit()` again on new data, it will overwrite the previous labels.**  
   - Example: If you fit on `['Apple', 'Banana', 'Orange']`, then later fit on `['Grapes', 'Mango']`,  
     the old mappings will be lost.

2️⃣ **Always use `transform()` on new data** to maintain consistency with previously learned labels.

3️⃣ **If you encode training data and test data separately, there might be inconsistencies.**  
   - Example: `'Apple'` might be `0` in training but `2` in testing.  
   - **Solution:** Always `fit()` on training data and only `transform()` on test data.

✅ **Best Practice:** Always `fit()` once on a reference dataset and `transform()` for new data!
