# 🔤 What is Encoding?

Encoding means converting something into a new format (usually numbers) so that a computer can understand it.

Computers only understand numbers — not Hindi, English, or text.
So, when we use text or categories (like "male", "female", or "red", "blue"), we must convert them into numbers — this process is called encoding.

# 🔧 Types of Encoding (In Simple English):

# 1. Label Encoding:
Each category is given a unique number.

Example:

In [None]:
# red    → 0  
# green  → 1  
# blue   → 2

# 🧠 When to use? When there’s a natural order among the categories (like "low", "medium", "high").

# ✅ 1. Label Encoding using sklearn

In [4]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
data = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']
})

# Label Encoding
le = LabelEncoder()
data['Color_Label'] = le.fit_transform(data['Color'])

print(data)


   Color  Color_Label
0    Red            2
1   Blue            0
2  Green            1
3   Blue            0
4    Red            2


# 🔍 Explanation:
First, we created a LabelEncoder() object.

Then, we used fit_transform() to convert text into numbers.

Each unique category was assigned a unique number:

📌 Note: The model might assume 2 > 1 > 0 (i.e., Red > Green > Blue), which may not be true. If order doesn't matter, prefer One-Hot Encoding.

# 2. One-Hot Encoding:

For each category, a separate binary column (1 or 0) is created.

Example:

In [2]:
# red    → [1, 0, 0]  
# green  → [0, 1, 0]  
# blue   → [0, 0, 1]

# 🧠 When to use? When there is no order among categories (like colors, cities, or countries).

# ✅ 2. One-Hot Encoding using pandas

In [5]:
import pandas as pd

# Sample data
data = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']
})

# One-Hot Encoding
data_encoded = pd.get_dummies(data, columns=['Color'])

print(data_encoded)


   Color_Blue  Color_Green  Color_Red
0       False        False       True
1        True        False      False
2       False         True      False
3        True        False      False
4       False        False       True


# 🔍 Explanation:
We used pd.get_dummies() to create a separate column for each category.

For each row, the column corresponding to its category is marked as 1, others are 0.

# 3. Ordinal Encoding:
Similar to label encoding, but it preserves the order of categories.

Example:

In [6]:
# Low     → 1  
# Medium  → 2  
# High    → 3

# 🧠 When to use? When the categories have a meaningful sequence or ranking.

# ✅ 3. One-Hot Encoding using sklearn

In [8]:
from sklearn.preprocessing import OneHotEncoder

# Sample data
data = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']
})

# OneHotEncoder needs a 2D array
ohe = OneHotEncoder(sparse_output=False)
encoded_array = ohe.fit_transform(data[['Color']])

# Convert array to DataFrame
encoded_df = pd.DataFrame(encoded_array, columns=ohe.get_feature_names_out(['Color']))

# Combine with original data (optional)
final_data = pd.concat([data, encoded_df], axis=1)

print(final_data)


   Color  Color_Blue  Color_Green  Color_Red
0    Red         0.0          0.0        1.0
1   Blue         1.0          0.0        0.0
2  Green         0.0          1.0        0.0
3   Blue         1.0          0.0        0.0
4    Red         0.0          0.0        1.0


# 🔍 Explanation:

We created a OneHotEncoder() object.

fit_transform() was used to convert categories into binary vectors.

sparse=False was used to return a regular array instead of a sparse matrix.

Column names were automatically generated using get_feature_names_out().

# 💡 Real-Life Example:

Suppose you have a column with gender values:
["Male", "Female", "Female", "Male"]

Label Encoding: ["Male", "Female", "Female", "Male"] → [1, 0, 0, 1]

One-Hot Encoding:
Male   → [1, 0]  
Female → [0, 1]
