###### Machine learning algorithms require numerical input. This section focuses on techniques to convert categorical features (text labels like 'Red', 'Blue', 'High', 'Low') into a numerical format that algorithms can understand. The choice of encoding method depends on the type of categorical data.
---

## Encoding Categorical Data in Python

This document covers:

* **Types:** Differentiates between Nominal and Ordinal categorical data.
* **Ordinal Encoding:** Demonstrates `OrdinalEncoder` for features with inherent order, emphasizing the need to specify the category sequence.
* **One-Hot Encoding (OHE):** Shows how to use `OneHotEncoder` (recommended for pipelines) and `pd.get_dummies` (for convenience) for nominal features, explaining parameters like `drop` and `handle_unknown`. Discusses the dimensionality increase.
* **Label Encoding:** Explains `LabelEncoder` and why it's typically used only for the target variable `y`.
* **Considerations:** Summarizes key points about choosing the right encoder and applying it correctly within an ML workflow.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder

# --- 1. Understanding Categorical Data Types ---
# - Nominal Data: Categories with NO intrinsic order or ranking (e.g., Country, Color, Gender).
# - Ordinal Data: Categories with a meaningful order or ranking (e.g., Size [Small, Medium, Large],
#                 Education Level [High School, Bachelor, Master], Quality [Fair, Good, Great]).

# --- 2. Create Sample Data ---
data = {
    'ID': [1, 2, 3, 4, 5, 6, 7],
    'Color': ['Red', 'Green', 'Blue', 'Green', 'Red', 'Blue', 'Green'], # Nominal
    'Size': ['M', 'L', 'S', 'M', 'L', 'M', 'S'], # Ordinal
    'Quality': ['Good', 'Great', 'Fair', 'Good', 'Great', 'Good', 'Fair'], # Ordinal
    'Target': ['ClassA', 'ClassB', 'ClassA', 'ClassA', 'ClassB', 'ClassA', 'ClassB'] # Target variable
}
df = pd.DataFrame(data)

print("--- Original DataFrame with Categorical Features ---")
print(df)
df.info()
print("-" * 30)


# --- 3. Technique 1: Ordinal Encoding ---
# Assigns a unique integer to each category based on a specified order.
# Use Case: Suitable ONLY for ORDINAL features where the numerical order is meaningful.
#           Using it on nominal data can mislead models by implying a false order.

print("--- Ordinal Encoding (for Ordinal Features) ---")
df_ordinal = df.copy() # Work on a copy

# Define the explicit order for each ordinal feature
size_categories = ['S', 'M', 'L']
quality_categories = ['Fair', 'Good', 'Great']

# Initialize OrdinalEncoder with the specified categories
# The order of lists in 'categories' must match the order of columns passed to fit_transform
ordinal_encoder = OrdinalEncoder(categories=[size_categories, quality_categories])

# Apply to the ordinal columns
ordinal_cols = ['Size', 'Quality']
df_ordinal[ordinal_cols] = ordinal_encoder.fit_transform(df_ordinal[ordinal_cols])

print("DataFrame after Ordinal Encoding ('Size', 'Quality'):\n", df_ordinal)
print("\nLearned categories for OrdinalEncoder:\n", ordinal_encoder.categories_)
print("-" * 30)


# --- 4. Technique 2: One-Hot Encoding (OHE) ---
# Creates new binary (0/1) columns for each unique category in the original feature.
# Use Case: Standard approach for NOMINAL features. Avoids implying order.

print("--- One-Hot Encoding (for Nominal Features) ---")
df_ohe = df.copy() # Work on a copy

# a) Using sklearn.preprocessing.OneHotEncoder (Recommended for pipelines)
# Initialize OneHotEncoder
# sparse_output=False: Returns a dense NumPy array (easier to view). Default is True (sparse matrix).
# drop='first': Drops the first category in each feature to avoid multicollinearity (dummy variable trap).
#              Can also use drop='if_binary' to drop only for binary features. Default is None.
# handle_unknown='ignore': Assigns all zeros if an unknown category is encountered during transform (e.g., in test data).
#                          'error' (default) would raise an error.
ohe = OneHotEncoder(sparse_output=False, drop='first', handle_unknown='ignore')

# Apply to the nominal column 'Color'
nominal_cols = ['Color']
# Fit the encoder on the training data (here, df_ohe) and transform
encoded_data = ohe.fit_transform(df_ohe[nominal_cols])

# Get the names of the new features created
# (e.g., 'Color_Green', 'Color_Red' - 'Color_Blue' was dropped due to drop='first')
feature_names_ohe = ohe.get_feature_names_out(nominal_cols)
print(f"\nGenerated One-Hot Feature Names: {feature_names_ohe}")

# Create a DataFrame with the new encoded columns
encoded_df = pd.DataFrame(encoded_data, columns=feature_names_ohe, index=df_ohe.index)

# Concatenate with the original DataFrame and drop the original nominal column
df_ohe = pd.concat([df_ohe.drop(nominal_cols, axis=1), encoded_df], axis=1)
print("\nDataFrame after One-Hot Encoding ('Color') using OneHotEncoder:\n", df_ohe)
print("-" * 20)

# b) Using pandas.get_dummies() (Convenient for quick analysis)
# Simpler syntax, but less suitable for use within Scikit-learn pipelines.
df_dummies = df.copy()
dummies_df = pd.get_dummies(df_dummies, columns=['Color', 'Size'], drop_first=True, prefix=['Color', 'Size'])
# drop_first=True avoids multicollinearity
# prefix adds context to new column names
print("\nDataFrame after One-Hot Encoding using pd.get_dummies():\n", dummies_df)
# Note: get_dummies also encoded 'Size' here, which might not be desired if it's truly ordinal.
print("-" * 30)


# --- 5. Technique 3: Label Encoding (for Target Variable y) ---
# Assigns a unique integer (0 to n_classes-1) to each class label in the target variable.
# Use Case: Specifically for encoding the TARGET variable (y), not features (X).
#           Most Scikit-learn classifiers handle string labels directly for y,
#           but explicit encoding is sometimes needed or preferred.

print("--- Label Encoding (for Target Variable) ---")
df_label = df.copy()
target_col = 'Target'

le = LabelEncoder()

# Fit on the target column and transform it
df_label[target_col + '_Encoded'] = le.fit_transform(df_label[target_col])

print("DataFrame after Label Encoding the 'Target' column:\n", df_label[['Target', 'Target_Encoded']])
print(f"\nMapping learned by LabelEncoder ({target_col}):")
# Show the mapping from class name to encoded integer
for i, class_name in enumerate(le.classes_):
    print(f"  '{class_name}' -> {i}")
print("-" * 30)


# --- 6. Considerations ---
print("--- Considerations ---")
print("- Choose encoding based on data type (Nominal vs. Ordinal).")
print("- One-Hot Encoding is standard for nominal data but increases dimensionality.")
print("- Ordinal Encoding implies order; use only for truly ordinal features.")
print("- Label Encoding is generally reserved for the target variable 'y'.")
print("- For high cardinality features (many unique categories), consider other techniques")
print("  like Binary Encoding, Feature Hashing, or Target Encoding (use with care).")
print("- Apply encoding *after* train-test split, fitting ONLY on training data.")
print("- Integrate encoders into Pipelines (Section VIII) for robust workflows.")
print("-" * 30)

--- Original DataFrame with Categorical Features ---
   ID  Color Size Quality  Target
0   1    Red    M    Good  ClassA
1   2  Green    L   Great  ClassB
2   3   Blue    S    Fair  ClassA
3   4  Green    M    Good  ClassA
4   5    Red    L   Great  ClassB
5   6   Blue    M    Good  ClassA
6   7  Green    S    Fair  ClassB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ID       7 non-null      int64 
 1   Color    7 non-null      object
 2   Size     7 non-null      object
 3   Quality  7 non-null      object
 4   Target   7 non-null      object
dtypes: int64(1), object(4)
memory usage: 412.0+ bytes
------------------------------
--- Ordinal Encoding (for Ordinal Features) ---
DataFrame after Ordinal Encoding ('Size', 'Quality'):
    ID  Color  Size  Quality  Target
0   1    Red   1.0      1.0  ClassA
1   2  Green   2.0      2.0  ClassB
2   3   Blue   0.0 