# Session 3: Categorical Variables and Encoding

**Data Science with Python - 2025 Edition**

---

## 🎯 Learning Objectives
By the end of this session, you will:
- Understand different types of categorical variables
- Learn when to use Label Encoding vs One-Hot Encoding
- Apply encoding techniques to real datasets
- Prepare categorical data for machine learning models

---

## 📚 Topics Covered
1. **Types of Categorical Variables**
   - Nominal vs Ordinal
   - Examples and identification

2. **Label Encoding**
   - When to use it
   - Implementation with examples

3. **One-Hot Encoding**
   - When to use it
   - Implementation with examples

**Let's start exploring categorical data! 📊**

## 🛠️ Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import matplotlib.pyplot as plt
import seaborn as sns

"Libraries loaded! Ready to work with categorical variables! 📊"

## 📊 What are Categorical Variables?

**Categorical variables** represent data that can be divided into groups or categories. Unlike numerical variables, they don't have mathematical meaning.

### Examples:
- **Colors**: Red, Blue, Green
- **Gender**: Male, Female
- **Education**: High School, Bachelor's, Master's
- **Ratings**: Poor, Good, Excellent

---

## 🏷️ Types of Categorical Variables

There are **two main types** of categorical variables:

### 1. 🔢 Nominal Categorical Variables

**Definition**: Categories with **NO natural order** or ranking.

**Characteristics**:
- Categories are just different labels
- No category is "higher" or "better" than another
- Order doesn't matter

**Examples**:
- **Colors**: Red, Blue, Green, Yellow
- **Car Brands**: Toyota, BMW, Honda, Ford
- **Countries**: USA, India, Japan, Germany
- **Blood Types**: A, B, AB, O

**Key Point**: Red is not "greater than" Blue - they're just different!

In [None]:
# Example of Nominal Categorical Data
colors = ['Red', 'Blue', 'Green', 'Red', 'Yellow', 'Blue', 'Green']
car_brands = ['Toyota', 'BMW', 'Honda', 'Toyota', 'Ford', 'BMW']

nominal_data = pd.DataFrame({
    'Color': colors,
    'Car_Brand': car_brands[:7]  # Match the length
})

nominal_data

### 2. 📈 Ordinal Categorical Variables

**Definition**: Categories with a **clear natural order** or ranking.

**Characteristics**:
- Categories can be ranked from low to high
- There's a meaningful sequence
- Order matters!

**Examples**:
- **Education Level**: High School < Bachelor's < Master's < PhD
- **Ratings**: Poor < Fair < Good < Excellent
- **Income Level**: Low < Medium < High
- **T-shirt Sizes**: Small < Medium < Large < XL

**Key Point**: "Excellent" is definitely better than "Poor" - there's a clear order!

In [None]:
# Example of Ordinal Categorical Data
education = ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'High School']
ratings = ['Poor', 'Good', 'Excellent', 'Fair', 'Good', 'Excellent']
sizes = ['Small', 'Medium', 'Large', 'XL', 'Medium', 'Small']

ordinal_data = pd.DataFrame({
    'Education': education,
    'Rating': ratings,
    'Size': sizes
})

ordinal_data

## 🔍 Quick Quiz: Identify the Type!

Look at these variables and think about whether they are **Nominal** or **Ordinal**:

1. **Movie Genres**: Action, Comedy, Drama, Horror
2. **Grade Levels**: A, B, C, D, F
3. **Cities**: New York, London, Tokyo, Paris
4. **Temperature**: Cold, Warm, Hot
5. **Marital Status**: Single, Married, Divorced

**Answers**:
1. **Nominal** - No natural order between genres
2. **Ordinal** - A > B > C > D > F (clear ranking)
3. **Nominal** - Cities have no inherent order
4. **Ordinal** - Cold < Warm < Hot (temperature order)
5. **Nominal** - No natural ranking between marital statuses

## 🤖 Why Do We Need Encoding?

**Problem**: Machine learning algorithms work with numbers, not text!

**Solution**: Convert categorical variables to numerical format through **encoding**.

### Two Main Encoding Techniques:
1. **Label Encoding** - For ordinal variables
2. **One-Hot Encoding** - For nominal variables

---

## 🏷️ Label Encoding

**What is Label Encoding?**
- Converts categories to numbers (0, 1, 2, 3, ...)
- Each unique category gets a unique number
- **Best for ordinal variables** where order matters

**When to Use Label Encoding:**
- ✅ Ordinal categorical variables (education, ratings, sizes)
- ✅ When the order of categories is meaningful
- ✅ Target variables in classification problems

**When NOT to Use:**
- ❌ Nominal variables (the algorithm might think there's an order)

### 📝 Example 1: Education Level (Ordinal)

In [None]:
# Create sample education data
education_data = pd.DataFrame({
    'Education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'High School', 'Master']
})

education_data

In [None]:
# Method 1: Manual Label Encoding (when you want control over the order)
education_mapping = {
    'High School': 0,
    'Bachelor': 1,
    'Master': 2,
    'PhD': 3
}

education_data['Education_Encoded'] = education_data['Education'].map(education_mapping)
education_data

In [None]:
# Method 2: Using sklearn LabelEncoder
label_encoder = LabelEncoder()
education_data['Education_LabelEncoder'] = label_encoder.fit_transform(education_data['Education'])

education_data

In [None]:
# See the mapping created by LabelEncoder
mapping_df = pd.DataFrame({
    'Original': label_encoder.classes_,
    'Encoded': range(len(label_encoder.classes_))
})

mapping_df

### 📝 Example 2: Customer Ratings (Ordinal)

In [None]:
# Create customer ratings data
ratings_data = pd.DataFrame({
    'Customer_ID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Rating': ['Poor', 'Good', 'Excellent', 'Fair', 'Good', 'Poor', 'Excellent', 'Fair']
})

ratings_data

In [None]:
# Manual encoding with meaningful order
rating_mapping = {
    'Poor': 1,
    'Fair': 2, 
    'Good': 3,
    'Excellent': 4
}

ratings_data['Rating_Encoded'] = ratings_data['Rating'].map(rating_mapping)
ratings_data

### 🎯 Key Benefits of Label Encoding:
- **Preserves order** for ordinal variables
- **Memory efficient** (only one column)
- **Simple to implement**
- **Reversible** (can decode back to original)

In [None]:
# Decoding back to original values
ratings_data['Rating_Decoded'] = ratings_data['Rating_Encoded'].map({v: k for k, v in rating_mapping.items()})
ratings_data

---

## 🎭 One-Hot Encoding

**What is One-Hot Encoding?**
- Creates **separate binary columns** for each category
- Each row has 1 in one column and 0 in all others
- **Best for nominal variables** where no order exists

**When to Use One-Hot Encoding:**
- ✅ Nominal categorical variables (colors, countries, brands)
- ✅ When categories have no meaningful order
- ✅ Most machine learning algorithms prefer this for nominal data

**When NOT to Use:**
- ❌ High cardinality variables (too many categories = too many columns)
- ❌ Ordinal variables (loses the order information)

### 📝 Example 1: Car Colors (Nominal)

In [None]:
# Create car color data
car_data = pd.DataFrame({
    'Car_ID': [1, 2, 3, 4, 5, 6],
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Yellow', 'Blue']
})

car_data

In [None]:
# Method 1: Using pandas get_dummies (most common)
car_encoded = pd.get_dummies(car_data, columns=['Color'], prefix='Color')
car_encoded

In [None]:
# Method 2: Using sklearn OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder(sparse_output=False)
color_encoded = onehot_encoder.fit_transform(car_data[['Color']])

# Create DataFrame with proper column names
color_columns = [f'Color_{cat}' for cat in onehot_encoder.categories_[0]]
color_encoded_df = pd.DataFrame(color_encoded, columns=color_columns)

# Combine with original data
car_sklearn = pd.concat([car_data[['Car_ID']], color_encoded_df], axis=1)
car_sklearn

### 📝 Example 2: Customer Countries (Nominal)

In [None]:
# Create customer country data
customer_data = pd.DataFrame({
    'Customer_ID': [101, 102, 103, 104, 105, 106, 107],
    'Country': ['USA', 'India', 'Germany', 'USA', 'Japan', 'India', 'Germany'],
    'Age': [25, 30, 35, 28, 45, 32, 38]
})

customer_data

In [None]:
# One-hot encode only the Country column
customer_encoded = pd.get_dummies(customer_data, columns=['Country'], prefix='Country')
customer_encoded

### 🎯 Key Benefits of One-Hot Encoding:
- **No false ordering** imposed on nominal variables
- **Clear representation** - easy to interpret
- **Works well** with most ML algorithms
- **Prevents bias** from arbitrary numerical assignments

## ⚖️ Label Encoding vs One-Hot Encoding

### 📊 Comparison Table

| Aspect | Label Encoding | One-Hot Encoding |
|--------|----------------|------------------|
| **Best For** | Ordinal Variables | Nominal Variables |
| **Output** | Single column with numbers | Multiple binary columns |
| **Memory Usage** | Low | Higher |
| **Preserves Order** | Yes | No |
| **Risk of False Order** | No (when used correctly) | No |
| **Algorithm Preference** | Tree-based models | Linear models |

### 🎯 Decision Guide:

**Use Label Encoding when:**
- Variable is ordinal (has natural order)
- You want to preserve the ranking
- Memory efficiency is important

**Use One-Hot Encoding when:**
- Variable is nominal (no natural order)
- You want to avoid false ordering
- Working with linear algorithms

## 🚗 Practical Example: Complete Car Dataset

In [None]:
# Create a comprehensive car dataset
car_dataset = pd.DataFrame({
    'Brand': ['Toyota', 'BMW', 'Honda', 'Ford', 'Toyota', 'BMW', 'Honda'],
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Black', 'White', 'Blue'],
    'Size': ['Small', 'Large', 'Medium', 'Large', 'Medium', 'Large', 'Small'],
    'Condition': ['Poor', 'Excellent', 'Good', 'Fair', 'Good', 'Excellent', 'Fair'],
    'Price': [15000, 45000, 25000, 35000, 22000, 50000, 18000]
})

car_dataset

In [None]:
# Identify variable types
variable_types = {
    'Brand': 'Nominal',
    'Color': 'Nominal', 
    'Size': 'Ordinal',
    'Condition': 'Ordinal',
    'Price': 'Numerical'
}

for var, var_type in variable_types.items():
    print(f"{var}: {var_type}")

In [None]:
# Apply appropriate encoding
car_processed = car_dataset.copy()

# 1. One-hot encode nominal variables
car_processed = pd.get_dummies(car_processed, columns=['Brand', 'Color'], prefix=['Brand', 'Color'])

# 2. Label encode ordinal variables
size_mapping = {'Small': 1, 'Medium': 2, 'Large': 3}
condition_mapping = {'Poor': 1, 'Fair': 2, 'Good': 3, 'Excellent': 4}

car_processed['Size_Encoded'] = car_processed['Size'].map(size_mapping)
car_processed['Condition_Encoded'] = car_processed['Condition'].map(condition_mapping)

# Drop original categorical columns
car_processed = car_processed.drop(['Size', 'Condition'], axis=1)

car_processed

## 🎪 Hands-On Exercise: Employee Dataset

In [None]:
# Create employee dataset for practice
employee_data = pd.DataFrame({
    'Employee_ID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Department': ['HR', 'IT', 'Finance', 'IT', 'Marketing', 'HR', 'Finance', 'Marketing'],
    'Education': ['Bachelor', 'Master', 'PhD', 'Bachelor', 'Master', 'High School', 'PhD', 'Bachelor'],
    'Performance': ['Good', 'Excellent', 'Poor', 'Good', 'Fair', 'Excellent', 'Good', 'Fair'],
    'Salary': [50000, 80000, 35000, 55000, 60000, 45000, 90000, 52000]
})

employee_data

In [None]:
# Identify which encoding to use for each variable
print("Variable Type Analysis:")
print("Department: Nominal (no ranking between departments)")
print("Education: Ordinal (High School < Bachelor < Master < PhD)")
print("Performance: Ordinal (Poor < Fair < Good < Excellent)")

In [None]:
# Apply encodings
employee_processed = employee_data.copy()

# One-hot encode Department (nominal)
employee_processed = pd.get_dummies(employee_processed, columns=['Department'], prefix='Dept')

# Label encode Education (ordinal)
education_order = {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}
employee_processed['Education_Encoded'] = employee_processed['Education'].map(education_order)

# Label encode Performance (ordinal)
performance_order = {'Poor': 1, 'Fair': 2, 'Good': 3, 'Excellent': 4}
employee_processed['Performance_Encoded'] = employee_processed['Performance'].map(performance_order)

# Drop original categorical columns
employee_processed = employee_processed.drop(['Education', 'Performance'], axis=1)

employee_processed

## 📊 Visualizing the Impact of Encoding

In [None]:
# Compare original vs encoded data
plt.figure(figsize=(15, 5))

# Original Performance distribution
plt.subplot(1, 3, 1)
employee_data['Performance'].value_counts().plot(kind='bar', color='lightblue')
plt.title('Original Performance Categories')
plt.xticks(rotation=45)

# Encoded Performance distribution  
plt.subplot(1, 3, 2)
employee_processed['Performance_Encoded'].value_counts().sort_index().plot(kind='bar', color='lightgreen')
plt.title('Label Encoded Performance')
plt.xlabel('Encoded Values')

# Department one-hot encoding visualization
plt.subplot(1, 3, 3)
dept_cols = [col for col in employee_processed.columns if col.startswith('Dept_')]
employee_processed[dept_cols].sum().plot(kind='bar', color='orange')
plt.title('One-Hot Encoded Departments')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 🎯 Best Practices and Tips

### ✅ Do's:
1. **Always identify** whether your categorical variable is nominal or ordinal first
2. **Use Label Encoding** for ordinal variables to preserve order
3. **Use One-Hot Encoding** for nominal variables to avoid false ordering
4. **Check for high cardinality** before one-hot encoding (too many categories)
5. **Save your mappings** for future use (encoding/decoding)

### ❌ Don'ts:
1. **Don't use Label Encoding** on nominal variables with many categories
2. **Don't use One-Hot Encoding** on high cardinality variables (>10-15 categories)
3. **Don't forget** to apply the same encoding to test/new data
4. **Don't lose** the original categorical data (keep backups)

### 🚨 Common Mistakes:
- Using Label Encoding on nominal variables (creates false order)
- One-hot encoding ordinal variables (loses order information)
- Not handling unseen categories in new data
- Creating too many dummy variables

---

# 🔧 Feature Scaling and Preprocessing

Now that we've mastered categorical encoding, let's explore **numerical feature preprocessing**!

**Why Scale Features?**
- Different features have different scales (age: 20-80, salary: 20,000-100,000)
- Machine learning algorithms can be sensitive to feature scales
- Scaling ensures all features contribute equally to the model

## 📊 Types of Scaling We'll Cover:
1. **Binarization** - Convert numerical to binary (0/1)
2. **Standardization (Z-score)** - Mean=0, Std=1
3. **Normalization (Min-Max)** - Scale to [0,1] range  
4. **Robust Scaling** - Uses median and IQR (less sensitive to outliers)

In [None]:
# Create sample dataset for scaling examples
np.random.seed(42)

scaling_data = pd.DataFrame({
    'Age': [25, 35, 28, 45, 32, 55, 29, 38, 42, 31],
    'Salary': [35000, 85000, 45000, 120000, 55000, 150000, 40000, 95000, 110000, 60000], 
    'Experience': [1, 8, 3, 15, 5, 20, 2, 10, 12, 4],
    'Score': [65, 88, 72, 95, 78, 92, 68, 85, 90, 75]
})

scaling_data

## 🔄 Binarization of Numerical Features

**What is Binarization?**
- Converts numerical values to binary (0 or 1) based on a threshold
- Values above threshold = 1, values below = 0
- Useful for creating yes/no features from continuous data

**When to Use Binarization:**
- ✅ When you only care about "above/below" a certain value
- ✅ Creating flags or indicators (high/low income, pass/fail scores)
- ✅ Simplifying complex numerical relationships
- ✅ When the exact value matters less than the category

**Examples:**
- Age → Senior (1 if age ≥ 60, else 0)
- Score → Pass (1 if score ≥ 70, else 0)
- Income → High Income (1 if income ≥ 80000, else 0)

In [None]:
# Method 1: Manual Binarization
binarized_data = scaling_data.copy()

# Create binary features based on thresholds
binarized_data['Is_Senior'] = (binarized_data['Age'] >= 40).astype(int)
binarized_data['High_Salary'] = (binarized_data['Salary'] >= 80000).astype(int)
binarized_data['Experienced'] = (binarized_data['Experience'] >= 10).astype(int)
binarized_data['High_Score'] = (binarized_data['Score'] >= 80).astype(int)

binarized_data

In [None]:
# Method 2: Using sklearn Binarizer
from sklearn.preprocessing import Binarizer

# Binarize Age with threshold 40
age_binarizer = Binarizer(threshold=40)
scaling_data['Age_Binary'] = age_binarizer.fit_transform(scaling_data[['Age']])

# Binarize Score with threshold 80
score_binarizer = Binarizer(threshold=80)
scaling_data['Score_Binary'] = score_binarizer.fit_transform(scaling_data[['Score']])

scaling_data[['Age', 'Age_Binary', 'Score', 'Score_Binary']]

---

## 📏 Standardization using Z-score Scaling

**What is Standardization (Z-score Scaling)?**
- Transforms features to have **mean = 0** and **standard deviation = 1**
- Formula: **z = (x - μ) / σ**
  - z = standardized value
  - x = original value
  - μ = mean of the feature
  - σ = standard deviation of the feature

**When to Use Standardization:**
- ✅ Features have different scales (age vs salary)
- ✅ Algorithm assumes normally distributed data
- ✅ Using algorithms sensitive to scale (SVM, Neural Networks, PCA)
- ✅ When you want to preserve the shape of the distribution

**Key Benefits:**
- **Preserves outliers** (doesn't compress them)
- **Centers data around 0**
- **Works well** with algorithms that assume normal distribution

In [None]:
# Method 1: Manual Z-score Standardization
standardized_data = scaling_data.copy()

# Calculate mean and standard deviation
salary_mean = standardized_data['Salary'].mean()
salary_std = standardized_data['Salary'].std()

# Apply Z-score formula manually
standardized_data['Salary_Standardized_Manual'] = (standardized_data['Salary'] - salary_mean) / salary_std

# Show the calculation
print(f"Salary Mean: {salary_mean:,.2f}")
print(f"Salary Std: {salary_std:,.2f}")
print("\nOriginal vs Standardized:")
standardized_data[['Salary', 'Salary_Standardized_Manual']].head()

In [None]:
# Method 2: Using sklearn StandardScaler
from sklearn.preprocessing import StandardScaler

# Create and fit the scaler
scaler = StandardScaler()
numerical_columns = ['Age', 'Salary', 'Experience', 'Score']

# Fit and transform the data
scaled_data = scaler.fit_transform(standardized_data[numerical_columns])

# Create DataFrame with scaled data
scaled_df = pd.DataFrame(scaled_data, columns=[f'{col}_Standardized' for col in numerical_columns])

# Combine with original data
result = pd.concat([standardized_data[numerical_columns], scaled_df], axis=1)
result

In [None]:
# Verify standardization properties (mean ≈ 0, std ≈ 1)
print("Standardization Verification:")
print("Mean of standardized features (should be ≈ 0):")
print(scaled_df.mean().round(10))
print("\nStandard deviation of standardized features (should be ≈ 1):")  
print(scaled_df.std().round(6))

---

## 📐 Normalization using Min-Max Scaling

**What is Min-Max Normalization?**
- Scales features to a **fixed range [0, 1]**
- Formula: **x_norm = (x - min) / (max - min)**
  - x_norm = normalized value
  - x = original value  
  - min = minimum value in the feature
  - max = maximum value in the feature

**When to Use Min-Max Scaling:**
- ✅ You want features in a specific range [0, 1]
- ✅ Data is uniformly distributed
- ✅ Using algorithms that work better with bounded features (Neural Networks)
- ✅ When you know the approximate range of future data

**Key Benefits:**
- **Preserves relationships** between data points
- **Bounded output** (always between 0 and 1)
- **No assumptions** about data distribution
- **Sensitive to outliers** (can compress most data)

In [None]:
# Method 1: Manual Min-Max Normalization
normalized_data = scaling_data.copy()

# Apply Min-Max formula manually for Salary
salary_min = normalized_data['Salary'].min()
salary_max = normalized_data['Salary'].max()

normalized_data['Salary_Normalized_Manual'] = (normalized_data['Salary'] - salary_min) / (salary_max - salary_min)

# Show the calculation
print(f"Salary Min: {salary_min:,}")
print(f"Salary Max: {salary_max:,}")
print(f"Range: {salary_max - salary_min:,}")
print("\nOriginal vs Normalized:")
normalized_data[['Salary', 'Salary_Normalized_Manual']].head()

In [None]:
# Method 2: Using sklearn MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Create and fit the scaler
minmax_scaler = MinMaxScaler()
numerical_columns = ['Age', 'Salary', 'Experience', 'Score']

# Fit and transform the data
normalized_data_sklearn = minmax_scaler.fit_transform(normalized_data[numerical_columns])

# Create DataFrame with normalized data
normalized_df = pd.DataFrame(normalized_data_sklearn, columns=[f'{col}_Normalized' for col in numerical_columns])

# Combine with original data
result_normalized = pd.concat([normalized_data[numerical_columns], normalized_df], axis=1)
result_normalized

In [None]:
# Verify Min-Max normalization properties (range [0, 1])
print("Min-Max Normalization Verification:")
print("Minimum values (should be 0):")
print(normalized_df.min())
print("\nMaximum values (should be 1):")
print(normalized_df.max())

---

## 🛡️ Robust Scaling using Median and IQR

**What is Robust Scaling?**
- Uses **median** and **Interquartile Range (IQR)** instead of mean and std
- Formula: **x_robust = (x - median) / IQR**
  - x_robust = robust scaled value
  - x = original value
  - median = middle value (50th percentile)
  - IQR = Q3 - Q1 (75th percentile - 25th percentile)

**When to Use Robust Scaling:**
- ✅ Data contains **outliers**
- ✅ Data is **not normally distributed**
- ✅ You want scaling **resistant to extreme values**
- ✅ When median is more representative than mean

**Key Benefits:**
- **Less sensitive to outliers** than StandardScaler
- **Uses robust statistics** (median, IQR)
- **Preserves outliers** but doesn't let them dominate
- **Works well** with skewed distributions

In [None]:
# Create data with outliers to demonstrate robust scaling
data_with_outliers = pd.DataFrame({
    'Normal_Income': [45000, 50000, 48000, 52000, 47000, 49000, 51000, 46000],
    'Income_With_Outlier': [45000, 50000, 48000, 52000, 47000, 49000, 51000, 500000],  # Last value is outlier
    'Normal_Age': [25, 30, 28, 32, 27, 29, 31, 26],
    'Age_With_Outlier': [25, 30, 28, 32, 27, 29, 31, 95]  # Last value is outlier
})

data_with_outliers

In [None]:
# Method 1: Manual Robust Scaling
robust_data = data_with_outliers.copy()

# Calculate robust statistics for Income_With_Outlier
income_median = robust_data['Income_With_Outlier'].median()
income_q1 = robust_data['Income_With_Outlier'].quantile(0.25)
income_q3 = robust_data['Income_With_Outlier'].quantile(0.75)
income_iqr = income_q3 - income_q1

# Apply robust scaling formula
robust_data['Income_Robust_Manual'] = (robust_data['Income_With_Outlier'] - income_median) / income_iqr

# Show the statistics
print(f"Income Median: {income_median:,.2f}")
print(f"Income Q1: {income_q1:,.2f}")
print(f"Income Q3: {income_q3:,.2f}")
print(f"Income IQR: {income_iqr:,.2f}")
print("\nOriginal vs Robust Scaled:")
robust_data[['Income_With_Outlier', 'Income_Robust_Manual']]

In [None]:
# Method 2: Using sklearn RobustScaler
from sklearn.preprocessing import RobustScaler

# Create and fit the robust scaler
robust_scaler = RobustScaler()

# Apply to all numerical columns
robust_scaled_data = robust_scaler.fit_transform(data_with_outliers)

# Create DataFrame with robust scaled data
robust_df = pd.DataFrame(robust_scaled_data, columns=[f'{col}_Robust' for col in data_with_outliers.columns])

# Combine with original data
result_robust = pd.concat([data_with_outliers, robust_df], axis=1)
result_robust

### 📊 Comparing Scaling Methods with Outliers

Let's see how different scaling methods handle the outlier (500,000 income):

In [None]:
# Compare all scaling methods on data with outliers
comparison_data = data_with_outliers[['Income_With_Outlier']].copy()

# Apply all scaling methods
standard_scaler = StandardScaler()
minmax_scaler = MinMaxScaler()
robust_scaler = RobustScaler()

comparison_data['Standardized'] = standard_scaler.fit_transform(comparison_data[['Income_With_Outlier']])
comparison_data['Normalized'] = minmax_scaler.fit_transform(comparison_data[['Income_With_Outlier']])
comparison_data['Robust_Scaled'] = robust_scaler.fit_transform(comparison_data[['Income_With_Outlier']])

comparison_data

---

## 📊 Complete Comparison: All Scaling Methods

### 📋 Scaling Methods Comparison Table

| Method | Output Range | Best For | Sensitive to Outliers | Preserves Distribution |
|--------|--------------|----------|---------------------|----------------------|
| **Binarization** | {0, 1} | Binary flags, threshold-based features | No | No |
| **Standardization** | Unbounded (mean=0, std=1) | Normal distributions, SVM, Neural Networks | Yes | Yes |
| **Min-Max** | [0, 1] | Bounded features, uniform distributions | Very sensitive | Yes |
| **Robust** | Unbounded (median-centered) | Data with outliers, skewed distributions | No | Yes |

### 🎯 Decision Guide:

**Use Binarization when:**
- You only care about above/below threshold
- Creating binary flags or indicators
- Simplifying numerical relationships

**Use Standardization when:**
- Data is approximately normal
- Using algorithms that assume normal distribution
- Features have very different scales

**Use Min-Max when:**
- You need bounded output [0,1]
- Data is uniformly distributed
- No significant outliers present

**Use Robust Scaling when:**
- Data contains outliers
- Distribution is skewed
- You want outlier-resistant scaling

## 🎪 Hands-On Exercise: Complete Data Preprocessing

In [None]:
# Create comprehensive dataset for practice
complete_dataset = pd.DataFrame({
    'Customer_ID': range(1, 11),
    'Age': [25, 35, 28, 45, 32, 55, 29, 38, 42, 31],
    'Income': [35000, 85000, 45000, 120000, 55000, 250000, 40000, 95000, 110000, 60000],  # Contains outlier
    'Education': ['Bachelor', 'Master', 'High School', 'PhD', 'Bachelor', 'Master', 'High School', 'PhD', 'Master', 'Bachelor'],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'New York', 'London', 'Berlin', 'Tokyo', 'Paris', 'Berlin'],
    'Satisfaction': ['Good', 'Excellent', 'Poor', 'Good', 'Fair', 'Excellent', 'Poor', 'Good', 'Fair', 'Excellent'],
    'Years_Experience': [2, 12, 5, 20, 8, 25, 3, 15, 18, 6],
    'Premium_Customer': [0, 1, 0, 1, 0, 1, 0, 1, 1, 0]  # Already binary
})

complete_dataset

In [None]:
# Apply complete preprocessing pipeline
processed_dataset = complete_dataset.copy()

# Step 1: Identify variable types
print("Variable Types:")
print("Age: Numerical (for scaling)")
print("Income: Numerical with outlier (for robust scaling)")
print("Education: Ordinal (High School < Bachelor < Master < PhD)")
print("City: Nominal (no natural order)")
print("Satisfaction: Ordinal (Poor < Fair < Good < Excellent)")
print("Years_Experience: Numerical (for binarization and scaling)")
print("Premium_Customer: Already binary")

# Step 2: One-hot encode nominal variables
processed_dataset = pd.get_dummies(processed_dataset, columns=['City'], prefix='City')

# Step 3: Label encode ordinal variables
education_mapping = {'High School': 1, 'Bachelor': 2, 'Master': 3, 'PhD': 4}
satisfaction_mapping = {'Poor': 1, 'Fair': 2, 'Good': 3, 'Excellent': 4}

processed_dataset['Education_Encoded'] = processed_dataset['Education'].map(education_mapping)
processed_dataset['Satisfaction_Encoded'] = processed_dataset['Satisfaction'].map(satisfaction_mapping)

# Step 4: Binarize Years_Experience (experienced = 10+ years)
processed_dataset['Is_Experienced'] = (processed_dataset['Years_Experience'] >= 10).astype(int)

# Step 5: Scale numerical features
# Use robust scaling for Income (has outlier), standard scaling for Age
robust_scaler = RobustScaler()
standard_scaler = StandardScaler()

processed_dataset['Income_Robust_Scaled'] = robust_scaler.fit_transform(processed_dataset[['Income']])
processed_dataset['Age_Standardized'] = standard_scaler.fit_transform(processed_dataset[['Age']])

# Drop original categorical columns
processed_dataset = processed_dataset.drop(['Education', 'Satisfaction'], axis=1)

processed_dataset

In [None]:
# Visualize the preprocessing results
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Original Income distribution
axes[0, 0].hist(complete_dataset['Income'], bins=5, alpha=0.7, color='red')
axes[0, 0].set_title('Original Income (with outlier)')
axes[0, 0].set_xlabel('Income')

# Robust scaled Income
axes[0, 1].hist(processed_dataset['Income_Robust_Scaled'], bins=5, alpha=0.7, color='green')
axes[0, 1].set_title('Robust Scaled Income')
axes[0, 1].set_xlabel('Scaled Income')

# Original Age distribution
axes[0, 2].hist(complete_dataset['Age'], bins=5, alpha=0.7, color='blue')
axes[0, 2].set_title('Original Age')
axes[0, 2].set_xlabel('Age')

# Standardized Age
axes[1, 0].hist(processed_dataset['Age_Standardized'], bins=5, alpha=0.7, color='orange')
axes[1, 0].set_title('Standardized Age')
axes[1, 0].set_xlabel('Standardized Age')

# Education encoding
education_counts = processed_dataset['Education_Encoded'].value_counts().sort_index()
axes[1, 1].bar(education_counts.index, education_counts.values, color='purple')
axes[1, 1].set_title('Education Level Encoding')
axes[1, 1].set_xlabel('Encoded Education')
axes[1, 1].set_xticks([1, 2, 3, 4])
axes[1, 1].set_xticklabels(['HS', 'Bach', 'Mast', 'PhD'])

# City one-hot encoding
city_cols = [col for col in processed_dataset.columns if col.startswith('City_')]
city_sums = processed_dataset[city_cols].sum()
axes[1, 2].bar(range(len(city_sums)), city_sums.values, color='teal')
axes[1, 2].set_title('One-Hot Encoded Cities')
axes[1, 2].set_xlabel('Cities')
axes[1, 2].set_xticks(range(len(city_sums)))
axes[1, 2].set_xticklabels([col.replace('City_', '') for col in city_cols], rotation=45)

plt.tight_layout()
plt.show()

---

# 🚨 Outlier Detection and Removal

**What are Outliers?**
- Data points that are significantly different from other observations
- Can be caused by measurement errors, data entry mistakes, or genuine extreme values
- Can negatively impact machine learning model performance

**Why Remove Outliers?**
- Improve model accuracy and stability
- Prevent skewed statistics (mean, standard deviation)
- Reduce noise in the dataset
- Better visualization and interpretation

## 🔍 Methods for Outlier Detection and Removal:
1. **IQR Method** - Using Interquartile Range
2. **Z-score Method** - Using standard deviations from mean
3. **Percentile Thresholds** - Using custom percentile cutoffs

In [None]:
# Create dataset with clear outliers for demonstration
np.random.seed(42)

# Generate normal data
normal_data = np.random.normal(50, 10, 100)

# Add some obvious outliers
outliers = [150, 200, -20, -50, 180]

# Combine normal data with outliers
data_with_outliers = np.concatenate([normal_data, outliers])

outlier_dataset = pd.DataFrame({
    'ID': range(1, len(data_with_outliers) + 1),
    'Value': data_with_outliers,
    'Category': ['Normal'] * len(normal_data) + ['Outlier'] * len(outliers)
})

# Show some statistics
print("Dataset Statistics:")
print(f"Count: {len(outlier_dataset)}")
print(f"Mean: {outlier_dataset['Value'].mean():.2f}")
print(f"Median: {outlier_dataset['Value'].median():.2f}")
print(f"Std: {outlier_dataset['Value'].std():.2f}")
print(f"Min: {outlier_dataset['Value'].min():.2f}")
print(f"Max: {outlier_dataset['Value'].max():.2f}")

outlier_dataset.tail(10)  # Show last 10 rows to see outliers

In [None]:
# Visualize the data with outliers
plt.figure(figsize=(15, 5))

# Box plot
plt.subplot(1, 3, 1)
plt.boxplot(outlier_dataset['Value'])
plt.title('Box Plot - Outliers Visible')
plt.ylabel('Value')

# Histogram
plt.subplot(1, 3, 2)
plt.hist(outlier_dataset['Value'], bins=20, alpha=0.7, color='skyblue')
plt.title('Histogram - Distribution with Outliers')
plt.xlabel('Value')
plt.ylabel('Frequency')

# Scatter plot with color coding
plt.subplot(1, 3, 3)
colors = ['red' if cat == 'Outlier' else 'blue' for cat in outlier_dataset['Category']]
plt.scatter(outlier_dataset['ID'], outlier_dataset['Value'], c=colors, alpha=0.6)
plt.title('Scatter Plot - Red = Outliers')
plt.xlabel('ID')
plt.ylabel('Value')

plt.tight_layout()
plt.show()

## 📊 Method 1: IQR (Interquartile Range) Method

**What is IQR Method?**
- Uses the concept of quartiles (Q1, Q3) to identify outliers
- **IQR = Q3 - Q1** (75th percentile - 25th percentile)
- **Lower Bound = Q1 - 1.5 × IQR**
- **Upper Bound = Q3 + 1.5 × IQR**
- Any value outside these bounds is considered an outlier

**When to Use IQR Method:**
- ✅ Data doesn't need to be normally distributed
- ✅ Robust to extreme values (uses quartiles)
- ✅ Most commonly used method
- ✅ Works well with skewed distributions

**Advantages:**
- **Non-parametric** (no distribution assumptions)
- **Robust** to extreme outliers
- **Easy to understand** and implement

In [None]:
# IQR Method Implementation
def remove_outliers_iqr(data, column_name):
    """
    Remove outliers using IQR method
    """
    df = data.copy()
    
    # Calculate Q1, Q3, and IQR
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)
    IQR = Q3 - Q1
    
    # Calculate bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Identify outliers
    outliers = df[(df[column_name] < lower_bound) | (df[column_name] > upper_bound)]
    
    # Remove outliers
    cleaned_data = df[(df[column_name] >= lower_bound) & (df[column_name] <= upper_bound)]
    
    print(f"IQR Method Results:")
    print(f"Q1: {Q1:.2f}")
    print(f"Q3: {Q3:.2f}")
    print(f"IQR: {IQR:.2f}")
    print(f"Lower Bound: {lower_bound:.2f}")
    print(f"Upper Bound: {upper_bound:.2f}")
    print(f"Original Data Points: {len(df)}")
    print(f"Outliers Detected: {len(outliers)}")
    print(f"Data Points After Cleaning: {len(cleaned_data)}")
    
    return cleaned_data, outliers

# Apply IQR method
cleaned_iqr, outliers_iqr = remove_outliers_iqr(outlier_dataset, 'Value')
cleaned_iqr.head()

---

## 📏 Method 2: Z-score Method

**What is Z-score Method?**
- Uses standard deviations from the mean to identify outliers
- **Z-score = (x - μ) / σ**
- Typically, **|Z-score| > 3** indicates an outlier
- Sometimes **|Z-score| > 2** is used for more aggressive outlier removal

**When to Use Z-score Method:**
- ✅ Data is approximately normally distributed
- ✅ You want to use statistical significance
- ✅ Working with standardized data
- ✅ Need precise control over outlier sensitivity

**Advantages:**
- **Statistical basis** (based on normal distribution)
- **Adjustable threshold** (2, 2.5, 3 standard deviations)
- **Works well** with normally distributed data

**Disadvantages:**
- **Assumes normal distribution**
- **Sensitive to extreme outliers** (they affect mean and std)

In [None]:
# Z-score Method Implementation
def remove_outliers_zscore(data, column_name, threshold=3):
    """
    Remove outliers using Z-score method
    """
    df = data.copy()
    
    # Calculate mean and standard deviation
    mean = df[column_name].mean()
    std = df[column_name].std()
    
    # Calculate Z-scores
    df['Z_Score'] = np.abs((df[column_name] - mean) / std)
    
    # Identify outliers
    outliers = df[df['Z_Score'] > threshold]
    
    # Remove outliers
    cleaned_data = df[df['Z_Score'] <= threshold].drop('Z_Score', axis=1)
    
    print(f"Z-score Method Results (threshold = {threshold}):")
    print(f"Mean: {mean:.2f}")
    print(f"Standard Deviation: {std:.2f}")
    print(f"Z-score Threshold: {threshold}")
    print(f"Original Data Points: {len(df)}")
    print(f"Outliers Detected: {len(outliers)}")
    print(f"Data Points After Cleaning: {len(cleaned_data)}")
    
    return cleaned_data, outliers.drop('Z_Score', axis=1)

# Apply Z-score method with different thresholds
print("=== Z-score Method with Threshold = 3 ===")
cleaned_z3, outliers_z3 = remove_outliers_zscore(outlier_dataset, 'Value', threshold=3)

print("\n=== Z-score Method with Threshold = 2 ===")
cleaned_z2, outliers_z2 = remove_outliers_zscore(outlier_dataset, 'Value', threshold=2)

---

## 📐 Method 3: Percentile Thresholds

**What is Percentile Threshold Method?**
- Uses custom percentile cutoffs to define outlier boundaries
- Common approaches:
  - **Remove top/bottom 5%** (5th and 95th percentiles)
  - **Remove top/bottom 1%** (1st and 99th percentiles)
  - **Custom thresholds** based on domain knowledge

**When to Use Percentile Method:**
- ✅ You want to remove a specific percentage of extreme values
- ✅ Domain knowledge suggests certain cutoffs
- ✅ Simple and interpretable approach
- ✅ Works with any distribution

**Advantages:**
- **Flexible** (you control the percentage)
- **Simple to understand**
- **No distribution assumptions**
- **Predictable results** (exact percentage removed)

**Disadvantages:**
- **Arbitrary cutoffs** (might remove valid data)
- **Fixed percentage** (might not reflect actual outliers)

In [None]:
# Percentile Threshold Method Implementation
def remove_outliers_percentile(data, column_name, lower_percentile=5, upper_percentile=95):
    """
    Remove outliers using percentile thresholds
    """
    df = data.copy()
    
    # Calculate percentile thresholds
    lower_threshold = df[column_name].quantile(lower_percentile / 100)
    upper_threshold = df[column_name].quantile(upper_percentile / 100)
    
    # Identify outliers
    outliers = df[(df[column_name] < lower_threshold) | (df[column_name] > upper_threshold)]
    
    # Remove outliers
    cleaned_data = df[(df[column_name] >= lower_threshold) & (df[column_name] <= upper_threshold)]
    
    print(f"Percentile Method Results ({lower_percentile}th - {upper_percentile}th percentile):")
    print(f"Lower Threshold ({lower_percentile}th percentile): {lower_threshold:.2f}")
    print(f"Upper Threshold ({upper_percentile}th percentile): {upper_threshold:.2f}")
    print(f"Original Data Points: {len(df)}")
    print(f"Outliers Detected: {len(outliers)}")
    print(f"Data Points After Cleaning: {len(cleaned_data)}")
    print(f"Percentage Removed: {(len(outliers)/len(df)*100):.1f}%")
    
    return cleaned_data, outliers

# Apply percentile method with different thresholds
print("=== Percentile Method: Remove top/bottom 5% ===")
cleaned_p5, outliers_p5 = remove_outliers_percentile(outlier_dataset, 'Value', 5, 95)

print("\n=== Percentile Method: Remove top/bottom 1% ===")
cleaned_p1, outliers_p1 = remove_outliers_percentile(outlier_dataset, 'Value', 1, 99)

In [None]:
# Compare all outlier removal methods
comparison_results = pd.DataFrame({
    'Method': ['Original', 'IQR', 'Z-score (3)', 'Z-score (2)', 'Percentile (5-95)', 'Percentile (1-99)'],
    'Data_Points': [
        len(outlier_dataset),
        len(cleaned_iqr),
        len(cleaned_z3),
        len(cleaned_z2),
        len(cleaned_p5),
        len(cleaned_p1)
    ],
    'Outliers_Removed': [
        0,
        len(outliers_iqr),
        len(outliers_z3),
        len(outliers_z2),
        len(outliers_p5),
        len(outliers_p1)
    ]
})

comparison_results['Percentage_Removed'] = (comparison_results['Outliers_Removed'] / len(outlier_dataset) * 100).round(1)

comparison_results

In [None]:
# Visualize the comparison of all methods
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

datasets = [
    (outlier_dataset, 'Original Data'),
    (cleaned_iqr, 'IQR Method'),
    (cleaned_z3, 'Z-score (threshold=3)'),
    (cleaned_z2, 'Z-score (threshold=2)'),
    (cleaned_p5, 'Percentile (5-95%)'),
    (cleaned_p1, 'Percentile (1-99%)')
]

for i, (data, title) in enumerate(datasets):
    row = i // 3
    col = i % 3
    
    # Box plot for each method
    axes[row, col].boxplot(data['Value'])
    axes[row, col].set_title(f'{title}\n({len(data)} points)')
    axes[row, col].set_ylabel('Value')
    
    # Add statistics
    mean_val = data['Value'].mean()
    median_val = data['Value'].median()
    std_val = data['Value'].std()
    axes[row, col].text(0.02, 0.98, f'Mean: {mean_val:.1f}\nMedian: {median_val:.1f}\nStd: {std_val:.1f}',
                       transform=axes[row, col].transAxes, fontsize=9,
                       verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

---

# 🔧 Building Preprocessing Pipelines using Scikit-learn

**What are Preprocessing Pipelines?**
- A sequence of data transformation steps applied in order
- Ensures consistent preprocessing across training and test data
- Prevents data leakage and makes preprocessing reproducible
- Combines multiple preprocessing steps into a single object

**Benefits of Pipelines:**
- **Reproducibility** - Same transformations applied consistently
- **No Data Leakage** - Fit only on training data, transform on test data
- **Clean Code** - All preprocessing in one place
- **Easy Deployment** - Save and load entire pipeline
- **Parameter Tuning** - Can tune preprocessing parameters with models

## 🏗️ Components We'll Use:
1. **ColumnTransformer** - Apply different transformations to different columns
2. **Pipeline** - Chain multiple steps together
3. **Custom Transformers** - Create our own transformation steps
4. **StandardScaler, OneHotEncoder, etc.** - Built-in transformers

In [None]:
# Create a comprehensive dataset for pipeline demonstration
np.random.seed(42)

pipeline_data = pd.DataFrame({
    'Age': [25, 35, 28, 45, 32, 55, 29, 38, 42, 31, 150, 22],  # Contains outlier (150)
    'Income': [35000, 85000, 45000, 120000, 55000, 250000, 40000, 95000, 110000, 60000, 50000, 30000],  # Contains outlier
    'Education': ['Bachelor', 'Master', 'High School', 'PhD', 'Bachelor', 'Master', 'High School', 'PhD', 'Master', 'Bachelor', 'Bachelor', 'High School'],
    'City': ['New York', 'London', 'Tokyo', 'Paris', 'New York', 'London', 'Berlin', 'Tokyo', 'Paris', 'Berlin', 'London', 'Tokyo'],
    'Satisfaction': ['Good', 'Excellent', 'Poor', 'Good', 'Fair', 'Excellent', 'Poor', 'Good', 'Fair', 'Excellent', 'Good', 'Fair'],
    'Years_Experience': [2, 12, 5, 20, 8, 25, 3, 15, 18, 6, 4, 1],
    'Premium_Customer': [0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0],  # Target variable
})

print("Pipeline Dataset:")
print(f"Shape: {pipeline_data.shape}")
pipeline_data.head()

In [None]:
# Import required libraries for pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, RobustScaler, OneHotEncoder, LabelEncoder, OrdinalEncoder
from sklearn.base import BaseEstimator, TransformerMixin

# Create custom transformer for outlier removal
class IQROutlierRemover(BaseEstimator, TransformerMixin):
    """
    Custom transformer to remove outliers using IQR method
    """
    def __init__(self, columns=None):
        self.columns = columns
        self.bounds_ = {}
    
    def fit(self, X, y=None):
        if self.columns is None:
            self.columns = X.select_dtypes(include=[np.number]).columns
        
        for col in self.columns:
            Q1 = X[col].quantile(0.25)
            Q3 = X[col].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            self.bounds_[col] = (lower_bound, upper_bound)
        
        return self
    
    def transform(self, X):
        X_transformed = X.copy()
        
        for col in self.columns:
            if col in self.bounds_:
                lower_bound, upper_bound = self.bounds_[col]
                # Remove outliers (keep only rows within bounds)
                mask = (X_transformed[col] >= lower_bound) & (X_transformed[col] <= upper_bound)
                X_transformed = X_transformed[mask]
        
        return X_transformed

# Test the custom transformer
outlier_remover = IQROutlierRemover(columns=['Age', 'Income'])
outlier_remover.fit(pipeline_data)

print("Outlier bounds calculated:")
for col, (lower, upper) in outlier_remover.bounds_.items():
    print(f"{col}: [{lower:.2f}, {upper:.2f}]")

In [None]:
# Build comprehensive preprocessing pipeline

# Step 1: Separate features and target
X = pipeline_data.drop('Premium_Customer', axis=1)
y = pipeline_data['Premium_Customer']

# Step 2: Define column types
numerical_columns = ['Age', 'Income', 'Years_Experience']
ordinal_columns = ['Education', 'Satisfaction']  
nominal_columns = ['City']

# Step 3: Define ordinal mappings
education_categories = ['High School', 'Bachelor', 'Master', 'PhD']
satisfaction_categories = ['Poor', 'Fair', 'Good', 'Excellent']

# Step 4: Create preprocessing pipeline
preprocessing_pipeline = ColumnTransformer(
    transformers=[
        # Numerical features: outlier removal + robust scaling
        ('numerical', Pipeline([
            ('outlier_removal', IQROutlierRemover(columns=['Age', 'Income'])),
            ('scaler', RobustScaler())
        ]), numerical_columns),
        
        # Ordinal features: ordinal encoding + standard scaling
        ('ordinal', Pipeline([
            ('ordinal_encoder', OrdinalEncoder(categories=[education_categories, satisfaction_categories])),
            ('scaler', StandardScaler())
        ]), ordinal_columns),
        
        # Nominal features: one-hot encoding
        ('nominal', OneHotEncoder(drop='first', sparse_output=False), nominal_columns)
    ],
    remainder='passthrough'  # Keep other columns as-is
)

print("Preprocessing Pipeline Created!")
print("\nPipeline Steps:")
print("1. Numerical columns: IQR outlier removal + Robust scaling")
print("2. Ordinal columns: Ordinal encoding + Standard scaling") 
print("3. Nominal columns: One-hot encoding")
print("\nColumn assignments:")
print(f"Numerical: {numerical_columns}")
print(f"Ordinal: {ordinal_columns}")
print(f"Nominal: {nominal_columns}")

In [None]:
# Apply the preprocessing pipeline
print("Original data shape:", X.shape)

# Fit and transform the data
X_preprocessed = preprocessing_pipeline.fit_transform(X)
print("Preprocessed data shape:", X_preprocessed.shape)

# Get feature names after preprocessing
feature_names = []

# Numerical features (after outlier removal, some rows might be dropped)
feature_names.extend([f'{col}_scaled' for col in numerical_columns])

# Ordinal features  
feature_names.extend([f'{col}_encoded' for col in ordinal_columns])

# Nominal features (one-hot encoded, drop first)
onehot_encoder = preprocessing_pipeline.named_transformers_['nominal']
city_features = onehot_encoder.get_feature_names_out(nominal_columns)
feature_names.extend(city_features)

# Create DataFrame with preprocessed data
X_preprocessed_df = pd.DataFrame(X_preprocessed, columns=feature_names)

print("\nPreprocessed Features:")
X_preprocessed_df.head()

### 🎯 Pipeline Benefits Demonstrated

**What our pipeline accomplished:**
1. **Consistent Processing** - Same transformations applied to all data
2. **No Data Leakage** - Fit parameters only on training data
3. **Outlier Handling** - Automatically removed outliers using IQR method  
4. **Appropriate Encoding** - Different methods for different variable types
5. **Feature Scaling** - Normalized features for better model performance
6. **Reproducibility** - Can apply same transformations to new data

**Key Advantages:**
- **Fit once, transform many** - Pipeline remembers all parameters
- **Clean separation** - Training vs test data handled properly
- **Easy deployment** - Single object contains entire preprocessing
- **Parameter tuning** - Can optimize preprocessing with model parameters

In [None]:
# Demonstrate pipeline with new data (simulating test data)
new_data = pd.DataFrame({
    'Age': [30, 50, 200],  # Third value is an outlier
    'Income': [60000, 100000, 1000000],  # Third value is an outlier
    'Education': ['Master', 'PhD', 'Bachelor'],
    'City': ['London', 'Berlin', 'New York'],
    'Satisfaction': ['Good', 'Excellent', 'Fair'],
    'Years_Experience': [7, 18, 3]
})

print("New data to transform:")
print(new_data)

# Transform new data using fitted pipeline (no fitting again!)
new_data_preprocessed = preprocessing_pipeline.transform(new_data)

print(f"\nOriginal new data shape: {new_data.shape}")
print(f"Preprocessed new data shape: {new_data_preprocessed.shape}")
print("\nNote: Outlier row was removed automatically!")

# Show preprocessed new data
new_data_preprocessed_df = pd.DataFrame(new_data_preprocessed, columns=feature_names)
new_data_preprocessed_df

---

## 📊 Complete Method Comparison Guide

### 🚨 Outlier Removal Methods Comparison

| Method | Best For | Pros | Cons | When to Use |
|--------|----------|------|------|-------------|
| **IQR** | Any distribution | Robust, non-parametric | May remove valid extreme values | General purpose, skewed data |
| **Z-score** | Normal distribution | Statistical basis, adjustable | Assumes normality, sensitive to outliers | Normally distributed data |
| **Percentile** | Custom control | Simple, predictable | Arbitrary cutoffs | When domain knowledge suggests cutoffs |

### 🔧 Preprocessing Pipeline Benefits

| Aspect | Without Pipeline | With Pipeline |
|--------|------------------|---------------|
| **Consistency** | Manual, error-prone | Automatic, consistent |
| **Data Leakage** | Risk of leakage | Prevented by design |
| **Reproducibility** | Difficult to reproduce | Easy to reproduce |
| **Deployment** | Complex setup | Single object |
| **Maintenance** | Scattered code | Centralized preprocessing |

### 🎯 Decision Framework:

**Choose IQR Method when:**
- Data is skewed or non-normal
- You want robust outlier detection
- Working with diverse datasets

**Choose Z-score Method when:**
- Data is approximately normal
- You need statistical justification
- Want adjustable sensitivity

**Choose Percentile Method when:**
- You know the desired percentage to remove
- Domain expertise suggests specific cutoffs
- Simple interpretability is important

**Use Preprocessing Pipelines when:**
- Building production ML systems
- Need consistent preprocessing
- Working with train/test splits
- Deploying models to production

## 🎓 Summary

### 📋 Key Takeaways:

1. **Categorical Variables** come in two types:
   - **Nominal**: No natural order (colors, brands, countries)
   - **Ordinal**: Clear natural order (education, ratings, sizes)

2. **Label Encoding**:
   - Converts categories to numbers (0, 1, 2, ...)
   - Perfect for ordinal variables
   - Preserves order and saves memory

3. **One-Hot Encoding**:
   - Creates binary columns for each category
   - Perfect for nominal variables
   - Prevents false ordering

4. **Binarization**:
   - Converts numerical to binary (0/1) based on threshold
   - Perfect for creating flags and indicators
   - Simplifies complex relationships

5. **Standardization (Z-score)**:
   - Centers data around mean=0, std=1
   - Perfect for normally distributed data
   - Works well with SVM, Neural Networks

6. **Min-Max Normalization**:
   - Scales features to [0,1] range
   - Perfect for bounded features
   - Sensitive to outliers

7. **Robust Scaling**:
   - Uses median and IQR instead of mean and std
   - Perfect for data with outliers
   - Less sensitive to extreme values

8. **Outlier Removal Methods**:
   - **IQR Method**: Robust, works with any distribution
   - **Z-score Method**: Statistical, assumes normal distribution
   - **Percentile Method**: Simple, predictable percentage removal

9. **Preprocessing Pipelines**:
   - Combine multiple preprocessing steps
   - Ensure consistency and prevent data leakage
   - Enable easy deployment and reproducibility

### 🎯 Decision Rules:
- **Nominal variables** → One-Hot Encoding
- **Ordinal variables** → Label Encoding
- **Threshold-based features** → Binarization
- **Normal distribution** → Standardization
- **Need bounded output** → Min-Max Scaling
- **Data with outliers** → Robust Scaling
- **Outlier removal** → IQR (general), Z-score (normal data), Percentile (custom)
- **Production systems** → Always use Preprocessing Pipelines

### 🚀 What's Next:
- Feature selection techniques
- Advanced encoding methods (Target encoding, Binary encoding)
- Cross-validation with preprocessing pipelines
- Building complete ML workflows
- Model deployment with preprocessing

**Congratulations! You now master comprehensive data preprocessing techniques! 🎉**

---

*Remember: Proper preprocessing is the foundation of successful machine learning. Choose techniques based on your data characteristics and business requirements!*