# Lab 1.2.2: Dataset Preprocessing Pipeline

**Module:** 1.2 - Python for AI/ML  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Load and explore datasets using Pandas
- [ ] Handle missing values with multiple strategies
- [ ] Encode categorical variables properly
- [ ] Implement feature scaling transformations
- [ ] Use a production-ready `Preprocessor` class for ML pipelines

---

## üìö Prerequisites

- Completed: Lab 1.2.1 (NumPy Broadcasting)
- Knowledge of: Basic Python classes, NumPy basics

### Required Packages
- Python 3.9+
- NumPy >= 1.21
- Pandas >= 1.3
- scikit-learn >= 1.0 (for train_test_split)

---

## üåç Real-World Context

**"Garbage in, garbage out"** - the oldest saying in data science.

In the real world:
- 80% of a data scientist's time is spent on data preparation
- Raw data has missing values, inconsistent formats, outliers
- A model is only as good as the data it's trained on

**Examples:**
- Medical records with missing patient data
- E-commerce data with mixed currencies and formats
- Sensor data with failed readings

A solid preprocessing pipeline is the foundation of any ML project!

---

## üßí ELI5: Data Preprocessing

> **Imagine you're making a recipe, but the ingredients are a mess...** üç≥
>
> - Some tomatoes are still in the fridge, some are rotten (missing data)
> - The recipe uses cups but you only have a scale (different formats)
> - Some ingredients are in grams, others in pounds (different scales)
>
> Before you can cook, you need to:
> 1. Find and replace the bad tomatoes
> 2. Convert everything to the same units
> 3. Measure out equal portions
>
> **In AI terms:** ML models need clean, consistent, properly-scaled data.
> Preprocessing transforms messy real-world data into something models can digest.

---

In [2]:
# ============================================================
# Environment Setup and Dependency Checks
# ============================================================
import sys
import subprocess
from pathlib import Path

# Determine the notebook's directory for reliable path resolution
# Works in Jupyter, VS Code, and command-line execution
try:
    # VS Code Jupyter
    notebook_dir = Path(__vsc_ipynb_file__).parent
except NameError:
    try:
        # Standard Jupyter with ipykernel
        import IPython
        notebook_dir = Path(IPython.get_ipython().kernel.session.config.get('IPKernelApp', {}).get('connection_file', '')).parent
        if not (notebook_dir / '../scripts').exists():
            notebook_dir = Path.cwd()
    except:
        notebook_dir = Path.cwd()

# Add scripts directory to path (robust method)
scripts_dir = (notebook_dir / '../scripts').resolve()
if scripts_dir.exists() and str(scripts_dir) not in sys.path:
    sys.path.insert(0, str(scripts_dir))
elif not scripts_dir.exists():
    # Fallback: try relative to cwd
    scripts_dir = Path('../scripts').resolve()
    if scripts_dir.exists() and str(scripts_dir) not in sys.path:
        sys.path.insert(0, str(scripts_dir))

# Check required packages
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

print(f"Python version: {sys.version.split()[0]}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

# Check sklearn availability (needed later)
try:
    import sklearn
    print(f"scikit-learn version: {sklearn.__version__}")
except ImportError:
    print("‚ö†Ô∏è scikit-learn not installed!")
    print("   Install with: pip install scikit-learn")
    print("   Some cells will fail without it.")

# Check if data files exist
data_dir = (notebook_dir / '../data').resolve()
if not data_dir.exists():
    data_dir = Path('../data').resolve()

required_files = ['sample_customers.csv', 'sample_training_history.json', 
                  'sample_embeddings.npy', 'sample_confusion_data.json']
missing_files = [f for f in required_files if not (data_dir / f).exists()]

if missing_files:
    print(f"\n‚ö†Ô∏è Data files not found: {missing_files}")
    generator_script = data_dir / 'generate_sample_data.py'
    if generator_script.exists():
        print("   Generating sample data...")
        result = subprocess.run([sys.executable, str(generator_script)], 
                               capture_output=True, text=True, cwd=str(data_dir))
        if result.returncode == 0:
            print("   ‚úÖ Sample data generated successfully!")
        else:
            print(f"   ‚ùå Error generating data: {result.stderr}")
    else:
        print(f"   ‚ùå Generator script not found at: {generator_script}")
else:
    print(f"\n‚úÖ All data files present in {data_dir}")

print(f"\n{'='*50}")
print("Welcome to the Preprocessing Pipeline Lab! üîß")
print(f"{'='*50}")

Python version: 3.12.3
NumPy version: 2.1.0
Pandas version: 2.3.2
scikit-learn version: 1.7.1

‚úÖ All data files present in /home/trosfy/projects/dgx-spark-ai-curriculum/domain-1-platform-foundations/module-1.2-python-for-ai/data

Welcome to the Preprocessing Pipeline Lab! üîß


---

## Part 1: Loading and Exploring Data

Let's create a realistic dataset with common data quality issues.

In [3]:
# Create a synthetic dataset with realistic issues
np.random.seed(42)

n_samples = 1000

# Generate synthetic customer data
data = {
    'age': np.random.randint(18, 80, n_samples).astype(float),
    'income': np.random.lognormal(10.5, 0.5, n_samples),
    'credit_score': np.random.randint(300, 850, n_samples).astype(float),
    'years_employed': np.random.exponential(5, n_samples),
    'education': np.random.choice(
        ['High School', 'Bachelor', 'Master', 'PhD', None], 
        n_samples, 
        p=[0.3, 0.35, 0.2, 0.1, 0.05]
    ),
    'employment_type': np.random.choice(
        ['Full-time', 'Part-time', 'Self-employed', 'Unemployed'],
        n_samples,
        p=[0.6, 0.15, 0.15, 0.1]
    ),
    'default': np.random.choice([0, 1], n_samples, p=[0.85, 0.15])
}

df = pd.DataFrame(data)

# Introduce realistic data quality issues
# Missing values
missing_age_idx = np.random.choice(n_samples, 50, replace=False)
missing_income_idx = np.random.choice(n_samples, 80, replace=False)
missing_credit_idx = np.random.choice(n_samples, 30, replace=False)

df.loc[missing_age_idx, 'age'] = np.nan
df.loc[missing_income_idx, 'income'] = np.nan
df.loc[missing_credit_idx, 'credit_score'] = np.nan

# Some outliers
df.loc[np.random.choice(n_samples, 5), 'income'] = 1e7  # Millionaires
df.loc[np.random.choice(n_samples, 3), 'years_employed'] = 50  # Very long tenure

print("Dataset created! Let's explore it...")
print(f"\nShape: {df.shape[0]} rows, {df.shape[1]} columns")

Dataset created! Let's explore it...

Shape: 1000 rows, 7 columns


In [4]:
# First look at the data
print("First 10 rows:")
df.head(10)

First 10 rows:


Unnamed: 0,age,income,credit_score,years_employed,education,employment_type,default
0,56.0,72127.372552,541.0,11.844744,Bachelor,Full-time,0
1,69.0,25876.925765,531.0,0.894922,Master,Part-time,1
2,46.0,64651.457991,,5.147562,High School,Part-time,0
3,32.0,30106.449167,499.0,8.299978,High School,Self-employed,0
4,,25666.11769,378.0,14.724744,High School,Self-employed,1
5,25.0,,472.0,2.057467,Bachelor,Self-employed,0
6,78.0,43097.852937,766.0,6.454579,Master,Part-time,0
7,38.0,94444.232145,715.0,3.869758,PhD,Full-time,0
8,56.0,42606.784896,748.0,0.312723,Bachelor,Full-time,0
9,75.0,40000.171914,386.0,5.855911,PhD,Full-time,0


In [5]:
# Data types and basic info
print("Data Types:")
print(df.dtypes)
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

Data Types:
age                float64
income             float64
credit_score       float64
years_employed     float64
education           object
employment_type     object
default              int64
dtype: object

Memory usage: 151.1 KB


In [6]:
# Check for missing values
print("Missing Values:")
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(1)

missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
})
print(missing_df[missing_df['Missing Count'] > 0])

Missing Values:
              Missing Count  Missing %
age                      50        5.0
income                   79        7.9
credit_score             30        3.0
education                37        3.7


In [7]:
# Statistical summary
print("Numerical Features Summary:")
df.describe()

Numerical Features Summary:


Unnamed: 0,age,income,credit_score,years_employed,default
count,950.0,921.0,970.0,1000.0,1000.0
mean,49.727368,96910.93,581.729897,5.296812,0.155
std,18.13863,732451.1,158.442221,5.451948,0.362086
min,18.0,6379.949,300.0,0.002214,0.0
25%,34.25,26538.31,447.25,1.677073,0.0
50%,50.0,38021.34,578.5,3.697526,0.0
75%,66.0,53799.37,721.5,7.235312,0.0
max,79.0,10000000.0,849.0,50.0,1.0


In [8]:
# Categorical features
print("Categorical Features:")
for col in ['education', 'employment_type']:
    print(f"\n{col}:")
    print(df[col].value_counts(dropna=False))

Categorical Features:

education:
education
Bachelor       337
High School    332
Master         195
PhD             99
None            37
Name: count, dtype: int64

employment_type:
employment_type
Full-time        611
Self-employed    155
Part-time        151
Unemployed        83
Name: count, dtype: int64


### üîç What Did We Find?

Our dataset has several issues to address:
1. **Missing values** in age (5%), income (8%), credit_score (3%), education (5%)
2. **Outliers** in income (some very high values)
3. **Categorical variables** that need encoding
4. **Different scales** (age ~18-80, income ~thousands, credit_score ~300-850)

---

## Part 2: Handling Missing Values

### üßí ELI5: Missing Data Strategies

> **Imagine you're taking attendance, but some kids forgot to sign in...** üìã
>
> You have options:
> - **Just remove them:** Don't count missing kids (deletion)
> - **Use the average:** "Most kids are 10 years old, so mark this one as 10" (mean imputation)
> - **Use the middle value:** "Half the class is above/below 10" (median imputation)
> - **Use the most common:** "Most kids chose pizza for lunch" (mode imputation)
> - **Ask a friend:** "Johnny usually sits near missing kid, what would he guess?" (model-based)
>
> Each method has trade-offs!

In [9]:
# Strategy 1: Mean/Median imputation
# Good for normally distributed data (mean) or skewed data (median)

df_imputed = df.copy()

# For age: Use median (less affected by outliers)
age_median = df_imputed['age'].median()
df_imputed['age'].fillna(age_median, inplace=True)
print(f"Age: Imputed {df['age'].isna().sum()} missing values with median = {age_median}")

# For income: Use median (it's right-skewed with outliers)
income_median = df_imputed['income'].median()
df_imputed['income'].fillna(income_median, inplace=True)
print(f"Income: Imputed {df['income'].isna().sum()} missing values with median = ${income_median:,.0f}")

# For credit_score: Use mean (approximately normal distribution)
credit_mean = df_imputed['credit_score'].mean()
df_imputed['credit_score'].fillna(credit_mean, inplace=True)
print(f"Credit Score: Imputed {df['credit_score'].isna().sum()} missing values with mean = {credit_mean:.0f}")

Age: Imputed 50 missing values with median = 50.0
Income: Imputed 79 missing values with median = $38,021
Credit Score: Imputed 30 missing values with mean = 582


In [10]:
# Strategy 2: Mode imputation for categorical data
# Fill with the most frequent value

education_mode = df_imputed['education'].mode()[0]
df_imputed['education'].fillna(education_mode, inplace=True)
print(f"Education: Imputed missing values with mode = '{education_mode}'")

# Verify no more missing values
print(f"\nRemaining missing values: {df_imputed.isnull().sum().sum()}")

Education: Imputed missing values with mode = 'Bachelor'

Remaining missing values: 0


In [11]:
# Strategy 3: Create a "missing" indicator (sometimes useful!)
# The fact that data is missing can itself be informative

df_with_indicators = df.copy()

# Add indicators for missing values
df_with_indicators['income_missing'] = df['income'].isna().astype(int)
df_with_indicators['credit_missing'] = df['credit_score'].isna().astype(int)

# Then impute
df_with_indicators['income'].fillna(df['income'].median(), inplace=True)
df_with_indicators['credit_score'].fillna(df['credit_score'].mean(), inplace=True)

print("Added missing value indicators:")
print(df_with_indicators[['income', 'income_missing', 'credit_score', 'credit_missing']].head(10))

Added missing value indicators:
         income  income_missing  credit_score  credit_missing
0  72127.372552               0    541.000000               0
1  25876.925765               0    531.000000               0
2  64651.457991               0    581.729897               1
3  30106.449167               0    499.000000               0
4  25666.117690               0    378.000000               0
5  38021.341036               1    472.000000               0
6  43097.852937               0    766.000000               0
7  94444.232145               0    715.000000               0
8  42606.784896               0    748.000000               0
9  40000.171914               0    386.000000               0


### üìö Pandas GroupBy and Transform

Before we try the exercise, let's learn a powerful Pandas pattern: **group-based operations**.

**`groupby()`** - Splits data into groups based on column values
**`transform()`** - Applies a function to each group and returns results aligned with the original index

This is perfect for imputing missing values with group-specific statistics!

In [12]:
# GroupBy and Transform - A Powerful Combination
# Let's understand how this works with a simple example

# Create a small example DataFrame
example_df = pd.DataFrame({
    'category': ['A', 'A', 'B', 'B', 'A'],
    'value': [10, np.nan, 30, 40, 50]
})

print("Original data:")
print(example_df)

# Step 1: groupby() splits data into groups
print("\n--- Understanding groupby() ---")
for name, group in example_df.groupby('category'):
    print(f"\nGroup '{name}':")
    print(group)

# Step 2: transform() applies a function to each group
# and returns results with the SAME INDEX as the original
print("\n--- Using transform() ---")

# Calculate median for each group and broadcast back
group_medians = example_df.groupby('category')['value'].transform('median')
print("\nMedian per group (aligned with original rows):")
print(group_medians)

# Step 3: Use with fillna() to impute missing values per group
print("\n--- Group-based imputation ---")
example_df['value_imputed'] = example_df.groupby('category')['value'].transform(
    lambda x: x.fillna(x.median())
)
print(example_df)
print("\n‚úÖ NaN in group 'A' was filled with median of group A (30.0)!")

Original data:
  category  value
0        A   10.0
1        A    NaN
2        B   30.0
3        B   40.0
4        A   50.0

--- Understanding groupby() ---

Group 'A':
  category  value
0        A   10.0
1        A    NaN
4        A   50.0

Group 'B':
  category  value
2        B   30.0
3        B   40.0

--- Using transform() ---

Median per group (aligned with original rows):
0    30.0
1    30.0
2    35.0
3    35.0
4    30.0
Name: value, dtype: float64

--- Group-based imputation ---
  category  value  value_imputed
0        A   10.0           10.0
1        A    NaN           30.0
2        B   30.0           30.0
3        B   40.0           40.0
4        A   50.0           50.0

‚úÖ NaN in group 'A' was filled with median of group A (30.0)!


### ‚úã Try It Yourself: Exercise 1

**Task:** Implement group-based imputation.

Instead of using the global median for `income`, impute using the median income *for each education level*.

This makes sense because income likely varies by education!

<details>
<summary>üí° Hint</summary>

Use `df.groupby('education')['income'].transform()` with a function that returns the median where values are not null.

```python
df['income'] = df.groupby('education')['income'].transform(
    lambda x: x.fillna(x.median())
)
```

</details>

In [13]:
# YOUR CODE HERE - Exercise 1
df_group_imputed = df.copy()

# First, fill education missing values (we need groups)
df_group_imputed['education'].fillna(df['education'].mode()[0], inplace=True)

# TODO: Impute income using median per education group
# Hint: Use groupby and transform
df['income'] = df.groupby('education')['income'].transform(
    lambda x: x.fillna(x.median())
)
# Uncomment to verify:
print("Median income by education (before imputation):")
print(df.groupby('education')['income'].median())

Median income by education (before imputation):
education
Bachelor       38243.882149
High School    40751.323933
Master         35390.042132
PhD            37263.957723
Name: income, dtype: float64


---

## Part 3: Encoding Categorical Variables

ML models work with numbers, not text. We need to convert categorical variables!

### üßí ELI5: Why Encode Categories?

> **Imagine teaching a robot to understand colors...** ü§ñ
>
> The robot only understands numbers. How do you explain "red", "blue", "green"?
>
> **Option 1 - Label Encoding:** Red=1, Blue=2, Green=3
> - Problem: Robot thinks Green(3) > Blue(2) > Red(1). But colors aren't ordered!
>
> **Option 2 - One-Hot Encoding:** 
> - Red   = [1, 0, 0]
> - Blue  = [0, 1, 0]  
> - Green = [0, 0, 1]
> - Now each color is equally different from the others!
>
> **When to use which:**
> - Label: For ordinal data (Low < Medium < High)
> - One-Hot: For nominal data (Red, Blue, Green - no order)

### üìö One-Hot Encoding with pd.get_dummies()

Pandas provides `pd.get_dummies()` for one-hot encoding - it converts categorical columns into binary (0/1) columns.

**Syntax:** `pd.get_dummies(data, prefix='column_prefix', dtype=int)`

- `data`: Series or DataFrame to encode
- `prefix`: String prefix for new column names  
- `dtype`: Data type for output columns (use `int` for 0/1 values)

In [14]:
# Start fresh with imputed data
df_encoded = df_imputed.copy()

print("Before encoding:")
print(df_encoded[['education', 'employment_type']].head())

Before encoding:
     education employment_type
0     Bachelor       Full-time
1       Master       Part-time
2  High School       Part-time
3  High School   Self-employed
4  High School   Self-employed


In [15]:
# Method 1: Label Encoding (for ordinal data)
# Education has a natural order: High School < Bachelor < Master < PhD

education_order = {
    'High School': 0,
    'Bachelor': 1,
    'Master': 2,
    'PhD': 3
}

df_encoded['education_encoded'] = df_encoded['education'].map(education_order)

print("Label Encoded Education:")
print(df_encoded[['education', 'education_encoded']].drop_duplicates().sort_values('education_encoded'))

Label Encoded Education:
     education  education_encoded
2  High School                  0
0     Bachelor                  1
1       Master                  2
7          PhD                  3


In [16]:
# Method 2: One-Hot Encoding (for nominal data)
# Employment type has no natural order

employment_dummies = pd.get_dummies(
    df_encoded['employment_type'], 
    prefix='emp',
    dtype=int
)

print("One-Hot Encoded Employment Type:")
print(employment_dummies.head())
print(f"\nNew columns created: {list(employment_dummies.columns)}")

One-Hot Encoded Employment Type:
   emp_Full-time  emp_Part-time  emp_Self-employed  emp_Unemployed
0              1              0                  0               0
1              0              1                  0               0
2              0              1                  0               0
3              0              0                  1               0
4              0              0                  1               0

New columns created: ['emp_Full-time', 'emp_Part-time', 'emp_Self-employed', 'emp_Unemployed']


In [17]:
# Combine with original dataframe
df_encoded = pd.concat([df_encoded, employment_dummies], axis=1)

# Drop original categorical columns
df_encoded = df_encoded.drop(['education', 'employment_type'], axis=1)

print("Final encoded dataframe:")
print(df_encoded.head())

Final encoded dataframe:
    age        income  credit_score  years_employed  default  \
0  56.0  72127.372552    541.000000       11.844744        0   
1  69.0  25876.925765    531.000000        0.894922        1   
2  46.0  64651.457991    581.729897        5.147562        0   
3  32.0  30106.449167    499.000000        8.299978        0   
4  50.0  25666.117690    378.000000       14.724744        1   

   education_encoded  emp_Full-time  emp_Part-time  emp_Self-employed  \
0                  1              1              0                  0   
1                  2              0              1                  0   
2                  0              0              1                  0   
3                  0              0              0                  1   
4                  0              0              0                  1   

   emp_Unemployed  
0               0  
1               0  
2               0  
3               0  
4               0  


---

## Part 4: Feature Scaling

Different features have different scales. Most ML algorithms work better when features are on similar scales.

### üßí ELI5: Why Scale Features?

> **Imagine comparing people by height and age...** üìè
>
> - Person A: Height = 180cm, Age = 25
> - Person B: Height = 160cm, Age = 45
>
> If you calculate "distance" between them:
> - Height difference: 20
> - Age difference: 20
>
> These look equal, but 20cm is a big height difference while 20 years is enormous!
>
> **Scaling puts everything on the same playing field** so that one feature
> doesn't dominate just because it has bigger numbers.

In [18]:
# Look at the current scales
numeric_cols = ['age', 'income', 'credit_score', 'years_employed']

print("Current feature ranges:")
print(df_imputed[numeric_cols].describe().loc[['min', 'max', 'mean', 'std']])

Current feature ranges:
            age        income  credit_score  years_employed
min   18.000000  6.379949e+03    300.000000        0.002214
max   79.000000  1.000000e+07    849.000000       50.000000
mean  49.741000  9.225865e+04    581.729897        5.296812
std   17.678984  7.030736e+05    156.045075        5.451948


In [19]:
# Method 1: StandardScaler (Z-score normalization)
# Transforms to mean=0, std=1
# Best for: Algorithms that assume normal distribution (linear regression, SVM)

def standard_scale(data):
    """Scale features to have mean=0 and std=1."""
    mean = data.mean(axis=0)
    std = data.std(axis=0)
    return (data - mean) / std, mean, std

df_standard = df_imputed.copy()
X = df_standard[numeric_cols].values

X_scaled, means, stds = standard_scale(X)
df_standard[numeric_cols] = X_scaled

print("After StandardScaler:")
print(df_standard[numeric_cols].describe().loc[['min', 'max', 'mean', 'std']].round(2))

After StandardScaler:
       age  income  credit_score  years_employed
min  -1.80   -0.12         -1.81           -0.97
max   1.66   14.10          1.71            8.20
mean  0.00   -0.00          0.00           -0.00
std   1.00    1.00          1.00            1.00


In [20]:
# Method 2: MinMaxScaler
# Transforms to range [0, 1]
# Best for: Neural networks, image data, when you need bounded values

def minmax_scale(data):
    """Scale features to range [0, 1]."""
    min_val = data.min(axis=0)
    max_val = data.max(axis=0)
    return (data - min_val) / (max_val - min_val), min_val, max_val

df_minmax = df_imputed.copy()
X = df_minmax[numeric_cols].values

X_scaled, mins, maxs = minmax_scale(X)
df_minmax[numeric_cols] = X_scaled

print("After MinMaxScaler:")
print(df_minmax[numeric_cols].describe().loc[['min', 'max', 'mean', 'std']].round(2))

After MinMaxScaler:
       age  income  credit_score  years_employed
min   0.00    0.00          0.00            0.00
max   1.00    1.00          1.00            1.00
mean  0.52    0.01          0.51            0.11
std   0.29    0.07          0.28            0.11


In [21]:
# Method 3: RobustScaler
# Uses median and IQR - resistant to outliers!
# Best for: Data with outliers that you want to keep

def robust_scale(data):
    """Scale features using median and interquartile range."""
    median = np.median(data, axis=0)
    q75, q25 = np.percentile(data, [75, 25], axis=0)
    iqr = q75 - q25
    return (data - median) / iqr, median, iqr

df_robust = df_imputed.copy()
X = df_robust[numeric_cols].values

X_scaled, medians, iqrs = robust_scale(X)
df_robust[numeric_cols] = X_scaled

print("After RobustScaler:")
print(df_robust[numeric_cols].describe().loc[['min', 'max', 'mean', 'std']].round(2))

After RobustScaler:
       age  income  credit_score  years_employed
min  -1.10   -1.28         -1.07           -0.66
max   1.00  401.88          1.01            8.33
mean -0.01    2.19          0.00            0.29
std   0.61   28.36          0.59            0.98


### üìö Splitting Data with scikit-learn

**scikit-learn** (sklearn) is Python's most popular ML library. We'll use its `train_test_split` function to properly divide our data.

**Why split data?**
- **Training set**: Data used to fit/learn the model
- **Test set**: Data held back to evaluate performance on unseen examples

**`train_test_split(data, test_size, random_state)`**
- `test_size`: Fraction of data for testing (e.g., 0.2 = 20%)
- `random_state`: Seed for reproducibility (same split each run)

### üîç Comparison of Scaling Methods

| Method | Range | Handles Outliers? | Best For |
|--------|-------|-------------------|----------|
| StandardScaler | Unbounded | No | Normal distributions |
| MinMaxScaler | [0, 1] | No | Neural networks, bounded features |
| RobustScaler | Unbounded | Yes | Data with outliers |

Look at the min/max values - RobustScaler has the widest range because outliers aren't "squished"!

---

## Part 5: Building a Reusable Preprocessor Class

Now let's use our production-ready Preprocessor class from the scripts folder!

### üßí ELI5: Why a Preprocessor Class?

> **Imagine you have a magic recipe book...** üìñ
>
> Every time you make cookies:
> 1. Mix ingredients the same way
> 2. Use the same oven temperature
> 3. Bake for the same time
>
> A class is like that recipe book - it remembers exactly how to prepare data
> so you can repeat it consistently on new data later!

In [22]:
# Import our custom Preprocessor
from preprocessing_pipeline import Preprocessor

print("Preprocessor class imported successfully!")
print("\nDocstring:")
print(Preprocessor.__doc__)

Preprocessor class imported successfully!

Docstring:

    A reusable data preprocessing pipeline for machine learning.

    This class handles common preprocessing tasks:
    - Missing value imputation (mean, median, mode, or constant)
    - Categorical encoding (one-hot or ordinal/label encoding)
    - Feature scaling (standard, minmax, or robust)
    - Adding missing value indicators
    - Log transformations for skewed features

    The preprocessor follows sklearn's fit/transform pattern to prevent
    data leakage: fit on training data, transform on all data.

    Attributes:
        numeric_features: List of numeric column names
        categorical_features: List of categorical column names
        ordinal_mappings: Dict mapping column names to ordinal encodings
        scaling: Scaling method ('standard', 'minmax', 'robust', or None)
        impute_strategy: Imputation strategy ('mean', 'median', or 'mode')
        add_missing_indicators: Whether to add binary missing indicator

In [23]:
# Split data into train/test (in practice, do this before any preprocessing!)
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

print(f"Training set: {len(train_df)} samples")
print(f"Test set: {len(test_df)} samples")

Training set: 800 samples
Test set: 200 samples


In [24]:
# Create and fit preprocessor
preprocessor = Preprocessor(
    numeric_features=['age', 'income', 'credit_score', 'years_employed'],
    categorical_features=['education', 'employment_type'],
    ordinal_mappings={
        'education': {
            'High School': 0,
            'Bachelor': 1,
            'Master': 2,
            'PhD': 3
        }
    },
    scaling='standard',
    impute_strategy='median'
)

# Fit on training data only!
train_processed = preprocessor.fit_transform(train_df)

# Transform test data (using parameters learned from training)
test_processed = preprocessor.transform(test_df)

print("Preprocessing complete!\n")
print(f"Output features: {preprocessor.get_feature_names()}")

Preprocessing complete!

Output features: ['age', 'income', 'credit_score', 'years_employed', 'education_encoded', 'employment_type_Full-time', 'employment_type_Part-time', 'employment_type_Self-employed', 'employment_type_Unemployed']


In [25]:
# Verify the output
print("Training data after preprocessing:")
print(train_processed.head())

Training data after preprocessing:
          age     income  credit_score  years_employed  default  \
29  -0.666457  -0.080887      0.715340       -0.379849        1   
535  0.015995  14.099072      1.381789       -0.974596        0   
695  1.039673  -0.113645     -1.141657       -0.356711        0   
557  0.527834   0.054997      0.902981       -0.119911        0   
836  0.755318  -0.104624     -0.132279        0.121997        0   

     education_encoded  employment_type_Full-time  employment_type_Part-time  \
29                   2                          1                          0   
535                  0                          1                          0   
695                  2                          1                          0   
557                  2                          1                          0   
836                  1                          0                          0   

     employment_type_Self-employed  employment_type_Unemployed  
29              

In [26]:
# Check that scaling is correct (should be mean~0, std~1 for training data)
print("Training data statistics (should be ~0 mean, ~1 std):")
print(train_processed[['age', 'income', 'credit_score', 'years_employed']].describe().round(2))

Training data statistics (should be ~0 mean, ~1 std):
          age  income  credit_score  years_employed
count  800.00  800.00        800.00          800.00
mean     0.00   -0.00         -0.00            0.00
std      1.00    1.00          1.00            1.00
min     -1.80   -0.12         -1.81           -0.99
25%     -0.78   -0.09         -0.82           -0.68
50%      0.02   -0.08          0.00           -0.29
75%      0.87   -0.06          0.84            0.38
max      1.67   14.10          1.74            8.26


In [27]:
# Save the preprocessor for later use
import pickle
from pathlib import Path

# Determine save path relative to notebook
try:
    save_dir = Path(__vsc_ipynb_file__).parent / '../data'
except NameError:
    save_dir = Path('../data')

save_path = save_dir / 'preprocessor.pkl'

try:
    with open(save_path, 'wb') as f:
        pickle.dump(preprocessor, f)
    print(f"‚úÖ Preprocessor saved to {save_path}")
    print("\nYou can load it later with:")
    print("  with open('preprocessor.pkl', 'rb') as f:")
    print("      preprocessor = pickle.load(f)")
except IOError as e:
    print(f"‚ö†Ô∏è Could not save preprocessor: {e}")
    print("   This is optional - you can continue without saving.")
    print("   The preprocessor object is still available in memory.")

‚úÖ Preprocessor saved to ../data/preprocessor.pkl

You can load it later with:
  with open('preprocessor.pkl', 'rb') as f:
      preprocessor = pickle.load(f)


---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Fitting on test data (data leakage!)

In [None]:
# ‚ùå Wrong: Fitting on all data before splitting
# This causes "data leakage" - test data info leaks into training!

# all_data_scaled = preprocessor.fit_transform(all_data)  # DON'T DO THIS
# train, test = train_test_split(all_data_scaled)         # Test data influenced fit!

# ‚úÖ Right: Split first, then fit on train only
# train, test = train_test_split(raw_data)
# train_processed = preprocessor.fit_transform(train)  # Fit on train
# test_processed = preprocessor.transform(test)        # Only transform test

print("üí° Always split data BEFORE preprocessing!")
print("   Fit on training data, transform on both.")

### Mistake 2: Forgetting to handle new categories

In [None]:
# What if test data has a category not in training?

# Example: Training had ['A', 'B', 'C'], test has ['A', 'B', 'D']
# One-hot encoding will create wrong columns!

print("‚ö†Ô∏è Our Preprocessor handles this by only using categories from training.")
print("   New categories become all-zeros (which may or may not be desired).")
print("")
print("üí° For production, consider:")
print("   - Use an 'Unknown' category")
print("   - Map new categories to most similar known category")
print("   - Raise an error if unexpected category appears")

### Mistake 3: Scaling the target variable incorrectly

In [None]:
# For classification: DON'T scale the target (it's categorical: 0 or 1)
# For regression: You CAN scale the target, but must inverse-transform predictions!

print("üí° In our example, 'default' is binary (0/1) - no scaling needed!")
print("")
print("   For regression targets:")
print("   1. Scale target during training")
print("   2. Inverse-scale predictions to get real values")

---

## üéâ Checkpoint

You've learned:
- ‚úÖ Loading and exploring datasets with Pandas
- ‚úÖ Multiple strategies for handling missing values
- ‚úÖ Label vs One-Hot encoding for categorical features
- ‚úÖ StandardScaler, MinMaxScaler, and RobustScaler
- ‚úÖ Building and using a reusable Preprocessor class
- ‚úÖ Avoiding data leakage in preprocessing

---

## üöÄ Challenge (Optional)

**Build a complete preprocessing pipeline for the Titanic dataset!**

1. Download the Titanic dataset from Kaggle or use seaborn's built-in version
2. Handle missing values in Age, Cabin, and Embarked
3. Engineer new features (e.g., Title from Name, FamilySize)
4. Encode categorical variables
5. Scale numeric features
6. Save your preprocessor parameters for later use

```python
import seaborn as sns
titanic = sns.load_dataset('titanic')
```

In [None]:
# YOUR CHALLENGE CODE HERE

---

## üìñ Further Reading

- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Scikit-learn Preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html)
- [Feature Engineering for Machine Learning](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/)

---

## üßπ Cleanup

In [28]:
# Clean up
import gc

del df, df_imputed, df_encoded, train_df, test_df, train_processed, test_processed
gc.collect()

print("‚úÖ Memory cleaned up!")
print("\nüéâ Congratulations! You've mastered data preprocessing!")
print("   Next up: Lab 1.2.3 - Visualization Dashboard")

‚úÖ Memory cleaned up!

üéâ Congratulations! You've mastered data preprocessing!
   Next up: Lab 1.2.3 - Visualization Dashboard
