# Module 3: Handling Missing Data (Imputation)

**Training Objective:** Master techniques for handling missing values in ML pipelines, from simple deletion to advanced statistical imputation.

**Scope:**
- Dropping values: When to use `dropna`
- Constant filling: Using `fillna` for categorical defaults
- Statistical imputation: Using `Imputer` (Mean/Median) for numerical data
- Missing flags: Capturing "missingness" as a feature

## Context and Requirements

- **Training day:** Day 1 - Data Preparation Fundamentals
- **Notebook type:** Demo
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - Permissions: CREATE TABLE, SELECT, MODIFY
- **Dependencies:** `02_Data_Splitting.ipynb` (creates `customer_train` table)
- **Execution time:** ~20 minutes

> **Critical:** Always fit imputers on TRAINING data only to prevent data leakage!

## Theoretical Introduction

**Why handle missing data?**

Most ML algorithms cannot handle `NULL` values directly. We must decide how to handle them before training.

**Imputation Strategies Comparison:**

| Strategy | When to Use | Pros | Cons |
|----------|-------------|------|------|
| **Drop (dropna)** | <5% missing, random | Simple | Loses data, may introduce bias |
| **Constant (fillna)** | Categorical data | Simple, interpretable | May not reflect reality |
| **Mean** | Normally distributed | Uses all data | Sensitive to outliers |
| **Median** | Skewed data, outliers | Robust | Ignores distribution shape |
| **Mode** | Categorical | Preserves distribution | May overfit majority class |

**Data Leakage Warning:**
> ‚ö†Ô∏è **Critical Rule:** Always calculate imputation statistics (mean, median) on TRAINING data only! Applying test data statistics would leak future information into the model.

**Informative Missingness:**
Sometimes, the fact that data is missing is a signal itself:
- A sensor returning `NULL` might indicate equipment failure
- A customer not providing phone number might indicate privacy concerns
- Creating a "missing flag" allows the model to learn these patterns

## Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [None]:
%run ./00_Setup

**Load Training Data:**

Remember: We fit imputers on TRAIN data only!

In [None]:
# Load Training Data
df = spark.table("customer_train")
display(df.limit(5))

## Section 1: Dropping Missing Values (`dropna`)

**When to drop?**
Dropping data is the easiest but most dangerous method.
- **Pros:** Simple, removes noise.
- **Cons:** You lose data (reduced sample size). If the missingness is not random (e.g., rich people refusing to share salary), dropping them introduces **Bias**.

Use `dropna` only when:
1.  The missing data is very small (< 5%).
2.  The column is mostly empty (> 90% missing) and useless.

In [None]:
# Drop rows where ANY column is null
df_drop_any = df.dropna(how="any")

# Drop rows where ALL columns are null
df_drop_all = df.dropna(how="all")

# Drop rows where specific columns are null (e.g., 'age')
df_drop_subset = df.dropna(subset=["age"])

print(f"Original: {df.count()}")
print(f"Drop Any: {df_drop_any.count()}")
print(f"Drop Age: {df_drop_subset.count()}")

## Section 2: Filling with Constants (`fillna`)

Useful for categorical data (e.g., filling missing Country with "Unknown").

In [None]:
# Fill missing Country with 'Unknown' and missing Age with -1
df_filled = df.fillna({
    "country": "Unknown",
    "age": -1
})

display(df_filled.filter(df_filled.age == -1).limit(5))

## Section 3: MLlib Imputer (Mean vs Median)

For numerical data, we can fill gaps with a statistical summary.

**Mean vs. Median:**
- **Mean:** Good for normally distributed data. **Bad** if there are outliers (one billionaire pulls the mean up).
- **Median:** Robust to outliers. Usually the safer default choice for things like Salary or House Prices.

*Note: We always calculate the Mean/Median on the TRAINING set and apply that value to the Test set to avoid data leakage.*

In [None]:
from pyspark.ml.feature import Imputer

# Define Imputer
# inputCols: columns to fix
# outputCols: new columns with fixed values
imputer = Imputer(
    inputCols=["age", "salary"],
    outputCols=["age_imputed", "salary_imputed"]
)

# Strategy: 'mean' or 'median'
imputer.setStrategy("median")

# Fit on Data (Calculate the median)
imputer_model = imputer.fit(df)

# Transform Data (Apply the median)
df_imputed = imputer_model.transform(df)

display(df_imputed.select("age", "age_imputed", "salary", "salary_imputed").filter("age IS NULL"))

## Section 4: Creating Missing Flags (Informative Missingness)

Sometimes, the fact that data is missing is a signal in itself.
- *Example:* A user who doesn't fill in "Phone Number" might be less likely to convert than one who does.
- *Example:* A sensor returning `NULL` might mean it's broken, which predicts failure.

By creating a binary flag (`is_missing`), we allow the model to learn this pattern instead of just hiding it with imputation.

In [None]:
from pyspark.sql.functions import when, col

# Create a flag: 1 if Age was missing, 0 otherwise
df_flagged = df_imputed.withColumn("age_missing_flag", when(col("age").isNull(), 1).otherwise(0))

display(df_flagged.select("age", "age_imputed", "age_missing_flag").filter("age_missing_flag == 1").limit(5))

In [None]:
# Save the imputed data for the next module (Feature Transformation)
# We save the version with imputed values and the missing flags
df_flagged.write.mode("overwrite").saveAsTable(f"{catalog_name}.{schema_name}.customer_train_imputed")
print("‚úÖ Saved 'customer_train_imputed' table.")


## Best Practices

### üéØ Imputation Strategy Guide:

| Data Type | Missing % | Recommended Approach |
|-----------|-----------|---------------------|
| Numerical, normal distribution | <30% | Mean imputation |
| Numerical, skewed/outliers | <30% | Median imputation |
| Categorical | <30% | Mode or "Unknown" constant |
| Any | >50% | Consider dropping column |
| Any | <5% | Can drop rows (if random) |

### ‚ö†Ô∏è Common Mistakes to Avoid:

1. **Imputing before splitting** ‚Üí Data leakage (test statistics leak into train)
2. **Using global mean/median** ‚Üí Should be train-only statistics
3. **Ignoring informative missingness** ‚Üí Miss valuable signal
4. **Always using mean** ‚Üí Sensitive to outliers, use median
5. **Not documenting imputation** ‚Üí Reproducibility issues

### üí° Pro Tips:

- Always create missing flags for important features
- Use `Imputer` from MLlib for Spark-native imputation
- Save the imputer model for applying to test/production data
- Consider domain knowledge (e.g., age=0 might mean "unknown", not impute)
- For time-series: use forward-fill or backward-fill instead of mean

## Summary

### What we achieved:

- **dropna**: Learned when to safely drop missing data
- **fillna**: Used constants for categorical imputation
- **Imputer**: Applied median imputation for numerical data
- **Missing Flags**: Created binary flags to capture missingness signal

### Key Takeaways:

| # | Principle |
|---|-----------|
| 1 | **Fit on train, transform on all** - prevent data leakage |
| 2 | **Median is safer than mean** - robust to outliers |
| 3 | **Missingness can be informative** - create flags |
| 4 | **Document your strategy** - reproducibility matters |
| 5 | **Domain knowledge helps** - understand why data is missing |

### Data Pipeline Status:

| Table | Created | Used By |
|-------|---------|---------|
| `customer_train` | Module 2 | This module |
| `customer_train_imputed` | ‚úÖ This module | Modules 4-7 |

### Next Steps:

üìö **Next Module:** Module 4 - Feature Transformation (encoding, scaling)

## Cleanup

Optionally remove demo tables created during exercises:

In [None]:
# Cleanup - remove demo tables created in this notebook

# ‚ö†Ô∏è WARNING: Do NOT delete customer_train_imputed - it is needed for subsequent modules!

# Uncomment the lines below to remove demo tables:

# spark.sql(f"DROP TABLE IF EXISTS {catalog_name}.{schema_name}.customer_train_imputed")

# print("‚úÖ All demo tables removed")

print("‚ÑπÔ∏è Cleanup disabled (uncomment code to remove demo tables)")