# Module 3: Handling Missing Data (Imputation)

## Business Context: TechCorp HR Analytics (continued)

**Where We Are:**
We have multiple data quality issues to fix:

| Column | Issue | % Missing | Strategy |
|--------|-------|-----------|----------|
| `age` | Missing values | ~5% | Imputer (median) |
| `salary` | Missing values | ~3% | Imputer (median) |
| `source` | Missing values | ~4% | fillna(UNKNOWN) |

---

**Scope:**
- dropna: When to drop missing data
- fillna: Categorical defaults (source)
- Imputer: Numerical imputation (age, salary)
- Missing flags: Capture missingness as feature

## Context and Requirements

- **Training day:** Day 1 - Data Preparation Fundamentals
- **Notebook type:** Demo
- **Technical requirements:**
  - Databricks Runtime 14.x LTS or newer
  - Unity Catalog enabled
  - Permissions: CREATE TABLE, SELECT, MODIFY
- **Dependencies:** `02_Data_Splitting.ipynb` (creates `customer_train` table)
- **Execution time:** ~20 minutes

> **Critical:** Always fit imputers on TRAINING data only to prevent data leakage!

## Theoretical Introduction

**Why handle missing data?**

Most ML algorithms cannot handle `NULL` values directly. We must decide how to handle them before training.

**Imputation Strategies Comparison:**

| Strategy | When to Use | Pros | Cons |
|----------|-------------|------|------|
| **Drop (dropna)** | <5% missing, random | Simple | Loses data, may introduce bias |
| **Constant (fillna)** | Categorical data | Simple, interpretable | May not reflect reality |
| **Mean** | Normally distributed | Uses all data | Sensitive to outliers |
| **Median** | Skewed data, outliers | Robust | Ignores distribution shape |
| **Mode** | Categorical | Preserves distribution | May overfit majority class |

**Data Leakage Warning:**
> ⚠️ **Critical Rule:** Always calculate imputation statistics (mean, median) on TRAINING data only! Applying test data statistics would leak future information into the model.

**Informative Missingness:**
Sometimes, the fact that data is missing is a signal itself:
- A sensor returning `NULL` might indicate equipment failure
- A customer not providing phone number might indicate privacy concerns
- Creating a "missing flag" allows the model to learn these patterns

## Per-User Isolation

Run the initialization script for per-user catalog and schema isolation:

In [0]:
%run ./00_Setup

**Load Training Data:**

Remember: We fit imputers on TRAIN data only!

In [0]:
# Load Training Data
df = spark.table("customer_train")
display(df)

## Section 1: Dropping Missing Values (`dropna`)

**When to drop?**
Dropping data is the easiest but most dangerous method.
- **Pros:** Simple, removes noise.
- **Cons:** You lose data (reduced sample size). If the missingness is not random (e.g., rich people refusing to share salary), dropping them introduces **Bias**.

Use `dropna` only when:
1.  The missing data is very small (< 5%).
2.  The column is mostly empty (> 90% missing) and useless.

In [0]:
# Drop rows where ANY column is null
df_drop_any = df.dropna(how="any")

In [0]:
display(df_drop_any)

In [0]:
# Drop rows where ALL columns are null
df_drop_all = df.dropna(how="all")

In [0]:
display(df_drop_all)

In [0]:

# Drop rows where specific columns are null (e.g., 'age')
df_drop_subset = df.dropna(subset=["age"])

In [0]:
display(df_drop_subset)

In [0]:
print(f"Original: {df.count()}")
print(f"Drop Any: {df_drop_any.count()}")
print(f"Drop Age: {df_drop_subset.count()}")

## Section 2: Filling with Constants (`fillna`)

**When to use fillna?**
Best for **categorical data** where a constant like "Unknown" or "Other" makes business sense.

In our TechCorp data:
- `source` has ~4% missing values (unknown recruitment channel)
- We'll fill with "UNKNOWN" - this becomes a valid category for the model to learn from!

In [0]:
# Check missing source values before filling
print(f"Missing source values: {df.filter(df.source.isNull()).count()}")

# Fill missing source with 'UNKNOWN' (categorical imputation)
df_source_filled = df.fillna({"source": "UNKNOWN"})

# Verify the fill worked
print(f"After fillna - Missing source: {df_source_filled.filter(df_source_filled.source.isNull()).count()}")
print(f"UNKNOWN source count: {df_source_filled.filter(df_source_filled.source == 'UNKNOWN').count()}")

In [0]:
# Show the distribution of source values (including UNKNOWN)
display(df_source_filled.groupBy("source").count().orderBy("count", ascending=False))

## Section 3: MLlib Imputer (Mean vs Median)

For numerical data, we can fill gaps with a statistical summary.

**Mean vs. Median:**
- **Mean:** Good for normally distributed data. **Bad** if there are outliers (one billionaire pulls the mean up).
- **Median:** Robust to outliers. Usually the safer default choice for things like Salary or House Prices.

*Note: We always calculate the Mean/Median on the TRAINING set and apply that value to the Test set to avoid data leakage.*

### PySpark MLlib `Imputer` Options

The `pyspark.ml.feature.Imputer` is used to fill missing values in numerical columns. Key options:

- **inputCols**: List of input column names with missing values.
- **outputCols**: List of output column names for imputed values.
- **strategy**: Imputation method. Options:
  - `"mean"`: Replace missing values with the mean (default).
  - `"median"`: Replace missing values with the median.
  - `"mode"`: Replace missing values with the most frequent value.
- **missingValue**: Value to consider as missing (default: `float('nan')`).
- **relativeError**: Precision for the approximate quantile algorithm (for median/mode).
- **addIndicatorCols**: If `True`, adds boolean columns indicating missingness.

**Example:**
python
from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=["age", "salary"],
    outputCols=["age_imputed", "salary_imputed"],
    strategy="median"
)
model = imputer.fit(df)
df_imputed = model.transform(df)

In [0]:
from pyspark.ml.feature import Imputer

# Check missing values before imputation
print("Missing values BEFORE imputation:")
print(f"  - age: {df_source_filled.filter(df_source_filled.age.isNull()).count()}")
print(f"  - salary: {df_source_filled.filter(df_source_filled.salary.isNull()).count()}")

# Define Imputer for NUMERICAL columns
# inputCols: columns to fix
# outputCols: new columns with fixed values
imputer = Imputer(
    inputCols=["age", "salary"],
    outputCols=["age_imputed", "salary_imputed"]
)

# Strategy: 'mean' or 'median' - we use median because salary has outliers (C-level execs)
imputer.setStrategy("median")

# Fit on Data (Calculate the median from TRAINING data only!)
imputer_model = imputer.fit(df_source_filled)

# Print the learned medians
print(f"\nLearned imputation values (medians):")
print(f"  - age median: {imputer_model.surrogateDF.collect()[0]['age']}")
print(f"  - salary median: {imputer_model.surrogateDF.collect()[0]['salary']}")

# Transform Data (Apply the median)
df_imputed = imputer_model.transform(df_source_filled)

In [0]:
# Show imputed values for age
print("Age imputation examples (where original age was NULL):")
display(df_imputed.select("age", "age_imputed", "salary", "salary_imputed").filter("age IS NULL OR salary IS NULL"))

## Section 4: Creating Missing Flags (Informative Missingness)

Sometimes, the fact that data is missing is a signal in itself.
- *Example:* A user who doesn't fill in "Phone Number" might be less likely to convert than one who does.
- *Example:* A sensor returning `NULL` might mean it's broken, which predicts failure.

By creating a binary flag (`is_missing`), we allow the model to learn this pattern instead of just hiding it with imputation.

In [0]:
from pyspark.sql.functions import when, col

# Create missing flags for BOTH age and salary
# This allows the model to learn if "missingness" itself is predictive

df_flagged = df_imputed \
    .withColumn("age_missing_flag", when(col("age").isNull(), 1).otherwise(0)) \
    .withColumn("salary_missing_flag", when(col("salary").isNull(), 1).otherwise(0)) \
    .withColumn("source_was_unknown", when(col("source") == "UNKNOWN", 1).otherwise(0))

# Summary of missing flags
print("Missing flags summary:")
print(f"  - Records with missing age: {df_flagged.filter(col('age_missing_flag') == 1).count()}")
print(f"  - Records with missing salary: {df_flagged.filter(col('salary_missing_flag') == 1).count()}")
print(f"  - Records with unknown source: {df_flagged.filter(col('source_was_unknown') == 1).count()}")

In [0]:
# Show examples of flagged records
display(df_flagged.select(
    "age", "age_imputed", "age_missing_flag",
    "salary", "salary_imputed", "salary_missing_flag",
    "source", "source_was_unknown"
).filter("age_missing_flag == 1 OR salary_missing_flag == 1").limit(10))

In [0]:
# Save the imputed data for the next module (Feature Transformation)
# We save the version with:
# - source filled with "UNKNOWN"
# - age_imputed and salary_imputed (median values)
# - missing flags (age_missing_flag, salary_missing_flag, source_was_unknown)

df_flagged.write.mode("overwrite").saveAsTable(f"{catalog_name}.{schema_name}.customer_train_imputed")

print("✅ Saved 'customer_train_imputed' table with:")
print("   - Categorical imputation: source → 'UNKNOWN'")
print("   - Numerical imputation: age_imputed, salary_imputed (median)")
print("   - Missing flags: age_missing_flag, salary_missing_flag, source_was_unknown")

## Best Practices

### 🎯 Imputation Strategy Guide:

| Data Type | Missing % | Recommended Approach |
|-----------|-----------|---------------------|
| Numerical, normal distribution | <30% | Mean imputation |
| Numerical, skewed/outliers | <30% | Median imputation |
| Categorical | <30% | Mode or "Unknown" constant |
| Any | >50% | Consider dropping column |
| Any | <5% | Can drop rows (if random) |

### ⚠️ Common Mistakes to Avoid:

1. **Imputing before splitting** → Data leakage (test statistics leak into train)
2. **Using global mean/median** → Should be train-only statistics
3. **Ignoring informative missingness** → Miss valuable signal
4. **Always using mean** → Sensitive to outliers, use median
5. **Not documenting imputation** → Reproducibility issues

### 💡 Pro Tips:

- Always create missing flags for important features
- Use `Imputer` from MLlib for Spark-native imputation
- Save the imputer model for applying to test/production data
- Consider domain knowledge (e.g., age=0 might mean "unknown", not impute)
- For time-series: use forward-fill or backward-fill instead of mean

## Summary

### What we achieved:

- **dropna**: Learned when to safely drop missing data
- **fillna**: Used constants for categorical imputation
- **Imputer**: Applied median imputation for numerical data
- **Missing Flags**: Created binary flags to capture missingness signal

### Key Takeaways:

| # | Principle |
|---|-----------|
| 1 | **Fit on train, transform on all** - prevent data leakage |
| 2 | **Median is safer than mean** - robust to outliers |
| 3 | **Missingness can be informative** - create flags |
| 4 | **Document your strategy** - reproducibility matters |
| 5 | **Domain knowledge helps** - understand why data is missing |

### Data Pipeline Status:

| Table | Created | Used By |
|-------|---------|---------|
| `customer_train` | Module 2 | This module |
| `customer_train_imputed` | ✅ This module | Modules 4-7 |

### Next Steps:

📚 **Next Module:** Module 4 - Feature Transformation (encoding, scaling)

## Cleanup

Optionally remove demo tables created during exercises:

In [0]:
# Cleanup - remove demo tables created in this notebook

# ⚠️ WARNING: Do NOT delete customer_train_imputed - it is needed for subsequent modules!

# Uncomment the lines below to remove demo tables:

# spark.sql(f"DROP TABLE IF EXISTS {catalog_name}.{schema_name}.customer_train_imputed")

# print("✅ All demo tables removed")

print("ℹ️ Cleanup disabled (uncomment code to remove demo tables)")