# Databricks Data Preparation in ML - Notebook 03
## Data Imputation Fundamentals

**Part of the Databricks Data Preparation in ML Training Series**

---

## Objectives

This notebook covers essential missing data handling techniques required for Databricks ML Associate Certification:

- **Missing Data Mechanisms** - Understanding MCAR, MAR, and MNAR patterns
- **Deletion Methods** - Listwise and pairwise deletion strategies
- **Simple Imputation** - Mean, median, mode, and constant value imputation
- **Advanced Imputation** - KNN, regression-based, and iterative methods
- **Evaluation Strategies** - Assessing imputation quality and impact

## Duration: ~45 minutes
## Level: Fundamental → Intermediate

---

## Why is Proper Missing Data Handling Critical?

Missing data is a **common challenge in real-world datasets** that can significantly impact ML model performance:
- **Bias Reduction** - Avoiding systematic errors in model training
- **Statistical Power** - Preserving the reliability of statistical inference
- **Model Performance** - Ensuring optimal algorithm performance
- **Production Readiness** - Maintaining stability in production environments

---

## Theory: Missing Data Mechanisms

Understanding the mechanism behind missing data is crucial for choosing the appropriate imputation strategy.

### Missing Completely at Random (MCAR)
```
Missingness is completely random and independent of observed/unobserved data
P(Missing | Observed, Unobserved) = P(Missing)
Example: Sensor malfunction due to random hardware failures
```
- **Characteristics**: Easiest to handle, least biased
- **Detection**: Little's MCAR test
- **Strategy**: Any imputation method is valid

### Missing at Random (MAR)
```
Missingness depends on observed data but not on unobserved values
P(Missing | Observed, Unobserved) = P(Missing | Observed)
Example: Older customers less likely to provide income information
```
- **Characteristics**: Can be predicted from other variables
- **Detection**: Analysis of missing patterns vs observed variables
- **Strategy**: Use observed data to inform imputation

### Missing Not at Random (MNAR)
```
Missingness depends on the unobserved values themselves
P(Missing | Observed, Unobserved) = P(Missing | Unobserved)
Example: High earners deliberately withholding income information
```
- **Characteristics**: Most challenging, requires domain knowledge
- **Detection**: Often requires subject matter expertise
- **Strategy**: Model the missingness mechanism explicitly

##Environment Setup

In [0]:
# Basic imports for Databricks ML
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, isnan, isnull, rand, randn, lit, mean, stddev
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.ml.feature import Imputer, VectorAssembler
from pyspark.ml.stat import Correlation
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import KNNImputer, IterativeImputer

In [0]:
# Creating demonstration dataset with missing data
np.random.seed(42)

# Generating patient data with different types of missing values
n_patients = 1000
ages = np.random.normal(45, 15, n_patients).clip(18, 80)
weights = np.random.normal(70, 12, n_patients).clip(40, 120)
heights = np.random.normal(170, 10, n_patients).clip(150, 200)
blood_pressure = np.random.normal(120, 20, n_patients).clip(80, 180)
incomes = np.random.lognormal(10, 0.8, n_patients)

# Schema and data
schema = StructType([
    StructField("patient_id", IntegerType(), True),
    StructField("age", DoubleType(), True),
    StructField("weight", DoubleType(), True),
    StructField("height", DoubleType(), True),
    StructField("blood_pressure", DoubleType(), True),
    StructField("income", DoubleType(), True)
])

data = [(i, float(ages[i]), float(weights[i]), float(heights[i]), 
         float(blood_pressure[i]), float(incomes[i])) for i in range(n_patients)]

df_complete = spark.createDataFrame(data, schema)
df_complete.display(10)

#Missing Data Mechanisms

## Theory
**Missing Data Mechanisms** define why data is missing. Understanding the mechanism is crucial for choosing the right imputation strategy.

### Types of Missing Data:
- **MCAR** (Missing Completely at Random): Missing data is independent of observed and unobserved values
- **MAR** (Missing at Random): Missing data depends only on observed values
- **MNAR** (Missing Not at Random): Missing data depends on unobserved values

### Identification methods:
- **MCAR**: Statistical tests (Little's MCAR test)
- **MAR**: Analysis of missing patterns vs observed variables  
- **MNAR**: Domain knowledge and business logic

### Impact on strategy:
- **MCAR**: Can safely delete or impute
- **MAR**: Imputation using observed variables
- **MNAR**: Advanced methods or mechanism modeling needed

## MCAR: Random Missing Values

Missing Completely at Random (MCAR) occurs when the probability of missing data on a variable is unrelated to any other measured or unmeasured variable. In other words, the missingness is entirely random and does not depend on observed or unobserved data. This mechanism is the easiest to handle, as any imputation or deletion method will not introduce bias.

In [0]:
# Simulation of different missing data mechanisms

# MCAR: Random missing values in weight (10%)
df_mcar = df_complete.withColumn(
    "weight",
    when(rand() < 0.1, None).otherwise(col("weight"))
)

print("MCAR - Random missing values in weight:")
missing_weight = df_mcar.filter(col("weight").isNull()).count()
print(f"Missing weight values: {missing_weight} ({missing_weight/n_patients:.1%})")

## MAR: Missing at Random

Missing at Random (MAR) occurs when the probability of missing data on a variable is related to other observed variables, but not to the value of the variable itself. In this case, the missingness can be explained by information available in the dataset. Imputation methods that use observed data, such as regression or MICE, are appropriate for handling MAR.

**Example:** Income data is more likely to be missing for older individuals, but among people of the same age, missingness is random.

In [0]:
# MAR: Missing income dependent on age (older people don't provide income)
df_mar = df_mcar.withColumn(
    "income",
    when((col("age") > 60) & (rand() < 0.3), None)
    .when((col("age") <= 60) & (rand() < 0.05), None)
    .otherwise(col("income"))
)

print("MAR - Missing income dependent on age:")
missing_income = df_mar.filter(col("income").isNull()).count()
print(f"Missing income values: {missing_income} ({missing_income/n_patients:.1%})")

# Check dependency
print("Distribution of missing income by age:")
df_mar.select(
    when(col("age") <= 60, "Young").otherwise("Old").alias("age_group"),
    col("income").isNull().alias("income_missing")
).groupBy("age_group", "income_missing").count().show()

## MNAR: Missing Not at Random

Missing Not at Random (MNAR) occurs when the probability of missing data on a variable is related to the unobserved value itself. In this case, the reason for missingness is directly tied to the missing data, making it the most challenging mechanism to address. Handling MNAR often requires explicit modeling of the missingness process or incorporating domain expertise.

**Example:** High-income individuals are less likely to report their income, so missingness depends on the (unseen) income value.

In [0]:
# MNAR: Missing blood pressure - people with high blood pressure avoid tests
df_mnar = df_mar.withColumn(
    "blood_pressure_original", col("blood_pressure")
).withColumn(
    "blood_pressure",
    when((col("blood_pressure") > 140) & (rand() < 0.4), None)
    .when((col("blood_pressure") <= 140) & (rand() < 0.05), None) 
    .otherwise(col("blood_pressure"))
)

print("MNAR - Missing blood pressure (high blood pressure = more missing):")
missing_bp = df_mnar.filter(col("blood_pressure").isNull()).count()
print(f"Missing blood pressure values: {missing_bp} ({missing_bp/n_patients})")

In [0]:
# Check bias
print("\nMean blood pressure before and after introducing missing values:")
orig_mean = df_mnar.agg(mean("blood_pressure_original")).collect()[0][0]
observed_mean = df_mnar.agg(mean("blood_pressure")).collect()[0][0]
print(f"Original mean: {orig_mean}")
print(f"Observed mean: {observed_mean}")
print(f"Bias: {orig_mean - observed_mean}")

In [0]:
df_missing_data = df_mnar

In [0]:
display(df_missing_data)

#Deletion Methods

## Theory

**Deletion Methods** remove observations or variables with missing data. This is the simplest approach but can lead to loss of information.

### Listwise Deletion (Complete Case Analysis)
- **Removes entire rows** with any missing values
-  **Advantages**: Simple, preserves relationships between variables
-  **Disadvantages**: Can remove a lot of data, bias if missing data is not MCAR

###  Pairwise Deletion
- **Uses available data** for each analysis
-  **Advantages**: Preserves more data
-  **Disadvantages**: Different sample sizes, potential correlation problems

###  Variable Deletion
- **Removes variables** with high proportion of missing values
-  **Advantages**: Preserves observations
-  **Disadvantages**: Loss of potentially important variables

## Missing Data Pattern Analysis - Drop Not Needed Column

Before analyzing missing data patterns, it's important to remove columns that are irrelevant or not needed for your analysis. Dropping unnecessary columns helps focus on meaningful variables, reduces noise, and improves the clarity of missing data visualizations and statistics.

In [0]:
df_missing_data.printSchema()

In [0]:
# Missing data pattern analysis
df_missing = df_missing_data.drop("blood_pressure_original")

In [0]:
display(df_missing)

In [0]:
for col_name in df_missing.columns:
    if col_name != "patient_id":
        missing_count = df_missing.filter(col(col_name).isNull()).count()
        missing_pct = missing_count / n_patients * 100
        print(f"{col_name:15s}: {missing_count:4d} ({missing_pct:5.1f}%)")

# Complete Case Analysis (Listwise Deletion)
df_complete_cases = df_missing.dropna()

In [0]:
print("Listwise Deletion:")
original_count = df_missing.count()
complete_count = df_complete_cases.count()
removed_count = original_count - complete_count

print(f"Original observations: {original_count}")
print(f"Complete cases: {complete_count}")
print(f"Removed observations: {removed_count} ({removed_count/original_count:.1%})")

print(f"Retained data: {complete_count/original_count:.1%}")

In [0]:
display(df_complete_cases)

## Variable Deletion - remove variables with >20% missing values

In [0]:
# Variable Deletion - remove variables if is missing more then threshold %
threshold = 0.10
cols_to_keep = ["patient_id"]

for col_name in df_missing.columns:
    if col_name != "patient_id":
        missing_pct = df_missing.filter(col(col_name).isNull()).count() / n_patients
        if missing_pct <= threshold:
            cols_to_keep.append(col_name)
        else:
            print(f"Removing {col_name}: {missing_pct:.1%} missing")

df_var_deleted = df_missing.select(*cols_to_keep)

print(f"Retained columns: {cols_to_keep[1:]}")
print(f"Retained variables: {len(cols_to_keep)-1}/{len(df_missing.columns)-1}")

In [0]:
display(df_var_deleted)

# Simple Imputation Methods

## Theory

**Simple Imputation** replaces missing values with single values based on simple statistics.

### Main Strategies:
- **Mean/Median**: For numerical variables
- **Mode**: For categorical variables  
- **Constant**: Fixed value (0, "Unknown", etc.)
- **Forward/Backward Fill**: Use previous/next values

### Advantages:
- Fast and simple
- Preserves all observations
- Easy to implement

### Disadvantages:
- Reduces variability 
- Can introduce bias
- Doesn't account for relationships between variables

## Imputer Overview

An **Imputer** is a tool or class used to fill in missing values in a dataset. It automates the process of replacing missing data with estimated values based on a chosen strategy (e.g., mean, median, mode, or a constant). This ensures that the dataset is complete and suitable for further analysis or modeling.

The code below demonstrates how to use an Imputer to handle missing values. It typically involves:
1. **Selecting a strategy** (e.g., mean, median, most frequent).
2. **Fitting the imputer** to the data to learn the replacement values.
3. **Transforming the dataset** by replacing missing values with the learned values.

This process helps maintain the integrity of the dataset and allows machine learning algorithms to work without errors due to missing data.

## Mean Imputation 

**Mean Imputation** replaces missing values in a numerical feature with the mean (average) of the observed values in that feature. The value used for imputation is calculated from the available (non-missing) data. This approach is simple and quick, but it can reduce variability and may introduce bias if the data is not missing completely at random.

In [0]:
# Mean Imputation with Spark ML Imputer
from pyspark.ml.feature import Imputer

# Select numerical columns for imputation
numeric_cols = ["age", "weight", "height", "blood_pressure", "income"]
output_cols = [f"{col}_imputed" for col in numeric_cols]

# Mean Imputer
mean_imputer = Imputer(
    inputCols=numeric_cols,
    outputCols=output_cols,
    strategy="mean"
)

# Fit and transform
mean_imputer_model = mean_imputer.fit(df_missing)
df_mean_imputed = mean_imputer_model.transform(df_missing)


In [0]:
df_mean_imputed.createOrReplaceTempView("mean_imputed")

In [0]:
%sql
create or replace table silver_patinet_data_imputed
as
select 
patient_id,
cast(age_imputed as int) as age,
case when age is null then 1 else 0 end age_imputed,
cast(weight_imputed as decimal(6,2)) as weight,
case when weight is null then 1 else 0 end weight_imputed,
cast(height_imputed as decimal(6,2)) as height,
case when height is null then 1 else 0 end height_imputed,
cast(blood_pressure_imputed as decimal(6,2)) blood_pressure,
case when blood_pressure_imputed is null then 1 else 0 end blood_pressure_imputed,
cast(income_imputed as decimal(12,2)) as income,
case when income is null then 1 else 0 end income_imputed
from mean_imputed

In [0]:
df_silver_imputed = spark.table("silver_patinet_data_imputed")

Comparing Base Table with Mean-Imputed Table

To evaluate the impact of mean imputation, compare the original table (with missing values) to the table after mean imputation. 


In [0]:
display(df_missing.describe())

In [0]:
display(df_silver_imputed.describe())

## Median Imputation

**Median Imputation** replaces missing values in a numerical feature with the median of the observed values in that feature. This method is robust to outliers and is often preferred over mean imputation when the data is skewed.



In [0]:
# Median Imputation
median_imputer = Imputer(
    inputCols=numeric_cols,
    outputCols=[f"{col}_median" for col in numeric_cols],
    strategy="median"
)

median_imputer_model = median_imputer.fit(df_missing)
df_median_imputed = median_imputer_model.transform(df_missing)


In [0]:
display(df_median_imputed)

In [0]:
# Original statistics (without missing values)
orig_stats = df_complete.select("weight").describe()
print("\nOriginal statistics:")
orig_stats.display()

# Statistics after mean imputation
mean_stats = df_mean_imputed.select("weight_imputed").describe()
print("After mean imputation:")
mean_stats.display()

# Statistics after median imputation
median_stats = df_median_imputed.select("weight_median").describe()
print("After median imputation:")
median_stats.display()

#Advanced Imputation Methods

## Theory

**Advanced Imputation** uses relationships between variables to better predict missing values.

### 🔄 K-Nearest Neighbors (KNN) Imputation
- **Finds K nearest neighbors** for observations with missing values
- **Imputes mean/median** from neighbor values
- ✅ **Advantages**: Considers local patterns, preserves relationships
- ❌ **Disadvantages**: Computationally expensive, sensitive to curse of dimensionality

### 🔄 Regression Imputation
- **Predicts missing values** using other variables
- **Trains regression model** for each variable with missing values
- ✅ **Advantages**: Uses all available information
- ❌ **Disadvantages**: Can be too precise, ignores uncertainty

### 🔄 MICE (Multiple Imputation by Chained Equations)
- **Iterative process** - imputes one variable at a time
- **Uses all other variables** as predictors
- **Multiple imputation** - generates several complete datasets
- ✅ **Advantages**: Accounts for uncertainty, very effective
- ❌ **Disadvantages**: Complex, time-consuming

In [0]:
# Data preparation for sklearn (KNN, MICE)
# Convert to pandas for advanced methods
df_pandas = df_missing.select(*numeric_cols).toPandas()

print("Data preparation for advanced methods:")
print(f"Shape: {df_pandas.shape}")
print(f"Missing values per column:")
print(df_pandas.isnull().sum())

## KNN Imputation: 

This section applies K-Nearest Neighbors imputation to fill missing values in the numeric columns of the dataset using sklearn's KNNImputer. It uses 5 nearest neighbors to estimate and replace missing values based on the similarity to other rows.

In [0]:
# KNN Imputation
from sklearn.impute import KNNImputer

# KNN Imputer with K=5 neighbors
knn_imputer = KNNImputer(n_neighbors=5)
df_knn_imputed = knn_imputer.fit_transform(df_pandas)

# Convert back to DataFrame
df_knn_pandas = pd.DataFrame(df_knn_imputed, columns=numeric_cols)

print("KNN Imputation (K=5) completed")
print(f"Missing values after KNN: {df_knn_pandas.isnull().sum()}")

## MICE (Multiple Imputation by Chained Equations)

**MICE** is an advanced imputation technique that fills in missing values by modeling each variable with missing data as a function of other variables in a round-robin fashion. It performs multiple rounds of imputation, creating several complete datasets to account for the uncertainty of missing values.

### How MICE Works:
1. **Initial Imputation:** Fill missing values with simple methods (mean, median, etc.).
2. **Iterative Modeling:** For each variable with missing data, regress it on the other variables and update the missing values with predictions.
3. **Repeat:** Cycle through all variables multiple times to refine imputations.
4. **Multiple Datasets:** Generate several imputed datasets to reflect uncertainty.

### Advantages:
- Accounts for relationships between variables
- Provides more accurate and robust imputations
- Quantifies uncertainty by creating multiple datasets

### Disadvantages:
- Computationally intensive
- More complex to implement and interpret

In [0]:
# MICE (Multiple Imputation by Chained Equations)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# MICE Imputer
mice_imputer = IterativeImputer(random_state=42, max_iter=10)
df_mice_imputed = mice_imputer.fit_transform(df_pandas)

# Convert back to DataFrame
df_mice_pandas = pd.DataFrame(df_mice_imputed, columns=numeric_cols)

print("MICE Imputation completed")
print(f"Missing values after MICE: {df_mice_pandas.isnull().sum().sum()}")


#Comparison of All Methods

## Evaluation Framework

We'll compare all imputation methods in terms of:
- **Distribution preservation** (mean, standard deviation)
- **Correlation preservation** between variables
- **Bias introduced** by each method

In [0]:
# Comparison of all methods - preservation of means
print("Comparison of means after different imputation methods:")
print(f"{'Method':<15} {'Age':<8} {'Weight':<8} {'Height':<8} {'BP':<8} {'Income':<10}")
print("-" * 70)

# Original means (without missing values)
orig_means = [df_complete.agg(mean(col)).collect()[0][0] for col in numeric_cols]
print(f"{'Original':<15} {orig_means[0]:<8.1f} {orig_means[1]:<8.1f} {orig_means[2]:<8.1f} {orig_means[3]:<8.1f} {orig_means[4]:<10.0f}")

# Observed (with missing values)
obs_means = [df_missing.agg(mean(col)).collect()[0][0] for col in numeric_cols]
print(f"{'Observed':<15} {obs_means[0]:<8.1f} {obs_means[1]:<8.1f} {obs_means[2]:<8.1f} {obs_means[3]:<8.1f} {obs_means[4]:<10.0f}")

# Mean Imputation
mean_means = [df_mean_imputed.agg(mean(f"{col}_imputed")).collect()[0][0] for col in numeric_cols]
print(f"{'Mean Imputed':<15} {mean_means[0]:<8.1f} {mean_means[1]:<8.1f} {mean_means[2]:<8.1f} {mean_means[3]:<8.1f} {mean_means[4]:<10.0f}")

# KNN
knn_means = df_knn_pandas.mean().values
print(f"{'KNN':<15} {knn_means[0]:<8.1f} {knn_means[1]:<8.1f} {knn_means[2]:<8.1f} {knn_means[3]:<8.1f} {knn_means[4]:<10.0f}")

# MICE  
mice_means = df_mice_pandas.mean().values
print(f"{'MICE':<15} {mice_means[0]:<8.1f} {mice_means[1]:<8.1f} {mice_means[2]:<8.1f} {mice_means[3]:<8.1f} {mice_means[4]:<10.0f}")

In [0]:
# Comparison of standard deviations
print("\n📊 Comparison of standard deviations:")
print(f"{'Method':<15} {'Age':<8} {'Weight':<8} {'Height':<8} {'BP':<8} {'Income':<10}")
print("-" * 70)

# Original std
orig_stds = [df_complete.agg(stddev(col)).collect()[0][0] for col in numeric_cols]
print(f"{'Original':<15} {orig_stds[0]:<8.1f} {orig_stds[1]:<8.1f} {orig_stds[2]:<8.1f} {orig_stds[3]:<8.1f} {orig_stds[4]:<10.0f}")

# Mean Imputation std
mean_stds = [df_mean_imputed.agg(stddev(f"{col}_imputed")).collect()[0][0] for col in numeric_cols]
print(f"{'Mean Imputed':<15} {mean_stds[0]:<8.1f} {mean_stds[1]:<8.1f} {mean_stds[2]:<8.1f} {mean_stds[3]:<8.1f} {mean_stds[4]:<10.0f}")

# KNN std
knn_stds = df_knn_pandas.std().values
print(f"{'KNN':<15} {knn_stds[0]:<8.1f} {knn_stds[1]:<8.1f} {knn_stds[2]:<8.1f} {knn_stds[3]:<8.1f} {knn_stds[4]:<10.0f}")

# MICE std
mice_stds = df_mice_pandas.std().values  
print(f"{'MICE':<15} {mice_stds[0]:<8.1f} {mice_stds[1]:<8.1f} {mice_stds[2]:<8.1f} {mice_stds[3]:<8.1f} {mice_stds[4]:<10.0f}")

print("\n💡 Observations:")
print("- Mean imputation often reduces variance")
print("- KNN and MICE better preserve distributions")
print("- MICE usually closest to original")

# Decision Framework - Choosing Imputation Method

## When to use which method?

### 1️⃣ **Deletion Methods**
✅ **Use when:**
- **Few missing values** (<5%) and MCAR
- **Large dataset** - you can afford to lose data
- **Complete case analysis** is required

### 2️⃣ **Simple Imputation** 
✅ **Use when:**
- **Quick prototyping** - you need a baseline
- **Missing data is MCAR** - simple methods suffice
- **Large datasets** - advanced methods are expensive

### 3️⃣ **KNN Imputation**
✅ **Use when:**
- **Local patterns** are important
- **Mixed data types** (numerical + categorical)
- **Medium-sized datasets** (not too large due to cost)

### 4️⃣ **MICE**
✅ **Use when:**
- **MAR/MNAR mechanisms** - you need sophisticated approach
- **Research/analysis** - you want highest quality
- **Multiple variables** have missing values
- **Uncertainty quantification** is important

## ⚠️ Key Principles

1. **Always analyze missing data mechanism before choosing method**
2. **Validate imputation** - compare distributions before/after
3. **Don't impute target variable** in supervised learning
4. **Consider domain knowledge** - does 0 make sense?
5. **Test multiple methods** and choose best for specific problem