# Module 01: Handling Missing Data

**Difficulty**: ⭐ Beginner  
**Estimated Time**: 60 minutes  
**Prerequisites**: [Module 00: Introduction to Feature Engineering](00_introduction_to_feature_engineering.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:

1. Identify different types of missing data (MCAR, MAR, MNAR)
2. Visualize and analyze missing data patterns
3. Apply simple imputation strategies (mean, median, mode, constant)
4. Use advanced imputation techniques (KNN, iterative imputation)
5. Decide when to delete vs. impute missing values
6. Avoid data leakage when handling missing data

## 1. Why Missing Data Matters

Real-world datasets are **messy**. Missing data is one of the most common issues:

- Survey respondents skip questions
- Sensors malfunction and fail to record
- Data entry errors create gaps
- Privacy restrictions remove sensitive information

**Impact if not handled properly**:
- Most ML algorithms can't handle missing values
- Removing all rows with missing data can lose 50%+ of your dataset
- Poor imputation can introduce bias and reduce model accuracy

### Types of Missingness

1. **MCAR (Missing Completely at Random)**
   - Missing values have no relationship to any data
   - Example: A sensor randomly fails
   - **Safest to handle** - any method works

2. **MAR (Missing at Random)**
   - Missingness depends on other observed variables
   - Example: Older people less likely to report income
   - **Common in real data** - need careful imputation

3. **MNAR (Missing Not at Random)**
   - Missingness depends on the missing value itself
   - Example: High earners don't report income
   - **Hardest to handle** - need domain expertise

## 2. Setup

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Missing data visualization
import missingno as msno

# Imputation methods
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)
pd.set_option('display.max_columns', None)

print("✓ Libraries imported successfully!")

## 3. Create Dataset with Missing Values

Let's create a realistic dataset to practice handling missing data.

In [None]:
# Create customer dataset for loan approval prediction
np.random.seed(42)
n_samples = 500

# Generate complete data first
data = pd.DataFrame({
    'age': np.random.randint(18, 70, n_samples),
    'income': np.random.normal(50000, 20000, n_samples),
    'credit_score': np.random.randint(300, 850, n_samples),
    'employment_years': np.random.randint(0, 40, n_samples),
    'loan_amount': np.random.normal(100000, 50000, n_samples),
    'debt_to_income': np.random.uniform(0, 0.5, n_samples)
})

# Ensure positive values
data['income'] = data['income'].clip(lower=15000)
data['loan_amount'] = data['loan_amount'].clip(lower=5000)

# Create target variable (loan approved)
data['approved'] = (
    (data['credit_score'] > 650) & 
    (data['debt_to_income'] < 0.4) &
    (data['income'] > 30000)
).astype(int)

print(f"Created dataset with {len(data)} samples and {len(data.columns)} features")
print(f"\nFirst few rows:")
data.head()

In [None]:
# Introduce missing values with different patterns

# MCAR: Randomly remove 10% of income values
mcar_mask = np.random.rand(len(data)) < 0.10
data.loc[mcar_mask, 'income'] = np.nan

# MAR: Older people less likely to report employment years
# Missing depends on age (observed variable)
mar_mask = (data['age'] > 50) & (np.random.rand(len(data)) < 0.25)
data.loc[mar_mask, 'employment_years'] = np.nan

# MNAR: High debt-to-income ratios less likely to be reported
# Missing depends on the value itself
mnar_mask = (data['debt_to_income'] > 0.4) & (np.random.rand(len(data)) < 0.30)
data.loc[mnar_mask, 'debt_to_income'] = np.nan

# Completely random missing in credit_score
random_mask = np.random.rand(len(data)) < 0.05
data.loc[random_mask, 'credit_score'] = np.nan

print("Missing data summary:")
print(data.isnull().sum())
print(f"\nTotal missing values: {data.isnull().sum().sum()}")
print(f"Percentage of data missing: {data.isnull().sum().sum() / (len(data) * len(data.columns)) * 100:.1f}%")

## 4. Visualizing Missing Data

Before handling missing data, **always visualize patterns**.

In [None]:
# Basic missing data summary
missing_summary = pd.DataFrame({
    'Column': data.columns,
    'Missing_Count': data.isnull().sum(),
    'Missing_Percent': (data.isnull().sum() / len(data) * 100).round(2)
}).sort_values('Missing_Count', ascending=False)

print("Missing Data Summary:")
print(missing_summary)

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))
ax.barh(missing_summary['Column'], missing_summary['Missing_Percent'])
ax.set_xlabel('Percentage Missing (%)')
ax.set_title('Missing Data by Feature')
plt.tight_layout()
plt.show()

In [None]:
# Using missingno library for advanced visualization

# Matrix visualization shows patterns
print("Missing Data Matrix:")
print("White lines indicate missing values\n")
msno.matrix(data, figsize=(12, 5))
plt.show()

# Bar chart
print("\nMissing Data Bar Chart:")
msno.bar(data, figsize=(12, 5))
plt.show()

## 5. Strategy 1: Deletion

**When to use**: 
- Very small percentage missing (<5%)
- Data is MCAR
- You have abundant data

**Risks**:
- Loss of information
- Introduces bias if not MCAR
- Can lose most of your dataset!

In [None]:
# Option 1: Drop rows with ANY missing values
data_dropna = data.dropna()

print(f"Original dataset: {len(data)} rows")
print(f"After dropping rows with ANY missing: {len(data_dropna)} rows")
print(f"Lost {len(data) - len(data_dropna)} rows ({(len(data) - len(data_dropna))/len(data)*100:.1f}%)")
print("\n⚠️ We lost almost a quarter of our data!")

In [None]:
# Option 2: Drop columns with >30% missing
threshold = 0.30
cols_to_drop = data.columns[data.isnull().mean() > threshold]

print(f"Columns with >{threshold*100}% missing:")
print(cols_to_drop.tolist() if len(cols_to_drop) > 0 else "None")

# Option 3: Drop rows where specific critical columns are missing
critical_columns = ['credit_score', 'income']
data_critical = data.dropna(subset=critical_columns)

print(f"\nAfter dropping rows missing critical columns {critical_columns}:")
print(f"Remaining rows: {len(data_critical)} ({len(data_critical)/len(data)*100:.1f}%)")

## 6. Strategy 2: Simple Imputation

Replace missing values with statistical measures.

### Common Strategies:
- **Mean**: Good for normally distributed data, sensitive to outliers
- **Median**: Robust to outliers, good for skewed data
- **Mode**: For categorical data
- **Constant**: Domain-specific value (e.g., 0, -1, "Unknown")

In [None]:
# IMPORTANT: Split data BEFORE imputation to avoid data leakage!
X = data.drop('approved', axis=1)
y = data['approved']

# Remove rows where target is affected by missing features for this demo
# In practice, you'd handle this more carefully
valid_indices = X.index
X = X.loc[valid_indices]
y = y.loc[valid_indices]

# Split first!
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

In [None]:
# Mean Imputation
mean_imputer = SimpleImputer(strategy='mean')

# Fit on training data only!
X_train_mean = mean_imputer.fit_transform(X_train)

# Apply same transformation to test data
X_test_mean = mean_imputer.transform(X_test)

# Convert back to DataFrame for readability
X_train_mean_df = pd.DataFrame(X_train_mean, columns=X.columns)
X_test_mean_df = pd.DataFrame(X_test_mean, columns=X.columns)

print("Mean Imputation completed!")
print(f"Missing values in training set: {X_train_mean_df.isnull().sum().sum()}")
print(f"Missing values in test set: {X_test_mean_df.isnull().sum().sum()}")

print("\nExample: Income column")
print(f"Original mean (ignoring NaN): ${X_train['income'].mean():.2f}")
print(f"Imputed value: ${mean_imputer.statistics_[X.columns.get_loc('income')]:.2f}")

In [None]:
# Median Imputation (better for skewed data)
median_imputer = SimpleImputer(strategy='median')

X_train_median = median_imputer.fit_transform(X_train)
X_test_median = median_imputer.transform(X_test)

print("Median Imputation completed!")
print("\nComparison for 'income' column:")
print(f"Mean imputation value: ${mean_imputer.statistics_[X.columns.get_loc('income')]:.2f}")
print(f"Median imputation value: ${median_imputer.statistics_[X.columns.get_loc('income')]:.2f}")

In [None]:
# Constant Imputation (domain-specific)
# For example, missing employment_years for young people could be 0
constant_imputer = SimpleImputer(strategy='constant', fill_value=0)

X_train_constant = constant_imputer.fit_transform(X_train)
X_test_constant = constant_imputer.transform(X_test)

print("Constant Imputation completed!")
print("All missing values filled with 0")

## 7. Strategy 3: Advanced Imputation

Use relationships between features to make better predictions of missing values.

In [None]:
# KNN Imputation
# Finds k nearest neighbors and uses their average

knn_imputer = KNNImputer(n_neighbors=5)

X_train_knn = knn_imputer.fit_transform(X_train)
X_test_knn = knn_imputer.transform(X_test)

X_train_knn_df = pd.DataFrame(X_train_knn, columns=X.columns)

print("KNN Imputation completed!")
print("Uses 5 nearest neighbors to predict missing values")
print(f"Missing values: {X_train_knn_df.isnull().sum().sum()}")

In [None]:
# Iterative Imputation (MICE - Multiple Imputation by Chained Equations)
# Models each feature with missing values as a function of other features

iterative_imputer = IterativeImputer(random_state=42, max_iter=10)

X_train_iter = iterative_imputer.fit_transform(X_train)
X_test_iter = iterative_imputer.transform(X_test)

X_train_iter_df = pd.DataFrame(X_train_iter, columns=X.columns)

print("Iterative Imputation completed!")
print("Uses regression models to predict missing values")
print(f"Missing values: {X_train_iter_df.isnull().sum().sum()}")

## 8. Comparing Imputation Methods

Let's compare how different imputation methods affect model performance.

In [None]:
# Train models with different imputation methods

imputation_methods = {
    'Mean': (X_train_mean, X_test_mean),
    'Median': (X_train_median, X_test_median),
    'Constant (0)': (X_train_constant, X_test_constant),
    'KNN': (X_train_knn, X_test_knn),
    'Iterative': (X_train_iter, X_test_iter)
}

results = []

for method_name, (X_tr, X_te) in imputation_methods.items():
    # Train model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_tr, y_train)
    
    # Evaluate
    y_pred = model.predict(X_te)
    accuracy = accuracy_score(y_test, y_pred)
    
    results.append({
        'Method': method_name,
        'Accuracy': accuracy
    })

# Display results
results_df = pd.DataFrame(results).sort_values('Accuracy', ascending=False)
print("Model Performance by Imputation Method:\n")
print(results_df.to_string(index=False))

# Visualize
plt.figure(figsize=(10, 5))
plt.barh(results_df['Method'], results_df['Accuracy'])
plt.xlabel('Accuracy')
plt.title('Model Performance by Imputation Method')
plt.xlim([0.5, 1.0])
for i, v in enumerate(results_df['Accuracy']):
    plt.text(v + 0.01, i, f'{v:.3f}', va='center')
plt.tight_layout()
plt.show()

## 9. Best Practices

### ✅ DO:

1. **Always split data before imputation**
   - Fit imputer on training data
   - Transform both train and test

2. **Visualize missing patterns first**
   - Understand WHY data is missing
   - Check if missing is random

3. **Consider adding indicator features**
   - `was_missing` boolean column
   - Sometimes missingness itself is predictive!

4. **Try multiple methods**
   - Compare performance
   - Different methods work for different data

5. **Document your decisions**
   - Why this imputation method?
   - What assumptions are you making?

### ❌ DON'T:

1. **Don't impute before splitting** (data leakage!)
2. **Don't always use mean** (consider data distribution)
3. **Don't ignore why data is missing**
4. **Don't delete data unless necessary**
5. **Don't forget to handle missing values in production**

## 10. Exercise Section

### Exercise 1: Identify Missing Data Type

For each scenario, identify if it's MCAR, MAR, or MNAR.

In [None]:
# Exercise 1: Classify these missing data scenarios

scenarios = {
    'A': 'Survey responses about salary are missing for unemployed people',
    'B': 'Temperature sensor randomly fails 1% of the time',
    'C': 'People with depression less likely to report mental health status',
    'D': 'Male respondents less likely to answer question about pregnancy',
    'E': 'Very wealthy individuals don\'t report their net worth'
}

print("Classify each scenario as MCAR, MAR, or MNAR:\n")
for key, scenario in scenarios.items():
    print(f"{key}. {scenario}")

print("\nYour answers (write as comment):")
# A: ???
# B: ???
# C: ???
# D: ???
# E: ???

In [None]:
# Solution to Exercise 1

print("Solutions:\n")
print("A: MAR - Missing depends on employment status (observed variable)")
print("B: MCAR - Completely random failure")
print("C: MNAR - Missing depends on the value itself (depression level)")
print("D: MAR - Missing depends on gender (observed variable)")
print("E: MNAR - Missing depends on the value itself (wealth level)")

### Exercise 2: Implement Missing Value Indicator

Create indicator features that show where data was missing.

In [None]:
# Exercise 2: Create missing value indicators

# Create a sample dataset
sample_data = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [10, np.nan, 30, 40, np.nan],
    'C': [100, 200, 300, 400, 500]
})

print("Original data:")
print(sample_data)

# TODO: Create indicator features for columns A and B
# Indicator should be 1 if value was missing, 0 otherwise

# Your code here:
# sample_data['A_was_missing'] = ???
# sample_data['B_was_missing'] = ???

# Then impute missing values with median
# Your code here:

print("\nData with indicators and imputation:")
# print(sample_data)

In [None]:
# Solution to Exercise 2

sample_data = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [10, np.nan, 30, 40, np.nan],
    'C': [100, 200, 300, 400, 500]
})

# Create indicators BEFORE imputing
sample_data['A_was_missing'] = sample_data['A'].isnull().astype(int)
sample_data['B_was_missing'] = sample_data['B'].isnull().astype(int)

# Now impute
sample_data['A'] = sample_data['A'].fillna(sample_data['A'].median())
sample_data['B'] = sample_data['B'].fillna(sample_data['B'].median())

print("Solution:")
print(sample_data)
print("\nNotice: Indicator columns preserve information about WHERE data was missing!")
print("This can be valuable for the model.")

### Exercise 3: Spot the Data Leakage

Identify what's wrong with this code.

In [None]:
# Exercise 3: What's wrong with this code?

print("Code snippet:")
print('''
# Load data
X = data.drop('target', axis=1)
y = data['target']

# Impute missing values
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y)

# Train model
model.fit(X_train, y_train)
''')

print("\nWhat's wrong? (Write answer as comment)")
# Your answer:

In [None]:
# Solution to Exercise 3

print("Problem: DATA LEAKAGE!")
print("\nThe imputer is fit on ALL data (including test set) before splitting.")
print("This means the test set statistics leak into the training process.")
print("\nCorrect approach:")
print('''
# 1. Split FIRST
X_train, X_test, y_train, y_test = train_test_split(X, y)

# 2. Fit imputer on training data only
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)

# 3. Apply to test data
X_test_imputed = imputer.transform(X_test)
''')

## 11. Summary

### Key Takeaways

1. **Missing data is common** in real-world datasets
   - Understand WHY data is missing (MCAR, MAR, MNAR)
   - Visualize patterns before handling

2. **Multiple strategies exist**:
   - **Deletion**: Only when <5% missing and MCAR
   - **Simple imputation**: Mean, median, mode, constant
   - **Advanced imputation**: KNN, iterative (MICE)

3. **Always avoid data leakage**:
   - Split data FIRST
   - Fit imputer on training data only
   - Transform both train and test

4. **Consider missing indicators**:
   - Sometimes missingness is informative
   - Create `was_missing` features

5. **No single best method**:
   - Try multiple approaches
   - Compare model performance
   - Consider computational cost

### What's Next?

**Module 02**: Encoding Categorical Variables - Learn how to convert categories to numbers for ML models

### Additional Resources

- [Sklearn Imputation Documentation](https://scikit-learn.org/stable/modules/impute.html)
- [Missingno Library](https://github.com/ResidentMario/missingno)
- "Flexible Imputation of Missing Data" by Stef van Buuren

---

**Congratulations!** You've completed Module 01. You now know how to:
- Identify and visualize missing data patterns
- Apply appropriate imputation strategies
- Avoid data leakage when handling missing values
- Compare imputation methods systematically

Ready for the next challenge? Move to **Module 02: Encoding Categorical Variables**!