# Module 3: Data Preparation & Cleaning

**Course**: End-to-End Machine Learning (Datacamp)  
**Case Study**: CardioCare Heart Disease Prediction  
**Author**: Seif

---

## Overview

After conducting EDA, we now move to **data preparation** — cleaning and transforming data for modeling.

In this module, we cover:
1. Handling null/empty values
2. Imputation strategies (mean, median, constant, KNN)
3. Dropping duplicates
4. Best practices for data cleaning

**Critical**: Data preparation sets the stage for all subsequent ML steps!

---

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer, KNNImputer
from imblearn.over_sampling import SMOTE

%matplotlib inline

## Data Preparation Steps

From EDA, we identified issues that need addressing:
- ❌ **Missing values** — can cause model failures
- ❌ **Outliers** — may skew model performance
- ❌ **Class imbalances** — can bias predictions
- ❌ **Empty columns** — provide no predictive value
- ❌ **Duplicates** — bias standard errors and confidence intervals

**Goal**: Clean the dataset to ensure reliable model training and evaluation.

## 1. Handling Null / Empty Values

**Missing values can cause model failures.** Two main strategies:

### Strategy 1: Drop rows/columns
- Use when data is sparse or empty
- Appropriate when losing the data doesn't impact model quality

### Strategy 2: Impute (fill) values
- Use when only a few values are missing
- Preserves valuable data
- Requires choosing an appropriate fill strategy

### Method 1: Dropping Rows or Columns

Use **`.drop()`** or **`.dropna()`** to remove sparse data.

In [None]:
# Drop a specific column (e.g., oldpeak if it has too many missing values)
# df = df.drop('oldpeak', axis=1)

# axis=1 means drop a column
# axis=0 means drop a row

In [None]:
# Drop all rows that are completely empty
# df = df.dropna(how='all')

# how='all' means drop only if ALL values in the row are null
# how='any' means drop if ANY value in the row is null

### When to Drop Values?

**Drop columns** when:
- Column has many missing values (e.g., > 50%)
- Column is mostly empty and provides little predictive value
- Example: If `oldpeak` (ECG measure) has 80% missing values, drop it

**Drop rows** when:
- Target column has missing values (can't train without labels)
- Only a few rows are affected
- Alternative: Treat missing target as a separate category (if makes sense)

## 2. Imputation

**Imputation** = filling missing values with substitutes.

### Why Impute?
- Don't want to drop entire patient record because they forgot to record age
- Can't drop essential columns (like features needed for prediction)
- Preserves valuable data

### Common Imputation Strategies:
1. **Mean** — fill with average value (good for normally distributed data)
2. **Median** — fill with middle value (robust to outliers)
3. **Constant** — fill with a specific value (e.g., 0, -1, "Unknown")
4. **Forward/Backward fill** — use previous/next value (timeseries data)

### Basic Imputation with `.fillna()`

In [None]:
# Example: Fill missing cholesterol values with the mean
# mean_chol = df['chol'].mean()
# df['chol'].fillna(mean_chol, inplace=True)

# inplace=True modifies the original DataFrame
# Without inplace=True, you need: df['chol'] = df['chol'].fillna(mean_chol)

In [None]:
# Fill with median (more robust to outliers)
# median_age = df['age'].median()
# df['age'].fillna(median_age, inplace=True)

In [None]:
# Fill with a constant value
# df['some_column'].fillna(0, inplace=True)

# For categorical data, fill with a placeholder
# df['category'].fillna('Unknown', inplace=True)

## 3. Advanced Imputation

Sometimes, simple summary statistics (mean/median) don't capture the nuance required for successful modeling.

### Advanced Techniques:
- **K-Nearest Neighbors (KNN)** — predict missing values based on similar patients
- **SMOTE** — Synthetic Minority Over-sampling (also handles class imbalance)
- **Iterative Imputer** — use ML models to predict missing values

These methods can impute missing values if they can be predicted from other features in the dataset.

In [None]:
# KNN Imputation - predict missing values based on k nearest neighbors
# from sklearn.impute import KNNImputer

# imputer = KNNImputer(n_neighbors=5)
# df['chol'] = imputer.fit_transform(df[['chol']])

# This finds the 5 most similar patients and uses their cholesterol values
# to predict the missing value

## 4. Dropping Duplicates

**Duplicates can bias model performance** by affecting standard errors and confidence intervals.

### Why Drop Duplicates?
- Each row should represent a **unique patient**
- Duplicates inflate certain patterns artificially
- Can lead to data leakage between train/test sets

### When to Check for Duplicates:
- **All columns** — exact duplicates across entire row
- **Subset of columns** — e.g., patient ID (same patient recorded twice)
- **Timeseries** — check both ID and timestamp

In [None]:
# Drop duplicate rows across all columns
# df = df.drop_duplicates()

# This removes rows that are completely identical

In [None]:
# Drop duplicates based on specific columns (e.g., patient_id)
# df = df.drop_duplicates(subset=['patient_id'])

# This removes rows where patient_id is duplicated
# (assumes same patient recorded multiple times)

In [None]:
# For timeseries: check both ID and timestamp
# df = df.drop_duplicates(subset=['patient_id', 'timestamp'])

# Keep the first occurrence, drop subsequent ones
# df = df.drop_duplicates(keep='first')  # default
# df = df.drop_duplicates(keep='last')   # keep last occurrence

### ⚠️ Important: Don't Drop Expected Duplicates!

Not all duplicates should be removed:
- **Age** — many patients can have the same age
- **Cholesterol** — different patients can have same values
- **Other features** — duplicates in individual columns are normal

Only drop duplicates when the **entire row** or **key identifiers** (like patient ID) are duplicated.

---

## Key Takeaways

1. **Data Preparation is Iterative**: May need to repeat as you progress through modeling
2. **Choose Strategy Carefully**: Dropping vs. imputation depends on EDA findings
3. **Imputation Matters**: Mean/median for simple cases, KNN/SMOTE for complex scenarios
4. **Duplicates Bias Results**: Always check for and remove duplicate records
5. **Context is Key**: Not all duplicates should be removed (e.g., same age is normal)
6. **Sets the Stage**: Clean data is critical for reliable model performance

---

## Data Cleaning Checklist

Before moving to modeling:
- ✅ Check for and handle missing values (drop or impute)
- ✅ Remove or impute outliers (based on EDA findings)
- ✅ Drop duplicate records (check patient IDs)
- ✅ Verify data types are correct
- ✅ Confirm no completely empty rows/columns remain
- ✅ Document all cleaning decisions for reproducibility

---

## References
- Datacamp: End-to-End Machine Learning Course
- Video 3: Data Preparation
- [Scikit-learn Imputation Documentation](https://scikit-learn.org/stable/modules/impute.html)
- [Pandas Data Cleaning Guide](https://pandas.pydata.org/docs/user_guide/missing_data.html)