# Handling Missing Values – Theory & Practical Techniques

Missing data is common in real-world datasets. If not handled properly, it causes:
- Errors in model training (`NaN` not allowed in most algorithms)
- Bias in results
- Reduced accuracy

We must **detect**, **understand**, and **impute** missing values correctly.

## 1. Types of Missing Data (Rubin’s Taxonomy)

Let’s denote:
- $ Y_{obs} $ = observed values
- $ Y_{miss} $ = missing values
- $ M $ = missingness indicator (1 if missing, 0 otherwise)

---

### 1. **MCAR** – Missing Completely at Random

**Definition**: Probability of missingness does **not** depend on observed **or** missing data.

$$
P(M | Y_{obs}, Y_{miss}) = P(M)
$$

**Example**:
> A spreadsheet is accidentally deleted for 10 random rows. The missingness has **nothing to do** with age, income, or survival.

→ Safe to delete or impute.

---

### 2. **MAR** – Missing at Random

**Definition**: Missingness depends **only on observed data**, not the missing value itself.

$$
P(M | Y_{obs}, Y_{miss}) = P(M | Y_{obs})
$$

**Example**:
> Older passengers are less likely to report their age. But once you **know their age group**, the missingness is random.

→ Can be handled with imputation **if we use the observed variables** (like age group).

---

### 3. **MNAR** – Missing Not at Random

**Definition**: Missingness depends on the **missing value itself**.

$$
P(M | Y_{obs}, Y_{miss}) = P(M | Y_{miss})
$$

**Example**:
> High-income people refuse to disclose income. The **higher the income, the more likely to skip** the question.

→ **Cannot fix with simple imputation**. Needs domain modeling or sensitivity analysis.

## 2. Load Dataset & Check Missing Values

In [2]:
import pandas as pd
import seaborn as sns
import numpy as np

# Load Titanic dataset (has real missing values)
df = sns.load_dataset('titanic')

# Show first 5 rows
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [3]:
# Count missing values per column
# Why? To know which columns need attention
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

## 3. Deletion Methods (Use Only if <5% Data is Missing & MCAR)

In [None]:
# Row-wise deletion: Remove any row with at least one missing value
# Use when data is MCAR and loss is minimal
df_row_dropped = df.dropna()
print(f"Original: {len(df)} rows → After row drop: {len(df_row_dropped)} rows") # This is not recommended if too much data is lost

# Column-wise deletion: Remove columns with too many missing values
# Use when a column has >50% missing (e.g., 'deck')
df_col_dropped = df.drop(columns=['deck'])  # 'deck' has 688 missing out of 891
print(f"Columns reduced from {df.shape[1]} to {df_col_dropped.shape[1]}")

Original: 891 rows → After row drop: 182 rows
Columns reduced from 15 to 14


## 4. Imputation Techniques

### 1. Mean Imputation (Use only for normal data)

In [None]:
# Fill missing 'age' with mean
# Why? Preserves central tendency
# Warning: Reduces variance → underestimates uncertainty
mean_age = df['age'].mean()  # Used only for normally distributed data and not suitable for data containing outliers
df['age_mean'] = df['age'].fillna(mean_age)

print(f"Mean age: {mean_age:.2f}")
df[['age', 'age_mean']].head(10)

Mean age: 29.70


Unnamed: 0,age,age_mean
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0
5,,29.699118
6,54.0,54.0
7,2.0,2.0
8,27.0,27.0
9,14.0,14.0


### 2. Median Imputation (Best for skewed data or outliers)

In [6]:
# Fill with median → robust to outliers
median_age = df['age'].median()
df['age_median'] = df['age'].fillna(median_age)

print(f"Median age: {median_age}")
df[['age', 'age_median']].head(10)

Median age: 28.0


Unnamed: 0,age,age_median
0,22.0,22.0
1,38.0,38.0
2,26.0,26.0
3,35.0,35.0
4,35.0,35.0
5,,28.0
6,54.0,54.0
7,2.0,2.0
8,27.0,27.0
9,14.0,14.0


### 3. Mode Imputation (For categorical data)

In [None]:
# 'embarked' has 2 missing values
# Mode = most frequent port: 'S'
mode_embarked = df['embarked'].mode()[0]
df['embarked_mode'] = df['embarked'].fillna(mode_embarked)

print(f"Mode of embarked: {mode_embarked}")
df[['embarked', 'embarked_mode']].loc[df['embarked'].isnull()]

Mode of embarked: S


Unnamed: 0,embarked,embarked_mode
61,,S
829,,S


### 4. **Random Sampling Imputation** (Best for preserving distribution)

In [None]:
# Step 1: Get all non-missing age values
observed_ages = df['age'].dropna()

# Step 2: Randomly sample from them to fill missing spots
# replace=True allows reuse (important if few observations)
n_missing = df['age'].isnull().sum()
random_samples = np.random.choice(observed_ages, size=n_missing, replace=True)

# Step 3: Assign to missing rows
df.loc[df['age'].isnull(), 'age_random'] = random_samples # df.loc is used to access a group of rows and columns by labels or a boolean array

print(f"Filled {n_missing} missing ages with random sampling")
df[['age', 'age_random']].head(10)

Filled 177 missing ages with random sampling


Unnamed: 0,age,age_random
0,22.0,
1,38.0,
2,26.0,
3,35.0,
4,35.0,
5,,22.0
6,54.0,
7,2.0,
8,27.0,
9,14.0,


> **Why Random Sampling?**
> - Preserves **mean**, **variance**, and **shape**
> - Better than mean/median for **MCAR** data
> - Simulates real variability

## 5. Scikit-learn Imputer (Production Ready)

In [9]:
from sklearn.impute import SimpleImputer

# Median imputer (fit on train, transform on test)
imputer = SimpleImputer(strategy='median')
df['age_sklearn'] = imputer.fit_transform(df[['age']]).flatten()

print("Scikit-learn imputation done. Use in pipelines!")

Scikit-learn imputation done. Use in pipelines!


## 6. Summary Table: When to Use What?

| Method | Best For | Preserves Distribution? | Safe for MNAR? |
|-------|----------|--------------------------|----------------|
| Delete rows | <5% missing, MCAR | No | No |
| Mean | Normal numeric | No (↓ variance) | No |
| Median | Skewed/outliers | Yes (center) | No |
| Mode | Categorical | Yes (frequency) | No |
| **Random Sample** | **MCAR numeric** | **Yes (full)** | **No** |
| Model-based (KNN, MICE) | MAR | Yes | Maybe |

> **Golden Rule**: **Never impute test set using test data** — fit imputer on **train only**.

## 7. Libraries Used

| Library | Purpose |
|--------|--------|
| `pandas` | Load, inspect, fill missing values |
| `numpy` | Random sampling (`np.random.choice`) |
| `seaborn` | Load example dataset |
| `sklearn.impute` | Production-ready imputers |

## Key Takeaways

1. **Always check** `df.isnull().sum()`
2. **Diagnose**: Is it MCAR, MAR, or MNAR?
3. **Prefer**: Random sampling > Median > Mean
4. **Never**: Use test data to impute train
5. **Use**: `SimpleImputer` in ML pipelines

---
**End of Notebook**