# <div align=center> Multivariate Imputation <div>
<hr>

# Multivariate Imputation

When a dataset contains missing values across multiple features, sometimes we can use **other features** to estimate and fill those missing values.  
This process is known as **Multivariate Imputation** — where each missing value in a column is predicted using the **relationships among other columns**.

---

## What is Multivariate Imputation?

In **multivariate imputation**, missing values are estimated not just from the same column (as in univariate methods), but using **information from multiple other features** in the dataset.

For example:  
If `Age` is missing, we can estimate it using related features like `Fare`, `Pclass`, or `SibSp` (in the Titanic dataset).  
This allows the imputer to leverage **inter-feature correlations**.

---

## When to Use Multivariate Imputation

- When **data is Missing At Random (MAR)** — i.e., the probability of a value being missing depends on **other observed variables**.  
  > Example: Suppose income (`Salary`) is missing more often for younger people (`Age < 25`). Here, `Salary` is MAR because its missingness depends on another feature (`Age`).

- When features have **strong relationships or dependencies** — allowing one feature to help infer another.

- When simple univariate imputers (like mean or median) fail to capture **data relationships**.

---

## Why Not Use Univariate Methods Here?

Univariate imputers (mean, median, mode) ignore relationships between features.  
They assume each column is independent — which can bias results if features are correlated.

For instance, imputing `Income` without considering `Education` or `Age` may distort the relationship between income and qualification levels.

Multivariate imputers fix this by considering **feature interactions**.

---

## Common Multivariate Imputation Techniques

There are several ways to perform multivariate imputation, but two widely used approaches are:

1. **K-Nearest Neighbors (KNN) Imputer**  
2. **Iterative Imputer (Regression-based)**

We will study both in detail.

---

## Understanding NaN Euclidean Distance (Used in KNN Imputer)

KNN Imputer works by finding “nearest neighbors” of a data point (row) based on a **distance metric**.  
Typically, we use **Euclidean Distance**, but what happens when there are missing values?

If a feature has missing values (`NaN`), the traditional Euclidean formula breaks — because we cannot compute distance with `NaN`.

To handle this, `sklearn` introduces a **NaN-aware Euclidean Distance** — a modified distance metric that ignores missing pairs and rescales for fairness.

---

### Example — NaN Euclidean Distance

Suppose we have a **4-dimensional** space (4 features), and two data points:

$$
A = (3, \text{NaN}, \text{NaN}, 6)
$$
$$
B = (1, \text{NaN}, 4, 5)
$$

We want to compute the Euclidean distance between `A` and `B`.

Normally, Euclidean Distance is:

$$
d(A, B) = \sqrt{(3-1)^2 + (\text{NaN-NaN})^2 + (\text{NaN-4})^2 + (6-5)^2}
$$

But since some values are missing (`NaN`), we can’t directly compute it.

So we **ignore missing pairs** and use only the pairs where both values are **non-missing**.

Here, only the following pairs are valid:
- (3, 1)
- (6, 5)

We compute distance using only these pairs:

$$
S = (3-1)^2 + (6-5)^2 = 4 + 1 = 5
$$

Now, since we’re using only 2 valid dimensions (\(M=2\)) out of 4 total (\(D=4\)),  
we rescale to maintain comparability with full-dimensional distances:

$$
d_{NaN}(A, B) = \sqrt{\frac{D}{M} \times S}
$$

$$
d_{NaN}(A, B) = \sqrt{\frac{4}{2} \times 5} = \sqrt{10} \approx 3.16
$$

 Thus, **NaN Euclidean Distance** uses only valid pairs and rescales by \(D/M\) so that distances remain consistent across partially observed samples.

---

### When NaN Euclidean Distance is Undefined

If **no valid pairs exist** (i.e., all compared features contain NaN values), distance cannot be calculated.

Examples:
1. `(NaN, NaN)` vs `(NaN, NaN)`
2. `(1, NaN)` vs `(NaN, 2)`

In such cases, the distance is **undefined**, as there are no overlapping non-missing values.

>  **Note:** In real implementations, such undefined distances are either skipped or imputed separately by other strategies.

---

## Why NaN Euclidean Distance Matters

In KNN imputation, finding “nearest neighbors” requires a distance measure.  
When missing values exist, ignoring them completely would exclude valuable rows.

NaN Euclidean Distance ensures:
- We still compute distance using **available information**.  
- The imputer can **use partial similarity** instead of discarding incomplete data.  
- This method remains **robust** even when many features contain missing values.

---

- Multivariate Imputation leverages **correlations among features**.  
- Ideal for **MAR** data, not MCAR or MNAR.  
- **NaN Euclidean Distance** allows handling missing values while finding neighbors.  
- KNN and Iterative Imputer are **powerful alternatives** to simple mean/median methods.  
- Always validate imputed results — ensure relationships and distributions are preserved.  
- Scaling by \( D/M \) is essential to make distances **comparable** across samples with different missingness patterns.

---


# Step-by-step KNN Imputer Example

This note explains, in a clear and corrected form, how K-Nearest Neighbors (KNN) imputation can be applied **row-wise** to fill missing values across multiple features. It uses a small example dataset with four features and demonstrates the decision rules, the NaN-aware Euclidean distance formula (with correct scaling), and the iterative imputation logic.

---

## Dataset

| S No | Feature 1 | Feature 2 | Feature 3 | Feature 4 |
|------|-----------|-----------|-----------|-----------|
| 1    | NaN       | 45        | 67        | 21        |
| 2    | 33        | NaN       | 68        | 12        |
| 3    | 23        | 51        | 71        | 18        |
| 4    | 40        | NaN       | 81        | NaN       |
| 5    | NaN       | 60        | 79        | NaN       |

There are missing values in many cells. The goal is to fill all missing cells using KNN-based imputation.

---

## General procedure (per missing cell)

For each missing value (treating them one at a time):

1. **Select the target feature** (the column with the missing value).  
2. **Form predictors** as the other features (columns).  
3. **Designate the row with the missing cell** as the test sample.  
4. **Choose training rows**: all rows except the test row and any row that has a missing value in the target feature (because such rows cannot provide a target value).  
5. **Compute distances** between the test sample and each training row using only overlapping (non-missing) predictor coordinates and the NaN-aware distance formula (see next section).  
6. **Select `k` nearest neighbors** (lowest distances).  
7. **Aggregate** the neighbors’ target values (mean or median) to produce the imputed value.  
8. **Insert** the imputed value into the dataset and proceed to the next missing cell.

> Note: this describes a row-wise, sequential imputation procedure. Practical implementations (libraries) may vectorize or iterate differently, but the conceptual logic is the same.

---

## NaN-aware Euclidean distance

Let:
- \(D\) = total number of predictor dimensions (here \(D=3\) when imputing one feature and using the other three as predictors; for the general vector notation use total feature count),
- \(V\) = set of predictor indices where **both** the test row and a training row have non-missing values,
- \(M = |V|\) = number of valid (overlapping) coordinates,
- $$(S = \sum_{i \in V} (x_i - y_i)^2)$$ = sum of squared differences over valid coordinates.

The NaN-aware Euclidean distance that keeps partial distances comparable to full-dimensional distances is:

$$
d_{\text{nan}}(x,y) \;=\; \sqrt{\frac{D}{M} \; \sum_{i \in V} (x_i - y_i)^2}
$$

If \(M=0\) (no overlapping non-missing coordinates) the distance is undefined for that pair and should be skipped or handled separately. If \(M\) is very small, the scaled distance may be noisy.

---

## Step-by-step application (example walkthrough)

### Step A — Impute Feature 1 for Row 1

- **Target:** Feature 1 (row 1).  
- **Predictors (X):** Features 2, 3, 4. Test vector (Row 1 predictors): \((45,\,67,\,21)\).  
- **Training candidates:** Rows that have Feature 1 present: Row 2, Row 3, Row 4. (Row 5 excluded because its Feature 1 is missing.)

Compute NaN-aware distances between Row 1 and Rows 2,3,4 using available predictor coordinates. (If some training rows have NaNs in predictors, distance uses only overlapping coordinates and scales by \(D/M\).)

- Suppose computed distances (illustrative) are:
  - Row 1 vs Row 2 → 10  
  - Row 1 vs Row 3 → 20  
  - Row 1 vs Row 4 → 30

With \(k=2\), choose the two nearest neighbors: Row 2 and Row 3. Their Feature 1 values are 33 and 23.

Impute Feature 1 (Row 1) by averaging:
$$
\text{Feature1}_{\text{Row1}} = \frac{33 + 23}{2} = 28.
$$

Update the dataset (Row 1 Feature 1 = 28).

---

### Step B — Impute Feature 2 for Row 2

- **Target:** Feature 2 (row 2).  
- **Predictors (X):** Features 1, 3, 4. After previous step, Row 1 Feature 1 = 28, so Row 1 predictors for this step are \(28\,67\,21)\.  
- **Training candidates:** rows with Feature 2 present. Exclude rows where Feature 2 is missing (Row 4 is excluded). Use Rows 1, 3, 5 (if their Feature 2 is present).

Compute NaN-aware distances between Row 2 (test) and Rows 1, 3, 5 using overlapping predictor coordinates.

- Suppose computed distances (illustrative) are:
  - Row 2 vs Row 1 → 8  
  - Row 2 vs Row 3 → 16  
  - Row 2 vs Row 5 → 32

With \(k=2\), choose Row 1 and Row 3 (distances 8 and 16). Their Feature 2 values are 45 and 51.

Impute Feature 2 (Row 2) by averaging:
$$
\text{Feature2}_{\text{Row2}} = \frac{45 + 51}{2} = 48.
$$

Update the dataset (Row 2 Feature 2 = 48).

---

### Step C — Continue iteratively

Repeat the same logic for the remaining missing cells:

- Row 4: Feature 2 and Feature 4 (treat each missing cell as a separate target and select training rows that have that target present).  
- Row 5: Feature 1 and Feature 4.  
At every step:
- Use current dataset values (including prior imputations) as predictors.  
- Exclude rows missing the current target from the training pool.  
- Compute NaN-aware distances with proper scaling.  
- Select `k` nearest neighbors and aggregate their target values.

---

## Important practical notes and choices

- **Choice of `k`:** Small `k` can be noisy; large `k` may oversmooth. Tune `k` with validation when possible.  
- **Distance weighting:** You can weight neighbor contributions by inverse distance (closer neighbors count more) instead of a simple mean.  
- **Feature scaling:** Always scale/standardize predictors before computing distances (KNN is sensitive to scale).  
- **Handling undefined distances:** If a training row has no overlapping predictors with the test row (`M=0`), skip that row. If most rows have small overlaps, consider alternative methods.  
- **Order effects:** Sequential row-wise imputation depends on the order in which missing cells are filled. Some implementations (vectorized KNN imputers) compute imputations based on the original data or use iterative strategies; be aware of the chosen algorithm’s behavior.  
- **Aggregation function:** Mean is common for numeric targets; median is more robust to outliers. For categorical targets, consider plurality (mode).  
- **Minimum overlap threshold:** It is advisable to require a minimum number of overlapping features $$(M_{\min})$$ to consider a neighbor valid; ignore neighbors with $$(M < M_{\min})$$ .  
- **Computational cost:** KNN imputation can be expensive for large datasets because it requires many distance calculations. Use optimized libraries or approximate nearest neighbor methods for large-scale data.

---

- For each missing cell, treat the cell’s column as the target and other columns as predictors; the missing row becomes the test sample, and rows with non-missing target values form the training pool.  
- Distances are computed using the NaN-aware Euclidean formula with scaling:
  $$
  d_{\text{nan}}(x,y) = \sqrt{\frac{D}{M}\sum_{i\in V}(x_i-y_i)^2}.
  $$
- Select `k` nearest neighbors, aggregate their target values, and impute. Repeat until all missing values are filled.  
- Validate the imputed dataset (distributional checks, model performance) and tune parameters (`k`, scaling, aggregation) to avoid bias and overfitting.

---

###  Advantages
- Uses correlations among features.
- Considers **similarity between rows**.
- Dynamic — different imputations for different samples.

###  Limitations
- Computationally expensive for large datasets.
- Sensitive to scaling and outliers.
- May produce bias if features are weakly correlated.

---


# Hyperparameters of KNNImputer

The **KNNImputer** class in scikit-learn provides several hyperparameters to control how missing values are handled, how neighbors are selected, and how imputations are performed.

---

## 1. missing_values

Specifies the placeholder used to represent missing data in your dataset.

- **Default:** np.nan (NumPy’s representation of “Not a Number”)  
- **Type:** float or int

You can change it if your dataset uses a different sentinel value for missing entries.  
For example, if missing values are encoded as `-999` or `0`, this parameter can be updated accordingly.

This is useful when your dataset uses special placeholders to denote missing values.

---

## 2. n_neighbors

Defines the number of nearest neighbors (k) to consider when imputing a missing value.

- **Default:** 5  
- **Type:** int

For each missing entry, the imputer:
1. Finds the n_neighbors closest samples in the dataset based on the chosen distance metric.  
2. Computes the mean (or weighted mean) of their values for that feature.  
3. Uses this mean as the imputed value.

- Larger k → smoother imputations but may blur local structure.  
- Smaller k → more localized imputations but more sensitive to noise.

---

## 3. weights

Determines how each neighbor contributes to the imputed value.

- **Options:**
  - **'uniform':** All neighbors are given equal weight (default).  
  - **'distance':** Neighbors closer to the sample have more influence (weight inversely proportional to distance).  
  - **callable:** A user-defined function can be passed that accepts an array of distances and returns corresponding weights.

When `'distance'` is used, missing values are filled using a weighted average, where closer points contribute more strongly to the imputed value.

---

## 4. metric

Specifies the distance function used to compute similarity between samples.

- **Default:** 'nan_euclidean'  
- **Type:** string or callable

By default, it uses **NaN-aware Euclidean distance**, which ignores missing dimensions when computing distance and scales distance by the ratio of total features to valid features.

Mathematically:

$$
d_{NaN}(x, y) = \sqrt{\frac{D}{M} \sum_{i \in V} (x_i - y_i)^2}
$$

Where:  
- **D:** Total number of features  
- **M:** Number of valid (non-missing) features  
- **V:** Indices of valid pairs

Other supported metrics include 'euclidean', 'manhattan', 'minkowski', or a callable distance function.

---

## 5. copy

Controls whether the imputer modifies the original data or creates a copy before imputing.

- **Default:** True  
- **Type:** bool

- If **True**, a copy of the input array is created before imputation.  
- If **False**, imputation is performed in-place, modifying the original dataset directly.

---

## 6. add_indicator

Determines whether to add missing value indicators (extra binary columns).

- **Default:** False  
- **Type:** bool

If set to **True**, the imputer appends a binary indicator matrix to the output dataset.  
Each new column corresponds to a feature in the original data:
- **1:** Value was missing and imputed  
- **0:** Value was originally present

This helps downstream models to recognize which values were imputed.

---

## 7. keep_empty_features

Specifies whether to keep entirely missing features (columns) in the output after imputation.

- **Default:** False  
- **Type:** bool

Behavior:

- If **True:**  
  - Columns that were entirely missing during fitting are retained in the output.  
  - These are filled with 0 by default.  
  - This ensures the original dataset structure is preserved.

- If **False:**  
  - Columns that were completely missing during fitting are dropped.  
  - This helps reduce dimensionality if such features provide no useful information.

---

## Table

| Hyperparameter | Description | Default | Example Value |
|----------------|--------------|----------|----------------|
| `missing_values` | Placeholder for missing entries | `np.nan` | `-999` |
| `n_neighbors` | Number of neighbors (k) | `5` | `3` |
| `weights` | Neighbor weighting scheme | `'uniform'` | `'distance'` |
| `metric` | Distance function | `'nan_euclidean'` | `'manhattan'` |
| `copy` | Whether to modify input data | `True` | `False` |
| `add_indicator` | Add binary missing indicators | `False` | `True` |
| `keep_empty_features` | Retain columns that were fully missing | `False` | `True` |

---

## Practical Tips

- Always scale data before applying KNN Imputation to prevent large-scale features from dominating the distance computation.  
- Choose **weights = 'distance'** for smoother, distance-sensitive imputations.  
- Tune **n_neighbors** using cross-validation; common values range from 3 to 10.  
- For datasets with many missing values, consider **Iterative Imputer** as an alternative for improved stability and consistency.

---


In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.impute import KNNImputer,SimpleImputer
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')[['Age','Pclass','Fare','Survived']]

In [3]:
df.head()

Unnamed: 0,Age,Pclass,Fare,Survived
0,22.0,3,7.25,0
1,38.0,1,71.2833,1
2,26.0,3,7.925,1
3,35.0,1,53.1,1
4,35.0,3,8.05,0


In [4]:
df.isnull().mean() * 100

Age         19.86532
Pclass       0.00000
Fare         0.00000
Survived     0.00000
dtype: float64

In [5]:
X = df.drop(columns=['Survived'])
y = df['Survived']

In [6]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [7]:
X_train.head()

Unnamed: 0,Age,Pclass,Fare
30,40.0,1,27.7208
10,4.0,3,16.7
873,47.0,3,9.0
182,9.0,3,31.3875
876,20.0,3,9.8458


In [8]:
knn = KNNImputer(n_neighbors=3)

X_train_trf = knn.fit_transform(X_train)
X_test_trf = knn.transform(X_test)

In [9]:
lr = LogisticRegression()

lr.fit(X_train_trf,y_train)

y_pred = lr.predict(X_test_trf)

accuracy_score(y_test,y_pred)

0.7039106145251397

In [10]:
# Comparision with Simple Imputer --> mean

si = SimpleImputer()

X_train_trf2 = si.fit_transform(X_train)
X_test_trf2 = si.transform(X_test)

In [11]:
lr = LogisticRegression()

lr.fit(X_train_trf2,y_train)

y_pred2 = lr.predict(X_test_trf2)

accuracy_score(y_test,y_pred2)

0.6927374301675978

# KNN Imputer — When to Use, When Not to Use, Advantages, and Disadvantages

The **K-Nearest Neighbors (KNN) Imputer** estimates missing values based on the similarity between samples. It replaces each missing entry with an average (or weighted average) of its nearest neighbors. While powerful in certain contexts, it must be applied carefully depending on data characteristics and the pattern of missingness.

---

## When to Use KNN Imputer

1. **Missing at Random (MAR):**  
   Best suited when the probability of a value being missing depends on other observed variables but not on the missing variable itself.

2. **Small to Moderate Amount of Missing Data:**  
   Effective when only a limited portion of the dataset is missing (typically less than 20–30%). Excessive missingness can distort distance calculations and lead to unreliable imputations.

3. **Data with Natural Groupings (Homogeneous Subgroups):**  
   Works well when samples form meaningful clusters or subgroups (e.g., customer segments, patient profiles). In such cases, nearby points tend to have similar values, improving imputation quality.

---

## When Not to Use KNN Imputer

1. **Very Large Datasets:**  
   KNN imputation involves computing pairwise distances for each missing entry. This becomes computationally expensive and memory-intensive as data size grows.

2. **High-Dimensional Data:**  
   Suffers from the **curse of dimensionality** — as the number of features increases, all points become nearly equidistant, reducing the reliability of neighbor-based predictions.

3. **Missing Not at Random (MNAR):**  
   If the missingness depends on the missing value itself, neighbor-based methods cannot correct this bias and will likely produce misleading imputations.

4. **Datasets with Many Categorical Variables:**  
   KNN Imputer primarily works on continuous data. Distance metrics are not meaningful for categorical variables unless transformed or encoded properly.

---

## Advantages of KNN Imputer

1. **Simple and Intuitive:**  
   Easy to understand, interpret, and implement without complex modeling assumptions.

2. **No Distributional Assumptions:**  
   Works well with non-linear, non-parametric data — does not require normality or homoscedasticity.

3. **Preserves Data Patterns:**  
   Imputed values are derived from existing samples, helping maintain the original data structure and variability.

---

## Disadvantages of KNN Imputer

1. **Computationally Expensive:**  
   Requires calculating distances for each missing entry, which becomes slow and memory-heavy for large datasets.

2. **Affected by the Curse of Dimensionality:**  
   In high-dimensional feature spaces, distances lose meaning, leading to poor imputation accuracy.

3. **Choosing Optimal k:**  
   Selecting the right number of neighbors (k) often requires tuning or validation, which can be time-consuming.

4. **Requires Full Dataset in Memory:**  
   Since the method relies on all available data for each prediction, it cannot be easily applied in streaming or distributed settings.

---


| Aspect | Description |
|---------|--------------|
| **Best Suited For** | MAR data, low missingness, continuous features, and naturally grouped samples |
| **Avoid When** | MNAR data, very large or high-dimensional datasets, or mostly categorical features |
| **Key Strengths** | Non-parametric, simple to apply, preserves original distribution |
| **Main Limitations** | Computational cost, dimensionality issues, storage requirements |

---


## <div> Iterative Imputer <div>
<hr>

# Iterative Imputer — Working, When to Use, and Key Insights

The **Iterative Imputer** in Scikit-learn is a **multivariate imputation technique** that works based on the **MICE (Multiple Imputation by Chained Equations)** algorithm.  
It models each feature with missing values as a function of other features and iteratively refines these imputations.

---

## How It Works (MICE Algorithm Overview)

1. **Initial Imputation:**  
   Missing values are first filled with initial estimates (such as mean, median, or any placeholder).

2. **Iterative Modeling:**  
   Each feature with missing values is treated as a **target variable**, and the remaining features are treated as **predictors**.  
   A regression model is fitted using complete cases to predict the missing values.

3. **Chained Equations:**  
   The process is repeated **sequentially for each feature** with missing data.  
   After one full round (covering all features), the imputed values are updated, and the cycle repeats for several iterations.

4. **Convergence:**  
   The iteration continues until imputed values stabilize — i.e., changes between iterations become minimal.

5. **Estimator Flexibility:**  
   The Iterative Imputer allows using **any regression estimator** (e.g., Linear Regression, Decision Tree, Random Forest).  
   This flexibility lets it capture both linear and non-linear relationships depending on the dataset.

---

## When to Use Iterative Imputer

1. **When Data Has Missing Values Across Multiple Features:**  
   Iterative Imputer is suitable when multiple columns have missing values, and these missing values are **interdependent**.

2. **For Datasets with Complex, Non-Linear Relationships:**  
   It works well when relationships among variables are not strictly linear — especially if you choose flexible estimators like **Random Forest** or **Gradient Boosting**.

3. **When Data is Missing at Random (MAR):**  
   The method performs best when missingness is dependent on other observed features rather than the missing feature itself.

4. **When You Want Flexible Model-Based Imputation:**  
   Unlike simple or KNN imputation, this technique can model complex dependencies, giving more accurate estimates.

---

## When Not to Use Iterative Imputer

1. **When More Than 50% of Data is Missing:**  
   If the majority of values are missing, iterative methods struggle to learn accurate models and may overfit or produce unstable imputations.

2. **For MCAR or MNAR Data:**  
   - **MCAR (Missing Completely at Random):** Simpler methods (like mean or median) may be sufficient.  
   - **MNAR (Missing Not at Random):** Iterative Imputer does not model the missingness mechanism; specialized methods are needed.

3. **For Primarily Categorical Data:**  
   The algorithm is designed for continuous (numerical) variables.  
   Using it on categorical variables can produce unrealistic results, as regression models assume numerical relationships.

4. **For Very Large Datasets:**  
   Due to iterative modeling, it can be computationally expensive and memory-intensive on high-dimensional or massive datasets.

---

## Advantages

1. **Flexible Estimator Choice:**  
   You can choose different estimators (linear, tree-based, etc.) depending on your data pattern — allowing the imputer to adapt to complex relationships.

2. **Captures Feature Relationships:**  
   It uses correlations among multiple features, making imputations more data-driven and accurate than univariate methods.

3. **Works Well Under MAR Assumption:**  
   It performs better than simple methods when missingness depends on other features.

4. **Produces Statistically Sound Results:**  
   By modeling each variable conditionally on others, it better preserves covariance structures in the dataset.

---

## Disadvantages

1. **High Computational Cost:**  
   Iterative modeling for each feature over multiple rounds is computationally expensive and time-consuming.

2. **Risk of Overfitting:**  
   If too many iterations are used or the model is too complex, it may overfit imputed values to training data.

3. **Convergence Issues:**  
   In some datasets, imputations may not stabilize, especially with noisy data or complex estimators.

4. **Sensitive to Initialization:**  
   Initial imputed values can affect convergence and final results. Poor initialization may lead to suboptimal imputations.

---

| Aspect | Description |
|---------|--------------|
| **Algorithm** | MICE (Multiple Imputation by Chained Equations) |
| **Type** | Multivariate |
| **Best For** | MAR data, moderate missingness, continuous features |
| **Estimator** | Flexible (Linear, Tree-based, etc.) |
| **Strengths** | Captures feature interactions, flexible modeling |
| **Limitations** | Computational cost, convergence issues, not for categorical or MNAR data |

---

In summary, the **Iterative Imputer** is one of the most powerful imputation techniques available, especially when data exhibits **inter-feature correlations** and **MAR missingness**.  
It provides flexibility and accuracy but requires computational resources and careful parameter tuning to ensure convergence and reliability.


In [15]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Titanic dataset
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')[['Survived','Pclass','Age','Fare']]


df.head()

Unnamed: 0,Survived,Pclass,Age,Fare
0,0,3,22.0,7.25
1,1,1,38.0,71.2833
2,1,3,26.0,7.925
3,1,1,35.0,53.1
4,0,3,35.0,8.05


In [16]:
# Separate features and target
X = df.drop('Survived', axis=1)
y = df['Survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [17]:
# Imputation
imputer = IterativeImputer(estimator=RandomForestRegressor(n_estimators=10, random_state=0), max_iter=10, random_state=0)
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Convert imputed data back to DataFrame (optional, for clarity)
X_train_imputed = pd.DataFrame(X_train_imputed, columns=X_train.columns)
X_test_imputed = pd.DataFrame(X_test_imputed, columns=X_test.columns)

X_train_imputed


Unnamed: 0,Pclass,Age,Fare
0,3.0,26.133333,15.2458
1,2.0,31.000000,10.5000
2,2.0,31.000000,37.0042
3,3.0,20.000000,4.0125
4,3.0,21.000000,7.2500
...,...,...,...
707,1.0,39.000000,83.1583
708,3.0,19.000000,7.8542
709,3.0,20.025000,7.7333
710,3.0,36.000000,17.4000


In [18]:
# Train a machine learning model
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(X_train_imputed, y_train)

# Predict on the test set
y_pred = model.predict(X_test_imputed)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.74


## Coding framework to compare different techniques

In [38]:
import pandas as pd

# URL of the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

# Load the dataset
df = pd.read_csv(url)

# Display the first few rows of the DataFrame to verify it loaded correctly
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [39]:
from sklearn.model_selection import train_test_split

# Separate the features and the target variable
X = df.drop('Survived', axis=1)  # Features (all columns except 'Survived')
y = df['Survived']  # Target variable

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify the split
X_train.head()


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
331,332,1,"Partner, Mr. Austen",male,45.5,0,0,113043,28.5,C124,S
733,734,2,"Berriman, Mr. William John",male,23.0,0,0,28425,13.0,,S
382,383,3,"Tikkanen, Mr. Juho",male,32.0,0,0,STON/O 2. 3101293,7.925,,S
704,705,3,"Hansen, Mr. Henrik Juul",male,26.0,1,0,350025,7.8542,,S
813,814,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S


In [40]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

# Since 'Embarked' is categorical and we're filling missing values with the most frequent category,
# it's useful to apply OneHotEncoding to it as well after imputation.
# We can achieve this by setting up a pipeline for 'Embarked'.
embarked_pipeline = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder())
])

# Update the transformations to use the pipeline for 'Embarked'
transformations = ColumnTransformer(transformers=[
    ('ohe_sex', OneHotEncoder(), ['Sex']),
    ('impute_age', SimpleImputer(strategy='mean'), ['Age']),
    ('missing_indicator', MissingIndicator(), ['Cabin']),
    ('embarked_pipeline', embarked_pipeline, ['Embarked'])
])

In [41]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

# Create the pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', transformations),
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

In [42]:
# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Now you can use the pipeline to make predictions on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model, e.g., by calculating the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.4f}")

Model accuracy: 0.7765


In [43]:
from sklearn.model_selection import train_test_split, GridSearchCV

# Parameters of the pipeline to tune
param_grid = {
    'preprocessor__impute_age': [SimpleImputer(strategy='mean'), SimpleImputer(strategy='median'), SimpleImputer(strategy='constant', fill_value=0)],
    'preprocessor__embarked_pipeline': [Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')), ('ohe', OneHotEncoder())]),
                                        Pipeline(steps=[('impute', SimpleImputer(strategy='constant', fill_value='S')), ('ohe', OneHotEncoder())])]
}



In [44]:
# Set up the GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', verbose=1)

# Perform the grid search on the training data
grid_search.fit(X_train, y_train)



Fitting 5 folds for each of 6 candidates, totalling 30 fits


0,1,2
,estimator,Pipeline(step...m_state=42))])
,param_grid,"{'preprocessor__embarked_pipeline': [Pipeline(step...otEncoder())]), Pipeline(step...otEncoder())])], 'preprocessor__impute_age': [SimpleImputer(), SimpleImputer...tegy='median'), ...]}"
,scoring,'accuracy'
,n_jobs,
,refit,True
,cv,5
,verbose,1
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,transformers,"[('ohe_sex', ...), ('impute_age', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,missing_values,
,features,'missing-only'
,sparse,'auto'
,error_on_new,True

0,1,2
,missing_values,
,strategy,'most_frequent'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [45]:
# Best parameter set found
print("Best parameters found:\n", grid_search.best_params_)

# Evaluate the best model found by GridSearchCV on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy with best parameters: {accuracy:.4f}")

Best parameters found:
 {'preprocessor__embarked_pipeline': Pipeline(steps=[('impute', SimpleImputer(strategy='most_frequent')),
                ('ohe', OneHotEncoder())]), 'preprocessor__impute_age': SimpleImputer(strategy='median')}
Model accuracy with best parameters: 0.7765


### Having Multiple Approaches

In [46]:
import pandas as pd

# URL of the Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

# Load the dataset
df = pd.read_csv(url).drop(columns=['PassengerId', 'Name', 'Ticket', 'Embarked'])

from sklearn.model_selection import train_test_split

# Separate the features and the target variable
X = df.drop('Survived', axis=1)  # Features (all columns except 'Survived')
y = df['Survived']  # Target variable

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify the split
X_train.head()


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin
331,1,male,45.5,0,0,28.5,C124
733,2,male,23.0,0,0,13.0,
382,3,male,32.0,0,0,7.925,
704,3,male,26.0,1,0,7.8542,
813,3,female,6.0,4,2,31.275,


In [47]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import KNNImputer, IterativeImputer
from sklearn.compose import ColumnTransformer


# Update the transformations to use the pipeline for 'Embarked'
approach_1_preprocessor = ColumnTransformer(transformers=[
    ('ohe_sex', OneHotEncoder(), ['Sex']),
    ('impute_age', SimpleImputer(strategy='mean'), ['Age']),
    ('missing_indicator', MissingIndicator(), ['Cabin'])
], remainder='passthrough')



approach_2_preprocessor = ColumnTransformer(transformers=[
    ('ohe_sex', OneHotEncoder(), ['Sex']),
    ('knn_impute', KNNImputer(), ['Age']),
    ('missing_indicator', MissingIndicator(), ['Cabin'])
], remainder='passthrough')


# Approach 3: IterativeImputer for Age and Embarked, OHE for Sex, Missing Indicator for Cabin
approach_3_preprocessor = ColumnTransformer(transformers=[
    ('ohe_sex', OneHotEncoder(), ['Sex']),
    ('iterative_impute', IterativeImputer(random_state=0), ['Age']),
    ('missing_indicator', MissingIndicator(), ['Cabin'])
], remainder='passthrough')

In [48]:
pipeline = Pipeline(steps=[
    ('preprocessor', None),  # Placeholder
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

In [49]:
# Parameters of the pipeline to tune, including the entire preprocessor component
param_grid = {
    'preprocessor': [approach_3_preprocessor, approach_2_preprocessor, approach_1_preprocessor]
}

In [50]:
# Set up the GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=10, scoring='accuracy', verbose=1)

# Perform the grid search on the training data
grid_search.fit(X_train, y_train)

Fitting 10 folds for each of 3 candidates, totalling 30 fits


0,1,2
,estimator,Pipeline(step...m_state=42))])
,param_grid,"{'preprocessor': [ColumnTransfo... ['Cabin'])]), ColumnTransfo... ['Cabin'])]), ...]}"
,scoring,'accuracy'
,n_jobs,
,refit,True
,cv,10
,verbose,1
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,transformers,"[('ohe_sex', ...), ('iterative_impute', ...), ...]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,estimator,
,missing_values,
,sample_posterior,False
,max_iter,10
,tol,0.001
,n_nearest_features,
,initial_strategy,'mean'
,fill_value,
,imputation_order,'ascending'
,skip_complete,False

0,1,2
,missing_values,
,features,'missing-only'
,sparse,'auto'
,error_on_new,True

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [51]:
# Best parameter set found
print("Best parameters found:\n", grid_search.best_params_)

# Evaluate the best model found by GridSearchCV on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy with best parameters: {accuracy:.4f}")

Best parameters found:
 {'preprocessor': ColumnTransformer(remainder='passthrough',
                  transformers=[('ohe_sex', OneHotEncoder(), ['Sex']),
                                ('iterative_impute',
                                 IterativeImputer(random_state=0), ['Age']),
                                ('missing_indicator', MissingIndicator(),
                                 ['Cabin'])])}
Model accuracy with best parameters: 0.7989


In [52]:
from sklearn.linear_model import BayesianRidge
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import RandomForestRegressor


param_grid = [
    {
        'preprocessor': [approach_1_preprocessor],  # Approach 1
        'preprocessor__impute_age__strategy': ['mean', 'median', 'constant']  # Tuning SimpleImputer within Approach 1
    },
    {
        'preprocessor': [approach_2_preprocessor],  # Approach 2
        'preprocessor__knn_impute__n_neighbors': [3, 5, 7],  # Tuning KNNImputer within Approach 2
        'preprocessor__knn_impute__weights': ['uniform', 'distance']  # Additional KNNImputer parameter
    },
    {
        'preprocessor': [approach_3_preprocessor],  # Approach 3
        'preprocessor__iterative_impute__max_iter': [10, 20],  # Tuning IterativeImputer within Approach 3
        'preprocessor__iterative_impute__imputation_order': ['ascending', 'descending', 'roman', 'arabic'],  # Additional IterativeImputer parameter
        'preprocessor__iterative_impute__estimator': [BayesianRidge(), ExtraTreesRegressor(n_estimators=50, random_state=0), RandomForestRegressor(random_state=0)]
    }
]

In [53]:
# Set up the GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=10, scoring='accuracy', verbose=1)

# Perform the grid search on the training data
grid_search.fit(X_train, y_train)

Fitting 10 folds for each of 33 candidates, totalling 330 fits


0,1,2
,estimator,Pipeline(step...m_state=42))])
,param_grid,"[{'preprocessor': [ColumnTransfo... ['Cabin'])])], 'preprocessor__impute_age__strategy': ['mean', 'median', ...]}, {'preprocessor': [ColumnTransfo... ['Cabin'])])], 'preprocessor__knn_impute__n_neighbors': [3, 5, ...], 'preprocessor__knn_impute__weights': ['uniform', 'distance']}, ...]"
,scoring,'accuracy'
,n_jobs,
,refit,True
,cv,10
,verbose,1
,pre_dispatch,'2*n_jobs'
,error_score,
,return_train_score,False

0,1,2
,transformers,"[('ohe_sex', ...), ('impute_age', ...), ...]"
,remainder,'passthrough'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,missing_values,
,strategy,'median'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,missing_values,
,features,'missing-only'
,sparse,'auto'
,error_on_new,True

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [54]:
# Best parameter set found
print("Best parameters found:\n", grid_search.best_params_)

# Evaluate the best model found by GridSearchCV on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy with best parameters: {accuracy:.4f}")

Best parameters found:
 {'preprocessor': ColumnTransformer(remainder='passthrough',
                  transformers=[('ohe_sex', OneHotEncoder(), ['Sex']),
                                ('impute_age', SimpleImputer(), ['Age']),
                                ('missing_indicator', MissingIndicator(),
                                 ['Cabin'])]), 'preprocessor__impute_age__strategy': 'median'}
Model accuracy with best parameters: 0.7933


## Comparison

| Feature/Method             | Mean Imputation | Median Imputation | Most Frequent Imputation | Constant Value Imputation | KNN Imputer | Missing Indicator | Iterative Imputer |
|----------------------------|-----------------|-------------------|--------------------------|---------------------------|-------------|-------------------|-------------------|
| **Suitable Data Types**    | Numeric only    | Numeric only      | Numeric and Categorical  | Numeric and Categorical   | Numeric     | Numeric and Categorical | Numeric primarily; Categorical with preprocessing |
| **Use Case**               | Simple cases, quick baseline | Non-normally distributed data | Categorical or when a mode is clear | When a placeholder is needed | Data with meaningful neighbor relationships | To flag missingness as a feature | Complex relationships, multiple variables with missing data |
| **Advantages**             | Easy to implement, quick | Robust to outliers | Good for categorical data | Flexibility in handling missing data | Captures local data structure | Directly models the impact of missingness | Utilizes inter-feature relationships, flexible estimator choice |
| **Disadvantages**          | Can distort distribution, reduce variance | Can distort distribution if not normally distributed | May not reflect underlying data complexity | May introduce artificial variance | Computationally intensive, sensitive to outliers | Increases feature space | Computationally expensive, risk of overfitting |
| **Assumes Data Pattern**   | MCAR            | MCAR              | MCAR/MAR                 | MNAR             | MAR    | MNAR     | MAR |
| **Complexity**             | Low             | Low               | Low                      | Low                       | High        | Low               | High |
| **Handling Missingness**   | Directly fills missing values | Directly fills missing values | Directly fills missing values | Directly fills missing values with a constant | Fills based on nearest neighbors | Creates binary indicators for missingness | Models each feature with missing values as a function of others |
| **Impact on Distribution** | Can distort the original distribution by affecting mean and reducing variance | Less impact on distribution for skewed data but can still alter original distribution | May not reflect the true distribution, especially if the mode is not representative of the data | No impact on the original distribution of the variable, but introduces a distinct category | Tries to maintain the local structure of the data, less distortion if neighbors are representative | No direct impact on the original distribution of the variable, but adds new binary features | Attempts to preserve relationships and distributions by using other features, but effectiveness varies with the underlying estimator |
| **Model Performance**      | May decrease performance if mean is not representative | Better for skewed data, but similar issues as mean imputation | Good for nominal categorical data with a clear mode | Useful when a distinct category for missing values is meaningful | Can improve performance if the dataset has a meaningful structure that neighbors can capture | Useful for models that can leverage the presence of missingness as an informative signal | Can improve model performance by leveraging inter-feature correlations, but depends on estimator selection |
| **Computational Cost**     | Low             | Low               | Low                      | Low                       | High        | Low               | High (multiple iterations and model fitting involved) |
| **Best Use Cases**         | Quick baseline models or when data is normally distributed and missing completely at random (MCAR) | Data with outliers or non-normal distribution, MCAR | Categorical data or when a clear majority category exists, MCAR | Situations where missingness might represent a distinct category itself | Data with rich feature interactions or when the local neighborhood can accurately predict missing values | Models where missingness itself is informative, regardless of the missing values | Complex datasets with multiple features having interdependencies, especially when data is not missing completely at random (MAR) |
| **Scalability**            | Highly scalable | Highly scalable   | Highly scalable           | Highly scalable            | Scalability issues with large datasets due to the need to compute distances between points | Highly scalable | Less scalable due to iterative nature and the need for multiple model fittings |
| **Risk of Bias**           | Introduces bias if the mean is not representative of the missing values | Lower risk of bias compared to mean imputation but still present | Risk of bias if the mode does not represent missing values well | Risk of introducing artificial variance if the constant does not represent the missing context well | Lower risk of bias if KNN can accurately capture the data structure, but sensitive to outliers | Minimal direct bias in imputation, but model interpretation complexity increases | Risk of overfitting or underfitting depending on the complexity of the estimator and the accuracy of the initial imputation |
| **Special Considerations** | Simple and fast, but may not be suitable for datasets with complex relationships or non-random missingness patterns | Similar to mean imputation but more robust to outliers | Useful for nominal data; requires careful consideration for ordinal or interval data | Flexible in handling different missingness contexts but requires thoughtful choice of the constant value | Requires careful tuning of `n_neighbors` and distance metric; performance may vary with data dimensionality | Can significantly increase the feature space; effectiveness depends on the downstream model's capacity | Requires selection of an appropriate estimator; computational demand and convergence criteria need careful management |
