# **MICE (Multivariate Imputataion by Chained Equation)**

MICE is a sophisticated method for filling missing values. Unlike simple imputation (mean/median) which looks at one column at a time, MICE models each feature with missing values as a function of other features.
It works by:
1.  Filling missing data with a placeholder (e.g., mean).
2.  Iteratively training a model (like Linear Regression or Bayesian Ridge) to predict the missing values of one column based on the others.
3.  Repeating this process multiple times until the values stabilize (converge).


### Code Implementation
Scikit-Learn calls this `IterativeImputer`. Note that it is still experimental, so we must enable it explicitly.
```python
import pandas as pd
import numpy as np
# 1. Enable experimental IterativeImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# 2. Create sample data with missing values
data = {
    'Age': [25, 30, np.nan, 40, 22, np.nan, 35],
    'Fare': [10, 20, 30, np.nan, 15, 50, 60],
    'SibSp': [1, 0, 1, 2, 0, 1, 2]
}
df = pd.DataFrame(data)
print("Original Data (with NaNs):")
print(df)
# 3. Initialize MICE Imputer
# estimator: The model used to predict missing values (default is BayesianRidge)
# max_iter: Maximum number of imputation rounds
mice_imputer = IterativeImputer(max_iter=10, random_state=0)
# 4. Fit and Transform
# Returns a numpy array, so we convert it back to a DataFrame
df_imputed_array = mice_imputer.fit_transform(df)
df_imputed = pd.DataFrame(df_imputed_array, columns=df.columns)
print("\nImputed Data (MICE):")
print(df_imputed)

#### Pros
1.  **High Accuracy**: usually provides more accurate imputations than simple Mean/Median or KNN because it models the *relationships* between features.
2.  **Preserves Variance**: It maintains the statistical distribution and variability of the data better than mean imputation (which shrinks variance).
3.  **Flexible**: You can choose different estimators (e.g., Linear Regression for simple data, Decision Trees for complex non-linear relationships) to predict the missing values.
4.  **Handles Multiple Variables**: It can handle datasets where multiple columns have missing values simultaneously by cycling through them.
#### Cons
1.  **Computationally Expensive**: It is slower than simple imputation or KNN because it trains multiple regression models for every feature with missing data, multiple times (iterations).
2.  **Assumes Relationships**: It assumes that the missing data is *Missing At Random (MAR)*, meaning the missingness can be explained by other observed variables. If the data is *Missing Not At Random (MNAR)*, MICE might introduce bias.
3.  **Complexity**: It is more complex to implement and tune compared to filling with the mean or median.
4.  **Sensitive to Outliers**: If the underlying estimator (like Linear Regression) is sensitive to outliers, the imputed values can be skewed.

# Types of Missing Data

Understanding *why* data is missing is crucial because it determines how you should handle it.
### 1. MCAR (Missing Completely At Random)
The missingness has **no relationship** with any values, observed or missing. It is purely random.
*   **Definition**: The probability of a value being missing is the same for all observations.
*   **Example**: A lab technician accidentally drops a test tube and loses the blood sample. It could have happened to anyone, regardless of their age, gender, or health status.
*   **Implication**: This is the best-case scenario. You can safely drop these rows or use simple imputation (mean/median) without introducing bias.

### 2. MAR (Missing At Random)
The missingness is related to **observed data** but not the missing data itself.
*   **Definition**: The probability of a value being missing depends on other available information in the dataset.
*   **Example**: In a survey, **Men** (observed variable) are more likely to refuse to answer questions about **Depression** (missing variable). The missingness is not random, but it can be explained by Gender. It can be said that, data is missing at the user's willingness on wheather to enter the data or not.
*   **Implication**: You cannot just drop these rows, or you will bias your data (e.g., you'll lose too many men). Techniques like **MICE** or **KNN Imputation** work best here because they use the observed data (Gender) to predict the missing values.

### 3. MNAR (Missing Not At Random)
The missingness is related to the **value itself**. This is the hardest type to handle.
*   **Definition**: The value is missing *because* of what the value would have been.
*   **Example**: People with **very high incomes** refuse to disclose their income. The fact that the data is missing tells you something about the value (it's likely high).
*   **Implication**: Standard imputation methods (Mean, Median, MICE) will fail and introduce heavy bias. You often need to model the missingness explicitly (e.g., create a new column `Income_Missing: 1`) or consult a domain expert.



# Difference Between MAR and MNAR
The distinction lies in the **cause** of the missingness.
### 1. MAR (Missing At Random)
*   **Definition**: The missingness is related to **other observed variables** in the dataset, but not the missing value itself.
*   **Logic**: *"I can explain the missing data by looking at other columns."*
*   **Example**: **Men** (observed) are less likely to report their **Depression Score** (missing).
    *   The missingness is driven by *Gender*, which we know.
*   **Solution**: Techniques like **MICE** or **KNN** work well because they use the observed correlations (e.g., between Gender and Depression) to fill the gaps.
### 2. MNAR (Missing Not At Random)
*   **Definition**: The missingness is related to the **value itself**.
*   **Logic**: *"The value is missing precisely because of what the value is."*
*   **Example**: People with **Severe Depression** (unobserved) are too unmotivated to fill out the **Depression Score** survey.
    *   The missingness is driven by the *Depression Score itself*, which we don't know.
*   **Solution**: Standard imputation fails here. You often need to create a binary flag (`Is_Missing`) or consult a domain expert, as the missingness contains critical information.
### Summary Comparison
| Feature | MAR (Missing At Random) | MNAR (Missing Not At Random) |
| :--- | :--- | :--- |
| **Cause** | Caused by *other* observed data. | Caused by the *missing value itself*. |
| **Predictability** | Predictable using other columns. | Not predictable from data alone. |
| **Example** | Men don't tell you their salary. | Rich people don't tell you their salary. |
| **Best Fix** | MICE, KNN Imputation. | Flag as missing, Domain Expert. |