### Handling Missing Numerical data

1) Unnivariate Imputation
2) Multivariate Imputation

## 1. Univariate Imputation

Univariate Imputation is a missing value handling technique in which **each feature is imputed independently**, without using information from other variables in the dataset.

The replacement value is computed using **only the non-missing values of that same feature**.

**Examples:**
- Numerical: mean, median, random, end-of-distribution
- Categorical: mode, `"Missing"` or `"Unknown"`

**Key characteristics:**
- Simple and fast
- Does not consider relationships between features
- Suitable when missingness is low or features are weakly correlated

---

## 2. Multivariate Imputation

Multivariate Imputation is a technique where **missing values are imputed using information from multiple features**, capturing relationships and dependencies within the dataset.

Each feature with missing values is modeled as a function of the other features.

**Examples:**
- KNN Imputer
- Iterative Imputer (MICE)

**Key characteristics:**
- Preserves feature relationships
- More accurate than univariate methods in many cases
- Computationally expensive
- Requires careful scaling and preprocessing


In [1]:
# which is best from above ?

### Mean?Median Imputation

| Aspect                  | Mean           | Median      |
| ----------------------- | -------------- | ----------- |
| Sensitivity to outliers | High           | Low         |
| Best for                | Symmetric data | Skewed data |
| Variance impact         | High           | Moderate    |
| Distribution distortion | High           | Lower       |


Mean and Median Imputation are **univariate numerical imputation techniques** used to handle missing values by replacing them with a single summary statistic computed from the observed values of the same feature.

### Mean Imputation
Missing values are replaced with the **mean (average)** of the non-missing observations.  
It is suitable when the feature follows a **roughly symmetric distribution** and does not contain significant outliers.  
Mean imputation is simple and fast but is **highly sensitive to outliers** and reduces the variance of the feature, which can distort the original distribution.

### Median Imputation
Missing values are replaced with the **median (middle value)** of the non-missing observations.  
It is preferred when the feature is **skewed** or contains **outliers**, as the median is more robust than the mean.  
Although median imputation preserves central tendency better, it still reduces variance and ignores relationships with other features.

**Key Note:**  
Both methods preserve dataset size but assume that missing values follow the same distribution as the observed data, which may introduce bias if this assumption is violated.


## Mean and Median Imputation (Univariate – Numerical)

Mean and Median Imputation are **univariate numerical imputation techniques** where missing values are replaced using a single summary statistic computed from the available values of the same feature.

---

## Mean Imputation

Mean imputation replaces missing values with the **average (mean)** of the non-missing observations.

### Definition
For a numerical feature \( X \), missing values are replaced with:
\[
\text{Mean}(X)
\]

### When to Use
- Data is **approximately normally distributed**
- No significant outliers
- Missingness is MCAR or MAR

### Advantages
- Simple and fast
- Preserves sample size
- Works well for symmetric distributions

### Disadvantages
- Sensitive to outliers
- Reduces variance
- Distorts feature distribution
- Weakens correlations with other variables

### Example
```python
df['age'] = df['age'].fillna(df['age'].mean())


## Median Imputation (Univariate – Numerical)

Median imputation replaces missing values with the **median (middle value)** of the non-missing observations.

### Definition
For a numerical feature \( X \), missing values are replaced with:
```

{Median}(X)

```
### When to Use
- Data is **skewed**
- Presence of **outliers**
- Missingness is MCAR or MAR

### Advantages
- Robust to outliers
- Preserves central tendency better for skewed data
- Simple and fast to implement

### Disadvantages
- Reduces variance
- Distorts feature distribution
- Ignores relationships with other features

### Example
```python
df['age'] = df['age'].fillna(df['age'].median())


## When to Use End-of-Distribution Imputation

End-of-distribution imputation replaces missing values with an **extreme value** taken from the tail of the distribution (for example, the 99th percentile, 1st percentile, or a very large/small constant).

---

## Use End-of-Distribution Imputation When

### 1. Missingness is Informative
When the fact that a value is missing itself carries information.

**Example:**
- Missing income → likely very high or very low
- Missing transaction amount → unusual behavior

In such cases, placing missing values at the distribution tail helps the model identify them as a special group.

---

### 2. You Want the Model to Detect Missing Values Explicitly
Extreme values act as a **flag**, allowing tree-based and linear models to learn:
> “This observation was missing.”

This is useful when you do **not want to add a separate missing indicator**.

---

### 3. Data is Skewed or Non-Normal
For skewed features, mean or median imputation can distort the distribution.  
End-of-distribution imputation avoids pulling values toward the center.

---

### 4. Tree-Based Models Are Used
Models such as:
- Decision Trees
- Random Forests
- Gradient Boosting

handle extreme values well and can split cleanly on them.

---

### 5. Missing Values Are Not MCAR
When missingness is MAR or MNAR and related to the target, end-of-distribution imputation can preserve predictive signal.

---

## When NOT to Use It

- When missingness is truly random (MCAR)
- When using distance-based models (KNN, K-Means)
- When features are tightly bounded
- When interpretability is critical

---

## Common Choices for End Values

- 99th percentile or 1st percentile
- Max + constant
- Min − constant
- Domain-specific extreme values

---

## Key Drawback

- Artificially inflates variance
- Creates spikes in the distribution
- Can mislead linear models if overused

---
> End-of-distribution imputation is used when missingness is informative, allowing models—especially tree-based ones—to treat missing values as a distinct signal by placing them at the distribution tail.
