## Day 35 – Handling Missing Data (Part 1)

Real-world data is never clean.  
Handling missing values correctly is a core Machine Learning skill.

---

## What is Missing Data?
Missing data refers to values that are:
- Not recorded
- Lost
- Corrupted
- Not applicable

Ignoring missing data can lead to biased models and incorrect predictions.

---

## Types of Missing Data

### 1. MCAR – Missing Completely At Random
Missingness has no relationship with any variable.

**Example:**  
A sensor randomly fails to record temperature.

**Notes:**
- Safe to drop rows if missing percentage is small.
- Does not introduce bias.

---

### 2. MAR – Missing At Random
Missingness depends on other observed features, not the missing value itself.

**Example:**  
Salary is missing more often for younger individuals.

**Notes:**
- Imputation methods generally work well.
- Most common case in real datasets.

---

### 3. MNAR – Missing Not At Random
Missingness depends on the value that is missing.

**Example:**  
High-income users choose not to disclose salary.

**Notes:**
- Hardest type to handle.
- Requires strong domain knowledge.

---

## Methods to Handle Missing Data (Part 1)

### 1. Deletion Methods

- Row-wise deletion
- Column-wise deletion

**Use when:**
- Missing values are less than 5%
- Dataset size is large

**Risk:**
- Loss of information
- Possible bias if assumption is wrong

---

### 2. Simple Imputation

#### Numerical Features
- Mean (sensitive to outliers)
- Median (robust and preferred)
- Constant value (e.g., 0 or -1)

#### Categorical Features
- Mode
- "Unknown" category

**Pros:**
- Easy and fast to implement

**Cons:**
- Can distort data distribution
- Ignores feature relationships

---

### 3. When Not to Impute
- Target variable is missing
- MNAR data without domain understanding

---

## Important Insight
Missing values can sometimes carry useful information.

**Example:**  
No credit score may indicate a risky customer.

In such cases, create a binary feature indicating whether the value was missing.

---

## Simple Example (Python)

```python
df["age"].fillna(df["age"].median(), inplace=True)
df["city"].fillna("Unknown", inplace=True)


####

###

##

CCA also known as list-wise deletion of cases , consists in discarding observation where values in any of the variable are missing

complete case analysis means literally analyzing only those observattion for which there is information ina ll of the varibles in the dataset

## Assumption for CCA

1) Data missing completely at random [MCAR]

Advantage : 
1) easy to implement as no data manipulation required
2) preserves varibales distribution(if data is MCAR) , then the distribution of the variable of the reduced datset should match the distribution in original datset

Disadvantage : 
1) It can exclude a large fraction of the original datset (if missing data is abundant) 
2) Exclude observation could be informative for analysis (if data is not missingat radom)
3) When using our models in production , the model will not know how to handle missing values

In [3]:
## When to use CCA 

1) MCAR
2) less than 5% missing data