# **7 Days Data Cleaning Course**

# **Course** : Machine Learning 

# **Day 2:** Missing Data Handling 

# **Student**: Muhammad Shafiq

-----------------------------------

## **Types of Missing Data** 

  1. **MCAR – Missing Completely At Random**

     - No pattern. Just bad luck.
     - → Safe to impute or drop.

     Example: 10% of Age column randomly blank.

  2. **MAR – Missing At Random**

     - Missingness depends on other features.
     - → Use ML-based imputation.

      Example: Women are more likely to have missing “Fare” column.

 3. **MNAR – Missing Not At Random**

     - Missingness depends on itself.
     - → Dangerous to impute blindly.

     - Example: Rich people don’t enter salary → missing salary = rich? 🤯

### **Exploring Missing Value**

In [1]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

# Missing count
missing = df.isnull().sum()
missing[missing > 0].sort_values(ascending=False)


URLError: <urlopen error [Errno 11001] getaddrinfo failed>

### **Drop or Filter**

| Case                                                | Strategy                  |
| --------------------------------------------------- | ------------------------- |
| Missing > 60%                                       | Drop column (`df.drop()`) |
| Missing < 10%                                       | Impute (mean/median/mode) |
| Critical feature (e.g., target)                     | Drop row                  |
| Has signal (e.g., "Cabin" = missing → lower class?) | Create `missing_flag`     |


### **Drop Cabin to much missing**

In [3]:
df.drop('Cabin', axis=1, inplace=True)
df.drop('Cabin', axis=1 , inplace=True)

NameError: name 'df' is not defined

### **Simple Imputation**

In [None]:
# Numeric -----> mean or median
df['Age'] = fillna(df['Age'].median(), inplace=True)

# Catagorical --------> mostly mode
df['Embarked'] = fillna(df['Embarked'].mode()[0], inplace=True)

### **ML Based Computation-KNN Imputer**

In [None]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

# Select relevent numeric features
feature = ['Age', 'Fare', 'Pclass', 'SibSp', 'Parch']
data = df(feature)

# NOrmalize (KNN sensitive to scale)
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Apply KNN
imputer = KNNImputer(n_neighbors=5)
data_imputed = imputer.fit_transform(data_scaled)

# Replace back in df
df['features']= scaler.inverse_transform(data_imputed)

KNN fills missing values based on similar rows

Powerful when there's correlation between features

### **When Not to Impute**

| Scenario                | Action                                                |
| ----------------------- | ----------------------------------------------------- |
| Target variable missing | Drop row                                              |
| MNAR pattern suspected  | Avoid simple imputation                               |
| Business logic needed   | Ask stakeholders (e.g., missing salary = unemployed?) |


### **Adding Missing value flags**

In [None]:
df['Age Missing']= df['Age'].isnull().astype(int)