## Types of Missing Data

#### Missing Completely at Random (MCAR):
- The missingness is purely accidental and unrelated to any other data (e.g., a lab sample was dropped).
#### Missing at Random (MAR): 
- The missingness depends on other observed data but not the missing value itself (e.g., men being less likely to answer a survey question about emotions).
#### Missing Not at Random (MNAR): 
- The missingness is related to the value of the missing data itself (e.g., high-income individuals being less likely to report their income). 

In [21]:
import seaborn as sns
import pandas as pd
import numpy as np

In [4]:
df = sns.load_dataset('titanic')

In [6]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [7]:
# check missing values
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [8]:
# deleting the row to handle missing values
df.shape

(891, 15)

In [9]:
df.dropna().shape

(182, 15)

In [10]:
df.dropna(axis=1)

Unnamed: 0,survived,pclass,sex,sibsp,parch,fare,class,who,adult_male,alive,alone
0,0,3,male,1,0,7.2500,Third,man,True,no,False
1,1,1,female,1,0,71.2833,First,woman,False,yes,False
2,1,3,female,0,0,7.9250,Third,woman,False,yes,True
3,1,1,female,1,0,53.1000,First,woman,False,yes,False
4,0,3,male,0,0,8.0500,Third,man,True,no,True
...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,0,0,13.0000,Second,man,True,no,True
887,1,1,female,0,0,30.0000,First,woman,False,yes,True
888,0,3,female,1,2,23.4500,Third,woman,False,no,False
889,1,1,male,0,0,30.0000,First,man,True,yes,True


### Imputation missing values
- A statistical technique used in data preprocessing to handle missing data by replacing absent values with estimated, predicted, or aggregated proxies.

- Mean, Median, or Mode Imputation: Replaces missing values with the average (mean), middle value (median), or most frequent value (mode) of the observed data.

## Mean value imputation
- Numeric
- works well when we have normally distributed data

In [None]:
# Simple column fill
df['Age_mean'] = df['age'].fillna(df['age'].mean())

In [17]:
df[['Age_mean', 'age']]

Unnamed: 0,Age_mean,age
0,22.000000,22.0
1,38.000000,38.0
2,26.000000,26.0
3,35.000000,35.0
4,35.000000,35.0
...,...,...
886,27.000000,27.0
887,19.000000,19.0
888,29.699118,
889,26.000000,26.0


In [28]:
sf = pd.DataFrame({
    "Age":    [22, np.nan, 25, np.nan, 30, np.nan, 28, np.nan, 35, np.nan],
    "Salary": [30000, 32000, np.nan, 40000, np.nan, 28000, np.nan, 35000, np.nan, 50000],
    "Score":  [85, np.nan, np.nan, 90, 88, np.nan, 75, 80, np.nan, np.nan],
    "City":   ["Delhi", np.nan, "Chennai", "Mumbai", np.nan, "Delhi", np.nan, "Mumbai", "Chennai", np.nan]
})

In [29]:
# filling only selected columns
cols = ['Age', 'Score']

sf[cols] = sf[cols].fillna(sf[cols].mean())

In [30]:
sf[['Age', 'Score']]

Unnamed: 0,Age,Score
0,22.0,85.0
1,28.0,83.6
2,25.0,83.6
3,28.0,90.0
4,30.0,88.0
5,28.0,83.6
6,28.0,75.0
7,28.0,80.0
8,35.0,83.6
9,28.0,83.6


In [32]:
# multiple numeric at once
sf = sf.fillna(sf.mean(numeric_only=True))
sf.head()

Unnamed: 0,Age,Salary,Score,City
0,22.0,30000.0,85.0,Delhi
1,28.0,32000.0,83.6,
2,25.0,35833.333333,83.6,Chennai
3,28.0,40000.0,90.0,Mumbai
4,30.0,35833.333333,88.0,


In [38]:
# Mean imputation with loc
df1 = pd.DataFrame({
    "Height": [170, np.nan, 160, 180,158,161,159,np.nan],
    "Weight": [65, 70, np.nan, 80,np.nan, 44, 59,np.nan]
})


df1.loc[:, 'Height'] = df1['Height'].fillna(df1['Height'].mean())
df1.loc[:, 'Weight'] = df1['Weight'].fillna(df1['Weight'].mean())
df1

Unnamed: 0,Height,Weight
0,170.0,65.0
1,164.666667,70.0
2,160.0,63.6
3,180.0,80.0
4,158.0,63.6
5,161.0,44.0
6,159.0,59.0
7,164.666667,63.6


In [None]:
# Group-wise Mean Imputation
df2 = pd.DataFrame({
    "Dept": ["IT", "IT", "HR", "HR", "IT", "HR"],
    "Salary": [50000, np.nan, 30000, np.nan, 52000, 31000]
})

df2["Salary"] = df2["Salary"].fillna(df2.groupby("Dept")["Salary"].transform("mean"))
# here 3 IT and 3 HR,
df2

Unnamed: 0,Dept,Salary
0,IT,50000.0
1,IT,51000.0
2,HR,30000.0
3,HR,30500.0
4,IT,52000.0
5,HR,31000.0


In [6]:
# Mean imputation using sklearn
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

df3 = pd.DataFrame({
    "Age": [20, np.nan, 25, 30],
    "Income": [20000, 25000, np.nan, 40000]
})

imputer = SimpleImputer(strategy='mean')
df_muted = pd.DataFrame(imputer.fit_transform(df3), columns=df3.columns)

df_muted

Unnamed: 0,Age,Income
0,20.0,20000.0
1,25.0,25000.0
2,25.0,28333.333333
3,30.0,40000.0


In [11]:
# Mean Imputation + Missing Indicator

df4 = pd.DataFrame({
    "Age": [20, np.nan, 25, 30, np.nan, 17],
    "Income": [20000, 25000, np.nan, 40000, 58000, np.nan]
})

df4['Age_missing'] = df4["Age"].isna().astype(int)
df4['Age'] = df4["Age"].fillna(df4['Age'].mean())
df4

Unnamed: 0,Age,Income,Age_missing
0,20.0,20000.0,0
1,23.0,25000.0,1
2,25.0,,0
3,30.0,40000.0,0
4,23.0,58000.0,1
5,17.0,,0


## Median value imputation
- Numeric
- if we have outliers in the dataset

## Mode imputation technique 
- categorical / numeric (mostly for categorical)
- works for numeric only if repeated values