##### When working with missing data, different techniques can be applied depending on the nature of missingness and the dataset.
##### 1. Complete Case Analysis (CCA) / Listwise Deletion
##### 2. Univariate Imputation
##### 3. Multivariate Imputation

### Complete Case Analysis (CCA) / Listwise Deletion

##### Complete Case Analysis (CCA), also known as listwise deletion, is one of the simplest techniques for handling missing data. 
##### It involves removing all rows that contain at least one missing value. This means that only the fully observed (complete) cases remain in the dataset.

#### When to use CCA?

##### The proportion of missing values is small (<5% of the dataset).
##### The missing data is completely at random (MCAR), meaning that the missingness does not depend on any feature.
##### The dataset is large enough that dropping some rows does not affect the analysis significantly.

#### Real-World Scenario of When CCA is Used
##### ✅ Example Where CCA is a Good Choice
##### A medical study collects patient information, but 2% of height values are missing randomly. 
##### Since this is a small percentage and does not depend on other features, CCA is safe to use.

#### ❌ Example Where CCA is a Bad Choice
##### A bank dataset has missing values in income and credit score, where missingness is more common for low-income groups. 
##### If we drop those rows, we may bias results by keeping only high-income individuals.

#### Advantages of CCA
##### ✅ Simple to Implement – Only requires dropna().
##### ✅ No Bias if MCAR – Works well when missingness is random.
##### ✅ Preserves Original Data Distribution – No artificial values introduced.

#### Disadvantages of CCA
##### ❌ Loss of Data – If missing values are common, a large portion of data gets deleted.
##### ❌ Biased Results if MAR or MNAR – If missing values depend on certain factors, dropping them changes dataset characteristics.
##### ❌ Not Suitable for Small Datasets – If dataset is small, removing rows can significantly reduce sample size.

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

titanic = pd.DataFrame(sns.load_dataset("titanic"))
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [8]:
titanic.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [10]:
temp_titanic = titanic.dropna()
print(f"Original shape: {titanic.shape}")
print(f"Shape after CCA: {temp_titanic.shape}")

Original shape: (891, 15)
Shape after CCA: (182, 15)


In [12]:
# Calculate percentage of data lost
data_loss = ((titanic.shape[0] - temp_titanic.shape[0]) / titanic.shape[0]) * 100
print(f"Percentage of data lost due to CCA: {data_loss:.2f}%")

Percentage of data lost due to CCA: 79.57%


### Univariate Imputation
##### Univariate Imputation is a method to fill missing values based only on one column (feature) at a time, 
##### without considering relationships with other features. It is one of the simplest and most commonly used techniques for handling missing values.
##### It is useful when missing values are random and when we want to preserve the dataset size instead of dropping rows using Complete Case Analysis (CCA).

### Types of Univariate Imputation
##### 1️⃣ Mean Imputation
##### Replaces missing values with the mean (average) of the column.
##### Best for normally distributed numerical data.
##### Not ideal for skewed data as it distorts distribution.
##### 2️⃣ Median Imputation
##### Replaces missing values with the median (middle value).
##### Works well for skewed distributions (e.g., income, house prices).
##### Less affected by outliers than mean imputation.
##### 3️⃣ Mode Imputation
##### Replaces missing values with the most frequent value (mode).
##### Suitable for categorical and ordinal data.
##### Example: Filling missing "Gender" values with "Male" if it is the most common.
##### 4️⃣ Constant (Fixed-Value) Imputation
##### Replaces missing values with a specific value (e.g., 0, "Unknown", -999).
##### Used when missing data represents a special category (e.g., missing salary = "Unemployed").

### Real-World Use Scenarios of Univariate Imputation
##### ✅ Example Where Mean Imputation is a Good Choice
##### A hospital dataset has missing values in patient weight. Since weight follows a normal distribution, replacing missing values with mean is reasonable.

##### ✅ Example Where Median Imputation is Better
##### A real estate dataset has missing values in house prices. Since prices are skewed (some very expensive houses), using median prevents distortion.

##### ✅ Example Where Mode Imputation is Used
##### A survey dataset has missing values in preferred mode of transport. The most common response (e.g., "Car") is used to fill missing values.

### Advantages of Univariate Imputation
##### ✅ Preserves dataset size – No data is lost (unlike CCA).
##### ✅ Easy to implement – Just one function fillna().
##### ✅ Useful for structured data – Works well when missing values are independent.

### Disadvantages of Univariate Imputation
##### ❌ Ignores relationships between features – Only considers one column at a time.
##### ❌ Can introduce bias – If data is missing not at random (MNAR), imputation may distort results.
##### ❌ Not suitable for complex data – Does not work well for datasets with strong correlations.

In [17]:
# mean imputation
titanic['age'] = titanic['age'].fillna(titanic['age'].mean())
print(titanic.isnull().sum())

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64


In [18]:
print(titanic.sample(5))

     survived  pclass     sex        age  sibsp  parch     fare embarked  \
46          0       3    male  29.699118      1      0  15.5000        Q   
136         1       1  female  19.000000      0      2  26.2833        S   
755         1       2    male   0.670000      1      1  14.5000        S   
667         0       3    male  29.699118      0      0   7.7750        S   
119         0       3  female   2.000000      4      2  31.2750        S   

      class    who  adult_male deck  embark_town alive  alone  
46    Third    man        True  NaN   Queenstown    no  False  
136   First  woman       False    D  Southampton   yes  False  
755  Second  child       False  NaN  Southampton   yes  False  
667   Third    man        True  NaN  Southampton    no   True  
119   Third  child       False  NaN  Southampton    no  False  


In [20]:
# median imputation
titanic['age'] = titanic['age'].fillna(titanic['age'].median())
print(titanic.head())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  


In [22]:
#mode imputation
titanic['embark_town'] = titanic['embark_town'].fillna(titanic['embark_town'].mode()[0])
print(titanic.isnull().sum())

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      0
alive            0
alone            0
dtype: int64


In [32]:
# Constant (Fixed-Value) Imputation
# Cannot setitem on a Categorical with a new category (Missing), set the categories first
# to fix it need to add category first
# titanic['deck'] = titanic['deck'].cat.add_categories('Missing') inplace=True (not necessary)
titanic['deck'] = titanic['deck'].fillna('Missing')
print(titanic.isnull().sum())

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       2
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64


In [33]:
titanic.sample(5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
684,0,2,male,60.0,1,1,39.0,S,Second,man,True,Missing,Southampton,no,False
627,1,1,female,21.0,0,0,77.9583,S,First,woman,False,Missing,Southampton,yes,True
850,0,3,male,4.0,4,2,31.275,S,Third,child,False,Missing,Southampton,no,False
374,0,3,female,3.0,3,1,21.075,S,Third,child,False,Missing,Southampton,no,False
344,0,2,male,36.0,0,0,13.0,S,Second,man,True,Missing,Southampton,no,True


### Multivariate Imputation
##### Multivariate Imputation is a technique for handling missing data by considering multiple columns (features) 
##### simultaneously instead of imputing values independently for each column. Unlike Univariate Imputation, which only looks at a single column,
##### Multivariate Imputation leverages relationships between multiple variables to make more informed imputations.

### Types of Multivariate Imputation
##### 1️⃣ k-Nearest Neighbors (KNN) Imputation
##### Uses k similar rows (neighbors) to estimate missing values.
##### Finds the k closest data points (based on distance) and averages their values.
##### Works well when data has patterns or clusters.
##### 2️⃣ Multivariate Imputation by Chained Equations (MICE)
##### Also called Iterative Imputation.
##### Creates a regression model for each column with missing values and predicts missing values iteratively.
##### Suitable when features have strong relationships.
##### 3️⃣ Regression Imputation
##### Uses linear regression or other models to predict missing values.
##### Best when the missing column has a linear relationship with other features.
##### 4️⃣ Expectation-Maximization (EM) Imputation
##### Uses probabilistic modeling to estimate missing values.
##### Works well when data follows a known distribution.

### Real-World Use Cases of Multivariate Imputation
##### ✅ Example Where KNN is Useful
##### A customer dataset with missing income values. People with similar spending habits likely have similar incomes.

##### ✅ Example Where MICE is Ideal
##### A medical dataset with missing blood pressure, cholesterol, and age. These values are interrelated and require an iterative approach.

##### ✅ Example Where Regression Works Best
##### A real estate dataset where missing house price values can be predicted using area and number of rooms.

### Advantages of Multivariate Imputation
##### ✅ Considers relationships between features – More accurate than univariate methods.
##### ✅ Works for complex missing data patterns – Handles MNAR (Missing Not At Random).
##### ✅ More flexible – Different models work for different types of data.

### Disadvantages of Multivariate Imputation
##### ❌ Computationally expensive – Slower than univariate methods.
##### ❌ Requires careful parameter tuning – Incorrect models can introduce bias.
##### ❌ Not always interpretable – KNN and MICE imputation can be hard to explain.


In [36]:
from sklearn.impute import KNNImputer

# KNN Imputation
numeric_cols = ['age', 'fare', 'pclass']

# Initialize KNN Imputer (k=5)
knn_imputer = KNNImputer(n_neighbors=5)

# Apply KNN Imputation
titanic_knn = titanic.copy()
titanic_knn[numeric_cols] = knn_imputer.fit_transform(titanic[numeric_cols])

# Check if missing values remain
print(titanic_knn.head())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0     3.0    male  22.0      1      0   7.2500        S  Third   
1         1     1.0  female  38.0      1      0  71.2833        C  First   
2         1     3.0  female  26.0      0      0   7.9250        S  Third   
3         1     1.0  female  35.0      1      0  53.1000        S  First   
4         0     3.0    male  35.0      0      0   8.0500        S  Third   

     who  adult_male     deck  embark_town alive  alone  
0    man        True  Missing  Southampton    no  False  
1  woman       False  Missing    Cherbourg   yes  False  
2  woman       False  Missing  Southampton   yes   True  
3  woman       False  Missing  Southampton   yes  False  
4    man        True  Missing  Southampton    no   True  


In [42]:
#Mice Imputation using iterative approach
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

mice_imputer = IterativeImputer()

titanic_mice = titanic.copy()
titanic_mice[['age', 'fare']] = mice_imputer.fit_transform(titanic[['age', 'fare']])

print(titanic_mice.sample(5))

     survived  pclass     sex        age  sibsp  parch     fare embarked  \
270         0       1    male  29.699118      0      0  31.0000        S   
411         0       3    male  29.699118      0      0   6.8583        Q   
415         0       3  female  29.699118      0      0   8.0500        S   
420         0       3    male  29.699118      0      0   7.8958        C   
224         1       1    male  38.000000      1      0  90.0000        S   

     class    who  adult_male     deck  embark_town alive  alone  
270  First    man        True  Missing  Southampton    no   True  
411  Third    man        True  Missing   Queenstown    no   True  
415  Third  woman       False  Missing  Southampton    no   True  
420  Third    man        True  Missing    Cherbourg    no   True  
224  First    man        True  Missing  Southampton   yes  False  


In [48]:
from sklearn.linear_model import LinearRegression

titanic_reg = pd.DataFrame(sns.load_dataset("titanic"))
train_data = titanic_reg.dropna(subset=['age'])  # Drop rows where 'age' is missing
test_data = titanic_reg[titanic_reg['age'].isnull()]  # Separate missing age values

# Train regression model
reg_model = LinearRegression()
reg_model.fit(train_data[['fare', 'pclass']], train_data['age'])

# Predict missing 'age' values
titanic_reg.loc[titanic_reg['age'].isnull(), 'age'] = reg_model.predict(test_data[['fare', 'pclass']])
print(titanic_reg.isnull().sum())

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
