# 🧪 Part 6: Advanced Missing Data Strategies (Transform, Fill, Map)

**Goal:** To master advanced, conditional techniques for handling missing data, moving beyond simple global median imputation to methods that preserve the underlying statistical relationships in the data.

---
### Key Learning Objectives
1.  Analyze **missing data patterns** by group (e.g., missing Age by Passenger Class).
2.  Implement **Group-Based Imputation** using the powerful `.transform()` method.
3.  Use time-series-style imputation: **Forward Fill (`ffill`)** and **Backward Fill (`bfill`)**.
4.  Apply custom, **conditional filling logic** using `.map()` and `.apply()`.

In [1]:
import pandas as pd
import numpy as np

# Set pandas options for better output visualization
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("=== ADVANCED MISSING VALUE STRATEGIES ===")

# Load the Titanic dataset (using local file as specified)
try:
    titanic_df = pd.read_csv('titanic_data.csv')
    print("✅ Loaded titanic_data.csv")
except FileNotFoundError:
    url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
    titanic_df = pd.read_csv(url)
    print("📥 Loaded fresh Titanic data from web")


print(f"✅ Dataset loaded successfully!")
print(f"Shape: {titanic_df.shape[0]} rows × {titanic_df.shape[1]} columns")


# First look at the data
print("\n📋 First 5 rows:")
print(titanic_df.head())


print("\n📋 Column info:")
titanic_df.info()

=== ADVANCED MISSING VALUE STRATEGIES ===
📥 Loaded fresh Titanic data from web
✅ Dataset loaded successfully!
Shape: 891 rows × 12 columns

📋 First 5 rows:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket   Fare Cabin Embarked  
0      0         A/5 21171   7.25   NaN        S  
1      0          PC 17599  71.28   C85        C  
2      0  S

## 2. Missing Data Pattern Analysis

The first step is to quantify missing data and understand if the missingness is **random** or **conditional**. A conditional pattern (e.g., less `Age` data missing for 1st Class) suggests using sophisticated imputation methods.

In [2]:
print("\n📊 2. Missing Data Pattern Analysis")


print("1. Missing Values Count and Percentage:")
missing_summary = titanic_df.isnull().sum()
missing_percentage = (titanic_df.isnull().sum() / len(titanic_df) * 100).round(2)

missing_analysis = pd.DataFrame({
    'Missing_Count': missing_summary,
    'Missing_Percentage': missing_percentage
}).sort_values('Missing_Percentage', ascending=False)


print(missing_analysis[missing_analysis['Missing_Count'] > 0])


print("\n3. Rows with Missing Values Analysis:")
# Count rows with any missing values
rows_with_missing = titanic_df.isnull().any(axis=1).sum()
rows_complete = len(titanic_df) - rows_with_missing


print(f"Rows with missing values: {rows_with_missing} ({rows_with_missing/len(titanic_df)*100:.1f}%)")
print(f"Complete rows: {rows_complete} ({rows_complete/len(titanic_df)*100:.1f}%)")


📊 2. Missing Data Pattern Analysis
1. Missing Values Count and Percentage:
          Missing_Count  Missing_Percentage
Cabin               687               77.10
Age                 177               19.87
Embarked              2                0.22

3. Rows with Missing Values Analysis:
Rows with missing values: 708 (79.5%)
Complete rows: 183 (20.5%)


## 3. Missing Data Patterns by Passenger Class

We use **`groupby()`** to see if the proportion of missing values differs across categories (`Pclass`). If it does, using the global median for imputation is inappropriate.

In [3]:
print("\n📊 3. Missing Data Patterns by Passenger Class")
print("Missing data by Passenger Class:")

# Define columns to check for missingness patterns
cols_to_check = ['Age', 'Cabin', 'Embarked']

for col in cols_to_check:
    # 🚨 Robust Check: Ensure column exists before accessing it
    if col not in titanic_df.columns:
        print(f"\n⚠️ Warning: Column '{col}' not found in DataFrame. Skipping pattern analysis.")
        continue

    if titanic_df[col].isnull().sum() > 0:
        print(f"\n{col} missing by class:")
        
        missing_by_class = titanic_df.groupby('Pclass')[col].apply(lambda x: x.isnull().sum())
        missing_pct_by_class = titanic_df.groupby('Pclass')[col].apply(lambda x: x.isnull().mean() * 100)
        
        class_analysis = pd.DataFrame({
            'Missing_Count': missing_by_class,
            'Missing_Percentage': missing_pct_by_class.round(1)
        })
        print(class_analysis)


print("\n🔍 Key Insight: Different missing patterns by passenger class reveal potential data collection biases!")


📊 3. Missing Data Patterns by Passenger Class
Missing data by Passenger Class:

Age missing by class:
        Missing_Count  Missing_Percentage
Pclass                                   
1                  30                13.9
2                  11                 6.0
3                 136                27.7

Cabin missing by class:
        Missing_Count  Missing_Percentage
Pclass                                   
1                  40                18.5
2                 168                91.3
3                 479                97.6

Embarked missing by class:
        Missing_Count  Missing_Percentage
Pclass                                   
1                   2                 0.9
2                   0                 0.0
3                   0                 0.0

🔍 Key Insight: Different missing patterns by passenger class reveal potential data collection biases!


## 4. Group-Based Imputation using `.transform()`

The **`.transform()`** method is the key to **conditional imputation**. It calculates a group-specific statistic (like the median) and returns a Series of the *same length* as the original DataFrame, aligned by index. This allows us to fill missing values with the median of their specific subgroup (e.g., the median age of 1st Class women).

🎯 **New Syntax:** `df.groupby(['col1', 'col2'])['target'].transform('median')`

In [4]:
print("\n📊 4. Group-Based Imputation using .transform()")

# Strategy: Fill missing ages with median age by passenger class and gender
print("\nStep 1: Analyzing age patterns by class and gender")
age_stats = titanic_df.groupby(['Pclass', 'Sex'])['Age'].agg(['count', 'mean', 'median']).round(1)
print("Age statistics by Pclass and Sex:")
print(age_stats)


print("\nStep 2: Applying group-based imputation")
# Create group-based median ages
age_by_group = titanic_df.groupby(['Pclass', 'Sex'])['Age'].transform('median')


# Show what transform does
print("\nBefore imputation (sample missing):")
sample_missing = titanic_df[titanic_df['Age'].isnull()].head(3)[['PassengerId', 'Pclass', 'Sex', 'Age']]
print(sample_missing)


# Apply the imputation
titanic_df['Age_group_filled'] = titanic_df['Age'].fillna(age_by_group)


print(f"\n✅ Imputation complete!")
print(f"Missing ages before: {titanic_df['Age'].isnull().sum()}")
print(f"Missing ages after: {titanic_df['Age_group_filled'].isnull().sum()}")


📊 4. Group-Based Imputation using .transform()

Step 1: Analyzing age patterns by class and gender
Age statistics by Pclass and Sex:
               count  mean  median
Pclass Sex                        
1      female     85  34.6    35.0
       male      101  41.3    40.0
2      female     74  28.7    28.0
       male       99  30.7    30.0
3      female    102  21.8    21.5
       male      253  26.5    25.0

Step 2: Applying group-based imputation

Before imputation (sample missing):
    PassengerId  Pclass     Sex  Age
5             6       3    male  NaN
17           18       2    male  NaN
19           20       3  female  NaN

✅ Imputation complete!
Missing ages before: 177
Missing ages after: 0


## 5. Forward Fill and Backward Fill Methods

**Forward Fill (`ffill`)** and **Backward Fill (`bfill`)** fill missing values based on adjacent non-null values. They are most appropriate when the row order matters (like in time-series or sequential data).

In [10]:

print("\n📊 5. Forward Fill and Backward Fill Methods")

print("🎯 New Syntax: df['column'].ffill() and df['column'].bfill()")
print("These methods fill missing values based on adjacent non-null values")


# Work with Embarked column (only 2 missing values)
print(f"\nEmbarked column missing values: {titanic_df['Embarked'].isnull().sum()}")


# Sort by PassengerId to simulate ordered data
titanic_sorted = titanic_df.sort_values('PassengerId').copy()
embarked_missing_idx = titanic_sorted[titanic_sorted['Embarked'].isnull()].index


# Apply forward fill (FIXED: Using .ffill() method)
titanic_sorted['Embarked_ffill'] = titanic_sorted['Embarked'].ffill()


# Apply backward fill (FIXED: Using .bfill() method)
titanic_sorted['Embarked_bfill'] = titanic_sorted['Embarked'].bfill()


print("\n📋 Forward/Backward Fill Results:")
fill_comparison = pd.DataFrame({
    'Original_Missing': titanic_sorted['Embarked'].isnull().sum(),
    'After_Forward_Fill': titanic_sorted['Embarked_ffill'].isnull().sum(),
    'After_Backward_Fill': titanic_sorted['Embarked_bfill'].isnull().sum()
}, index=[0])
print(fill_comparison)


# Show what happened to the missing values
print("\nWhat values were filled:")
for idx in embarked_missing_idx:
    original = titanic_sorted.loc[idx, 'Embarked']
    ffill = titanic_sorted.loc[idx, 'Embarked_ffill'] 
    bfill = titanic_sorted.loc[idx, 'Embarked_bfill']
    pid = titanic_sorted.loc[idx, 'PassengerId']
    print(f"PassengerId {pid}: Original={original}, FFill={ffill}, BFill={bfill}")


📊 5. Forward Fill and Backward Fill Methods
🎯 New Syntax: df['column'].ffill() and df['column'].bfill()
These methods fill missing values based on adjacent non-null values

Embarked column missing values: 2

📋 Forward/Backward Fill Results:
   Original_Missing  After_Forward_Fill  After_Backward_Fill
0                 2                   0                    0

What values were filled:
PassengerId 62: Original=nan, FFill=C, BFill=S
PassengerId 830: Original=nan, FFill=Q, BFill=C


## 6. Conditional Filling Strategies

For categorical features like `Embarked` (port), we fill missing values based on the most common port for a related group (`Pclass`). Using `.map()` with a dictionary of group modes is the most **efficient** and **idiomatic Pandas** way to execute this conditional logic.

In [8]:
print("\n📊 6. Conditional Filling Strategies")


# Strategy: Fill missing Embarked based on most common port for each class
print("\nStep 1: Find most common embarkation port by passenger class")
embarked_mode_by_class = titanic_df.groupby('Pclass')['Embarked'].apply(
    lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else 'S'
)
print("Most common Embarked port by class:")
print(embarked_mode_by_class)


# Method 2 (Efficient): Using pandas map() 
class_to_embarked = embarked_mode_by_class.to_dict()
print(f"\nClass to Embarked mapping: {class_to_embarked}")

# Fill missing values using the map() result
titanic_df['Embarked_filled_map'] = titanic_df['Embarked'].fillna(
    titanic_df['Pclass'].map(class_to_embarked)
)


# Method 1 (Less Efficient): Using apply() with custom function
def fill_embarked_by_class(row):
    if pd.isna(row['Embarked']):
        return embarked_mode_by_class[row['Pclass']]
    return row['Embarked']

titanic_df['Embarked_filled_apply'] = titanic_df.apply(fill_embarked_by_class, axis=1)

# Verify both methods give same result
methods_match = (titanic_df['Embarked_filled_map'] == titanic_df['Embarked_filled_apply']).all()
print(f"\n✅ Both filling methods match: {methods_match}")
print(f"Missing Embarked values after conditional filling: {titanic_df['Embarked_filled_map'].isnull().sum()}")


📊 6. Conditional Filling Strategies

Step 1: Find most common embarkation port by passenger class
Most common Embarked port by class:
Pclass
1    S
2    S
3    S
Name: Embarked, dtype: object

Class to Embarked mapping: {1: 'S', 2: 'S', 3: 'S'}

✅ Both filling methods match: True
Missing Embarked values after conditional filling: 0


In [9]:
print("\n" + "="*60)
print("📚 SUMMARY: Advanced Missing Value Strategies")
print("="*60)

print("\n✅ SKILLS MASTERED TODAY:")
print("1. Missing data pattern analysis with pandas")
print("2. Group-based imputation using .groupby().transform()")
print("3. Forward fill and backward fill methods")
print("4. Conditional filling with .map() and .fillna()")

print("\n🎯 NEW PANDAS SYNTAX LEARNED:")
print("• df.groupby(['col1', 'col2'])['target'].transform('median')")
print("• df['column'].fillna(method='ffill')  # Forward fill")
print("• df['col'].map(dictionary)  # Value mapping for filling")

print("\n🔥 POWER TECHNIQUE OF THE DAY:")
print("GROUP-BASED IMPUTATION with .transform()")
print("→ Fills missing values with group-specific statistics")

print("\n" + "✓ Session 6 completed! Sophisticated data cleaning mastered." + "\n")


📚 SUMMARY: Advanced Missing Value Strategies

✅ SKILLS MASTERED TODAY:
1. Missing data pattern analysis with pandas
2. Group-based imputation using .groupby().transform()
3. Forward fill and backward fill methods
4. Conditional filling with .map() and .fillna()

🎯 NEW PANDAS SYNTAX LEARNED:
• df.groupby(['col1', 'col2'])['target'].transform('median')
• df['column'].fillna(method='ffill')  # Forward fill
• df['col'].map(dictionary)  # Value mapping for filling

🔥 POWER TECHNIQUE OF THE DAY:
GROUP-BASED IMPUTATION with .transform()
→ Fills missing values with group-specific statistics

✓ Session 6 completed! Sophisticated data cleaning mastered.

