# 📊 Part 9: Pure Pandas Outlier Detection and Treatment

**Goal:** To master statistical and context-aware outlier detection using **100% native Pandas methods** (IQR and Percentile), and implement robust outlier treatment strategies (Capping, Transformation, Replacement).

---
### Key Learning Objectives
1.  Implement the **IQR method** using `series.quantile()` and boolean masking.
2.  Perform **Percentile-based** outlier detection.
3.  Design **Context-Aware** outlier analysis using `groupby()`.
4.  Apply treatment strategies: `.clip()` (Capping), `.apply()` (Transformation), and `loc`-based replacement.

In [1]:
import pandas as pd
import numpy as np # Used in function definitions for np.nan checks

# Set pandas display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 2)

print("=== PURE PANDAS OUTLIER DETECTION ===")
print("\n🔧 Environment setup complete! (100% Pandas)")

# Load and prepare our dataset with all enhancements from this week
titanic_url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
titanic_df = pd.read_csv(titanic_url)

# Apply all prior enhancements (Imputation, Dtypes, Feature Engineering)
print("🔧 Applying Week 11 enhancements using pandas-only...")

# Missing value fixes (pandas-only)
age_by_group = titanic_df.groupby(['Pclass', 'Sex'])['Age'].transform('median')
titanic_df['Age'] = titanic_df['Age'].fillna(age_by_group)
titanic_df['Embarked'] = titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0])

# Data type optimizations (pandas-only)
titanic_df['Pclass'] = titanic_df['Pclass'].astype('category')
titanic_df['Sex'] = titanic_df['Sex'].astype('category')
titanic_df['Embarked'] = titanic_df['Embarked'].astype('category')
titanic_df['Survived'] = titanic_df['Survived'].astype('bool')

# Key feature engineering (pandas-only)
titanic_df['Family_Size'] = titanic_df['SibSp'] + titanic_df['Parch'] + 1
titanic_df['Fare_Per_Person'] = titanic_df['Fare'] / titanic_df['Family_Size']

print("✅ Dataset enhanced with prior improvements using pandas-only!")
print(f"Dataset shape: {titanic_df.shape}")

# Focus on numerical columns for outlier detection
numerical_cols = ['Age', 'Fare', 'SibSp', 'Parch', 'Family_Size', 'Fare_Per_Person']
print(f"\n📊 Numerical columns for outlier analysis: {len(numerical_cols)}")

print(f"\n📋 Basic Statistical Summary:")
print(titanic_df[numerical_cols].describe())

=== PURE PANDAS OUTLIER DETECTION ===

🔧 Environment setup complete! (100% Pandas)
🔧 Applying Week 11 enhancements using pandas-only...
✅ Dataset enhanced with prior improvements using pandas-only!
Dataset shape: (891, 14)

📊 Numerical columns for outlier analysis: 6

📋 Basic Statistical Summary:
          Age    Fare   SibSp   Parch  Family_Size  Fare_Per_Person
count  891.00  891.00  891.00  891.00       891.00           891.00
mean    29.11   32.20    0.52    0.38         1.90            19.92
std     13.30   49.69    1.10    0.81         1.61            35.84
min      0.42    0.00    0.00    0.00         1.00             0.00
25%     21.50    7.91    0.00    0.00         1.00             7.25
50%     26.00   14.45    0.00    0.00         1.00             8.30
75%     36.00   31.00    1.00    0.00         2.00            23.67
max     80.00  512.33    8.00    6.00        11.00           512.33


## 2. IQR (Interquartile Range) Outlier Detection

The **IQR method** is the standard statistical technique for outlier detection. An outlier is defined as any point that falls outside $1.5 \times IQR$ above the 75th percentile (Q3) or below the 25th percentile (Q1).

🎯 **Pure pandas Syntax:** `.quantile(0.25)` and boolean masking.

In [2]:
def detect_outliers_iqr_pandas_only(series, multiplier=1.5):
    """Detect outliers using IQR method with pure pandas"""
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    
    # Calculate bounds using pandas arithmetic
    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR
    
    # Create outlier mask using pandas boolean operations
    outlier_mask = (series < lower_bound) | (series > upper_bound)
    
    return {
        'Q1': Q1, 'Q3': Q3, 'IQR': IQR, 
        'lower_bound': lower_bound, 'upper_bound': upper_bound,
        'outlier_mask': outlier_mask, 
        'outlier_count': outlier_mask.sum(),
        'outlier_percentage': (outlier_mask.sum() / len(series)) * 100,
    }

print("\n1. Age Outlier Analysis using pandas:")
age_outliers = detect_outliers_iqr_pandas_only(titanic_df['Age'])

print("📊 Age Outlier Statistics:")
for key, value in age_outliers.items():
    if key != 'outlier_mask': 
        print(f"  {key}: {value:.2f}" if isinstance(value, (int, float)) else f"  {key}: {value}")

# Show age outliers using pandas boolean indexing
age_outlier_data = titanic_df[age_outliers['outlier_mask']]
print(f"\n🔍 Age Outliers ({len(age_outlier_data)} passengers):")
print(age_outlier_data[['Name', 'Age', 'Pclass', 'Survived']].head(5))


print("\n3. Comprehensive Outlier Analysis using pandas:")
outlier_summary = []

for col in numerical_cols:
    outliers = detect_outliers_iqr_pandas_only(titanic_df[col])
    outlier_summary.append({
        'Column': col,
        'Outlier_Count': outliers['outlier_count'],
        'Outlier_Percentage': round(outliers['outlier_percentage'], 2),
        'Lower_Bound': round(outliers['lower_bound'], 3),
        'Upper_Bound': round(outliers['upper_bound'], 3)
    })

outlier_df = pd.DataFrame(outlier_summary)
print("📋 Outlier Summary Across All Numerical Columns:")
print(outlier_df)


1. Age Outlier Analysis using pandas:
📊 Age Outlier Statistics:
  Q1: 21.50
  Q3: 36.00
  IQR: 14.50
  lower_bound: -0.25
  upper_bound: 57.75
  outlier_count: 33
  outlier_percentage: 3.70

🔍 Age Outliers (33 passengers):
                              Name   Age Pclass  Survived
11        Bonnell, Miss. Elizabeth  58.0      1      True
33           Wheadon, Mr. Edward H  66.0      2     False
54  Ostby, Mr. Engelhart Cornelius  65.0      1     False
94               Coxon, Mr. Daniel  59.0      3     False
96       Goldschmidt, Mr. George B  71.0      1     False

3. Comprehensive Outlier Analysis using pandas:
📋 Outlier Summary Across All Numerical Columns:
            Column  Outlier_Count  Outlier_Percentage  Lower_Bound  \
0              Age             33                3.70        -0.25   
1             Fare            116               13.02       -26.72   
2            SibSp             46                5.16        -1.50   
3            Parch            213               23.

## 3. Percentile-based Outlier Detection

This method uses percentile thresholds (e.g., the 5th and 95th percentiles) instead of the IQR multiplier. It offers more **flexibility** and is often used when the data distribution is unknown or highly skewed.

In [3]:
def detect_outliers_percentile_pandas_only(series, lower_percentile=5, upper_percentile=95):
    """Detect outliers using percentile method with pure pandas"""
    # Calculate percentile thresholds using pandas quantile
    lower_threshold = series.quantile(lower_percentile / 100)
    upper_threshold = series.quantile(upper_percentile / 100)
    
    # Create outlier mask using pandas boolean operations
    outlier_mask = (series < lower_threshold) | (series > upper_threshold)
    
    return {
        'lower_threshold': lower_threshold, 'upper_threshold': upper_threshold,
        'outlier_mask': outlier_mask, 
        'outlier_count': outlier_mask.sum(),
        'outlier_percentage': (outlier_mask.sum() / len(series)) * 100,
        'method': f'{lower_percentile}th/{upper_percentile}th percentile'
    }

print("\n1. Fare Analysis with Different Percentile Thresholds:")

percentile_methods = [(1, 99), (5, 95), (10, 90)] # Test different thresholds

fare_percentile_results = []
for lower, upper in percentile_methods:
    result = detect_outliers_percentile_pandas_only(titanic_df['Fare'], lower, upper)
    fare_percentile_results.append({
        'Method': f'{lower}th/{upper}th percentile',
        'Lower_Threshold': round(result['lower_threshold'], 2),
        'Upper_Threshold': round(result['upper_threshold'], 2), 
        'Outliers': result['outlier_count'],
        'Percentage': round(result['outlier_percentage'], 2)
    })

percentile_comparison = pd.DataFrame(fare_percentile_results)
print("📊 Fare Outliers by Different Percentile Methods:")
print(percentile_comparison)


1. Fare Analysis with Different Percentile Thresholds:
📊 Fare Outliers by Different Percentile Methods:
                 Method  Lower_Threshold  Upper_Threshold  Outliers  \
0   1th/99th percentile             0.00           249.01         9   
1   5th/95th percentile             7.22           112.08        88   
2  10th/90th percentile             7.55            77.96       175   

   Percentage  
0        1.01  
1        9.88  
2       19.64  


## 4. Context-Aware Group-Based Outlier Detection

**Smart outlier detection** recognizes that context matters (e.g., a high fare is normal for 1st Class but an outlier for 3rd Class). We use `groupby()` to calculate IQR bounds *within* each subgroup.

🎯 **SMART OUTLIER DETECTION:** High fares might be normal for 1st class passengers.

In [6]:

def detect_outliers_by_group_pandas_only(df, value_col, group_cols, multiplier=1.5):
    """Detect outliers within groups using pandas-only methods"""
    outlier_results = []
    
    # 🚨 FIX 1: Add observed=True to groupby to silence FutureWarning for Categoricals
    for group_values, group_data in df.groupby(group_cols, observed=True):
        series = group_data[value_col]
        
        # IQR method within group using pandas quantile
        Q1 = series.quantile(0.25)
        Q3 = series.quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - multiplier * IQR
        upper_bound = Q3 + multiplier * IQR
        outlier_mask = (series < lower_bound) | (series > upper_bound)
        
        outliers_in_group = outlier_mask.sum() if len(series) > 0 else 0
        
        group_info = {}
        for i, col in enumerate(group_cols):
            # Handle single-level grouping (group_values is not a tuple)
            group_info[col] = group_values if not isinstance(group_values, tuple) else group_values[i]
        
        group_info.update({
            'Count': len(series),
            'Mean': series.mean() if len(series) > 0 else 0,
            'Outliers': outliers_in_group,
            'Outlier_Pct': (outliers_in_group / len(series) * 100) if len(series) > 0 else 0,
            # Add Max back for insight printing
            'Max': series.max() if len(series) > 0 else 0 
        })
        
        outlier_results.append(group_info)
    
    return pd.DataFrame(outlier_results).round(3)

print("\n1. Fare Outliers by Passenger Class using pandas:")
fare_by_class = detect_outliers_by_group_pandas_only(titanic_df, 'Fare', ['Pclass'])
print("📊 Fare Analysis by Class (Context-Aware):")
print(fare_by_class)

# 🚨 FIX 2: Separate formatting logic to avoid ValueError on 'N/A'

class_1_data = fare_by_class[fare_by_class['Pclass'] == 1]
class_3_data = fare_by_class[fare_by_class['Pclass'] == 3]

# Extract values, ensure defensive access
class_1_max = class_1_data['Max'].iloc[0] if not class_1_data.empty and 'Max' in class_1_data.columns else 'N/A'
class_3_max = class_3_data['Max'].iloc[0] if not class_3_data.empty and 'Max' in class_3_data.columns else 'N/A'

# Helper function to format the output safely
def format_fare(value):
    return f"${value:.2f}" if isinstance(value, (int, float)) else str(value)

print(f"\n🔍 Key Insights:")
print(f"• 1st class max fare: {format_fare(class_1_max)}")
print(f"• 3rd class max fare: {format_fare(class_3_max)}")
print("• Context is essential: an outlier in 3rd class might be normal in 1st.")


print("\n2. Age Outliers by Gender and Class using pandas:")
age_by_sex_class = detect_outliers_by_group_pandas_only(titanic_df, 'Age', ['Sex', 'Pclass'])
print("📊 Age Analysis by Gender and Class:")
print(age_by_sex_class)


1. Fare Outliers by Passenger Class using pandas:
📊 Fare Analysis by Class (Context-Aware):
   Pclass  Count   Mean  Outliers  Outlier_Pct     Max
0       1    216  84.16        20         9.26  512.33
1       2    184  20.66         7         3.80   73.50
2       3    491  13.68        52        10.59   69.55

🔍 Key Insights:
• 1st class max fare: $512.33
• 3rd class max fare: $69.55
• Context is essential: an outlier in 3rd class might be normal in 1st.

2. Age Outliers by Gender and Class using pandas:
📊 Age Analysis by Gender and Class:
      Sex  Pclass  Count   Mean  Outliers  Outlier_Pct   Max
0  female       1     94  34.65         0         0.00  63.0
1  female       2     76  28.70         2         2.63  57.0
2  female       3    144  21.68        27        18.75  63.0
3    male       1    122  41.06         3         2.46  80.0
4    male       2    108  30.68        14        12.96  70.0
5    male       3    347  26.10        37        10.66  74.0


## 5. Outlier Treatment Strategies

After detection, we decide how to handle the outliers.

* **Removal:** Delete the outlier rows (use with caution).
* **Capping (Winsorization):** Limit values to a set threshold (e.g., 5th and 95th percentiles) using `series.clip()`.
* **Transformation:** Apply a mathematical function (e.g., $\sqrt{x}$) to reduce skewness and lessen the outlier's influence.
* **Flagging:** Keep the data but mark it with a binary feature.

In [7]:
print("\n📊 6. Outlier Treatment Strategies with pandas-Only")

# Create a copy for treatment experiments
titanic_treatment = titanic_df.copy()

print("\n1. Capping Strategy (Winsorization) using pandas:")

def cap_outliers_pandas_only(series, lower_percentile=5, upper_percentile=95):
    """Cap outliers using percentile limits with pandas-only methods"""
    lower_cap = series.quantile(lower_percentile / 100)
    upper_cap = series.quantile(upper_percentile / 100)
    
    # Cap using pandas clip method
    capped_series = series.clip(lower=lower_cap, upper=upper_cap)
    
    return capped_series, lower_cap, upper_cap

# Cap fare outliers
fare_capped, fare_lower_cap, fare_upper_cap = cap_outliers_pandas_only(titanic_treatment['Fare'])
titanic_treatment['Fare_Capped'] = fare_capped

print(f"Fare capping results:")
print(f"  Lower cap ({fare_lower_cap:.2f}) | Upper cap ({fare_upper_cap:.2f})")
print(f"  Original Max: ${titanic_treatment['Fare'].max():.2f} → Capped Max: ${fare_capped.max():.2f}")


print("\n2. Transformation Strategy using pandas:")
# Square root transformation (reduces skewness)
titanic_treatment['Fare_Sqrt'] = titanic_treatment['Fare'].apply(lambda x: x**0.5)

# Check skewness reduction using pandas skew method
original_skew = titanic_treatment['Fare'].skew()
sqrt_skew = titanic_treatment['Fare_Sqrt'].skew()

print(f"📐 Skewness comparison (closer to 0 is better):")
print(f"  Original Skew: {original_skew:.3f}")
print(f"  Square Root Skew: {sqrt_skew:.3f}")


print("\n3. Flagging Strategy using pandas:")
# Create comprehensive outlier flags using pandas operations
fare_outliers = detect_outliers_iqr_pandas_only(titanic_treatment['Fare'])['outlier_mask']
age_outliers = detect_outliers_iqr_pandas_only(titanic_treatment['Age'])['outlier_mask']

titanic_treatment['Fare_IQR_Outlier'] = fare_outliers.astype(int)
titanic_treatment['Age_IQR_Outlier'] = age_outliers.astype(int)

# Multi-dimensional outlier flag
titanic_treatment['Total_Outlier_Flags'] = (titanic_treatment['Fare_IQR_Outlier'] + titanic_treatment['Age_IQR_Outlier'])

print(f"Passengers with 2+ outlier flags: {titanic_treatment['Total_Outlier_Flags'].value_counts().get(2, 0) + titanic_treatment['Total_Outlier_Flags'].value_counts().get(3, 0)}")

print("✅ All outlier treatment strategies demonstrated using pandas-only!")


📊 6. Outlier Treatment Strategies with pandas-Only

1. Capping Strategy (Winsorization) using pandas:
Fare capping results:
  Lower cap (7.22) | Upper cap (112.08)
  Original Max: $512.33 → Capped Max: $112.08

2. Transformation Strategy using pandas:
📐 Skewness comparison (closer to 0 is better):
  Original Skew: 4.787
  Square Root Skew: 2.085

3. Flagging Strategy using pandas:
Passengers with 2+ outlier flags: 9
✅ All outlier treatment strategies demonstrated using pandas-only!


In [9]:
print("\n" + "="*60)
print("📚 FINAL SUMMARY: Outlier Detection and Treatment Audit")
print("="*60)

# 1. Evaluate Skewness
treatment_comparison = []
treatment_methods = {
    'Original': titanic_treatment['Fare'],
    'Capped': titanic_treatment['Fare_Capped'],
    'Sqrt_Transform': titanic_treatment['Fare_Sqrt']
}

for method_name, method_data in treatment_methods.items():
    treatment_comparison.append({
        'Method': method_name,
        'Mean': method_data.mean(),
        'Std': method_data.std(),
        'Skewness': method_data.skew()
    })

comparison_df = pd.DataFrame(treatment_comparison).round(3)
print("📊 Treatment Method Comparison:")
print(comparison_df)

# 2. Survival Impact Analysis
outlier_survival_analysis = []
for col in ['Age', 'Fare']:
    outliers = detect_outliers_iqr_pandas_only(titanic_treatment[col])
    outlier_survival = titanic_treatment[outliers['outlier_mask']]['Survived'].mean()
    normal_survival = titanic_treatment[~outliers['outlier_mask']]['Survived'].mean()
    
    outlier_survival_analysis.append({
        'Column': col,
        'Outlier_Survival_Rate': round(outlier_survival, 3),
        'Normal_Survival_Rate': round(normal_survival, 3),
        'Difference': round(outlier_survival - normal_survival, 3),
    })

survival_analysis_df = pd.DataFrame(outlier_survival_analysis)
print("\n📊 Survival Rate Impact: Outliers vs Normal:")
print(survival_analysis_df)

print(f"\n🏆 RECOMMENDED STRATEGY:")
print("• For **Modeling**: Square root transformation (best skewness reduction)")
print("• For **Reporting**: Capping (maintains interpretability and range)")
print("• For **Exploration**: Flagging (preserves all data points)")

print("\n💡 KEY LEARNINGS (pandas-Only):")
print("1. Detection is not enough; context is key.")
print("2. Use `series.clip()` for effective capping (Winsorization).")
print("3. Survival rates often differ significantly for outliers (high-fare survivors).")

print("\n✓ Session 9 completed! Outlier analysis and treatment mastered. 🐼")


📚 FINAL SUMMARY: Outlier Detection and Treatment Audit
📊 Treatment Method Comparison:
           Method   Mean    Std  Skewness
0        Original  32.20  49.69      4.79
1          Capped  27.86  29.11      1.73
2  Sqrt_Transform   4.85   2.95      2.08

📊 Survival Rate Impact: Outliers vs Normal:
  Column  Outlier_Survival_Rate  Normal_Survival_Rate  Difference
0    Age                   0.30                  0.39       -0.08
1   Fare                   0.68                  0.34        0.34

🏆 RECOMMENDED STRATEGY:
• For **Modeling**: Square root transformation (best skewness reduction)
• For **Reporting**: Capping (maintains interpretability and range)
• For **Exploration**: Flagging (preserves all data points)

💡 KEY LEARNINGS (pandas-Only):
1. Detection is not enough; context is key.
2. Use `series.clip()` for effective capping (Winsorization).
3. Survival rates often differ significantly for outliers (high-fare survivors).

✓ Session 9 completed! Outlier analysis and treatment ma