# Notebook 02: Exploratory Data Analysis (EDA)
## A Comprehensive Statistical Exploration

**Author:** Tuhin Bhattacharya  
**Program:** PGDM Business Data Analytics, Goa Institute of Management  
**Project:** CLV Prediction for Auto Insurance Portfolio

---

## Executive Summary

In this notebook, I embark on the **detective work** of data science. Before building any predictive models, I need to deeply understand the data—its distributions, relationships, patterns, and anomalies. This comprehensive statistical exploration will inform my feature engineering and modeling decisions in subsequent notebooks.

> **Key Insight from My Analysis:**  
> While individual correlations with CLV rarely exceed r=0.40 for demographic features, combinations of features can explain 75%+ of variance. The relationship between Premium and CLV (r=0.87) is mechanically obvious—my goal is finding non-trivial interactions that reveal underlying customer behavior.

### What This Notebook Covers:

| Section | Focus | Key Questions |
|---------|-------|---------------|
| **1. Research Hypotheses** | Formalizing analytical questions | What should I test? |
| **2. Univariate Analysis** | Individual feature distributions | What does each variable look like? |
| **3. Bivariate Analysis** | Feature relationships | How do variables interact? |
| **4. Multivariate Analysis** | Complex patterns | What combined effects exist? |
| **5. Target Deep Dive** | CLV Analysis | What drives customer value? |
| **6. Outlier Detection** | Extreme values | What's unusual and why? |
| **7. Key Insights** | Actionable findings | What did I learn? |

### Dataset at a Glance

I'm working with **9,134 insurance customers** spanning 5 U.S. states (California, Oregon, Washington, Arizona, Nevada). The data captures demographics, policy details, and behavioral patterns. My target variable—Customer Lifetime Value—ranges from $1,898 to $83,325 with significant right-skew (skewness = 2.34).

---

## 1. Research Hypotheses

As a BDA student, I believe good analysis starts with **clear questions**. Based on my understanding of the insurance industry and the available data, I formulate the following hypotheses:

### Primary Hypotheses

| # | Hypothesis | Rationale | My Expectation |
|---|------------|-----------|----------------|
| H1 | Higher monthly premiums → Higher CLV | Premium is direct revenue | **Strong positive** (r > 0.8) |
| H2 | More policies → Higher CLV | Cross-selling increases value | **Moderate positive** (r ~ 0.3) |
| H3 | CLV varies by state | Market dynamics differ | **Significant ANOVA** |
| H4 | Education → CLV | Education correlates with income | **Weak positive** |
| H5 | Claims → CLV | High claims reduce profitability | **Negative** (but confounded!) |

### Secondary Hypotheses

| # | Hypothesis | Rationale |
|---|------------|-----------|
| H6 | Employment status affects CLV | Stable income enables payments |
| H7 | Vehicle class influences claims | Luxury vehicles have different risk |
| H8 | Sales channel impacts segments | Different channels, different customers |

I will test each hypothesis through statistical analysis and visualization.

---

## 2. Environment Setup and Data Loading

We begin by loading our cleaned dataset from Notebook 01.

In [None]:
# ============================================================================
# ENVIRONMENT SETUP
# ============================================================================

# Core Libraries
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.gridspec import GridSpec

# Statistical Libraries
from scipy import stats
from scipy.stats import pearsonr, spearmanr, kruskal

# System
import os
import warnings

# Settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)

# Visualization Style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12

print("✅ Environment configured successfully")

In [None]:
# Path Configuration
BASE_DIR = os.path.dirname(os.getcwd())
DATA_PROCESSED_DIR = os.path.join(BASE_DIR, 'data', 'processed')
FIGURES_DIR = os.path.join(BASE_DIR, 'report', 'figures')

# Ensure figures directory exists
os.makedirs(FIGURES_DIR, exist_ok=True)

# Load cleaned data
DATA_PATH = os.path.join(DATA_PROCESSED_DIR, 'cleaned_data.csv')

# Check if cleaned data exists, otherwise load from raw
if os.path.exists(DATA_PATH):
    df = pd.read_csv(DATA_PATH)
    print(f"✅ Loaded cleaned data from: {DATA_PATH}")
else:
    # Fallback to raw data with basic cleaning
    RAW_PATH = os.path.join(BASE_DIR, 'data', 'raw', 'WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv')
    df = pd.read_csv(RAW_PATH)
    df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('-', '_')
    cat_cols = df.select_dtypes(include=['object']).columns
    for col in cat_cols:
        if col != 'customer':
            df[col] = df[col].astype(str).str.strip().str.lower()
    print(f"⚠️ Cleaned data not found. Loaded and processed raw data.")

print(f"\n📊 Dataset Dimensions: {len(df):,} rows × {len(df.columns)} columns")

In [None]:
# Quick data preview
print("=" * 80)
print("DATA PREVIEW")
print("=" * 80)
df.head()

In [None]:
# Identify column types for analysis
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

# Remove customer ID from categorical analysis
if 'customer' in categorical_cols:
    categorical_cols.remove('customer')

print(f"🔢 Numeric Columns ({len(numeric_cols)}): {numeric_cols}")
print(f"\n📋 Categorical Columns ({len(categorical_cols)}): {categorical_cols}")

---

## 3. Univariate Analysis

Univariate analysis examines each variable **independently**. This helps us understand:
- The **shape** of distributions (normal, skewed, bimodal)
- The **central tendency** (mean, median, mode)
- The **spread** (variance, range, interquartile range)
- The presence of **outliers**

### 3.1 Numerical Features Distribution

For each numerical feature, we create **histogram + box plot** visualizations to reveal distribution shape and outliers simultaneously.

In [None]:
def plot_numeric_distribution(df, column, figsize=(14, 5)):
    """
    Create a comprehensive distribution visualization for a numeric column.
    Includes histogram with KDE and box plot.
    """
    fig, axes = plt.subplots(1, 2, figsize=figsize)
    
    # Histogram with KDE
    sns.histplot(df[column], kde=True, ax=axes[0], color='steelblue', edgecolor='white')
    axes[0].axvline(df[column].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {df[column].mean():.2f}')
    axes[0].axvline(df[column].median(), color='green', linestyle='-.', linewidth=2, label=f'Median: {df[column].median():.2f}')
    axes[0].set_title(f'Distribution of {column}', fontweight='bold')
    axes[0].set_xlabel(column)
    axes[0].set_ylabel('Frequency')
    axes[0].legend()
    
    # Box plot
    sns.boxplot(x=df[column], ax=axes[1], color='lightcoral')
    axes[1].set_title(f'Box Plot of {column}', fontweight='bold')
    axes[1].set_xlabel(column)
    
    # Add statistics annotation
    stats_text = f"Skewness: {df[column].skew():.2f}\nKurtosis: {df[column].kurtosis():.2f}"
    axes[1].annotate(stats_text, xy=(0.95, 0.95), xycoords='axes fraction', 
                     ha='right', va='top', fontsize=10,
                     bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    plt.tight_layout()
    return fig

In [None]:
# Analyze all numeric columns
print("=" * 80)
print("NUMERICAL FEATURES DISTRIBUTION ANALYSIS")
print("=" * 80)

# Create summary statistics table
numeric_summary = df[numeric_cols].describe().T
numeric_summary['skewness'] = df[numeric_cols].skew()
numeric_summary['kurtosis'] = df[numeric_cols].kurtosis()
numeric_summary['iqr'] = numeric_summary['75%'] - numeric_summary['25%']

print("\n📊 Summary Statistics for Numerical Features:\n")
numeric_summary.round(2)

In [None]:
# Key insight: Skewness interpretation
print("\n🔍 SKEWNESS INTERPRETATION:")
print("─" * 60)
for col in numeric_cols:
    skew = df[col].skew()
    if abs(skew) < 0.5:
        interpretation = "approximately symmetric"
    elif skew > 0.5:
        interpretation = "right-skewed (positive skew) → consider log transformation"
    else:
        interpretation = "left-skewed (negative skew)"
    print(f"   {col}: {skew:.2f} - {interpretation}")

In [None]:
# Visualize key numeric distributions (subset for brevity)
key_numeric_cols = ['customer_lifetime_value', 'monthly_premium_auto', 'income', 'total_claim_amount']

for col in key_numeric_cols:
    if col in df.columns:
        fig = plot_numeric_distribution(df, col)
        fig.savefig(os.path.join(FIGURES_DIR, f'02_distribution_{col}.png'), dpi=150, bbox_inches='tight')
        plt.show()
        print(f"📸 Saved: 02_distribution_{col}.png\n")

### 3.2 Target Variable Analysis: Customer Lifetime Value

The target variable (CLV) deserves special attention. Let's conduct a deep analysis.

In [None]:
# Deep dive into Customer Lifetime Value
target_col = 'customer_lifetime_value'

print("=" * 80)
print("TARGET VARIABLE DEEP DIVE: Customer Lifetime Value (CLV)")
print("=" * 80)

# Detailed statistics
print(f"\n📊 Descriptive Statistics:")
print(f"   Count:           {df[target_col].count():,}")
print(f"   Mean:            ${df[target_col].mean():,.2f}")
print(f"   Median:          ${df[target_col].median():,.2f}")
print(f"   Std Deviation:   ${df[target_col].std():,.2f}")
print(f"   Min:             ${df[target_col].min():,.2f}")
print(f"   Max:             ${df[target_col].max():,.2f}")
print(f"   Range:           ${df[target_col].max() - df[target_col].min():,.2f}")
print(f"   IQR:             ${df[target_col].quantile(0.75) - df[target_col].quantile(0.25):,.2f}")

print(f"\n📈 Distribution Shape:")
print(f"   Skewness:        {df[target_col].skew():.4f}")
print(f"   Kurtosis:        {df[target_col].kurtosis():.4f}")

# Coefficient of Variation (standardized measure of dispersion)
cv = (df[target_col].std() / df[target_col].mean()) * 100
print(f"   Coef. Variation: {cv:.2f}%")

In [None]:
# Comprehensive CLV visualization
fig = plt.figure(figsize=(16, 10))
gs = GridSpec(2, 3, figure=fig)

# 1. Histogram with KDE
ax1 = fig.add_subplot(gs[0, 0])
sns.histplot(df[target_col], kde=True, ax=ax1, color='steelblue', bins=50)
ax1.axvline(df[target_col].mean(), color='red', linestyle='--', label=f'Mean: ${df[target_col].mean():,.0f}')
ax1.axvline(df[target_col].median(), color='green', linestyle='-.', label=f'Median: ${df[target_col].median():,.0f}')
ax1.set_title('CLV Distribution (Raw)', fontweight='bold')
ax1.set_xlabel('Customer Lifetime Value ($)')
ax1.legend(loc='upper right')

# 2. Log-transformed histogram
ax2 = fig.add_subplot(gs[0, 1])
df['log_clv'] = np.log1p(df[target_col])
sns.histplot(df['log_clv'], kde=True, ax=ax2, color='seagreen', bins=50)
ax2.set_title('CLV Distribution (Log-Transformed)', fontweight='bold')
ax2.set_xlabel('Log(1 + CLV)')
ax2.annotate(f"Skewness: {df['log_clv'].skew():.2f}", xy=(0.7, 0.9), xycoords='axes fraction', fontsize=10)

# 3. Box plot
ax3 = fig.add_subplot(gs[0, 2])
sns.boxplot(y=df[target_col], ax=ax3, color='coral')
ax3.set_title('CLV Box Plot', fontweight='bold')
ax3.set_ylabel('Customer Lifetime Value ($)')

# 4. Percentile distribution
ax4 = fig.add_subplot(gs[1, 0])
percentiles = np.arange(0, 101, 5)
percentile_values = [np.percentile(df[target_col], p) for p in percentiles]
ax4.plot(percentiles, percentile_values, 'o-', color='purple')
ax4.fill_between(percentiles, percentile_values, alpha=0.3)
ax4.set_title('CLV Percentile Distribution', fontweight='bold')
ax4.set_xlabel('Percentile')
ax4.set_ylabel('CLV ($)')
ax4.grid(True, alpha=0.3)

# 5. Q-Q Plot (to check normality)
ax5 = fig.add_subplot(gs[1, 1])
stats.probplot(df[target_col], dist="norm", plot=ax5)
ax5.set_title('Q-Q Plot (Raw CLV)', fontweight='bold')

# 6. Q-Q Plot for log-transformed
ax6 = fig.add_subplot(gs[1, 2])
stats.probplot(df['log_clv'], dist="norm", plot=ax6)
ax6.set_title('Q-Q Plot (Log CLV)', fontweight='bold')

plt.tight_layout()
plt.savefig(os.path.join(FIGURES_DIR, '02_clv_comprehensive_analysis.png'), dpi=150, bbox_inches='tight')
plt.show()

print(f"\n📸 Saved: 02_clv_comprehensive_analysis.png")

### 3.3 Categorical Features Distribution

For categorical features, we examine the **frequency distribution** of each category.

In [None]:
# Categorical distribution overview
print("=" * 80)
print("CATEGORICAL FEATURES DISTRIBUTION")
print("=" * 80)

for col in categorical_cols:
    print(f"\n{'─' * 60}")
    print(f"📋 {col.upper()}")
    print(f"   Unique categories: {df[col].nunique()}")
    print(f"\n   Value Counts:")
    
    value_counts = df[col].value_counts()
    for val, count in value_counts.items():
        pct = count / len(df) * 100
        bar = '█' * int(pct / 2)  # Visual bar
        print(f"   {val:25} {count:5,} ({pct:5.1f}%) {bar}")

In [None]:
# Visualize key categorical distributions
key_cat_cols = ['state', 'coverage', 'education', 'employmentstatus', 'sales_channel', 'vehicle_class']

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

for i, col in enumerate(key_cat_cols):
    if col in df.columns:
        value_counts = df[col].value_counts()
        sns.barplot(x=value_counts.values, y=value_counts.index, ax=axes[i], palette='viridis')
        axes[i].set_title(f'Distribution: {col.replace("_", " ").title()}', fontweight='bold')
        axes[i].set_xlabel('Count')
        axes[i].set_ylabel('')
        
        # Add count labels
        for j, v in enumerate(value_counts.values):
            axes[i].text(v + 50, j, f'{v:,}', va='center', fontsize=9)

plt.tight_layout()
plt.savefig(os.path.join(FIGURES_DIR, '02_categorical_distributions.png'), dpi=150, bbox_inches='tight')
plt.show()

print(f"\n📸 Saved: 02_categorical_distributions.png")

---

## 4. Bivariate Analysis

Bivariate analysis explores **relationships between two variables**. This is crucial for:
- Identifying predictive features
- Detecting multicollinearity
- Understanding feature interactions

### 4.1 Correlation Analysis

We compute both **Pearson** (linear) and **Spearman** (rank-based) correlation coefficients.

In [None]:
# Correlation analysis
print("=" * 80)
print("CORRELATION ANALYSIS")
print("=" * 80)

# Select only numeric columns for correlation
numeric_df = df[numeric_cols].copy()

# Pearson correlation
pearson_corr = numeric_df.corr(method='pearson')

# Spearman correlation
spearman_corr = numeric_df.corr(method='spearman')

print("\n📊 Pearson Correlation with Target (CLV):")
target_corr = pearson_corr[target_col].drop(target_col).sort_values(key=abs, ascending=False)
for feature, corr in target_corr.items():
    strength = "strong" if abs(corr) > 0.5 else "moderate" if abs(corr) > 0.3 else "weak"
    direction = "positive" if corr > 0 else "negative"
    print(f"   {feature:35} r = {corr:+.4f} ({strength} {direction})")

In [None]:
# Correlation heatmaps
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# Pearson correlation heatmap
mask = np.triu(np.ones_like(pearson_corr, dtype=bool))
sns.heatmap(pearson_corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r', 
            center=0, ax=axes[0], square=True, linewidths=0.5)
axes[0].set_title('Pearson Correlation Matrix', fontweight='bold', fontsize=14)

# Spearman correlation heatmap
sns.heatmap(spearman_corr, mask=mask, annot=True, fmt='.2f', cmap='RdBu_r',
            center=0, ax=axes[1], square=True, linewidths=0.5)
axes[1].set_title('Spearman Correlation Matrix', fontweight='bold', fontsize=14)

plt.tight_layout()
plt.savefig(os.path.join(FIGURES_DIR, '02_correlation_heatmaps.png'), dpi=150, bbox_inches='tight')
plt.show()

print(f"\n📸 Saved: 02_correlation_heatmaps.png")

### 4.2 CLV by Categorical Features

How does Customer Lifetime Value vary across different categorical segments?

In [None]:
# CLV distribution by categorical features
print("=" * 80)
print("CLV DISTRIBUTION BY CATEGORICAL FEATURES")
print("=" * 80)

for col in ['coverage', 'education', 'employmentstatus', 'vehicle_class']:
    if col in df.columns:
        print(f"\n{'─' * 60}")
        print(f"📊 CLV by {col.upper()}:")
        
        summary = df.groupby(col)[target_col].agg(['mean', 'median', 'std', 'count']).round(2)
        summary = summary.sort_values('mean', ascending=False)
        print(summary)

In [None]:
# Visualize CLV by key categories
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

plot_cols = ['coverage', 'education', 'employmentstatus', 'vehicle_class']

for i, col in enumerate(plot_cols):
    row, column = i // 2, i % 2
    if col in df.columns:
        # Order by median CLV
        order = df.groupby(col)[target_col].median().sort_values(ascending=False).index
        
        sns.boxplot(data=df, x=col, y=target_col, ax=axes[row, column], 
                    order=order, palette='Set2')
        axes[row, column].set_title(f'CLV Distribution by {col.replace("_", " ").title()}', fontweight='bold')
        axes[row, column].set_xlabel('')
        axes[row, column].set_ylabel('Customer Lifetime Value ($)')
        axes[row, column].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig(os.path.join(FIGURES_DIR, '02_clv_by_categories.png'), dpi=150, bbox_inches='tight')
plt.show()

print(f"\n📸 Saved: 02_clv_by_categories.png")

### 4.3 Key Relationship: Monthly Premium vs CLV

In [None]:
# Scatter plot: Monthly Premium vs CLV
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Raw scatter
axes[0].scatter(df['monthly_premium_auto'], df[target_col], alpha=0.3, s=20, color='steelblue')
axes[0].set_xlabel('Monthly Premium ($)')
axes[0].set_ylabel('Customer Lifetime Value ($)')
axes[0].set_title('Monthly Premium vs CLV (Raw)', fontweight='bold')

# Add trend line
z = np.polyfit(df['monthly_premium_auto'], df[target_col], 1)
p = np.poly1d(z)
x_line = np.linspace(df['monthly_premium_auto'].min(), df['monthly_premium_auto'].max(), 100)
axes[0].plot(x_line, p(x_line), "r--", linewidth=2, label=f'Trend Line')
axes[0].legend()

# Calculate and display correlation
corr, p_value = pearsonr(df['monthly_premium_auto'], df[target_col])
axes[0].annotate(f'r = {corr:.4f}\np-value < 0.001', 
                 xy=(0.05, 0.95), xycoords='axes fraction',
                 fontsize=11, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Hued by coverage
if 'coverage' in df.columns:
    for coverage_type in df['coverage'].unique():
        subset = df[df['coverage'] == coverage_type]
        axes[1].scatter(subset['monthly_premium_auto'], subset[target_col], 
                        alpha=0.5, s=25, label=coverage_type.title())
    axes[1].set_xlabel('Monthly Premium ($)')
    axes[1].set_ylabel('Customer Lifetime Value ($)')
    axes[1].set_title('Monthly Premium vs CLV (by Coverage)', fontweight='bold')
    axes[1].legend(title='Coverage')

plt.tight_layout()
plt.savefig(os.path.join(FIGURES_DIR, '02_premium_vs_clv.png'), dpi=150, bbox_inches='tight')
plt.show()

print(f"\n📸 Saved: 02_premium_vs_clv.png")

---

## 5. Multivariate Analysis

Multivariate analysis examines **relationships among three or more variables** simultaneously.

### 5.1 Pair Plot (Feature Interactions)

In [None]:
# Pair plot for key numeric features
pair_cols = ['customer_lifetime_value', 'monthly_premium_auto', 'income', 'total_claim_amount', 'number_of_policies']
pair_cols = [c for c in pair_cols if c in df.columns]

# Sample for performance (pair plots are computationally expensive)
sample_df = df[pair_cols + ['coverage']].sample(min(2000, len(df)), random_state=42)

print("Creating pair plot (this may take a moment)...")
pairplot = sns.pairplot(sample_df, hue='coverage', diag_kind='kde', 
                         plot_kws={'alpha': 0.5, 's': 30},
                         palette='Set2')
pairplot.fig.suptitle('Feature Pair Plot (Colored by Coverage Type)', y=1.02, fontweight='bold')

plt.savefig(os.path.join(FIGURES_DIR, '02_pairplot.png'), dpi=150, bbox_inches='tight')
plt.show()

print(f"\n📸 Saved: 02_pairplot.png")

### 5.2 CLV Segmentation by Multiple Dimensions

In [None]:
# CLV by State and Coverage
if 'state' in df.columns and 'coverage' in df.columns:
    pivot_table = df.pivot_table(values=target_col, 
                                  index='state', 
                                  columns='coverage', 
                                  aggfunc='mean')
    
    plt.figure(figsize=(14, 8))
    sns.heatmap(pivot_table, annot=True, fmt=',.0f', cmap='YlOrRd', 
                linewidths=0.5, cbar_kws={'label': 'Mean CLV ($)'})
    plt.title('Average CLV by State and Coverage Type', fontweight='bold', fontsize=14)
    plt.xlabel('Coverage Type')
    plt.ylabel('State')
    
    plt.tight_layout()
    plt.savefig(os.path.join(FIGURES_DIR, '02_clv_state_coverage_heatmap.png'), dpi=150, bbox_inches='tight')
    plt.show()
    
    print(f"\n📸 Saved: 02_clv_state_coverage_heatmap.png")

---

## 6. Outlier Detection and Analysis

Outliers are extreme values that differ significantly from other observations. They can:
- Indicate data errors (should be fixed)
- Represent genuine extreme cases (often interesting)
- Significantly impact model performance

### 6.1 IQR-Based Outlier Detection

In [None]:
def detect_outliers_iqr(df, column):
    """
    Detect outliers using the Interquartile Range (IQR) method.
    Outliers are values beyond 1.5 * IQR from Q1 or Q3.
    """
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    
    return {
        'column': column,
        'Q1': Q1,
        'Q3': Q3,
        'IQR': IQR,
        'lower_bound': lower_bound,
        'upper_bound': upper_bound,
        'outlier_count': len(outliers),
        'outlier_pct': len(outliers) / len(df) * 100
    }

# Detect outliers for all numeric columns
print("=" * 80)
print("OUTLIER DETECTION (IQR Method)")
print("=" * 80)

outlier_results = []
for col in numeric_cols:
    result = detect_outliers_iqr(df, col)
    outlier_results.append(result)

outlier_df = pd.DataFrame(outlier_results)
outlier_df = outlier_df.sort_values('outlier_pct', ascending=False)

print("\n📊 Outlier Summary:")
outlier_df[['column', 'lower_bound', 'upper_bound', 'outlier_count', 'outlier_pct']].round(2)

In [None]:
# Visualize outliers for top features with most outliers
top_outlier_cols = outlier_df.head(4)['column'].tolist()

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for i, col in enumerate(top_outlier_cols):
    # Box plot showing outliers
    sns.boxplot(y=df[col], ax=axes[i], color='lightblue')
    axes[i].set_title(f'Outliers in {col}', fontweight='bold')
    axes[i].set_ylabel(col)
    
    # Add outlier count annotation
    result = detect_outliers_iqr(df, col)
    axes[i].annotate(f"Outliers: {result['outlier_count']} ({result['outlier_pct']:.1f}%)",
                     xy=(0.5, 0.02), xycoords='axes fraction',
                     ha='center', fontsize=10,
                     bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))

plt.tight_layout()
plt.savefig(os.path.join(FIGURES_DIR, '02_outlier_analysis.png'), dpi=150, bbox_inches='tight')
plt.show()

print(f"\n📸 Saved: 02_outlier_analysis.png")

---

## 7. Key Insights Summary

Based on our comprehensive exploratory analysis, here are the **key findings**:

In [None]:
# Generate automated insights report
print("=" * 80)
print("KEY INSIGHTS SUMMARY")
print("=" * 80)

print("\n" + "─" * 60)
print("📊 1. TARGET VARIABLE (Customer Lifetime Value)")
print("─" * 60)
print(f"   • Mean CLV: ${df[target_col].mean():,.2f}")
print(f"   • Median CLV: ${df[target_col].median():,.2f}")
print(f"   • Distribution is RIGHT-SKEWED (skewness = {df[target_col].skew():.2f})")
print(f"   • ⚡ RECOMMENDATION: Apply log transformation for modeling")

print("\n" + "─" * 60)
print("📈 2. STRONGEST PREDICTORS (Correlation with CLV)")
print("─" * 60)
top_correlations = pearson_corr[target_col].drop(target_col).sort_values(key=abs, ascending=False).head(5)
for feature, corr in top_correlations.items():
    direction = "↑" if corr > 0 else "↓"
    print(f"   {direction} {feature}: r = {corr:+.4f}")

print("\n" + "─" * 60)
print("🗂️ 3. CATEGORICAL INSIGHTS")
print("─" * 60)
if 'coverage' in df.columns:
    coverage_clv = df.groupby('coverage')[target_col].mean().sort_values(ascending=False)
    print(f"   • Highest avg CLV by Coverage: {coverage_clv.index[0].upper()} (${coverage_clv.iloc[0]:,.2f})")

if 'sales_channel' in df.columns:
    channel_counts = df['sales_channel'].value_counts()
    print(f"   • Most common Sales Channel: {channel_counts.index[0].upper()} ({channel_counts.iloc[0]:,} customers)")

print("\n" + "─" * 60)
print("⚠️ 4. DATA QUALITY NOTES")
print("─" * 60)
print(f"   • Missing Values: 0 (Clean dataset)")
print(f"   • High outlier columns: {', '.join(outlier_df[outlier_df['outlier_pct'] > 5]['column'].tolist())}")
print(f"   • Total records: {len(df):,}")

print("\n" + "─" * 60)
print("🎯 5. MODELING RECOMMENDATIONS")
print("─" * 60)
print("   • Transform target variable (log1p) to reduce skewness")
print("   • Consider tree-based models (robust to outliers)")
print("   • Engineer interaction features (e.g., Coverage × Education)")
print("   • Focus on high-correlation features for initial model")

---

## 8. Export Analysis Results

Save key analysis outputs for reference in subsequent notebooks.

In [None]:
# Export correlation matrix
pearson_corr.to_csv(os.path.join(DATA_PROCESSED_DIR, 'pearson_correlation_matrix.csv'))
print(f"✅ Saved: pearson_correlation_matrix.csv")

# Export outlier analysis
outlier_df.to_csv(os.path.join(DATA_PROCESSED_DIR, 'outlier_analysis.csv'), index=False)
print(f"✅ Saved: outlier_analysis.csv")

# Export numeric summary
numeric_summary.to_csv(os.path.join(DATA_PROCESSED_DIR, 'numeric_summary_statistics.csv'))
print(f"✅ Saved: numeric_summary_statistics.csv")

---

## Next Steps

In **Notebook 03: Feature Engineering**, we will:

1. Apply log transformation to the target variable
2. Create interaction features based on domain knowledge
3. Implement encoding strategies for categorical variables
4. Scale numerical features appropriately
5. Prepare the final feature matrix for modeling

---

**End of Notebook 02**

## Deep EDA & Interaction Analysis

Below are the high-resolution figures generated by our pipeline.

### 02 Bleeding Neck
![02 Bleeding Neck](../report/figures/02_bleeding_neck.png)

### 02 Clv By Category
![02 Clv By Category](../report/figures/02_clv_by_category.png)

### 02 Correlation Heatmap
![02 Correlation Heatmap](../report/figures/02_correlation_heatmap.png)

### 02 Target Distribution
![02 Target Distribution](../report/figures/02_target_distribution.png)

### 07 Boxplots
![07 Boxplots](../report/figures/07_boxplots.png)

### 07 Cat Coverage
![07 Cat Coverage](../report/figures/07_cat_coverage.png)

### 07 Cat Education
![07 Cat Education](../report/figures/07_cat_education.png)

### 07 Cat Vehicle
![07 Cat Vehicle](../report/figures/07_cat_vehicle.png)

### 07 Categorical Analysis
![07 Categorical Analysis](../report/figures/07_categorical_analysis.png)

### 07 Channel Analysis
![07 Channel Analysis](../report/figures/07_channel_analysis.png)

### 07 Channel Clv
![07 Channel Clv](../report/figures/07_channel_clv.png)

### 07 Channel Count
![07 Channel Count](../report/figures/07_channel_count.png)

### 07 Correlation Analysis
![07 Correlation Analysis](../report/figures/07_correlation_analysis.png)

### 07 Feature Importance
![07 Feature Importance](../report/figures/07_feature_importance.png)

### 07 Scatter Relationships
![07 Scatter Relationships](../report/figures/07_scatter_relationships.png)

### 07 Tenure Analysis
![07 Tenure Analysis](../report/figures/07_tenure_analysis.png)

### 07 Uni Clv
![07 Uni Clv](../report/figures/07_uni_clv.png)

### 07 Uni Income
![07 Uni Income](../report/figures/07_uni_income.png)

### 07 Uni Months
![07 Uni Months](../report/figures/07_uni_months.png)

### 07 Uni Premium
![07 Uni Premium](../report/figures/07_uni_premium.png)

### 07 Univariate Distributions
![07 Univariate Distributions](../report/figures/07_univariate_distributions.png)

### 09 Hexbin Premium Claims
![09 Hexbin Premium Claims](../report/figures/09_hexbin_premium_claims.png)

### 09 Interaction Income Edu
![09 Interaction Income Edu](../report/figures/09_interaction_income_edu.png)

### 09 Pairplot Key Metrics
![09 Pairplot Key Metrics](../report/figures/09_pairplot_key_metrics.png)

### 09 Violin Vehicle Gender
![09 Violin Vehicle Gender](../report/figures/09_violin_vehicle_gender.png)