## Phase 4: Exploratory Data Analysis (EDA)

### What is EDA?

EDA is the process of understanding data through visualization and statistical analysis BEFORE building models.

### EDA Goals

1. Understand feature distributions
2. Identify relationships between features and target
3. Detect data leakage and inconsistencies
4. Guide feature engineering decisions
5. Communicate insights to stakeholders

### Key EDA Techniques

**1. Univariate Analysis (Single Variable)**


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Setup
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Numeric variable: Distribution
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Histogram
df['age'].hist(bins=30, ax=axes[0])
axes[0].set_title('Age Distribution (Histogram)')
axes[0].set_xlabel('Age')

# KDE plot (smooth distribution)
df['age'].plot(kind='kde', ax=axes[1])
axes[1].set_title('Age Distribution (KDE)')

# Box plot (shows outliers)
df['age'].plot(kind='box', ax=axes[2])
axes[2].set_title('Age Distribution (Box Plot)')

plt.tight_layout()
plt.show()

# Statistical summary
print(df['age'].describe())
# Output:
# count    10000
# mean      42.5
# std       15.2
# min       18
# 25%       32
# 50%       42
# 75%       53
# max       120


**2. Categorical Variable Distribution**


In [None]:
# Count plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

df['account_type'].value_counts().plot(kind='bar', ax=axes[0])
axes[0].set_title('Account Type Distribution')

# Percentage
(df['account_type'].value_counts() / len(df) * 100).plot(
    kind='pie', ax=axes[1], autopct='%1.1f%%'
)
axes[1].set_title('Account Type Percentage')

plt.tight_layout()
plt.show()


**3. Bivariate Analysis (Two Variables)**


In [None]:
# Numeric vs Numeric: Scatter plot & Correlation
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Scatter plot
axes[0].scatter(df['age'], df['total_spent'], alpha=0.5)
axes[0].set_xlabel('Age')
axes[0].set_ylabel('Total Spent ($)')
axes[0].set_title('Age vs Total Spent')

# Correlation
correlation = df[['age', 'total_spent', 'transaction_count']].corr()
sns.heatmap(correlation, annot=True, ax=axes[1], cmap='coolwarm')
axes[1].set_title('Correlation Matrix')

plt.tight_layout()
plt.show()

print(f"Correlation (Age vs Total Spent): {df['age'].corr(df['total_spent']):.3f}")


**4. Feature vs Target Analysis**


In [None]:
# Numeric feature vs Binary target
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Box plot: Target groups
sns.boxplot(x='churned', y='age', data=df, ax=axes[0])
axes[0].set_title('Age by Churn Status')
axes[0].set_ylabel('Age')
axes[0].set_xticklabels(['No Churn', 'Churned'])

# Violin plot: Better visualization of distribution
sns.violinplot(x='churned', y='total_spent', data=df, ax=axes[1])
axes[1].set_title('Total Spent by Churn Status')
axes[1].set_ylabel('Total Spent ($)')

plt.tight_layout()
plt.show()

# Statistical comparison
print("Average age by churn:")
print(df.groupby('churned')['age'].agg(['mean', 'median', 'std']))


**5. Correlation with Target**


In [None]:
# Calculate correlation with target variable
target = 'churned'
correlations = df.corr()[target].sort_values(ascending=False)

print("Feature Correlation with Churn:")
print(correlations)

# Visualize
plt.figure(figsize=(10, 6))
correlations[1:].plot(kind='barh')
plt.title('Feature Correlation with Churn')
plt.xlabel('Correlation Coefficient')
plt.tight_layout()
plt.show()


**6. Class Imbalance Check**


In [None]:
# Critical for classification problems
print(df['churned'].value_counts())
print(df['churned'].value_counts(normalize=True))

# Visualize
plt.figure(figsize=(8, 4))
df['churned'].value_counts().plot(kind='bar')
plt.title('Churn Distribution')
plt.ylabel('Count')
plt.xticks(rotation=0, labels=['No Churn', 'Churned'])
plt.tight_layout()
plt.show()

# Calculate imbalance ratio
positive_class = (df['churned'] == 1).sum()
negative_class = (df['churned'] == 0).sum()
imbalance_ratio = negative_class / positive_class
print(f"Imbalance ratio: {imbalance_ratio:.2f}:1")
# If >10:1, need special handling (class weights, resampling)


### EDA Questions to Ask

```
1. Data Completeness:
   - How many missing values per column?
   - Are missing patterns random or systematic?

2. Distribution:
   - Is the target well-balanced?
   - Are features normally distributed or skewed?
   - Are there obvious outliers?

3. Relationships:
   - Which features correlate with the target?
   - Are there multicollinearity issues (features highly correlated)?
   - Are relationships linear or non-linear?

4. Time Patterns (if applicable):
   - Are there seasonal patterns?
   - Is there trend over time?
   - Do patterns vary by customer segment?

5. Data Leakage:
   - Are we using future data to predict the past?
   - Are there features that would not be available at prediction time?

6. Segments:
   - Do patterns vary by customer type, geography, time period?
   - Should we build separate models per segment?
```

### Complete EDA Summary Report


In [None]:
def generate_eda_report(df, target_col):
    """
    Generate comprehensive EDA report
    """
    report = f"""
    ========== EDA REPORT ==========
    
    DATASET SHAPE: {df.shape}
    
    MISSING VALUES:
    {df.isnull().sum()}
    
    DATA TYPES:
    {df.dtypes}
    
    TARGET DISTRIBUTION:
    {df[target_col].value_counts()}
    Class Balance: {(df[target_col] == 1).sum() / len(df) * 100:.2f}% positive
    
    NUMERIC FEATURES SUMMARY:
    {df.describe()}
    
    CORRELATION WITH TARGET (Top 10):
    {df.corr()[target_col].nlargest(10)}
    """
    print(report)

generate_eda_report(cleaned_data, 'churned')


### Tools Used in EDA

| Tool | Purpose |
|------|---------|
| Pandas | Data exploration, statistics |
| Matplotlib | Basic plotting |
| Seaborn | Statistical visualization |
| Plotly | Interactive visualizations |
| Jupyter | Notebook environment |
| Pandas Profiling | Automated EDA report |
| Apache Spark | Large-scale EDA |

---
