# Exploratory Data Analysis (EDA) Learning Guide

## What is Exploratory Data Analysis?

**Exploratory Data Analysis (EDA)** is the critical first step in any data science project. It's the process of investigating datasets to discover patterns, spot anomalies, test hypotheses, and check assumptions using statistical summaries and graphical representations.

## Why is EDA Important?

1. **Understand Your Data** - Get familiar with the structure, size, and content
2. **Detect Data Quality Issues** - Find missing values, duplicates, and errors
3. **Identify Patterns** - Discover relationships and trends in the data
4. **Spot Outliers** - Detect anomalies that might affect analysis
5. **Inform Feature Engineering** - Guide creation of new features for modeling
6. **Select Appropriate Models** - Choose algorithms based on data characteristics
7. **Validate Assumptions** - Test statistical assumptions before modeling

## What You'll Learn in This Notebook

This comprehensive guide covers **11 essential EDA topics**:

1. [Data Loading & Initial Inspection](#1-data-loading--initial-inspection) - Understanding dataset structure
2. [Data Types & Structure](#2-data-types--structure) - Identifying numerical and categorical features
3. [Missing Data Analysis](#3-missing-data-analysis) - Detecting and handling missing values
4. [Descriptive Statistics](#4-descriptive-statistics) - Central tendency, dispersion, and shape
5. [Univariate Analysis](#5-univariate-analysis-single-variable) - Distribution of individual variables
6. [Bivariate Analysis](#6-bivariate-analysis-two-variables) - Relationships between pairs of variables
7. [Multivariate Analysis](#7-multivariate-analysis) - Complex multi-variable patterns
8. [Outlier Detection](#8-outlier-detection) - Identifying and handling anomalies
9. [Distribution Analysis](#9-distribution-analysis) - Understanding data shapes and normality
10. [Correlation Analysis](#10-correlation-analysis) - Feature relationships and dependencies
11. [Best Practices & Summary](#11-eda-best-practices--summary) - Professional EDA workflow

## Tools & Libraries Used

- **pandas** - Data manipulation and analysis
- **numpy** - Numerical operations
- **matplotlib** - Basic plotting
- **seaborn** - Statistical visualizations
- **scipy** - Statistical tests and functions

## How to Use This Notebook

1. **Run cells sequentially** - Each section builds on previous knowledge
2. **Experiment** - Modify code to explore different aspects
3. **Apply to your data** - Replace sample data with your own datasets
4. **Document insights** - Keep notes on findings and decisions

## Dataset Overview

This notebook uses a **synthetic loan default dataset** with 1,000 customers and the following features:

- **Numerical**: age, income, credit_score, loan_amount, employment_years
- **Categorical**: education, owns_home
- **Target**: default (0=No, 1=Yes)

This realistic dataset includes missing values and outliers to demonstrate real-world EDA scenarios.

---

Let's begin! üöÄ

In [None]:
# ============================================
# SETUP: Import Essential Libraries for EDA
# ============================================

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
from scipy import stats
from scipy.stats import skew, kurtosis

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.precision', 2)

# Plotting settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("‚úì Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## 1. Data Loading & Initial Inspection

**Why**: Understand the basic structure and size of your dataset before diving deeper.

**When to use**: 
- First step in any EDA
- After loading new datasets
- When combining multiple data sources

**Key Questions**:
- How many rows and columns?
- What do the first/last few rows look like?
- What's the overall structure?

In [None]:
# ============================================
# 1. DATA LOADING & INITIAL INSPECTION
# ============================================

# Create sample dataset (in practice, you'd load from CSV, database, etc.)
np.random.seed(42)

data = {
    'customer_id': range(1, 1001),
    'age': np.random.randint(18, 80, 1000),
    'income': np.random.normal(50000, 20000, 1000),
    'credit_score': np.random.randint(300, 850, 1000),
    'loan_amount': np.random.normal(15000, 7000, 1000),
    'employment_years': np.random.randint(0, 40, 1000),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], 1000),
    'owns_home': np.random.choice(['Yes', 'No'], 1000),
    'default': np.random.choice([0, 1], 1000, p=[0.85, 0.15])
}

df = pd.DataFrame(data)

# Add some missing values (realistic scenario)
df.loc[np.random.choice(df.index, 50, replace=False), 'income'] = np.nan
df.loc[np.random.choice(df.index, 30, replace=False), 'credit_score'] = np.nan
df.loc[np.random.choice(df.index, 20, replace=False), 'employment_years'] = np.nan

print("=" * 60)
print("INITIAL DATA INSPECTION")
print("=" * 60)

# 1. Shape of dataset
print(f"\n1. Dataset Shape: {df.shape}")
print(f"   - Rows (observations): {df.shape[0]:,}")
print(f"   - Columns (features): {df.shape[1]}")

# 2. First few rows
print("\n2. First 5 rows:")
print(df.head())

# 3. Last few rows
print("\n3. Last 5 rows:")
print(df.tail())

# 4. Random sample
print("\n4. Random sample (3 rows):")
print(df.sample(3))

# 5. Column names and index
print("\n5. Column Names:")
print(df.columns.tolist())

# 6. Basic info
print("\n6. Dataset Info:")
print(df.info())

# 7. Memory usage
print(f"\n7. Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## 2. Data Types & Structure

**Why**: Understanding data types helps choose appropriate analysis methods and visualizations.

**When to use**:
- After loading data
- Before data cleaning
- When planning transformations

**Key Questions**:
- Are numeric columns actually numeric?
- Are categorical variables properly encoded?
- Do data types match expectations?

In [None]:
# ============================================
# 2. DATA TYPES & STRUCTURE
# ============================================

print("=" * 60)
print("DATA TYPES ANALYSIS")
print("=" * 60)

# 1. Data types overview
print("\n1. Data Types:")
print(df.dtypes)

# 2. Count of each data type
print("\n2. Data Type Summary:")
print(df.dtypes.value_counts())

# 3. Identify numerical and categorical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"\n3. Numerical Columns ({len(numerical_cols)}):")
print(numerical_cols)

print(f"\n4. Categorical Columns ({len(categorical_cols)}):")
print(categorical_cols)

# 5. Unique values in each column
print("\n5. Unique Values Count:")
for col in df.columns:
    print(f"   {col:20s}: {df[col].nunique():5d} unique values")

# 6. Cardinality check (for categorical variables)
print("\n6. Categorical Variables Detail:")
for col in categorical_cols:
    print(f"\n   {col}:")
    print(f"   - Unique values: {df[col].nunique()}")
    print(f"   - Value counts:")
    print(df[col].value_counts())

# 7. Check for constant columns (no variance)
print("\n7. Checking for Constant Columns:")
constant_cols = [col for col in df.columns if df[col].nunique() == 1]
if constant_cols:
    print(f"   Constant columns found: {constant_cols}")
else:
    print("   ‚úì No constant columns found")

# 8. Data type conversion example
print("\n8. Example: Converting Data Types")
print(f"   Before: default dtype = {df['default'].dtype}")
df['default'] = df['default'].astype('category')
print(f"   After: default dtype = {df['default'].dtype}")
print(f"   Memory saved: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## 3. Missing Data Analysis

**Why**: Missing data can bias results, reduce statistical power, and cause model errors.

**When to use**:
- Before any analysis
- When deciding on imputation strategies
- When assessing data quality

**Strategies**:
- **Drop**: If < 5% missing and random
- **Impute**: Mean/median (numerical), mode (categorical)
- **Flag**: Create indicator variable for missingness
- **Model**: Predict missing values

In [None]:
# ============================================
# 3. MISSING DATA ANALYSIS
# ============================================

print("=" * 60)
print("MISSING DATA ANALYSIS")
print("=" * 60)

# 1. Count missing values
print("\n1. Missing Values Count:")
missing = df.isnull().sum()
print(missing[missing > 0])

# 2. Percentage of missing values
print("\n2. Missing Values Percentage:")
missing_pct = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print(missing_pct[missing_pct > 0])

# 3. Create comprehensive missing data report
def missing_data_report(df):
    total = df.isnull().sum().sort_values(ascending=False)
    percent = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    return missing_data[missing_data['Total'] > 0]

print("\n3. Missing Data Report:")
print(missing_data_report(df))

# 4. Visualize missing data
print("\n4. Missing Data Visualization:")
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot
missing_data_report(df).plot(kind='bar', ax=axes[0])
axes[0].set_title('Missing Values by Column')
axes[0].set_ylabel('Count / Percentage')
axes[0].legend(['Total Missing', 'Percentage Missing'])

# Heatmap
sns.heatmap(df.isnull(), yticklabels=False, cbar=True, cmap='viridis', ax=axes[1])
axes[1].set_title('Missing Data Heatmap (Yellow = Missing)')

plt.tight_layout()
plt.show()

# 5. Missing data patterns
print("\n5. Missing Data Patterns:")
print(f"   - Rows with any missing value: {df.isnull().any(axis=1).sum()} ({df.isnull().any(axis=1).sum()/len(df)*100:.1f}%)")
print(f"   - Rows with all values present: {(~df.isnull().any(axis=1)).sum()} ({(~df.isnull().any(axis=1)).sum()/len(df)*100:.1f}%)")

# 6. Correlation of missingness
print("\n6. Missingness Correlation:")
missing_corr = df.isnull().corr()
plt.figure(figsize=(8, 6))
sns.heatmap(missing_corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation of Missing Values')
plt.show()

# 7. Example: Handling missing data
print("\n7. Example: Handling Missing Data")

# Strategy 1: Drop rows with missing values
df_dropped = df.dropna()
print(f"   After dropping rows: {df_dropped.shape[0]} rows remain")

# Strategy 2: Fill with median (numerical)
df_filled = df.copy()
for col in ['income', 'credit_score', 'employment_years']:
    if col in df_filled.columns:
        df_filled[col].fillna(df_filled[col].median(), inplace=True)
print(f"   After median imputation: {df_filled.isnull().sum().sum()} missing values remain")

# Strategy 3: Forward fill (time series)
# df_ffill = df.fillna(method='ffill')

# Strategy 4: Create missing indicator
df['income_missing'] = df['income'].isnull().astype(int)
print(f"   Created missing indicator: income_missing")
print(f"   - Missing: {df['income_missing'].sum()}")
print(f"   - Present: {(~df['income_missing'].astype(bool)).sum()}")

## 4. Descriptive Statistics

**Why**: Get a statistical summary of your data's central tendency, dispersion, and shape.

**When to use**:
- Understanding data distributions
- Identifying potential outliers
- Comparing variables

**Key Metrics**:
- **Central Tendency**: Mean, median, mode
- **Dispersion**: Std dev, variance, range, IQR
- **Shape**: Skewness, kurtosis

In [None]:
# ============================================
# 4. DESCRIPTIVE STATISTICS
# ============================================

print("=" * 60)
print("DESCRIPTIVE STATISTICS")
print("=" * 60)

# 1. Basic statistics for numerical columns
print("\n1. Basic Statistical Summary:")
print(df.describe())

# 2. Extended statistics
print("\n2. Extended Statistical Summary (including percentiles):")
print(df.describe(percentiles=[.01, .05, .25, .5, .75, .95, .99]))

# 3. Statistics for categorical columns
print("\n3. Categorical Variables Summary:")
print(df.describe(include=['object', 'category']))

# 4. Individual statistics
print("\n4. Detailed Statistics for Numerical Columns:")
for col in numerical_cols:
    if col != 'customer_id':  # Skip ID column
        print(f"\n   {col}:")
        print(f"      Mean:     {df[col].mean():.2f}")
        print(f"      Median:   {df[col].median():.2f}")
        print(f"      Mode:     {df[col].mode().values[0]:.2f}")
        print(f"      Std Dev:  {df[col].std():.2f}")
        print(f"      Variance: {df[col].var():.2f}")
        print(f"      Min:      {df[col].min():.2f}")
        print(f"      Max:      {df[col].max():.2f}")
        print(f"      Range:    {df[col].max() - df[col].min():.2f}")
        print(f"      IQR:      {df[col].quantile(0.75) - df[col].quantile(0.25):.2f}")
        print(f"      Skewness: {df[col].skew():.2f}")
        print(f"      Kurtosis: {df[col].kurtosis():.2f}")

# 5. Custom aggregations
print("\n5. Custom Aggregations:")
custom_stats = df[['age', 'income', 'credit_score']].agg({
    'age': ['min', 'max', 'mean', 'median'],
    'income': ['min', 'max', 'mean', 'std'],
    'credit_score': ['min', 'max', 'mean', 'median']
})
print(custom_stats)

# 6. Group statistics
print("\n6. Statistics by Group (Education Level):")
print(df.groupby('education')['income'].describe())

# 7. Correlation of statistics with target
print("\n7. Statistics Summary with Target (Default):")
print(df.groupby('default').agg({
    'age': ['mean', 'median'],
    'income': ['mean', 'median'],
    'credit_score': ['mean', 'median']
}))

# 8. Coefficient of Variation (CV)
print("\n8. Coefficient of Variation (Relative Variability):")
for col in ['age', 'income', 'credit_score', 'loan_amount']:
    cv = (df[col].std() / df[col].mean()) * 100
    print(f"   {col:20s}: {cv:.2f}%")

## 5. Univariate Analysis (Single Variable)

**Why**: Understand the distribution and characteristics of individual variables.

**When to use**:
- Analyzing each feature independently
- Detecting outliers and anomalies
- Understanding value distributions

**Techniques**:
- **Numerical**: Histograms, box plots, density plots
- **Categorical**: Bar charts, pie charts, frequency tables

In [None]:
# ============================================
# 5. UNIVARIATE ANALYSIS
# ============================================

print("=" * 60)
print("UNIVARIATE ANALYSIS")
print("=" * 60)

# 1. Histogram for numerical variables
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

numerical_features = ['age', 'income', 'credit_score', 'loan_amount', 'employment_years']

for idx, col in enumerate(numerical_features):
    df[col].hist(bins=30, ax=axes[idx], edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'Distribution of {col}')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Frequency')
    axes[idx].axvline(df[col].mean(), color='red', linestyle='--', label=f'Mean: {df[col].mean():.1f}')
    axes[idx].axvline(df[col].median(), color='green', linestyle='--', label=f'Median: {df[col].median():.1f}')
    axes[idx].legend()

plt.tight_layout()
plt.show()

# 2. Density plots (KDE)
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, col in enumerate(numerical_features):
    df[col].dropna().plot(kind='density', ax=axes[idx], color='blue', alpha=0.7)
    axes[idx].set_title(f'Density Plot of {col}')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Density')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 3. Box plots for outlier detection
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, col in enumerate(numerical_features):
    df.boxplot(column=col, ax=axes[idx])
    axes[idx].set_title(f'Box Plot of {col}')
    axes[idx].set_ylabel(col)

plt.tight_layout()
plt.show()

# 4. Categorical variables - bar charts
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Education distribution
df['education'].value_counts().plot(kind='bar', ax=axes[0], color='skyblue', edgecolor='black')
axes[0].set_title('Distribution of Education Levels')
axes[0].set_xlabel('Education')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

# Home ownership distribution
df['owns_home'].value_counts().plot(kind='bar', ax=axes[1], color='lightgreen', edgecolor='black')
axes[1].set_title('Distribution of Home Ownership')
axes[1].set_xlabel('Owns Home')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

# 5. Pie charts for categorical variables
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Education pie chart
df['education'].value_counts().plot(kind='pie', ax=axes[0], autopct='%1.1f%%', startangle=90)
axes[0].set_title('Education Distribution')
axes[0].set_ylabel('')

# Default rate pie chart
df['default'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%', startangle=90, 
                                   labels=['No Default', 'Default'], colors=['green', 'red'])
axes[1].set_title('Default Rate')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

# 6. Value counts for categorical variables
print("\n6. Frequency Tables:")
for col in categorical_cols:
    print(f"\n   {col}:")
    counts = df[col].value_counts()
    percentages = df[col].value_counts(normalize=True) * 100
    freq_table = pd.DataFrame({'Count': counts, 'Percentage': percentages})
    print(freq_table)

# 7. QQ Plot for normality check
from scipy.stats import probplot

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, col in enumerate(numerical_features):
    probplot(df[col].dropna(), dist="norm", plot=axes[idx])
    axes[idx].set_title(f'Q-Q Plot: {col}')

plt.tight_layout()
plt.show()

print("\n7. Normality Interpretation:")
print("   - Points on the line = Normal distribution")
print("   - Points deviate = Non-normal distribution")

## 6. Bivariate Analysis (Two Variables)

**Why**: Understand relationships between pairs of variables.

**When to use**:
- Exploring feature relationships
- Identifying correlations
- Understanding target variable relationships

**Techniques**:
- **Numerical vs Numerical**: Scatter plots, correlation
- **Numerical vs Categorical**: Box plots, violin plots
- **Categorical vs Categorical**: Crosstabs, stacked bars

In [None]:
# ============================================
# 6. BIVARIATE ANALYSIS
# ============================================

print("=" * 60)
print("BIVARIATE ANALYSIS")
print("=" * 60)

# 1. Scatter plots (Numerical vs Numerical)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

pairs = [
    ('age', 'income'),
    ('credit_score', 'loan_amount'),
    ('employment_years', 'income'),
    ('age', 'credit_score')
]

for idx, (x, y) in enumerate(pairs):
    axes[idx].scatter(df[x], df[y], alpha=0.5)
    axes[idx].set_xlabel(x)
    axes[idx].set_ylabel(y)
    axes[idx].set_title(f'{y} vs {x}')
    
    # Add trend line
    z = np.polyfit(df[x].dropna(), df[y].dropna(), 1)
    p = np.poly1d(z)
    axes[idx].plot(df[x].dropna(), p(df[x].dropna()), "r--", alpha=0.8, label='Trend')
    axes[idx].legend()

plt.tight_layout()
plt.show()

# 2. Correlation between variables
print("\n1. Correlation Analysis:")
correlations = df[['age', 'income', 'credit_score', 'loan_amount', 'employment_years']].corr()
print(correlations)

# 3. Scatter plot with color (target variable)
plt.figure(figsize=(10, 6))
scatter = plt.scatter(df['credit_score'], df['income'], 
                     c=df['default'], cmap='RdYlGn_r', alpha=0.6)
plt.xlabel('Credit Score')
plt.ylabel('Income')
plt.title('Income vs Credit Score (colored by Default)')
plt.colorbar(scatter, label='Default (0=No, 1=Yes)')
plt.grid(True, alpha=0.3)
plt.show()

# 4. Box plots (Numerical vs Categorical)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Income by education
df.boxplot(column='income', by='education', ax=axes[0, 0])
axes[0, 0].set_title('Income by Education Level')
axes[0, 0].set_xlabel('Education')
axes[0, 0].set_ylabel('Income')

# Credit score by home ownership
df.boxplot(column='credit_score', by='owns_home', ax=axes[0, 1])
axes[0, 1].set_title('Credit Score by Home Ownership')
axes[0, 1].set_xlabel('Owns Home')
axes[0, 1].set_ylabel('Credit Score')

# Loan amount by default
df.boxplot(column='loan_amount', by='default', ax=axes[1, 0])
axes[1, 0].set_title('Loan Amount by Default Status')
axes[1, 0].set_xlabel('Default (0=No, 1=Yes)')
axes[1, 0].set_ylabel('Loan Amount')

# Age by education
df.boxplot(column='age', by='education', ax=axes[1, 1])
axes[1, 1].set_title('Age by Education Level')
axes[1, 1].set_xlabel('Education')
axes[1, 1].set_ylabel('Age')

plt.tight_layout()
plt.show()

# 5. Violin plots (better than box plots for distribution)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

sns.violinplot(data=df, x='education', y='income', ax=axes[0])
axes[0].set_title('Income Distribution by Education')
axes[0].tick_params(axis='x', rotation=45)

sns.violinplot(data=df, x='default', y='credit_score', ax=axes[1])
axes[1].set_title('Credit Score Distribution by Default')

plt.tight_layout()
plt.show()

# 6. Crosstab (Categorical vs Categorical)
print("\n2. Crosstab Analysis:")
crosstab = pd.crosstab(df['education'], df['default'], normalize='index') * 100
print("\nDefault Rate by Education (%):")
print(crosstab)

crosstab2 = pd.crosstab(df['owns_home'], df['default'], normalize='index') * 100
print("\nDefault Rate by Home Ownership (%):")
print(crosstab2)

# 7. Stacked bar chart
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

pd.crosstab(df['education'], df['default']).plot(kind='bar', stacked=True, ax=axes[0], 
                                                  color=['green', 'red'], alpha=0.7)
axes[0].set_title('Default by Education (Stacked)')
axes[0].set_xlabel('Education')
axes[0].set_ylabel('Count')
axes[0].legend(['No Default', 'Default'])
axes[0].tick_params(axis='x', rotation=45)

pd.crosstab(df['owns_home'], df['default']).plot(kind='bar', stacked=True, ax=axes[1],
                                                  color=['green', 'red'], alpha=0.7)
axes[1].set_title('Default by Home Ownership (Stacked)')
axes[1].set_xlabel('Owns Home')
axes[1].set_ylabel('Count')
axes[1].legend(['No Default', 'Default'])

plt.tight_layout()
plt.show()

# 8. Pairwise scatter plot matrix
print("\n3. Pairwise Relationships (Scatter Matrix):")
from pandas.plotting import scatter_matrix

scatter_matrix(df[['age', 'income', 'credit_score', 'loan_amount']], 
               figsize=(12, 12), alpha=0.5, diagonal='hist')
plt.suptitle('Pairwise Scatter Matrix', y=1.0)
plt.show()

## 7. Multivariate Analysis

**Why**: Understand complex relationships among multiple variables simultaneously.

**When to use**:
- Exploring high-dimensional data
- Understanding feature interactions
- Dimensionality reduction

**Techniques**:
- Correlation heatmaps
- Pair plots with hue
- Parallel coordinates
- PCA visualization

In [None]:
# ============================================
# 7. MULTIVARIATE ANALYSIS
# ============================================

print("=" * 60)
print("MULTIVARIATE ANALYSIS")
print("=" * 60)

# 1. Correlation heatmap
plt.figure(figsize=(10, 8))
corr_matrix = df[['age', 'income', 'credit_score', 'loan_amount', 'employment_years']].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, square=True, 
            linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

print("\n1. High Correlations (|r| > 0.5):")
# Find highly correlated pairs
high_corr = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.5:
            high_corr.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))

for col1, col2, corr_val in high_corr:
    print(f"   {col1} <-> {col2}: {corr_val:.3f}")

# 2. Pair plot with target variable
print("\n2. Pair Plot with Default Status:")
# Sample for faster plotting
df_sample = df.sample(min(500, len(df)), random_state=42)
sns.pairplot(df_sample[['age', 'income', 'credit_score', 'loan_amount', 'default']], 
             hue='default', diag_kind='kde', palette='Set1')
plt.suptitle('Pairplot by Default Status', y=1.02)
plt.show()

# 3. Grouped box plots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Income by education and default
df.boxplot(column='income', by=['education', 'default'], ax=axes[0, 0])
axes[0, 0].set_title('Income by Education and Default')
axes[0, 0].set_xlabel('Education - Default')

# Credit score by home ownership and default
df.boxplot(column='credit_score', by=['owns_home', 'default'], ax=axes[0, 1])
axes[0, 1].set_title('Credit Score by Home Ownership and Default')
axes[0, 1].set_xlabel('Owns Home - Default')

# Loan amount by education and home ownership
df.boxplot(column='loan_amount', by=['education', 'owns_home'], ax=axes[1, 0])
axes[1, 0].set_title('Loan Amount by Education and Home Ownership')
axes[1, 0].set_xlabel('Education - Owns Home')

# Age by education and default
df.boxplot(column='age', by=['education', 'default'], ax=axes[1, 1])
axes[1, 1].set_title('Age by Education and Default')
axes[1, 1].set_xlabel('Education - Default')

plt.tight_layout()
plt.show()

# 4. Facet grid (multiple subplots)
print("\n3. Facet Grid Analysis:")
g = sns.FacetGrid(df, col='owns_home', row='default', hue='education', 
                  height=4, aspect=1.5, palette='Set2')
g.map(plt.scatter, 'credit_score', 'income', alpha=0.5)
g.add_legend()
plt.show()

# 5. 3D scatter plot
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')

scatter = ax.scatter(df['age'], df['income'], df['credit_score'], 
                     c=df['default'], cmap='RdYlGn_r', alpha=0.6)
ax.set_xlabel('Age')
ax.set_ylabel('Income')
ax.set_zlabel('Credit Score')
ax.set_title('3D Scatter: Age, Income, Credit Score (colored by Default)')
plt.colorbar(scatter, label='Default')
plt.show()

# 6. Parallel coordinates
from pandas.plotting import parallel_coordinates

plt.figure(figsize=(14, 6))
# Normalize data for better visualization
df_normalized = df[['age', 'income', 'credit_score', 'loan_amount', 'default']].copy()
for col in ['age', 'income', 'credit_score', 'loan_amount']:
    df_normalized[col] = (df_normalized[col] - df_normalized[col].min()) / (df_normalized[col].max() - df_normalized[col].min())

parallel_coordinates(df_normalized.sample(200), 'default', color=['green', 'red'], alpha=0.5)
plt.title('Parallel Coordinates Plot')
plt.ylabel('Normalized Value')
plt.grid(True, alpha=0.3)
plt.show()

# 7. Grouped statistics
print("\n4. Grouped Statistics (Education √ó Home Ownership √ó Default):")
grouped_stats = df.groupby(['education', 'owns_home', 'default']).agg({
    'income': 'mean',
    'credit_score': 'mean',
    'loan_amount': 'mean'
}).round(2)
print(grouped_stats)

# 8. Pivot table
print("\n5. Pivot Table (Average Income by Education and Home Ownership):")
pivot = df.pivot_table(values='income', index='education', 
                       columns='owns_home', aggfunc='mean')
print(pivot)

# Heatmap of pivot table
plt.figure(figsize=(8, 6))
sns.heatmap(pivot, annot=True, fmt='.0f', cmap='YlGnBu', linewidths=1)
plt.title('Average Income by Education and Home Ownership')
plt.show()

## 8. Outlier Detection

**Why**: Outliers can skew analysis and impact model performance.

**When to use**:
- Before modeling
- When cleaning data
- When investigating anomalies

**Methods**:
- **IQR Method**: Values beyond Q1-1.5√óIQR or Q3+1.5√óIQR
- **Z-Score**: |z| > 3 (assuming normal distribution)
- **Visual**: Box plots, scatter plots

In [None]:
# ============================================
# 8. OUTLIER DETECTION
# ============================================

print("=" * 60)
print("OUTLIER DETECTION")
print("=" * 60)

# 1. IQR Method
def detect_outliers_iqr(df, column):
    """Detect outliers using IQR method"""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    
    return outliers, lower_bound, upper_bound

print("\n1. IQR Method for Outlier Detection:")
for col in ['age', 'income', 'credit_score', 'loan_amount']:
    outliers, lower, upper = detect_outliers_iqr(df, col)
    print(f"\n   {col}:")
    print(f"      Lower bound: {lower:.2f}")
    print(f"      Upper bound: {upper:.2f}")
    print(f"      Outliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")

# 2. Z-Score Method
def detect_outliers_zscore(df, column, threshold=3):
    """Detect outliers using Z-score method"""
    z_scores = np.abs(stats.zscore(df[column].dropna()))
    outliers_idx = np.where(z_scores > threshold)[0]
    return df.iloc[outliers_idx]

print("\n2. Z-Score Method (|z| > 3):")
for col in ['age', 'income', 'credit_score', 'loan_amount']:
    outliers = detect_outliers_zscore(df, col)
    print(f"   {col}: {len(outliers)} outliers ({len(outliers)/len(df)*100:.1f}%)")

# 3. Box plots for visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

features_to_check = ['age', 'income', 'credit_score', 'loan_amount']

for idx, col in enumerate(features_to_check):
    df.boxplot(column=col, ax=axes[idx])
    axes[idx].set_title(f'Box Plot: {col}')
    axes[idx].set_ylabel(col)
    
    # Mark outlier boundaries
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    
    axes[idx].axhline(y=lower, color='r', linestyle='--', label=f'Lower: {lower:.0f}')
    axes[idx].axhline(y=upper, color='r', linestyle='--', label=f'Upper: {upper:.0f}')
    axes[idx].legend()

plt.tight_layout()
plt.show()

# 4. Scatter plot highlighting outliers
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Income outliers
outliers_income, lower_i, upper_i = detect_outliers_iqr(df, 'income')
axes[0].scatter(df.index, df['income'], alpha=0.5, label='Normal')
axes[0].scatter(outliers_income.index, outliers_income['income'], 
                color='red', alpha=0.8, label='Outliers')
axes[0].axhline(y=lower_i, color='r', linestyle='--')
axes[0].axhline(y=upper_i, color='r', linestyle='--')
axes[0].set_title('Income Outliers')
axes[0].set_xlabel('Index')
axes[0].set_ylabel('Income')
axes[0].legend()

# Credit score outliers
outliers_credit, lower_c, upper_c = detect_outliers_iqr(df, 'credit_score')
axes[1].scatter(df.index, df['credit_score'], alpha=0.5, label='Normal')
axes[1].scatter(outliers_credit.index, outliers_credit['credit_score'], 
                color='red', alpha=0.8, label='Outliers')
axes[1].axhline(y=lower_c, color='r', linestyle='--')
axes[1].axhline(y=upper_c, color='r', linestyle='--')
axes[1].set_title('Credit Score Outliers')
axes[1].set_xlabel('Index')
axes[1].set_ylabel('Credit Score')
axes[1].legend()

plt.tight_layout()
plt.show()

# 5. Outlier summary
print("\n3. Outlier Summary:")
outlier_summary = []

for col in ['age', 'income', 'credit_score', 'loan_amount', 'employment_years']:
    outliers_iqr, lower, upper = detect_outliers_iqr(df, col)
    outliers_z = detect_outliers_zscore(df, col)
    
    outlier_summary.append({
        'Feature': col,
        'IQR_Outliers': len(outliers_iqr),
        'IQR_Pct': f"{len(outliers_iqr)/len(df)*100:.1f}%",
        'Z_Outliers': len(outliers_z),
        'Z_Pct': f"{len(outliers_z)/len(df)*100:.1f}%",
        'Lower_Bound': f"{lower:.1f}",
        'Upper_Bound': f"{upper:.1f}"
    })

outlier_df = pd.DataFrame(outlier_summary)
print(outlier_df.to_string(index=False))

# 6. Handling outliers (examples)
print("\n4. Outlier Handling Strategies:")

# Strategy 1: Remove outliers
df_no_outliers = df.copy()
for col in ['income', 'loan_amount']:
    outliers, lower, upper = detect_outliers_iqr(df_no_outliers, col)
    df_no_outliers = df_no_outliers[(df_no_outliers[col] >= lower) & (df_no_outliers[col] <= upper)]

print(f"   - Remove: {len(df)} ‚Üí {len(df_no_outliers)} rows")

# Strategy 2: Cap outliers (winsorization)
df_capped = df.copy()
for col in ['income', 'loan_amount']:
    outliers, lower, upper = detect_outliers_iqr(df_capped, col)
    df_capped[col] = df_capped[col].clip(lower, upper)

print(f"   - Cap: Min income = {df_capped['income'].min():.0f}, Max = {df_capped['income'].max():.0f}")

# Strategy 3: Transform (log transformation)
df_transformed = df.copy()
df_transformed['income_log'] = np.log1p(df_transformed['income'])

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
df['income'].hist(bins=30, ax=axes[0], edgecolor='black')
axes[0].set_title('Original Income Distribution')
axes[0].set_xlabel('Income')

df_transformed['income_log'].hist(bins=30, ax=axes[1], edgecolor='black', color='green')
axes[1].set_title('Log-Transformed Income Distribution')
axes[1].set_xlabel('Log(Income)')

plt.tight_layout()
plt.show()

print(f"   - Transform: Skewness before = {df['income'].skew():.2f}, after = {df_transformed['income_log'].skew():.2f}")

## 9. Distribution Analysis

**Why**: Understanding distributions helps choose appropriate statistical tests and transformations.

**When to use**:
- Before modeling
- When selecting transformations
- When testing assumptions

**Key Concepts**:
- **Normal**: Mean = median, symmetric
- **Skewed**: Right (positive) or left (negative)
- **Kurtosis**: Heavy or light tails

In [None]:
# ============================================
# 9. DISTRIBUTION ANALYSIS
# ============================================

print("=" * 60)
print("DISTRIBUTION ANALYSIS")
print("=" * 60)

# 1. Distribution statistics
print("\n1. Distribution Metrics:")
distribution_stats = []

for col in ['age', 'income', 'credit_score', 'loan_amount', 'employment_years']:
    distribution_stats.append({
        'Feature': col,
        'Mean': df[col].mean(),
        'Median': df[col].median(),
        'Mode': df[col].mode().values[0] if len(df[col].mode()) > 0 else np.nan,
        'Skewness': df[col].skew(),
        'Kurtosis': df[col].kurtosis()
    })

dist_df = pd.DataFrame(distribution_stats)
print(dist_df.round(2).to_string(index=False))

print("\n   Interpretation:")
print("   - Skewness ~ 0: Symmetric")
print("   - Skewness > 0: Right-skewed (tail extends right)")
print("   - Skewness < 0: Left-skewed (tail extends left)")
print("   - Kurtosis ~ 0: Normal distribution")
print("   - Kurtosis > 0: Heavy tails (more outliers)")
print("   - Kurtosis < 0: Light tails (fewer outliers)")

# 2. Histogram + KDE + Normal curve
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

features = ['age', 'income', 'credit_score', 'loan_amount', 'employment_years']

for idx, col in enumerate(features):
    # Histogram
    data = df[col].dropna()
    axes[idx].hist(data, bins=30, density=True, alpha=0.6, color='skyblue', edgecolor='black')
    
    # KDE
    data.plot(kind='density', ax=axes[idx], color='red', linewidth=2, label='KDE')
    
    # Normal curve for comparison
    mu, sigma = data.mean(), data.std()
    x = np.linspace(data.min(), data.max(), 100)
    axes[idx].plot(x, stats.norm.pdf(x, mu, sigma), 'g--', linewidth=2, label='Normal')
    
    axes[idx].set_title(f'Distribution: {col}')
    axes[idx].set_xlabel(col)
    axes[idx].set_ylabel('Density')
    axes[idx].legend()
    axes[idx].axvline(data.mean(), color='red', linestyle='--', alpha=0.5, label=f'Mean: {data.mean():.0f}')
    axes[idx].axvline(data.median(), color='green', linestyle='--', alpha=0.5, label=f'Median: {data.median():.0f}')

plt.tight_layout()
plt.show()

# 3. Normality tests
from scipy.stats import shapiro, normaltest, kstest

print("\n2. Normality Tests (p-value > 0.05 suggests normal):")
normality_results = []

for col in ['age', 'income', 'credit_score', 'loan_amount']:
    data = df[col].dropna()
    
    # Shapiro-Wilk test
    shapiro_stat, shapiro_p = shapiro(data.sample(min(5000, len(data))))  # Sample for speed
    
    # D'Agostino's K¬≤ test
    k2_stat, k2_p = normaltest(data)
    
    normality_results.append({
        'Feature': col,
        'Shapiro_p': f"{shapiro_p:.4f}",
        'K2_p': f"{k2_p:.4f}",
        'Normal?': 'Yes' if shapiro_p > 0.05 and k2_p > 0.05 else 'No'
    })

norm_df = pd.DataFrame(normality_results)
print(norm_df.to_string(index=False))

# 4. Q-Q plots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, col in enumerate(features):
    data = df[col].dropna()
    stats.probplot(data, dist="norm", plot=axes[idx])
    axes[idx].set_title(f'Q-Q Plot: {col}')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 5. Transformations for normality
print("\n3. Testing Transformations:")

# Log transformation
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Income - original vs log
axes[0, 0].hist(df['income'].dropna(), bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].set_title(f'Income (Original) - Skew: {df["income"].skew():.2f}')
axes[0, 0].set_xlabel('Income')

income_log = np.log1p(df['income'].dropna())
axes[0, 1].hist(income_log, bins=30, edgecolor='black', alpha=0.7, color='green')
axes[0, 1].set_title(f'Income (Log) - Skew: {income_log.skew():.2f}')
axes[0, 1].set_xlabel('Log(Income)')

# Loan amount - original vs sqrt
axes[1, 0].hist(df['loan_amount'].dropna(), bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].set_title(f'Loan Amount (Original) - Skew: {df["loan_amount"].skew():.2f}')
axes[1, 0].set_xlabel('Loan Amount')

loan_sqrt = np.sqrt(df['loan_amount'].dropna())
axes[1, 1].hist(loan_sqrt, bins=30, edgecolor='black', alpha=0.7, color='orange')
axes[1, 1].set_title(f'Loan Amount (Sqrt) - Skew: {loan_sqrt.skew():.2f}')
axes[1, 1].set_xlabel('Sqrt(Loan Amount)')

plt.tight_layout()
plt.show()

# 6. Distribution by category
print("\n4. Distribution by Category:")
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Income by default
for default_val in [0, 1]:
    df[df['default'] == default_val]['income'].plot(kind='density', ax=axes[0, 0], 
                                                      label=f'Default={default_val}', alpha=0.7)
axes[0, 0].set_title('Income Distribution by Default')
axes[0, 0].set_xlabel('Income')
axes[0, 0].legend()

# Credit score by education
for edu in df['education'].unique():
    df[df['education'] == edu]['credit_score'].plot(kind='density', ax=axes[0, 1], 
                                                      label=edu, alpha=0.7)
axes[0, 1].set_title('Credit Score Distribution by Education')
axes[0, 1].set_xlabel('Credit Score')
axes[0, 1].legend()

# Age by home ownership
for home in df['owns_home'].unique():
    df[df['owns_home'] == home]['age'].plot(kind='density', ax=axes[1, 0], 
                                             label=f'Owns Home={home}', alpha=0.7)
axes[1, 0].set_title('Age Distribution by Home Ownership')
axes[1, 0].set_xlabel('Age')
axes[1, 0].legend()

# Loan amount by default
for default_val in [0, 1]:
    df[df['default'] == default_val]['loan_amount'].plot(kind='density', ax=axes[1, 1], 
                                                           label=f'Default={default_val}', alpha=0.7)
axes[1, 1].set_title('Loan Amount Distribution by Default')
axes[1, 1].set_xlabel('Loan Amount')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

## 10. Correlation Analysis

**Why**: Identify relationships between variables to inform feature selection and engineering.

**When to use**:
- Feature selection
- Multicollinearity detection
- Understanding feature relationships

**Methods**:
- **Pearson**: Linear relationships
- **Spearman**: Monotonic relationships
- **Kendall**: Ordinal relationships

In [None]:
# ============================================
# 10. CORRELATION ANALYSIS
# ============================================

print("=" * 60)
print("CORRELATION ANALYSIS")
print("=" * 60)

# 1. Pearson correlation (linear relationships)
print("\n1. Pearson Correlation Matrix:")
pearson_corr = df[['age', 'income', 'credit_score', 'loan_amount', 'employment_years']].corr(method='pearson')
print(pearson_corr.round(3))

# 2. Spearman correlation (monotonic relationships)
print("\n2. Spearman Correlation Matrix:")
spearman_corr = df[['age', 'income', 'credit_score', 'loan_amount', 'employment_years']].corr(method='spearman')
print(spearman_corr.round(3))

# 3. Correlation heatmap
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Pearson
sns.heatmap(pearson_corr, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8}, ax=axes[0])
axes[0].set_title('Pearson Correlation Heatmap')

# Spearman
sns.heatmap(spearman_corr, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=1, cbar_kws={"shrink": 0.8}, ax=axes[1])
axes[1].set_title('Spearman Correlation Heatmap')

plt.tight_layout()
plt.show()

# 4. Correlation with target variable
print("\n3. Correlation with Target (Default):")
target_corr = df[['age', 'income', 'credit_score', 'loan_amount', 'employment_years', 'default']].corr()['default'].sort_values(ascending=False)
print(target_corr)

# Visualize
plt.figure(figsize=(10, 6))
target_corr.drop('default').plot(kind='barh', color='steelblue', edgecolor='black')
plt.title('Feature Correlation with Default')
plt.xlabel('Correlation Coefficient')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.grid(True, alpha=0.3)
plt.show()

# 5. Highly correlated features (multicollinearity check)
print("\n4. Highly Correlated Pairs (|r| > 0.7):")
high_corr_pairs = []

for i in range(len(pearson_corr.columns)):
    for j in range(i+1, len(pearson_corr.columns)):
        if abs(pearson_corr.iloc[i, j]) > 0.7:
            high_corr_pairs.append({
                'Feature 1': pearson_corr.columns[i],
                'Feature 2': pearson_corr.columns[j],
                'Correlation': pearson_corr.iloc[i, j]
            })

if high_corr_pairs:
    corr_pairs_df = pd.DataFrame(high_corr_pairs)
    print(corr_pairs_df.to_string(index=False))
else:
    print("   ‚úì No highly correlated pairs found (good for modeling!)")

# 6. Correlation triangle (upper triangle only)
plt.figure(figsize=(10, 8))
mask = np.triu(np.ones_like(pearson_corr, dtype=bool))
sns.heatmap(pearson_corr, mask=mask, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix (Upper Triangle)')
plt.show()

# 7. Partial correlation (controlling for other variables)
print("\n5. Correlation Interpretation:")
print("   |r| < 0.3  : Weak")
print("   0.3 ‚â§ |r| < 0.7 : Moderate")
print("   |r| ‚â• 0.7  : Strong")
print("\n   r > 0 : Positive (both increase together)")
print("   r < 0 : Negative (one increases, other decreases)")

# 8. Point-biserial correlation (continuous vs binary)
from scipy.stats import pointbiserialr

print("\n6. Point-Biserial Correlation (Continuous vs Binary Target):")
for col in ['age', 'income', 'credit_score', 'loan_amount', 'employment_years']:
    # Remove NaN values
    mask = df[col].notna() & df['default'].notna()
    corr, p_value = pointbiserialr(df.loc[mask, 'default'], df.loc[mask, col])
    print(f"   {col:20s}: r={corr:6.3f}, p-value={p_value:.4f}")

# 9. Categorical correlation (Cram√©r's V)
def cramers_v(x, y):
    """Calculate Cram√©r's V statistic for categorical variables"""
    confusion_matrix = pd.crosstab(x, y)
    chi2 = stats.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    return np.sqrt(phi2 / min(k-1, r-1))

print("\n7. Cram√©r's V (Categorical Associations):")
cat_vars = ['education', 'owns_home', 'default']

for i in range(len(cat_vars)):
    for j in range(i+1, len(cat_vars)):
        v = cramers_v(df[cat_vars[i]], df[cat_vars[j]])
        print(f"   {cat_vars[i]:15s} √ó {cat_vars[j]:15s}: V={v:.3f}")

# 10. Correlation scatter plots with trend lines
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

corr_pairs = [
    ('age', 'income'),
    ('credit_score', 'income'),
    ('employment_years', 'income'),
    ('credit_score', 'loan_amount')
]

for idx, (x, y) in enumerate(corr_pairs):
    # Scatter plot
    axes[idx].scatter(df[x], df[y], alpha=0.5, s=20)
    
    # Trend line
    mask = df[x].notna() & df[y].notna()
    z = np.polyfit(df.loc[mask, x], df.loc[mask, y], 1)
    p = np.poly1d(z)
    axes[idx].plot(df.loc[mask, x].sort_values(), 
                   p(df.loc[mask, x].sort_values()), 
                   "r--", alpha=0.8, linewidth=2)
    
    # Calculate correlation
    corr = df[[x, y]].corr().iloc[0, 1]
    axes[idx].set_title(f'{y} vs {x} (r={corr:.3f})')
    axes[idx].set_xlabel(x)
    axes[idx].set_ylabel(y)
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 11. EDA Best Practices & Summary

**Key Takeaways**:

### **EDA Workflow**
1. **Load & Inspect** ‚Üí Understand structure
2. **Clean** ‚Üí Handle missing data, duplicates
3. **Explore** ‚Üí Univariate, bivariate, multivariate
4. **Transform** ‚Üí Handle outliers, normalize, encode
5. **Visualize** ‚Üí Tell data stories
6. **Document** ‚Üí Record findings and decisions

### **Common Pitfalls to Avoid**
- Skipping EDA and jumping to modeling
- Ignoring missing data patterns
- Not checking for outliers
- Assuming normality without testing
- Overlooking data quality issues
- Not documenting assumptions

### **Tools & Libraries**
- **pandas**: Data manipulation
- **numpy**: Numerical operations
- **matplotlib/seaborn**: Visualization
- **scipy**: Statistical tests
- **plotly**: Interactive plots
- **pandas-profiling**: Automated EDA reports

### **When is EDA Complete?**
You've answered:
- What is the data structure?
- Are there quality issues?
- What are the distributions?
- How do variables relate?
- What transformations are needed?
- Which features matter most?

In [None]:
# ============================================
# COMPREHENSIVE EDA REPORT GENERATOR
# ============================================

def generate_eda_report(df, target_column=None):
    """
    Generate a comprehensive EDA report for a dataset
    
    Parameters:
    -----------
    df : pd.DataFrame
        The dataset to analyze
    target_column : str, optional
        Name of the target variable for supervised learning
    
    Returns:
    --------
    dict : Comprehensive report with all EDA insights
    """
    
    report = {}
    
    # 1. Dataset Overview
    report['overview'] = {
        'shape': df.shape,
        'rows': df.shape[0],
        'columns': df.shape[1],
        'memory_mb': df.memory_usage(deep=True).sum() / 1024**2,
        'duplicates': df.duplicated().sum()
    }
    
    # 2. Column Types
    report['column_types'] = {
        'numerical': df.select_dtypes(include=[np.number]).columns.tolist(),
        'categorical': df.select_dtypes(include=['object', 'category']).columns.tolist(),
        'datetime': df.select_dtypes(include=['datetime']).columns.tolist()
    }
    
    # 3. Missing Data
    missing = df.isnull().sum()
    missing_pct = (missing / len(df) * 100)
    report['missing_data'] = {
        col: {'count': int(missing[col]), 'percentage': float(missing_pct[col])}
        for col in df.columns if missing[col] > 0
    }
    
    # 4. Numerical Statistics
    report['numerical_stats'] = {}
    for col in report['column_types']['numerical']:
        if df[col].nunique() > 1:  # Skip constant columns
            report['numerical_stats'][col] = {
                'mean': float(df[col].mean()),
                'median': float(df[col].median()),
                'std': float(df[col].std()),
                'min': float(df[col].min()),
                'max': float(df[col].max()),
                'skewness': float(df[col].skew()),
                'kurtosis': float(df[col].kurtosis()),
                'unique': int(df[col].nunique())
            }
    
    # 5. Categorical Statistics
    report['categorical_stats'] = {}
    for col in report['column_types']['categorical']:
        report['categorical_stats'][col] = {
            'unique': int(df[col].nunique()),
            'top_value': str(df[col].mode().values[0]) if len(df[col].mode()) > 0 else None,
            'top_freq': int(df[col].value_counts().iloc[0]) if len(df[col]) > 0 else 0,
            'distribution': df[col].value_counts().to_dict()
        }
    
    # 6. Outliers (IQR method)
    report['outliers'] = {}
    for col in report['column_types']['numerical']:
        if df[col].nunique() > 1:
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower = Q1 - 1.5 * IQR
            upper = Q3 + 1.5 * IQR
            outliers = df[(df[col] < lower) | (df[col] > upper)]
            
            report['outliers'][col] = {
                'count': len(outliers),
                'percentage': float(len(outliers) / len(df) * 100),
                'lower_bound': float(lower),
                'upper_bound': float(upper)
            }
    
    # 7. Correlations
    if len(report['column_types']['numerical']) > 1:
        corr_matrix = df[report['column_types']['numerical']].corr()
        
        # Find high correlations
        high_corr = []
        for i in range(len(corr_matrix.columns)):
            for j in range(i+1, len(corr_matrix.columns)):
                if abs(corr_matrix.iloc[i, j]) > 0.7:
                    high_corr.append({
                        'var1': corr_matrix.columns[i],
                        'var2': corr_matrix.columns[j],
                        'correlation': float(corr_matrix.iloc[i, j])
                    })
        
        report['high_correlations'] = high_corr
    
    # 8. Target Analysis (if provided)
    if target_column and target_column in df.columns:
        report['target_analysis'] = {
            'name': target_column,
            'type': str(df[target_column].dtype),
            'distribution': df[target_column].value_counts().to_dict()
        }
        
        # Correlation with target (if numerical)
        if df[target_column].dtype in [np.float64, np.int64]:
            target_corr = df.select_dtypes(include=[np.number]).corr()[target_column].sort_values(ascending=False)
            report['target_analysis']['correlations'] = target_corr.to_dict()
    
    return report

# Generate report for our dataset
print("=" * 60)
print("GENERATING COMPREHENSIVE EDA REPORT")
print("=" * 60)

report = generate_eda_report(df, target_column='default')

# Display report sections
print("\nüìä DATASET OVERVIEW:")
print(f"   Rows: {report['overview']['rows']:,}")
print(f"   Columns: {report['overview']['columns']}")
print(f"   Memory: {report['overview']['memory_mb']:.2f} MB")
print(f"   Duplicates: {report['overview']['duplicates']}")

print("\nüìà COLUMN TYPES:")
print(f"   Numerical: {len(report['column_types']['numerical'])}")
print(f"   Categorical: {len(report['column_types']['categorical'])}")

print("\n‚ùå MISSING DATA:")
if report['missing_data']:
    for col, info in report['missing_data'].items():
        print(f"   {col}: {info['count']} ({info['percentage']:.1f}%)")
else:
    print("   ‚úì No missing data!")

print("\nüìä NUMERICAL FEATURES SUMMARY:")
for col, stats in list(report['numerical_stats'].items())[:3]:  # Show first 3
    print(f"\n   {col}:")
    print(f"      Mean: {stats['mean']:.2f}")
    print(f"      Median: {stats['median']:.2f}")
    print(f"      Std Dev: {stats['std']:.2f}")
    print(f"      Skewness: {stats['skewness']:.2f}")

print("\nüè∑Ô∏è  CATEGORICAL FEATURES SUMMARY:")
for col, stats in report['categorical_stats'].items():
    print(f"\n   {col}:")
    print(f"      Unique values: {stats['unique']}")
    print(f"      Most common: {stats['top_value']} ({stats['top_freq']} times)")

print("\n‚ö†Ô∏è  OUTLIERS DETECTED:")
for col, info in list(report['outliers'].items())[:5]:  # Show first 5
    if info['count'] > 0:
        print(f"   {col}: {info['count']} ({info['percentage']:.1f}%)")

print("\nüîó HIGH CORRELATIONS (|r| > 0.7):")
if report.get('high_correlations'):
    for corr in report['high_correlations']:
        print(f"   {corr['var1']} √ó {corr['var2']}: {corr['correlation']:.3f}")
else:
    print("   ‚úì No highly correlated features!")

print("\nüéØ TARGET VARIABLE ANALYSIS:")
if 'target_analysis' in report:
    print(f"   Name: {report['target_analysis']['name']}")
    print(f"   Type: {report['target_analysis']['type']}")
    print(f"   Distribution: {report['target_analysis']['distribution']}")

print("\n" + "=" * 60)
print("‚úÖ EDA REPORT COMPLETE")
print("=" * 60)

## Final Summary & Recommendations

### **EDA Process Completed! üéâ**

You've now learned comprehensive EDA techniques covering:

1. ‚úÖ **Data Loading & Inspection** - Understanding dataset structure
2. ‚úÖ **Data Types & Structure** - Identifying numerical and categorical features
3. ‚úÖ **Missing Data Analysis** - Detecting and handling gaps
4. ‚úÖ **Descriptive Statistics** - Central tendency, dispersion, shape
5. ‚úÖ **Univariate Analysis** - Individual variable distributions
6. ‚úÖ **Bivariate Analysis** - Relationships between pairs
7. ‚úÖ **Multivariate Analysis** - Complex multi-variable patterns
8. ‚úÖ **Outlier Detection** - Identifying anomalies
9. ‚úÖ **Distribution Analysis** - Understanding data shapes
10. ‚úÖ **Correlation Analysis** - Feature relationships
11. ‚úÖ **Best Practices** - Professional EDA workflow

### **Next Steps for Your Data Science Journey**

#### **Immediate Actions**
- Practice EDA on real datasets (Kaggle, UCI ML Repository)
- Create EDA templates for different data types
- Build a portfolio of EDA projects

#### **Advanced EDA Topics to Explore**
- **Time Series EDA**: Trend, seasonality, autocorrelation
- **Text Data EDA**: Word frequency, n-grams, sentiment
- **Image Data EDA**: Pixel distributions, image statistics
- **Geospatial EDA**: Maps, spatial patterns
- **Interactive EDA**: Plotly, Bokeh for dashboards

#### **Tools to Master**
- **pandas-profiling**: Automated EDA reports
- **sweetviz**: Beautiful comparison reports
- **dtale**: Interactive data exploration
- **ydata-profiling**: Enhanced profiling
- **Tableau/Power BI**: Business intelligence dashboards

#### **Machine Learning Pipeline**
After EDA, proceed to:
1. **Feature Engineering** - Create new features based on insights
2. **Data Preprocessing** - Scale, encode, transform
3. **Model Selection** - Choose algorithms based on data characteristics
4. **Model Training** - Fit models to data
5. **Model Evaluation** - Assess performance
6. **Deployment** - Put models into production

### **Resources for Further Learning**

#### **Books**
- "Python for Data Analysis" by Wes McKinney
- "Storytelling with Data" by Cole Nussbaumer Knaflic
- "The Art of Statistics" by David Spiegelhalter

#### **Online Courses**
- Kaggle Learn: Data Visualization
- DataCamp: Exploratory Data Analysis in Python
- Coursera: Applied Data Science with Python

#### **Practice Datasets**
- **Kaggle Datasets**: Diverse, real-world data
- **UCI ML Repository**: Classic ML datasets
- **Data.gov**: Government open data
- **Google Dataset Search**: Find any dataset

### **Key Principles to Remember**

1. **Always start with EDA** - Never skip this step
2. **Visualize, visualize, visualize** - A picture is worth a thousand numbers
3. **Question your data** - Don't assume, verify
4. **Document everything** - Record assumptions and decisions
5. **Iterate** - EDA is not linear, revisit as needed
6. **Communicate insights** - Share findings clearly
7. **Think critically** - Correlation ‚â† Causation

### **Final Thought**

> "Exploratory Data Analysis can never be the whole story, but nothing else can serve as the foundation stone."
> ‚Äî John Tukey (Pioneer of EDA)

**You now have the skills to:**
- üîç Investigate any dataset systematically
- üìä Create insightful visualizations
- üßÆ Perform statistical analysis
- üéØ Identify patterns and anomalies
- üí° Extract actionable insights
- üöÄ Prepare data for machine learning

**Happy Exploring! üöÄüìä**