# Notebook 02: Exploratory Data Analysis
## CLO Loan-Level Liquidity Predictor

Deep dive into loan characteristics, correlations, and relationships with liquidity.

---

**Objectives:**
1. Understand the distribution of each feature
2. Identify correlations between numeric variables
3. Analyze liquidity tier characteristics
4. Explore credit rating patterns
5. Investigate industry sector differences
6. Identify predictive features for liquidity modeling

**Prerequisites:**
- Notebook 01 completed (data collection)
- `data/synthetic_loans.csv` available

---

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

# Configure visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.titlesize'] = 13
plt.rcParams['axes.labelsize'] = 11
warnings.filterwarnings('ignore')

# Load the synthetic loan data
data_path = Path('..') / 'data' / 'synthetic_loans.csv'
df = pd.read_csv(data_path)

print(f"Loaded {len(df):,} loans from {data_path}")
print(f"Columns: {list(df.columns)}")

## 1. Data Overview

Let's examine the structure, data types, and summary statistics of our loan dataset.

In [None]:
# Display basic info about the dataset
print("=" * 60)
print("DATASET INFORMATION")
print("=" * 60)
print(f"\nShape: {df.shape[0]:,} rows x {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

print("\n" + "-" * 60)
print("DATA TYPES")
print("-" * 60)
df.info()

print("\n" + "-" * 60)
print("MISSING VALUES")
print("-" * 60)
missing = df.isnull().sum()
if missing.sum() == 0:
    print("No missing values detected.")
else:
    print(missing[missing > 0])

print("\n" + "-" * 60)
print("NUMERIC SUMMARY STATISTICS")
print("-" * 60)
display(df.describe().round(2))

print("\n" + "-" * 60)
print("CATEGORICAL SUMMARY")
print("-" * 60)
print(f"\nCredit Ratings: {df['credit_rating'].nunique()} unique values")
print(df['credit_rating'].value_counts().to_string())
print(f"\nIndustry Sectors: {df['industry_sector'].nunique()} unique values")
print(df['industry_sector'].value_counts().to_string())
print(f"\nCovenant Lite: {df['covenant_lite'].value_counts().to_dict()}")
print(f"\nLiquidity Tiers: {sorted(df['liquidity_tier'].unique())}")

## 2. Univariate Analysis

Examining the distribution of each feature individually to understand data ranges and identify any anomalies.

In [None]:
# Numeric column distributions
numeric_cols = ['facility_size', 'current_spread', 'time_to_maturity', 
                'trading_volume_30d', 'bid_ask_spread']

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

colors = sns.color_palette('husl', len(numeric_cols))

for idx, col in enumerate(numeric_cols):
    ax = axes[idx]
    
    # Histogram with KDE
    sns.histplot(df[col], kde=True, ax=ax, color=colors[idx], edgecolor='white', alpha=0.7)
    
    # Add mean and median lines
    mean_val = df[col].mean()
    median_val = df[col].median()
    ax.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.1f}')
    ax.axvline(median_val, color='blue', linestyle=':', linewidth=2, label=f'Median: {median_val:.1f}')
    
    ax.set_title(col.replace('_', ' ').title())
    ax.set_xlabel('')
    ax.legend(fontsize=9)

# Liquidity tier distribution in last subplot
ax = axes[5]
tier_counts = df['liquidity_tier'].value_counts().sort_index()
tier_colors = ['#27ae60', '#2ecc71', '#f39c12', '#e67e22', '#c0392b']
bars = ax.bar(tier_counts.index, tier_counts.values, color=tier_colors, edgecolor='white')
ax.set_xlabel('Liquidity Tier')
ax.set_ylabel('Count')
ax.set_title('Liquidity Tier Distribution\n(1=Most Liquid, 5=Illiquid)')

# Add percentage labels
for bar, count in zip(bars, tier_counts.values):
    pct = count / len(df) * 100
    ax.annotate(f'{pct:.1f}%', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.suptitle('Distribution of Numeric Features', fontsize=14, fontweight='bold', y=1.02)
plt.show()

# Categorical distributions
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Credit Rating
ax = axes[0]
rating_order = ['BB+', 'BB', 'BB-', 'B+', 'B', 'B-', 'CCC+', 'CCC']
rating_counts = df['credit_rating'].value_counts().reindex(rating_order)
colors_rating = sns.color_palette('RdYlGn_r', len(rating_order))
bars = ax.bar(range(len(rating_counts)), rating_counts.values, color=colors_rating, edgecolor='white')
ax.set_xticks(range(len(rating_order)))
ax.set_xticklabels(rating_order)
ax.set_xlabel('Credit Rating')
ax.set_ylabel('Count')
ax.set_title('Credit Rating Distribution\n(Better to Worse)')

# Industry Sector
ax = axes[1]
sector_counts = df['industry_sector'].value_counts()
colors_sector = sns.color_palette('Set2', len(sector_counts))
bars = ax.barh(range(len(sector_counts)), sector_counts.values, color=colors_sector, edgecolor='white')
ax.set_yticks(range(len(sector_counts)))
ax.set_yticklabels(sector_counts.index)
ax.set_xlabel('Count')
ax.set_title('Industry Sector Distribution')
ax.invert_yaxis()

# Covenant Lite
ax = axes[2]
cov_counts = df['covenant_lite'].value_counts()
colors_cov = ['#3498db', '#95a5a6']
wedges, texts, autotexts = ax.pie(cov_counts.values, labels=['Covenant-Lite', 'Standard'], 
                                   autopct='%1.1f%%', colors=colors_cov, startangle=90,
                                   explode=(0.05, 0))
ax.set_title('Covenant-Lite vs Standard Loans')

plt.tight_layout()
plt.suptitle('Distribution of Categorical Features', fontsize=14, fontweight='bold', y=1.02)
plt.show()

## 3. Correlation Analysis

Examining relationships between numeric variables to identify potential predictors of liquidity.

In [None]:
# Select numeric columns for correlation analysis
numeric_df = df[['facility_size', 'current_spread', 'time_to_maturity', 
                 'trading_volume_30d', 'bid_ask_spread', 'liquidity_tier']]

# Calculate correlation matrix
corr_matrix = numeric_df.corr()

# Create correlation heatmap
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Heatmap
ax = axes[0]
mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='RdBu_r', center=0,
            fmt='.2f', square=True, linewidths=0.5, ax=ax,
            cbar_kws={'shrink': 0.8, 'label': 'Correlation'})
ax.set_title('Correlation Matrix\n(Lower Triangle)', fontsize=12, fontweight='bold')

# Correlation with liquidity tier (target variable)
ax = axes[1]
target_corr = corr_matrix['liquidity_tier'].drop('liquidity_tier').sort_values()
colors_corr = ['#e74c3c' if x < 0 else '#27ae60' for x in target_corr.values]
bars = ax.barh(range(len(target_corr)), target_corr.values, color=colors_corr, edgecolor='white')
ax.set_yticks(range(len(target_corr)))
ax.set_yticklabels([col.replace('_', ' ').title() for col in target_corr.index])
ax.set_xlabel('Correlation Coefficient')
ax.set_title('Correlation with Liquidity Tier\n(Target Variable)', fontsize=12, fontweight='bold')
ax.axvline(0, color='black', linewidth=0.5)

# Add value labels
for bar, val in zip(bars, target_corr.values):
    x_pos = val + 0.02 if val >= 0 else val - 0.02
    ha = 'left' if val >= 0 else 'right'
    ax.annotate(f'{val:.3f}', xy=(x_pos, bar.get_y() + bar.get_height()/2),
                ha=ha, va='center', fontsize=10)

plt.tight_layout()
plt.show()

# Print key correlations
print("\n" + "=" * 60)
print("KEY CORRELATION INSIGHTS")
print("=" * 60)
print("\nCorrelations with Liquidity Tier (target):")
for feature, corr in target_corr.items():
    direction = "higher tier (less liquid)" if corr > 0 else "lower tier (more liquid)"
    strength = "strong" if abs(corr) > 0.5 else "moderate" if abs(corr) > 0.3 else "weak"
    print(f"  {feature}: {corr:+.3f} ({strength}, associated with {direction})")

# Scatter matrix for key features
print("\n" + "-" * 60)
print("SCATTER MATRIX")
print("-" * 60)

key_features = ['facility_size', 'trading_volume_30d', 'bid_ask_spread', 'liquidity_tier']
fig = plt.figure(figsize=(12, 10))
scatter_df = df[key_features].copy()

# Create pair plot with liquidity tier as hue
g = sns.pairplot(scatter_df, hue='liquidity_tier', palette='RdYlGn_r', 
                  diag_kind='kde', plot_kws={'alpha': 0.5, 's': 20},
                  corner=True)
g.fig.suptitle('Scatter Matrix of Key Features by Liquidity Tier', y=1.02, fontsize=14, fontweight='bold')
plt.show()

## 4. Liquidity Tier Analysis

Detailed examination of how loan characteristics vary across liquidity tiers.

In [None]:
# Box plots: Features by Liquidity Tier
features_to_plot = ['facility_size', 'current_spread', 'trading_volume_30d', 'bid_ask_spread']

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

tier_palette = {1: '#27ae60', 2: '#2ecc71', 3: '#f39c12', 4: '#e67e22', 5: '#c0392b'}

for idx, feature in enumerate(features_to_plot):
    ax = axes[idx]
    sns.boxplot(data=df, x='liquidity_tier', y=feature, palette=tier_palette, ax=ax)
    
    # Add mean markers
    means = df.groupby('liquidity_tier')[feature].mean()
    ax.scatter(range(len(means)), means.values, color='red', marker='D', s=50, zorder=5, label='Mean')
    
    ax.set_xlabel('Liquidity Tier (1=Most Liquid, 5=Illiquid)')
    ax.set_ylabel(feature.replace('_', ' ').title())
    ax.set_title(f'{feature.replace("_", " ").title()} by Liquidity Tier')
    ax.legend(loc='upper right')

plt.tight_layout()
plt.suptitle('Feature Distributions Across Liquidity Tiers', fontsize=14, fontweight='bold', y=1.02)
plt.show()

# Summary table by liquidity tier
print("\n" + "=" * 80)
print("LIQUIDITY TIER CHARACTERISTICS")
print("=" * 80)

tier_summary = df.groupby('liquidity_tier').agg({
    'facility_size': ['count', 'mean', 'median'],
    'current_spread': ['mean', 'median'],
    'trading_volume_30d': ['mean', 'median'],
    'bid_ask_spread': ['mean', 'median'],
    'time_to_maturity': ['mean'],
    'covenant_lite': lambda x: x.sum() / len(x) * 100  # % covenant lite
}).round(2)

tier_summary.columns = ['Count', 'Avg Size ($M)', 'Med Size ($M)', 
                        'Avg Spread (bps)', 'Med Spread (bps)',
                        'Avg Vol 30d ($M)', 'Med Vol 30d ($M)',
                        'Avg Bid-Ask (bps)', 'Med Bid-Ask (bps)',
                        'Avg Maturity (yrs)', '% Cov-Lite']

display(tier_summary)

# Key observations
print("\nKEY OBSERVATIONS:")
print("-" * 60)
print("1. Trading Volume: Higher liquidity tiers (more liquid) have higher trading volumes")
print("2. Bid-Ask Spread: Illiquid loans (Tier 5) have wider bid-ask spreads")
print("3. Facility Size: Larger loans tend to be more liquid (lower tier)")
print("4. Credit Spread: Less liquid loans tend to have higher credit spreads")

## 5. Credit Rating Analysis

Examining how credit ratings relate to spreads, trading volumes, and liquidity.

In [None]:
# Define rating order (best to worst)
rating_order = ['BB+', 'BB', 'BB-', 'B+', 'B', 'B-', 'CCC+', 'CCC']

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Spread by Credit Rating
ax = axes[0, 0]
sns.boxplot(data=df, x='credit_rating', y='current_spread', order=rating_order,
            palette='RdYlGn_r', ax=ax)
ax.set_xlabel('Credit Rating')
ax.set_ylabel('Current Spread (bps)')
ax.set_title('Credit Spread by Rating\n(Higher risk = Higher spread)')

# Add trend line (mean)
means = df.groupby('credit_rating')['current_spread'].mean().reindex(rating_order)
ax.plot(range(len(means)), means.values, 'r--', marker='D', markersize=6, label='Mean')
ax.legend()

# 2. Trading Volume by Credit Rating
ax = axes[0, 1]
sns.boxplot(data=df, x='credit_rating', y='trading_volume_30d', order=rating_order,
            palette='RdYlGn_r', ax=ax)
ax.set_xlabel('Credit Rating')
ax.set_ylabel('Trading Volume 30d ($M)')
ax.set_title('Trading Volume by Rating')

# 3. Liquidity Tier Distribution by Credit Rating
ax = axes[1, 0]
rating_tier = pd.crosstab(df['credit_rating'], df['liquidity_tier'], normalize='index') * 100
rating_tier = rating_tier.reindex(rating_order)
rating_tier.plot(kind='bar', stacked=True, ax=ax, 
                  colormap='RdYlGn_r', edgecolor='white', width=0.8)
ax.set_xlabel('Credit Rating')
ax.set_ylabel('Percentage (%)')
ax.set_title('Liquidity Tier Distribution by Credit Rating')
ax.legend(title='Liquidity Tier', bbox_to_anchor=(1.02, 1), loc='upper left')
ax.tick_params(axis='x', rotation=0)

# 4. Average Liquidity Tier by Credit Rating
ax = axes[1, 1]
avg_tier = df.groupby('credit_rating')['liquidity_tier'].mean().reindex(rating_order)
colors = sns.color_palette('RdYlGn_r', len(rating_order))
bars = ax.bar(range(len(avg_tier)), avg_tier.values, color=colors, edgecolor='white')
ax.set_xticks(range(len(rating_order)))
ax.set_xticklabels(rating_order)
ax.set_xlabel('Credit Rating')
ax.set_ylabel('Average Liquidity Tier')
ax.set_title('Average Liquidity Tier by Credit Rating\n(Higher = Less Liquid)')
ax.set_ylim(1, 5)

# Add value labels
for bar, val in zip(bars, avg_tier.values):
    ax.annotate(f'{val:.2f}', xy=(bar.get_x() + bar.get_width()/2, bar.get_height()),
                ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.suptitle('Credit Rating Analysis', fontsize=14, fontweight='bold', y=1.02)
plt.show()

# Credit rating summary table
print("\n" + "=" * 80)
print("CREDIT RATING SUMMARY")
print("=" * 80)

rating_summary = df.groupby('credit_rating').agg({
    'loan_id': 'count',
    'facility_size': 'mean',
    'current_spread': 'mean',
    'trading_volume_30d': 'mean',
    'bid_ask_spread': 'mean',
    'liquidity_tier': 'mean'
}).reindex(rating_order).round(2)

rating_summary.columns = ['Count', 'Avg Size ($M)', 'Avg Spread (bps)', 
                          'Avg Vol 30d ($M)', 'Avg Bid-Ask (bps)', 'Avg Liq Tier']
display(rating_summary)

## 6. Size and Trading Relationship

Exploring the relationships between facility size, trading volume, and bid-ask spreads.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 1. Facility Size vs Trading Volume
ax = axes[0]
scatter = ax.scatter(df['facility_size'], df['trading_volume_30d'], 
                     c=df['liquidity_tier'], cmap='RdYlGn_r', 
                     alpha=0.6, s=30, edgecolor='white', linewidth=0.3)
plt.colorbar(scatter, ax=ax, label='Liquidity Tier')

# Add trend line
z = np.polyfit(df['facility_size'], df['trading_volume_30d'], 1)
p = np.poly1d(z)
x_line = np.linspace(df['facility_size'].min(), df['facility_size'].max(), 100)
ax.plot(x_line, p(x_line), 'b--', linewidth=2, label=f'Trend (r={df["facility_size"].corr(df["trading_volume_30d"]):.2f})')

ax.set_xlabel('Facility Size ($M)')
ax.set_ylabel('Trading Volume 30d ($M)')
ax.set_title('Facility Size vs Trading Volume\n(Colored by Liquidity Tier)')
ax.legend(loc='upper left')

# 2. Trading Volume vs Bid-Ask Spread
ax = axes[1]
scatter = ax.scatter(df['trading_volume_30d'], df['bid_ask_spread'],
                     c=df['liquidity_tier'], cmap='RdYlGn_r',
                     alpha=0.6, s=30, edgecolor='white', linewidth=0.3)
plt.colorbar(scatter, ax=ax, label='Liquidity Tier')

# Add trend line
z = np.polyfit(df['trading_volume_30d'], df['bid_ask_spread'], 1)
p = np.poly1d(z)
x_line = np.linspace(df['trading_volume_30d'].min(), df['trading_volume_30d'].max(), 100)
ax.plot(x_line, p(x_line), 'b--', linewidth=2, label=f'Trend (r={df["trading_volume_30d"].corr(df["bid_ask_spread"]):.2f})')

ax.set_xlabel('Trading Volume 30d ($M)')
ax.set_ylabel('Bid-Ask Spread (bps)')
ax.set_title('Trading Volume vs Bid-Ask Spread\n(Colored by Liquidity Tier)')
ax.legend(loc='upper right')

plt.tight_layout()
plt.suptitle('Size and Trading Relationships', fontsize=14, fontweight='bold', y=1.02)
plt.show()

# Additional analysis: Size buckets
print("\n" + "=" * 60)
print("FACILITY SIZE BUCKET ANALYSIS")
print("=" * 60)

df['size_bucket'] = pd.cut(df['facility_size'], 
                           bins=[0, 200, 400, 600, 800, float('inf')],
                           labels=['<$200M', '$200-400M', '$400-600M', '$600-800M', '>$800M'])

size_summary = df.groupby('size_bucket').agg({
    'loan_id': 'count',
    'trading_volume_30d': 'mean',
    'bid_ask_spread': 'mean',
    'liquidity_tier': 'mean'
}).round(2)

size_summary.columns = ['Count', 'Avg Vol 30d ($M)', 'Avg Bid-Ask (bps)', 'Avg Liq Tier']
display(size_summary)

# Clean up temporary column
df.drop('size_bucket', axis=1, inplace=True)

## 7. Industry Sector Analysis

Examining liquidity patterns across different industry sectors.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Sort sectors by average liquidity tier
sector_order = df.groupby('industry_sector')['liquidity_tier'].mean().sort_values().index.tolist()

# 1. Liquidity Tier Distribution by Sector
ax = axes[0, 0]
sns.boxplot(data=df, x='industry_sector', y='liquidity_tier', order=sector_order,
            palette='Set2', ax=ax)
ax.set_xlabel('Industry Sector')
ax.set_ylabel('Liquidity Tier')
ax.set_title('Liquidity Tier by Industry Sector\n(Sorted by Avg Liquidity)')
ax.tick_params(axis='x', rotation=45)

# 2. Average Bid-Ask Spread by Sector
ax = axes[0, 1]
sector_bidask = df.groupby('industry_sector')['bid_ask_spread'].mean().sort_values()
colors = sns.color_palette('Set2', len(sector_bidask))
bars = ax.barh(range(len(sector_bidask)), sector_bidask.values, color=colors, edgecolor='white')
ax.set_yticks(range(len(sector_bidask)))
ax.set_yticklabels(sector_bidask.index)
ax.set_xlabel('Average Bid-Ask Spread (bps)')
ax.set_title('Average Bid-Ask Spread by Sector')

# Add value labels
for bar, val in zip(bars, sector_bidask.values):
    ax.annotate(f'{val:.1f}', xy=(val + 1, bar.get_y() + bar.get_height()/2),
                ha='left', va='center', fontsize=10)

# 3. Stacked bar: Liquidity tier proportions by sector
ax = axes[1, 0]
sector_tier = pd.crosstab(df['industry_sector'], df['liquidity_tier'], normalize='index') * 100
sector_tier = sector_tier.reindex(sector_order)
sector_tier.plot(kind='barh', stacked=True, ax=ax, 
                  colormap='RdYlGn_r', edgecolor='white', width=0.8)
ax.set_xlabel('Percentage (%)')
ax.set_ylabel('Industry Sector')
ax.set_title('Liquidity Tier Proportions by Sector')
ax.legend(title='Tier', bbox_to_anchor=(1.02, 1), loc='upper left')

# 4. Trading Volume by Sector
ax = axes[1, 1]
sns.boxplot(data=df, x='industry_sector', y='trading_volume_30d', order=sector_order,
            palette='Set2', ax=ax)
ax.set_xlabel('Industry Sector')
ax.set_ylabel('Trading Volume 30d ($M)')
ax.set_title('Trading Volume by Industry Sector')
ax.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.suptitle('Industry Sector Analysis', fontsize=14, fontweight='bold', y=1.02)
plt.show()

# Sector summary table
print("\n" + "=" * 80)
print("INDUSTRY SECTOR SUMMARY")
print("=" * 80)

sector_summary = df.groupby('industry_sector').agg({
    'loan_id': 'count',
    'facility_size': 'mean',
    'current_spread': 'mean',
    'trading_volume_30d': 'mean',
    'bid_ask_spread': 'mean',
    'liquidity_tier': 'mean'
}).sort_values('liquidity_tier').round(2)

sector_summary.columns = ['Count', 'Avg Size ($M)', 'Avg Spread (bps)', 
                          'Avg Vol 30d ($M)', 'Avg Bid-Ask (bps)', 'Avg Liq Tier']
display(sector_summary)

## 8. Key Findings Summary

### Univariate Insights
- **Facility Size**: Right-skewed distribution with median around $400-500M, consistent with leveraged loan market
- **Credit Ratings**: Dominated by B-rated loans, with BB+ being relatively rare (higher quality)
- **Liquidity Tiers**: Most loans fall in Tier 3-4 (moderate to low liquidity)
- **Covenant-Lite**: Majority of loans are covenant-lite, reflecting modern market structure

### Correlation Insights
| Feature | Correlation with Liquidity Tier | Interpretation |
|---------|--------------------------------|----------------|
| Bid-Ask Spread | Strong Positive | Wider spreads = less liquid |
| Trading Volume | Strong Negative | Higher volume = more liquid |
| Facility Size | Moderate Negative | Larger loans = more liquid |
| Current Spread | Weak Positive | Higher credit spread = slightly less liquid |

### Predictive Features for Liquidity Modeling
Based on the EDA, the following features are most predictive of liquidity:

1. **Trading Volume (30d)** - Strongest negative correlation with liquidity tier
2. **Bid-Ask Spread** - Strongest positive correlation with liquidity tier
3. **Facility Size** - Larger deals tend to be more liquid
4. **Credit Rating** - Lower-rated loans tend to be less liquid
5. **Industry Sector** - Some sectors show systematically different liquidity

### Feature Engineering Candidates
- **Size-adjusted trading volume**: trading_volume / facility_size
- **Spread relative to rating**: current_spread vs rating-average spread
- **Rating numeric encoding**: Convert ratings to ordinal scale
- **Sector liquidity score**: Based on sector average liquidity
- **Volume-weighted bid-ask**: Combined liquidity metric

## 9. Next Steps

With the exploratory analysis complete, we have identified key relationships and predictive features.

### Continue to Notebook 03: Feature Engineering

In the next notebook, we will:
1. Create derived features based on EDA insights
2. Encode categorical variables appropriately
3. Normalize/scale numeric features
4. Prepare the feature matrix for model training

---

**Notebook Series:**
- [x] Notebook 01: Data Collection
- [x] **Notebook 02: Exploratory Data Analysis** (this notebook)
- [ ] Notebook 03: Feature Engineering
- [ ] Notebook 04: Model Training
- [ ] Notebook 05: Model Evaluation

---

**Continue to Notebook 03**: [03_feature_engineering.ipynb](./03_feature_engineering.ipynb)