# Notebook 01: Data Collection
## CLO Loan-Level Liquidity Predictor

This notebook demonstrates data collection from multiple sources for the liquidity prediction model.

**Data Sources:**
1. Synthetic Loan Data (realistic leveraged loan characteristics)
2. FRED API (economic indicators)
3. Yahoo Finance (market data)
4. SEC EDGAR (CLO holdings - demonstration)

---

**Prerequisites:**
- Python 3.9+
- Required packages installed (see `requirements.txt`)
- Optional: FRED API key for live economic data

**Output:**
- `data/synthetic_loans.csv` - Generated loan dataset
- `data/market_data.csv` - Fetched market indicators (if available)

## 1. Setup and Imports

In [None]:
# Add project root to path for imports
import sys
sys.path.insert(0, '..')

# Standard library imports
import warnings
from pathlib import Path

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Configure plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Import our data modules
try:
    from src.data.data_generator import SyntheticLoanGenerator, print_distribution_stats
    print("Loaded: SyntheticLoanGenerator")
except ImportError as e:
    print(f"Could not import SyntheticLoanGenerator: {e}")

try:
    from src.data.fred_fetcher import FREDFetcher, FREDFetcherError
    print("Loaded: FREDFetcher")
except ImportError as e:
    print(f"Could not import FREDFetcher: {e}")
    FREDFetcher = None

try:
    from src.data.yfinance_fetcher import YFinanceFetcher
    print("Loaded: YFinanceFetcher")
except ImportError as e:
    print(f"Could not import YFinanceFetcher: {e}")
    YFinanceFetcher = None

# Note: EDGAR and Kaggle loaders require additional setup/credentials
# from src.data.edgar_parser import EDGARParser
# from src.data.kaggle_loader import KaggleLoader

print("\nSetup complete!")

## 2. Generate Synthetic Loan Data

We use the `SyntheticLoanGenerator` to create realistic leveraged loan data. The generator produces loans with:

- **Facility sizes**: $100M - $3B (log-normal distribution)
- **Credit ratings**: BB+ through CCC (market-weighted distribution)
- **Spreads**: Correlated with credit quality (200-750 bps over SOFR)
- **Trading characteristics**: Volume, bid-ask spreads, liquidity tiers

This synthetic data serves as our primary training dataset when real loan-level data is not available.

In [None]:
# Initialize the generator with a fixed seed for reproducibility
generator = SyntheticLoanGenerator(seed=42)

# Generate 5000 synthetic loans
print("Generating 5,000 synthetic loans...")
loans_df = generator.generate_loans(n_loans=5000)

# Display basic information
print(f"\nGenerated DataFrame shape: {loans_df.shape}")
print(f"Columns: {list(loans_df.columns)}")

# Show sample data
print("\n" + "="*60)
print("SAMPLE DATA (First 10 rows)")
print("="*60)
display(loans_df.head(10))

In [None]:
# Print comprehensive distribution statistics
print_distribution_stats(loans_df)

## 3. Explore Loan Data Distributions

Let's visualize the key characteristics of our synthetic loan data to verify it resembles real market patterns.

In [None]:
# Create a figure with multiple subplots for loan data exploration
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Facility Size Distribution (log scale)
ax1 = axes[0, 0]
ax1.hist(loans_df['facility_size'], bins=50, edgecolor='white', alpha=0.7, color='steelblue')
ax1.set_xlabel('Facility Size ($M)')
ax1.set_ylabel('Frequency')
ax1.set_title('Distribution of Facility Sizes')
ax1.axvline(loans_df['facility_size'].median(), color='red', linestyle='--', 
            label=f'Median: ${loans_df["facility_size"].median():,.0f}M')
ax1.axvline(loans_df['facility_size'].mean(), color='orange', linestyle='--',
            label=f'Mean: ${loans_df["facility_size"].mean():,.0f}M')
ax1.legend()

# 2. Credit Rating Distribution
ax2 = axes[0, 1]
rating_order = ['BB+', 'BB', 'BB-', 'B+', 'B', 'B-', 'CCC+', 'CCC']
rating_counts = loans_df['credit_rating'].value_counts().reindex(rating_order)
colors = sns.color_palette('RdYlGn_r', len(rating_order))
rating_counts.plot(kind='bar', ax=ax2, color=colors, edgecolor='white')
ax2.set_xlabel('Credit Rating')
ax2.set_ylabel('Number of Loans')
ax2.set_title('Credit Rating Distribution')
ax2.tick_params(axis='x', rotation=0)

# Add percentage labels on bars
for i, (rating, count) in enumerate(rating_counts.items()):
    pct = count / len(loans_df) * 100
    ax2.annotate(f'{pct:.1f}%', xy=(i, count), ha='center', va='bottom', fontsize=9)

# 3. Current Spread Distribution
ax3 = axes[1, 0]
ax3.hist(loans_df['current_spread'], bins=50, edgecolor='white', alpha=0.7, color='coral')
ax3.set_xlabel('Current Spread (bps over SOFR)')
ax3.set_ylabel('Frequency')
ax3.set_title('Distribution of Credit Spreads')
ax3.axvline(loans_df['current_spread'].median(), color='blue', linestyle='--',
            label=f'Median: {loans_df["current_spread"].median():.0f} bps')
ax3.legend()

# 4. Liquidity Tier Distribution
ax4 = axes[1, 1]
tier_counts = loans_df['liquidity_tier'].value_counts().sort_index()
tier_colors = ['#2ecc71', '#27ae60', '#f39c12', '#e74c3c', '#c0392b']
tier_counts.plot(kind='bar', ax=ax4, color=tier_colors, edgecolor='white')
ax4.set_xlabel('Liquidity Tier')
ax4.set_ylabel('Number of Loans')
ax4.set_title('Liquidity Tier Distribution\n(1=Most Liquid, 5=Illiquid)')
ax4.tick_params(axis='x', rotation=0)

# Add percentage labels
for i, (tier, count) in enumerate(tier_counts.items()):
    pct = count / len(loans_df) * 100
    ax4.annotate(f'{pct:.1f}%', xy=(i, count), ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.suptitle('Synthetic Loan Data Overview', fontsize=14, fontweight='bold', y=1.02)
plt.show()

In [None]:
# Box plot: Spread by Credit Rating
fig, ax = plt.subplots(figsize=(12, 6))

rating_order = ['BB+', 'BB', 'BB-', 'B+', 'B', 'B-', 'CCC+', 'CCC']
sns.boxplot(data=loans_df, x='credit_rating', y='current_spread', 
            order=rating_order, palette='RdYlGn_r', ax=ax)

ax.set_xlabel('Credit Rating')
ax.set_ylabel('Current Spread (bps)')
ax.set_title('Credit Spread by Rating\n(Higher risk ratings command higher spreads)')

# Add reference line for average
ax.axhline(loans_df['current_spread'].mean(), color='blue', linestyle='--', 
           alpha=0.5, label=f'Overall Mean: {loans_df["current_spread"].mean():.0f} bps')
ax.legend()

plt.tight_layout()
plt.show()

## 4. FRED Economic Data (Demonstration)

The FRED (Federal Reserve Economic Data) API provides access to economic indicators that can influence loan liquidity:

- **VIX**: Market volatility expectations
- **Federal Funds Rate**: Base interest rate environment
- **High Yield Spread**: Credit risk premium for junk bonds
- **Investment Grade Spread**: Credit risk premium for IG bonds
- **Yield Curve Slope**: Economic outlook indicator (10Y-2Y spread)

> **Note**: A FRED API key is required for live data. Get one free at: https://fred.stlouisfed.org/docs/api/api_key.html

In [None]:
# FRED Data Collection - requires API key
fred_data = None

if FREDFetcher is not None:
    try:
        # Attempt to initialize FRED fetcher (requires API key)
        fred_fetcher = FREDFetcher()
        
        # Define date range for data collection
        start_date = '2022-01-01'
        end_date = '2024-12-31'
        
        print(f"Fetching FRED data from {start_date} to {end_date}...")
        
        # Fetch all economic indicators
        fred_data = fred_fetcher.fetch_all_indicators(start_date, end_date)
        
        print(f"\nFRED Data Retrieved!")
        print(f"Shape: {fred_data.shape}")
        print(f"Date range: {fred_data['date'].min()} to {fred_data['date'].max()}")
        print(f"\nSample data:")
        display(fred_data.tail(10))
        
    except FREDFetcherError as e:
        print(f"FRED API Error: {e}")
        print("\nTo use FRED data, set your API key:")
        print("  1. Get a free key at: https://fred.stlouisfed.org/docs/api/api_key.html")
        print("  2. Set environment variable: export FRED_API_KEY='your-key-here'")
        print("  3. Or add to .env file in project root")
    except Exception as e:
        print(f"Error: {e}")
else:
    print("FREDFetcher not available - check module imports above.")

# Show what data would look like (mock example)
if fred_data is None:
    print("\n" + "="*60)
    print("MOCK DATA EXAMPLE (for demonstration purposes)")
    print("="*60)
    
    # Create mock data to show expected structure
    mock_dates = pd.date_range('2024-01-01', periods=10, freq='W')
    mock_fred = pd.DataFrame({
        'date': mock_dates,
        'vix': np.random.uniform(12, 25, 10),
        'fed_funds_rate': [5.25] * 10,
        'hy_spread': np.random.uniform(350, 450, 10),
        'ig_spread': np.random.uniform(100, 150, 10),
        'yield_curve_slope': np.random.uniform(-0.5, 0.2, 10)
    })
    print("\nExpected FRED data structure:")
    display(mock_fred)

## 5. Yahoo Finance Market Data

We use Yahoo Finance to fetch real-time market indicators:

- **VIX (^VIX)**: CBOE Volatility Index - measures market fear/uncertainty
- **HYG**: iShares High Yield Corporate Bond ETF - tracks junk bonds
- **LQD**: iShares Investment Grade Corporate Bond ETF - tracks IG bonds
- **S&P 500 (^GSPC)**: Broad market performance indicator

These market indicators help contextualize loan liquidity conditions.

In [None]:
# Yahoo Finance Data Collection
market_data = None
vix_data = None
credit_data = None

if YFinanceFetcher is not None:
    try:
        # Initialize the fetcher
        yf_fetcher = YFinanceFetcher(delay_seconds=0.3)
        
        # Define date range
        start_date = '2023-01-01'
        end_date = '2024-12-31'
        
        print(f"Fetching Yahoo Finance data from {start_date} to {end_date}...")
        print("-" * 60)
        
        # Fetch VIX data
        print("\n1. Fetching VIX volatility index...")
        vix_data = yf_fetcher.fetch_vix(start_date, end_date)
        if not vix_data.empty:
            print(f"   VIX records: {len(vix_data)}")
            print(f"   VIX range: {vix_data['Close'].min():.2f} - {vix_data['Close'].max():.2f}")
        
        # Fetch credit ETF data
        print("\n2. Fetching credit ETF data (HYG, LQD)...")
        credit_data = yf_fetcher.fetch_credit_etfs(start_date, end_date)
        if not credit_data.empty:
            print(f"   Credit ETF records: {len(credit_data)}")
            print(f"   Columns: {list(credit_data.columns)}")
        
        # Fetch combined market data
        print("\n3. Fetching combined market data...")
        market_data = yf_fetcher.fetch_all_market_data(start_date, end_date)
        
        if not market_data.empty:
            print(f"\nMarket Data Retrieved!")
            print(f"Shape: {market_data.shape}")
            print(f"Columns: {list(market_data.columns)}")
            print(f"Date range: {market_data.index.min()} to {market_data.index.max()}")
            print(f"\nSample data (last 10 rows):")
            display(market_data.tail(10))
        
    except Exception as e:
        print(f"Error fetching Yahoo Finance data: {e}")
        import traceback
        traceback.print_exc()
else:
    print("YFinanceFetcher not available - check module imports above.")

In [None]:
# Visualize VIX data if available
if vix_data is not None and not vix_data.empty:
    fig, axes = plt.subplots(2, 1, figsize=(14, 8))
    
    # VIX time series
    ax1 = axes[0]
    ax1.plot(vix_data.index, vix_data['Close'], color='purple', linewidth=1.5, label='VIX Close')
    ax1.fill_between(vix_data.index, vix_data['Close'], alpha=0.3, color='purple')
    
    # Add threshold lines
    ax1.axhline(y=20, color='green', linestyle='--', alpha=0.7, label='Low Volatility (<20)')
    ax1.axhline(y=30, color='red', linestyle='--', alpha=0.7, label='High Volatility (>30)')
    
    ax1.set_xlabel('Date')
    ax1.set_ylabel('VIX Index')
    ax1.set_title('CBOE Volatility Index (VIX) - Market Fear Gauge')
    ax1.legend(loc='upper right')
    ax1.grid(True, alpha=0.3)
    
    # VIX distribution
    ax2 = axes[1]
    ax2.hist(vix_data['Close'].dropna(), bins=50, edgecolor='white', alpha=0.7, color='purple')
    ax2.axvline(vix_data['Close'].mean(), color='red', linestyle='--', 
                label=f'Mean: {vix_data["Close"].mean():.1f}')
    ax2.axvline(vix_data['Close'].median(), color='blue', linestyle='--',
                label=f'Median: {vix_data["Close"].median():.1f}')
    ax2.set_xlabel('VIX Value')
    ax2.set_ylabel('Frequency')
    ax2.set_title('VIX Distribution')
    ax2.legend()
    
    plt.tight_layout()
    plt.show()
    
    # Print statistics
    print("\nVIX Statistics:")
    print(f"  Mean: {vix_data['Close'].mean():.2f}")
    print(f"  Std:  {vix_data['Close'].std():.2f}")
    print(f"  Min:  {vix_data['Close'].min():.2f}")
    print(f"  Max:  {vix_data['Close'].max():.2f}")
else:
    print("VIX data not available - skipping visualization.")

In [None]:
# Visualize credit ETF data if available
if credit_data is not None and not credit_data.empty:
    fig, axes = plt.subplots(2, 1, figsize=(14, 8))
    
    # Price comparison
    ax1 = axes[0]
    if 'hyg' in credit_data.columns:
        ax1.plot(credit_data.index, credit_data['hyg'], label='HYG (High Yield)', linewidth=1.5)
    if 'lqd' in credit_data.columns:
        ax1.plot(credit_data.index, credit_data['lqd'], label='LQD (Investment Grade)', linewidth=1.5)
    
    ax1.set_xlabel('Date')
    ax1.set_ylabel('Price ($)')
    ax1.set_title('Credit ETF Prices - HYG vs LQD')
    ax1.legend(loc='upper right')
    ax1.grid(True, alpha=0.3)
    
    # Normalized comparison (rebased to 100)
    ax2 = axes[1]
    if 'hyg' in credit_data.columns and 'lqd' in credit_data.columns:
        hyg_norm = credit_data['hyg'] / credit_data['hyg'].iloc[0] * 100
        lqd_norm = credit_data['lqd'] / credit_data['lqd'].iloc[0] * 100
        
        ax2.plot(credit_data.index, hyg_norm, label='HYG (Normalized)', linewidth=1.5)
        ax2.plot(credit_data.index, lqd_norm, label='LQD (Normalized)', linewidth=1.5)
        ax2.axhline(y=100, color='gray', linestyle='--', alpha=0.5)
        
        ax2.set_xlabel('Date')
        ax2.set_ylabel('Normalized Price (Base=100)')
        ax2.set_title('Credit ETF Performance Comparison (Rebased)')
        ax2.legend(loc='upper right')
        ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
else:
    print("Credit ETF data not available - skipping visualization.")

## 6. Data Summary

Let's review all the datasets we have collected or generated.

In [None]:
print("="*60)
print("DATA COLLECTION SUMMARY")
print("="*60)

# Synthetic loan data
print("\n1. SYNTHETIC LOAN DATA")
print("-"*40)
print(f"   Shape: {loans_df.shape}")
print(f"   Columns: {list(loans_df.columns)}")
print(f"   Memory usage: {loans_df.memory_usage(deep=True).sum() / 1024:.1f} KB")

# FRED data
print("\n2. FRED ECONOMIC DATA")
print("-"*40)
if fred_data is not None:
    print(f"   Shape: {fred_data.shape}")
    print(f"   Columns: {list(fred_data.columns)}")
    print(f"   Date range: {fred_data['date'].min()} to {fred_data['date'].max()}")
else:
    print("   Status: Not available (requires FRED API key)")

# Yahoo Finance data
print("\n3. YAHOO FINANCE MARKET DATA")
print("-"*40)
if market_data is not None and not market_data.empty:
    print(f"   Shape: {market_data.shape}")
    print(f"   Columns: {list(market_data.columns)}")
    print(f"   Date range: {market_data.index.min()} to {market_data.index.max()}")
else:
    print("   Status: Not available or empty")

print("\n" + "="*60)

## 7. Save Data

Save the collected data to the `data/` directory for use in subsequent notebooks.

In [None]:
# Define output directory
data_dir = Path('..') / 'data'
data_dir.mkdir(parents=True, exist_ok=True)

print("Saving datasets...")
print("-"*60)

# Save synthetic loan data
loans_path = data_dir / 'synthetic_loans.csv'
loans_df.to_csv(loans_path, index=False)
print(f"Saved: {loans_path}")
print(f"       {len(loans_df):,} loans, {loans_df.shape[1]} columns")

# Save market data if available
if market_data is not None and not market_data.empty:
    market_path = data_dir / 'market_data.csv'
    market_data.to_csv(market_path, index=True)
    print(f"\nSaved: {market_path}")
    print(f"       {len(market_data):,} observations, {market_data.shape[1]} columns")

# Save FRED data if available
if fred_data is not None:
    fred_path = data_dir / 'fred_economic_data.csv'
    fred_data.to_csv(fred_path, index=False)
    print(f"\nSaved: {fred_path}")
    print(f"       {len(fred_data):,} observations, {fred_data.shape[1]} columns")

print("\n" + "-"*60)
print("Data collection complete!")

## 8. Next Steps

With our data collected, we can proceed to the next phase of the analysis:

### Notebook 02: Exploratory Data Analysis
- Deep-dive into loan characteristics
- Correlation analysis between features
- Relationship between loan attributes and liquidity
- Market conditions impact analysis

### Future Notebooks:
- **03: Feature Engineering** - Create predictive features from raw data
- **04: Model Training** - Build and train liquidity prediction models
- **05: Model Evaluation** - Evaluate model performance and explainability

---

### Data Sources Summary

| Source | Status | Description |
|--------|--------|-------------|
| Synthetic Loans | Generated | 5,000 realistic leveraged loan records |
| FRED API | Requires API Key | Economic indicators (VIX, rates, spreads) |
| Yahoo Finance | Available | Real-time market data (VIX, credit ETFs) |
| SEC EDGAR | Not Implemented | CLO holdings data (future work) |
| Kaggle | Requires Credentials | Additional loan datasets (future work) |

---

**Continue to Notebook 02**: [02_exploratory_analysis.ipynb](./02_exploratory_analysis.ipynb)