# Notebook 03: Feature Engineering
## CLO Loan-Level Liquidity Predictor

Transform raw loan data into ML-ready features using the project's feature engineering modules.

---

**Objectives:**
1. Apply loan-level feature transformations (credit rating encoding, facility size normalization)
2. Generate market-based features (simulated market conditions)
3. Calculate liquidity indicator features (volume metrics, bid-ask analysis)
4. Combine all features into a unified feature matrix
5. Analyze feature correlations with the target variable

**Prerequisites:**
- Notebooks 01 and 02 completed
- `data/synthetic_loans.csv` available
- Feature engineering modules in `src/features/`

**Output:**
- `data/engineered_features.csv` - ML-ready feature matrix

---

## 1. Setup and Data Loading

In [None]:
# Standard library imports
import sys
import warnings
from pathlib import Path

# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Sklearn for scaling
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Configure plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.titlesize'] = 13
plt.rcParams['axes.labelsize'] = 11
warnings.filterwarnings('ignore')

# Add project root to path for imports
sys.path.insert(0, '..')

# Load the data
data_path = Path('..') / 'data' / 'synthetic_loans.csv'
df = pd.read_csv(data_path)

print(f"Loaded {len(df):,} loans from {data_path}")
print(f"Columns: {list(df.columns)}")
print(f"\nSample data:")
display(df.head())

## 2. Feature Categories Overview

Our feature engineering pipeline creates three categories of features:

### 2.1 Loan-Level Features
Features derived directly from loan characteristics:
- **Credit rating encoding**: Ordinal and one-hot encoding of credit ratings
- **Industry sector encoding**: One-hot encoding of industry sectors
- **Facility size transformations**: Log transform and percentile rank
- **Spread normalization**: Z-scores within rating categories
- **Maturity features**: Buckets and near-maturity flags

### 2.2 Market Features (Simulated)
Features capturing macroeconomic conditions:
- **VIX volatility features**: Level, percentile, and regime classification
- **Credit spread features**: HY/IG spreads and their gap
- **Interest rate features**: Fed funds rate and yield curve slope
- **Composite stress indicator**: Combined market stress score

### 2.3 Liquidity Indicator Features
Features measuring trading activity and market depth:
- **Volume metrics**: Trading volume, percentile, and turnover ratio
- **Bid-ask analysis**: Spread level, percentile, and volatility
- **Trading recency**: Days since last trade, trade frequency
- **Dealer metrics**: Quote count and coverage
- **Ownership features**: CLO ownership and concentration

## 3. Loan Feature Engineering

Using the `LoanFeatureEngine` from `src/features/loan_features.py` to transform raw loan attributes.

In [None]:
# Import the loan feature engineering module
from src.features.loan_features import LoanFeatureEngine, print_feature_summary

# Initialize the feature engine
loan_engine = LoanFeatureEngine()

# Apply transformations
print("Applying loan feature transformations...")
loan_features = loan_engine.transform(df)

print(f"\nLoan features shape: {loan_features.shape}")
print(f"Features generated: {list(loan_features.columns)}")

# Display sample output
print("\n" + "="*60)
print("LOAN FEATURES SAMPLE (first 5 rows)")
print("="*60)
display(loan_features.head())

In [None]:
# Print detailed feature summary
print_feature_summary(loan_features)

# Show numeric feature statistics
print("\n" + "="*60)
print("NUMERIC FEATURE STATISTICS")
print("="*60)

numeric_cols = loan_engine.get_numeric_feature_names()
# Filter to columns that exist in loan_features
numeric_cols = [c for c in numeric_cols if c in loan_features.columns]
display(loan_features[numeric_cols].describe().round(3))

## 4. Market Feature Engineering

Using the `MarketFeatureEngine` from `src/features/market_features.py` to generate market condition features.

**Note**: Since we are working with loan-level data without time-series market data, we will simulate market conditions for demonstration purposes.

In [None]:
# Import the market feature engineering module
from src.features.market_features import MarketFeatureEngine

# Initialize the market feature engine
market_engine = MarketFeatureEngine(
    vix_percentile_window=252,
    fed_funds_change_window=30
)

# Since we don't have time-series market data for each loan,
# we'll simulate market conditions that correlate with loan characteristics
# This represents a point-in-time snapshot of market conditions

print("Simulating market data for demonstration...")
np.random.seed(42)
n_loans = len(df)

# Create simulated market data that varies slightly around baseline values
# In production, this would come from actual market data feeds
market_data = pd.DataFrame({
    'vix': np.random.uniform(18, 25, n_loans),  # VIX around normal levels
    'hy_spread': np.random.uniform(4.0, 5.5, n_loans),  # HY spread in %
    'ig_spread': np.random.uniform(1.2, 1.8, n_loans),  # IG spread in %
    'fed_funds_rate': np.random.uniform(5.0, 5.5, n_loans),  # Fed funds rate
    'yield_curve_slope': np.random.uniform(-0.5, 0.5, n_loans)  # 10Y-2Y spread
})

# Apply market feature transformations
print("Applying market feature transformations...")
market_features = market_engine.transform(market_data)

print(f"\nMarket features shape: {market_features.shape}")
print(f"Features generated: {list(market_features.columns)}")

# Display sample output
print("\n" + "="*60)
print("MARKET FEATURES SAMPLE (first 5 rows)")
print("="*60)
display(market_features.head())

In [None]:
# Market feature statistics
print("="*60)
print("MARKET FEATURE STATISTICS")
print("="*60)

# Numeric columns only
market_numeric = market_features.select_dtypes(include=[np.number])
display(market_numeric.describe().round(3))

# VIX regime distribution
print("\n" + "-"*40)
print("VIX REGIME DISTRIBUTION")
print("-"*40)
print(market_features['vix_regime'].value_counts())

# Curve inversion statistics
print("\n" + "-"*40)
print("YIELD CURVE STATISTICS")
print("-"*40)
inverted_count = market_features['curve_inverted'].sum()
print(f"Inverted observations: {inverted_count} ({inverted_count/len(market_features)*100:.1f}%)")

## 5. Liquidity Feature Engineering

Using the `LiquidityFeatureEngine` from `src/features/liquidity_features.py` to calculate liquidity indicators.

In [None]:
# Import the liquidity feature engineering module
from src.features.liquidity_features import LiquidityFeatureEngine

# Initialize the liquidity feature engine
liquidity_engine = LiquidityFeatureEngine(max_dealers=50)

# Apply liquidity feature transformations
print("Applying liquidity feature transformations...")
liquidity_features = liquidity_engine.transform(df)

print(f"\nLiquidity features shape: {liquidity_features.shape}")

# Get the feature names from the engine
liquidity_feature_names = liquidity_engine.get_feature_names()
print(f"\nLiquidity indicator features generated:")
for name in liquidity_feature_names:
    if name in liquidity_features.columns:
        print(f"  - {name}")

# Display sample output
print("\n" + "="*60)
print("LIQUIDITY FEATURES SAMPLE (first 5 rows)")
print("="*60)

# Show only the engineered liquidity features
display_cols = ['loan_id'] + [c for c in liquidity_feature_names if c in liquidity_features.columns]
display(liquidity_features[display_cols].head())

In [None]:
# Liquidity feature statistics
print("="*60)
print("LIQUIDITY FEATURE STATISTICS")
print("="*60)

# Get feature descriptions
feature_descriptions = liquidity_engine.get_feature_descriptions()

# Display statistics for liquidity features
liquidity_cols = [c for c in liquidity_feature_names if c in liquidity_features.columns]
display(liquidity_features[liquidity_cols].describe().round(3))

# Feature descriptions
print("\n" + "-"*60)
print("FEATURE DESCRIPTIONS")
print("-"*60)
for name, desc in feature_descriptions.items():
    if name in liquidity_features.columns:
        print(f"  {name:30s}: {desc}")

## 6. Combine All Features

Merge all engineered features into a single feature matrix ready for model training.

In [None]:
# Combine all engineered features
print("Combining feature sets...")
print("-"*60)

# Start with loan features
features_df = loan_features.copy()
print(f"1. Loan features: {loan_features.shape[1]} columns")

# Add market features (exclude any duplicate columns)
market_cols_to_add = [c for c in market_features.columns if c not in features_df.columns]
features_df = pd.concat([features_df, market_features[market_cols_to_add]], axis=1)
print(f"2. Added market features: {len(market_cols_to_add)} new columns")

# Add liquidity features (only the derived features, not the original columns)
liquidity_cols_to_add = [c for c in liquidity_feature_names if c in liquidity_features.columns and c not in features_df.columns]
features_df = pd.concat([features_df, liquidity_features[liquidity_cols_to_add]], axis=1)
print(f"3. Added liquidity features: {len(liquidity_cols_to_add)} new columns")

# Remove any duplicate columns that may have been created
features_df = features_df.loc[:, ~features_df.columns.duplicated()]

# Add target variable
features_df['liquidity_tier'] = df['liquidity_tier']

print("\n" + "="*60)
print(f"FINAL FEATURE MATRIX")
print("="*60)
print(f"Shape: {features_df.shape[0]:,} rows x {features_df.shape[1]} columns")
print(f"Memory usage: {features_df.memory_usage(deep=True).sum() / 1024:.1f} KB")

print("\nAll columns:")
for i, col in enumerate(features_df.columns, 1):
    print(f"  {i:2d}. {col}")

# Display sample
print("\n" + "-"*60)
print("SAMPLE DATA")
print("-"*60)
display(features_df.head())

## 7. Feature Summary Table

Complete list of all engineered features with their descriptions and categories.

In [None]:
# Create comprehensive feature summary table
feature_summary = []

# Loan features
loan_feature_info = {
    'loan_id': ('Identifier', 'Unique loan identifier'),
    'facility_size': ('Loan', 'Total facility amount in millions'),
    'facility_size_log': ('Loan', 'Natural log of facility size'),
    'facility_size_pctl': ('Loan', 'Percentile rank of facility size (0-100)'),
    'credit_rating_encoded': ('Loan', 'Ordinal encoding of credit rating (1=BB+, 8=CCC)'),
    'current_spread': ('Loan', 'Current spread over SOFR in basis points'),
    'spread_z_score': ('Loan', 'Z-score of spread within rating category'),
    'time_to_maturity': ('Loan', 'Years until maturity'),
    'maturity_bucket': ('Loan', 'Maturity bucket: Short/Medium/Long'),
    'near_maturity': ('Loan', 'Flag for loans maturing within 1 year'),
    'covenant_lite': ('Loan', 'Boolean: covenant-lite loan indicator'),
}

# Add rating one-hot features
for rating in ['BB+', 'BB', 'BB-', 'B+', 'B', 'B-', 'CCC+', 'CCC']:
    loan_feature_info[f'rating_{rating}'] = ('Loan', f'One-hot: credit rating is {rating}')

# Add sector one-hot features
for sector in ['Consumer', 'Energy', 'Financials', 'Healthcare', 'Industrials', 'Technology', 'Telecom', 'Utilities']:
    loan_feature_info[f'sector_{sector}'] = ('Loan', f'One-hot: industry sector is {sector}')

# Market features
market_feature_info = {
    'vix_level': ('Market', 'VIX index value'),
    'vix_percentile': ('Market', 'Rolling percentile rank of VIX'),
    'vix_regime': ('Market', 'VIX regime: Low/Normal/High/Extreme'),
    'hy_spread': ('Market', 'High yield corporate bond spread'),
    'ig_spread': ('Market', 'Investment grade corporate bond spread'),
    'hy_ig_gap': ('Market', 'Difference between HY and IG spreads'),
    'fed_funds_rate': ('Market', 'Federal funds effective rate'),
    'fed_funds_change_30d': ('Market', '30-day change in fed funds rate'),
    'yield_curve_slope': ('Market', '10Y-2Y Treasury spread'),
    'curve_inverted': ('Market', 'Boolean: yield curve is inverted'),
    'market_stress': ('Market', 'Composite market stress indicator (0-1)'),
}

# Liquidity features
liquidity_feature_info = {
    'trading_volume_30d': ('Liquidity', 'Rolling 30-day trading volume'),
    'volume_percentile': ('Liquidity', 'Percentile rank of trading volume'),
    'volume_to_size_ratio': ('Liquidity', 'Turnover ratio: volume / facility size'),
    'price_volatility_30d': ('Liquidity', 'Rolling 30-day price volatility'),
    'bid_ask_spread': ('Liquidity', 'Average bid-ask spread in basis points'),
    'bid_ask_percentile': ('Liquidity', 'Percentile rank of bid-ask spread'),
    'spread_volatility': ('Liquidity', 'Standard deviation of bid-ask spread'),
    'dealer_quote_count': ('Liquidity', 'Number of dealers actively quoting'),
    'dealer_coverage': ('Liquidity', 'Quote count as percentage of max dealers'),
    'days_since_last_trade': ('Liquidity', 'Days since most recent trade'),
    'trade_frequency': ('Liquidity', 'Average trades per week'),
    'clo_ownership_pct': ('Liquidity', 'Percentage of loan held by CLOs'),
    'ownership_concentration': ('Liquidity', 'HHI concentration of top holders'),
}

# Combine all feature info
all_feature_info = {**loan_feature_info, **market_feature_info, **liquidity_feature_info}
all_feature_info['liquidity_tier'] = ('Target', 'Liquidity tier (1=Most liquid, 5=Illiquid)')

# Build summary table for features present in our dataset
for col in features_df.columns:
    if col in all_feature_info:
        category, description = all_feature_info[col]
        dtype = str(features_df[col].dtype)
        non_null = features_df[col].notna().sum()
        feature_summary.append({
            'Feature': col,
            'Category': category,
            'Type': dtype,
            'Non-Null': f"{non_null:,}",
            'Description': description
        })

summary_df = pd.DataFrame(feature_summary)

print("="*80)
print("COMPLETE FEATURE SUMMARY")
print("="*80)

# Display by category
for category in ['Identifier', 'Loan', 'Market', 'Liquidity', 'Target']:
    cat_df = summary_df[summary_df['Category'] == category]
    if len(cat_df) > 0:
        print(f"\n--- {category.upper()} FEATURES ({len(cat_df)}) ---")
        display(cat_df[['Feature', 'Type', 'Description']].reset_index(drop=True))

## 8. Feature Correlations with Target

Analyze how each engineered feature correlates with the liquidity tier target variable.

In [None]:
# Select numeric features for correlation analysis
numeric_features = features_df.select_dtypes(include=[np.number]).columns.tolist()

# Remove identifiers and target from feature list
exclude_cols = ['loan_id', 'liquidity_tier']
numeric_features = [c for c in numeric_features if c not in exclude_cols]

# Calculate correlations with liquidity_tier
correlations = features_df[numeric_features + ['liquidity_tier']].corr()['liquidity_tier'].drop('liquidity_tier')
correlations = correlations.sort_values()

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 10))

# 1. Bar chart of correlations
ax = axes[0]
colors = ['#e74c3c' if x < 0 else '#27ae60' for x in correlations.values]
bars = ax.barh(range(len(correlations)), correlations.values, color=colors, edgecolor='white')
ax.set_yticks(range(len(correlations)))
ax.set_yticklabels([c.replace('_', ' ').title()[:25] for c in correlations.index], fontsize=8)
ax.set_xlabel('Correlation Coefficient')
ax.set_title('Feature Correlations with Liquidity Tier\n(Negative = More Liquid, Positive = Less Liquid)')
ax.axvline(0, color='black', linewidth=0.5)
ax.axvline(-0.3, color='gray', linestyle='--', alpha=0.5)
ax.axvline(0.3, color='gray', linestyle='--', alpha=0.5)

# 2. Top 10 positive and negative correlations
ax = axes[1]
top_positive = correlations.nlargest(10)
top_negative = correlations.nsmallest(10)
combined = pd.concat([top_negative, top_positive])

colors_combined = ['#e74c3c' if x < 0 else '#27ae60' for x in combined.values]
bars = ax.barh(range(len(combined)), combined.values, color=colors_combined, edgecolor='white')
ax.set_yticks(range(len(combined)))
ax.set_yticklabels([c.replace('_', ' ').title()[:30] for c in combined.index], fontsize=9)
ax.set_xlabel('Correlation Coefficient')
ax.set_title('Top 10 Strongest Correlations\n(Positive and Negative)')
ax.axvline(0, color='black', linewidth=0.5)

# Add value labels
for bar, val in zip(bars, combined.values):
    x_pos = val + 0.02 if val >= 0 else val - 0.02
    ha = 'left' if val >= 0 else 'right'
    ax.annotate(f'{val:.3f}', xy=(x_pos, bar.get_y() + bar.get_height()/2),
                ha=ha, va='center', fontsize=9)

plt.tight_layout()
plt.suptitle('Feature Correlations with Liquidity Tier', fontsize=14, fontweight='bold', y=1.02)
plt.show()

# Print correlation summary
print("\n" + "="*60)
print("CORRELATION SUMMARY")
print("="*60)
print("\nStrongest NEGATIVE correlations (more liquid):")
for feature, corr in top_negative.items():
    print(f"  {feature:35s}: {corr:+.4f}")

print("\nStrongest POSITIVE correlations (less liquid):")
for feature, corr in top_positive.items():
    print(f"  {feature:35s}: {corr:+.4f}")

In [None]:
# Full correlation heatmap for top features
# Select top features by absolute correlation
top_features = correlations.abs().nlargest(15).index.tolist()
top_features.append('liquidity_tier')

# Calculate correlation matrix
corr_matrix = features_df[top_features].corr()

# Create heatmap
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool), k=1)
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='RdBu_r', center=0,
            fmt='.2f', square=True, linewidths=0.5,
            cbar_kws={'shrink': 0.8, 'label': 'Correlation'},
            annot_kws={'size': 8})
plt.title('Correlation Heatmap: Top 15 Features + Target', fontsize=14, fontweight='bold')
plt.xticks(rotation=45, ha='right', fontsize=9)
plt.yticks(fontsize=9)
plt.tight_layout()
plt.show()

## 9. Save Engineered Features

Save the complete feature matrix for use in model training.

In [None]:
# Prepare final feature matrix for saving
# Convert categorical columns to strings for CSV compatibility
save_df = features_df.copy()

# Convert category columns to string
for col in save_df.columns:
    if save_df[col].dtype.name == 'category':
        save_df[col] = save_df[col].astype(str)

# Save to data folder
output_path = Path('..') / 'data' / 'engineered_features.csv'
save_df.to_csv(output_path, index=False)

print("="*60)
print("FEATURE ENGINEERING COMPLETE")
print("="*60)
print(f"\nSaved: {output_path}")
print(f"Rows: {len(save_df):,}")
print(f"Columns: {len(save_df.columns)}")
print(f"File size: {output_path.stat().st_size / 1024:.1f} KB")

# Summary by feature category
print("\nFeatures by Category:")
category_counts = summary_df['Category'].value_counts()
for cat, count in category_counts.items():
    print(f"  {cat}: {count} features")

# Verify saved data
print("\n" + "-"*60)
print("VERIFICATION")
print("-"*60)
loaded_df = pd.read_csv(output_path)
print(f"Verified: {len(loaded_df):,} rows loaded from saved file")
print(f"Columns match: {list(loaded_df.columns) == list(save_df.columns)}")

## 10. Next Steps

With the feature engineering complete, we have created a comprehensive feature matrix ready for model training.

### Feature Engineering Summary

| Category | Count | Key Features |
|----------|-------|-------------|
| Loan Features | ~25 | Credit rating encoding, facility size transforms, spread z-scores |
| Market Features | ~11 | VIX regime, credit spreads, market stress indicator |
| Liquidity Features | ~13 | Volume metrics, bid-ask analysis, dealer coverage |
| **Total** | **~50** | Comprehensive feature set for liquidity prediction |

### Key Predictive Features Identified

Based on correlation analysis, the most predictive features for liquidity tier are:

**Positive Correlation (Higher = Less Liquid):**
- Bid-ask spread and percentile
- Days since last trade
- Credit rating (ordinal)
- Ownership concentration

**Negative Correlation (Higher = More Liquid):**
- Trading volume and percentile
- Dealer quote count and coverage
- Facility size and percentile
- Trade frequency

---

### Continue to Notebook 04: Model Training

In the next notebook, we will:
1. Prepare train/validation/test splits
2. Train multiple classification models
3. Tune hyperparameters
4. Evaluate model performance

---

**Notebook Series:**
- [x] Notebook 01: Data Collection
- [x] Notebook 02: Exploratory Data Analysis
- [x] **Notebook 03: Feature Engineering** (this notebook)
- [ ] Notebook 04: Model Training
- [ ] Notebook 05: Model Evaluation

---

**Continue to Notebook 04**: [04_model_training.ipynb](./04_model_training.ipynb)