# Notebook 04: Model Training
## CLO Loan-Level Liquidity Predictor

Training ML models for loan liquidity prediction and trade cost estimation.

---

**Objectives:**
1. Train XGBoost classifier for liquidity tier prediction (1-5)
2. Train LightGBM regressor for bid-ask spread prediction
3. Evaluate model performance with cross-validation
4. Analyze feature importance

**Prerequisites:**
- Notebook 03 completed (feature engineering)
- `data/engineered_features.csv` available

---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import sys
sys.path.insert(0, '..')

from src.models.liquidity_model import LiquidityScoreModel
from src.models.spread_model import TradeCostPredictor

# Load engineered features
df = pd.read_csv('../data/engineered_features.csv')
print(f"Loaded {len(df)} samples with {len(df.columns)} features")

In [None]:
# Define feature columns (exclude target and identifiers)
exclude_cols = ['liquidity_tier', 'loan_id']
feature_cols = [c for c in df.columns if c not in exclude_cols and df[c].dtype in ['int64', 'float64']]

X = df[feature_cols]
y = df['liquidity_tier']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

## Liquidity Tier Model (XGBoost Classifier)

Train an XGBoost multi-class classifier to predict liquidity tiers (1-5):
- **Tier 1**: Most Liquid - frequent trading, tight spreads
- **Tier 2**: Liquid
- **Tier 3**: Moderate
- **Tier 4**: Illiquid
- **Tier 5**: Most Illiquid - infrequent trading, wide spreads

In [None]:
# Initialize and train the model
liquidity_model = LiquidityScoreModel(n_estimators=200, max_depth=6, learning_rate=0.1)
cv_metrics = liquidity_model.fit(X_train, y_train)

print("Cross-Validation Results:")
for metric, value in cv_metrics.items():
    if isinstance(value, float):
        print(f"  {metric}: {value:.4f}")
    elif isinstance(value, list):
        print(f"  {metric}: {[f'{v:.4f}' for v in value]}")

In [None]:
from sklearn.metrics import classification_report, accuracy_score

y_pred = liquidity_model.predict(X_test)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Tier 1', 'Tier 2', 'Tier 3', 'Tier 4', 'Tier 5']))

In [None]:
# Get and plot confusion matrix
cm_df = liquidity_model.get_confusion_matrix(X_test, y_test)
plt.figure(figsize=(8, 6))
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues')
plt.title('Liquidity Model - Confusion Matrix')
plt.tight_layout()
plt.show()

In [None]:
importance_df = liquidity_model.get_feature_importance()
plt.figure(figsize=(10, 8))
top_20 = importance_df.head(20)
plt.barh(range(len(top_20)), top_20['importance'].values)
plt.yticks(range(len(top_20)), top_20['feature'].values)
plt.xlabel('Importance')
plt.title('Top 20 Features - Liquidity Model')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## Trade Cost Model (LightGBM Regressor)

Train a LightGBM regressor to predict bid-ask spreads in basis points.

The spread prediction enables:
- Trade cost estimation for portfolio analysis
- Execution planning and optimization
- Liquidity risk management

In [None]:
# For spread model, use bid_ask_spread as target
y_spread_train = df.loc[X_train.index, 'bid_ask_spread'] if 'bid_ask_spread' in df.columns else X_train['bid_ask_spread']
y_spread_test = df.loc[X_test.index, 'bid_ask_spread'] if 'bid_ask_spread' in df.columns else X_test['bid_ask_spread']

# Remove bid_ask_spread from features for spread model
spread_feature_cols = [c for c in feature_cols if c != 'bid_ask_spread']
X_spread_train = X_train[spread_feature_cols]
X_spread_test = X_test[spread_feature_cols]

# Train spread predictor
spread_model = TradeCostPredictor(n_estimators=200, max_depth=6, learning_rate=0.1)
spread_metrics = spread_model.fit(X_spread_train, y_spread_train)

print("Training Metrics:")
for metric, value in spread_metrics.items():
    print(f"  {metric}: {value:.4f}")

In [None]:
y_spread_pred = spread_model.predict(X_spread_test)
from sklearn.metrics import mean_absolute_error, r2_score

print(f"Test MAE: {mean_absolute_error(y_spread_test, y_spread_pred):.2f} bps")
print(f"Test R2: {r2_score(y_spread_test, y_spread_pred):.4f}")

# Scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(y_spread_test, y_spread_pred, alpha=0.5)
plt.plot([y_spread_test.min(), y_spread_test.max()], [y_spread_test.min(), y_spread_test.max()], 'r--')
plt.xlabel('Actual Bid-Ask Spread (bps)')
plt.ylabel('Predicted Bid-Ask Spread (bps)')
plt.title('Spread Model - Actual vs Predicted')
plt.tight_layout()
plt.show()

In [None]:
spread_importance = spread_model.get_feature_importance()
plt.figure(figsize=(10, 8))
top_20_spread = spread_importance.head(20)
plt.barh(range(len(top_20_spread)), top_20_spread['importance'].values)
plt.yticks(range(len(top_20_spread)), top_20_spread['feature'].values)
plt.xlabel('Importance')
plt.title('Top 20 Features - Spread Model')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# Save trained models
liquidity_model.save('../models/liquidity_model.joblib')
spread_model.save('../models/spread_model.joblib')
print("Models saved successfully!")

## Next Steps

Model training complete. The trained models have been saved to the `models/` directory.

### Summary

| Model | Type | Target | Key Metrics |
|-------|------|--------|-------------|
| Liquidity Model | XGBoost Classifier | Liquidity Tier (1-5) | Accuracy, F1-Score |
| Spread Model | LightGBM Regressor | Bid-Ask Spread (bps) | MAE, R2 |

### Saved Artifacts

- `models/liquidity_model.joblib` - Trained XGBoost classifier
- `models/spread_model.joblib` - Trained LightGBM regressor

---

**Continue to Notebook 05**: [05_model_evaluation.ipynb](./05_model_evaluation.ipynb)

In the next notebook, we will:
1. Perform detailed model evaluation
2. Analyze SHAP explanations for interpretability
3. Conduct error analysis
4. Generate model performance reports

---

**Notebook Series:**
- [x] Notebook 01: Data Collection
- [x] Notebook 02: Exploratory Data Analysis
- [x] Notebook 03: Feature Engineering
- [x] **Notebook 04: Model Training** (this notebook)
- [ ] Notebook 05: Model Evaluation