# XGBoost for Credit Classification

This notebook demonstrates XGBoost (eXtreme Gradient Boosting) for credit default prediction using the UCI German Credit dataset.

## Model Overview

**XGBoost** is a gradient boosting algorithm that builds trees sequentially, with each tree correcting the errors of the previous ensemble.

### Pros
- State-of-the-art performance on tabular data
- Handles missing values natively
- Built-in regularisation (L1, L2) to prevent overfitting
- Highly optimised for speed and memory efficiency
- Supports custom objective functions

### Cons
- Requires careful hyperparameter tuning
- Can be a "black box" without SHAP explanations
- Slower to train than Random Forest with many trees
- More prone to overfitting if not regularised properly

### When to Use
- When you need maximum predictive performance
- When dealing with structured/tabular data
- In competitions or production systems where accuracy is paramount

## Setup

In [None]:
import sys
sys.path.insert(0, '../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from creditclass.preprocessing import prepare_data
from creditclass.training import get_model, train_model, save_model, tune_hyperparameters
from creditclass.evaluation import (
    evaluate_model,
    compute_shap_values,
    get_learning_curve_data,
)
from creditclass.plots import (
    set_plot_style,
    plot_confusion_matrix,
    plot_roc_curve,
    plot_precision_recall,
    plot_feature_importance,
    plot_learning_curve,
    plot_calibration,
    plot_shap_summary,
)

set_plot_style()
RANDOM_STATE = 42

## Load Data

In [None]:
data = prepare_data(
    target_type='default',
    encoding_method='onehot',
    test_size=0.2,
    random_state=RANDOM_STATE,
    scale=False,  # XGBoost doesn't require scaling
)

X_train = data['X_train'].values
X_test = data['X_test'].values
y_train = data['y_train']
y_test = data['y_test']
feature_names = data['feature_names']

print(f"Training set: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Test set: {X_test.shape[0]} samples")

## Training

In [None]:
model = get_model('xgboost')
model = train_model(model, X_train, y_train)

print("Model trained successfully!")
print(f"Number of estimators: {model.n_estimators}")
print(f"Max depth: {model.max_depth}")
print(f"Learning rate: {model.learning_rate}")

## Evaluation

In [None]:
metrics = evaluate_model(model, X_test, y_test)

print("Performance Metrics:")
print("-" * 30)
for name, value in metrics.items():
    if value is not None:
        print(f"{name.capitalize():12} {value:.4f}")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

plot_confusion_matrix(
    model, X_test, y_test,
    class_names=['Good Credit', 'Bad Credit'],
    ax=axes[0],
    title='XGBoost - Confusion Matrix'
)

plot_roc_curve(model, X_test, y_test, ax=axes[1], label='XGBoost')

plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(7, 6))
plot_precision_recall(model, X_test, y_test, ax=ax, label='XGBoost')
plt.tight_layout()
plt.show()

## Interpretability

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
plot_feature_importance(
    model,
    feature_names=feature_names,
    top_n=15,
    ax=ax,
    title='XGBoost - Feature Importance (Gain)'
)
plt.tight_layout()
plt.show()

In [None]:
shap_data = compute_shap_values(model, X_test, feature_names=feature_names, max_samples=100)

fig, ax = plt.subplots(figsize=(10, 8))
plot_shap_summary(shap_data, plot_type='bar', max_display=15)
plt.title('XGBoost - SHAP Feature Importance')
plt.tight_layout()
plt.show()

## Hyperparameter Tuning

In [None]:
tuning_results = tune_hyperparameters(
    'xgboost',
    X_train, y_train,
    method='random',
    cv=5,
    scoring='f1',
    n_iter=20
)

print("Best Parameters:")
print(tuning_results['best_params'])
print(f"\nBest CV F1 Score: {tuning_results['best_score']:.4f}")

In [None]:
tuned_model = tuning_results['best_model']
tuned_metrics = evaluate_model(tuned_model, X_test, y_test)

print("\nTuned Model Performance:")
print("-" * 30)
for name, value in tuned_metrics.items():
    if value is not None:
        print(f"{name.capitalize():12} {value:.4f}")

## Learning Curve

In [None]:
lc_model = get_model('xgboost')
lc_data = get_learning_curve_data(lc_model, X_train, y_train, cv=5, scoring='f1')

fig, ax = plt.subplots(figsize=(8, 6))
plot_learning_curve(lc_data, ax=ax, title='XGBoost - Learning Curve')
plt.tight_layout()
plt.show()

## Calibration

In [None]:
fig, ax = plt.subplots(figsize=(7, 6))
plot_calibration(model, X_test, y_test, ax=ax, label='XGBoost')
plt.tight_layout()
plt.show()

## Save Model

In [None]:
save_path = save_model(model, 'xgboost')
print(f"Model saved to: {save_path}")

## Summary

### Key Takeaways

1. **Performance**: XGBoost often achieves the best performance on tabular data
2. **Regularisation**: Built-in L1/L2 regularisation helps prevent overfitting
3. **SHAP Integration**: TreeExplainer provides fast, exact SHAP values
4. **Tuning**: Performance is sensitive to hyperparameters

### Recommendations

- Start with conservative parameters and increase complexity gradually
- Use early stopping to find optimal number of boosting rounds
- Consider class weights for imbalanced datasets
- Always validate with SHAP to understand predictions