# Random Forest for Credit Classification

This notebook demonstrates Random Forest classification for credit default prediction using the UCI German Credit dataset.

## Model Overview

**Random Forest** is an ensemble method that builds multiple decision trees and aggregates their predictions through voting (classification) or averaging (regression).

### Pros
- Handles non-linear relationships effectively
- Robust to outliers and noise
- Built-in feature importance via impurity decrease
- Less prone to overfitting than single decision trees
- Works well with both numerical and categorical features

### Cons
- Can overfit on small datasets with many features
- Less interpretable than linear models
- Slower inference than simpler models
- Memory-intensive for large forests

### When to Use
- When you need good out-of-the-box performance
- When feature interactions are important
- When you want built-in feature importance

## Setup

In [None]:
import sys
sys.path.insert(0, '../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from creditclass.preprocessing import prepare_data
from creditclass.training import get_model, train_model, save_model, tune_hyperparameters
from creditclass.evaluation import (
    evaluate_model,
    compute_shap_values,
    get_learning_curve_data,
)
from creditclass.plots import (
    set_plot_style,
    plot_confusion_matrix,
    plot_roc_curve,
    plot_precision_recall,
    plot_feature_importance,
    plot_learning_curve,
    plot_calibration,
    plot_shap_summary,
)

set_plot_style()
RANDOM_STATE = 42

## Load Data

In [None]:
data = prepare_data(
    target_type='default',
    encoding_method='onehot',
    test_size=0.2,
    random_state=RANDOM_STATE,
    scale=False,  # Random Forest doesn't require scaling
)

X_train = data['X_train'].values
X_test = data['X_test'].values
y_train = data['y_train']
y_test = data['y_test']
feature_names = data['feature_names']

print(f"Training set: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Test set: {X_test.shape[0]} samples")

## Training

In [None]:
model = get_model('random_forest')
model = train_model(model, X_train, y_train)

print("Model trained successfully!")
print(f"Number of trees: {model.n_estimators}")
print(f"Max depth: {model.max_depth}")

## Evaluation

In [None]:
metrics = evaluate_model(model, X_test, y_test)

print("Performance Metrics:")
print("-" * 30)
for name, value in metrics.items():
    if value is not None:
        print(f"{name.capitalize():12} {value:.4f}")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

plot_confusion_matrix(
    model, X_test, y_test,
    class_names=['Good Credit', 'Bad Credit'],
    ax=axes[0],
    title='Random Forest - Confusion Matrix'
)

plot_roc_curve(model, X_test, y_test, ax=axes[1], label='Random Forest')

plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(7, 6))
plot_precision_recall(model, X_test, y_test, ax=ax, label='Random Forest')
plt.tight_layout()
plt.show()

## Interpretability

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
plot_feature_importance(
    model,
    feature_names=feature_names,
    top_n=15,
    ax=ax,
    title='Random Forest - Feature Importance (Gini Impurity)'
)
plt.tight_layout()
plt.show()

In [None]:
shap_data = compute_shap_values(model, X_test, feature_names=feature_names, max_samples=100)

fig, ax = plt.subplots(figsize=(10, 8))
plot_shap_summary(shap_data, plot_type='bar', max_display=15)
plt.title('Random Forest - SHAP Feature Importance')
plt.tight_layout()
plt.show()

## Hyperparameter Tuning

In [None]:
tuning_results = tune_hyperparameters(
    'random_forest',
    X_train, y_train,
    method='random',
    cv=5,
    scoring='f1',
    n_iter=20
)

print("Best Parameters:")
print(tuning_results['best_params'])
print(f"\nBest CV F1 Score: {tuning_results['best_score']:.4f}")

In [None]:
tuned_model = tuning_results['best_model']
tuned_metrics = evaluate_model(tuned_model, X_test, y_test)

print("\nTuned Model Performance:")
print("-" * 30)
for name, value in tuned_metrics.items():
    if value is not None:
        print(f"{name.capitalize():12} {value:.4f}")

## Learning Curve

In [None]:
lc_model = get_model('random_forest')
lc_data = get_learning_curve_data(lc_model, X_train, y_train, cv=5, scoring='f1')

fig, ax = plt.subplots(figsize=(8, 6))
plot_learning_curve(lc_data, ax=ax, title='Random Forest - Learning Curve')
plt.tight_layout()
plt.show()

## Calibration

In [None]:
fig, ax = plt.subplots(figsize=(7, 6))
plot_calibration(model, X_test, y_test, ax=ax, label='Random Forest')
plt.tight_layout()
plt.show()

## Save Model

In [None]:
save_path = save_model(model, 'random_forest')
print(f"Model saved to: {save_path}")

## Summary

### Key Takeaways

1. **Performance**: Random Forest typically outperforms logistic regression by capturing non-linear patterns
2. **Feature Importance**: Gini-based importance and SHAP values provide insight into predictive features
3. **Robustness**: Less sensitive to outliers and doesn't require feature scaling
4. **Trade-off**: Better performance comes at the cost of interpretability

### Recommendations

- Tune `n_estimators` and `max_depth` to balance performance and overfitting
- Use out-of-bag (OOB) score for quick validation
- Consider calibration if probability estimates are important