# CreditClass: Master Analysis Notebook

This notebook provides comprehensive exploratory data analysis (EDA) and model comparison for credit classification using the UCI German Credit dataset.

## Contents

1. **Introduction** - Project overview and dataset description
2. **Data Loading** - Download and inspect raw data
3. **Exploratory Data Analysis** - Distributions, correlations, class balance
4. **Preprocessing** - Encoding and target derivation demonstration
5. **Model Comparison** - Train and compare all six models
6. **Results** - Comparison tables, charts, ROC overlay
7. **Interpretability** - SHAP comparison across models
8. **Multi-Task Analysis** - Results for tier classification and loan approval
9. **Conclusions** - Summary and recommendations

## 1. Introduction

### Project Overview

**CreditClass** demonstrates classification techniques on real-world credit data. We showcase six different models:

| Model | Type | Key Characteristic |
|-------|------|-------------------|
| Logistic Regression | Linear | Interpretable baseline |
| Random Forest | Ensemble (Bagging) | Robust, handles non-linearity |
| XGBoost | Ensemble (Boosting) | State-of-the-art performance |
| SVM | Kernel-based | Maximum margin classifier |
| k-NN | Instance-based | Simple, non-parametric |
| Neural Network | Deep Learning | Learns complex patterns |

### Dataset

The **UCI German Credit** dataset contains 1,000 samples with 20 features describing credit applicants. The task is to classify applicants as good or bad credit risks.

## 2. Setup & Data Loading

In [None]:
import sys
sys.path.insert(0, '../src')

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from creditclass.preprocessing import (
    download_data,
    load_data,
    encode_categoricals,
    create_default_target,
    create_tier_target,
    create_approval_target,
    split_data,
    prepare_data,
    NUMERICAL_COLUMNS,
    CATEGORICAL_COLUMNS,
)
from creditclass.feature_engineering import engineer_all_features
from creditclass.training import get_model, train_model, save_model, get_all_model_names
from creditclass.evaluation import (
    evaluate_model,
    compare_models,
    compute_shap_values,
    get_feature_importance,
)
from creditclass.plots import (
    set_plot_style,
    plot_confusion_matrix,
    plot_roc_curve,
    plot_roc_curves_comparison,
    plot_precision_recall,
    plot_feature_importance,
    plot_correlation_heatmap,
    plot_class_distribution,
    plot_distribution,
    plot_model_comparison,
    plot_metrics_grouped_bar,
    plot_shap_summary,
    plot_calibration,
    COLOURS,
)

set_plot_style()
RANDOM_STATE = 42

print("Setup complete!")

In [None]:
# Download and load data
download_data()
df = load_data()

print(f"Dataset shape: {df.shape}")
print(f"\nFeatures: {df.shape[1] - 1}")
print(f"  - Numerical: {len(NUMERICAL_COLUMNS)}")
print(f"  - Categorical: {len(CATEGORICAL_COLUMNS)}")
print(f"\nSamples: {df.shape[0]}")

In [None]:
# Preview data
df.head()

In [None]:
# Data types and info
df.info()

## 3. Exploratory Data Analysis

### 3.1 Target Distribution

In [None]:
# Original target distribution
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Default target
default_target = create_default_target(df)
plot_class_distribution(
    default_target,
    class_names=['Good Credit', 'Bad Credit'],
    ax=axes[0],
    title='Credit Default (Binary)'
)

# Tier target
tier_target = create_tier_target(df)
plot_class_distribution(
    tier_target,
    class_names=['Low Risk', 'Medium Risk', 'High Risk'],
    ax=axes[1],
    title='Risk Tier (Multi-class)'
)

# Approval target
approval_target = create_approval_target(df)
plot_class_distribution(
    approval_target,
    class_names=['Denied', 'Approved'],
    ax=axes[2],
    title='Loan Approval (Binary)'
)

plt.tight_layout()
plt.show()

print("\nClass imbalance ratios:")
print(f"  Default: {default_target.value_counts()[1]/len(default_target):.1%} positive (bad credit)")
print(f"  Approval: {approval_target.value_counts()[1]/len(approval_target):.1%} positive (approved)")

### 3.2 Numerical Feature Distributions

In [None]:
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

for i, col in enumerate(NUMERICAL_COLUMNS):
    plot_distribution(df[col], ax=axes[i], kind='both', title=col.replace('_', ' ').title())

# Hide extra subplot
axes[-1].axis('off')

plt.tight_layout()
plt.show()

In [None]:
# Summary statistics
df[NUMERICAL_COLUMNS].describe().round(2)

### 3.3 Correlation Analysis

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
plot_correlation_heatmap(df, columns=NUMERICAL_COLUMNS, ax=ax)
plt.tight_layout()
plt.show()

### 3.4 Categorical Feature Analysis

In [None]:
# Show categorical value counts for key features
key_categorical = ['checking_account_status', 'credit_history', 'purpose', 'employment_since']

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for i, col in enumerate(key_categorical):
    counts = df[col].value_counts()
    axes[i].barh(counts.index, counts.values, color=COLOURS[i])
    axes[i].set_xlabel('Count')
    axes[i].set_title(col.replace('_', ' ').title())
    axes[i].invert_yaxis()

plt.tight_layout()
plt.show()

### 3.5 Missing Values

In [None]:
missing = df.isnull().sum()
print(f"Total missing values: {missing.sum()}")

if missing.sum() > 0:
    print("\nMissing values by column:")
    print(missing[missing > 0])
else:
    print("No missing values in the dataset.")

## 4. Preprocessing Demonstration

In [None]:
# Demonstrate encoding
df_encoded, encoders = encode_categoricals(df, method='onehot')

print(f"Original features: {df.shape[1]}")
print(f"After one-hot encoding: {df_encoded.shape[1]}")
print(f"\nNew feature names (sample):")
print(df_encoded.columns[:10].tolist())

In [None]:
# Demonstrate feature engineering
df_engineered = engineer_all_features(df)

print(f"Features after engineering: {df_engineered.shape[1]}")
print(f"\nNew features:")
new_cols = set(df_engineered.columns) - set(df.columns)
print(list(new_cols))

## 5. Model Comparison

In [None]:
# Prepare data for modelling
data = prepare_data(
    target_type='default',
    encoding_method='onehot',
    test_size=0.2,
    random_state=RANDOM_STATE,
    scale=True,
)

X_train = data['X_train']
X_test = data['X_test']
X_train_scaled = data['X_train_scaled']
X_test_scaled = data['X_test_scaled']
y_train = data['y_train']
y_test = data['y_test']
feature_names = data['feature_names']

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Features: {X_train.shape[1]}")

In [None]:
# Train all models
trained_models = {}

model_configs = [
    ('logistic_regression', X_train_scaled, X_test_scaled),
    ('random_forest', X_train.values, X_test.values),
    ('xgboost', X_train.values, X_test.values),
    ('svm', X_train_scaled, X_test_scaled),
    ('knn', X_train_scaled, X_test_scaled),
    ('neural_network', X_train_scaled, X_test_scaled),
]

for model_name, X_tr, X_te in model_configs:
    print(f"Training {model_name}...")
    
    if model_name == 'neural_network':
        model = get_model(model_name, params={'epochs': 100})
        model = train_model(model, X_tr, y_train.values)
    else:
        model = get_model(model_name)
        model = train_model(model, X_tr, y_train)
    
    trained_models[model_name] = {
        'model': model,
        'X_test': X_te,
    }

print("\nAll models trained!")

## 6. Results

### 6.1 Metrics Comparison Table

In [None]:
# Compute metrics for all models
all_metrics = []

for model_name, model_data in trained_models.items():
    model = model_data['model']
    X_te = model_data['X_test']
    
    if model_name == 'neural_network':
        metrics = evaluate_model(model, X_te, y_test.values)
    else:
        metrics = evaluate_model(model, X_te, y_test)
    
    metrics['model'] = model_name
    all_metrics.append(metrics)

comparison_df = pd.DataFrame(all_metrics).set_index('model')
comparison_df = comparison_df.sort_values('f1', ascending=False)

print("Model Comparison (sorted by F1 score):")
print("=" * 60)
comparison_df.round(4)

### 6.2 Metrics Visualisation

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# F1 comparison
plot_model_comparison(comparison_df, metric='f1', ax=axes[0], title='Model Comparison: F1 Score')

# AUC comparison
plot_model_comparison(comparison_df, metric='auc', ax=axes[1], title='Model Comparison: AUC')

plt.tight_layout()
plt.show()

In [None]:
# Grouped bar chart
fig, ax = plt.subplots(figsize=(14, 6))
plot_metrics_grouped_bar(
    comparison_df,
    metrics=['accuracy', 'precision', 'recall', 'f1', 'auc'],
    ax=ax,
    title='All Metrics Comparison'
)
plt.tight_layout()
plt.show()

### 6.3 ROC Curves Comparison

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))

for i, (model_name, model_data) in enumerate(trained_models.items()):
    model = model_data['model']
    X_te = model_data['X_test']
    
    if model_name == 'neural_network':
        plot_roc_curve(model, X_te, y_test.values, ax=ax, label=model_name, colour=COLOURS[i])
    else:
        plot_roc_curve(model, X_te, y_test, ax=ax, label=model_name, colour=COLOURS[i])

ax.set_title('ROC Curves - All Models')
plt.tight_layout()
plt.show()

### 6.4 Confusion Matrices

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i, (model_name, model_data) in enumerate(trained_models.items()):
    model = model_data['model']
    X_te = model_data['X_test']
    
    if model_name == 'neural_network':
        plot_confusion_matrix(
            model, X_te, y_test.values,
            class_names=['Good', 'Bad'],
            ax=axes[i],
            title=model_name.replace('_', ' ').title()
        )
    else:
        plot_confusion_matrix(
            model, X_te, y_test,
            class_names=['Good', 'Bad'],
            ax=axes[i],
            title=model_name.replace('_', ' ').title()
        )

plt.tight_layout()
plt.show()

## 7. Interpretability - SHAP Comparison

In [None]:
# Feature importance for tree-based models
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

plot_feature_importance(
    trained_models['random_forest']['model'],
    feature_names=feature_names,
    top_n=10,
    ax=axes[0],
    title='Random Forest - Feature Importance'
)

plot_feature_importance(
    trained_models['xgboost']['model'],
    feature_names=feature_names,
    top_n=10,
    ax=axes[1],
    title='XGBoost - Feature Importance'
)

plt.tight_layout()
plt.show()

In [None]:
# SHAP for XGBoost (fast with TreeExplainer)
print("Computing SHAP values for XGBoost...")
xgb_shap = compute_shap_values(
    trained_models['xgboost']['model'],
    trained_models['xgboost']['X_test'],
    feature_names=feature_names,
    max_samples=100
)

fig, ax = plt.subplots(figsize=(10, 8))
plot_shap_summary(xgb_shap, plot_type='bar', max_display=15)
plt.title('XGBoost - SHAP Feature Importance')
plt.tight_layout()
plt.show()

## 8. Multi-Task Analysis

### 8.1 Risk Tier Classification (Multi-class)

In [None]:
# Prepare tier classification data
tier_data = prepare_data(
    target_type='tier',
    encoding_method='onehot',
    test_size=0.2,
    random_state=RANDOM_STATE,
    scale=True,
)

# Train XGBoost on tier task
tier_model = get_model('xgboost')
tier_model = train_model(tier_model, tier_data['X_train'].values, tier_data['y_train'])

# Evaluate
tier_metrics = evaluate_model(
    tier_model,
    tier_data['X_test'].values,
    tier_data['y_test'],
    average='macro'
)

print("Risk Tier Classification (XGBoost):")
print("-" * 40)
for name, value in tier_metrics.items():
    if value is not None:
        print(f"{name.capitalize():12} {value:.4f}")

In [None]:
# Confusion matrix for tier classification
fig, ax = plt.subplots(figsize=(7, 6))
plot_confusion_matrix(
    tier_model,
    tier_data['X_test'].values,
    tier_data['y_test'],
    class_names=['Low Risk', 'Medium Risk', 'High Risk'],
    ax=ax,
    title='Risk Tier Classification - Confusion Matrix'
)
plt.tight_layout()
plt.show()

### 8.2 Loan Approval Prediction

In [None]:
# Prepare approval data
approval_data = prepare_data(
    target_type='approval',
    encoding_method='onehot',
    test_size=0.2,
    random_state=RANDOM_STATE,
    scale=True,
)

# Train XGBoost on approval task
approval_model = get_model('xgboost')
approval_model = train_model(approval_model, approval_data['X_train'].values, approval_data['y_train'])

# Evaluate
approval_metrics = evaluate_model(
    approval_model,
    approval_data['X_test'].values,
    approval_data['y_test']
)

print("Loan Approval Prediction (XGBoost):")
print("-" * 40)
for name, value in approval_metrics.items():
    if value is not None:
        print(f"{name.capitalize():12} {value:.4f}")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

plot_confusion_matrix(
    approval_model,
    approval_data['X_test'].values,
    approval_data['y_test'],
    class_names=['Denied', 'Approved'],
    ax=axes[0],
    title='Loan Approval - Confusion Matrix'
)

plot_roc_curve(
    approval_model,
    approval_data['X_test'].values,
    approval_data['y_test'],
    ax=axes[1],
    label='XGBoost'
)
axes[1].set_title('Loan Approval - ROC Curve')

plt.tight_layout()
plt.show()

## 9. Conclusions

In [None]:
# Summary table
print("=" * 70)
print("SUMMARY: Credit Default Prediction")
print("=" * 70)
print(f"\nDataset: UCI German Credit (1000 samples, {X_train.shape[1]} features)")
print(f"\nBest Model: {comparison_df.index[0]} (F1 = {comparison_df.iloc[0]['f1']:.4f})")
print("\nModel Rankings (by F1 score):")
print("-" * 50)

for i, (model_name, row) in enumerate(comparison_df.iterrows(), 1):
    print(f"  {i}. {model_name:25} F1={row['f1']:.4f}  AUC={row['auc']:.4f}")

print("\n" + "=" * 70)

### Key Findings

1. **Model Performance**: Tree-based models (XGBoost, Random Forest) typically perform best on this tabular dataset

2. **Interpretability vs Performance Trade-off**:
   - Logistic Regression: Most interpretable, reasonable baseline
   - XGBoost/Random Forest: Best performance, SHAP provides interpretability
   - Neural Network: May underperform on small tabular datasets

3. **Important Features**: Checking account status, credit history, and credit amount consistently appear as top predictors

4. **Class Imbalance**: The dataset has ~30% bad credit risk, which affects model calibration

### Recommendations

1. **For Production**: Use XGBoost with SHAP explanations for audit trail
2. **For Regulatory Compliance**: Consider Logistic Regression for full interpretability
3. **For Exploration**: Use the individual model notebooks for deep dives
4. **For Better Performance**: Consider ensemble stacking or more feature engineering

In [None]:
# Save all models
print("Saving models...")
for model_name, model_data in trained_models.items():
    save_path = save_model(model_data['model'], model_name)
    print(f"  Saved: {model_name}")

print("\nAll models saved to outputs/models/")