# Logistic Regression for Credit Classification

This notebook demonstrates logistic regression for credit default prediction using the UCI German Credit dataset.

## Model Overview

**Logistic Regression** is a linear model for binary classification that estimates the probability of an instance belonging to a class using the logistic (sigmoid) function.

### Pros
- Highly interpretable coefficients (feature weights)
- Fast training and inference
- Works well as a baseline model
- Supports regularisation (L1, L2) to prevent overfitting
- Outputs calibrated probabilities

### Cons
- Assumes linear decision boundary
- Struggles with complex, non-linear relationships
- May underperform on datasets with many feature interactions

### When to Use
- When interpretability is important (e.g., regulatory requirements)
- As a baseline to compare against more complex models
- When you need fast predictions at scale

## Setup

In [None]:
import sys
sys.path.insert(0, '../src')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from creditclass.preprocessing import prepare_data
from creditclass.training import get_model, train_model, save_model, tune_hyperparameters
from creditclass.evaluation import (
    evaluate_model,
    get_confusion_matrix,
    compute_shap_values,
    get_feature_importance,
    get_learning_curve_data,
)
from creditclass.plots import (
    set_plot_style,
    plot_confusion_matrix,
    plot_roc_curve,
    plot_precision_recall,
    plot_feature_importance,
    plot_learning_curve,
    plot_calibration,
    plot_shap_summary,
)

set_plot_style()

# For reproducibility
RANDOM_STATE = 42

## Load Data

In [None]:
# Prepare data for default prediction task
data = prepare_data(
    target_type='default',
    encoding_method='onehot',
    test_size=0.2,
    random_state=RANDOM_STATE,
    scale=True,
)

X_train = data['X_train_scaled']
X_test = data['X_test_scaled']
y_train = data['y_train']
y_test = data['y_test']
feature_names = data['feature_names']

print(f"Training set: {X_train.shape[0]} samples, {X_train.shape[1]} features")
print(f"Test set: {X_test.shape[0]} samples")
print(f"\nTarget distribution (train):")
print(y_train.value_counts(normalize=True))

## Training

In [None]:
# Get and train the model
model = get_model('logistic_regression')
model = train_model(model, X_train, y_train)

print("Model trained successfully!")
print(f"\nModel parameters: {model.get_params()}")

## Evaluation

In [None]:
# Compute metrics
metrics = evaluate_model(model, X_test, y_test)

print("Performance Metrics:")
print("-" * 30)
for name, value in metrics.items():
    if value is not None:
        print(f"{name.capitalize():12} {value:.4f}")

In [None]:
# Plot confusion matrix
fig, ax = plt.subplots(figsize=(6, 5))
plot_confusion_matrix(
    model, X_test, y_test,
    class_names=['Good Credit', 'Bad Credit'],
    ax=ax,
    title='Logistic Regression - Confusion Matrix'
)
plt.tight_layout()
plt.show()

In [None]:
# Plot ROC curve
fig, ax = plt.subplots(figsize=(7, 6))
plot_roc_curve(model, X_test, y_test, ax=ax, label='Logistic Regression')
plt.tight_layout()
plt.show()

In [None]:
# Plot precision-recall curve
fig, ax = plt.subplots(figsize=(7, 6))
plot_precision_recall(model, X_test, y_test, ax=ax, label='Logistic Regression')
plt.tight_layout()
plt.show()

## Interpretability

In [None]:
# Feature importance from coefficients
fig, ax = plt.subplots(figsize=(10, 8))
plot_feature_importance(
    model,
    feature_names=feature_names,
    top_n=15,
    ax=ax,
    title='Logistic Regression - Feature Importance (Absolute Coefficients)'
)
plt.tight_layout()
plt.show()

In [None]:
# SHAP values
shap_data = compute_shap_values(model, X_test, feature_names=feature_names, max_samples=100)

fig, ax = plt.subplots(figsize=(10, 8))
plot_shap_summary(shap_data, plot_type='bar', max_display=15)
plt.title('Logistic Regression - SHAP Feature Importance')
plt.tight_layout()
plt.show()

## Hyperparameter Tuning

In [None]:
# Tune hyperparameters
tuning_results = tune_hyperparameters(
    'logistic_regression',
    X_train, y_train,
    method='grid',
    cv=5,
    scoring='f1'
)

print("Best Parameters:")
print(tuning_results['best_params'])
print(f"\nBest CV F1 Score: {tuning_results['best_score']:.4f}")

In [None]:
# Evaluate tuned model
tuned_model = tuning_results['best_model']
tuned_metrics = evaluate_model(tuned_model, X_test, y_test)

print("\nTuned Model Performance:")
print("-" * 30)
for name, value in tuned_metrics.items():
    if value is not None:
        print(f"{name.capitalize():12} {value:.4f}")

## Learning Curve

In [None]:
# Compute learning curve
lc_model = get_model('logistic_regression')
lc_data = get_learning_curve_data(lc_model, X_train, y_train, cv=5, scoring='f1')

fig, ax = plt.subplots(figsize=(8, 6))
plot_learning_curve(lc_data, ax=ax, title='Logistic Regression - Learning Curve')
plt.tight_layout()
plt.show()

## Calibration

In [None]:
# Plot calibration curve
fig, ax = plt.subplots(figsize=(7, 6))
plot_calibration(model, X_test, y_test, ax=ax, label='Logistic Regression')
plt.tight_layout()
plt.show()

## Save Model

In [None]:
# Save the trained model
save_path = save_model(model, 'logistic_regression')
print(f"Model saved to: {save_path}")

## Summary

### Key Takeaways

1. **Performance**: Logistic regression provides a solid baseline for credit default prediction
2. **Interpretability**: Coefficients directly show feature impact on default probability
3. **Calibration**: Well-calibrated probability estimates out of the box
4. **Limitations**: May miss non-linear patterns in the data

### Recommendations

- Use as a baseline model for comparison
- Consider L1 regularisation for automatic feature selection
- Compare with non-linear models to assess potential gains