# LightGBM Model for Credit Risk Prediction

This notebook implements LightGBM to compare with the Decision Tree model from the paper.

**Key Results:**
- Best F1-Score: **0.550** (4.5% improvement over Decision Tree)
- Best ROC-AUC: **0.779** (4.1% improvement over Decision Tree)
- Most important feature: **PAY_0** (repayment status in September)

### Import libraries and dataset

In [1]:
import pandas as pd
import numpy as np
import joblib
import lightgbm as lgb
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, confusion_matrix

In [None]:
df_original = pd.read_csv('../data/UCI_Credit_Card.csv')
df = df_original.copy()

### Data Preparation

Using the same preprocessing steps as the Decision Tree model for consistency.

In [None]:
# Cap outliers at 1st and 99th percentile for bill and payment amounts
bill_payment_cols = ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 
                     'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2', 
                     'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

for col in bill_payment_cols:
    lower_bound = df[col].quantile(0.01)
    upper_bound = df[col].quantile(0.99)
    df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)

# Filter valid ages
df = df[(df['AGE'] >= 18) & (df['AGE'] <= 100)]
print(f"Records after outlier handling: {len(df)}")

Records after outlier handling: 30000


In [None]:
# Clean EDUCATION: recode 0, 5, 6 as 4 (others)
df['EDUCATION'] = df['EDUCATION'].replace({0: 4, 5: 4, 6: 4})

# Clean MARRIAGE: recode 0 as 3 (others)
df['MARRIAGE'] = df['MARRIAGE'].replace({0: 3})

In [None]:
# Create Gender-Marriage combined feature (as described in the paper)
def create_gender_marriage_category(row):
    sex = row['SEX']
    marriage = row['MARRIAGE']
    
    if sex == 1:  # Male
        if marriage == 1:
            return 1  # Married man
        elif marriage == 2:
            return 2  # Single man
        else:
            return 3  # Divorced man
    else:  # Female
        if marriage == 1:
            return 4  # Married woman
        elif marriage == 2:
            return 5  # Single woman
        else:
            return 6  # Divorced woman

df['GENDER_MARRIAGE'] = df.apply(create_gender_marriage_category, axis=1)

# Exclude divorced women (category 6) as per the paper
df = df[df['GENDER_MARRIAGE'] != 6].copy()
print(f"Records after excluding divorced women: {len(df)}")

Records after excluding divorced women: 29768


In [None]:
# Remove ID column
if 'ID' in df.columns:
    df = df.drop('ID', axis=1)

# Define target and features
target_col = 'default.payment.next.month'
y = df[target_col].copy()

feature_cols = ['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE',
                'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6',
                'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 
                'BILL_AMT5', 'BILL_AMT6',
                'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 
                'PAY_AMT5', 'PAY_AMT6',
                'GENDER_MARRIAGE']

X = df[feature_cols].copy()

print(f"Feature matrix shape: {X.shape}")
print(f"Target vector shape: {y.shape}")
print(f"Default rate: {y.mean():.1%}")

Feature matrix shape: (29768, 24)
Target vector shape: (29768,)
Default rate: 22.1%


In [None]:
# Train-test split (same as Decision Tree: 70-30 with random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.3,
    random_state=42,
    stratify=y
)

print(f"Training set size: {X_train.shape[0]} ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test set size: {X_test.shape[0]} ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"Training default rate: {y_train.mean():.3%}")
print(f"Test default rate: {y_test.mean():.3%}")

Training set size: 20837 (70.0%)
Test set size: 8931 (30.0%)
Training default rate: 22.134%
Test default rate: 22.136%


### Handle Class Imbalance

Calculate scale_pos_weight for LightGBM (ratio of negative to positive samples).

In [8]:
# Calculate scale_pos_weight for class imbalance handling
negative_samples = (y_train == 0).sum()
positive_samples = (y_train == 1).sum()
scale_pos_weight_value = negative_samples / positive_samples

print("=" * 50)
print("CLASS IMBALANCE HANDLING")
print("=" * 50)
print(f"Negative samples (no default): {negative_samples}")
print(f"Positive samples (default): {positive_samples}")
print(f"Scale pos weight: {scale_pos_weight_value:.2f}")

CLASS IMBALANCE HANDLING
Negative samples (no default): 16225
Positive samples (default): 4612
Scale pos weight: 3.52


### Hyperparameter Tuning

Grid search with 10-fold cross-validation over 324 parameter combinations.

**Best Parameters Found:**
- `learning_rate`: 0.01
- `max_depth`: 7
- `n_estimators`: 200
- `num_leaves`: 31
- `min_child_samples`: 20

Best cross-validation F1 score: **0.5427**

In [9]:
# Define the parameter grid for hyperparameter tuning
lgb_param_grid = {
    'num_leaves': [20, 31, 50],
    'max_depth': [3, 5, 7, -1],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200, 500],
    'min_child_samples': [20, 50, 100]
}

# Initialize LightGBM classifier with class imbalance handling
lgb_model = lgb.LGBMClassifier(
    random_state=42,
    scale_pos_weight=scale_pos_weight_value,
    verbose=-1
)

# Perform grid search with cross-validation
print("=" * 50)
print("HYPERPARAMETER TUNING WITH GRID SEARCH")
print("=" * 50)
print(f"Total combinations: {3 * 4 * 3 * 3 * 3} = 324")
print("Starting grid search (this may take a few minutes)...\n")

lgb_grid_search = GridSearchCV(
    lgb_model,
    lgb_param_grid,
    cv=10,
    scoring='f1',
    n_jobs=-1,
    verbose=1
)

lgb_grid_search.fit(X_train, y_train)

print(f"\nBest parameters found:")
for param, value in lgb_grid_search.best_params_.items():
    print(f"  {param}: {value}")
print(f"\nBest cross-validation F1 score: {lgb_grid_search.best_score_:.4f}")

HYPERPARAMETER TUNING WITH GRID SEARCH
Total combinations: 324 = 324
Starting grid search (this may take a few minutes)...

Fitting 10 folds for each of 324 candidates, totalling 3240 fits

Best parameters found:
  learning_rate: 0.01
  max_depth: 7
  min_child_samples: 20
  n_estimators: 200
  num_leaves: 31

Best cross-validation F1 score: 0.5427


### Performance Evaluation

**Test Set Results:**
| Metric | Value |
|--------|-------|
| Accuracy | 0.789 |
| Precision | 0.521 |
| Recall | 0.582 |
| F1-Score | 0.550 |
| ROC-AUC | 0.779 |

The model correctly identifies 58.2% of actual defaulters (recall) while maintaining 52.1% precision.

In [14]:
# Get the best model from grid search
best_lgb_model = lgb_grid_search.best_estimator_

# Make predictions on test set
y_pred_lgb = best_lgb_model.predict(X_test)
y_pred_proba_lgb = best_lgb_model.predict_proba(X_test)[:, 1]

# Calculate performance metrics
accuracy_lgb = accuracy_score(y_test, y_pred_lgb)
precision_lgb = precision_score(y_test, y_pred_lgb)
recall_lgb = recall_score(y_test, y_pred_lgb)
f1_lgb = f1_score(y_test, y_pred_lgb)
roc_auc_lgb = roc_auc_score(y_test, y_pred_proba_lgb)

print("=" * 50)
print("LIGHTGBM PERFORMANCE EVALUATION")
print("=" * 50)
print(f"\nPerformance Metrics:")
print(f"Accuracy:  {accuracy_lgb:.3f}")
print(f"Precision: {precision_lgb:.3f}")
print(f"Recall:    {recall_lgb:.3f}")
print(f"F1-score:  {f1_lgb:.3f}")
print(f"ROC-AUC:   {roc_auc_lgb:.3f}")

print("\n" + "-" * 50)
print("Classification Report:")
print("-" * 50)
print(classification_report(y_test, y_pred_lgb, target_names=['No Default', 'Default']))

print("-" * 50)
print("Confusion Matrix:")
print("-" * 50)
cm_lgb = confusion_matrix(y_test, y_pred_lgb)
print(cm_lgb)
print(f"\nTrue Negatives:  {cm_lgb[0, 0]}")
print(f"False Positives: {cm_lgb[0, 1]}")
print(f"False Negatives: {cm_lgb[1, 0]}")
print(f"True Positives:  {cm_lgb[1, 1]}")

LIGHTGBM PERFORMANCE EVALUATION

Performance Metrics:
Accuracy:  0.789
Precision: 0.521
Recall:    0.582
F1-score:  0.550
ROC-AUC:   0.779

--------------------------------------------------
Classification Report:
--------------------------------------------------
              precision    recall  f1-score   support

  No Default       0.88      0.85      0.86      6954
     Default       0.52      0.58      0.55      1977

    accuracy                           0.79      8931
   macro avg       0.70      0.71      0.71      8931
weighted avg       0.80      0.79      0.79      8931

--------------------------------------------------
Confusion Matrix:
--------------------------------------------------
[[5896 1058]
 [ 827 1150]]

True Negatives:  5896
False Positives: 1058
False Negatives: 827
True Positives:  1150


### Feature Importance Analysis

**Top 5 Features by Gain:**
1. **PAY_0** - Repayment status in September (dominant feature with 274,048 gain)
2. **PAY_AMT2** - Payment amount in August (27,434 gain)
3. **LIMIT_BAL** - Credit limit (25,512 gain)
4. **BILL_AMT1** - Bill statement in September (24,630 gain)
5. **PAY_4** - Repayment status in June (14,138 gain)

PAY_0 is by far the most predictive feature, consistent with the paper's findings that repayment history is the strongest indicator of default risk.

In [11]:
# Feature Importance Analysis
print("=" * 50)
print("FEATURE IMPORTANCE ANALYSIS")
print("=" * 50)

# Get feature importances (split-based by default in sklearn API)
feature_importance_split = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance (Split)': best_lgb_model.feature_importances_
}).sort_values('Importance (Split)', ascending=False)

# Get gain-based importance using booster
booster = best_lgb_model.booster_
gain_importance = booster.feature_importance(importance_type='gain')
feature_importance_gain = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance (Gain)': gain_importance
}).sort_values('Importance (Gain)', ascending=False)

# Combine both importance measures
feature_importance = feature_importance_split.merge(
    feature_importance_gain, on='Feature'
)
feature_importance = feature_importance.sort_values('Importance (Gain)', ascending=False)

print("\nTop 10 Features by Gain-based Importance:")
print("-" * 50)
print(feature_importance.head(10).to_string(index=False))

print("\n\nTop 10 Features by Split-based Importance:")
print("-" * 50)
print(feature_importance.sort_values('Importance (Split)', ascending=False).head(10).to_string(index=False))

# Highlight PAY variables and AGE (as mentioned in the paper)
print("\n" + "=" * 50)
print("KEY FEATURES FROM PAPER COMPARISON")
print("=" * 50)
pay_features = ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'AGE']
print("\nImportance of PAY variables and AGE:")
key_features = feature_importance[feature_importance['Feature'].isin(pay_features)]
key_features = key_features.sort_values('Importance (Gain)', ascending=False)
print(key_features.to_string(index=False))

FEATURE IMPORTANCE ANALYSIS

Top 10 Features by Gain-based Importance:
--------------------------------------------------
  Feature  Importance (Split)  Importance (Gain)
    PAY_0                 332      274047.755493
 PAY_AMT2                 393       27434.305377
LIMIT_BAL                 616       25512.307361
BILL_AMT1                 649       24629.907673
    PAY_4                 135       14137.522610
    PAY_2                 237       12836.259132
 PAY_AMT1                 360       12560.479429
 PAY_AMT3                 309       11943.671025
    PAY_3                 222       11438.383887
 PAY_AMT4                 258       10268.698339


Top 10 Features by Split-based Importance:
--------------------------------------------------
  Feature  Importance (Split)  Importance (Gain)
BILL_AMT1                 649       24629.907673
LIMIT_BAL                 616       25512.307361
 PAY_AMT2                 393       27434.305377
 PAY_AMT1                 360       12560.47942

### Model Comparison with Decision Tree

**LightGBM vs Decision Tree Performance:**

| Metric | Decision Tree | LightGBM | Improvement |
|--------|---------------|----------|-------------|
| Accuracy | 0.781 | 0.789 | +1.1% |
| Precision | 0.504 | 0.521 | +3.3% |
| Recall | 0.550 | 0.582 | +5.8% |
| F1-Score | 0.526 | 0.550 | +4.5% |
| ROC-AUC | 0.748 | 0.779 | +4.1% |

LightGBM outperforms the Decision Tree model across all metrics, with the most significant improvement in recall (+5.8%), indicating better detection of actual defaulters.

In [12]:
# Load the Decision Tree model for comparison
dt_model = joblib.load('./models/best_decision_tree_model.pkl')

# Get Decision Tree predictions
y_pred_dt = dt_model.predict(X_test)
y_pred_proba_dt = dt_model.predict_proba(X_test)[:, 1]

# Decision Tree metrics
accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)
recall_dt = recall_score(y_test, y_pred_dt)
f1_dt = f1_score(y_test, y_pred_dt)
roc_auc_dt = roc_auc_score(y_test, y_pred_proba_dt)

# Model Comparison
print("=" * 60)
print("MODEL COMPARISON: DECISION TREE VS LIGHTGBM")
print("=" * 60)

comparison_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC'],
    'Decision Tree': [accuracy_dt, precision_dt, recall_dt, f1_dt, roc_auc_dt],
    'LightGBM': [accuracy_lgb, precision_lgb, recall_lgb, f1_lgb, roc_auc_lgb]
})

comparison_df['Difference'] = comparison_df['LightGBM'] - comparison_df['Decision Tree']
comparison_df['% Change'] = (comparison_df['Difference'] / comparison_df['Decision Tree'] * 100).round(2)

print("\nPerformance Comparison:")
print("-" * 60)
print(comparison_df.to_string(index=False))

print("\n" + "-" * 60)
print("Summary:")
print("-" * 60)
if f1_lgb > f1_dt:
    print(f"✓ LightGBM outperforms Decision Tree on F1-Score by {(f1_lgb - f1_dt):.3f} ({((f1_lgb - f1_dt) / f1_dt * 100):.1f}% improvement)")
else:
    print(f"✗ Decision Tree outperforms LightGBM on F1-Score by {(f1_dt - f1_lgb):.3f}")

if roc_auc_lgb > roc_auc_dt:
    print(f"✓ LightGBM outperforms Decision Tree on ROC-AUC by {(roc_auc_lgb - roc_auc_dt):.3f} ({((roc_auc_lgb - roc_auc_dt) / roc_auc_dt * 100):.1f}% improvement)")
else:
    print(f"✗ Decision Tree outperforms LightGBM on ROC-AUC by {(roc_auc_dt - roc_auc_lgb):.3f}")

MODEL COMPARISON: DECISION TREE VS LIGHTGBM

Performance Comparison:
------------------------------------------------------------
   Metric  Decision Tree  LightGBM  Difference  % Change
 Accuracy       0.780652  0.788937    0.008286      1.06
Precision       0.504174  0.520833    0.016659      3.30
   Recall       0.549823  0.581689    0.031866      5.80
 F1-Score       0.526010  0.549582    0.023572      4.48
  ROC-AUC       0.748421  0.779349    0.030927      4.13

------------------------------------------------------------
Summary:
------------------------------------------------------------
✓ LightGBM outperforms Decision Tree on F1-Score by 0.024 (4.5% improvement)
✓ LightGBM outperforms Decision Tree on ROC-AUC by 0.031 (4.1% improvement)


https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


### Save Model

In [13]:
# Save the best LightGBM model
joblib.dump(best_lgb_model, './models/best_lightgbm_model.pkl')
print("✓ Best LightGBM model saved as 'best_lightgbm_model.pkl'")

# Also save the model parameters for reference
model_info = {
    'best_params': lgb_grid_search.best_params_,
    'best_cv_score': lgb_grid_search.best_score_,
    'test_metrics': {
        'accuracy': accuracy_lgb,
        'precision': precision_lgb,
        'recall': recall_lgb,
        'f1_score': f1_lgb,
        'roc_auc': roc_auc_lgb
    },
    'scale_pos_weight': scale_pos_weight_value
}

joblib.dump(model_info, './models/lightgbm_model_info.pkl')
print("✓ Model info saved as 'lightgbm_model_info.pkl'")

✓ Best LightGBM model saved as 'best_lightgbm_model.pkl'
✓ Model info saved as 'lightgbm_model_info.pkl'
