Skip to content

JoshLG18/DSE-EMP-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Loan Defaults: A Data-Driven Approach to Credit Risk Analysis

Course: BEE2041 Data Science in Economics — Final Empirical Project

Live Blog: joshlg18.quarto.pub/predicting-loan-defaults-a-data-driven-approach-to-credit-risk-analysis


Overview

This project applies machine learning to predict loan defaults using a real-world credit risk dataset. Three classification models are implemented and compared — Logistic Regression, Random Forest, and XGBoost — evaluated on standard performance metrics. The analysis goes beyond pure prediction to include:

  • Causal analysis: Estimating heterogeneous treatment effects (CATE) of prior default history on default risk
  • Fairness audit: Examining model performance disparities across demographic subgroups (income, age, home ownership)

The project is written as an interactive Quarto blog (Blog.qmd) that renders to a fully reproducible HTML report.


Research Question

Can machine learning models effectively predict loan defaults, and how fair are their predictions across different borrower groups?


Dataset

Source: Credit Risk Dataset (Kaggle)

Property Value
File Project/Data/credit_risk_dataset.csv
Rows 32,581 observations
Columns 12 variables
Target loan_status (0 = No Default, 1 = Default)

Variables

Variable Description Type
person_age Borrower age (years) Numeric
person_income Annual income (USD) Numeric
person_home_ownership RENT / OWN / MORTGAGE / OTHER Categorical
person_emp_length Employment length (years) Numeric
loan_intent Purpose: EDUCATION, MEDICAL, VENTURE, etc. Categorical
loan_grade Credit grade: A–F Categorical
loan_amnt Loan amount (USD) Numeric
loan_int_rate Interest rate (%) Numeric
loan_status Default indicator — target variable Binary
loan_percent_income Loan amount as % of annual income Numeric
cb_person_default_on_file Prior default on record (Y/N) — treatment variable Categorical
cb_person_cred_hist_length Credit history length (years) Numeric

Data Cleaning

  • Removed 406 duplicate rows
  • Imputed person_emp_length missing values (895) with the median
  • Imputed loan_int_rate missing values (1,133) via regression on loan_grade
  • Filtered implausible outliers (age and employment length capped at realistic maximums)

Methodology

Preprocessing

  • Quantile transformation (to approximate normal distributions) for all numeric features
  • Label encoding for categorical features
  • Class-balanced weighting to handle imbalance (~70% non-default, ~30% default)
  • 80/20 stratified train-test split

Models

Model Key Settings
Logistic Regression L1/L2 regularisation, GridSearchCV (C ∈ [0.01, 100])
Random Forest 100 estimators, max_depth ∈ [10, 20], GridSearchCV
XGBoost 200 estimators, L1/L2 regularisation, scale_pos_weight tuning

All models tuned via 3-fold Stratified K-Fold cross-validation, optimising F1-score.

Evaluation Metrics

  • Accuracy, Precision, Recall (Sensitivity), F1-Score
  • AUC-ROC, Log Loss, Brier Score
  • Confusion matrices
  • Feature importance (Odds Ratios for LR; feature importance scores for RF/XGBoost)

Causal Analysis

  • Conditional Average Treatment Effects (CATE) estimated using a Random Forest causal model
  • Treatment: cb_person_default_on_file = Y vs. N
  • Individual-level treatment effect distribution visualised

Fairness Audit

  • Model recall, accuracy, and F1 stratified by:
    • Income (low vs. high)
    • Age (young vs. old)
    • Home Ownership (RENT / OWN / MORTGAGE)

Results Summary

Model Accuracy AUC F1 Recall
XGBoost 0.861 0.825 0.620 0.559
Random Forest 0.847 0.819 0.595 0.522
Logistic Regression 0.819 0.795 0.534 0.466

Top predictive features: loan_percent_income, loan_int_rate, loan_grade, person_home_ownership

Fairness findings: Notable recall disparities — renters flagged at higher rates (0.58) than mortgage holders (0.37); low-income borrowers flagged more than high-income borrowers (0.50 vs. 0.40).

Causal finding: Previous default has heterogeneous effects; a distinct subgroup shows substantially elevated default risk, underlining the importance of CATE over average treatment effects in credit decisions.


Repository Structure

Empirical Project/
├── Makefile                        # Render command for reproducibility
├── README.md                       # This file
├── Blog.txt                        # Links to GitHub repo and published blog
│
└── Project/
    ├── Blog.qmd                    # Main Quarto source — all analysis and narrative
    ├── _quarto.yml                 # Quarto configuration (theme, format, TOC)
    ├── _publish.yml                # Quarto publish settings
    ├── styles.css                  # Custom CSS for blog styling
    │
    ├── Data/
    │   └── credit_risk_dataset.csv # Raw dataset
    │
    ├── Image/
    │   ├── banner.jpg              # Header banner
    │   ├── github.png              # GitHub icon
    │   ├── linkedin.png            # LinkedIn icon
    │   └── dataset.png             # Dataset download icon
    │
    ├── References/
    │   ├── references.bib          # BibTeX citations
    │   └── apa.csl                 # APA citation style
    │
    └── Outputs/                    # Rendered HTML output (auto-generated)
        ├── Blog.html
        ├── Blog_files/             # HTML assets, figures, JS/CSS libraries
        ├── data/                   # Dataset copy for in-blog download link
        └── styles.css

Requirements

Python: 3.13.2

Library Version
pandas 2.2.3
numpy 2.2.3
matplotlib 3.10.0
seaborn 0.13.2
scikit-learn 1.6.1
xgboost 2.1.4
statsmodels 0.14.4
scipy 1.15.2
plotly 6.0.1
ipython 8.32.0

Install all dependencies:

pip install pandas==2.2.3 numpy==2.2.3 matplotlib==3.10.0 seaborn==0.13.2 \
    scikit-learn==1.6.1 xgboost==2.1.4 statsmodels==0.14.4 scipy==1.15.2 \
    plotly==6.0.1 ipython==8.32.0

Quarto: quarto.org (required to render Blog.qmd)


How to Reproduce

Clone the repo and render the blog from the project root:

git clone https://github.com/JoshLG18/DSE-EMP-Project.git
cd DSE-EMP-Project

# Render the Quarto blog to HTML
make

# Open the rendered output
make open

Or render directly with Quarto:

cd Project
quarto render Blog.qmd

A random seed of 42 is set throughout to ensure reproducibility.


Links

About

End-to-end credit risk pipeline on real-world loan data. Compared Logistic Regression, Random Forest, and XGBoost; addressed multicollinearity and class imbalance. XGBoost achieved the best results with 92.8% Accuracy, 0.944 AUC, and 0.816 F1-score.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors