Predicting Loan Defaults: A Data-Driven Approach to Credit Risk Analysis

Course: BEE2041 Data Science in Economics — Final Empirical Project

Live Blog: joshlg18.quarto.pub/predicting-loan-defaults-a-data-driven-approach-to-credit-risk-analysis

Overview

This project applies machine learning to predict loan defaults using a real-world credit risk dataset. Three classification models are implemented and compared — Logistic Regression, Random Forest, and XGBoost — evaluated on standard performance metrics. The analysis goes beyond pure prediction to include:

Causal analysis: Estimating heterogeneous treatment effects (CATE) of prior default history on default risk
Fairness audit: Examining model performance disparities across demographic subgroups (income, age, home ownership)

The project is written as an interactive Quarto blog (Blog.qmd) that renders to a fully reproducible HTML report.

Research Question

Can machine learning models effectively predict loan defaults, and how fair are their predictions across different borrower groups?

Dataset

Source: Credit Risk Dataset (Kaggle)

Property	Value
File	`Project/Data/credit_risk_dataset.csv`
Rows	32,581 observations
Columns	12 variables
Target	`loan_status` (0 = No Default, 1 = Default)

Variables

Variable	Description	Type
`person_age`	Borrower age (years)	Numeric
`person_income`	Annual income (USD)	Numeric
`person_home_ownership`	RENT / OWN / MORTGAGE / OTHER	Categorical
`person_emp_length`	Employment length (years)	Numeric
`loan_intent`	Purpose: EDUCATION, MEDICAL, VENTURE, etc.	Categorical
`loan_grade`	Credit grade: A–F	Categorical
`loan_amnt`	Loan amount (USD)	Numeric
`loan_int_rate`	Interest rate (%)	Numeric
`loan_status`	Default indicator — target variable	Binary
`loan_percent_income`	Loan amount as % of annual income	Numeric
`cb_person_default_on_file`	Prior default on record (Y/N) — treatment variable	Categorical
`cb_person_cred_hist_length`	Credit history length (years)	Numeric

Data Cleaning

Removed 406 duplicate rows
Imputed person_emp_length missing values (895) with the median
Imputed loan_int_rate missing values (1,133) via regression on loan_grade
Filtered implausible outliers (age and employment length capped at realistic maximums)

Methodology

Preprocessing

Quantile transformation (to approximate normal distributions) for all numeric features
Label encoding for categorical features
Class-balanced weighting to handle imbalance (~70% non-default, ~30% default)
80/20 stratified train-test split

Models

Model	Key Settings
Logistic Regression	L1/L2 regularisation, GridSearchCV (`C` ∈ [0.01, 100])
Random Forest	100 estimators, `max_depth` ∈ [10, 20], GridSearchCV
XGBoost	200 estimators, L1/L2 regularisation, `scale_pos_weight` tuning

All models tuned via 3-fold Stratified K-Fold cross-validation, optimising F1-score.

Evaluation Metrics

Accuracy, Precision, Recall (Sensitivity), F1-Score
AUC-ROC, Log Loss, Brier Score
Confusion matrices
Feature importance (Odds Ratios for LR; feature importance scores for RF/XGBoost)

Causal Analysis

Conditional Average Treatment Effects (CATE) estimated using a Random Forest causal model
Treatment: cb_person_default_on_file = Y vs. N
Individual-level treatment effect distribution visualised

Fairness Audit

Model recall, accuracy, and F1 stratified by:
- Income (low vs. high)
- Age (young vs. old)
- Home Ownership (RENT / OWN / MORTGAGE)

Results Summary

Model	Accuracy	AUC	F1	Recall
XGBoost	0.861	0.825	0.620	0.559
Random Forest	0.847	0.819	0.595	0.522
Logistic Regression	0.819	0.795	0.534	0.466

Top predictive features: loan_percent_income, loan_int_rate, loan_grade, person_home_ownership

Fairness findings: Notable recall disparities — renters flagged at higher rates (0.58) than mortgage holders (0.37); low-income borrowers flagged more than high-income borrowers (0.50 vs. 0.40).

Causal finding: Previous default has heterogeneous effects; a distinct subgroup shows substantially elevated default risk, underlining the importance of CATE over average treatment effects in credit decisions.

Repository Structure

Empirical Project/
├── Makefile                        # Render command for reproducibility
├── README.md                       # This file
├── Blog.txt                        # Links to GitHub repo and published blog
│
└── Project/
    ├── Blog.qmd                    # Main Quarto source — all analysis and narrative
    ├── _quarto.yml                 # Quarto configuration (theme, format, TOC)
    ├── _publish.yml                # Quarto publish settings
    ├── styles.css                  # Custom CSS for blog styling
    │
    ├── Data/
    │   └── credit_risk_dataset.csv # Raw dataset
    │
    ├── Image/
    │   ├── banner.jpg              # Header banner
    │   ├── github.png              # GitHub icon
    │   ├── linkedin.png            # LinkedIn icon
    │   └── dataset.png             # Dataset download icon
    │
    ├── References/
    │   ├── references.bib          # BibTeX citations
    │   └── apa.csl                 # APA citation style
    │
    └── Outputs/                    # Rendered HTML output (auto-generated)
        ├── Blog.html
        ├── Blog_files/             # HTML assets, figures, JS/CSS libraries
        ├── data/                   # Dataset copy for in-blog download link
        └── styles.css

Requirements

Python: 3.13.2

Library	Version
pandas	2.2.3
numpy	2.2.3
matplotlib	3.10.0
seaborn	0.13.2
scikit-learn	1.6.1
xgboost	2.1.4
statsmodels	0.14.4
scipy	1.15.2
plotly	6.0.1
ipython	8.32.0

Install all dependencies:

pip install pandas==2.2.3 numpy==2.2.3 matplotlib==3.10.0 seaborn==0.13.2 \
    scikit-learn==1.6.1 xgboost==2.1.4 statsmodels==0.14.4 scipy==1.15.2 \
    plotly==6.0.1 ipython==8.32.0

Quarto: quarto.org (required to render Blog.qmd)

How to Reproduce

Clone the repo and render the blog from the project root:

git clone https://github.com/JoshLG18/DSE-EMP-Project.git
cd DSE-EMP-Project

# Render the Quarto blog to HTML
make

# Open the rendered output
make open

Or render directly with Quarto:

cd Project
quarto render Blog.qmd

A random seed of 42 is set throughout to ensure reproducibility.

Links

Published Blog: joshlg18.quarto.pub/predicting-loan-defaults-a-data-driven-approach-to-credit-risk-analysis
Dataset: Kaggle — Credit Risk Dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Loan Defaults: A Data-Driven Approach to Credit Risk Analysis

Overview

Research Question

Dataset

Variables

Data Cleaning

Methodology

Preprocessing

Models

Evaluation Metrics

Causal Analysis

Fairness Audit

Results Summary

Repository Structure

Requirements

How to Reproduce

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Project		Project
Blog.txt		Blog.txt
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Predicting Loan Defaults: A Data-Driven Approach to Credit Risk Analysis

Overview

Research Question

Dataset

Variables

Data Cleaning

Methodology

Preprocessing

Models

Evaluation Metrics

Causal Analysis

Fairness Audit

Results Summary

Repository Structure

Requirements

How to Reproduce

Links

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages