Course: BEE2041 Data Science in Economics — Final Empirical Project
Live Blog: joshlg18.quarto.pub/predicting-loan-defaults-a-data-driven-approach-to-credit-risk-analysis
This project applies machine learning to predict loan defaults using a real-world credit risk dataset. Three classification models are implemented and compared — Logistic Regression, Random Forest, and XGBoost — evaluated on standard performance metrics. The analysis goes beyond pure prediction to include:
- Causal analysis: Estimating heterogeneous treatment effects (CATE) of prior default history on default risk
- Fairness audit: Examining model performance disparities across demographic subgroups (income, age, home ownership)
The project is written as an interactive Quarto blog (Blog.qmd) that renders to a fully reproducible HTML report.
Can machine learning models effectively predict loan defaults, and how fair are their predictions across different borrower groups?
Source: Credit Risk Dataset (Kaggle)
| Property | Value |
|---|---|
| File | Project/Data/credit_risk_dataset.csv |
| Rows | 32,581 observations |
| Columns | 12 variables |
| Target | loan_status (0 = No Default, 1 = Default) |
| Variable | Description | Type |
|---|---|---|
person_age |
Borrower age (years) | Numeric |
person_income |
Annual income (USD) | Numeric |
person_home_ownership |
RENT / OWN / MORTGAGE / OTHER | Categorical |
person_emp_length |
Employment length (years) | Numeric |
loan_intent |
Purpose: EDUCATION, MEDICAL, VENTURE, etc. | Categorical |
loan_grade |
Credit grade: A–F | Categorical |
loan_amnt |
Loan amount (USD) | Numeric |
loan_int_rate |
Interest rate (%) | Numeric |
loan_status |
Default indicator — target variable | Binary |
loan_percent_income |
Loan amount as % of annual income | Numeric |
cb_person_default_on_file |
Prior default on record (Y/N) — treatment variable | Categorical |
cb_person_cred_hist_length |
Credit history length (years) | Numeric |
- Removed 406 duplicate rows
- Imputed
person_emp_lengthmissing values (895) with the median - Imputed
loan_int_ratemissing values (1,133) via regression onloan_grade - Filtered implausible outliers (age and employment length capped at realistic maximums)
- Quantile transformation (to approximate normal distributions) for all numeric features
- Label encoding for categorical features
- Class-balanced weighting to handle imbalance (~70% non-default, ~30% default)
- 80/20 stratified train-test split
| Model | Key Settings |
|---|---|
| Logistic Regression | L1/L2 regularisation, GridSearchCV (C ∈ [0.01, 100]) |
| Random Forest | 100 estimators, max_depth ∈ [10, 20], GridSearchCV |
| XGBoost | 200 estimators, L1/L2 regularisation, scale_pos_weight tuning |
All models tuned via 3-fold Stratified K-Fold cross-validation, optimising F1-score.
- Accuracy, Precision, Recall (Sensitivity), F1-Score
- AUC-ROC, Log Loss, Brier Score
- Confusion matrices
- Feature importance (Odds Ratios for LR; feature importance scores for RF/XGBoost)
- Conditional Average Treatment Effects (CATE) estimated using a Random Forest causal model
- Treatment:
cb_person_default_on_file = Yvs.N - Individual-level treatment effect distribution visualised
- Model recall, accuracy, and F1 stratified by:
- Income (low vs. high)
- Age (young vs. old)
- Home Ownership (RENT / OWN / MORTGAGE)
| Model | Accuracy | AUC | F1 | Recall |
|---|---|---|---|---|
| XGBoost | 0.861 | 0.825 | 0.620 | 0.559 |
| Random Forest | 0.847 | 0.819 | 0.595 | 0.522 |
| Logistic Regression | 0.819 | 0.795 | 0.534 | 0.466 |
Top predictive features: loan_percent_income, loan_int_rate, loan_grade, person_home_ownership
Fairness findings: Notable recall disparities — renters flagged at higher rates (0.58) than mortgage holders (0.37); low-income borrowers flagged more than high-income borrowers (0.50 vs. 0.40).
Causal finding: Previous default has heterogeneous effects; a distinct subgroup shows substantially elevated default risk, underlining the importance of CATE over average treatment effects in credit decisions.
Empirical Project/
├── Makefile # Render command for reproducibility
├── README.md # This file
├── Blog.txt # Links to GitHub repo and published blog
│
└── Project/
├── Blog.qmd # Main Quarto source — all analysis and narrative
├── _quarto.yml # Quarto configuration (theme, format, TOC)
├── _publish.yml # Quarto publish settings
├── styles.css # Custom CSS for blog styling
│
├── Data/
│ └── credit_risk_dataset.csv # Raw dataset
│
├── Image/
│ ├── banner.jpg # Header banner
│ ├── github.png # GitHub icon
│ ├── linkedin.png # LinkedIn icon
│ └── dataset.png # Dataset download icon
│
├── References/
│ ├── references.bib # BibTeX citations
│ └── apa.csl # APA citation style
│
└── Outputs/ # Rendered HTML output (auto-generated)
├── Blog.html
├── Blog_files/ # HTML assets, figures, JS/CSS libraries
├── data/ # Dataset copy for in-blog download link
└── styles.css
Python: 3.13.2
| Library | Version |
|---|---|
| pandas | 2.2.3 |
| numpy | 2.2.3 |
| matplotlib | 3.10.0 |
| seaborn | 0.13.2 |
| scikit-learn | 1.6.1 |
| xgboost | 2.1.4 |
| statsmodels | 0.14.4 |
| scipy | 1.15.2 |
| plotly | 6.0.1 |
| ipython | 8.32.0 |
Install all dependencies:
pip install pandas==2.2.3 numpy==2.2.3 matplotlib==3.10.0 seaborn==0.13.2 \
scikit-learn==1.6.1 xgboost==2.1.4 statsmodels==0.14.4 scipy==1.15.2 \
plotly==6.0.1 ipython==8.32.0Quarto: quarto.org (required to render Blog.qmd)
Clone the repo and render the blog from the project root:
git clone https://github.com/JoshLG18/DSE-EMP-Project.git
cd DSE-EMP-Project
# Render the Quarto blog to HTML
make
# Open the rendered output
make openOr render directly with Quarto:
cd Project
quarto render Blog.qmdA random seed of 42 is set throughout to ensure reproducibility.