An end-to-end, production-shaped fraud detection system: rich feature engineering, an honest class-imbalance study, three modelling paradigms (gradient boosting Β· graph neural network Β· autoencoder), SHAP explainability for compliance, concept-drift monitoring, and a real-time scoring service.
Fraud detection is the canonical hard problem in applied ML and the bread-and-butter of data science at any fintech (GoPay, OVO, Kredivo, Dana). It forces you to confront the issues tutorials skip:
- Extreme class imbalance (~0.5% fraud) β accuracy is meaningless
- Asymmetric costs β a missed fraud and a blocked customer are not equal
- Adversarial drift β attack patterns change; a static model decays
- Latency β decisions must happen in milliseconds, mid-transaction
- Explainability β regulators require every automated decline to be justified
This project addresses each one explicitly rather than stopping at a notebook with an inflated accuracy score.
Simulated but realistic credit-card transactions (Kaggle: kartik2112/fraud-detection).
| Transactions | 1.85M (1.30M train Β· 0.56M test) |
| Fraud rate | ~0.5% (realistic extreme imbalance) |
| Split | Temporal β train = earlier period, test = later (no leakage) |
| Cards / Merchants | 983 / 693 |
| Period | Jan 2019 β Dec 2020 |
Unlike the over-used ULB creditcard.csv (PCA-anonymised V1βV28), Sparkov keeps
human-readable columns β merchant, category, geo-coordinates, timestamps β which
is what makes meaningful feature engineering and a transaction graph possible.
All per-card features are computed in strict time order, looking only at the past
(closed='left' rolling windows, shifted expanding stats) β zero target leakage.
| Family | Features | Catches |
|---|---|---|
| Transaction | amount, log-amount | Large-value fraud |
| Temporal | hour, day-of-week, is_night, is_weekend | Fraud clusters at night |
| Demographic | cardholder age, city population | Population priors |
| Geo | haversine distance homeβmerchant, distance from previous txn | Impossible-travel |
| Velocity | rolling count/sum/mean per card over 1h / 24h / 7d | Transaction bursts |
| Behavioral | deviation & ratio vs card's own past mean, secs since last txn, distinct merchants 24h | Out-of-pattern spend |
Signal check (test period, fraud vs legit means):
| Feature | Legit | Fraud | Ratio |
|---|---|---|---|
amt |
67.6 | 528.4 | 7.8Γ |
amt_ratio_to_card_mean |
1.02 | 6.52 | 6.4Γ |
txn_count_1h |
0.22 | 0.67 | 3.0Γ |
is_night |
0.30 | 0.86 | 2.9Γ |
| Metric | Value | What it means |
|---|---|---|
| PR-AUC | 0.967 | Primary metric for imbalanced fraud (average precision) |
| ROC-AUC | 0.999 | Near-perfect ranking |
| Recall @ top 1% | 98.0% | Reviewing the riskiest 1% of txns catches 98% of fraud |
| Precision @ top 100 | 100% | The 100 highest-risk transactions are all fraud |
| Cost-optimal threshold | 0.019 | Minimises expected business cost β 20% cheaper than the naive 0.5 |
The cost-optimal threshold (0.019, well below 0.5) reflects realistic fraud economics: a missed fraud loses the transaction amount, so it is worth accepting more false positives to catch more fraud. At that threshold the model catches 96.6% of fraud.
| Model | PR-AUC | ROC-AUC | Role |
|---|---|---|---|
| LightGBM (cost-sensitive, Optuna) | 0.967 | 0.999 | Production workhorse |
| GraphSAGE (GNN, directed card+merchant graph) | 0.368 | 0.985 | Relational burst signal |
| Autoencoder (unsupervised) | 0.135 | 0.866 | Label-free novel-fraud net |
The GNN uses strictly directed pastβpresent edges (a transaction can only attend to the card's / merchant's earlier transactions) β no temporal leakage.
Honest outcome: gradient boosting dominates tabular fraud. The GNN adds relational context (and beats the unsupervised baseline), while the autoencoder β trained with no fraud labels at all β still ranks fraud far above random, which is exactly its job as a safety net for novel attacks.
An apples-to-apples study (same LightGBM, same matrix, only the sampling varies) reproduces the 2025 industry consensus:
| Strategy | PR-AUC | Fit time | Train rows | Verdict |
|---|---|---|---|---|
| None | 0.682 | 11.7s | 1.10M | Baseline β imbalance cripples it |
Cost-sensitive (scale_pos_weight) |
0.980 | 13.0s | 1.10M | Best ROI |
| SMOTE | 0.982 | 28.8s | 1.21M | +0.002 PR-AUC for 2.2Γ the time |
| Undersample | 0.981 | 4.1s | 0.07M | Fast, but discards 94% of data |
Cost-sensitive weighting matches SMOTE on PR-AUC at a fraction of the compute (+0.002 PR-AUC is within noise, for 2.2Γ the runtime). A 2025 review of 821 papers found only 6% of scale-focused work uses SMOTE successfully β production has largely abandoned it. This project shows why.
| Metric | Value |
|---|---|
| Latency P50 | 7.1 ms |
| Latency P95 / P99 | 11.2 / 13.0 ms |
| Throughput (single thread) | ~132 txn/sec |
Velocity features are maintained incrementally in an in-memory online store, so per-transaction scoring stays in the single-digit-millisecond range β well inside the sub-100ms budget real payment systems require. Replaying the test stream at the cost-optimal threshold catches 85% of fraud at a 1.8% decline rate.
The same recipe was applied to the real-world ULB dataset (284,807 genuine European card transactions, 0.17% fraud, PCA-anonymised):
| Strategy | PR-AUC (ULB, real) | PR-AUC (Sparkov) |
|---|---|---|
| Plain LightGBM | 0.418 | 0.682 |
| Cost-sensitive | 0.025 | 0.980 |
Imbalance strategy is dataset-dependent. Cost-sensitive weighting dominates on Sparkov (strong engineered features) but collapses on ULB (weak PCA features) β there, aggressive
scale_pos_weightfloods the score head with false positives and a plain model wins. There is no universal imbalance recipe; you validate per-dataset. This is why the project ships an imbalance study, not a single assumed fix.
Raw transaction stream
β
βΌ
ββββββββββββββββββββ offline (batch) online (real-time)
β Feature pipeline β βββΊ src/features.py βββΊ src/online.py (in-memory state)
ββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Models β
β β’ LightGBM (cost-sensitive + Optuna) β βββ production
β β’ GraphSAGE (card-chain transaction graph) β βββ relational
β β’ Autoencoder (legit-only reconstruction) β βββ unsupervised
ββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββ ββββββββββββββββ βββββββββββββββββ
β Evaluation β β SHAP explain β β PSI drift β
β PR-AUC + cost β β (compliance) β β monitoring β
ββββββββββββββββββββ ββββββββββββββββ βββββββββββββββββ
β
βΌ
FastAPI /score + Gradio dashboard (HF Spaces)
fraud-detection/
βββ src/
β βββ config.py # paths, feature groups, costs, hyperparams
β βββ data.py # Sparkov download + load
β βββ features.py # batch feature engineering (no leakage)
β βββ online.py # incremental online feature store (real-time)
β βββ preprocess.py # matrices + temporal split
β βββ train.py # LightGBM + imbalance study + Optuna
β βββ autoencoder.py # unsupervised anomaly detector (PyTorch)
β βββ gnn.py # GraphSAGE on card-chain graph (PyG)
β βββ evaluate.py # PR-AUC, business cost, threshold optimization
β βββ explain.py # SHAP
β βββ drift.py # PSI concept-drift
βββ scripts/ # download / features / train / autoencoder / gnn / drift
βββ api/ # FastAPI real-time scoring service
βββ streaming/ # streaming replay + latency benchmark
βββ app/gradio_app.py # 6-tab interactive dashboard
βββ tests/ # pytest (features, evaluate, drift, online)
βββ .github/workflows/ # CI
pip install -r requirements.txt
python scripts/download_data.py # Sparkov (~200 MB)
python scripts/run_features.py # engineered feature tables (~3 min)
python scripts/run_training.py # LightGBM + imbalance study + Optuna
python scripts/run_autoencoder.py # unsupervised baseline (GPU)
python scripts/run_gnn.py # GraphSAGE (GPU)
python scripts/run_drift.py # PSI drift report
python streaming/simulate_stream.py # latency benchmark
python app/gradio_app.py # dashboard at :7860
uvicorn api.main:app --port 8000 # real-time API
pytest tests/ -v # tests- Imbalanced learning done right β PR-AUC, cost-sensitive learning, business-cost threshold optimization (not accuracy, not default 0.5) β and the critical-thinking twist that the best imbalance strategy is dataset-dependent (cross-validated on real ULB data)
- Feature engineering β leakage-safe velocity/behavioral/geo features that carry the signal
- Breadth of modelling β gradient boosting, graph neural networks, and autoencoders, compared honestly
- Production thinking β real-time online features, latency benchmarking, drift monitoring, explainability
- Engineering β typed config, unit tests, CI, MLflow tracking, Docker, deployed demo
Relevant roles: fraud/risk DS at GoPay, OVO, Kredivo, Akulaku, Dana; any payments or lending team.
- Deng et al. β cost-sensitive vs SMOTE at scale; 821-paper review (2025) on imbalanced learning in production
- Hamilton, Ying, Leskovec (2017). Inductive Representation Learning on Large Graphs (GraphSAGE)
- SR 11-7 / FinCEN model-risk guidance β explainability requirements for automated decisions
- Sparkov Data Generator β Brandon Harris (dataset)
Built by Muhammad Fikri Wahidin β ML Engineer / Data Scientist portfolio