<a href="https://colab.research.google.com/github/NdumbiData/Fraudulent-Transaction-Detection-for-Digital-Money-Transfer/blob/main/Nova_Pay_Proffesional_Report_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Below is the **professionally structured stakeholder report**, refined for executive and technical audiences, with clear emphasis on **recall prioritization due to the high-risk nature of fraud in money markets**.

---

# FRAUDULENT TRANSACTION DETECTION FOR DIGITAL MONEY TRANSFER

## End-to-End Machine Learning Project Report

**Prepared by:** Ndumbi Kimani
**Project Type:** Supervised Machine Learning â€“ Fraud Classification
**Objective:** Detection of Fraudulent Transactions in Digital Money Transfer Systems

---

# 1. Executive Summary

Fraud within digital money transfer systems presents significant financial, regulatory, and reputational risks. In money markets, even a small number of undetected fraudulent transactions can lead to substantial financial loss and erosion of stakeholder trust.

This project developed and evaluated multiple machine learning models to detect fraudulent transactions. After comprehensive comparison and optimization, **Logistic Regression was selected as the preferred model due to its superior recall performance**, which aligns with the strategic objective of minimizing missed fraud cases.

Key Results:

* ROC-AUC above 0.96 across top-performing models
* Fraud recall up to 94% using Logistic Regression
* Strong overall classification performance
* Fully explainable model outputs using SHAP

Given the business context, **maximizing recall (reducing false negatives) was prioritized over maximizing precision**.

---

# 2. Business Context & Strategic Objective

Fraud detection in digital money markets must prioritize:

1. Preventing financial loss
2. Reducing undetected fraud (False Negatives)
3. Maintaining regulatory compliance
4. Supporting audit transparency

In high-risk financial systems, missing a fraudulent transaction is significantly more costly than incorrectly flagging a legitimate one. Therefore, this project explicitly optimized for:

> **Highest possible recall for fraudulent transactions**

---

# 3. Dataset Overview

The dataset contains structured transaction records including financial, behavioral, and risk-related attributes.

## 3.1 Target Variable

* **is_fraud**

  * 1 = Fraudulent transaction
  * 0 = Legitimate transaction

## 3.2 Feature Categories

### Transaction Features

* Transaction amount
* Transaction fee
* Timestamp
* Channel (mobile, web, ATM, etc.)
* Source & destination currency

### Customer & Account Features

* KYC tier
* Account age
* Home country

### Risk & Behavioral Indicators

* Device risk score
* Velocity indicators
* Fraud risk flags
* Historical fraud signals

The dataset was sorted chronologically to ensure real-world simulation and avoid data leakage.

---

# 4. Data Preparation & Engineering

## 4.1 Data Cleaning

* Removal of non-informative identifiers
* Validation of financial consistency
* Handling missing values
* Chronological sorting

## 4.2 Feature Engineering

Features were separated into:

* Numerical features (amounts, risk scores, velocity)
* Categorical features (channel, country, currency, KYC tier)

## 4.3 Preprocessing Pipeline

* OneHotEncoder for categorical features
* StandardScaler for numerical features
* Implemented using ColumnTransformer within a Scikit-learn Pipeline

After transformation:

* Original features expanded into 48 engineered model-ready features.

## 4.4 Train-Test Split

* 80% training set
* 20% testing set
* Time-based split to simulate real deployment

---

# 5. Exploratory Data Analysis (EDA)

Key Findings:

* Significant class imbalance (fraud cases are minority class)
* Strong predictive signals in:

  * Risk scores
  * Velocity patterns
  * Device-related features
  * Transaction amount anomalies
* High-risk behavioral indicators strongly correlated with fraud

These findings guided model selection and weighting strategies.

---

# 6. Model Development & Comparison

The following models were evaluated:

1. Logistic Regression
2. Random Forest
3. XGBoost
4. LightGBM

All models incorporated class balancing strategies to address fraud imbalance.

---

# 7. Model Performance Comparison (Test Set)

| Model               | Accuracy | Precision (Fraud) | Recall (Fraud) | F1 Score | ROC-AUC |
| ------------------- | -------- | ----------------- | -------------- | -------- | ------- |
| Logistic Regression | 95%      | 0.74              | **0.94**       | 0.83     | 0.9816  |
| Random Forest       | 99%      | 0.99              | 0.92           | 0.95     | 0.9738  |
| XGBoost             | 98%      | 0.95              | 0.92           | 0.93     | 0.9642  |
| LightGBM            | 98%      | 0.92              | 0.92           | 0.92     | 0.9648  |

---

# 8. Strategic Model Selection: Why Logistic Regression?

Although Random Forest achieved slightly higher precision and accuracy, **Logistic Regression delivered the highest recall (94%)**, meaning:

* 94% of fraudulent transactions were correctly detected.
* Fewer fraudulent transactions go undetected.
* Reduced financial exposure from missed fraud.

Given the high cost of false negatives in money markets:

> **Logistic Regression is selected as the preferred operational model due to its superior fraud recall.**

In fraud-sensitive environments, recall is prioritized over precision because:

* A false positive can be reviewed manually.
* A false negative results in direct financial loss.

---

# 9. Hyperparameter Optimization

## 9.1 Logistic Regression Tuning

Optimized parameters:

* Regularization strength (C)
* Penalty type (L1 vs L2)
* Class weights

Best configuration:

* C = 0.1
* Penalty = L1
* Class weight = balanced

Result:

* Maintained high recall
* Improved generalization
* Reduced overfitting risk

## 9.2 Random Forest Tuning

Optimized:

* Number of estimators
* Tree depth
* Split criteria
* Feature selection strategy

While performance improved slightly, recall remained lower than Logistic Regression.

---

# 10. Model Explainability (SHAP)

To ensure transparency and regulatory alignment, SHAP (SHapley Additive exPlanations) was implemented.

## 10.1 Global Feature Importance

Top contributing features included:

* Risk scores
* Velocity patterns
* Transaction amount
* Device indicators
* Currency mismatch signals

## 10.2 Local Explanation Capability

SHAP enables:

* Transaction-level explanations
* Clear identification of fraud-driving features
* Audit trail for compliance teams

This ensures the model is:

* Transparent
* Interpretable
* Operationally defensible

---

# 11. Risk Management Impact

This solution provides:

* Early fraud detection capability
* Reduced financial exposure
* Improved regulatory readiness
* Transparent decision-making framework
* Data-driven fraud prevention

By prioritizing recall, the system significantly lowers the probability of undetected fraud.

---

# 12. Deployment Readiness

The solution is production-ready because:

* Full preprocessing pipeline integrated
* Hyperparameters optimized
* Model reproducible
* Explainability integrated
* Cross-validation confirms stability

Deployment options include:

* Real-time scoring API
* Batch processing system
* Integration with fraud monitoring dashboard

---

# 13. Limitations & Considerations

* Fraud patterns evolve; periodic retraining is required.
* Precision trade-off may increase review workload.
* Additional graph-based fraud detection could further enhance detection.

---

# 14. Future Enhancements

* Real-time model monitoring
* Drift detection system
* Ensemble stacking
* Adaptive learning mechanisms
* Integration with rule-based systems

---

# 15. Conclusion

This project successfully delivered a high-performing, explainable fraud detection system optimized for money market risk environments.

Key Achievements:

* End-to-end ML pipeline development
* Recall-focused model optimization
* Hyperparameter tuning
* Robust evaluation
* SHAP-based transparency
* Deployment-ready architecture

Most importantly:

> Logistic Regression was selected due to its superior recall (94%), aligning with the strategic priority of minimizing undetected fraud in financial markets.

---

If you would like, I can now:

* Format this into a polished PDF with professional layout
* Add an executive one-page summary
* Prepare a board-level condensed version
* Draft a formal stakeholder email to accompany the report

Let me know your preference.
