# Predicting Fraud

### On `Mobile money` transactions for `Mara Bank`

## Sections in this notebook.


- Introduction
    - Project Overview
    - Objectives
    - Dataset Background

- Data Importation
    - Loading Required Libraries
    - Reading Data Files
    - Initial Data Preview

- Data Normalization
    - Feature Encoding
    - Feature Normilization
    

- Model Training
    - Pick Models
    - Split Data
    - Training

- Initial Evaluation
    - Cross Validation
    - Classification Report

- Fine Tuning
    - Initialize Parameters
    - Random Tuning
    - Grid Tuning

- Final Evaluation
    - Fit & Predict
    - Evaluation

- Insights and Findings
    - Key Patterns
    - Anomalies
    - Business Insights
    - Recommendations

## Introduction

- Project Overview
- Objectives
- Dataset Background

### Project Overview

Every second, Mara Bank’s mobile money platform processes countless transactions, airtime top-ups, utility bill payments, and peer-to-peer transfers; flowing across Nigeria.

But hidden within this stream are fraudulent attempts: some subtle, others blatant. Fraudsters exploit timing gaps, customer behavior, and even system trust. Spotting them in real-time requires not just rules but predictive intelligence.

This project is about building a fraud prediction model that learns from historical transactions to detect and flag suspicious activity before it spreads.

### Project Objective

The Key objectives are:

- Detect anomalies at scale: Use transaction history to identify deviations from normal user and network behavior.

- Develop predictive features: Incorporate transaction patterns, velocity, amounts, geolocation, and device data that highlight fraud signals.

- Train robust models: Experiment with machine learning algorithms (tree-based models, gradient boosting) to capture both simple and complex fraud patterns.

- Evaluate with precision: Prioritize recall and precision in performance metrics — missing fraud is costly, but so is flagging too many genuine users.

- Enable real-time inference: Prepare the model for deployment so Mara Bank can flag or block fraudulent activities instantly as they occur.

The ultimate goal: predict fraudulent transactions with high accuracy, minimizing financial loss while maintaining customer trust.

### Background of Dataset

The dataset was generated mimicing the different scenerios transactions can occur in Nigeria. It contains transactions of diferrent banks, however we will be focusing on the transactions that belongs to `Mara Bank` for this project.

This dataset contains the following:

- `amount`: The value of the transaction.
- `balance`: The account balance after the transaction.
- `time`: The timestamp of the transaction.
- `holder`: The account number of the transaction's initiator or recipient.
- `kyc`: The kyc level of the account
- `holder_bvn`: The BVN of the transaction's initiator or recipient.
- `holder_bank`: The bank of the related party.
- `related`: The account number or entity related to the transaction (e.g., recipient account, ATM bank).
- `related_bvn`: The BVN of the related party.
- `related_bank`: The bank of the related party.
- `state`, `latitude`, `longitude`: Location details of the transaction.
- `status`: The outcome of the transaction (e.g., 'SUCCESS', 'FAILED').
- `type`: The transaction type (e.g., 'DEBIT', 'CREDIT').
- `category`: The specific class of transaction (e.g., 'OPENING', 'WITHDRAWAL', 'PAYMENT', 'TRANSFER', 'REVERSAL', 'BILL').
- `channel`: The channel used for the transaction (e.g., 'CARD', 'APP', 'USSD').
- `device`: The device used for the transaction (e.g., 'ATM-001', 'MOBILE-003') .
- `nonce`: A unique identifier for related transactions.
- `reported`: Marks reported transactions?

## Data Importation

- Loading Required Libraries
- Reading Data Files
- Initial Data Preview

### Loading required libraries

In [28]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [29]:
# Import modules

import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import TimeSeriesSplit, cross_val_score, GridSearchCV, train_test_split, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor, XGBRFRegressor
from sklearn.preprocessing import RobustScaler
import joblib
from datetime import  datetime

In [30]:
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [31]:
from lib import oracle, engineer, analyst

### Reading data files

In [32]:
# Load the transactions dataset for the project 
df = pd.read_csv('../datasets/classified_transactions.csv')

### Initial data preview

In [33]:
# Preview the dataset
df.head()

Unnamed: 0,amount,balance,holder,holder_bvn,related,related_bvn,related_bank,state,latitude,longitude,...,unsual_reported_score,unsual_reported,unsual_reversal_score,unsual_reversal,unsual_related_score,unsual_related,unsual_related_bvn_score,unsual_related_bvn,fraud_score,fraud
0,-0.314803,-0.541285,-0.619772,-0.791519,0.432938,-0.293271,-0.62069,0.0,0.452051,0.076179,...,0.098439,False,0.094319,False,0.148919,False,0.146113,False,0.135535,False
1,0.613989,-0.400131,-0.61597,-0.780919,0.432938,-0.293271,-0.62069,-0.578947,0.693422,-0.095731,...,0.087991,False,0.090895,False,0.137967,False,0.145474,False,0.264673,False
2,1.001715,-0.341206,-0.608365,-0.745583,0.432938,-0.293271,-0.62069,0.105263,1.00547,1.210116,...,0.079467,False,0.082812,False,0.128382,False,0.128906,False,0.298879,False
3,0.05701,-0.484778,-0.604563,-0.720848,0.432938,-0.293271,-0.62069,0.473684,-0.555446,0.871618,...,0.075602,False,0.082261,False,0.125092,False,0.125831,False,0.181967,False
4,0.67868,-0.390299,-0.60076,-0.713781,0.432938,-0.293271,-0.62069,0.0,0.452051,0.076179,...,0.075602,False,0.077004,False,0.125744,False,0.121731,False,0.274537,False


In [34]:
# The basic information about the dataset.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20107 entries, 0 to 20106
Columns: 213 entries, amount to fraud
dtypes: bool(11), float64(202)
memory usage: 31.2 MB


In [35]:
# The shape of the dataset
df.shape

(20107, 213)

## Model Training

In [36]:
SEED = 42

### Model Selection

We will using the following regression estimators as starting point, then select the best and cross validate with them.

In [37]:
train_models = {
    'RandomForestRegressor': RandomForestRegressor(),
    'LinearRegression': LinearRegression(),
    'DecisionTreeRegressor': DecisionTreeRegressor(),
    'XGBRegressor': XGBRegressor(),
    'XGBRFRegressor': XGBRFRegressor()
}

In [38]:
X = df.drop(['fraud_score', 'fraud'], axis=1)
y = df['fraud_score']

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=SEED, test_size=.2)

In [40]:
oracle.train_score_models(models=train_models, seed=SEED, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)

RandomForestRegressor scored: 0.9368503940974509
LinearRegression scored: 0.9584658573939011
DecisionTreeRegressor scored: 0.839615036341375
XGBRegressor scored: 0.9665930501379139
XGBRFRegressor scored: 0.8102676869921176


Out of the 5 estimators, We will continue the cross validation with the top 3.

Which are XGBRegressor, XGBRFRegressor and RandomForestRegressor

In [43]:
cross_val_models = {
    'RandomForestRegressor': RandomForestRegressor(),
    'XGBRegressor': XGBRegressor(),
    'LinearRegression': LinearRegression()
}

Let's score them using `R2`, it checks if our prediction is following the pattern of the actual values. The closer to 1 the better.

In [44]:
oracle.crossval_models(models=cross_val_models, scoring='r2', seed=SEED, X=X, y=y)

RandomForestRegressor mean: 0.9074955972207709, std: 0.011942954785290841
XGBRegressor mean: 0.9327441904891298, std: 0.021432275810894092
LinearRegression mean: -353.721520881571, std: 612.7089740500094


XGBRegressor is the best with respect to `r2` scoring.

Let's score them using `neg_mean_squared_error`, this is the average of the square of our error in predicting, the smaller the better.

In [45]:
oracle.crossval_models(models=cross_val_models, scoring='neg_mean_squared_error', seed=SEED, X=X, y=y)

RandomForestRegressor mean: -0.0016946185077241862, std: 0.0001415692460052566
XGBRegressor mean: -0.0012262650507573953, std: 0.0003349046236382354
LinearRegression mean: -6.2282864085804235, std: 10.810561272185383


XGBRegressor is the best with respect to `neg_mean_squared_error` scoring.

Let's score them using `neg_root_mean_squared_error`, this is the root average of the square of our error in predicting, the smaller the better.

In [46]:
oracle.crossval_models(models=cross_val_models, scoring='neg_root_mean_squared_error', seed=SEED, X=X, y=y)

RandomForestRegressor mean: -0.04112978754919857, std: 0.001720198791412274
XGBRegressor mean: -0.0347237112814857, std: 0.004530885740936038
LinearRegression mean: -1.4448093050535487, std: 2.0348986413605727


XGBRegressor is the best with respect to `neg_root_mean_squared_error` scoring.

Let's score them using `neg_median_absolute_error`, this is the median of the errors we made (Squared), the smaller the better.

In [47]:
oracle.crossval_models(models=cross_val_models, scoring='neg_median_absolute_error', seed=SEED, X=X, y=y)

RandomForestRegressor mean: -0.02162746670202809, std: 0.0015576164993869894
XGBRegressor mean: -0.016182087002567618, std: 0.0008581041581727493
LinearRegression mean: -0.27341604414262993, std: 0.5078555846234488


XGBRegressor is the best with respect to `neg_median_absolute_error` scoring.

In general, it is obvious that `XGBRegressor` is best option already in all the scorings.

Let's tune our parameters, just to be very sure. 

So we will tune the `RandomForestRegressor` and `XGBRegressor` estimators.

## Parameter Tuning

Here are our params

In [48]:
rf_params = {
    "n_estimators": [50, 100],             # fewer trees for speed
    "max_depth": [None, 10],               # shallow vs unlimited
    "min_samples_split": [2, 5],           # low vs higher split
    "min_samples_leaf": [1, 2],            # low vs higher leaf
    "max_features": ["sqrt"],              # keep it simple
    "bootstrap": [True],                   # avoid both True/False for speed
    "criterion": ["squared_error"]         # stick with the standard
}

# --- XGBRegressor (Reduced) ---
xgbr_params = {
    "n_estimators": [50, 100, 200],        # cut down boosting rounds
    "learning_rate": [0.05, 0.1],          # common values
    "max_depth": [3, 5],                   # shallow vs medium
    "min_child_weight": [1, 3],            # flexible
    "subsample": [0.8, 1.0],               # avoid too many values
    "colsample_bytree": [0.8],             # fixed for testing
    "gamma": [0, 0.1],                     # lightweight
    "reg_alpha": [0, 0.1],                 # light L1 regularization
    "reg_lambda": [1]                      # standard L2
}

Let's do a random search to determine the best estimator with random parameters.

In [49]:
random_models = [
    ('RandomForestRegressor', RandomForestRegressor(), rf_params),
    ('XGBRegressor', XGBRegressor(), xgbr_params)
]

In [50]:
random_search = oracle.random_search(models=random_models, X=X, y=y, n_iter=10)

RandomForestRegressor best score: 0.9053916768116428 scored by r2
XGBRegressor best score: 0.9474562969276434 scored by r2


In [51]:
grid_models = [
   ('XGBRegressor', XGBRegressor(), xgbr_params),
]

In [None]:
grid_search = oracle.grid_search(models=grid_models, X=X, y=y)

Exception ignored on calling ctypes callback function: <bound method DataIter._next_wrapper of <xgboost.data.SingleBatchInternalIter object at 0x2afa2ed50>>
Traceback (most recent call last):
  File "/Users/kennedyikeka/Documents/workshop/money_hound/.venv/lib/python3.11/site-packages/xgboost/core.py", line 637, in _next_wrapper
    return self._handle_exception(lambda: self.next(input_data), 0)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kennedyikeka/Documents/workshop/money_hound/.venv/lib/python3.11/site-packages/xgboost/core.py", line 550, in _handle_exception
    return fn()
           ^^^^
  File "/Users/kennedyikeka/Documents/workshop/money_hound/.venv/lib/python3.11/site-packages/xgboost/core.py", line 637, in <lambda>
    return self._handle_exception(lambda: self.next(input_data), 0)
                                          ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kennedyikeka/Documents/workshop/money_hound/.venv/lib/python3.11/site-package

In [None]:
model = grid_search[0]['Best Estimator']

In [None]:
# Save the model

# model = XGBRegressor()
# model.fit(X_train, y_train)
# model.score(X_test, y_test)


timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
joblib.dump(model, f"../models/predict_fraud_score_model_{timestamp}")

['../models/predict_fraud_score_model_20250906_073426']