# Predicting Fraud

### On `Mobile money` transactions for `Mara Bank`

## Sections in this notebook.


- Introduction
    - Project Overview
    - Objectives
    - Dataset Background

- Data Importation
    - Loading Required Libraries
    - Reading Data Files
    - Initial Data Preview

- Data Normalization
    - Feature Encoding
    - Feature Normilization
    

- Model Training
    - Pick Models
    - Split Data
    - Training

- Initial Evaluation
    - Cross Validation
    - Classification Report

- Fine Tuning
    - Initialize Parameters
    - Random Tuning
    - Grid Tuning

- Final Evaluation
    - Fit & Predict
    - Evaluation

- Insights and Findings
    - Key Patterns
    - Anomalies
    - Business Insights
    - Recommendations

## Introduction

- Project Overview
- Objectives
- Dataset Background

### Project Overview

Every second, Mara Bank’s mobile money platform processes countless transactions, airtime top-ups, utility bill payments, and peer-to-peer transfers; flowing across Nigeria.

But hidden within this stream are fraudulent attempts: some subtle, others blatant. Fraudsters exploit timing gaps, customer behavior, and even system trust. Spotting them in real-time requires not just rules but predictive intelligence.

This project is about building a fraud prediction model that learns from historical transactions to detect and flag suspicious activity before it spreads.

### Project Objective

The Key objectives are:

- Detect anomalies at scale: Use transaction history to identify deviations from normal user and network behavior.

- Develop predictive features: Incorporate transaction patterns, velocity, amounts, geolocation, and device data that highlight fraud signals.

- Train robust models: Experiment with machine learning algorithms (tree-based models, gradient boosting) to capture both simple and complex fraud patterns.

- Evaluate with precision: Prioritize recall and precision in performance metrics — missing fraud is costly, but so is flagging too many genuine users.

- Enable real-time inference: Prepare the model for deployment so Mara Bank can flag or block fraudulent activities instantly as they occur.

The ultimate goal: predict fraudulent transactions with high accuracy, minimizing financial loss while maintaining customer trust.

### Background of Dataset

The dataset was generated mimicing the different scenerios transactions can occur in Nigeria. It contains transactions of diferrent banks, however we will be focusing on the transactions that belongs to `Mara Bank` for this project.

This dataset contains the following:

- `amount`: The value of the transaction.
- `balance`: The account balance after the transaction.
- `time`: The timestamp of the transaction.
- `holder`: The account number of the transaction's initiator or recipient.
- `kyc`: The kyc level of the account
- `holder_bvn`: The BVN of the transaction's initiator or recipient.
- `holder_bank`: The bank of the related party.
- `related`: The account number or entity related to the transaction (e.g., recipient account, ATM bank).
- `related_bvn`: The BVN of the related party.
- `related_bank`: The bank of the related party.
- `state`, `latitude`, `longitude`: Location details of the transaction.
- `status`: The outcome of the transaction (e.g., 'SUCCESS', 'FAILED').
- `type`: The transaction type (e.g., 'DEBIT', 'CREDIT').
- `category`: The specific class of transaction (e.g., 'OPENING', 'WITHDRAWAL', 'PAYMENT', 'TRANSFER', 'REVERSAL', 'BILL').
- `channel`: The channel used for the transaction (e.g., 'CARD', 'APP', 'USSD').
- `device`: The device used for the transaction (e.g., 'ATM-001', 'MOBILE-003') .
- `nonce`: A unique identifier for related transactions.
- `reported`: Marks reported transactions?

## Data Importation

- Loading Required Libraries
- Reading Data Files
- Initial Data Preview

### Loading required libraries

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from xgboost import XGBRegressor, XGBRFRegressor

In [None]:
# Import modules

import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import TimeSeriesSplit, cross_val_score, GridSearchCV, train_test_split, RandomizedSearchCV
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, ConfusionMatrixDisplay
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor, XGBRFRegressor
from sklearn.preprocessing import RobustScaler
import joblib
from datetime import  datetime

In [None]:
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [None]:
from src.lib.analytics import oracle, engineer, analyst

### Reading data files

In [None]:
# Load the transactions dataset for the project 
df = pd.read_csv('../datasets/classified_transactions.csv')

### Initial data preview

In [None]:
# Preview the dataset
df.head()

In [None]:
# The basic information about the dataset.
df.info()

In [None]:
# The shape of the dataset
df.shape

## Model Training

In [None]:
SEED = 42

### Model Selection

We will using the following regression estimators as starting point, then select the best and cross validate with them.

In [None]:
train_models = {
    'RandomForestRegressor': RandomForestRegressor(),
    'LinearRegression': LinearRegression(),
    'DecisionTreeRegressor': DecisionTreeRegressor(),
    'XGBRegressor': XGBRegressor(),
    'XGBRFRegressor': XGBRFRegressor()
}

In [None]:
X = df.drop(['fraud_score', 'fraud'], axis=1)
y = df['fraud_score']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=SEED, test_size=.2)

In [None]:
oracle.train_score_models(models=train_models, seed=SEED, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)

Out of the 5 estimators, We will continue the cross validation with the top 3.

Which are XGBRegressor, XGBRFRegressor and RandomForestRegressor

In [None]:
cross_val_models = {
    'RandomForestRegressor': RandomForestRegressor(),
    'XGBRegressor': XGBRegressor(),
    'LinearRegression': LinearRegression()
}

Let's score them using `R2`, it checks if our prediction is following the pattern of the actual values. The closer to 1 the better.

In [None]:
oracle.crossval_models(models=cross_val_models, scoring='r2', seed=SEED, X=X, y=y)

XGBRegressor is the best with respect to `r2` scoring.

Let's score them using `neg_mean_squared_error`, this is the average of the square of our error in predicting, the smaller the better.

In [None]:
oracle.crossval_models(models=cross_val_models, scoring='neg_mean_squared_error', seed=SEED, X=X, y=y)

XGBRegressor is the best with respect to `neg_mean_squared_error` scoring.

Let's score them using `neg_root_mean_squared_error`, this is the root average of the square of our error in predicting, the smaller the better.

In [None]:
oracle.crossval_models(models=cross_val_models, scoring='neg_root_mean_squared_error', seed=SEED, X=X, y=y)

XGBRegressor is the best with respect to `neg_root_mean_squared_error` scoring.

Let's score them using `neg_median_absolute_error`, this is the median of the errors we made (Squared), the smaller the better.

In [None]:
oracle.crossval_models(models=cross_val_models, scoring='neg_median_absolute_error', seed=SEED, X=X, y=y)

XGBRegressor is the best with respect to `neg_median_absolute_error` scoring.

In general, it is obvious that `XGBRegressor` is best option already in all the scorings.

Let's tune our parameters, just to be very sure. 

So we will tune the `RandomForestRegressor` and `XGBRegressor` estimators.

## Parameter Tuning

Here are our params

In [None]:
rf_params = {
    "n_estimators": [50, 100],             # fewer trees for speed
    "max_depth": [None, 10],               # shallow vs unlimited
    "min_samples_split": [2, 5],           # low vs higher split
    "min_samples_leaf": [1, 2],            # low vs higher leaf
    "max_features": ["sqrt"],              # keep it simple
    "bootstrap": [True],                   # avoid both True/False for speed
    "criterion": ["squared_error"]         # stick with the standard
}

# --- XGBRegressor (Reduced) ---
xgbr_params = {
    "n_estimators": [50, 100, 200],        # cut down boosting rounds
    "learning_rate": [0.05, 0.1],          # common values
    "max_depth": [3, 5],                   # shallow vs medium
    "min_child_weight": [1, 3],            # flexible
    "subsample": [0.8, 1.0],               # avoid too many values
    "colsample_bytree": [0.8],             # fixed for testing
    "gamma": [0, 0.1],                     # lightweight
    "reg_alpha": [0, 0.1],                 # light L1 regularization
    "reg_lambda": [1]                      # standard L2
}

Let's do a random search to determine the best estimator with random parameters.

In [None]:
random_models = [
    ('RandomForestRegressor', RandomForestRegressor(), rf_params),
    ('XGBRegressor', XGBRegressor(), xgbr_params)
]

In [None]:
random_search = oracle.random_search(models=random_models, X=X, y=y, n_iter=10)

In [None]:
grid_search = oracle.grid_search(models=random_models, X=X, y=y)

In [None]:
xgboost_model = grid_search[0]['Best Estimator']
rf_model = grid_search[1]['Best Estimator']

In [None]:
# Create a dataframe to store the feature importances
feature_importances = pd.DataFrame(xgboost_model.feature_importances_, index=X_train.columns, columns=['importance']).sort_values('importance', ascending=False)

# Visualize the feature importances
feature_importances.plot.barh(figsize=(20, 30))
plt.title('XGBoost Feature Importances')

In [None]:
# Save the model
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
joblib.dump(xgboost_model, f"../models/predict_fraud_score_xgboost_{timestamp}")

In [None]:
# Create a dataframe to store the feature importances
feature_importances = pd.DataFrame(rf_model.feature_importances_, index=X_train.columns, columns=['importance']).sort_values('importance', ascending=False)

# Visualize the feature importances
feature_importances.plot.barh(figsize=(20, 30))
plt.title('Random Forest Feature Importances')

In [None]:
# Save the model
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
joblib.dump(rf_model, f"../models/predict_fraud_score_rf_{timestamp}")