# Predicting financial fraud using XGBoost and Bayesian Optimization

This is a notebook demonstrating how to analyze a dataset and make an XGBoost model using Bayesian Hyperparameter tuning to predict fraud with a high level of accuracy.

## Contents

1. The data

2. What is the XGBoost model?

3. Parameters of the XGBoost model

4. Bayesian Hyperparameter tuning

5. Why I use XGBoost and Bayesian Hyperparameter tuning

6. Code

7. Bibliography

# The data

The data is a large (493.53 MB) .csv file from kaggle.com ([link](https://www.kaggle.com/code/arjunjoshua/predicting-fraud-in-financial-payment-services/input)).

Be sure to download it and put it in the same folder as this demo.

## Description

Paysim synthetic dataset of mobile money transactions. Each step represents an hour of simulation. This dataset is scaled down 1/4 of the original dataset which is presented in the paper "PaySim: A financial mobile money simulator for fraud detection".

## Shape

rows: 6,362,620

columns: 11

## Column names

### step
Maps a unit of time in the real world. In this case 1 step is 1 hour of time.

### type
Transaction type, either ```"CASH-IN"```, ```"CASH-OUT"```, ```"DEBIT"```, ```"PAYMENT"``` or ```"TRANSFER"```.

### amount
The amount of the transaction in local currency.

### nameOrig
The customer who started the transaction.

### oldbalanceOrg
The initial balance before the transaction.

### newbalanceOrg
The customer's balance after the transaction.

### nameDest
The recipient ID of the transaction.

### oldbalanceDest
The initial recipient balance before the transaction.

### newbalanceDest
The recipient's balance after the transaction.

### isFraud
Identifies a fraudulent transaction (1) and non fraudulent (0)

### isFlaggedFraud
Flags illegal attempts to transfer more than 200.000 in a single transaction.

# What is the XGBoost model?

XGBoost stands for e**X**treme **G**radient **Boost**. It's a powerful machine learning algorithm used for classification tasks.

XGBoost builds upon 4 things:
 
## 1. Decision trees

Decision trees are used for classification tasks. They"re a series of yes/no (or if/else) questions about the features of the data set you"re working with.

Here"s a brief explanation:

To make a decision tree with a dataset, you do something called _recursive splitting_. For the root node of the tree, you look at the features and their values to find the best way to split the data into 2 homogenous (or as homogenous as possible) chunks. Then you recursively repeat that process until some stopping criterion is met. Below are some examples of stopping criteria:

* Maximum depth: the maximum depth of the tree.
* Minimum sample size: the minimum number of observations allowed per node.
* Minimum impurity: the minimum heterogeneity allowed in a sample, measured in terms of gini impurity or entropy.
* Maximum number of leaf nodes: the maximum number of leaf nodes allowed in the tree.


## 2. Supervised machine learning

SML is a machine learning technique where the model is trained on a dataset whose observations are labeled as belonging to some class. It"s used for classification tasks.

Unsupervised machine learning (UML) is used for discovering patterns in a dataset with no particular target variables in mind. After the patterns are discovered, it"s up to the data scientist to label the discovered categories.

## 3. Ensemble learning

A "meta approach" to machine learning that uses predictions from different models to arrive at a more accurate prediction. XGBoost is an ensemble of [gradient boosting](https://towardsdatascience.com/all-you-need-to-know-about-gradient-boosting-algorithm-part-1-regression-2520a34a502) and [decision trees](https://scikit-learn.org/stable/modules/tree.html#:~:text=Decision%20Trees%20(DTs)%20are%20a,as%20a%20piecewise%20constant%20approximation.).

## 4. Gradient boosting

A classification or regression model that typically uses decision trees (made up of weak predictors) and iteratively updates them. On each iteration, a new weak predictor is added, its predictions are compared to the other predictors, and the model is updated to correct its errors. The model is then updated until some stopping criterion is met, like the number of iterations (see above list of stopping criteria too).

For more details, here"s the [Nvidia glossary page for XGBoost](https://www.nvidia.com/en-us/glossary/data-science/xgboost/#:~:text=XGBoost%2C%20which%20stands%20for%20Extreme,%2C%20classification%2C%20and%20ranking%20problems.). 

# Parameters of the XGBoost model

Here are the hyperparameters I will be focusing on:

```learning_rate```

Controls the step size or shrinkage factor at each boosting iteration. A smaller learning rate can make the model more robust but may require more iterations to converge.

```n_estimators```

Specifies the number of boosting rounds or decision trees to build. Increasing this value can improve the model"s performance but may also lead to overfitting if set too high.

```max_depth```

Sets the maximum depth of each decision tree in the ensemble. Deeper trees can capture more complex interactions but may increase the risk of overfitting.

```subsample```

Controls the fraction of training instances used for building each tree. Setting a value less than 1.0 introduces randomness and helps prevent overfitting.

```colsample_bytree```

Specifies the fraction of features or columns used for building each tree. Similar to subsample, it introduces randomness and can prevent overfitting.

```gamma```

Sets the minimum loss reduction required to split a node further. Higher values make the model more conservative, preventing overfitting, but may result in underfitting if set too high.

```lambda or reg_lambda```

Controls L2 regularization, also known as the Ridge regularization term. It helps prevent overfitting by adding a penalty term to the loss function.

```alpha or reg_alpha```

Controls L1 regularization, also known as the Lasso regularization term. It can be used to encourage sparse feature selection and reduce the impact of less important features.

```min_child_weight```

Specifies the minimum sum of instance weights required to split a node further. Higher values can help prevent overfitting by avoiding splits with low weighted instances.

```objective```

Defines the loss function to be optimized. It depends on the specific task, such as regression, classification, or ranking. XGBoost provides a range of predefined objectives to choose from.

_For more details on the parameters (too many to list here), check out the [documentation](https://xgboost.readthedocs.io/en/stable/parameter.html)._

# Bayesian Hyperparameter Tuning

BHT uses Bayesian inference to intelligently search the given hyperparameter space for an optimal set of hyperparameters.

Here"s how it works:

1. Define the hyperparameter space by specifying the range or distribution of values for each hyperparameter you want to tune.
2. Choose the objective function. This could be mean squared error, accuracy, or F1 score.
3. Build an Initial Surrogate Model. This step often uses random forests or Gaussian processes to give an estimate of the objective function.
4. Iteratively optimize the model by choosing to either explore (try new hyperparameters) or exploit (keep current, effective, hyperparameters) based on an acquisition function.
5. Train and evaluate the model with the selected hyperparameters. This is usually done with validation data or cross validation.
6. Use the newly evaluated data to update the model accordingly.
7. Repeat steps 4-6 until some stopping criterion is met, like the number of iterations or some pre-defined convergence point.

# Why I use XGBoost and Bayesian Hyperparameter tuning

XGBoost is an excellent tool for classification tasks. It is widely used by machine learning engineers and data scientists for production-grade code. Bayesian Hyperparameter Tuning is a smart way to iteratively search for the optimal set of hyperparameters, often requiring fewer iterations than Grid Search Cross Validation or Random Search Cross Validation.

This is a strong combination.

# Code

In [None]:
# Import all necessary libraries.

from sys import modules
import pandas as pd
import numpy as np
if not "seaborn" in modules:
    %pip install seaborn
import seaborn as sns
# if not "matplotlib" in modules:
#     %pip install matplotlib
# import matplotlib as plt
if not "xgboost" in modules:
    %pip install xgboost
from xgboost import XGBClassifier
if not "sklearn" in modules:
    %pip install sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
if not "hyperopt" in modules:
    %pip install hyperopt
from hyperopt import STATUS_OK, Trials, fmin, hp, tpe

# The random state to be used for the whole program.
random_state = 3

You can download the dataset [here](https://www.kaggle.com/code/arjunjoshua/predicting-fraud-in-financial-payment-services/input).

In [4]:
df = pd.read_csv("PS_20174392719_1491204439457_log.csv")

In [5]:
# Big dataset!
# (6362620, 11)
df.shape

(6362620, 11)

In [6]:
# See if anything pops out with basic stats.
df.describe()

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


In [7]:
# Look at data types.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
 #   Column          Dtype  
---  ------          -----  
 0   step            int64  
 1   type            object 
 2   amount          float64
 3   nameOrig        object 
 4   oldbalanceOrg   float64
 5   newbalanceOrig  float64
 6   nameDest        object 
 7   oldbalanceDest  float64
 8   newbalanceDest  float64
 9   isFraud         int64  
 10  isFlaggedFraud  int64  
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


## Data columns
### step
Maps a unit of time in the real world. In this case 1 step is 1 hour of time.

### type
CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER

### amount
Amount of the transaction in local currency.

### nameOrig
Customer who started the transaction.

### oldbalanceOrg
Initial balance before the transaction.

### newbalanceOrg
Customer's balance after the transaction.

### nameDest
Recipient ID of the transaction.

### oldbalanceDest
Initial recipient balance before the transaction.

### newbalanceDest
Recipient's balance after the transaction.

### isFraud
Identifies a fraudulent transaction (1) and non fraudulent (0).

### isFlaggedFraud
Flags illegal attempts to transfer more than 200.000 in a single transaction.

In [11]:
# See if any columns should be converted to numerical values.
for column in df.columns:
    print(column, df.dtypes[column], df[column].unique()[:5])

step int64 [1 2 3 4 5]
type object ['PAYMENT' 'TRANSFER' 'CASH_OUT' 'DEBIT' 'CASH_IN']
amount float64 [ 9839.64  1864.28   181.   11668.14  7817.71]
nameOrig object ['C1231006815' 'C1666544295' 'C1305486145' 'C840083671' 'C2048537720']
oldbalanceOrg float64 [170136.  21249.    181.  41554.  53860.]
newbalanceOrig float64 [160296.36  19384.72      0.    29885.86  46042.29]
nameDest object ['M1979787155' 'M2044282225' 'C553264065' 'C38997010' 'M1230701703']
oldbalanceDest float64 [    0. 21182. 41898. 10845.  5083.]
newbalanceDest float64 [     0.    40348.79 157982.12  51513.44  16896.7 ]
isFraud int64 [0 1]
isFlaggedFraud int64 [0 1]


In [35]:
# Find out how much fraud occurs.
n_frauds = df[df["isFraud"] == 1].shape[0]
print(f'{n_frauds} frauds out of {df.shape[0]}')
print(f"{round(100*n_frauds/df.shape[0], 3)}%")
# 8213 frauds out of 6362620
# 0.129 %

8213 frauds out of 6362620
0.129 %


In [36]:
# Find out the true positive rate for flagging transactions as fraud (precision).

positive = df["isFlaggedFraud"] == 1
positives = df[positive].shape[0]

true_positive = positive & (df["isFraud"] == 1)
true_positives = df[true_positive].shape[0]

false_positives = positives - true_positives

print(f"positives                     : {positives}")
print(f"true positives                : {true_positives}")
print(f"false positives               : {false_positives}")

true_positive_rate = 100 * true_positives / positives

print(f"true positive rate (precision): {round(true_positive_rate, 3)}%")

positives                     : 16
true positives                : 16
false positives               : 0
true positive rate (precision): 100.0%


In [37]:
# Find out the true negative rate for flagging transactions as fraud (specificity).

negative = df["isFlaggedFraud"] == 0
negatives = df[negative].shape[0]


true_negative = negative & (df["isFraud"] == 0)
true_negatives = df[true_negative].shape[0]

false_negatives = negatives - true_negatives

print(f"negatives                      : {negatives}")
print(f"true negatives                 : {true_negatives}")
print(f"false negatives                : {false_negatives}")

true_negative_rate = 100 * true_negatives / negatives

print(f"true negative rate (specificity) : {round(true_negative_rate, 3)}%")
print(f"false negative rate              : {round(100-true_negative_rate, 3)}%")

negatives                      : 6362604
true negatives                 : 6354407
false negatives                : 8197
true negative rate (precision) : 99.871%
false negative rate (precision): 0.129%


In [38]:
# Make prediction features and target feature.

X = df[df.columns.drop(["isFraud", "isFlaggedFraud"])].select_dtypes(
    include=["int64", "float64"]
)

y = df["isFraud"]

In [39]:
X.sample(5)

Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest
617011,34,1063.07,117577.1,116514.03,0.0,0.0
4036091,299,38801.52,5963988.18,6002789.7,608714.36,569912.85
3126108,236,144875.26,21449.0,0.0,0.0,144875.26
5887791,403,178560.07,102622.0,0.0,1249609.75,1428169.82
1758918,161,246855.58,78475.0,325330.58,3035577.2,2788721.61


In [40]:
# use xg boost
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=.25,
    random_state=random_state
)

In [42]:
space = {
    'max_depth': hp.quniform("max_depth", 3, 18, 1),
    'gamma': hp.uniform ('gamma', 1,9),
    'reg_alpha' : hp.quniform('reg_alpha', 40,180,1),
    'reg_lambda' : hp.uniform('reg_lambda', 0,1),
    'colsample_bytree' : hp.uniform('colsample_bytree', 0.5,1),
    'min_child_weight' : hp.quniform('min_child_weight', 0, 10, 1),
    'n_estimators': 180,
    'seed': 0
    }

In [43]:
def objective(space):
    clf = XGBClassifier(
        n_estimators=space['n_estimators'],
        max_depth=int(space['max_depth']),
        gamma=space['gamma'],
        reg_alpha=int(space['reg_alpha']),
        min_child_weight=int(space['min_child_weight']),
        colsample_bytree=int(space['colsample_bytree'])
    )
    
    evaluation = [(X_train, y_train), (X_test, y_test)]

    clf.fit(
        X_train,
        y_train,
        eval_set=evaluation,
        eval_metric="auc",
        early_stopping_rounds=10,
        verbose=False
    )
    

    pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, pred>0.5)
    print ("SCORE:", accuracy)
    return {'loss': -accuracy, 'status': STATUS_OK }

In [None]:
# at least 180 minutes
trials = Trials()

best_hyperparams = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=50,
    trials=trials
)

In [36]:
xgb_optimized = XGBClassifier(best_hyperparams)

# Bibliography

[Data](https://www.kaggle.com/code/arjunjoshua/predicting-fraud-in-financial-payment-services/input)

[Gradient boosting](https://towardsdatascience.com/all-you-need-to-know-about-gradient-boosting-algorithm-part-1-regression-2520a34a502)

[Decision trees](https://scikit-learn.org/stable/modules/tree.html#:~:text=Decision%20Trees%20(DTs)%20are%20a,as%20a%20piecewise%20constant%20approximation.)

[Nvidia glossary page for XGBoost](https://www.nvidia.com/en-us/glossary/data-science/xgboost/#:~:text=XGBoost%2C%20which%20stands%20for%20Extreme,%2C%20classification%2C%20and%20ranking%20problems.)

[XGBoost documentation](https://xgboost.readthedocs.io/en/stable/parameter.html).