# Detecting Fraud in Bank Transactions

Creating a predictive system to detect fraudulent transactions in bank datasets.

## 1. Importing the libraries

In [22]:
# Basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

# Data pre-processing packages
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

# Machine learning packages
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from xgboost import XGBClassifier


# 2. Creating fake data

# 3. Preprocessing

Data normalization, outliers, missing values, feature selection.

# 4. Loading data

In [2]:
#df = pd.read_csv('data/transferencias.csv')
df = pd.read_csv('data/bank_transactions.csv')
df

Unnamed: 0,Timestamp,country,city,district,postal_code,ip_address,day,hour,minute,operating_system,...,android,ios,purchases,browsing_history,relationship,security_index,transaction_time,credit_limit,balance_history,Target
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


# 5. Exploratory Data Analysis

First, let's check for missing values in the dataset.

In [3]:
# Checking for missing values
print(df.isna().sum())

Timestamp            0
country              0
city                 0
district             0
postal_code          0
ip_address           0
day                  0
hour                 0
minute               0
operating_system     0
amount               0
background           0
complaints           0
transaction_count    0
credit               0
global_limit         0
credit_type          0
merchant             0
accounts             0
loans                0
browser              0
android              0
ios                  0
purchases            0
browsing_history     0
relationship         0
security_index       0
transaction_time     0
credit_limit         0
balance_history      0
Target               0
dtype: int64


Great! There's no missing values in the dataset.

Now, let's check the distribution of the target variable.

In [4]:
display(df['Target'].value_counts())
px.bar(df['Target'].value_counts(), )

Target
0    284315
1       492
Name: count, dtype: int64

In [5]:
(len(df[df['Target'] == 1]) / len(df['Target'])) * 100

0.1727485630620034

As we can see, the dataset is extremely imbalanced, with only 0.17% of the transactions being frauds.

So, further on we'll need to perform a resampling to balance the dataset.

We also need to know the variables correlation to the target variable:

In [6]:
df.corr()['Target'].sort_values(ascending=False)

Target               1.000000
background           0.154876
postal_code          0.133447
city                 0.091289
android              0.040413
loans                0.034783
browser              0.020090
minute               0.019875
transaction_time     0.017580
credit_limit         0.009536
balance_history      0.005632
security_index       0.004455
relationship         0.003308
ios                  0.000805
purchases           -0.002685
global_limit        -0.004223
transaction_count   -0.004570
browsing_history    -0.007221
Timestamp           -0.012323
day                 -0.043643
ip_address          -0.094974
operating_system    -0.097733
country             -0.101347
accounts            -0.111485
hour                -0.187257
district            -0.192961
credit_type         -0.196539
amount              -0.216883
complaints          -0.260593
credit              -0.302544
merchant            -0.326481
Name: Target, dtype: float64

Notice that "postal code" and "background" are the most correlated features with our target variable.

Let's run a more holistic analysis between the features of the dataset.

In [8]:
df.corr().style.background_gradient(cmap='coolwarm')

Unnamed: 0,Timestamp,country,city,district,postal_code,ip_address,day,hour,minute,operating_system,amount,background,complaints,transaction_count,credit,global_limit,credit_type,merchant,accounts,loans,browser,android,ios,purchases,browsing_history,relationship,security_index,transaction_time,credit_limit,balance_history,Target
Timestamp,1.0,0.117396,-0.010593,-0.419618,-0.10526,0.173072,-0.063016,0.084714,-0.036949,-0.00866,0.030617,-0.247689,0.124348,-0.065902,-0.098757,-0.183453,0.011903,-0.073297,0.090438,0.028975,-0.050866,0.044736,0.144059,0.051142,-0.016182,-0.233083,-0.041407,-0.005135,-0.009413,-0.010596,-0.012323
country,0.117396,1.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.227709,-0.101347
city,-0.010593,0.0,1.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.531409,0.091289
district,-0.419618,-0.0,0.0,1.0,0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,-0.21088,-0.192961
postal_code,-0.10526,-0.0,-0.0,0.0,1.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.098732,0.133447
ip_address,0.173072,0.0,0.0,-0.0,-0.0,1.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,-0.386356,-0.094974
day,-0.063016,-0.0,0.0,0.0,-0.0,0.0,1.0,0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,0.215981,-0.043643
hour,0.084714,-0.0,0.0,0.0,-0.0,0.0,0.0,1.0,0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.397311,-0.187257
minute,-0.036949,-0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,1.0,0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.103079,0.019875
operating_system,-0.00866,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.0,0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.044246,-0.097733


This overview is important to detect interdependencies between variables and understand the data distribution. In this case, we can see that our most interesting variables are does not have a strong correlation with each other or with other variables, what could give us multicollinarity problems.

# 6. Preparing the data for the models

As we saw in the previous notebook, the dataset is imbalanced, with only 0.17% of the transactions being fraudulent. This is a common issue in fraud detection, and it requires special attention when building the model.

To address this issue, we can use a variety of techniques, such as oversampling the minority class, undersampling the majority class, or using a combination of both. In this notebook, we will use oversampling to balance the dataset, which is the most recommended approach in most cases.

We will use the SMOTE (Synthetic Minority Over-sampling Technique) algorithm to oversample the minority class. SMOTE works by creating synthetic examples of the minority class by interpolating between existing examples. This can help to balance the dataset and improve the performance of the model.

First, we will split the dataset into training and testing sets. We will use 70% of the data for training and 30% for testing.

### Spliting data into train and test
Setting our explanatory variables:

In [9]:
X = df.drop(['Target'], axis=1)
X

Unnamed: 0,Timestamp,country,city,district,postal_code,ip_address,day,hour,minute,operating_system,...,browser,android,ios,purchases,browsing_history,relationship,security_index,transaction_time,credit_limit,balance_history
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.524980,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.208038,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,1.475829,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.059616,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.001396,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.127434,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00


Setting our response variable:

In [10]:
y = df['Target']

Runnning the resampling (oversampling) method:

In [11]:
smt = SMOTE()
X, y = smt.fit_resample(X, y)

Now let's check the results of resampling.

In [12]:
px.bar(y.value_counts(), color=y.value_counts().index, labels={"value": "Count", "index": "Class"})

Spliting into train and test

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)

# 7. Creating predictive models for fraud detection in bank transactions

Now we will create three machine learning models to predict bank transaction fraud:

1. XGBoost
2. LightGBM
3. Random Forest

## 7.1 XGBoost

Building the model

In [14]:
model = XGBClassifier()

Training a model to detect fraud in bank transactions (can take a while)

In [15]:
model = model.fit(X_train, y_train)
model

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


Using the real values to create a prediction dataset

In [None]:
y_predict = model.predict(X_test)

In [None]:
y_predict

array([1, 1, 1, ..., 1, 1, 1], shape=(170589,))

Comparing model answers with real data prediction.

Let's create a template dataframe with desired answers and our model answers.

In [19]:
template  = pd.DataFrame({'template': y_test, 'predictions': y_predict})
template

Unnamed: 0,template,predictions
338226,1,1
318644,1,1
431631,1,1
381828,1,1
466931,1,1
...,...,...
316720,1,1
171658,0,0
530325,1,1
450324,1,1


At the first sight, looks like our model did a good job!

Using some metrics to evaluate our model:

In [25]:
print(f'Accuracy: \n{accuracy_score(y_test, y_predict)}')

Accuracy: 
0.9998241387193781


In terms of accuracy, our model is great. Considering the imbalanced dataset, this is a very good result. To confirm, let's check the classification report.

In [26]:
print(f'Classification metrics: \n{classification_report(y_test, y_predict)}')

Classification metrics: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85252
           1       1.00      1.00      1.00     85337

    accuracy                           1.00    170589
   macro avg       1.00      1.00      1.00    170589
weighted avg       1.00      1.00      1.00    170589



The classification report confirms our high accuracy level.

Finallym we can run a confusion matrix to see the true positive and false positive rates.

In [27]:
print(f'Confusion matrix: \n{confusion_matrix(y_test, y_predict)}')

Confusion matrix: 
[[85222    30]
 [    0 85337]]


The main diagonal in our confusion matrix shows the number of correct predictions. The off-diagonal elements represent incorrect predictions. Our model brought us only 30 incorrect predictions in positive cases.