# üè¶ Detecting Fraud in Bank Transactions

This notebook demonstrates the development of a predictive system for detecting fraudulent bank transactions using machine learning techniques.
The project aims to simulate a realistic fraud detection workflow ‚Äî from data generation and preprocessing to model evaluation and interpretation.

# 1. Objective

The main goal of this project is to build and evaluate a machine learning model capable of identifying potentially fraudulent transactions.
Through exploratory data analysis (EDA), feature engineering, and supervised learning, we aim to detect patterns that distinguish frauds from legitimate operations.

## 2. Importing the libraries

Importing the main Python libraries for data manipulation, visualization, and machine learning.

In [45]:
# Basic packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

# Data pre-processing packages
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

# Machine learning packages
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from xgboost import XGBClassifier


# 3. Creating fake data

Since no real dataset is available, we create a synthetic dataset that mimics real-world bank transactions.
The dataset includes demographic, financial, and behavioral attributes, along with a binary target variable (1 = fraud, 0 = legitimate).

# 4. Preprocessing

Data preprocessing includes:

- Handling missing values
- Removing outliers
- Normalizing numeric features
- Encoding categorical variables
- Selecting the most relevant features

These steps ensure data quality and improve model performance.

Once the data is prepared, we load it into a pandas DataFrame for further exploration.

# 5. Loading the Data

In [2]:
#df = pd.read_csv('data/transferencias.csv')
df = pd.read_csv('data/bank_transactions.csv')
df

Unnamed: 0,Timestamp,country,city,district,postal_code,ip_address,day,hour,minute,operating_system,...,android,ios,purchases,browsing_history,relationship,security_index,transaction_time,credit_limit,balance_history,Target
0,0.00,-1.36,-0.07,2.54,1.38,-0.34,0.46,0.24,0.10,0.36,...,-0.02,0.28,-0.11,0.07,0.13,-0.19,0.13,-0.02,149.62,0
1,0.00,1.19,0.27,0.17,0.45,0.06,-0.08,-0.08,0.09,-0.26,...,-0.23,-0.64,0.10,-0.34,0.17,0.13,-0.01,0.01,2.69,0
2,1.00,-1.36,-1.34,1.77,0.38,-0.50,1.80,0.79,0.25,-1.51,...,0.25,0.77,0.91,-0.69,-0.33,-0.14,-0.06,-0.06,378.66,0
3,1.00,-0.97,-0.19,1.79,-0.86,-0.01,1.25,0.24,0.38,-1.39,...,-0.11,0.01,-0.19,-1.18,0.65,-0.22,0.06,0.06,123.50,0
4,2.00,-1.16,0.88,1.55,0.40,-0.41,0.10,0.59,-0.27,0.82,...,-0.01,0.80,-0.14,0.14,-0.21,0.50,0.22,0.22,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.00,-11.88,10.07,-9.83,-2.07,-5.36,-2.61,-4.92,7.31,1.91,...,0.21,0.11,1.01,-0.51,1.44,0.25,0.94,0.82,0.77,0
284803,172787.00,-0.73,-0.06,2.04,-0.74,0.87,1.06,0.02,0.29,0.58,...,0.21,0.92,0.01,-1.02,-0.61,-0.40,0.07,-0.05,24.79,0
284804,172788.00,1.92,-0.30,-3.25,-0.56,2.63,3.03,-0.30,0.71,0.43,...,0.23,0.58,-0.04,0.64,0.27,-0.09,0.00,-0.03,67.88,0
284805,172788.00,-0.24,0.53,0.70,0.69,-0.38,0.62,-0.69,0.68,0.39,...,0.27,0.80,-0.16,0.12,-0.57,0.55,0.11,0.10,10.00,0


# 6. Exploratory Data Analysis (EDA)

First, let's check for missing values in the dataset.

In [3]:
# Checking for missing values
print(df.isna().sum())

Timestamp            0
country              0
city                 0
district             0
postal_code          0
ip_address           0
day                  0
hour                 0
minute               0
operating_system     0
amount               0
background           0
complaints           0
transaction_count    0
credit               0
global_limit         0
credit_type          0
merchant             0
accounts             0
loans                0
browser              0
android              0
ios                  0
purchases            0
browsing_history     0
relationship         0
security_index       0
transaction_time     0
credit_limit         0
balance_history      0
Target               0
dtype: int64


Great! There's no missing values in the dataset.

Now, let's check the distribution of the target variable.

In [4]:
display(df['Target'].value_counts())
px.bar(df['Target'].value_counts(), )

Target
0    284315
1       492
Name: count, dtype: int64

In [5]:
(len(df[df['Target'] == 1]) / len(df['Target'])) * 100

0.1727485630620034

‚ö†Ô∏è The dataset is highly imbalanced ‚Äî only 0.17% of transactions are fraudulent.
To address this, we‚Äôll apply a resampling strategy (such as SMOTE) before model training, but before that, we will explore feature correlations to identify variables most related to fraud occurrences.

In [6]:
df.corr()['Target'].sort_values(ascending=False)

Target               1.00
background           0.15
postal_code          0.13
city                 0.09
android              0.04
loans                0.03
browser              0.02
minute               0.02
transaction_time     0.02
credit_limit         0.01
balance_history      0.01
security_index       0.00
relationship         0.00
ios                  0.00
purchases           -0.00
global_limit        -0.00
transaction_count   -0.00
browsing_history    -0.01
Timestamp           -0.01
day                 -0.04
ip_address          -0.09
operating_system    -0.10
country             -0.10
accounts            -0.11
hour                -0.19
district            -0.19
credit_type         -0.20
amount              -0.22
complaints          -0.26
credit              -0.30
merchant            -0.33
Name: Target, dtype: float64

> ‚ÄúPostal Code‚Äù and ‚ÄúBackground‚Äù appear to be the most correlated features with the target variable.

This overview helps us detect interdependencies and understand feature distributions.

# 7. Feature Correlation and Insights

We analyze correlations across all features to identify redundant or irrelevant variables.
This step is crucial for dimensionality reduction and better model interpretability.

In [63]:
df.corr().style.background_gradient(cmap='coolwarm')

Unnamed: 0,Timestamp,country,city,district,postal_code,ip_address,day,hour,minute,operating_system,amount,background,complaints,transaction_count,credit,global_limit,credit_type,merchant,accounts,loans,browser,android,ios,purchases,browsing_history,relationship,security_index,transaction_time,credit_limit,balance_history,Target
Timestamp,1.0,0.117396,-0.010593,-0.419618,-0.10526,0.173072,-0.063016,0.084714,-0.036949,-0.00866,0.030617,-0.247689,0.124348,-0.065902,-0.098757,-0.183453,0.011903,-0.073297,0.090438,0.028975,-0.050866,0.044736,0.144059,0.051142,-0.016182,-0.233083,-0.041407,-0.005135,-0.009413,-0.010596,-0.012323
country,0.117396,1.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,0.0,-0.227709,-0.101347
city,-0.010593,0.0,1.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.531409,0.091289
district,-0.419618,-0.0,0.0,1.0,0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,-0.21088,-0.192961
postal_code,-0.10526,-0.0,-0.0,0.0,1.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.098732,0.133447
ip_address,0.173072,0.0,0.0,-0.0,-0.0,1.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0,0.0,0.0,-0.0,-0.386356,-0.094974
day,-0.063016,-0.0,0.0,0.0,-0.0,0.0,1.0,0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,0.0,-0.0,0.0,-0.0,-0.0,0.0,0.215981,-0.043643
hour,0.084714,-0.0,0.0,0.0,-0.0,0.0,0.0,1.0,0.0,0.0,-0.0,0.0,-0.0,0.0,0.0,-0.0,0.0,0.0,0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.397311,-0.187257
minute,-0.036949,-0.0,-0.0,-0.0,0.0,0.0,-0.0,0.0,1.0,0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,0.0,0.0,0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.103079,0.019875
operating_system,-0.00866,-0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-0.0,0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0,-0.0,0.0,-0.0,-0.0,-0.0,0.0,-0.0,-0.0,0.0,-0.044246,-0.097733


This overview is important to detect interdependencies between variables and understand the data distribution. In this case, we can see that our most interesting variables are does not have a strong correlation with each other or with other variables, what could give us multicollinarity problems.

# 8. Preparing data for training

Given the severe class imbalance, we use SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic samples of the minority class, ensuring a balanced training dataset.

### Spliting data into train and test
Setting our explanatory variables:

In [64]:
X = df.drop(['Target'], axis=1)
X

Unnamed: 0,Timestamp,country,city,district,postal_code,ip_address,day,hour,minute,operating_system,...,browser,android,ios,purchases,browsing_history,relationship,security_index,transaction_time,credit_limit,balance_history
0,0.00,-1.36,-0.07,2.54,1.38,-0.34,0.46,0.24,0.10,0.36,...,0.25,-0.02,0.28,-0.11,0.07,0.13,-0.19,0.13,-0.02,149.62
1,0.00,1.19,0.27,0.17,0.45,0.06,-0.08,-0.08,0.09,-0.26,...,-0.07,-0.23,-0.64,0.10,-0.34,0.17,0.13,-0.01,0.01,2.69
2,1.00,-1.36,-1.34,1.77,0.38,-0.50,1.80,0.79,0.25,-1.51,...,0.52,0.25,0.77,0.91,-0.69,-0.33,-0.14,-0.06,-0.06,378.66
3,1.00,-0.97,-0.19,1.79,-0.86,-0.01,1.25,0.24,0.38,-1.39,...,-0.21,-0.11,0.01,-0.19,-1.18,0.65,-0.22,0.06,0.06,123.50
4,2.00,-1.16,0.88,1.55,0.40,-0.41,0.10,0.59,-0.27,0.82,...,0.41,-0.01,0.80,-0.14,0.14,-0.21,0.50,0.22,0.22,69.99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.00,-11.88,10.07,-9.83,-2.07,-5.36,-2.61,-4.92,7.31,1.91,...,1.48,0.21,0.11,1.01,-0.51,1.44,0.25,0.94,0.82,0.77
284803,172787.00,-0.73,-0.06,2.04,-0.74,0.87,1.06,0.02,0.29,0.58,...,0.06,0.21,0.92,0.01,-1.02,-0.61,-0.40,0.07,-0.05,24.79
284804,172788.00,1.92,-0.30,-3.25,-0.56,2.63,3.03,-0.30,0.71,0.43,...,0.00,0.23,0.58,-0.04,0.64,0.27,-0.09,0.00,-0.03,67.88
284805,172788.00,-0.24,0.53,0.70,0.69,-0.38,0.62,-0.69,0.68,0.39,...,0.13,0.27,0.80,-0.16,0.12,-0.57,0.55,0.11,0.10,10.00


Setting our response variable:

In [65]:
y = df['Target']

Runnning the resampling (oversampling) method:

In [11]:
smt = SMOTE()
X, y = smt.fit_resample(X_train, y_train)

Now let's check the results of resampling.

In [12]:
px.bar(y.value_counts(), color=y.value_counts().index, labels={"value": "Count", "index": "Class"})

# 9. Creating predictive models for fraud detection in bank transactions

We train multiple machine learning models, such as:

- Random Forest
- XGBoost
- LightGBM

Each model is evaluated based on its ability to correctly classify fraudulent transactions while minimizing false positives.


### Model Evaluation

We use a combination of the following metrics to assess model performance:

- Confusion Matrix
- Precision, Recall, and F1-Score
- ROC‚ÄìAUC Curve (not implemented yet)

Because fraud detection is a highly imbalanced problem, **precision** are more important than overall accuracy. Moreover, the best model should balance sensitivity (detecting frauds) and specificity (avoiding false alarms), so we gave a higher attention to Confusion Matrix and **F1-Score** reports.


## 9.1 XGBoost

Building the model

In [68]:
model = XGBClassifier()

Training the model to detect fraud in bank transactions (can take a while)

In [69]:
model = model.fit(X_train, y_train)
model

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


Using the real values to create a prediction dataset

In [None]:
y_predict = model.predict(X_test)

In [None]:
y_predict

array([0, 0, 0, ..., 0, 0, 0], shape=(85443,))

Comparing model answers with real data prediction.

Let's create a template dataframe with desired answers and our model answers.

In [19]:
template  = pd.DataFrame({'template': y_test, 'predictions': y_predict})
template

Sample of the template:


Unnamed: 0,template,predictions,Unnamed: 3
86801,0,0,
34867,0,0,
151239,0,0,
122560,0,0,
77820,0,0,
...,...,...,
11519,0,0,
21449,0,0,
530325,1,1,0.0
450324,1,1,0.0


At the first sight, looks like our model did a good job!

Using some metrics to evaluate our model:

In [25]:
print(f'Accuracy: \n{accuracy_score(y_test, y_predict)}')

Accuracy: 
0.9994733330992591


In terms of accuracy, our model is great. Considering the imbalanced dataset, this is a very good result. To confirm, let's check the classification report.

In [26]:
print(f'Classification metrics: \n{classification_report(y_test, y_predict)}')

Classification metrics: 
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     85288
           1       0.88      0.83      0.85       155

    accuracy                           1.00     85443
   macro avg       0.94      0.91      0.93     85443
weighted avg       1.00      1.00      1.00     85443



The classification report confirms our high accuracy level.

Finally we can run a confusion matrix to see the true positive and false positive rates.

In [27]:
print(f'Confusion matrix: \n{confusion_matrix(y_test, y_predict)}')

Confusion matrix: 
[[85270    18]
 [   27   128]]


The main diagonal in our confusion matrix shows the number of correct predictions. The off-diagonal elements represent incorrect predictions. Our model brought us only 30 incorrect predictions in positive cases.

## 9.2 LightGBM

Setting the model

In [None]:
# LightGBM
train_data = lgb.Dataset(X_train, label=y_train)

# Setting parameters for lightgbm
params = {
    'num_leaves': 1000,
    'objective': 'binary',
    'max_depth': 7,
    'learning_rate': .01,
    'max_bin': 200,
    'metric': ['auc', 'binary_logloss']
}

Training our model using lightgbm

In [78]:
num_round = 50

lgbm = lgb.train(params, train_data, num_round)

[LightGBM] [Info] Number of positive: 337, number of negative: 199027
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008620 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6000
[LightGBM] [Info] Number of data points in the train set: 199364, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.001690 -> initscore=-6.381113
[LightGBM] [Info] Start training from score -6.381113


Testing the model

In [79]:
y_predict = lgbm.predict(X_test)

Evaluating the results

In [None]:
template = pd.DataFrame({'template': y_test, 'predictions': y_predict})
template

Unnamed: 0,template,predictions
86801,0,0.00
34867,0,0.00
151239,0,0.00
122560,0,0.00
77820,0,0.00
...,...,...
11519,0,0.00
21449,0,0.00
129577,0,0.00
197268,0,0.00


LightGBM returns probabilities. So we have to transform it to binary values setting a threshold

In [89]:
y_predict.size

85443

In [92]:
# Converting probabilities into 0 or 1
for i in range(0, y_predict.size):
    if y_predict[i] >= 0.5:
        y_predict[i] = 1
    else:
        y_predict[i] = 0

In [96]:
pd.value_counts(y_predict)    

0.00    85350
1.00       93
Name: count, dtype: int64

Now we have only zeros and ones.

Evaluating the model

In [99]:
print('Classification metrics: \n', classification_report(y_test, y_predict))
print('Accuracy: \n', accuracy_score(y_test, y_predict))
print('\nConfusion matrix: \n', confusion_matrix(y_test, y_predict))

Classification metrics: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85288
           1       0.94      0.56      0.70       155

    accuracy                           1.00     85443
   macro avg       0.97      0.78      0.85     85443
weighted avg       1.00      1.00      1.00     85443

Accuracy: 
 0.999133925541004

Confusion matrix: 
 [[85282     6]
 [   68    87]]


In [None]:
# LightGBM
train_data = lgb.Dataset(X_train, label=y_train)

# Setting parameters for lightgbm
params = {
    'num_leaves': 1000,
    'objective': 'binary',
    'max_depth': 7,
    'learning_rate': .01,
    'max_bin': 200,
    'metric': ['auc', 'binary_logloss']
}

Training our model using lightgbm

In [25]:
num_round = 50

lgbm = lgb.train(params, train_data, num_round)

[LightGBM] [Info] Number of positive: 337, number of negative: 199027
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.028663 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 6000
[LightGBM] [Info] Number of data points in the train set: 199364, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.001690 -> initscore=-6.381113
[LightGBM] [Info] Start training from score -6.381113


Testing the model

In [26]:
y_predict = lgbm.predict(X_test)

Evaluating the results

In [27]:
template = pd.DataFrame({'template': y_test, 'predictions': y_predict})
template

Unnamed: 0,template,predictions
86801,0,0.001027
34867,0,0.001027
151239,0,0.001143
122560,0,0.001027
77820,0,0.001027
...,...,...
11519,0,0.001066
21449,0,0.001027
129577,0,0.001027
197268,0,0.001089


LightGBM returns probabilities. So we have to transform it to binary values setting a threshold

In [28]:
y_predict.size

85443

In [29]:
# Converting probabilities into 0 or 1
for i in range(0, y_predict.size):
    if y_predict[i] >= 0.5:
        y_predict[i] = 1
    else:
        y_predict[i] = 0

In [30]:
pd.value_counts(y_predict)    

0.0    85350
1.0       93
Name: count, dtype: int64

Now we have only zeros and ones.

Evaluating the model

In [31]:
print('Classification metrics: \n', classification_report(y_test, y_predict))
print('Accuracy: \n', accuracy_score(y_test, y_predict))
print('\nConfusion matrix: \n', confusion_matrix(y_test, y_predict))

Classification metrics: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85288
           1       0.94      0.56      0.70       155

    accuracy                           1.00     85443
   macro avg       0.97      0.78      0.85     85443
weighted avg       1.00      1.00      1.00     85443

Accuracy: 
 0.999133925541004

Confusion matrix: 
 [[85282     6]
 [   68    87]]


## 9.2 Random Forest

Setting up the model

In [36]:
model = RandomForestClassifier()

Training the model

In [37]:
model = model.fit(X_train, y_train)
model

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


Testing the model

In [39]:
y_predict = model.predict(X_test)

Evaluating the model

In [41]:
template = pd.DataFrame({'template': y_test, 'prediction': y_predict})
template

Unnamed: 0,template,prediction
86801,0,0
34867,0,0
151239,0,0
122560,0,0
77820,0,0
...,...,...
11519,0,0
21449,0,0
129577,0,0
197268,0,0


In [44]:
# Getting the metrics
print('Classification metrics: \n', classification_report(y_test, y_predict))
print('Accuracy: \n', accuracy_score(y_test, y_predict))
print('Confusion Matrix: \n', confusion_matrix(y_test, y_predict))

Classification metrics: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85288
           1       0.93      0.83      0.88       155

    accuracy                           1.00     85443
   macro avg       0.97      0.92      0.94     85443
weighted avg       1.00      1.00      1.00     85443

Accuracy: 
 0.9995903701883126
Confusion Matrix: 
 [[85279     9]
 [   26   129]]


Conclusion

- Hypotesys generation is crucial for fraud detection.
- Unbalanced classes must be handled.
- Sometimes much simple algorithms are better than complex ones. Random Forest beated LightGBM and XGBoost, but processing time was much higher.