## 1. Importing the libraries

Importing the main Python libraries for data manipulation, visualization, and machine learning.

In [1]:
# Basic packages
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

# Data pre-processing packages
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

# Machine learning packages
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier


Assuming the data is prepared, we load it into a pandas DataFrame for further exploration.

# 4. Loading the Data

In [2]:
#df = pd.read_csv('data/transferencias.csv')
df = pd.read_csv(os.path.join("data", "bank_transactions_processed.csv"))
df

Unnamed: 0,pca_1,pca_2,pca_3,pca_4,pca_5,pca_6,pca_7,pca_8,pca_9,pca_10,...,pca_13,pca_14,pca_15,pca_16,pca_17,pca_18,pca_19,pca_20,value,is_fraud
0,-0.671428,2.384285,0.166130,-0.794337,1.957361,1.354013,-1.743959,-0.005643,-0.181439,-1.478795,...,-1.525467,-1.289536,-0.883806,1.293161,-1.390802,-0.028964,0.319144,0.535073,0.203084,0
1,0.863544,1.279572,0.534368,-0.520691,0.259223,-0.101313,1.301402,-1.574485,-0.770987,0.163682,...,-0.877596,-1.520939,-0.150134,0.498575,-1.322150,4.198688,2.577431,2.413617,0.811126,0
2,-0.895024,-0.972095,0.408360,-1.357932,-1.361304,1.099777,0.510390,-1.768689,0.527156,-0.004611,...,0.021071,-1.652027,1.161755,2.824849,-1.153931,-0.395659,0.913622,0.033904,-0.645348,0
3,-1.780856,-0.406535,-0.781746,-1.002004,-1.183516,0.142199,-0.288984,0.710746,-0.537743,-1.511962,...,0.166825,-0.657215,-1.387942,0.590543,-1.628303,-0.279937,0.277378,1.185532,-0.991991,0
4,0.879539,0.943594,-0.891395,-0.634451,1.456009,-0.498341,-1.887608,0.900030,-0.871284,0.202399,...,-1.568136,-1.757523,0.108359,0.639796,-0.664426,0.161594,0.939584,-0.347371,-0.208884,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,-3.272626,0.697995,1.462324,-1.336004,2.076722,-1.349106,1.384431,-1.055271,-0.382198,-0.495233,...,-0.542282,2.284305,-0.981249,-0.701960,1.560206,-0.398124,-0.566325,0.153074,1.113355,0
9996,-1.025218,2.614890,1.143024,1.298115,0.337518,-0.637039,1.619402,2.518042,-0.036125,-1.542511,...,-0.079691,1.010998,-0.673343,-0.054092,0.158904,-0.945541,-0.344079,0.156783,-0.543776,0
9997,-2.539605,0.698052,-0.414624,-0.501739,-0.514716,1.031083,0.781892,-0.213110,0.231349,-1.161355,...,-0.426956,3.840834,0.939579,-1.063721,0.717653,4.187816,1.753997,2.330793,-0.319021,0
9998,-0.500749,-1.222346,-1.073083,-1.335034,0.449434,-0.760098,0.043146,0.010107,1.317452,0.177794,...,-0.920368,0.933586,1.786813,-0.833415,-0.477168,0.722694,-0.461358,-0.549694,0.611367,0


# 5. Exploratory Data Analysis (EDA)

First, let's check for missing values in the dataset.

In [3]:
# Checking for missing values
print(df.isna().sum())

pca_1       0
pca_2       0
pca_3       0
pca_4       0
pca_5       0
pca_6       0
pca_7       0
pca_8       0
pca_9       0
pca_10      0
pca_11      0
pca_12      0
pca_13      0
pca_14      0
pca_15      0
pca_16      0
pca_17      0
pca_18      0
pca_19      0
pca_20      0
value       0
is_fraud    0
dtype: int64


Great! There's no missing values in the dataset.

Now, let's check the distribution of the target variable.

In [4]:
display(df['is_fraud'].value_counts())
px.bar(df['is_fraud'].value_counts(), )

is_fraud
0    9857
1     143
Name: count, dtype: int64

In [5]:
(len(df[df['is_fraud'] == 1]) / len(df['is_fraud'])) * 100

1.43

⚠️ The dataset is highly imbalanced — only 1.43% of transactions are fraudulent.
To address this, we’ll apply a resampling strategy (such as SMOTE) before model training in next steps.

# 6. Preparing data for training

First we have split our dataset in train and test sets to avoid data leakage and overfitting.

Given the severe class imbalance, we use SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic samples of the minority class, ensuring a balanced training dataset.

### Spliting data into train and test
Setting our explanatory variables:

In [8]:
X = df.drop(['is_fraud'], axis=1)

Setting our response variable:

In [9]:
y = df['is_fraud']

Splitting our data into train and test sets. We set 20% of the data as the test set and 80% as the train set. We also use `stratify=y` to ensure that the test set has the same distribution of classes as the train set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Runnning the resampling (oversampling) method:

In [56]:
smt = SMOTE()
X, y = smt.fit_resample(X_train, y_train)

Now let's check the results of resampling.

In [57]:
px.bar(y.value_counts(), color=y.value_counts().index, labels={"value": "Count", "index": "Class"})

# 7. Creating predictive models for fraud detection in bank transactions

We train multiple machine learning models, such as:

- Random Forest
- XGBoost
- LightGBM

Each model is evaluated based on its ability to correctly classify fraudulent transactions while minimizing false positives.


### Model Evaluation

We use a combination of the following metrics to assess model performance:

- Confusion Matrix
- Precision, Recall, and F1-Score
- ROC–AUC Curve (not implemented yet)

Because fraud detection is a highly imbalanced problem, **precision** are more important than overall accuracy. Moreover, the best model should balance sensitivity (detecting frauds) and specificity (avoiding false alarms), so we gave a higher attention to Confusion Matrix and **F1-Score** reports.


## 7.1 XGBoost

Building the model

In [68]:
xgb = XGBClassifier()

Training the model to detect fraud in bank transactions (can take a while)

In [69]:
xgb_model = xgb.fit(X_train, y_train)
xgb_model

0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


Using the real values to create a prediction dataset

In [62]:
y_predict = model.predict(X_test)

In [63]:
y_predict

array([0, 0, 0, ..., 0, 0, 0], shape=(3300,))

Comparing model answers with real data prediction.

Let's create a template dataframe with desired answers and our model answers.

In [64]:
template  = pd.DataFrame({'template': y_test, 'predictions': y_predict})
template

Unnamed: 0,template,predictions
6252,0,0
4684,0,0
1731,0,0
4742,0,0
4521,0,0
...,...,...
1744,0,0
9754,0,0
6094,0,0
8781,0,0


At the first sight, looks like our model did a good job!

Using some metrics to evaluate our model:

In [65]:
print(f'Accuracy: \n{accuracy_score(y_test, y_predict)}')

Accuracy: 
0.9869696969696969


In terms of accuracy, our model is great. Considering the imbalanced dataset, this is a very good result. To confirm, let's check the classification report.

In [60]:
print(f'Classification metrics: \n{classification_report(y_test, y_predict)}')

Classification metrics: 
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      3259
           1       0.00      0.00      0.00        41

    accuracy                           0.99      3300
   macro avg       0.49      0.50      0.50      3300
weighted avg       0.98      0.99      0.98      3300



The classification report confirms our high accuracy level.

Finally we can run a confusion matrix to see the true positive and false positive rates.

In [59]:
print(f'Confusion matrix: \n{confusion_matrix(y_test, y_predict)}')

Confusion matrix: 
[[3259    0]
 [  41    0]]


The main diagonal in our confusion matrix shows the number of correct predictions. The off-diagonal elements represent incorrect predictions. Our model brought us only a few incorrect predictions.

## 7.2 LightGBM

Setting the model

In [37]:
# LightGBM
import lightgbm as lgb
lgb = lgb.LGBMClassifier()
lgb.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 102, number of negative: 6598
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000585 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5355
[LightGBM] [Info] Number of data points in the train set: 6700, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.015224 -> initscore=-4.169549
[LightGBM] [Info] Start training from score -4.169549


0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,-1
,learning_rate,0.1
,n_estimators,100
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


Training and testing our model using lightgbm

In [38]:
y_pred = lgb.predict(X_test)
y_predict = lgb.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.99      1.00      0.99      3259
           1       0.00      0.00      0.00        41

    accuracy                           0.99      3300
   macro avg       0.49      0.50      0.50      3300
weighted avg       0.98      0.99      0.98      3300



Evaluating the results

In [39]:
template = pd.DataFrame({'template': y_test, 'predictions': y_predict})
template

Unnamed: 0,template,predictions
6252,0,0
4684,0,0
1731,0,0
4742,0,0
4521,0,0
...,...,...
1744,0,0
9754,0,0
6094,0,0
8781,0,0


LightGBM returns probabilities. So we have to transform it to binary values setting a threshold

In [40]:
y_predict.size

3300

In [41]:
# Converting probabilities into 0 or 1
for i in range(0, y_predict.size):
    if y_predict[i] >= 0.5:
        y_predict[i] = 1
    else:
        y_predict[i] = 0

In [42]:
pd.value_counts(y_predict)    

0    3300
Name: count, dtype: int64

Now we have only zeros and ones.

Evaluating the model

In [43]:
print('Classification metrics: \n', classification_report(y_test, y_predict))
print('Accuracy: \n', accuracy_score(y_test, y_predict))
print('\nConfusion matrix: \n', confusion_matrix(y_test, y_predict))

Classification metrics: 
               precision    recall  f1-score   support

           0       0.99      1.00      0.99      3259
           1       0.00      0.00      0.00        41

    accuracy                           0.99      3300
   macro avg       0.49      0.50      0.50      3300
weighted avg       0.98      0.99      0.98      3300

Accuracy: 
 0.9875757575757576

Confusion matrix: 
 [[3259    0]
 [  41    0]]


In [45]:
# LightGBM
train_data = lgb.Dataset(X_train, label=y_train)

# Setting parameters for lightgbm
params = {
    'num_leaves': 1000,
    'objective': 'binary',
    'max_depth': 7,
    'learning_rate': .01,
    'max_bin': 200,
    'metric': ['auc', 'binary_logloss']
}

AttributeError: 'LGBMClassifier' object has no attribute 'Dataset'

Training our model using lightgbm

In [44]:
num_round = 50

lgbm = lgb.train(params, train_data, num_round)

AttributeError: 'LGBMClassifier' object has no attribute 'train'

Testing the model

In [26]:
y_predict = lgbm.predict(X_test)

Evaluating the results

In [27]:
template = pd.DataFrame({'template': y_test, 'predictions': y_predict})
template

Unnamed: 0,template,predictions
86801,0,0.001027
34867,0,0.001027
151239,0,0.001143
122560,0,0.001027
77820,0,0.001027
...,...,...
11519,0,0.001066
21449,0,0.001027
129577,0,0.001027
197268,0,0.001089


LightGBM returns probabilities. So we have to transform it to binary values setting a threshold

In [28]:
y_predict.size

85443

In [29]:
# Converting probabilities into 0 or 1
for i in range(0, y_predict.size):
    if y_predict[i] >= 0.5:
        y_predict[i] = 1
    else:
        y_predict[i] = 0

In [30]:
pd.value_counts(y_predict)    

0.0    85350
1.0       93
Name: count, dtype: int64

Now we have only zeros and ones.

Evaluating the model

In [31]:
print('Classification metrics: \n', classification_report(y_test, y_predict))
print('Accuracy: \n', accuracy_score(y_test, y_predict))
print('\nConfusion matrix: \n', confusion_matrix(y_test, y_predict))

Classification metrics: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85288
           1       0.94      0.56      0.70       155

    accuracy                           1.00     85443
   macro avg       0.97      0.78      0.85     85443
weighted avg       1.00      1.00      1.00     85443

Accuracy: 
 0.999133925541004

Confusion matrix: 
 [[85282     6]
 [   68    87]]


## 7.3 Random Forest

Setting up the model

In [36]:
model = RandomForestClassifier()

Training the model

In [37]:
model = model.fit(X_train, y_train)
model

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


Testing the model

In [39]:
y_predict = model.predict(X_test)

Evaluating the model

In [41]:
template = pd.DataFrame({'template': y_test, 'prediction': y_predict})
template

Unnamed: 0,template,prediction
86801,0,0
34867,0,0
151239,0,0
122560,0,0
77820,0,0
...,...,...
11519,0,0
21449,0,0
129577,0,0
197268,0,0


In [44]:
# Getting the metrics
print('Classification metrics: \n', classification_report(y_test, y_predict))
print('Accuracy: \n', accuracy_score(y_test, y_predict))
print('Confusion Matrix: \n', confusion_matrix(y_test, y_predict))

Classification metrics: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     85288
           1       0.93      0.83      0.88       155

    accuracy                           1.00     85443
   macro avg       0.97      0.92      0.94     85443
weighted avg       1.00      1.00      1.00     85443

Accuracy: 
 0.9995903701883126
Confusion Matrix: 
 [[85279     9]
 [   26   129]]


Conclusion

- Hypotesys generation is crucial for fraud detection.
- Unbalanced classes must be handled.
- Sometimes much simple algorithms are better than complex ones. Random Forest beated LightGBM and XGBoost, but processing time was much higher.