## Introduction

This notebook assumes you already have a pandas DataFrame `df` with the following columns:
- `pca_1` ... `pca_20` (PCA components)
- `value` (transaction amount)
- `is_fraud` (binary target, 0/1)

We will:
1. Split the data (stratified) into train/test.
2. Train four models using reasonable default hyperparameters.
3. Evaluate with classification report, ROC AUC and PR AUC.
4. Show comparative plots and a summary table.


## 1. Importing the required libraries

In [59]:
# Basic packages
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

# Data pre-processing packages
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

# Machine learning packages
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import roc_auc_score, average_precision_score
from xgboost import XGBClassifier
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay

# Note: Make sure xgboost and lightgbm are installed.

Assuming the data is prepared, we load it into a pandas DataFrame for further exploration.

## 2. Loading the Data

In [60]:
df = pd.read_csv(os.path.join("data", "bank_transactions_processed.csv"))
display(df.sample(5))

Unnamed: 0,pca_1,pca_2,pca_3,pca_4,pca_5,pca_6,pca_7,pca_8,pca_9,pca_10,...,pca_13,pca_14,pca_15,pca_16,pca_17,pca_18,pca_19,pca_20,value,is_fraud
1850,-0.521369,2.24817,-1.000626,-0.650667,-1.389915,1.989784,-1.496804,1.5957,-1.517042,0.309474,...,-0.898379,-0.696912,-0.573514,0.23054,0.2737,-0.746194,-0.161422,1.08177,-0.686077,0
6190,-0.447963,-1.300053,-1.506615,-0.918446,1.292154,-1.089791,2.591238,-0.140665,0.209215,-1.156711,...,1.511277,-1.354618,0.513474,0.352735,-0.657805,0.605419,0.180942,-0.039773,-0.703603,1
1919,0.455204,1.167726,4.21276,0.14076,1.267224,-0.061641,0.713301,0.232927,0.919042,-2.169781,...,-0.947903,1.954456,1.097916,0.245586,0.204188,5.083197,-0.358361,-2.541927,0.935902,0
4814,-0.934753,-1.094249,-2.434944,-0.179651,1.421463,0.133711,-0.190779,-1.262137,-1.769587,-1.684574,...,-0.526421,-1.359622,-0.450853,0.002597,0.105777,0.138948,-0.02989,-0.067553,-0.806869,0
7819,1.351281,-3.088381,-0.947682,-0.879908,-1.250638,2.206905,-0.084555,-1.522759,-1.354205,-0.71853,...,-0.303121,0.245802,0.018106,-0.12502,1.318358,-0.195988,0.133265,-0.196254,-0.298706,0


## 3. Exploratory Data Analysis (EDA)

First, let's get a brief checking on missing values, dtypes and descriptive statistics.

In [61]:
# Checking for missing values
print(df.isna().sum())

pca_1       0
pca_2       0
pca_3       0
pca_4       0
pca_5       0
pca_6       0
pca_7       0
pca_8       0
pca_9       0
pca_10      0
pca_11      0
pca_12      0
pca_13      0
pca_14      0
pca_15      0
pca_16      0
pca_17      0
pca_18      0
pca_19      0
pca_20      0
value       0
is_fraud    0
dtype: int64


In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 22 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pca_1     10000 non-null  float64
 1   pca_2     10000 non-null  float64
 2   pca_3     10000 non-null  float64
 3   pca_4     10000 non-null  float64
 4   pca_5     10000 non-null  float64
 5   pca_6     10000 non-null  float64
 6   pca_7     10000 non-null  float64
 7   pca_8     10000 non-null  float64
 8   pca_9     10000 non-null  float64
 9   pca_10    10000 non-null  float64
 10  pca_11    10000 non-null  float64
 11  pca_12    10000 non-null  float64
 12  pca_13    10000 non-null  float64
 13  pca_14    10000 non-null  float64
 14  pca_15    10000 non-null  float64
 15  pca_16    10000 non-null  float64
 16  pca_17    10000 non-null  float64
 17  pca_18    10000 non-null  float64
 18  pca_19    10000 non-null  float64
 19  pca_20    10000 non-null  float64
 20  value     10000 non-null  flo

In [63]:
display(df.describe())

Unnamed: 0,pca_1,pca_2,pca_3,pca_4,pca_5,pca_6,pca_7,pca_8,pca_9,pca_10,...,pca_13,pca_14,pca_15,pca_16,pca_17,pca_18,pca_19,pca_20,value,is_fraud
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,-7.354117000000001e-17,7.105427000000001e-17,6.750156e-17,7.958079000000001e-17,-5.968559000000001e-17,-3.836931e-17,2.2737370000000003e-17,8.526513e-17,6.821210000000001e-17,2.2737370000000003e-17,...,4.1922020000000005e-17,1.215028e-16,1.477929e-16,-1.179501e-16,1.705303e-17,6.821210000000001e-17,-5.684342e-18,9.237056e-18,5.82645e-17,0.0143
std,1.425766,1.411595,1.238624,1.227578,1.219981,1.181413,1.170264,1.167538,1.163735,1.158356,...,1.143635,1.141923,1.138352,1.132033,1.120515,1.032621,1.018481,1.015899,1.00005,0.11873
min,-3.412884,-3.424606,-2.556509,-1.920182,-2.431773,-3.283442,-2.948267,-2.992338,-3.448462,-3.232591,...,-3.037863,-2.907108,-2.934881,-2.677256,-3.639277,-2.129623,-2.633583,-5.537357,-1.006629,0.0
25%,-1.034605,-1.057621,-0.7469235,-0.9885486,-0.9873619,-0.8433941,-0.8721569,-0.8367222,-0.8410854,-0.867664,...,-0.8358179,-0.8338384,-0.8166699,-0.6433689,-0.7548194,-0.573041,-0.5093968,-0.5712499,-0.7169722,0.0
50%,0.01614279,-0.01977968,-0.1709389,-0.5247201,-0.2979166,-0.01433268,-0.03557377,-0.01335186,-0.05597769,-0.04744226,...,-0.008584502,0.003207329,-0.02600806,-0.02848167,0.006468591,-0.1556714,-0.1142292,-0.03297456,-0.3096602,0.0
75%,1.016025,1.077497,0.5099288,1.430438,1.211329,0.8311552,0.7924358,0.8173948,0.7713267,0.7723551,...,0.7776989,0.7309069,0.8075405,0.5426552,0.7872728,0.2783467,0.2755548,0.507304,0.3984868,0.0
max,3.373057,3.333026,24.35335,2.586224,6.71571,4.252539,3.542977,3.137977,3.789435,3.58644,...,3.311253,3.635107,3.04568,2.788958,3.105128,7.907093,7.088443,5.781896,7.520511,1.0


Great! Our data is normalized, encoded and there's no missing values in the dataset.

Now, let's check the distribution of the target variable.

In [64]:
display(df['is_fraud'].value_counts())
px.bar(df['is_fraud'].value_counts(), color=df['is_fraud'].value_counts().index)

is_fraud
0    9857
1     143
Name: count, dtype: int64

In [65]:
(len(df[df['is_fraud'] == 1]) / len(df['is_fraud'])) * 100

1.43

⚠️ As expected, the dataset is highly imbalanced — only 1.43% of transactions are fraudulent.
We'll handle this imbalance in the next steps.

## 4. Preparing data for training

First we have split our dataset in train and test sets to avoid data leakage and overfitting.

Given the severe class imbalance, we use SMOTE (Synthetic Minority Oversampling Technique) to generate synthetic samples of the minority class, ensuring a balanced training dataset.

### Spliting data into train and test
Setting our explanatory variables:

In [66]:
X = df.drop(['is_fraud'], axis=1)

Setting our response variable:

In [67]:
y = df['is_fraud']

Splitting our data into train and test sets. We set 20% of the data as the test set and 80% as the train set. We also use `stratify=y` to ensure that the test set has the same distribution of classes as the train set.

In [68]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Checking the fraud ratio in train and test datasets:

In [69]:
# Print sizes
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)
print('Train fraud ratio:', y_train.mean(), 'Test fraud ratio:', y_test.mean())

Train shape: (8000, 21) Test shape: (2000, 21)
Train fraud ratio: 0.01425 Test fraud ratio: 0.0145


The stratify parameter worked fine and the proportions of fraud and non-fraud transactions were preserved in both the training and testing sets.

Runnning the resampling (oversampling) method:

## 5. Creating predictive models for fraud detection in bank transactions

We train multiple machine learning models, widely used in fraud detection:

- Logistic Regression
- Random Forest
- XGBoost
- LightGBM

Each model is evaluated based on its ability to correctly classify fraudulent transactions while minimizing false positives.


### Model Evaluation

As fraud is rarely an accuracy problem, we use a combination of the following metrics to assess model performance:

- Precision
- Recall
- F1-Score
- AUC-ROC
- AUC-PR (even better than ROC)

Because fraud detection is a highly imbalanced issue, **precision** is more important than overall accuracy. Moreover, the best model should balance sensitivity (detecting frauds) and specificity (avoiding false alarms), so we give a higher attention to **Confusion Matrix** and **F1-Score** reports to ensure good performance on both fronts. This ensures that we minimize false positives while maintaining high recall for fraud detection, which is critical in real-world fraud detection systems, especially in financial applications where false positives can lead to customer dissatisfaction and operational costs, and false negatives can result in financial losses.


### Auxiliary function

First of all, let's define a helper function to evaluate our models:

In [70]:
# Print basic metrics
def eval_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print(classification_report(y_test, y_pred))
    print("ROC AUC:", roc_auc_score(y_test, y_prob))
    print("PR AUC:", average_precision_score(y_test, y_prob))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))

# Plotting helper: ROC and PR for multiple models
def plot_roc_pr(models, X_test, y_test, figsize=(12,5)):
    """Plot ROC and Precision-Recall for a dict of fitted models.
    models: dict{name: model}
    """
    print("Plotting ROC curves...")
    plt.figure(figsize=figsize)
    # ROC subplot
    plt.subplot(1,2,1)
    for name, model in models.items():
        try:
            RocCurveDisplay.from_estimator(model, X_test, y_test, name=name)
        except Exception as e:
            print(f"Could not plot ROC for {name}: {e}")
    plt.title('ROC Curves')
    plt.legend()

    # PR subplot
    plt.subplot(1,2,2)
    for name, model in models.items():
        try:
            PrecisionRecallDisplay.from_estimator(model, X_test, y_test, name=name)
        except Exception as e:
            print(f"Could not plot PR for {name}: {e}")
    plt.title('Precision-Recall Curves')
    plt.tight_layout()
    plt.show()

### 5.1 Logistic Regression

We use `class_weight='balanced'` parameter to handle the class imbalance.

In [71]:
logreg = LogisticRegression(
    class_weight='balanced',
    max_iter=500,
    n_jobs=-1,
    random_state=42
)
logreg.fit(X_train, y_train)
print("Logistic Regression model trained successfully.")
print(f"Training samples: {len(X_train)}, Fraud rate: {y_train.mean():.4f}")
print(f"Test samples: {len(X_test)}, Fraud rate: {y_test.mean():.4f}")
eval_model(logreg, X_test, y_test)

Logistic Regression model trained successfully.
Training samples: 8000, Fraud rate: 0.0143
Test samples: 2000, Fraud rate: 0.0145
              precision    recall  f1-score   support

           0       0.99      0.58      0.73      1971
           1       0.01      0.41      0.03        29

    accuracy                           0.58      2000
   macro avg       0.50      0.50      0.38      2000
weighted avg       0.97      0.58      0.72      2000

ROC AUC: 0.48805962315645834
PR AUC: 0.014218485662085913
Confusion Matrix:
[[1152  819]
 [  17   12]]


## 5.2 Random Forest

We also can set `class_weight='balanced_subsample` to handle the imbalance data.

In [72]:
rf = RandomForestClassifier(
    n_estimators=400,
    random_state=42,
    n_jobs=-1,
    class_weight='balanced_subsample',
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
)
rf.fit(X_train, y_train)
eval_model(rf, X_test, y_test)

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1971
           1       0.00      0.00      0.00        29

    accuracy                           0.99      2000
   macro avg       0.49      0.50      0.50      2000
weighted avg       0.97      0.99      0.98      2000

ROC AUC: 0.4752882310747214
PR AUC: 0.01536493223531143
Confusion Matrix:
[[1971    0]
 [  29    0]]


### 5.3 XGBoost

XGBoost is a powerful gradient boosting framework that often provides excellent performance for tabular data like transaction datasets.

It's particularly effective for this type of binary classification problem due to its ability to handle imbalanced data and capture complex feature interactions effectively.

To deal with the class imbalance in our fraud detection dataset, we'll use the `scale_pos_weight` parameter to give higher weight to the minority class (fraudulent transactions).

The `scale_pos_weight` parameter is calculated as the ratio of negative samples to positive samples to balance the classes.

In [73]:
# Calculating the scale_pos_weight
neg = (y_train == 0).sum()
pos = (y_train == 1).sum()

scale = neg / pos
print(f"Negative samples: {neg}, Positive samples: {pos}")
print(f"Scale factor: {scale:.2f}")
print(f"scale_pos_weight for XGBoost: {scale:.2f}")

Negative samples: 7886, Positive samples: 114
Scale factor: 69.18
scale_pos_weight for XGBoost: 69.18


This means XGBoost will penalize false negatives 69 times more than false positives.

Training and evaluating the model:

In [41]:
xgb = XGBClassifier(
    n_estimators=400,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.9,
    colsample_bytree=0.8,
    objective="binary:logistic",
    eval_metric="logloss",
    scale_pos_weight=scale,
    n_jobs=-1,
    random_state=42
)

xgb.fit(X_train, y_train)
eval_model(xgb, X_test, y_test)

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1971
           1       0.00      0.00      0.00        29

    accuracy                           0.98      2000
   macro avg       0.49      0.50      0.50      2000
weighted avg       0.97      0.98      0.98      2000

ROC AUC: 0.5675396700432128
PR AUC: 0.01980903675577001
Confusion Matrix:
[[1965    6]
 [  29    0]]


### 5.4 LightGBM

We train LightGBM with `is_unbalance=True` or `scale_pos_weight` depending on API. Using `is_unbalance=True` makes LightGBM internally treat the classes as unbalanced. We can also provide `class_weight` or `scale_pos_weight` if desired.

In [None]:
import lightgbm as lgb

lgbm = lgb.LGBMClassifier(
    n_estimators=400,
    learning_rate=0.05,
    max_depth=-1,
    subsample=0.9,
    colsample_bytree=0.8,
    n_jobs=-1,
    random_state=42,
    is_unbalance=True  # quick handling of imbalance
)

lgbm.fit(X_train, y_train)
print('LightGBM trained')
eval_model(lgbm, X_test, y_test)

[LightGBM] [Info] Number of positive: 114, number of negative: 7886
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.025854 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5355
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.014250 -> initscore=-4.236646
[LightGBM] [Info] Start training from score -4.236646
LightGBM trained
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      1971
           1       0.00      0.00      0.00        29

    accuracy                           0.99      2000
   macro avg       0.49      0.50      0.50      2000
weighted avg       0.97      0.99      0.98      2000

ROC AUC: 0.5689042845396175
PR AUC: 0.018466551261816604
Confusion Matrix:
[[1971    0]
 [  29    0]]


## 6. Comparative plots and summary

In [58]:
models = {
    'XGBoost': xgb,
    'LightGBM': lgbm,
    'Random Forest': rf,
    'Logistic Regression': logreg,
}

results = []
for name, model in models.items():
    if hasattr(model, 'predict_proba'):
        y_prob = model.predict_proba(X_test)[:, 1]
    else:
        try:
            y_score = model.decision_function(X_test)
            y_prob = 1 / (1 + np.exp(-y_score))
        except Exception:
            y_prob = model.predict(X_test)
    roc = roc_auc_score(y_test, y_prob)
    pr = average_precision_score(y_test, y_prob)
    results.append({'Model': name, 'ROC_AUC': roc, 'PR_AUC': pr})

summary_df = pd.DataFrame(results).sort_values('PR_AUC', ascending=False).reset_index(drop=True)
print(summary_df)

                 Model   ROC_AUC    PR_AUC
0              XGBoost  0.567540  0.019809
1             LightGBM  0.568904  0.018467
2        Random Forest  0.475288  0.015365
3  Logistic Regression  0.488060  0.014218


**Explanation:**
- We rank models by PR_AUC (average precision) because Precision-Recall is more informative on imbalanced datasets.

-----

Conclusion

- Hypotesys generation is crucial for fraud detection.
- Unbalanced classes must be handled.
- Sometimes much simple algorithms are better than complex ones. Random Forest beated LightGBM and XGBoost, but processing time was much higher.