# Neural Network with Pytorch Lightning

As stated in the assignment prompt, my goal was not to create the perfect model for fraud detection. Rather, my objective was to demonstrate my skills in data science and software engineering.

To that end, I developed this neural network using the PyTorch Lightning framework, even though machine learning models are often better suited and less resource-intensive for addressing heavily imbalanced problems like this.

I greatly appreciate PyTorch Lightning, as it allows for the perfect structuring of PyTorch code into various modules and facilitates integration with services like logging and callbacks.

## Imports

In [1]:
import os
import torch
import numpy as np
import pandas as pd
from torch.nn import BCEWithLogitsLoss
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight


from fia.nn.train import TrainerManager
from fia.nn.model import LightningFraudClassifier
from fia.nn.classifier import FraudDetectionModel
from fia.nn.data import DataModule

from fia.preprocess import preprocess_with_labelencoder
from fia.nn.metrics import get_metrics
from fia.plots import PerformancePlotter
from fia.constants import (
    COL_BANK_MONTHS_COUNT,
    COL_PREV_ADDRESS_MONTHS_COUNT,
    COL_VELOCITY_4W,
    COL_DF_LABEL_FRAUD
)



## Load the data

In [2]:
FILENAME = "./data/Base.csv"
df = pd.read_csv(FILENAME)
df.shape

(1000000, 32)

## Remove the features that bring bias

From the Exploratory Data Analysis notebook, we identified features that do not contribute to improving the model, so we decided to remove these columns.

In [3]:
df = df.drop(columns=[
    COL_BANK_MONTHS_COUNT, 
    COL_PREV_ADDRESS_MONTHS_COUNT,
    COL_VELOCITY_4W
    ]
)

## Remove the empty rows

As we observed during the EDA, there is very little missing data. Although XGBoost can handle missing values, it is simpler to remove them as a starting point.

In [4]:
cols_missing = [
    'current_address_months_count',
    'session_length_in_minutes',
    'device_distinct_emails_8w',
    'intended_balcon_amount'
]

df[cols_missing] = df[cols_missing].replace(-1, np.nan)

df= df.dropna()
df.shape

(993607, 29)

## Preprocessing

In [5]:
df_preprocessed, label_encoder, sclarer = preprocess_with_labelencoder(
    df=df, 
    col_label=COL_DF_LABEL_FRAUD
)

## Split the dataframe in train, val, test

In [6]:
test_size = 0.30
val_size = 0.5

train_df, test_df = train_test_split(
    df_preprocessed,
    test_size=test_size,
    random_state=42,
    shuffle=True,
    stratify=df_preprocessed[COL_DF_LABEL_FRAUD],
)

# Split to create a train and validation dataframe
test_df, val_df = train_test_split(
    test_df,
    test_size=val_size,
    shuffle=True,
    random_state=42,
    stratify=test_df[COL_DF_LABEL_FRAUD],
)

## Compute the class weights

In [7]:
# Compute the class weights
class_weights = compute_class_weight(
        class_weight="balanced",
        classes=train_df[COL_DF_LABEL_FRAUD].unique(),
        y=train_df[COL_DF_LABEL_FRAUD],
    )
tensor_class_weights = torch.tensor(data=class_weights, dtype=torch.float32)

## Create the Lightning Datamodule

In [8]:
pl_datamodule = DataModule(
    train_df=train_df,
    val_df=val_df,
    test_df=test_df, 
    batch_size=128,
    prefetch_factor=2,
    persistent_workers=True
)

num_classes = tensor_class_weights.shape[0]
print(tensor_class_weights)

tensor([ 0.5056, 45.1874])


## Create the Lightning Module

In [9]:
model = FraudDetectionModel(df.shape[1]-1)
metrics= get_metrics(num_classes=num_classes)
criterion = BCEWithLogitsLoss(pos_weight=tensor_class_weights[1])

pl_model = LightningFraudClassifier(
    num_classes=num_classes,
    model=model,
    metrics=metrics,
    criterion=criterion,
)

## Create the Trainer

In [10]:
run_datadir = "./model_trained"

trainer = TrainerManager(
    pl_datamodule=pl_datamodule,
    pl_model=pl_model,
    run_datadir=run_datadir
)

## Train the model

In [11]:
model_trained, _ = trainer.train(epochs=22, use_gpu=True)

Seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 4090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
c:\Users\rodri\Documents\AI\FIA\.venv\lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:654: Checkpoint directory C:\Users\rodri\Documents\AI\FIA\model_trained\checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type                | Params | Mode 
--------------------------------------------------------------
0 | metrics       | MetricCollection    | 0      | train
1 | model         | FraudDetectionModel | 12.4 K | train
2 | criterion  

Epoch 0: 100%|██████████| 5434/5434 [00:52<00:00, 103.47it/s, v_num=343a]

Metric val_loss improved. New best score: 0.646


Epoch 1: 100%|██████████| 5434/5434 [00:28<00:00, 188.31it/s, v_num=343a]

Metric val_loss improved by 0.012 >= min_delta = 0.0. New best score: 0.634


Epoch 2: 100%|██████████| 5434/5434 [00:28<00:00, 187.91it/s, v_num=343a]

Metric val_loss improved by 0.004 >= min_delta = 0.0. New best score: 0.630


Epoch 3: 100%|██████████| 5434/5434 [00:28<00:00, 188.67it/s, v_num=343a]

Metric val_loss improved by 0.006 >= min_delta = 0.0. New best score: 0.624


Epoch 4: 100%|██████████| 5434/5434 [00:27<00:00, 194.56it/s, v_num=343a]

Metric val_loss improved by 0.005 >= min_delta = 0.0. New best score: 0.618


Epoch 5: 100%|██████████| 5434/5434 [00:28<00:00, 190.07it/s, v_num=343a]

Metric val_loss improved by 0.003 >= min_delta = 0.0. New best score: 0.615


Epoch 7: 100%|██████████| 5434/5434 [00:28<00:00, 192.67it/s, v_num=343a]

Metric val_loss improved by 0.000 >= min_delta = 0.0. New best score: 0.615


Epoch 8: 100%|██████████| 5434/5434 [00:28<00:00, 192.75it/s, v_num=343a]

Metric val_loss improved by 0.003 >= min_delta = 0.0. New best score: 0.612


Epoch 9: 100%|██████████| 5434/5434 [00:28<00:00, 193.95it/s, v_num=343a]

Metric val_loss improved by 0.002 >= min_delta = 0.0. New best score: 0.610


Epoch 10: 100%|██████████| 5434/5434 [00:29<00:00, 187.36it/s, v_num=343a]

Metric val_loss improved by 0.001 >= min_delta = 0.0. New best score: 0.608


Epoch 11: 100%|██████████| 5434/5434 [00:28<00:00, 187.93it/s, v_num=343a]

Metric val_loss improved by 0.000 >= min_delta = 0.0. New best score: 0.608


Epoch 12: 100%|██████████| 5434/5434 [00:28<00:00, 190.24it/s, v_num=343a]

Metric val_loss improved by 0.002 >= min_delta = 0.0. New best score: 0.606


Epoch 13: 100%|██████████| 5434/5434 [00:28<00:00, 193.02it/s, v_num=343a]

Metric val_loss improved by 0.003 >= min_delta = 0.0. New best score: 0.602


Epoch 16: 100%|██████████| 5434/5434 [00:28<00:00, 193.97it/s, v_num=343a]

Monitored metric val_loss did not improve in the last 3 records. Best score: 0.602. Signaling Trainer to stop.


Epoch 16: 100%|██████████| 5434/5434 [00:28<00:00, 193.58it/s, v_num=343a]


## Test the model

In [12]:
test_metrics = trainer.test()

Restoring states from the checkpoint path at C:\Users\rodri\Documents\AI\FIA\model_trained\checkpoints\epoch=13-val_loss=0.60-v1.ckpt
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loaded model weights from the checkpoint at C:\Users\rodri\Documents\AI\FIA\model_trained\checkpoints\epoch=13-val_loss=0.60-v1.ckpt


Testing DataLoader 0: 100%|██████████| 1165/1165 [00:04<00:00, 273.19it/s]
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 test_accuracy_weighted     0.8883662819862366
       test_auroc           0.8881074786186218
 test_f1_score_weighted     0.1182829886674881
        test_loss           0.5866850018501282
 test_precision_weighted    0.06480459868907928
  test_recall_weighted      0.6767737865447998
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────


## Load the trained model stored locally in the best checkpoints

Let's compare with the same plotter used for the Xgboost model. To be able to do that we need first to load the statedict from the best checkpoints

In [13]:
from fia.nn.model import FraudDetectionModel
from pathlib import Path

path_checkpoints_folder = Path(run_datadir, "checkpoints")
listfile_in_checkpoint = os.listdir(path_checkpoints_folder)
checkpoint_file = Path(path_checkpoints_folder, listfile_in_checkpoint[0])
print(checkpoint_file)

checkpoints = torch.load(checkpoint_file)

model_trained\checkpoints\epoch=13-val_loss=0.60-v1.ckpt


  checkpoints = torch.load(checkpoint_file)


In [14]:
# Get the input dim of the model from the checkpoints
imputs_dim = checkpoints.get("state_dict").get("model.fc1.weight").shape[1]

state_dict = checkpoints["state_dict"]
new_state_dict = {
    k.replace("model.", ""): v
    for k, v in state_dict.items()
    if "critirion.pos_weight" not in k
}

In [15]:
# Load the model from the checkpoints
model_loaded = FraudDetectionModel(imputs_dim)
model_loaded.load_state_dict(new_state_dict)

# Load the test dataloader from the DataModule
test_dataloader = pl_datamodule.test_dataloader()

y_true = []
y_probs = []

for batch in test_dataloader:
    inputs, targets = batch

    with torch.no_grad():  # Disable gradient computation
        outputs = model_loaded(inputs)
        probabilities = torch.sigmoid(outputs).cpu().numpy()
        positive_probs = probabilities.squeeze()  # Get probabilities for the positive class

    y_true.extend(targets.cpu().numpy())  
    y_probs.extend(positive_probs)

RuntimeError: Error(s) in loading state_dict for FraudDetectionModel:
	Unexpected key(s) in state_dict: "criterion.pos_weight". 

In [None]:
plotter = PerformancePlotter()
plotter.plot_metrics(y_true, y_probs) 