Author:
        
        PARK, JunHo, junho@ccnets.org

        
        KIM, JeongYoong, jeongyoong@ccnets.org
        
    COPYRIGHT (c) 2024. CCNets. All Rights reserved.

# Credit Card Fraud Detection: Handling Imbalanced Dataset with CCNet

## Introduction

This tutorial explores the use of a Cooperative Encoding Network (CCNet) to address challenges associated with imbalanced datasets in the domain of credit card fraud detection. By leveraging the power of synthetic data generation, we aim to enhance the diversity and volume of training data, thereby improving the robustness and accuracy of models designed to identify fraudulent transactions.

## Tutorial Goals

The objectives of this tutorial are designed to guide you through the process of enhancing data quality and model performance:

### Dataset Recreation with CCNet
- **Understand Data Augmentation**: Learn how encoding techniques can be used to generate synthetic data instances that closely mimic the characteristics of real-world fraudulent and non-fraudulent transactions.
- **Impact on Model Training**: Assess how augmenting the dataset influences the training process and subsequently, the model's ability to generalize from training to real-world scenarios.

### Model Training and Evaluation
- **Dual Model Training**: Train two distinct models to directly compare performance metrics:
  - A model trained on the **original dataset**.
  - A model trained on the **CCNet-augmented dataset**.
- **Performance Metrics**: Use the F1 score, a critical measure for models operating on imbalanced datasets, to evaluate and compare the effectiveness of these models.

### Testing and Validation
- **Independent Model Testing**: Conduct a thorough evaluation of both models using a standalone test set that was not involved in the training phase.
- **Objective Analysis**: Critically analyze the outcomes to validate whether data augmentation through CCNet offers a tangible benefit in detecting credit card fraud.

## Conclusion

By the end of this tutorial, participants will not only grasp the theoretical underpinnings of using synthetic data to combat data imbalance but also gain hands-on experience in applying these concepts through CCNet to potentially enhance model performance in fraud detection tasks.


In [1]:
import sys
path_append = "../"
sys.path.append(path_append)  # Go up one directory from where you are.

import torch
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler

In [2]:
dataroot = path_append + "../data/credit_card_fraud_detection/creditcard.csv"
df = pd.read_csv(dataroot)
df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.166480,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.167170,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.379780,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.108300,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.50,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.206010,0.502292,0.219422,0.215153,69.99,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
284802,172786.0,-11.881118,10.071785,-9.834783,-2.066656,-5.364473,-2.606837,-4.918215,7.305334,1.914428,...,0.213454,0.111864,1.014480,-0.509348,1.436807,0.250034,0.943651,0.823731,0.77,0
284803,172787.0,-0.732789,-0.055080,2.035030,-0.738589,0.868229,1.058415,0.024330,0.294869,0.584800,...,0.214205,0.924384,0.012463,-1.016226,-0.606624,-0.395255,0.068472,-0.053527,24.79,0
284804,172788.0,1.919565,-0.301254,-3.249640,-0.557828,2.630515,3.031260,-0.296827,0.708417,0.432454,...,0.232045,0.578229,-0.037501,0.640134,0.265745,-0.087371,0.004455,-0.026561,67.88,0
284805,172788.0,-0.240440,0.530483,0.702510,0.689799,-0.377961,0.623708,-0.686180,0.679145,0.392087,...,0.265245,0.800049,-0.163298,0.123205,-0.569159,0.546668,0.108821,0.104533,10.00,0


In [3]:
print('No Frauds', round(df['Class'].value_counts()[0] / len(df) *100,2), '%of the dataset')
print('Frauds', round(df['Class'].value_counts()[1] / len(df) *100,2), '%of the dataset')

No Frauds 99.83 %of the dataset
Frauds 0.17 %of the dataset


In [4]:
# https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_7_DeepLearning/FeedForwardNeuralNetworks.html
class LabeledDataset(torch.utils.data.Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        vals = torch.tensor(self.x[index], dtype = torch.float32)
        label = torch.tensor(self.y[index], dtype = torch.float32)
        return vals, label

class UnlabelledDataset(torch.utils.data.Dataset):
    def __init__(self, x):
        self.x = x
        
    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        vals = torch.tensor(self.x[index], dtype = torch.float32)
        return vals, None

sc = StandardScaler()
df.iloc[:, :-1] = sc.fit_transform(df.iloc[:, :-1])
n_elements = df.shape[1]
print(n_elements)

#### Initial Setup and Model Configuration

This section initializes the environment by setting a fixed random seed to ensure reproducibility of results. It imports necessary configurations and initializes model parameters with specific configurations. The model specified here is set to have no core model but uses a 'deepfm' encoder model for data processing, which is particularly tailored for structured or tabular data like credit card transactions.


In [6]:
# Set a fixed random seed for reproducibility of experiments
from nn.utils.init import set_random_seed
set_random_seed(0)

# Importing configuration setups for ML parameters and data
from tools.setting.ml_params import MLParameters
from tools.setting.data_config import DataConfig
from trainer_hub import TrainerHub

# Configuration for the data handling, defining dataset specifics and the task type
data_config = DataConfig(dataset_name='CreditCardFraudDetection', task_type='augmentation', obs_shape=[n_elements], label_size=None)

# Initializing ML parameters without a core model and setting the encoder model to 'deepfm' with specific configurations
ml_params = MLParameters(core_model='none', encoder_model='deepfm')
ml_params.encoder_config.num_layers = 4
ml_params.encoder_config.d_model = 256

# Setting training parameters and device configuration
ml_params.training.num_epoch = 10
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

# Create a TrainerHub instance to manage training and data processing
trainer_hub = TrainerHub(ml_params, data_config, device, use_print=True, use_wandb=False, use_full_eval=False)


  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


#### Dataset Splitting for Training and Testing

The original dataset is split into training and testing parts to evaluate the model's performance accurately. This step is crucial for validating the effectiveness of the training on unseen data.


In [7]:
# Splitting the dataset into training and test sets for model evaluation
df_train, df_test = train_test_split(df, test_size=0.5, shuffle=False)
X_train, y_train = df_train.iloc[:, :-1].values, df_train.iloc[:, -1:].values
X_test, y_test = df_test.iloc[:, :-1].values, df_test.iloc[:, -1:].values

# Preparing the unlabelled and labelled datasets for use in training and testing
_df_train = df_train.iloc[:, :].values 
unlabelled_trainset = UnlabelledDataset(_df_train)
trainset = LabeledDataset(X_test, y_test)
testset = LabeledDataset(X_test, y_test)

In [8]:
trainer_hub.train(unlabelled_trainset)

Epochs:   0%|          | 0/10 [00:00<?, ?it/s]

Iterations:   0%|          | 0/2225 [00:00<?, ?it/s]

[0/10][50/2225][Time 2.02]
Unified LR across all optimizers: 0.0001995308238189185
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.5559	Gen: 0.5476	Rec: 0.5942	E: 0.5092	R: 0.6025	P: 0.5859
[0/10][100/2225][Time 1.73]
Unified LR across all optimizers: 0.00019907191565870155
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.5097	Gen: 0.4905	Rec: 0.5432	E: 0.4570	R: 0.5624	P: 0.5240
[0/10][150/2225][Time 1.66]
Unified LR across all optimizers: 0.00019861406295796434
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.4619	Gen: 0.4461	Rec: 0.4918	E: 0.4162	R: 0.5075	P: 0.4760
[0/10][200/2225][Time 1.64]
Unified LR across all optimizers: 0.00019815726328921765
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.4097	Gen: 0.4073	Rec: 0.4369	E: 0.3801	R: 0.4392	P: 0.4345
[0/10][250/2225][Time 1.58]
Unified LR across all optimizers: 0.00019770151423055492
-----------------

Iterations:   0%|          | 0/2225 [00:00<?, ?it/s]

[1/10][25/2225][Time 2.04]
Unified LR across all optimizers: 0.00018030592393534033
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.1200	Gen: 0.1265	Rec: 0.1284	E: 0.1181	R: 0.1218	P: 0.1349
[1/10][75/2225][Time 1.86]
Unified LR across all optimizers: 0.0001798912318178735
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.1184	Gen: 0.1236	Rec: 0.1266	E: 0.1154	R: 0.1214	P: 0.1317
[1/10][125/2225][Time 1.76]
Unified LR across all optimizers: 0.00017947749346581006
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.1143	Gen: 0.1209	Rec: 0.1222	E: 0.1130	R: 0.1155	P: 0.1288
[1/10][175/2225][Time 1.76]
Unified LR across all optimizers: 0.0001790647066855505
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.1198	Gen: 0.1261	Rec: 0.1277	E: 0.1182	R: 0.1213	P: 0.1340
[1/10][225/2225][Time 1.87]
Unified LR across all optimizers: 0.00017865286928854052
-------------------

Iterations:   0%|          | 0/2225 [00:00<?, ?it/s]

[2/10][0/2225][Time 1.95]
Unified LR across all optimizers: 0.00016293335327318117
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0871	Gen: 0.0944	Rec: 0.0916	E: 0.0899	R: 0.0843	P: 0.0988
[2/10][50/2225][Time 2.04]
Unified LR across all optimizers: 0.00016255861695947546
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0801	Gen: 0.0874	Rec: 0.0853	E: 0.0822	R: 0.0779	P: 0.0926
[2/10][100/2225][Time 1.75]
Unified LR across all optimizers: 0.00016218474251537463
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0822	Gen: 0.0892	Rec: 0.0883	E: 0.0831	R: 0.0813	P: 0.0953
[2/10][150/2225][Time 1.77]
Unified LR across all optimizers: 0.00016181172795863357
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0788	Gen: 0.0869	Rec: 0.0838	E: 0.0819	R: 0.0757	P: 0.0918
[2/10][200/2225][Time 2.02]
Unified LR across all optimizers: 0.0001614395713115662
-------------------

Iterations:   0%|          | 0/2225 [00:00<?, ?it/s]

[3/10][25/2225][Time 2.44]
Unified LR across all optimizers: 0.00014689600866445298
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0669	Gen: 0.0741	Rec: 0.0697	E: 0.0712	R: 0.0626	P: 0.0769
[3/10][75/2225][Time 2.65]
Unified LR across all optimizers: 0.00014655815721980301
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0641	Gen: 0.0713	Rec: 0.0669	E: 0.0684	R: 0.0597	P: 0.0742
[3/10][125/2225][Time 5.65]
Unified LR across all optimizers: 0.00014622108281191326
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0666	Gen: 0.0731	Rec: 0.0689	E: 0.0708	R: 0.0625	P: 0.0754
[3/10][175/2225][Time 5.98]
Unified LR across all optimizers: 0.00014588478365364866
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0694	Gen: 0.0757	Rec: 0.0724	E: 0.0726	R: 0.0661	P: 0.0787
[3/10][225/2225][Time 6.01]
Unified LR across all optimizers: 0.0001455492579619846
------------------

Iterations:   0%|          | 0/2225 [00:00<?, ?it/s]

[4/10][0/2225][Time 5.60]
Unified LR across all optimizers: 0.00013274250092153782
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0608	Gen: 0.0660	Rec: 0.0619	E: 0.0649	R: 0.0568	P: 0.0671
[4/10][50/2225][Time 6.19]
Unified LR across all optimizers: 0.00013243720164138364
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0628	Gen: 0.0683	Rec: 0.0640	E: 0.0672	R: 0.0585	P: 0.0695
[4/10][100/2225][Time 4.69]
Unified LR across all optimizers: 0.00013213260453008872
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0580	Gen: 0.0631	Rec: 0.0597	E: 0.0613	R: 0.0546	P: 0.0648
[4/10][150/2225][Time 5.42]
Unified LR across all optimizers: 0.00013182870797270977
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0538	Gen: 0.0591	Rec: 0.0567	E: 0.0562	R: 0.0514	P: 0.0620
[4/10][200/2225][Time 5.54]
Unified LR across all optimizers: 0.0001315255103580172
-------------------

Iterations:   0%|          | 0/2225 [00:00<?, ?it/s]

[5/10][25/2225][Time 2.03]
Unified LR across all optimizers: 0.00011967680756448871
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0472	Gen: 0.0527	Rec: 0.0493	E: 0.0507	R: 0.0438	P: 0.0547
[5/10][75/2225][Time 1.79]
Unified LR across all optimizers: 0.0001194015585451698
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0485	Gen: 0.0531	Rec: 0.0502	E: 0.0514	R: 0.0457	P: 0.0548
[5/10][125/2225][Time 1.79]
Unified LR across all optimizers: 0.0001191269425810282
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0444	Gen: 0.0497	Rec: 0.0469	E: 0.0473	R: 0.0415	P: 0.0522
[5/10][175/2225][Time 1.69]
Unified LR across all optimizers: 0.00011885295821607745
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0491	Gen: 0.0534	Rec: 0.0509	E: 0.0517	R: 0.0465	P: 0.0552
[5/10][225/2225][Time 1.78]
Unified LR across all optimizers: 0.0001185796039976797
--------------------

Iterations:   0%|          | 0/2225 [00:00<?, ?it/s]

[6/10][0/2225][Time 1.82]
Unified LR across all optimizers: 0.00010814588417241378
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0430	Gen: 0.0482	Rec: 0.0447	E: 0.0466	R: 0.0394	P: 0.0499
[6/10][50/2225][Time 1.86]
Unified LR across all optimizers: 0.00010789715554096363
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0503	Gen: 0.0542	Rec: 0.0507	E: 0.0538	R: 0.0468	P: 0.0547
[6/10][100/2225][Time 1.84]
Unified LR across all optimizers: 0.00010764899896949131
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0396	Gen: 0.0444	Rec: 0.0415	E: 0.0424	R: 0.0367	P: 0.0463
[6/10][150/2225][Time 1.81]
Unified LR across all optimizers: 0.00010740141314229549
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0492	Gen: 0.0542	Rec: 0.0493	E: 0.0541	R: 0.0443	P: 0.0543
[6/10][200/2225][Time 1.87]
Unified LR across all optimizers: 0.0001071543967467006
-------------------

Iterations:   0%|          | 0/2225 [00:00<?, ?it/s]

[7/10][25/2225][Time 2.00]
Unified LR across all optimizers: 9.750120782072374e-05
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0400	Gen: 0.0450	Rec: 0.0408	E: 0.0442	R: 0.0358	P: 0.0458
[7/10][75/2225][Time 1.80]
Unified LR across all optimizers: 9.72769612655121e-05
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0390	Gen: 0.0440	Rec: 0.0399	E: 0.0432	R: 0.0349	P: 0.0448
[7/10][125/2225][Time 1.77]
Unified LR across all optimizers: 9.705323046306541e-05
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0478	Gen: 0.0534	Rec: 0.0471	E: 0.0542	R: 0.0415	P: 0.0527
[7/10][175/2225][Time 1.72]
Unified LR across all optimizers: 9.683001422718531e-05
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0394	Gen: 0.0445	Rec: 0.0398	E: 0.0441	R: 0.0347	P: 0.0449
[7/10][225/2225][Time 1.80]
Unified LR across all optimizers: 9.660731137440147e-05
--------------------Tra

Iterations:   0%|          | 0/2225 [00:00<?, ?it/s]

[8/10][0/2225][Time 1.78]
Unified LR across all optimizers: 8.810691513448475e-05
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0375	Gen: 0.0429	Rec: 0.0379	E: 0.0425	R: 0.0324	P: 0.0433
[8/10][50/2225][Time 1.89]
Unified LR across all optimizers: 8.790427485288373e-05
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0395	Gen: 0.0453	Rec: 0.0404	E: 0.0444	R: 0.0347	P: 0.0462
[8/10][100/2225][Time 1.75]
Unified LR across all optimizers: 8.770210063099734e-05
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0386	Gen: 0.0438	Rec: 0.0393	E: 0.0431	R: 0.0340	P: 0.0445
[8/10][150/2225][Time 1.92]
Unified LR across all optimizers: 8.750039139691806e-05
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0375	Gen: 0.0428	Rec: 0.0385	E: 0.0418	R: 0.0333	P: 0.0437
[8/10][200/2225][Time 2.01]
Unified LR across all optimizers: 8.729914608120354e-05
--------------------Tra

Iterations:   0%|          | 0/2225 [00:00<?, ?it/s]

[9/10][25/2225][Time 1.94]
Unified LR across all optimizers: 7.943465170874777e-05
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0377	Gen: 0.0435	Rec: 0.0375	E: 0.0438	R: 0.0316	P: 0.0433
[9/10][75/2225][Time 1.92]
Unified LR across all optimizers: 7.925195707953989e-05
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0347	Gen: 0.0404	Rec: 0.0349	E: 0.0402	R: 0.0293	P: 0.0405
[9/10][125/2225][Time 1.93]
Unified LR across all optimizers: 7.90696826363192e-05
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0371	Gen: 0.0428	Rec: 0.0371	E: 0.0428	R: 0.0315	P: 0.0428
[9/10][175/2225][Time 1.98]
Unified LR across all optimizers: 7.888782741268464e-05
--------------------Training Metrics--------------------
Trainer:  deepfm
Inf: 0.0375	Gen: 0.0431	Rec: 0.0378	E: 0.0428	R: 0.0322	P: 0.0434
[9/10][225/2225][Time 1.73]
Unified LR across all optimizers: 7.870639044445802e-05
--------------------Tra

In [9]:
batch_size = 64  # Lower than the original batch size
# Use DataLoader to handle smaller batches
from torch.nn.utils.rnn import pad_sequence
def collate_fn(batch):
    X, y = zip(*batch)
    # Directly use the tensors from X if they are already tensors, else convert appropriately
    X_padded = pad_sequence([x.clone().detach() if isinstance(x, torch.Tensor) else torch.tensor(x) for x in X], batch_first=True, padding_value=0)
    
    if any(label is None for label in y):
        y_padded = None
    else:
        # Directly use the tensors from y if they are already tensors, else convert appropriately
        y_padded = pad_sequence([label.clone().detach() if isinstance(label, torch.Tensor) else torch.tensor(label) for label in y], batch_first=True, padding_value=-1)
    
    return X_padded, y_padded

#### Data Loading and Synthetic Data Generation

This section deals with loading the unlabelled dataset, processing it through the trained model to create synthetic data. This data augmentation step is crucial for models that benefit from larger datasets, such as in fraud detection scenarios.


In [10]:
# Loading the unlabelled data and preparing it for processing222
train_loader = torch.utils.data.DataLoader(dataset=unlabelled_trainset, batch_size=batch_size, collate_fn=collate_fn, shuffle=False)

# Generate synthetic data through the model to augment the training dataset
recreated_dataset = None
with torch.no_grad():
    for data, _ in train_loader:
        data = data.to(device)
        batch_recreated_data = trainer_hub.encoder_ccnet.synthesize(data, output_multiplier=2)
        recreated_dataset = torch.cat([recreated_dataset, batch_recreated_data], dim = 0) if recreated_dataset is not None else batch_recreated_data
        
recreated_dataset.shape

torch.Size([284806, 31])

#### Data Preparation for Model Training

After synthetic data generation, this section separates the data and labels for training purposes, preparing them for use in machine learning models to ensure proper supervision and evaluation.


In [11]:
# Separate the recreated data into features and labels for training
recreated_training_data, recreated_labels = recreated_dataset[:, :-1].clone().detach().cpu().numpy(), recreated_dataset[:, -1:].clone().detach().cpu().numpy()
ccnet_recreated_dataset = LabeledDataset(recreated_training_data, recreated_labels)

num_features = recreated_training_data.shape[1]
num_classes = recreated_labels.shape[1]
num_features, num_classes

(30, 1)

In [12]:
class DNN(torch.nn.Module):
    def __init__(self, input_size, output_size, num_layers=4, hidden_size=256):
        super(DNN, self).__init__()
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_size = hidden_size
        
        # Create a list to hold all layers
        layers = []
        
        # Input layer
        layers.append(torch.nn.Linear(input_size, hidden_size))
        layers.append(torch.nn.ReLU())
        
        # Hidden layers
        for _ in range(num_layers - 2):
            layers.append(torch.nn.Linear(hidden_size, hidden_size))
            layers.append(torch.nn.ReLU())
        
        # Output layer
        layers.append(torch.nn.Linear(hidden_size, output_size))
        
        # Register all layers
        self.layers = torch.nn.Sequential(*layers)

    def forward(self, x):
        x = self.layers(x)
        return torch.sigmoid(x)

#### Training Supervised Models

This section outlines the process of training supervised learning models using both original and synthetic datasets. The `train_supervised_model` function is designed to iterate through the dataset, perform forward passes, compute loss, and update model weights using backpropagation.


In [13]:
def train_supervised_model(model, dataset):
    # Initialize the optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    # Ensure reproducibility by resetting the random seed
    set_random_seed(0)
    # Create DataLoader for batch processing
    trainloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)
    
    # Training loop
    for epoch in range(4):  # Train for 2 epochs as an example
        for i, (data, label) in enumerate(trainloader):
            data = data.to(device)
            label = label.to(device)
            # Perform forward pass
            output = model(data)
            # Compute loss
            loss = torch.nn.functional.mse_loss(output, label)
            # Backward pass to compute gradients
            loss.backward()
            # Update weights
            optimizer.step()
            # Reset gradients
            optimizer.zero_grad()


#### Model Training Using Recreated and Original Datasets

Models are trained using both datasets generated through the Data Augmentation process and the original dataset. This comparison helps to determine the effectiveness of the synthetic data in improving model performance.


In [14]:
# Initialize and train a model on the recreated dataset
model_trained_on_recreated = DNN(input_size=num_features, output_size=num_classes).to(device)
train_supervised_model(model_trained_on_recreated, ccnet_recreated_dataset)

# Initialize and train a model on the original dataset
model_trained_on_original = DNN(input_size=num_features, output_size=num_classes).to(device)
train_supervised_model(model_trained_on_original, trainset)

#### Evaluating Model Performance

After training, the models are evaluated using the F1 score, a harmonic mean of precision and recall, which is particularly useful in the context of imbalanced datasets like fraud detection. This step is critical for assessing the quality of the models trained on different types of data.


In [15]:
from sklearn.metrics import f1_score

def get_f1_score(model, testset, batch_size=batch_size):
    model.eval()  # Set the model to evaluation mode
    y_true = []
    y_pred = []
    # DataLoader for testing
    data_loader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False)

    # No gradient computation needed during inference
    with torch.no_grad():
        for data, label in data_loader:
            data = data.to(device)
            label = label.to(device)
            output = model(data)
            # Process output for binary classification
            predicted = (output.squeeze() > 0.5).long()
            y_true.extend(label.squeeze().long().cpu().numpy())
            y_pred.extend(predicted.cpu().numpy())

    # Compute and return the F1 score
    score = f1_score(y_true, y_pred, average='binary')
    return score

# Calculate F1 scores for both models
f1_score_original = get_f1_score(model_trained_on_original, testset)
f1_score_recreated = get_f1_score(model_trained_on_recreated, testset)

# Output the results
print("F1 score of the supervised learning model trained on the original data: ", f1_score_original)
print("F1 score of the supervised learning model trained on the recreated data: ", f1_score_recreated)


F1 score of the supervised learning model trained on the original data:  0.0
F1 score of the supervised learning model trained on the recreated data:  0.5833333333333334
