# Modeling

This notebook will guide you through the process of loading heterogeneous graphs and training models. The heterogeneous graph is based on Deep Graph Library (DGL) implementation, and the training process is based on PyTorch implementation.

The CPU is recommended for training when the equipment allows it. If you need to use GPU, please install GPU-based DGL separately.

## Colab setting

If you want to train in Colab, please run both cells first and mount to the corresponding path.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import os
cur_path = "/content/drive/MyDrive/graph-fraud-detection-F/"
os.chdir(cur_path)
!pwd

/content/drive/MyDrive/graph-fraud-detection-F


In [3]:
# Uninstall everything that might conflict (may take a minute)
!pip uninstall -y torch torchvision torchaudio torchtext torchdata
!pip uninstall -y dgl dgl-cu101 dgl-cu102 dgl-cu110 dgl-cu111 dgl-cu113 dgl-cu116 dgl-cu117 dgl-cu118 dgl-cu121 dgl-cu122 dgl-cu123 dgl-cu124 dgl-cu12 dgl-cuda10 dgl-cuda11 dgl-cuda12



[0mFound existing installation: torch 1.13.1
Uninstalling torch-1.13.1:
  Successfully uninstalled torch-1.13.1
[0mFound existing installation: dgl 1.1.1
Uninstalling dgl-1.1.1:
  Successfully uninstalled dgl-1.1.1
[0m

In [4]:
!pip install torch==1.13.1
!pip install dgl==1.1.1 -f https://data.dgl.ai/wheels/repo.html
!pip install numpy==1.23.5 pandas==1.5.3 scikit-learn==1.2.2 matplotlib==3.6.3 tqdm




[0mCollecting torch==1.13.1
  Using cached torch-1.13.1-cp311-cp311-manylinux1_x86_64.whl.metadata (24 kB)
Using cached torch-1.13.1-cp311-cp311-manylinux1_x86_64.whl (887.4 MB)
[0mInstalling collected packages: torch
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
timm 1.0.15 requires torchvision, which is not installed.
fastai 2.7.19 requires pandas, which is not installed.
fastai 2.7.19 requires torchvision>=0.11, which is not installed.
accelerate 1.6.0 requires torch>=2.0.0, but you have torch 1.13.1 which is incompatible.[0m[31m
[0mSuccessfully installed torch-1.13.1
[0mLooking in links: https://data.dgl.ai/wheels/repo.html
Collecting dgl==1.1.1
  Using cached dgl-1.1.1-cp311-cp311-manylinux1_x86_64.whl.metadata (530 bytes)
Using cached dgl-1.1.1-cp311-cp311-manylinux1_x86_64.whl (6.3 MB)
[0mInstalling collected packages: dgl
[0mSuccessfu

In [None]:
#!pip install dgl
#-cu101

## Training (All in 1)

In this part, you can use this all-in-one method to train the model easily.

In [None]:
#!python train.py --n-epochs 1000

Using backend: pytorch
DLG version: 0.6.1
numpy version:1.19.5 PyTorch version:1.8.1+cu101 DGL version:0.6.1
Namespace(compute_metrics=True, dropout=0.2, edges='relation*', embedding_size=360, labels='tags.csv', lr=0.01, model_dir='./model/2021_05_12_17_27_00', n_epochs=1000, n_hidden=16, n_layers=3, new_accounts='test.csv', nodes='features.csv', num_gpus=0, optimizer='adam', output_dir='./output', target_ntype='TransactionID', threshold=0, training_dir='./data', weight_decay=0.0005)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_id_04_edgelist.csv', 'relation_id_01_edgelist.csv', 'relation_id_03_edgelist.csv', 

## Training (Detailed)

Besides the approach mentioned before, you can also use this detailed approach.

### Prepare environment

In [5]:
import os
import sys
import glob

os.environ['DGLBACKEND'] = 'pytorch'

import torch as th
import dgl
import numpy as np

from gnn.estimator_fns import *
from gnn.graph_utils import *
from gnn.data import *
from gnn.utils import *
from gnn.pytorch_model import *
from train import *

DLG version: 1.1.1


### Load data

Recall the edges we defined before and the csv files we used to save them.

In [6]:
file_list = glob.glob('./data/*edgelist.csv')

edges = ",".join(map(lambda x: x.split("/")[-1], [file for file in file_list if "relation" in file]))

To train the graph neural network, we need to define a few hyperparameters that determine properties such as the class of graph neural network models we will be using, the network architecture and the optimizer and optimization parameters.

Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see `gnn/estimator_fns.py`. The parameters set below are:

- **nodes** is the name of the file that contains the node_ids of the target nodes and the node features.
- **edges** is a regular expression that when expanded lists all the filenames for the edgelists
- **labels** is the name of the file tha contains the target node_ids and their labels
- **model** specify which graph neural network to use, this should be set to r-gcn

The following hyperparameters can be tuned and adjusted to improve model performance

- **batch-size** is the number nodes that are used to compute a single forward pass of the GNN
- **embedding-size** is the size of the embedding dimension for non target nodes
- **n-neighbors** is the number of neighbours to sample for each target node during graph sampling for mini-batch training
- **n-layers** is the number of GNN layers in the model
- **n-epochs** is the number of training epochs for the model training job
- **optimizer** is the optimization algorithm used for gradient based parameter updates
- **lr** is the learning rate for parameter updates

### Generate graph

In [7]:
print('numpy version:{} PyTorch version:{} DGL version:{}'.format(np.__version__,
                                                                    th.__version__,
                                                                    dgl.__version__))

args = parse_args()
print(args)

numpy version:1.23.5 PyTorch version:1.13.1+cu117 DGL version:1.1.1
Namespace(training_dir='./data', model_dir='./model/2025_05_25_16_57_50', output_dir='./output', nodes='features.csv', target_ntype='TransactionID', edges='relation*', labels='tags.csv', new_accounts='test.csv', compute_metrics=True, threshold=0, num_gpus=0, optimizer='adam', lr=0.01, n_epochs=700, n_hidden=16, n_layers=3, weight_decay=0.0005, dropout=0.2, embedding_size=360)


In [8]:
args.edges = edges

args.edges = get_edgelists('relation*', args.training_dir)

g, features, target_id_to_node, id_to_node = construct_graph(args.training_dir,
                                                                args.edges,
                                                                args.nodes,
                                                                args.target_ntype)

mean, stdev, features = normalize(th.from_numpy(features))

print('feature mean shape:{}, std shape:{}'.format(mean.shape, stdev.shape))

Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgelist.csv', 'relation_id_01_edgelist.csv', 'relation_id_02_edgelist.csv', 'relation_id_03_edgelist.csv', 'relation_id_04_edgelist.csv', 'relation_id_05_edgelist.csv', 'relation_id_06_edgelist.csv', 'relation_id_07_edgelist.csv', 'relation_id_08_edgelist.csv', 'relation_id_09_edgelist.csv', 'relation_id_10_edgelist.csv', 'relation_id_11_edgelist.csv', 'relation_id_12_edgelist.csv', 'relation_id_13_edgelist.csv', 'relation_id_14_edgelist.csv', 'relation_id_15_edgelist.csv', 'relation_id_16_edgelist.csv', 'relation_id_17_edgelist.csv', 'relation_id_18_

In [9]:
g.nodes['target'].data['features'] = features

print("Getting labels")
n_nodes = g.number_of_nodes('target')

labels, _, test_mask = get_labels(target_id_to_node,
                                            n_nodes,
                                            args.target_ntype,
                                            os.path.join(args.training_dir, args.labels),
                                            os.path.join(args.training_dir, args.new_accounts))
print("Got labels")

labels = th.from_numpy(labels).float()
test_mask = th.from_numpy(test_mask).float()

n_nodes = th.sum(th.tensor([g.number_of_nodes(n_type) for n_type in g.ntypes]))
n_edges = th.sum(th.tensor([g.number_of_edges(e_type) for e_type in g.etypes]))

print("""----Data statistics------'
            #Nodes: {}
            #Edges: {}
            #Features Shape: {}
            #Labeled Test samples: {}""".format(n_nodes,
                                                    n_edges,
                                                    features.shape,
                                                    test_mask.sum()))

Getting labels
Got labels
----Data statistics------'
            #Nodes: 726345
            #Edges: 19518802
            #Features Shape: torch.Size([590540, 390])
            #Labeled Test samples: 118108.0


### Start training

The training process and result will be saved in the same folder.

In [10]:
if args.num_gpus:
    cuda = True
    device = th.device('cuda:0')
else:
    cuda = False
    device = th.device('cpu')

In [None]:
print("Initializing Model")
in_feats = features.shape[1]
n_classes = 2

ntype_dict = {n_type: g.number_of_nodes(n_type) for n_type in g.ntypes}

model = get_model(ntype_dict, g.etypes, vars(args), in_feats, n_classes, device)
print("Initialized Model")

features = features.to(device)

labels = labels.long().to(device)
test_mask = test_mask.to(device)
# g = g.to(device)

loss = th.nn.CrossEntropyLoss()

# print(model)
optim = th.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)

print("Starting Model training")

initial_record()

model, class_preds, pred_proba = train_fg(model, optim, loss, features, labels, g, g,
                                            test_mask, device, args.n_epochs,
                                            args.threshold,  args.compute_metrics)
print("Finished Model training")

print("Saving model")

if not os.path.exists(args.model_dir):
    os.makedirs(args.model_dir)

save_model(g, model, args.model_dir, id_to_node, mean, stdev)
print("Model and metadata saved")

Initializing Model
Initialized Model
Starting Model training
Epoch 00000, Time(s) 57.9874, Loss 0.6547, F1 0.0000 
Epoch 00001, Time(s) 57.3823, Loss 0.9180, F1 0.0000 
Epoch 00002, Time(s) 56.9625, Loss 0.6393, F1 0.0000 
Epoch 00003, Time(s) 56.8712, Loss 0.2876, F1 0.1591 
Epoch 00004, Time(s) 56.8676, Loss 0.6427, F1 0.0000 
Epoch 00005, Time(s) 56.8425, Loss 0.1941, F1 0.0000 
Epoch 00006, Time(s) 56.8770, Loss 0.2660, F1 0.0000 
Epoch 00007, Time(s) 56.8355, Loss 0.2724, F1 0.0000 
Epoch 00008, Time(s) 56.8680, Loss 0.2441, F1 0.0000 
Epoch 00009, Time(s) 56.8777, Loss 0.2086, F1 0.0000 
Epoch 00010, Time(s) 56.8418, Loss 0.1835, F1 0.0000 
Epoch 00011, Time(s) 56.8701, Loss 0.1720, F1 0.0000 
Epoch 00012, Time(s) 56.9887, Loss 0.1661, F1 0.0000 
Epoch 00013, Time(s) 57.4061, Loss 0.1592, F1 0.0000 
Epoch 00014, Time(s) 57.6687, Loss 0.1511, F1 0.0000 
Epoch 00015, Time(s) 57.8424, Loss 0.1460, F1 0.0000 
Epoch 00016, Time(s) 57.9489, Loss 0.1440, F1 0.0000 
Epoch 00017, Time(s) 