<a href="https://colab.research.google.com/github/JidapaBur/DADS7201/blob/main/sna/fraud/final/20_modeling_for_students.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modeling

This notebook will guide you through the process of loading heterogeneous graphs and training models. The heterogeneous graph is based on Deep Graph Library (DGL) implementation, and the training process is based on PyTorch implementation.

The CPU is recommended for training when the equipment allows it. If you need to use GPU, please install GPU-based DGL separately.

## Colab setting

If you want to train in Colab, please run both cells first and mount to the corresponding path.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os
cur_path = "/content/drive/MyDrive/graph-fraud-detection/"
os.chdir(cur_path)
!pwd

/content/drive/MyDrive/graph-fraud-detection


In [3]:
#!pip install dgl # This might install a version compatible with the current PyTorch
#!pip install  dgl -f https://data.dgl.ai/wheels/torch-2.3/repo.html
!pip install dgl==1.1.2  # Or other version that works without GraphBolt

#-cu101

Collecting dgl==1.1.2
  Downloading dgl-1.1.2-cp311-cp311-manylinux1_x86_64.whl.metadata (530 bytes)
Downloading dgl-1.1.2-cp311-cp311-manylinux1_x86_64.whl (6.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m62.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dgl
Successfully installed dgl-1.1.2


## Training (All in 1)

In this part, you can use this all-in-one method to train the model easily.

In [None]:
!python train.py --n-epochs 10
#F1: 0.2557, Precision: 0.2568, Recall: 0.2545, Acc: 0.9443, ROC: 0.7703, PR: 0.2243, AP: 0.2243

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_21_15_41_22', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=16, n_layers=3, weight_decay=0.0005, dropout=0.2, embedding_size=360, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgelist.csv', 'rela

In [None]:
!python train.py --n-epochs 10 --n-hidden 32
#F1: 0.2680, Precision: 0.3133, Recall: 0.2342, Acc: 0.9519, ROC: 0.7390, PR: 0.1828, AP: 0.1828

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_21_15_44_03', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=32, n_layers=3, weight_decay=0.0005, dropout=0.2, embedding_size=360, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgelist.csv', 'rela

In [None]:
!python train.py --n-epochs 10 --n-hidden 32 --threshold 0.3
#F1: 0.1336, Precision: 0.4375, Recall: 0.0788, Acc: 0.9616, ROC: 0.7390, PR: 0.1828, AP: 0.1828

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_21_15_51_04', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0.3, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=32, n_layers=3, weight_decay=0.0005, dropout=0.2, embedding_size=360, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgelist.csv', 're

In [None]:
!python train.py --n-epochs 10 --n-hidden 32 --threshold 0.1
#F1: 0.0516, Precision: 0.5714, Recall: 0.0270, Acc: 0.9627, ROC: 0.7390, PR: 0.1828, AP: 0.1828

Traceback (most recent call last):
  File "/content/drive/MyDrive/graph-fraud-detection/train.py", line 9, in <module>
    import dgl
  File "/usr/local/lib/python3.11/dist-packages/dgl/__init__.py", line 16, in <module>
    from . import (
  File "/usr/local/lib/python3.11/dist-packages/dgl/dataloading/__init__.py", line 4, in <module>
    from .base import *
  File "/usr/local/lib/python3.11/dist-packages/dgl/dataloading/base.py", line 7, in <module>
    from ..convert import heterograph
  File "/usr/local/lib/python3.11/dist-packages/dgl/convert.py", line 7, in <module>
    from scipy.sparse import spmatrix
  File "/usr/local/lib/python3.11/dist-packages/scipy/sparse/__init__.py", line 315, in <module>
    from . import csgraph
  File "/usr/local/lib/python3.11/dist-packages/scipy/sparse/csgraph/__init__.py", line 187, in <module>
    from ._laplacian import laplacian
  File "/usr/local/lib/python3.11/dist-packages/scipy/sparse/csgraph/_laplacian.py", line 7, in <module>
    from sc

In [7]:
!python train.py --n-epochs 10 --n-hidden 32 --threshold 0.3 --weight 3
#F1: 0.0941, Precision: 0.5111, Recall: 0.0518, Acc: 0.9625, ROC: 0.7581, PR: 0.1991, AP: 0.1991

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_22_08_32_56', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0.3, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=32, n_layers=3, weight_decay=0.0005, dropout=0.2, embedding_size=360, weight=3.0, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgeli

In [8]:
!python train.py --n-epochs 10 --n-hidden 32 --threshold 0.2 --weight 4
#F1: 0.0090, Precision: 1.0000, Recall: 0.0045, Acc: 0.9626, ROC: 0.6954, PR: 0.1864, AP: 0.1864

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_22_08_37_52', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0.2, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=32, n_layers=3, weight_decay=0.0005, dropout=0.2, embedding_size=360, weight=4.0, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgeli

In [9]:
!python train.py --n-epochs 10 --n-hidden 32 --threshold 0.25 --weight 3
#F1: 0.0766, Precision: 0.6923, Recall: 0.0405, Acc: 0.9633, ROC: 0.7581, PR: 0.1991, AP: 0.1991

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_22_08_44_37', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0.25, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=32, n_layers=3, weight_decay=0.0005, dropout=0.2, embedding_size=360, weight=3.0, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgel

In [10]:
!python train.py --n-epochs 10 --n-hidden 32 --weight 2
#F1: 0.2861, Precision: 0.4145, Recall: 0.2185, Acc: 0.9590, ROC: 0.7664, PR: 0.2371, AP: 0.2371

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_22_08_48_47', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=32, n_layers=3, weight_decay=0.0005, dropout=0.2, embedding_size=360, weight=2.0, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgelist

In [15]:
!python train.py --n-epochs 10 --n-hidden 48 --embedding-size 180 --n-layers 2 --weight 2
#F1: 0.2627, Precision: 0.1886, Recall: 0.4324, Acc: 0.9087, ROC: 0.8252, PR: 0.2761, AP: 0.2761

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_22_08_58_54', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=48, n_layers=2, weight_decay=0.0005, dropout=0.2, embedding_size=180, weight=2.0, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgelist

In [17]:
!python train.py --n-epochs 10 --n-hidden 48 --embedding-size 180 --n-layers 2 --weight 2 --threshold 0.3 --device cpu
#F1: 0.2939, Precision: 0.2430, Recall: 0.3716, Acc: 0.9329, ROC: 0.8252, PR: 0.2761, AP: 0.2761

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_22_09_18_43', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0.3, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=48, n_layers=2, weight_decay=0.0005, dropout=0.2, embedding_size=180, weight=2.0, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgeli

In [18]:
!python train.py --n-epochs 10 --n-hidden 48 --embedding-size 180 --n-layers 2 --weight 2 --threshold 0.25 --device cpu
#F1: 0.3055, Precision: 0.2663, Recall: 0.3581, Acc: 0.9388, ROC: 0.8252, PR: 0.2761, AP: 0.2761

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_22_09_23_46', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0.25, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=48, n_layers=2, weight_decay=0.0005, dropout=0.2, embedding_size=180, weight=2.0, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgel

In [19]:
!python train.py --n-epochs 10 --n-hidden 48 --embedding-size 180 --n-layers 2 --weight 2 --threshold 0.2 --device cpu
#F1: 0.3120, Precision: 0.2916, Recall: 0.3356, Acc: 0.9444, ROC: 0.8252, PR: 0.2761, AP: 0.2761

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_22_09_30_43', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0.2, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=48, n_layers=2, weight_decay=0.0005, dropout=0.2, embedding_size=180, weight=2.0, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgeli

In [20]:
!python train.py --n-epochs 10 --n-hidden 48 --embedding-size 180 --n-layers 2 --weight 2 --threshold 0.1 --device cpu
#F1: 0.3211, Precision: 0.3861, Recall: 0.2748, Acc: 0.9563, ROC: 0.8252, PR: 0.2761, AP: 0.2761

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_22_10_43_20', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0.1, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=48, n_layers=2, weight_decay=0.0005, dropout=0.2, embedding_size=180, weight=2.0, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgeli

In [22]:
!python train.py --n-epochs 10 --n-hidden 48 --embedding-size 180 --n-layers 2 --weight 2 --threshold 0.05 --device cpu
#F1: 0.3005, Precision: 0.4605, Recall: 0.2230, Acc: 0.9610, ROC: 0.8252, PR: 0.2761, AP: 0.2761

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_22_10_50_43', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0.05, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=48, n_layers=2, weight_decay=0.0005, dropout=0.2, embedding_size=180, weight=2.0, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgel

#End here ...

In [None]:
1/0 #for break the code.

ZeroDivisionError: division by zero

## Training (Detailed)

Besides the approach mentioned before, you can also use this detailed approach.

### Prepare environment

In [None]:
import os
import sys
import glob

os.environ['DGLBACKEND'] = 'pytorch'

import torch as th
import dgl
import numpy as np

from gnn.estimator_fns import *
from gnn.graph_utils import *
from gnn.data import *
from gnn.utils import *
from gnn.pytorch_model import *
from train import *

### Load data

Recall the edges we defined before and the csv files we used to save them.

In [None]:
file_list = glob.glob('./data/*edgelist.csv')

edges = ",".join(map(lambda x: x.split("/")[-1], [file for file in file_list if "relation" in file]))

To train the graph neural network, we need to define a few hyperparameters that determine properties such as the class of graph neural network models we will be using, the network architecture and the optimizer and optimization parameters.

Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see `gnn/estimator_fns.py`. The parameters set below are:

- **nodes** is the name of the file that contains the node_ids of the target nodes and the node features.
- **edges** is a regular expression that when expanded lists all the filenames for the edgelists
- **labels** is the name of the file tha contains the target node_ids and their labels
- **model** specify which graph neural network to use, this should be set to r-gcn

The following hyperparameters can be tuned and adjusted to improve model performance

- **batch-size** is the number nodes that are used to compute a single forward pass of the GNN
- **embedding-size** is the size of the embedding dimension for non target nodes
- **n-neighbors** is the number of neighbours to sample for each target node during graph sampling for mini-batch training
- **n-layers** is the number of GNN layers in the model
- **n-epochs** is the number of training epochs for the model training job
- **optimizer** is the optimization algorithm used for gradient based parameter updates
- **lr** is the learning rate for parameter updates

### Generate graph

In [None]:
print('numpy version:{} PyTorch version:{} DGL version:{}'.format(np.__version__,
                                                                    th.__version__,
                                                                    dgl.__version__))

args = parse_args()
print(args)

In [None]:
args.edges = edges

args.edges = get_edgelists('relation*', args.training_dir)

g, features, target_id_to_node, id_to_node = construct_graph(args.training_dir,
                                                                args.edges,
                                                                args.nodes,
                                                                args.target_ntype)

mean, stdev, features = normalize(th.from_numpy(features))

print('feature mean shape:{}, std shape:{}'.format(mean.shape, stdev.shape))

In [None]:
g.nodes['target'].data['features'] = features

print("Getting labels")
n_nodes = g.number_of_nodes('target')

labels, _, test_mask = get_labels(target_id_to_node,
                                            n_nodes,
                                            args.target_ntype,
                                            os.path.join(args.training_dir, args.labels),
                                            os.path.join(args.training_dir, args.new_accounts))
print("Got labels")

labels = th.from_numpy(labels).float()
test_mask = th.from_numpy(test_mask).float()

n_nodes = th.sum(th.tensor([g.number_of_nodes(n_type) for n_type in g.ntypes]))
n_edges = th.sum(th.tensor([g.number_of_edges(e_type) for e_type in g.etypes]))

print("""----Data statistics------'
            #Nodes: {}
            #Edges: {}
            #Features Shape: {}
            #Labeled Test samples: {}""".format(n_nodes,
                                                    n_edges,
                                                    features.shape,
                                                    test_mask.sum()))

### Start training

The training process and result will be saved in the same folder.

In [None]:
if args.num_gpus:
    cuda = True
    device = th.device('cuda:0')
else:
    cuda = False
    device = th.device('cpu')

In [None]:
print("Initializing Model")
in_feats = features.shape[1]
n_classes = 2

ntype_dict = {n_type: g.number_of_nodes(n_type) for n_type in g.ntypes}

model = get_model(ntype_dict, g.etypes, vars(args), in_feats, n_classes, device)
print("Initialized Model")

features = features.to(device)

labels = labels.long().to(device)
test_mask = test_mask.to(device)
# g = g.to(device)

loss = th.nn.CrossEntropyLoss()

# print(model)
optim = th.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)

print("Starting Model training")

initial_record()

model, class_preds, pred_proba = train_fg(model, optim, loss, features, labels, g, g,
                                            test_mask, device, args.n_epochs,
                                            args.threshold,  args.compute_metrics)
print("Finished Model training")

print("Saving model")

if not os.path.exists(args.model_dir):
    os.makedirs(args.model_dir)

save_model(g, model, args.model_dir, id_to_node, mean, stdev)
print("Model and metadata saved")

In [None]:
%tb

In [None]:
!pip install torchdata