<a href="https://colab.research.google.com/github/JidapaBur/DADS7201/blob/main/sna/fraud/final/20_modeling_for_students.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modeling

This notebook will guide you through the process of loading heterogeneous graphs and training models. The heterogeneous graph is based on Deep Graph Library (DGL) implementation, and the training process is based on PyTorch implementation.

The CPU is recommended for training when the equipment allows it. If you need to use GPU, please install GPU-based DGL separately.

## Colab setting

If you want to train in Colab, please run both cells first and mount to the corresponding path.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import os
cur_path = "/content/drive/MyDrive/graph-fraud-detection/"
os.chdir(cur_path)
!pwd

/content/drive/MyDrive/graph-fraud-detection


In [3]:
#!pip install dgl # This might install a version compatible with the current PyTorch
#!pip install  dgl -f https://data.dgl.ai/wheels/torch-2.3/repo.html
!pip install dgl==1.1.2  # Or other version that works without GraphBolt
#!pip install --upgrade dgl -f https://data.dgl.ai/wheels/torch-2.4/cu124/repo.html

#-cu101

Collecting dgl==1.1.2
  Downloading dgl-1.1.2-cp311-cp311-manylinux1_x86_64.whl.metadata (530 bytes)
Downloading dgl-1.1.2-cp311-cp311-manylinux1_x86_64.whl (6.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m47.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dgl
Successfully installed dgl-1.1.2


In [None]:
!pip install optuna

Collecting optuna
  Downloading optuna-4.4.0-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.16.4-py3-none-any.whl.metadata (7.3 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Downloading optuna-4.4.0-py3-none-any.whl (395 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m395.9/395.9 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading alembic-1.16.4-py3-none-any.whl (247 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.0/247.0 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, alembic, optuna
Successfully installed alembic-1.16.4 colorlog-6.9.0 optuna-4.4.0


## Training (All in 1)

In this part, you can use this all-in-one method to train the model easily.

In [4]:
!python train_origin.py --n-epochs 10 --n-hidden 32 --embedding-size 128 --dropout 0.0
#F1: 0.2667, Precision: 0.3095, Recall: 0.2342, Acc: 0.9516, ROC: 0.7389, PR: 0.1833, AP: 0.1833

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_25_21_45_06', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0.5, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=32, n_layers=3, weight_decay=0.0005, dropout=0.0, embedding_size=128, weight=None, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgel

In [None]:
!python train.py --n-epochs 10 --n-hidden 48 --embedding-size 180 --n-layers 2 --weight 2 --threshold 0.05 --dropout 0.0
#F1: 0.3005, Precision: 0.4605, Recall: 0.2230, Acc: 0.9610, ROC: 0.8252, PR: 0.2761, AP: 0.2761

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_25_21_28_39', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0.05, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=48, n_layers=2, weight_decay=0.0005, dropout=0.0, embedding_size=180, weight=2.0, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgel

In [6]:
!python train_origin.py --n-epochs 10 --n-hidden 32 --embedding-size 128 --dropout 0.1

DLG version: 1.1.2
numpy version:2.0.2 PyTorch version:2.6.0+cu124 DGL version:1.1.2
Namespace(training_dir='./data', predicting_dir='./tdata', model_dir='./model/2025_07_25_21_52_34', model_predict_dir='./model/2025_07_20_08_53_50', output_dir='./output', nodes='features.csv', edges='relation*', labels='tags.csv', new_accounts='test.csv', target_ntype='TransactionID', compute_metrics=True, threshold=0.5, optimizer='adam', lr=0.01, n_epochs=10, n_hidden=32, n_layers=3, weight_decay=0.0005, dropout=0.1, embedding_size=128, weight=None, num_gpus=0)
Getting relation graphs from the following edge lists : ['relation_card1_edgelist.csv', 'relation_card2_edgelist.csv', 'relation_card3_edgelist.csv', 'relation_card4_edgelist.csv', 'relation_card5_edgelist.csv', 'relation_card6_edgelist.csv', 'relation_ProductCD_edgelist.csv', 'relation_addr1_edgelist.csv', 'relation_addr2_edgelist.csv', 'relation_P_emaildomain_edgelist.csv', 'relation_R_emaildomain_edgelist.csv', 'relation_TransactionID_edgel

## Training (Detailed)

Besides the approach mentioned before, you can also use this detailed approach.

### Prepare environment

In [None]:
import os
import sys
import glob

os.environ['DGLBACKEND'] = 'pytorch'

import torch as th
import dgl
import numpy as np

from gnn.estimator_fns import *
from gnn.graph_utils import *
from gnn.data import *
from gnn.utils import *
from gnn.pytorch_model import *
from train import *

### Load data

Recall the edges we defined before and the csv files we used to save them.

In [None]:
file_list = glob.glob('./data/*edgelist.csv')

edges = ",".join(map(lambda x: x.split("/")[-1], [file for file in file_list if "relation" in file]))

To train the graph neural network, we need to define a few hyperparameters that determine properties such as the class of graph neural network models we will be using, the network architecture and the optimizer and optimization parameters.

Here we're setting only a few of the hyperparameters, to see all the hyperparameters and their default values, see `gnn/estimator_fns.py`. The parameters set below are:

- **nodes** is the name of the file that contains the node_ids of the target nodes and the node features.
- **edges** is a regular expression that when expanded lists all the filenames for the edgelists
- **labels** is the name of the file tha contains the target node_ids and their labels
- **model** specify which graph neural network to use, this should be set to r-gcn

The following hyperparameters can be tuned and adjusted to improve model performance

- **batch-size** is the number nodes that are used to compute a single forward pass of the GNN
- **embedding-size** is the size of the embedding dimension for non target nodes
- **n-neighbors** is the number of neighbours to sample for each target node during graph sampling for mini-batch training
- **n-layers** is the number of GNN layers in the model
- **n-epochs** is the number of training epochs for the model training job
- **optimizer** is the optimization algorithm used for gradient based parameter updates
- **lr** is the learning rate for parameter updates

### Generate graph

In [None]:
print('numpy version:{} PyTorch version:{} DGL version:{}'.format(np.__version__,
                                                                    th.__version__,
                                                                    dgl.__version__))

args = parse_args()
print(args)

In [None]:
args.edges = edges

args.edges = get_edgelists('relation*', args.training_dir)

g, features, target_id_to_node, id_to_node = construct_graph(args.training_dir,
                                                                args.edges,
                                                                args.nodes,
                                                                args.target_ntype)

mean, stdev, features = normalize(th.from_numpy(features))

print('feature mean shape:{}, std shape:{}'.format(mean.shape, stdev.shape))

In [None]:
g.nodes['target'].data['features'] = features

print("Getting labels")
n_nodes = g.number_of_nodes('target')

labels, _, test_mask = get_labels(target_id_to_node,
                                            n_nodes,
                                            args.target_ntype,
                                            os.path.join(args.training_dir, args.labels),
                                            os.path.join(args.training_dir, args.new_accounts))
print("Got labels")

labels = th.from_numpy(labels).float()
test_mask = th.from_numpy(test_mask).float()

n_nodes = th.sum(th.tensor([g.number_of_nodes(n_type) for n_type in g.ntypes]))
n_edges = th.sum(th.tensor([g.number_of_edges(e_type) for e_type in g.etypes]))

print("""----Data statistics------'
            #Nodes: {}
            #Edges: {}
            #Features Shape: {}
            #Labeled Test samples: {}""".format(n_nodes,
                                                    n_edges,
                                                    features.shape,
                                                    test_mask.sum()))

### Start training

The training process and result will be saved in the same folder.

In [None]:
if args.num_gpus:
    cuda = True
    device = th.device('cuda:0')
else:
    cuda = False
    device = th.device('cpu')

In [None]:
print("Initializing Model")
in_feats = features.shape[1]
n_classes = 2

ntype_dict = {n_type: g.number_of_nodes(n_type) for n_type in g.ntypes}

model = get_model(ntype_dict, g.etypes, vars(args), in_feats, n_classes, device)
print("Initialized Model")

features = features.to(device)

labels = labels.long().to(device)
test_mask = test_mask.to(device)
# g = g.to(device)

loss = th.nn.CrossEntropyLoss()

# print(model)
optim = th.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)

print("Starting Model training")

initial_record()

model, class_preds, pred_proba = train_fg(model, optim, loss, features, labels, g, g,
                                            test_mask, device, args.n_epochs,
                                            args.threshold,  args.compute_metrics)
print("Finished Model training")

print("Saving model")

if not os.path.exists(args.model_dir):
    os.makedirs(args.model_dir)

save_model(g, model, args.model_dir, id_to_node, mean, stdev)
print("Model and metadata saved")

In [None]:
%tb

In [None]:
!pip install torchdata

##JIDAPA


In [None]:
!pip install optuna

In [None]:
from train_origin import train_fg, initial_record
from gnn.data import get_features, get_labels, parse_edgelist
import dgl
import torch as th
import optuna
import glob
import os
from collections import defaultdict


In [None]:
def parse_edgelist_multi(edge_files, tag_path):
    graphs = []
    for i, edge_file in enumerate(edge_files):
        if os.stat(edge_file).st_size == 0:
            print(f"⚠️ Empty file: {edge_file}")
            continue
        try:
            result = parse_edgelist(edge_file, tag_path)  # ⛔️ ไม่ส่ง id_to_node แล้ว
            if isinstance(result, tuple):
                g = result[0]
            else:
                g = result
            if hasattr(g, 'ntypes'):
                graphs.append(g)
            else:
                print(f"⚠️ Not a valid DGLGraph: {edge_file}")
        except Exception as e:
            print(f"❌ Error parsing {edge_file}: {e}")
    print(f"📎 Merging {len(graphs)} graphs...")
    if not graphs:
        raise ValueError("No valid graphs to merge.")
    return dgl.merge(graphs)



In [None]:
def load_graph_data():
    data_dir = './data'
    edge_files = glob.glob(os.path.join(data_dir, 'relation_*_edgelist.csv'))
    tag_path = os.path.join(data_dir, 'tags.csv')

    print(f"🧩 Found {len(edge_files)} edge files")

    g = parse_edgelist_multi(edge_files, tag_path)
    features = get_features(g)
    labels = get_labels(g, 'TransactionID', tag_path)
    test_mask = g.ndata['target']
    user_type = 'TransactionID'

    return g, features, labels, test_mask, user_type


In [None]:
# ✅ Objective สำหรับ Optuna
def objective(trial):
    params = {
        'lr': trial.suggest_float('lr', 1e-4, 1e-2, log=True),
        'n_hidden': trial.suggest_categorical('n_hidden', [16, 32, 64]),
        'n_layers': trial.suggest_int('n_layers', 1, 3),
        'embedding_size': trial.suggest_categorical('embedding_size', [64, 128, 256]),
        'dropout': trial.suggest_float('dropout', 0.1, 0.5),
        'weight_decay': trial.suggest_float('weight_decay', 1e-6, 1e-2),
        'threshold': trial.suggest_float('threshold', 0.3, 0.7),
        'optimizer': trial.suggest_categorical('optimizer', ['adam', 'adamw'])
    }

    g, features, labels, test_mask, user_type = load_graph_data()
    device = th.device('cuda' if th.cuda.is_available() else 'cpu')

    # ✅ สร้าง target mask จาก NTYPE (ตาม node type จริง)
    if dgl.NTYPE not in g.ndata:
        raise ValueError("Graph missing dgl.NTYPE info")

    ntype_tensors = g.ndata[dgl.NTYPE]
    user_type_id = None
    if hasattr(g, "ntypes"):
        for i, name in enumerate(g.ntypes):
            if name == user_type:
                user_type_id = i
                break
    if user_type_id is None:
        raise ValueError(f"user_type '{user_type}' not found in g.ntypes")

    g.ndata["target"] = (ntype_tensors == user_type_id)

    # ✅ สร้าง model และ optimizer
    model = get_model(
        {user_type: features.shape[0]},
        ["dummy_edge"],
        {
            'n_hidden': params['n_hidden'],
            'n_layers': params['n_layers'],
            'embedding_size': params['embedding_size'],
            'dropout': params['dropout']
        },
        features.shape[1],
        2,
        device
    )
    optimizer_cls = th.optim.Adam if params["optimizer"] == "adam" else th.optim.AdamW
    optimizer = optimizer_cls(model.parameters(), lr=params["lr"], weight_decay=params["weight_decay"])
    loss = th.nn.CrossEntropyLoss()

    initial_record()
    model, class_preds, pred_proba = train_fg(
        model, optimizer, loss, features, labels,
        g, g, test_mask, device, 10, params["threshold"], compute_metrics=False
    )

    _, f1, *_ = get_metrics(class_preds, pred_proba, labels.numpy(), test_mask.numpy(), './output/')
    return f1

In [None]:
if __name__ == "__main__":
    study = optuna.create_study(direction="maximize", study_name="gnn_realdata")
    study.optimize(objective, n_trials=30)
    print("Best F1:", study.best_value)
    print("Best Params:", study.best_trial.params)

In [None]:
result = load_graph_data()
print(len(result), result)
