# Edge Classification Pretraining demo (IEEE)

## Overview

Often times it is helpful to pre-train or initialize a network with learned weights on a downstream task of interest and further fine-tune.

This notebook demonstrates the steps for pretraing a GNN on synthetic data and finetuning on real data. 

### Imports

In [1]:
# preprocessing
from syngen.preprocessing.datasets.ieee import IEEEPreprocessing

# generation
from syngen.synthesizer import StaticBipartiteGraphSynthesizer
from syngen.generator.tabular import CTGANGenerator
from syngen.generator.graph import RMATBipartiteGenerator
from syngen.graph_aligner.xgboost_aligner import XGBoostAligner

# training
import torch
from syngen.benchmark.data_loader.datasets.edge_ds import EdgeDS
from syngen.benchmark.models import GATEC
from syngen.benchmark.tasks.ec import train_epoch

# utils
import dgl
import time
import cudf
import numpy as np
import pandas as pd
from syngen.utils.types import MetaData, ColumnType

DGL backend not selected or invalid.  Assuming PyTorch for now.


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


### Generate synthetic data

In the following cells, a synthesizer is instantiated and fitted on the IEEE dataset.

Once fitted, the synthesizer is used to generate synthetic data with similar characteristics.

For a more detailed explanation checkout the `e2e_ieee_demo.ipynb`

In [2]:
edge_feature_generator = CTGANGenerator(epochs=10, batch_size=2000)
static_graph_generator = RMATBipartiteGenerator(seed=42)
preprocessing = IEEEPreprocessing(cached=False)
graph_aligner = XGBoostAligner(features_to_correlate_edge={'TransactionAmt': ColumnType.CONTINUOUS})

num_edges = 52008
num_nodes_src_set = 17091
num_nodes_dst_set = 198
num_edges_src_dst = num_edges
num_edges_dst_src = num_edges

DEBUG:root:Initialized logger
DEBUG:root:Using seed: 42


In [3]:
synthesizer = StaticBipartiteGraphSynthesizer(
                                    graph_generator=static_graph_generator,
                                    graph_info=preprocessing.graph_info,
                                    edge_feature_generator=edge_feature_generator,
                                    graph_aligner=graph_aligner)

In [4]:
data = preprocessing.transform('/workspace/data/ieee-fraud/data.csv')

INFO:syngen.preprocessing.base_preprocessing:read data : (52008, 386)
INFO:syngen.preprocessing.base_preprocessing:droping column: []


In [5]:
synthesizer.fit(edge_data = data[MetaData.EDGE_DATA])

INFO:syngen.synthesizer.static_bipartite_graph_synthesizer:Fitting feature generator...
INFO:syngen.synthesizer.static_bipartite_graph_synthesizer:Fitting graph generator...
DEBUG:root:Fit results dst_src: None
DEBUG:root:Fit results src_dst: (0.43224427700042733, 0.22040712833404563, 0.06775572299957267, 0.27959287166595437)
Fitting TransactionAmt ...
Parameters: { "n_estimators", "silent", "verbose_eval" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


[0]	train-rmse:1.08987
[1]	train-rmse:1.07147
[2]	train-rmse:1.05633
[3]	train-rmse:1.03909
[4]	train-rmse:1.02918
[5]	train-rmse:1.02108
[6]	train-rmse:1.01415
[7]	train-rmse:1.00876
[8]	train-rmse:1.00009
[9]	train-rmse:0.99292


In [6]:
synthetic_data = synthesizer.generate(num_nodes_src_set, num_nodes_dst_set, num_edges_src_dst, num_edges_dst_src, graph_noise=0.5)

INFO:syngen.synthesizer.static_bipartite_graph_synthesizer:Generating graph...
INFO:syngen.utils.gen_utils:writing to file table_edge_samples_52018.csv


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.05it/s]


INFO:syngen.synthesizer.static_bipartite_graph_synthesizer:Generating final graph...


Aligner - preds edge: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:00<00:00, 74.44it/s]


Finished ranking, overlaying features on generated graph...
INFO:syngen.synthesizer.static_bipartite_graph_synthesizer:Saving data to ./


### Train GNN

To train an example GNN we need the following:

- a dataset object instantiated using either the synthetic or original data
- the model, optimizer and hyperparameters defined

In the tool an example dataloader is implemented for edge classification under `syngen/benchmark/data_loader`.

This dataset object is used to great the dgl graphs corresponding to both the generated data and real data.

#### Create datasets

In [7]:
dataset = EdgeDS(target_col='isFraud')
g, e_ids = dataset.get_graph(data[MetaData.EDGE_DATA],graph_info=preprocessing.graph_info)
gs, es_ids = dataset.get_graph(synthetic_data[MetaData.EDGE_DATA],graph_info=preprocessing.graph_info)

#### Create helper function


The helper function defines a simple trianing loop and standard metrics for edge classification.

In [8]:
def train(model, optimizer, graph, edge_ids, epochs, shuffle=True, batch_size=256):
    sampler = dgl.dataloading.MultiLayerFullNeighborSampler(1)
    dataloader = dgl.dataloading.EdgeDataLoader(
        graph, edge_ids, sampler,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=False,
        num_workers=8)
    
    best_val_acc, best_test_acc = 0, 0
    total_batches = []
    batch_times = []

    for e in range(epochs):

        train_acc, val_acc, test_acc, losses, e_batch_times = train_epoch(model, dataloader, optimizer, verbose=True)
        if e == 0:
            e_batch_times = e_batch_times[10:]  # ignore warm-up steps
        batch_times += e_batch_times
        
        val_acc = np.mean(val_acc)
        test_acc = np.mean(test_acc)
        train_acc = np.mean(train_acc)
        loss = np.mean(losses)

        if best_val_acc < val_acc:
            best_val_acc = val_acc
            best_test_acc = test_acc
        
        if (e+1) % 1 == 0:
            print('epoch {}, loss: {:.3f}, train acc: {:.3f} val acc: {:.4f} (best {:.4f}), test acc: {:.4f} (best {:.4f})'.format(
                e+1, loss, train_acc, val_acc, best_val_acc, test_acc, best_test_acc))
    return best_val_acc.item(), best_test_acc.item(), np.mean(batch_times)

#### No-Pretrain

Without pre-training the model is trained from scratch using the original data graph.

In [9]:
in_feats = g.ndata.get('feat').shape[1]
in_edge_feats = g.edata.get('feat').shape[1]
model = GATEC(
    in_dim=in_feats, 
    in_dim_edge=in_edge_feats, 
    hidden_dim=64, 
    out_dim=32, 
    num_classes=2, 
    n_heads=2,
    in_feat_dropout=0.2,
    dropout=0.2,
    n_layers=1).cuda()

optimizer = torch.optim.Adam(model.parameters(),lr=0.001,weight_decay=0.0)
train(model, optimizer, g, e_ids, 5)

epoch 1, loss: 0.123, train acc: 0.982 val acc: 0.9823 (best 0.9823), test acc: 0.9836 (best 0.9836)
epoch 2, loss: 0.087, train acc: 0.983 val acc: 0.9837 (best 0.9837), test acc: 0.9835 (best 0.9835)
epoch 3, loss: 0.087, train acc: 0.983 val acc: 0.9836 (best 0.9837), test acc: 0.9838 (best 0.9835)
epoch 4, loss: 0.087, train acc: 0.983 val acc: 0.9829 (best 0.9837), test acc: 0.9835 (best 0.9835)
epoch 5, loss: 0.087, train acc: 0.983 val acc: 0.9831 (best 0.9837), test acc: 0.9838 (best 0.9835)


(0.9836600881581213, 0.9835433869385252, 0.07241769592729536)

#### Pretrain

In this example the model is first trained on the generated data for a certain epoch budget.

Subsequently it is further trained on the original data graph.

In [10]:
model_pretrain = GATEC(
    in_dim=in_feats, 
    in_dim_edge=in_edge_feats, 
    hidden_dim=64, 
    out_dim=32, 
    num_classes=2, 
    n_heads=2,
    in_feat_dropout=0.2,
    dropout=0.2,
    n_layers=1).cuda()

print('Pretraining')
optimizer_pretrain = torch.optim.Adam(model_pretrain.parameters(),lr=0.0005,weight_decay=0.0)
_, _, synthetic_graph_throughput_batches = train(model_pretrain, optimizer_pretrain, gs, es_ids, 5)

print('Fine-tuning')
optimizer_pretrain = torch.optim.Adam(model_pretrain.parameters(),lr=0.0005,weight_decay=0.0)
val_acc, test_acc, original_graph_throughput_batches = train(model_pretrain, optimizer_pretrain, g, e_ids, 5)

Pretraining
epoch 1, loss: 0.430, train acc: 0.855 val acc: 0.8576 (best 0.8576), test acc: 0.8483 (best 0.8483)
epoch 2, loss: 0.387, train acc: 0.855 val acc: 0.8551 (best 0.8576), test acc: 0.8482 (best 0.8483)
epoch 3, loss: 0.344, train acc: 0.855 val acc: 0.8570 (best 0.8576), test acc: 0.8473 (best 0.8483)
epoch 4, loss: 0.340, train acc: 0.854 val acc: 0.8581 (best 0.8581), test acc: 0.8499 (best 0.8499)
epoch 5, loss: 0.340, train acc: 0.854 val acc: 0.8557 (best 0.8581), test acc: 0.8500 (best 0.8499)
Fine-tuning
epoch 1, loss: 0.098, train acc: 0.983 val acc: 0.9836 (best 0.9836), test acc: 0.9838 (best 0.9838)
epoch 2, loss: 0.087, train acc: 0.983 val acc: 0.9836 (best 0.9836), test acc: 0.9841 (best 0.9841)
epoch 3, loss: 0.087, train acc: 0.983 val acc: 0.9830 (best 0.9836), test acc: 0.9833 (best 0.9841)
epoch 4, loss: 0.087, train acc: 0.983 val acc: 0.9832 (best 0.9836), test acc: 0.9834 (best 0.9841)
epoch 5, loss: 0.087, train acc: 0.983 val acc: 0.9831 (best 0.9836

In [11]:
synthetic_graph_throughput_batches, original_graph_throughput_batches

(0.07172738509898138, 0.07249173903376749)

### CLI example

The tool also provides this functionality through its CLI.

The commands used to generate and pretrain/fine tune on the downstream tasks as done above are provided below.

#### Generate synthetic graph

In [1]:
!python -m syngen synthesize \
--synthesizer static_bipartite_graph \
--preprocessing ieee \
--aligner xg_boost \
--graph-generator rmat_bipartite \
--gg-seed 42 \
--edge-generator ctgan \
--eg-batch-size 2000 \
--eg-epochs 10 \
--num-nodes-src-set 17091 \
--num-nodes-dst-set 198 \
--num-edges-src-dst 52008 \
--num-edges-dst-src 52008 --data-path '/workspace/data/ieee-fraud/data.csv' --save-path '/workspace/ieee/' \
--features-to-correlate-edge "{\"TransactionAmt\": \"continuous\"}"

INFO:__main__:|    Synthetic Graph Generation Tool    |
INFO:syngen.preprocessing.base_preprocessing:read data : (52008, 386)
INFO:syngen.preprocessing.base_preprocessing:droping column: []
DEBUG:root:Initialized logger
DEBUG:root:Using seed: 42
INFO:syngen.synthesizer.static_bipartite_graph_synthesizer:Fitting feature generator...
INFO:syngen.synthesizer.static_bipartite_graph_synthesizer:Fitting graph generator...
DEBUG:root:Fit results dst_src: None
DEBUG:root:Fit results src_dst: (0.43224427700042733, 0.22040712833404563, 0.06775572299957267, 0.27959287166595437)
DEBUG:asyncio:Using selector: EpollSelector
DEBUG:asyncio:Using selector: EpollSelector
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
distributed.preloading - INFO - Im

#### Results without pretraining

In [2]:
!python -m syngen pretrain \
--model gat_ec \
--hidden-dim 64 \
--out-dim 32 \
--n-layers 1 \
--n-heads 2 \
--weight-decay 0.0 \
--learning-rate 0.0005 \
--batch-size 256 \
--pretrain-epochs 0 \
--finetune-epochs 5 \
--data-path '/workspace/data/ieee-fraud/data.csv' \
--pretraining-path '/workspace/ieee/' \
--preprocessing ieee \
--task ec \
--target-col isFraud \
--num-classes 2 \
--log-interval 1

INFO:__main__:|    Synthetic Graph Generation Tool    |
INFO:syngen.preprocessing.base_preprocessing:read data : (52008, 386)
INFO:syngen.preprocessing.base_preprocessing:droping column: []
INFO:syngen.benchmark.tasks.ec:Finetuning: In epoch 0, loss: 0.142, val acc: 0.985 (best 0.985), test acc: 0.985 (best 0.985)
INFO:syngen.benchmark.tasks.ec:Finetuning: In epoch 1, loss: 0.089, val acc: 0.985 (best 0.985), test acc: 0.985 (best 0.985)
INFO:syngen.benchmark.tasks.ec:Finetuning: In epoch 2, loss: 0.089, val acc: 0.985 (best 0.985), test acc: 0.985 (best 0.985)
INFO:syngen.benchmark.tasks.ec:Finetuning: In epoch 3, loss: 0.089, val acc: 0.985 (best 0.985), test acc: 0.985 (best 0.985)
INFO:syngen.benchmark.tasks.ec:Finetuning: In epoch 4, loss: 0.089, val acc: 0.985 (best 0.985), test acc: 0.985 (best 0.985)
INFO:__main__:{'finetune-loss': 0.08860890291558177, 'finetune-val-acc': 0.9854286459146762, 'finetune-test-acc': 0.9847563131182802}
[0m

#### Pretrain and finetune

In [3]:
!python -m syngen pretrain \
--model gat_ec \
--hidden-dim 64 \
--out-dim 32 \
--n-layers 1 \
--n-heads 2 \
--weight-decay 0.0 \
--learning-rate 0.0005 \
--batch-size 256 \
--pretrain-epochs 5 \
--finetune-epochs 5 \
--data-path '/workspace/data/ieee-fraud/data.csv' \
--pretraining-path '/workspace/ieee/' \
--preprocessing ieee \
--task ec \
--target-col isFraud \
--num-classes 2 \
--log-interval 1

INFO:__main__:|    Synthetic Graph Generation Tool    |
INFO:syngen.preprocessing.base_preprocessing:read data : (52008, 386)
INFO:syngen.preprocessing.base_preprocessing:droping column: []
INFO:syngen.benchmark.tasks.ec:Running pretraining ...
INFO:syngen.benchmark.tasks.ec:Pretraining epoch 0, loss: 0.046, val acc: 1.000 (best 1.000), test acc: 1.000 (best 1.000)
INFO:syngen.benchmark.tasks.ec:Pretraining epoch 1, loss: 0.000, val acc: 1.000 (best 1.000), test acc: 1.000 (best 1.000)
INFO:syngen.benchmark.tasks.ec:Pretraining epoch 2, loss: 0.000, val acc: 1.000 (best 1.000), test acc: 1.000 (best 1.000)
INFO:syngen.benchmark.tasks.ec:Pretraining epoch 3, loss: 0.000, val acc: 1.000 (best 1.000), test acc: 1.000 (best 1.000)
INFO:syngen.benchmark.tasks.ec:Pretraining epoch 4, loss: 0.000, val acc: 1.000 (best 1.000), test acc: 1.000 (best 1.000)
INFO:syngen.benchmark.tasks.ec:Finetuning: In epoch 0, loss: 0.097, val acc: 0.982 (best 0.982), test acc: 0.982 (best 0.982)
INFO:syngen.be