# **Tutorial**

In this notebook, we will work to construct our own graph neural network using PyTorch Geometric (PyG) and then apply that model on  Open Graph Benchmark (OGB) dataset. This dataset will be used to benchmark your model's performance on graph property prediction task: predicting properties of entire graphs or subgraphs.

**Note**: Make sure to **sequentially run all the cells in each section**, so that the intermediate variables / packages will carry over to the next cell

We recommend you to run this notebook in colab so you don't need to go through dependecies installations!

# Device
You might need to use a GPU for this Colab to run quickly.

Please click `Runtime` and then `Change runtime type`. Then set the `hardware accelerator` to **GPU**.

# Setup
Installation of PyG on Colab can be a little bit tricky. First let us check which version of PyTorch you are running. Copy the version of PyTorch and paste it to the url in the cell below

In [1]:
import torch
import os
print("PyTorch has version {}".format(torch.__version__))

PyTorch has version 2.1.0+cu118


Download the necessary packages for PyG. Make sure that your version of torch matches the output from the cell above. In case of any issues, more information can be found on the [PyG's installation page](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html).

In [None]:
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-2.1.0+cu118.html
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-2.1.0+cu118.html
!pip install torch-geometric
!pip install ogb

# Dataset downloading

In [10]:
import torch_geometric.transforms as T
from ogb.graphproppred import PygGraphPropPredDataset

dataset_name = 'ogbg-molhiv'
# Load the dataset and transform it to sparse tensor
dataset = PygGraphPropPredDataset(name=dataset_name,
                                transform=T.ToSparseTensor())
print('The {} dataset has {} graph'.format(dataset_name, len(dataset)))

# Extract the graph
print("Example of graph:", dataset[0])

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# If you use GPU, the device should be cuda
print('Device: {}'.format(device))

The ogbg-molhiv dataset has 41127 graph
Example of graph: Data(edge_attr=[40, 3], x=[19, 9], y=[1, 1], num_nodes=19, adj_t=[19, 19, nnz=40])
Device: cuda


# GCN Model

In [11]:
class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers,
                 dropout):
        super(GCN, self).__init__()

        # A list of GCNConv layers
        self.convs = torch.nn.ModuleList()
        self.convs.append(GCNConv(input_dim, hidden_dim))
        for _ in range(num_layers -2):
          self.convs.append(GCNConv(hidden_dim, hidden_dim))
        self.convs.append(GCNConv(hidden_dim, output_dim))

        # A list of 1D batch normalization layers
        self.bns = torch.nn.ModuleList()
        for _ in range(num_layers - 1):
          self.bns.append(torch.nn.BatchNorm1d(hidden_dim))
        self.dropout = dropout

        self.linear = torch.nn.Linear(output_dim, 1)

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()
        for bn in self.bns:
            bn.reset_parameters()

    def forward(self, x, adj_t, batch):

        out = None
        for i in range(len(self.convs) - 1):
          x = self.convs[i](x, adj_t)
          x = self.bns[i](x)
          x = F.relu(x)
          x = F.dropout(x, self.dropout, training=self.training)
        x = self.convs[len(self.convs) - 1](x, adj_t)
        x = global_mean_pool(x, batch)

        x = self.linear(x)
        out = x

        return out

# Training arguments

In [21]:
import torch
from torch_geometric.data import DataLoader
from torch_geometric.nn import GCNConv
import torch.nn.functional as F
import numpy as np
from torch_geometric.nn.pool import global_mean_pool
from tqdm import tqdm

split_idx = dataset.get_idx_split()
train_loader = DataLoader(dataset[split_idx["train"]], batch_size=64, shuffle=True)
valid_loader = DataLoader(dataset[split_idx["valid"]], batch_size=64, shuffle=False)
test_loader = DataLoader(dataset[split_idx["test"]], batch_size=64, shuffle=False)

args = {
      'device': device,
      'num_layers': 5,
      'hidden_dim': 512,
      'dropout': 0.5,
      'lr': 0.01,
      'epochs': 30,
      "out_channels": 128,
  }

model = GCN(dataset.num_node_features, args['hidden_dim'],
            args["out_channels"], args['num_layers'],
            args['dropout']).to(device)

# Loss and optimizer
criterion = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])



# Train/eval functions

In [22]:
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
# Training loop
def train():
  model.train()
  train_loss = []
  for data in tqdm(train_loader, total=len(train_loader)):
      optimizer.zero_grad()
      data = data.to(device)
      data.adj_t = data.adj_t.to_symmetric()
      data.x = data.x.float()
      out = model(data.x, data.adj_t, data.batch)
      loss = criterion(out, data.y.view(-1, 1).to(torch.float32))
      loss.backward()
      optimizer.step()
      train_loss.append(loss.item())

  return sum(train_loss)/len(train_loss)

# Evaluation
def evaluate(model, loader, display=False):
  model.eval()
  with torch.no_grad():
      y_true = []
      y_pred = []
      for data in loader:
          data = data.to(device)
          data.x = data.x.float()
          data.adj_t = data.adj_t.to_symmetric()
          out = model(data.x, data.adj_t, data.batch)
          y_true.append(data.y.view(-1).cpu().numpy())
          y_pred.append((out > 0.5).view(-1).cpu().numpy())

  y_true = np.concatenate(y_true)
  y_pred = np.concatenate(y_pred)

  # Evaluate using appropriate metrics (e.g., accuracy, F1 score, ROC AUC)
  if display:
    print(f"ROC AUC Score: {roc_auc_score(y_true, y_pred)}")
    print(f"F1 Score: {f1_score(y_true, y_pred)}")
    print(f"Accuracy: {accuracy_score(y_true, y_pred)}")
  else:
    return roc_auc_score(y_true, y_pred)

# Training

In [23]:
import copy
best_model = None
best_valid_roc_auc = 0
for epoch in range(args["epochs"]):
  train_loss = train()
  roc_auc_val = evaluate(model, valid_loader)
  print(f"Epoch #{epoch + 1}. Train Loss: {train_loss}. ROC_AUC_val: {roc_auc_val}")
  if roc_auc_val > best_valid_roc_auc:
    best_valid_roc_auc = roc_auc_val
    best_model = copy.deepcopy(model)

100%|██████████| 515/515 [00:22<00:00, 22.97it/s]


Epoch #1. Train Loss: 0.18592179684529025. ROC_AUC_val: 0.5


100%|██████████| 515/515 [00:22<00:00, 22.87it/s]


Epoch #2. Train Loss: 0.15578844076908618. ROC_AUC_val: 0.5


100%|██████████| 515/515 [00:22<00:00, 22.69it/s]


Epoch #3. Train Loss: 0.15357059603104892. ROC_AUC_val: 0.5559000220458554


100%|██████████| 515/515 [00:22<00:00, 23.23it/s]


Epoch #4. Train Loss: 0.1515967304881626. ROC_AUC_val: 0.5


100%|██████████| 515/515 [00:21<00:00, 24.11it/s]


Epoch #5. Train Loss: 0.14920218694340256. ROC_AUC_val: 0.5


100%|██████████| 515/515 [00:21<00:00, 23.88it/s]


Epoch #6. Train Loss: 0.14720059062887741. ROC_AUC_val: 0.5


100%|██████████| 515/515 [00:22<00:00, 22.77it/s]


Epoch #7. Train Loss: 0.1460420221017981. ROC_AUC_val: 0.5481426366843034


100%|██████████| 515/515 [00:22<00:00, 23.34it/s]


Epoch #8. Train Loss: 0.1455311675818221. ROC_AUC_val: 0.5


100%|██████████| 515/515 [00:22<00:00, 23.25it/s]


Epoch #9. Train Loss: 0.14324428084695223. ROC_AUC_val: 0.5061728395061729


100%|██████████| 515/515 [00:21<00:00, 24.04it/s]


Epoch #10. Train Loss: 0.14204590580995802. ROC_AUC_val: 0.5


100%|██████████| 515/515 [00:21<00:00, 23.45it/s]


Epoch #11. Train Loss: 0.14202204871452548. ROC_AUC_val: 0.5185185185185185


100%|██████████| 515/515 [00:22<00:00, 22.83it/s]


Epoch #12. Train Loss: 0.1407237218602479. ROC_AUC_val: 0.49987599206349204


100%|██████████| 515/515 [00:22<00:00, 23.05it/s]


Epoch #13. Train Loss: 0.1407564827141542. ROC_AUC_val: 0.5


100%|██████████| 515/515 [00:22<00:00, 22.89it/s]


Epoch #14. Train Loss: 0.13989909892015664. ROC_AUC_val: 0.537037037037037


100%|██████████| 515/515 [00:21<00:00, 23.51it/s]


Epoch #15. Train Loss: 0.13876484247304283. ROC_AUC_val: 0.5


100%|██████████| 515/515 [00:23<00:00, 22.38it/s]


Epoch #16. Train Loss: 0.1381435959916381. ROC_AUC_val: 0.5123456790123457


100%|██████████| 515/515 [00:21<00:00, 23.45it/s]


Epoch #17. Train Loss: 0.13824860710397507. ROC_AUC_val: 0.5


100%|██████████| 515/515 [00:22<00:00, 23.36it/s]


Epoch #18. Train Loss: 0.1374718436627712. ROC_AUC_val: 0.6035741843033509


100%|██████████| 515/515 [00:22<00:00, 22.87it/s]


Epoch #19. Train Loss: 0.1362828128293012. ROC_AUC_val: 0.5185185185185185


100%|██████████| 515/515 [00:22<00:00, 23.27it/s]


Epoch #20. Train Loss: 0.13628632044690905. ROC_AUC_val: 0.5302441578483246


100%|██████████| 515/515 [00:21<00:00, 24.00it/s]


Epoch #21. Train Loss: 0.1357616801646728. ROC_AUC_val: 0.49975198412698413


100%|██████████| 515/515 [00:21<00:00, 23.83it/s]


Epoch #22. Train Loss: 0.13618539322203801. ROC_AUC_val: 0.5308641975308642


100%|██████████| 515/515 [00:22<00:00, 22.80it/s]


Epoch #23. Train Loss: 0.1360282866302335. ROC_AUC_val: 0.5482666446208113


100%|██████████| 515/515 [00:22<00:00, 23.00it/s]


Epoch #24. Train Loss: 0.13582148599320823. ROC_AUC_val: 0.5491347001763669


100%|██████████| 515/515 [00:22<00:00, 23.17it/s]


Epoch #25. Train Loss: 0.13539949756441186. ROC_AUC_val: 0.5123456790123457


100%|██████████| 515/515 [00:23<00:00, 21.48it/s]


Epoch #26. Train Loss: 0.13408220889571223. ROC_AUC_val: 0.6399911816578483


100%|██████████| 515/515 [00:21<00:00, 23.71it/s]


Epoch #27. Train Loss: 0.13377767369802138. ROC_AUC_val: 0.536913029100529


100%|██████████| 515/515 [00:22<00:00, 23.39it/s]


Epoch #28. Train Loss: 0.13321246951818466. ROC_AUC_val: 0.5487626763668431


100%|██████████| 515/515 [00:22<00:00, 22.73it/s]


Epoch #29. Train Loss: 0.13240979595598087. ROC_AUC_val: 0.5792548500881834


100%|██████████| 515/515 [00:22<00:00, 22.64it/s]


Epoch #30. Train Loss: 0.1321655114949907. ROC_AUC_val: 0.5182705026455026


In [25]:
print("Performance on validation dataset:")
evaluate(best_model, valid_loader, display=True)
print("\nPerformance on test dataset:")
evaluate(best_model, test_loader, display=True)

Performance on validation dataset:
ROC AUC Score: 0.6399911816578483
F1 Score: 0.3833333333333333
Accuracy: 0.9820082664721614

Performance on test dataset:
ROC AUC Score: 0.5554327043782229
F1 Score: 0.18404907975460125
Accuracy: 0.9676635059567226
