<a href="https://colab.research.google.com/github/15muralidhara/oxcourse/blob/main/Graph%20CNN%20(Hw6%20Pt2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Build and train a graph convolutional neural network using PyTorch Geometric for the node property prediction task.

We will use ogbn-products dataset.

## OGBN-Products

The ogbn-products dataset is an undirected and unweighted graph, representing an Amazon product co-purchasing network. Nodes represent products sold in Amazon, and edges between two products indicate that the products are purchased together. Node features are generated by extracting bag-of-words features from the product descriptions followed by a Principal Component Analysis to reduce the dimension to 100.

The task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.

In [None]:
import torch
import os
print("PyTorch has version {}".format(torch.__version__))

PyTorch has version 2.0.1+cu118


Download the necessary packages for PyG. Make sure that your version of torch matches the output from the cell above. In case of any issues, more information can be found on the [PyG's installation page](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html).

In [None]:
# Install torch geometric
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-{torch.__version__}.html
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-{torch.__version__}.html
!pip install torch-geometric
!pip install ogb

Looking in links: https://pytorch-geometric.com/whl/torch-2.0.1+cu118.html
Collecting torch-scatter
  Downloading https://data.pyg.org/whl/torch-2.0.0%2Bcu118/torch_scatter-2.1.1%2Bpt20cu118-cp310-cp310-linux_x86_64.whl (10.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m30.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: torch-scatter
Successfully installed torch-scatter-2.1.1+pt20cu118
Looking in links: https://pytorch-geometric.com/whl/torch-2.0.1+cu118.html
Collecting torch-sparse
  Downloading https://data.pyg.org/whl/torch-2.0.0%2Bcu118/torch_sparse-0.6.17%2Bpt20cu118-cp310-cp310-linux_x86_64.whl (4.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch-sparse
Successfully installed torch-sparse-0.6.17+pt20cu118
Collecting torch-geometric
  Downloading torch_geometric-2.3.1.tar.gz (661 kB)
[2K     [90m━━━━━━━━

In [None]:
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
import torch_geometric.transforms as T
from torch_geometric.data import DataLoader
import numpy as np
from torch_geometric.typing import SparseTensor

## Load and Preprocess the Dataset

In [None]:
dataset_name = 'ogbn-products'
dataset = PygNodePropPredDataset(name=dataset_name,
                                 transform=T.ToSparseTensor())
data = dataset[0]

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# If you use GPU, the device should be cuda
print('Device: {}'.format(device))

# data = data.to(device)
# split_idx = dataset.get_idx_split()
# train_idx = split_idx['train'].to(device)

This will download 1.38GB. Will you proceed? (y/N)
y
Downloading http://snap.stanford.edu/ogb/data/nodeproppred/products.zip


Downloaded 1.38 GB: 100%|██████████| 1414/1414 [00:17<00:00, 80.17it/s] 


Extracting dataset/products.zip


Processing...


Loading necessary files...
This might take a while.
Processing graphs...


100%|██████████| 1/1 [00:01<00:00,  1.89s/it]


Converting graphs into PyG objects...


100%|██████████| 1/1 [00:00<00:00, 97.22it/s]


Saving...


Done!


Device: cuda


In [None]:
data

Data(num_nodes=2449029, x=[2449029, 100], y=[2449029, 1], adj_t=[2449029, 2449029, nnz=123718280])

This dataset is very big and if you try to run it as it is on colab, you may get an out of memory error.

One solution is to use batching and train on subgraphs. Here, we will just make a smaller dataset so that we can train it in one go.

In [None]:
# We need to have edge indxes to make a subgraph. We can get those from the adjacency matrix.
data.edge_index = torch.stack([data.adj_t.__dict__["storage"]._row, data.adj_t.__dict__["storage"]._col])

# We will only use the first 100000 nodes.
sub_nodes = 100000
sub_graph = data.subgraph(torch.arange(sub_nodes))

# Update the adjaceny matrix according to the new graph
sub_graph.adj_t = SparseTensor(
    row=sub_graph.edge_index[0],
    col=sub_graph.edge_index[1],
    sparse_sizes=None,
    is_sorted=True,
    trust_data=True,
)

sub_graph = sub_graph.to(device)

sub_graph


Data(num_nodes=100000, x=[100000, 100], y=[100000, 1], adj_t=[100000, 100000, nnz=2818046], edge_index=[2, 2818046])

In [None]:
# Spilt data into train validation and test set
split_sizes = [int(sub_nodes*0.8),int(sub_nodes*0.05),int(sub_nodes*0.15)]
indices = torch.arange(sub_nodes)
np.random.shuffle(indices.numpy())
split_idx = {s:t for t,s in zip(torch.split(indices, split_sizes, dim=0), ["train", "valid", "test"])}
split_idx

{'train': tensor([36219, 25071, 89645,  ..., 30159, 77955, 70048]),
 'valid': tensor([88275, 65023, 89446,  ..., 64010, 89158, 23509]),
 'test': tensor([51001, 24995, 66942,  ..., 18649, 19824, 75270])}

In [None]:
print(f"Feature Length of each node: {data.x.shape[1]}")

Feature Length of each node: 100


## GCN Model

Now we will implement our GCN model!

In [None]:
import torch.nn as nn
from torch_geometric.nn import GCNConv

class GCNModel(nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, num_layers):
        super(GCNModel, self).__init__()
        self.conv_layers = nn.ModuleList()

        # input layer
        self.conv_layers.append(GCNConv(in_channels, hidden_channels))

        # hidden layers
        for _ in range(num_layers - 2):
            self.conv_layers.append(GCNConv(hidden_channels, hidden_channels))

        # output layer
        self.conv_layers.append(GCNConv(hidden_channels, out_channels))

    def forward(self, x, edge_index):
        for layer in self.conv_layers:
            x = layer(x, edge_index)
            x = F.relu(x)
        return x

in_channels = data.x.shape[1]  # feature dimension of each node
hidden_channels = 128
out_channels = dataset.num_classes  # no. of output classes
num_layers = 3

model = GCNModel(in_channels, hidden_channels, out_channels, num_layers).to(device)

In [None]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

train_idx = split_idx['train']
valid_idx = split_idx['valid']

def train(epoch):
    model.train()
    optimizer.zero_grad()
    out = model(sub_graph.x, sub_graph.edge_index)

    target = sub_graph.y[train_idx].squeeze().long()

    loss = criterion(out[train_idx], target)
    loss.backward()
    optimizer.step()

def evaluate(split):
  model.eval()
  with torch.no_grad():
    out = model(sub_graph.x, sub_graph.adj_t)
    pred = out.argmax(dim=-1, keepdim=True)
  correct = pred[split_idx[split]] == sub_graph.y[split_idx[split]]
  accuracy = int(correct.sum()) / int(split_idx[split].size(0))
  return accuracy

num_epochs = 50
for epoch in range(1, num_epochs + 1):
    train(epoch)
    accuracy = evaluate('valid')  # Calculate validation accuracy for the 'valid' split
    print(f'Epoch: {epoch:02d}, Validation Accuracy: {accuracy:.4f}%')


Epoch: 01, Validation Accuracy: 0.7562%
Epoch: 02, Validation Accuracy: 0.7860%
Epoch: 03, Validation Accuracy: 0.7868%
Epoch: 04, Validation Accuracy: 0.7854%
Epoch: 05, Validation Accuracy: 0.7864%
Epoch: 06, Validation Accuracy: 0.7890%
Epoch: 07, Validation Accuracy: 0.7922%
Epoch: 08, Validation Accuracy: 0.7906%
Epoch: 09, Validation Accuracy: 0.7884%
Epoch: 10, Validation Accuracy: 0.7870%
Epoch: 11, Validation Accuracy: 0.7858%
Epoch: 12, Validation Accuracy: 0.7874%
Epoch: 13, Validation Accuracy: 0.7892%
Epoch: 14, Validation Accuracy: 0.7910%
Epoch: 15, Validation Accuracy: 0.7930%
Epoch: 16, Validation Accuracy: 0.7924%
Epoch: 17, Validation Accuracy: 0.7928%
Epoch: 18, Validation Accuracy: 0.7930%
Epoch: 19, Validation Accuracy: 0.7918%
Epoch: 20, Validation Accuracy: 0.7924%
Epoch: 21, Validation Accuracy: 0.7920%
Epoch: 22, Validation Accuracy: 0.7912%
Epoch: 23, Validation Accuracy: 0.7914%
Epoch: 24, Validation Accuracy: 0.7908%
Epoch: 25, Validation Accuracy: 0.7908%
