# Introduction to PyTorch Geometric (PyG)

PyTorch Geometric (PyG): https://pytorch-geometric.readthedocs.io/en/latest/

## 0. Instllation 

See https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

## 1. Data Format

See https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html#data-handling-of-graphs and understand the meaning of `edge_index`.

## 2. Example

The following code provides an example to use PyG for building GCN to solve the node classification task. We will walk through the code and write code comments in this lecture. 

In [1]:
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Citeseer', name='Citeseer')

In [2]:
dataset 

Citeseer()

In [3]:
data = dataset[0]
data

Data(x=[3327, 3703], edge_index=[2, 9104], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327])

In [9]:
data.edge_index

tensor([[ 628,  158,  486,  ..., 2820, 1643,   33],
        [   0,    1,    1,  ..., 3324, 3325, 3326]])

In [4]:
data.x.shape

torch.Size([3327, 3703])

In [5]:
data.y.max() + 1

tensor(6)

In [10]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self, num_node_features, num_hidden, num_classes):
        super().__init__()
        self.conv1 = GCNConv(num_node_features, num_hidden)
        self.conv2 = GCNConv(num_hidden, num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

In [11]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
data = dataset[0].to(device)
model = GCN(num_node_features=data.x.shape[1], 
            num_hidden=16,
            num_classes=(data.y.max()+1).item()
           ).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
        print('Epoch {0}: {1}'.format(epoch, loss.item()))

Epoch 0: 1.7981531620025635
Epoch 10: 0.3964819610118866
Epoch 20: 0.1508595496416092
Epoch 30: 0.06388244777917862
Epoch 40: 0.040696993470191956
Epoch 50: 0.05569307878613472
Epoch 60: 0.07979481667280197
Epoch 70: 0.04876231029629707
Epoch 80: 0.03782535344362259
Epoch 90: 0.040128253400325775
Epoch 100: 0.04260769858956337
Epoch 110: 0.03815246745944023
Epoch 120: 0.03554859012365341
Epoch 130: 0.0326622910797596
Epoch 140: 0.04496221989393234
Epoch 150: 0.052857302129268646
Epoch 160: 0.02636527270078659
Epoch 170: 0.055984869599342346
Epoch 180: 0.04254470393061638
Epoch 190: 0.030840350314974785


In [21]:
data.y.dtype

torch.int64

In [9]:
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')

Accuracy: 0.6910


## Q: How to use PyG in our project?

Essentially, there are only a few more steps we need to do:
- we need to convert the provided data into the PyG format.
    - PyG

In [11]:
import scipy.sparse as sp
import numpy as np
import json
adj = sp.load_npz('./data_2024/adj.npz')
feat  = np.load('./data_2024/features.npy')
labels = np.load('./data_2024/labels.npy')
splits = json.load(open('./data_2024/splits.json'))
idx_train, idx_test = splits['idx_train'], splits['idx_test']

In [12]:
from torch_geometric.utils import from_scipy_sparse_matrix

In [13]:
edge_index = from_scipy_sparse_matrix(adj)

In [14]:
edge_index

(tensor([[   0,    0,    0,  ..., 2478, 2478, 2479],
         [1084, 1104, 1288,  ...,  931,  933,  999]]),
 tensor([1., 1., 1.,  ..., 1., 1., 1.]))

In [15]:
feat

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [16]:
labels.shape

(496,)

In [17]:
len(splits['idx_train']), len(splits['idx_test'])

(496, 1984)

## How to submit the result

In [18]:
preds = pred[idx_test]
np.savetxt('submission.txt', preds, fmt='%d')

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

In [19]:
preds

tensor([1, 2, 1,  ..., 2, 5, 0], device='cuda:0')

In [21]:
preds = pred[idx_test].cpu().numpy()

np.savetxt('submission.txt', preds, fmt='%d')