# Introduction to PyTorch Geometric (PyG)

PyTorch Geometric (PyG): https://pytorch-geometric.readthedocs.io/en/latest/

## 0. Instllation 

See https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

## 1. Data Format

See https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html#data-handling-of-graphs and understand the meaning of `edge_index`.

In [7]:
!pip install torch_geometric



## 2. Example

The following code provides an example to use PyG for building GCN to solve the node classification task. We will walk through the code and write code comments in this lecture. 

In [10]:
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='data/Citeseer', name='Citeseer')

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.citeseer.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.citeseer.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.citeseer.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.citeseer.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.citeseer.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.citeseer.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.citeseer.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.citeseer.test.index
Processing...
Done!


In [11]:
dataset 

Citeseer()

In [12]:
data = dataset[0]
data

Data(x=[3327, 3703], edge_index=[2, 9104], y=[3327], train_mask=[3327], val_mask=[3327], test_mask=[3327])

In [13]:
data.x.shape

torch.Size([3327, 3703])

In [14]:
data.y.max() + 1

tensor(6)

In [15]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self, num_node_features, num_hidden, num_classes):
        super().__init__()
        self.conv1 = GCNConv(num_node_features, num_hidden)
        self.conv2 = GCNConv(num_hidden, num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

In [16]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
data = dataset[0].to(device)
model = GCN(num_node_features=data.x.shape[1], 
            num_hidden=16,
            num_classes=(data.y.max()+1).item()
           ).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

model.train()
for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
    if epoch % 10 == 0:
        print('Epoch {0}: {1}'.format(epoch, loss.item()))

Epoch 0: 1.7953003644943237
Epoch 10: 0.3820640742778778
Epoch 20: 0.12834005057811737
Epoch 30: 0.06699023395776749
Epoch 40: 0.03642746061086655
Epoch 50: 0.048827774822711945
Epoch 60: 0.03443939611315727
Epoch 70: 0.05090121552348137
Epoch 80: 0.025061285123229027
Epoch 90: 0.036277711391448975
Epoch 100: 0.03450305014848709
Epoch 110: 0.05213276296854019
Epoch 120: 0.033971529453992844
Epoch 130: 0.03782632201910019
Epoch 140: 0.032948918640613556
Epoch 150: 0.032123465090990067
Epoch 160: 0.025365522131323814
Epoch 170: 0.03311274200677872
Epoch 180: 0.044955410063266754
Epoch 190: 0.011644215323030949


In [17]:
model.eval()
pred = model(data).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = int(correct) / int(data.test_mask.sum())
print(f'Accuracy: {acc:.4f}')

Accuracy: 0.6730


## Q: How to use PyG in our project?

Essentially, there are only a few more steps we need to do:
- we need to convert the provided data into the PyG format.
    - PyG

In [20]:
import scipy.sparse as sp
import numpy as np
import json

adj = sp.load_npz('data/adj.npz')
feat  = np.load('data/features.npy')
labels = np.load('data/labels.npy')
splits = json.load(open('data/splits.json'))
idx_train, idx_test = splits['idx_train'], splits['idx_test']

In [21]:
# what data do we have in this project?

# 1. Adjacent matrix: a n x n matrix where n is the number of nodes in this graph
print(adj.shape) # 2480 nodes, and each entry of adj indicates whether two nodes are connected or not (0 if disconnected, 1 otherwise)
print(adj[0, 1104]) # adj[0, 1104] = 1 indicates node 0 and node 1104 are connected
print(adj[1, 1003]) # adj[1, 1003] = 0 indicates node 1 and node 1003 aren't connected

(2480, 2480)
1.0
0.0


In [22]:
# 2. feat matrix: a n x d matrix where each row is a d-dimensional feature vector of a node
print(feat.shape) # 2480 nodes each containing a 1390-dimensional feature vector
print(feat[6])    # the feature vector of node 6

(2480, 1390)
[0 0 0 ... 0 0 0]


In [23]:
# 3. labels: class labels of training nodes
print(labels.shape)  # note: labels contain only the class labels of training nodes
print(labels.max() + 1) # There are 7 classes in total, and each training node belongs to one of 7 classes

(496,)
7


In [24]:
# 4. splits: A python dictionary for train-test set split
print(splits.keys()) # 'idx_train' includes node index for training, and 'idx_test' includes node index for testing
print("# train nodes = ", len(splits['idx_train'])) # 496 nodes for training, and "labels" above correspond to the class labels of these training nodes
print("# test nodes = ", len(splits['idx_test']))  # 1984 nodes for testing

dict_keys(['idx_train', 'idx_test'])
# train nodes =  496
# test nodes =  1984


In [25]:
from torch_geometric.utils import from_scipy_sparse_matrix

In [26]:
edge_index = from_scipy_sparse_matrix(adj)

In [27]:
edge_index

(tensor([[   0,    0,    0,  ..., 2478, 2478, 2479],
         [1084, 1104, 1288,  ...,  931,  933,  999]]),
 tensor([1., 1., 1.,  ..., 1., 1., 1.]))

## How to submit the result

In [29]:
preds = pred[idx_test]
np.savetxt('submission.txt', preds, fmt='%d')

## One intuitive Example using MLP

This example is just used to show what is expected in 'submission.txt'. 
No graph structure (adjacency matrix) will be used, thus only torch functions are needed.
Feel free to use torch_geometric when incorporating graph structure in more advanced graph mining algorithms.


Remark 1: without using the graph structure (adjacency matrix), this could be suboptimal.

Remark 2: see the tutorial "Introduction to Pytorch" for MLP construction and training

Remark 3: feel free to run this example to see what is expected in 'submission.txt'

Remark 4: don't forget to rename your submitted files to '{YourTeamName}_submission.txt' (feel free to pick a fun name for your team!)

In [31]:
import torch
from torch import nn
import torch.nn.functional as F

class Linear(nn.Module): # Inheritance torch.nn.Module
    def __init__(self, in_features, out_features):
        super(Linear, self).__init__()
        self.weight = nn.Parameter(torch.randn(in_features, out_features)) 
        self.bias = nn.Parameter(torch.randn(out_features))
    
    def forward(self, x): # x is the input
        x = x.mm(self.weight) 
        return x + self.bias

class MLP(nn.Module):
    def __init__(self, in_features, hidden_features, out_features):
        super(MLP, self).__init__()
        self.in_features = in_features
        self.layer1 = Linear(in_features, hidden_features)  # Linear()
        self.layer2 = Linear(hidden_features, out_features)
        
    def forward(self,x):
        x = self.layer1(x)
        x = F.relu(x) # ReLU
        return self.layer2(x)

train_feats, train_labels = torch.from_numpy(feat[idx_train]).float(), torch.from_numpy(labels).long()
data_train = torch.utils.data.TensorDataset(train_feats, train_labels) # wrap tensors into a dataset object
feat_dim = train_feats.shape[1]
num_classes = labels.max() + 1
model = MLP(in_features=feat_dim, hidden_features=256, out_features=num_classes)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = model.to(device) 

In [32]:
def train(model, data, num_epochs=5, learning_rate=1e-3, batch_size=32):
    # define an optimizer
    optimizer = torch.optim.Adam(model.parameters(),
                                 lr=learning_rate, 
                                 weight_decay=1e-5) # weight_decay is the L2 Regularization
    
    # Put the data into DataLoader so we can get a batch of data
    train_loader = torch.utils.data.DataLoader(data, 
                                               batch_size=batch_size, 
                                               shuffle=True)
    # define loss function
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        loss_total = 0 # 
        for data in train_loader:
            # clear the gradient
            optimizer.zero_grad()
            
            feat, label = data
            feat = feat.to(device)
            label = label.to(device)
            
            # forward, loss and backward 
            output = model(feat)
            loss = criterion(output, label)
            loss.backward()
            
            # optimize the parameters
            optimizer.step()
            
        loss_total += loss.item() 
        print('Epoch: {}, Training Loss: {:.4f}'.format(epoch+1, loss_total))

In [33]:
train(model, data_train)

Epoch: 1, Training Loss: 42.2957
Epoch: 2, Training Loss: 34.0030
Epoch: 3, Training Loss: 55.2561
Epoch: 4, Training Loss: 30.5589
Epoch: 5, Training Loss: 38.9551


In [34]:
# test
output = model(torch.from_numpy(feat).float().to(device))
pred = output.argmax(1)
preds = pred[idx_test]
np.savetxt('submission.txt', preds, fmt='%d')