This tutorial is adapted from [Dair-AI Pytorch Geometric Tutorial](https://github.com/dair-ai/GNNs-Recipe) by Elvis Saravia.

### Installing dependencies

In [2]:
import torch
print("PyTorch has version {}".format(torch.__version__))

PyTorch has version 2.3.1


The installation of PyG can be a little bit tricky. Execute the cell below -- in case of issues, more information can be found on the [PyG's installation page](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html).

In [3]:
# Install dependencies
# !pip install -q torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cu101.html
# Install torch geometric
!pip install -q torch-geometric

In [4]:
import matplotlib.pyplot as plt
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.datasets import Planetoid

# Loading Data

One of the cool things about the PyTorch Geometric library is that it contains out-of-the-box benchmark datasets that are ready to use and explore. A popular dataset is the Cora dataset that is used for supervised graph node classification.

"The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words." - [Papers with Code](https://paperswithcode.com/dataset/cora).

Let's load the Cora dataset:

In [5]:
dataset = Planetoid(root='tmp/Cora', name='Cora')

Let's check some of the properties of the Cora dataset.

In [6]:
# number of graphs
print("Number of graphs: ", len(dataset))

# number of features
print("Number of features: ", dataset.num_features)

# number of classes
print("Number of classes: ", dataset.num_classes)

Number of graphs:  1
Number of features:  1433
Number of classes:  7


### meaning of the contents of each element of the dataset
`x`:[number of nodes, number of features],<br>
`y`:[number of nodes, 1], <br>
`edge_index`:[2, number of edges], <br>
`train_mask`:[number of nodes, 1], <br>
`test_mask`:[number of nodes, 1], <br>
`val_mask`:[number of nodes, 1]

In [12]:
print(dataset[0])

Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])


In [7]:
print(dataset[0].edge_index)

tensor([[ 633, 1862, 2582,  ...,  598, 1473, 2706],
        [   0,    0,    0,  ..., 2707, 2707, 2707]])


We can see that this particular graph dataset only contains one graph. Graph data can be very complex and can include multiple graphs depending on the type of data and application. Let's check more feature of the Cora dataset:

In [6]:
# select the first graph
data = dataset[0]

# number of nodes
print("Number of nodes: ", data.num_nodes)

# number of edges
print("Number of edges: ", data.num_edges)

# check if directed
print("Is directed: ", data.is_directed())

Number of nodes:  2708
Number of edges:  10556
Is directed:  False


You can sample nodes from the graph this way:

In [7]:
# sample nodes from the graph
print("Shape of sample nodes: ", data.x[:5].shape)

Shape of sample nodes:  torch.Size([5, 1433])


We extracted 5 nodes from the graph and checked its shape. You will see that each node has `1433` features.

Another great advantage of using PyTorch Geometric to load the Cora data is that it comes pre-processed and ready to use. It also has the splits for training, validation and test which we can directly use for training a GNN.

Let's check some stats for the partitions of the data:

In [8]:
# check training nodes
print("# of nodes to train on: ", data.train_mask.sum().item())
# check test nodes
print("# of nodes to test on: ", data.test_mask.sum().item())
# check validation nodes
print("# of nodes to validate on: ", data.val_mask.sum().item())

# of nodes to train on:  140
# of nodes to test on:  1000
# of nodes to validate on:  500


That information is important as it will indicate to our model which nodes to train against and which to test against, and so on.

In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
data = dataset[0].to(device)

cuda


# TO-DO: Model and Training

Finally, let's define a standard GCN to train on the Cora dataset. The aim is to train a model that gets better at predicting the class of the node.

Use the built-in `GCNConv` module.

The model below should have two `GCNConv` layers. The first layer is to be followed by a non-linearity `ReLU` and `Dropout`. The result should be fed to the second layer on top of which you should apply `Softmax` to get distribution over the number of classes. You can experiment with the number of channels in between layers.

In [19]:
class GCN(torch.nn.Module):

    def __init__(self, num_features, num_classes):
        super().__init__()
        # TO_DO
        self.conv1 = GCNConv(in_channels=num_features, out_channels=100)
        self.conv2 = GCNConv(in_channels=100, out_channels=num_classes)
    # TO_DO 
    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x)
        x = self.conv2(x, edge_index)
        x = F.softmax(x)

        return x

In [20]:
model = GCN(dataset.num_features, dataset.num_classes).to(device) # TO_DO

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Define an optimizer:

In [18]:
optimizer = torch.optim.Adam(model.parameters(),lr=0.01, weight_decay=5e-4) # lr=0.01, wieght_decay= 5e-4

AttributeError: 'Adam' object has no attribute 'to'

Define a loss function:

In [13]:
loss_fn = torch.nn.NLLLoss() # negative log likelihood loss

Train the model on the training nodes for 200 epochs:

In [14]:
from sklearn.metrics import accuracy_score

In [15]:
data.train_mask

tensor([ True,  True,  True,  ..., False, False, False], device='cuda:0')

In [17]:


# train the model
model.train()
losses = []
accuracies = []


for epoch in range(200):

    optimizer.zero_grad()
    out = model(data.x,data.edge_index)

    loss = loss_fn(out[data.train_mask],data.y[data.train_mask]) # TO-DO  
    losses.append(loss.item())

    acc = accuracy_score(data.y, torch.argmax(out, dim=-1))
    accuracies.append(acc)

    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print('Epoch: {}, Loss: {:.4f}, Training Acc: {:.4f}'.format(epoch+1, loss.item(), acc))


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


In [None]:
# plot the loss and accuracy
plt.plot(losses)
plt.plot(accuracies)
plt.legend(['Loss', 'Accuracy'])
plt.show()

You should aim for 100% accuracy on the training data, and ~80% on the test data.

In [None]:
# evaluate the model on test set
model.eval()
pred = model(data.x, data.edge_index).argmax(dim=1)
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
acc = correct / data.test_mask.sum()
print(f'Accuracy: {acc:.4f}')