[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/KedoKudo/DT_GNN_Tutorial/blob/tree/main/notebooks/05_node_classification.ipynb)

# Node Classification with PyTorch Geometric

In this notebook, we'll explore one of the foundational tasks in graph learning: node classification using pyg.
Essentially, given a graph with some labeled and some unlabeled nodes, our aim is to predict the labels of the unlabeled nodes.
This exercise will not only give us a hands-on understanding of the problem but also showcase the power and utility of PyTorch Geometric in dealing with graph-based learning problems.

## Table of Contents

0. [Loading the Dataset](#Loading-the-Dataset)
0. [Defining the Model](#Defining-the-Model)
0. [Setting up the Training Loop](#Setting-up-the-Training-Loop)
0. [Evaluating Training Results](#Evaluating-Training-Results)

In [1]:
# Uncomment the following line to install the required packages if needed.
# !pip install torch torchvision torchaudio pytorch-gemetric

## Loading the Dataset <a name="Loading-the-Dataset"></a>

PyTorch Geometric provides several benchmark datasets tailored for the node classification problem. Let's use one such dataset.

In [2]:
from torch_geometric.datasets import Planetoid

# Using the Cora dataset as an example
dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0]

print(f'Number of classes: {dataset.num_classes}')
print(f'Number of node features: {data.num_features}')

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index


Number of classes: 7
Number of node features: 1433


Processing...
Done!


## Defining the Model <a name="Defining-the-Model"></a>

For this exercise, we'll use a simple Graph Convolution Network (GCN). 

In [3]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(input_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, output_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)
        
        return F.log_softmax(x, dim=1)

# Check if GPU is available and move the model to GPU
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Uncomment this line to use CUDA if available
device = torch.device("mps") # Use MPS to run on M1 Mac
model = GCN(input_dim=data.num_features, hidden_dim=16, output_dim=dataset.num_classes).to(device)
data = data.to(device)

## Setting up the Training Loop <a name="Setting-up-the-Training-Loop"></a>

Now that we have our dataset and model, let's set up the training loop.

In [4]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
model.train()

for epoch in range(200):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
    
    # Print loss for every 20 epochs
    if epoch % 20 == 0:
        print(f'Epoch: {epoch}, Loss: {loss.item()}')

Epoch: 0, Loss: 1.9450783729553223
Epoch: 20, Loss: 0.22880995273590088
Epoch: 40, Loss: 0.05259493738412857
Epoch: 60, Loss: 0.03490505367517471
Epoch: 80, Loss: 0.030518341809511185
Epoch: 100, Loss: 0.030535725876688957
Epoch: 120, Loss: 0.030138257890939713
Epoch: 140, Loss: 0.033741842955350876
Epoch: 160, Loss: 0.02363608032464981
Epoch: 180, Loss: 0.026172390207648277


## Evaluating Training Results <a name="Evaluating-Training-Results"></a>

After training, it's essential to understand how our model is performing.
We'll do this by evaluating its accuracy on the validation and test datasets.

First, let's define a helper function to compute accuracy:

In [5]:
def compute_accuracy(output, labels, mask):
    _, predictions = output.max(dim=1)
    correct = predictions[mask].eq(labels[mask]).sum().item()
    total = mask.sum().item()
    return correct / total

Now, we can compute the accuracy for our trained model on the validation and test sets:

In [6]:
model.eval()
output = model(data)

val_accuracy = compute_accuracy(output, data.y, data.val_mask)
test_accuracy = compute_accuracy(output, data.y, data.test_mask)

print(f'Validation Accuracy: {val_accuracy:.4f}')
print(f'Test Accuracy: {test_accuracy:.4f}')

Validation Accuracy: 0.7840
Test Accuracy: 0.8150


This section gives insight into the effectiveness of the trained model.
Depending on the dataset and architecture details, you might achieve different results.
Adjusting hyperparameters, model architecture, or training strategy can be next steps if results are unsatisfactory.
If interested, you can also check out how to use libaries like [Optuna](https://optuna.org) and [DeepHyper](https://deephyper.readthedocs.io/en/latest/) to perform hyperparameter tuning.