# Graph Neural Networks with PyTorch Geometric

## What is a Graph Neural Network? (Intuition)

A **graph** is a data structure made of:

* **nodes** (objects, entities)
* **edges** (relations between objects)

Examples:

* Social networks: users = nodes, friendships = edges
* Citation networks: papers = nodes, citations = edges
* Molecules: atoms = nodes, bonds = edges

A **Graph Neural Network (GNN)** learns representations of nodes or whole graphs by:

1. Letting each node look at its neighbors
2. Aggregating (summing / averaging / weighting) neighbor information
3. Updating node representations via neural networks

This process is called **message passing**.

## What Goes In and What Comes Out?

### Input to a GNN

* **Node features** `x`
  Shape: `[num_nodes, num_node_features]`
* **Graph structure** `edge_index`
  Shape: `[2, num_edges]`
* (Optional) **Edge features**, positions, etc.

### Output of a GNN

Depends on the task:

* **Node classification** → one label per node
  Shape: `[num_nodes, num_classes]`
* **Graph classification** → one label per graph
  Shape: `[num_graphs, num_classes]`
* **Regression** → continuous values

## Graph : Data Object

In PyG, **one graph = one `Data` object**.

**Key intuition**

* `x` → what a node *is*
* `edge_index` → who talks to whom
* undirected = edges in both directions

In [1]:
import torch
from torch_geometric.data import Data

# edges: 0↔1, 1↔2
edge_index = torch.tensor([
    [0, 1, 1, 2],
    [1, 0, 2, 1]
], dtype=torch.long)

# node features
x = torch.tensor([
    [-1.0],
    [ 0.0],
    [ 1.0]
])

data = Data(x=x, edge_index=edge_index)
print(data)

  import torch_geometric.typing
  import torch_geometric.typing
  import torch_geometric.typing
  import torch_geometric.typing
  import torch_geometric.typing


Data(x=[3, 1], edge_index=[2, 4])


## Built-in Datasets

PyTorch Geometric provides a wide collection of **ready-to-use benchmark datasets** for graph machine learning. These datasets cover a variety of tasks such as node classification, graph classification, and regression, and they can be downloaded and processed automatically.

In this tutorial, we will use the **Cora** dataset, a standard benchmark for **node classification** in citation networks. Its main characteristics are:

* **Task type:** node classification
* **Number of graphs:** 1
* **Number of nodes:** 2,708 (scientific papers)
* **Number of classes:** 7
* **Domain:** citation network (edges represent citations between papers)
* **Availability:** automatically downloaded and prepared by PyTorch Geometric

In [None]:
from torch_geometric.datasets import Planetoid

dataset = Planetoid(root='/tmp/Cora', name='Cora') # Download and load the Cora dataset
data = dataset[0]

print(data.x.shape)        # number of nodes, node features
print(data.edge_index.shape) # number of edges
print(data.y.shape)        # number of nodes, node labels

torch.Size([2708, 1433])
torch.Size([2, 10556])
torch.Size([2708])


Masks define **which nodes are used for training**:

In [3]:
data.train_mask
data.val_mask
data.test_mask

tensor([False, False, False,  ...,  True,  True,  True])

## A Tiny GNN

Instead of using standard layers provided by PyTorch (such as linear or convolutional layers for images), **PyTorch Geometric offers layers specifically designed for graph-structured data**.

For example, these layers implement **graph convolution**, where each node updates its representation by aggregating information from its neighboring nodes according to the graph structure.

In [4]:
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.conv1 = GCNConv(in_dim, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, out_dim)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.conv2(x, edge_index)
        return F.log_softmax(x, dim=1)

## Training on Part of the Graph

Training GCN looks like always.

In [5]:
if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"
    
model = GCN(
    dataset.num_node_features,
    16,
    dataset.num_classes
).to(device)

data = data.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

model.train()
for epoch in range(100):
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(
        out[data.train_mask],
        data.y[data.train_mask]
    )
    loss.backward()
    optimizer.step()

model.eval()
pred = model(data).argmax(dim=1)

acc = (
    pred[data.test_mask]
    == data.y[data.test_mask]
).sum() / data.test_mask.sum()

print(f"Accuracy: {acc:.4f}")

Accuracy: 0.7710


## Mini-batching Many Graphs

Mini-batching in **PyTorch Geometric** works slightly differently than in standard deep learning workflows.

For **large single graphs**, it is often not possible (or not necessary) to split the graph itself. Instead, the model processes the **entire graph**, while the loss is computed only on a **subset of nodes** (e.g. using train, validation, and test masks).

For **datasets consisting of many small graphs**, PyTorch Geometric supports true mini-batching. Multiple graphs are **merged into a single batch** by:

* concatenating node feature matrices,
* concatenating edge indices,
* creating a block-diagonal adjacency structure internally.

This is handled automatically by the `DataLoader`:

In [6]:

from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader

dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')
loader = DataLoader(dataset, batch_size=32, shuffle=True)

batch = next(iter(loader))
print(batch)

DataBatch(edge_index=[2, 4296], x=[1115, 3], y=[32], batch=[1115], ptr=[33])
