# Introduction

Following the [Introduction By Example](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html) provided by pytorch-geometric.

__Other Tutorial Resources:__
- [pypi torch-geometric](https://pypi.org/project/torch-geometric/)

In [1]:
!pip install -q torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
!pip install -q torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
!pip install -q git+https://github.com/pyg-team/pytorch_geometric.git

In [16]:
conda env list

# conda environments:
#
base                     /users/cwoest/Applications/anaconda3
gdsctools_env            /users/cwoest/Applications/anaconda3/envs/gdsctools_env
master-thesis-log     *  /users/cwoest/Applications/anaconda3/envs/master-thesis-log
r-env                    /users/cwoest/Applications/anaconda3/envs/r-env


Note: you may need to restart the kernel to use updated packages.


In [17]:
conda --version

conda 4.10.3

Note: you may need to restart the kernel to use updated packages.


In [14]:
import sys
import torch

device = torch.device("cpu")

print(f"""
    Python version:   {sys.version}
    PyTorch version:  {torch.__version__}
    Device:           {device}
    CUDA available:   {torch.cuda.is_available()}
""")


    Python version:   3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:45:10) [Clang 12.0.1 ]
    PyTorch version:  1.11.0
    Device:           cpu
    CUDA available:   False



## Data Handling of Graphs

| Attributes | Shape | Description |
| ---------- | ----- | ----------- |
| `data.x`          | `[num_nodes, num_node_features]` | Node feature matrix |
| `data.edge_index` | `[2, num_edges]` | Graph connectivity in COO format |
| `data.edge_attr`  | `[num_edges, num:edge_features]` | Edge feature matrix |
| `data.y`          | arbitrary | Traget to train against. <br> - Node-level targets `[num_nodes, *]` <br> - Graph-level targets `[1, *]` |
| `data.pos`        | `[num_nodes, num_dimensions]` | Node position matrix |

In [15]:
import torch
from torch_geometric.data import Data

########################
# Definition of a graph.
########################

# Define edges.
edge_index = torch.tensor(
    [[0, 1, 1, 2],  # from
     [1, 0, 2, 1]], # to
     dtype=torch.long
)

# Define nodes.
x = torch.tensor(
    [[-1], [0], [1]], dtype=torch.float
)

# Define single graph in PyG.
data = Data(
    x=x,
    edge_index=edge_index
)

# Shapes of the edge and feature matrices.
print(data)


Data(x=[3, 1], edge_index=[2, 4])


- `[3, 1]` means the graph has 3 nodes and 1 feature per node
- `[2, 4]` means the graph has 4 edges and if undirected this means 2 edges

The above graph `data` has the following adjacency matrix:
```
    0  1  0
    1  0  1 
    0  1  0
```

__Remember__: 

- `x` has shape `[num_nodes, num_node_features]`
- `edge_index` has shape `[2, num_edges]`

In [16]:
# Define as list of index tuples.
edge_index = torch.tensor([[0, 1],
                           [1, 0],
                           [1, 2],
                           [2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

# Use contiguous if the tensor which defines the source and target nodes of all edges
# should be defined as a list of index tuples.
data = Data(x=x, edge_index=edge_index.t().contiguous())
print(data)

Data(x=[3, 1], edge_index=[2, 4])


In [26]:
####################
# Utility functions.
####################

# Keys of the graph.
print(data.keys)

# Access value of a key.
print(data['x']) # or print(data.x)

# Check if a key is contained in the graph.
print('edge_attr' in data)

# Size parameters of the graph.
print(f"""\nGraph parameters
---------------------
   # of nodes       : {data.num_nodes}
   # of edges       : {data.num_edges}
   # nodes features : {data.num_node_features}
   # edge features  : {data.num_edge_features}
   
   has_isolated_nodes : {data.has_isolated_nodes()}
   has_self_loops     : {data.has_self_loops()}
   is_directed        : {data.is_directed()}
""")



['x', 'edge_index']
tensor([[-1.],
        [ 0.],
        [ 1.]])
False

Graph parameters
---------------------
   # of nodes       : 3
   # of edges       : 4
   # nodes features : 1
   # edge features  : 0
   
   has_isolated_nodes : False
   has_self_loops     : False
   is_directed        : False



# Common Benchmark Datasets

- [TUDatasets](https://chrsmrrs.github.io/datasets/docs/datasets/)

In [36]:
from torch_geometric.datasets import TUDataset


dataset = TUDataset(root='ENZYMES', name='ENZYMES')

print(f""" 
    dataset             : {dataset}
    len(dataset)        : {len(dataset)}
    num_classes         : {dataset.num_classes}

    Node:
    -----
    num_node_features   : {dataset.num_node_features}
    num_node_attributes : {dataset.num_node_attributes}
    num_node_labels     : {dataset.num_node_labels}  

    Edge:
    -----      
    num_edge_features   : {dataset.num_edge_features}
    num_edge_attributes : {dataset.num_edge_attributes}
    num_edge_labels     : {dataset.num_edge_labels}
""")

 
    dataset             : ENZYMES(600)
    len(dataset)        : 600
    num_classes         : 6

    Node:
    -----
    num_node_features   : 3
    num_node_attributes : 0
    num_node_labels     : 3  

    Edge:
    -----      
    num_edge_features   : 0
    num_edge_attributes : 0
    num_edge_labels     : 0



- `len(dataset)` = Number of Graphs
- _ENZYMES_ consists of 600 graphs within 6 classes

In [38]:
# Access a specific graph.
data = dataset[3]
print(data)

print(f"is_undirected : {data.is_undirected()}")

Data(edge_index=[2, 90], x=[24, 3], y=[1])
is_undirected : True


Thus, the _ENZYMES_ dataset contains
- 24 nodes
  - each node has 3 features
- 90 edges (since the graph is _undirected_, this means there are 45 edges)
- Graph is assigned to exactly one class

In [42]:
data.y

tensor([5])

## Mini-batches

In [47]:
from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader

dataset = TUDataset(root='ENZYMES', name='ENZYMES', use_node_attr=True)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

print(f"Iterations : {len(loader)}\n")
for batch in loader:
    print(batch)
    print(f"num_graphs: {batch.num_graphs}")


Iterations : 19

DataBatch(edge_index=[2, 3850], x=[993, 21], y=[32], batch=[993], ptr=[33])
num_graphs: 32
DataBatch(edge_index=[2, 3814], x=[984, 21], y=[32], batch=[984], ptr=[33])
num_graphs: 32
DataBatch(edge_index=[2, 3756], x=[1020, 21], y=[32], batch=[1020], ptr=[33])
num_graphs: 32
DataBatch(edge_index=[2, 4030], x=[1118, 21], y=[32], batch=[1118], ptr=[33])
num_graphs: 32
DataBatch(edge_index=[2, 4350], x=[1088, 21], y=[32], batch=[1088], ptr=[33])
num_graphs: 32
DataBatch(edge_index=[2, 4132], x=[1077, 21], y=[32], batch=[1077], ptr=[33])
num_graphs: 32
DataBatch(edge_index=[2, 4266], x=[1093, 21], y=[32], batch=[1093], ptr=[33])
num_graphs: 32
DataBatch(edge_index=[2, 3772], x=[1069, 21], y=[32], batch=[1069], ptr=[33])
num_graphs: 32
DataBatch(edge_index=[2, 4150], x=[1118, 21], y=[32], batch=[1118], ptr=[33])
num_graphs: 32
DataBatch(edge_index=[2, 3820], x=[967, 21], y=[32], batch=[967], ptr=[33])
num_graphs: 32
DataBatch(edge_index=[2, 3708], x=[967, 21], y=[32], batch=

In [49]:
# num_iters_with_full_batch_size + one iteration with smaller batch size.
18*32 + 24

600

In [59]:
from torch_scatter import scatter_mean
from torch_geometric.datasets import TUDataset
from torch_geometric.loader import DataLoader

dataset = TUDataset(root='ENZYMES', name='ENZYMES', use_node_attr=True)
loader = DataLoader(dataset, batch_size=32, shuffle=True)

for i, data in enumerate(loader):
    x = scatter_mean(data.x, data.batch, dim=0)
    print(f"""
        i                 : {i}
        data              : {data}
        num_graphs        : {data.num_graphs}
        scatter_mean.size : {x.size()}         (average node features in the node dimension)
    """)


        i                 : 0
        data              : DataBatch(edge_index=[2, 4030], x=[1035, 21], y=[32], batch=[1035], ptr=[33])
        num_graphs        : 32
        scatter_mean.size : torch.Size([32, 21])         (average node features in the node dimension)
    

        i                 : 1
        data              : DataBatch(edge_index=[2, 4318], x=[1116, 21], y=[32], batch=[1116], ptr=[33])
        num_graphs        : 32
        scatter_mean.size : torch.Size([32, 21])         (average node features in the node dimension)
    

        i                 : 2
        data              : DataBatch(edge_index=[2, 3726], x=[949, 21], y=[32], batch=[949], ptr=[33])
        num_graphs        : 32
        scatter_mean.size : torch.Size([32, 21])         (average node features in the node dimension)
    

        i                 : 3
        data              : DataBatch(edge_index=[2, 4042], x=[1157, 21], y=[32], batch=[1157], ptr=[33])
        num_graphs        : 32
      

- `batch`: column vector which maps each node to its respective graph in the batch

## Learning Methods on Graphs

Implementation of a simple GCN layer. 

- [Explanation of GCNs](http://tkipf.github.io/graph-convolutional-networks/)

In [60]:
from torch_geometric.datasets import Planetoid

# Load Cora dataset
dataset = Planetoid(root='Cora', name='Cora')

Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.tx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.allx
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.y
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ty
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.ally
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.graph
Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
Processing...
Done!


In [96]:
print(f""" 
    dataset[0] : {dataset[0]}
    dataset[0].is_undirected() : {dataset[0].is_undirected()}
""")

if dataset[0].is_undirected():
    print(f"actual number of edges : {dataset[0].edge_index.shape[1] / 2}")

 
    dataset[0] : Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])
    dataset[0].is_undirected() : True

actual number of edges : 5278.0


The Platenoid dataset contains
- 2708 nodes with 1433 features each
- 10556 edges (and since its an undirected graph there are 10556/2 = 5278 edges)
- 2708 target node values for each node
  - Since `y` has shape `[2708, *]` we have _node-level prediction targets_, thus a target for each node.
- Train set contains 2708 boolean entries, which each entry specifying if the node is contained in the train set or not
  - The same holds for val and test set
  - Obviously, if e.g. index 15 is `True` for the val mask, it will be `False` for the train and test masks.

In [97]:
print(f"""
    dataset           : {dataset}
    num_node_features : {dataset.num_node_features}
    num_edge_features : {dataset.num_edge_features}
    num_classes       : {dataset.num_classes}
    num_features      : {dataset.num_features}
""")


    dataset           : Cora()
    num_node_features : 1433
    num_edge_features : 0
    num_classes       : 7
    num_features      : 1433



In [101]:
# Implementation of a 2-layer GCN.
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

class GCN(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(dataset.num_node_features, 16)
        self.conv2 = GCNConv(16, dataset.num_classes)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

- 2 convolutional layers
- 1st layer : `[1433, 16]`
- ReLU
- Dropout
- 2nd layer : `[16, 7]`
- Log-Softmax

In [146]:
###################################################
# Train model on the training nodes for 200 epochs.
###################################################

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Set model and data to the CPU.
model = GCN().to(device)
data = dataset[0].to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

# Set module in training mode.
model.train()

NUM_EPOCHS = 200
for epoch in range(NUM_EPOCHS):
    # Set all gradients to zero.
    optimizer.zero_grad()

    out = model(data) 

    # Calculate negative log likelihood loss for the train data.
    loss = F.nll_loss(
        out[data.train_mask],
        data.y[data.train_mask]
    )

    # Backpropagation: Compute gradients via chain rule.
    loss.backward()

    # Performs a single optimization step (parameter update).
    optimizer.step()

    # Print training statistics.
    running_loss = loss.item()
    if epoch % 20 == 0:    # print every 2000 mini-batches
        print(f'Epoch: {(epoch + 1):3d}  Loss: {running_loss:.10f}')
        for pg in optimizer.param_groups:
            print(pg['weight_decay'])
        running_loss = 0.0



Epoch:   1  Loss: 1.9526757002
0.0005
Epoch:  21  Loss: 0.2469878048
0.0005
Epoch:  41  Loss: 0.0600667670
0.0005
Epoch:  61  Loss: 0.0701837912
0.0005
Epoch:  81  Loss: 0.0456110276
0.0005
Epoch: 101  Loss: 0.0341489501
0.0005
Epoch: 121  Loss: 0.0428811610
0.0005
Epoch: 141  Loss: 0.0218738653
0.0005
Epoch: 161  Loss: 0.0245752838
0.0005
Epoch: 181  Loss: 0.0235794187
0.0005


In [141]:
# Evaluate the model on the test nodes.
model.eval()

pred = model(data).argmax(dim=1)  # For each row, the indices of the maximal value per row.
correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()  # Number of correct predictions.
acc = int(correct) / int(data.test_mask.sum())  # How many of all test nodes were predicted correctly?
print(f"Accuracy : {acc:.4f}")

Accuracy : 0.7960
