# Neuro-symbolic AI - GNN

This workbook explores the concepts of **hybrid intelligent systems**, in particular of how to combine a knowledge-base structure with a neural network, in other words, how to create a **neuro-symbolic** solution. 


Throughout this notebook, you will work on three simple hybrid systems, using graphs and, of course, neural networks. You will find some guided examples to aid your understanding, and some exercises for you to implement on your own.

#### Content:
* [GNN - Graph Neural Networks](#gnn)
    * [Getting started](#gnn-start)
    * [Exercise: node-level task](#gnn-node)
    * [Exercise: edge-level task](#gnn-edge)
    * [Exercise: graph-level task](#gnn-graph)

## GNN - Graph Neural Networks <a class="anchor" id="gnn"></a>

**Graph Neural Networks (GNNs)** are a type of neural networks designed to work with **graph data**, structured representation of data as _nodes_ and _edges_ (see Week3 and Week4 learning material for more details). GNNs can be used in various applications like social network analysis, recommendation systems, and drug discovery. 

GNNs accept a graph as input, with information loaded into its nodes, edges and global-context. Through the layers of the neural network, GNNs capture the relationships and dependencies in the data  without changing the connectivity of the input graph.

The picture below shows the structure of a GNN:

<figure>
<img src="gnn_structure.png" alt="GNN structure" style="width: 800px;"/>
<figcaption style="text-align:center;font-style:italic">(from <a href='https://tkipf.github.io/graph-convolutional-networks/'>Kipf, T.(2016) "Graph Convolutional Neural Network"</a> ) </figcaption>    
</figure>

In a GNN, the layers are responsible for aggregating information from neighboring nodes and updating a node features iteratively. In most GNN, this  process is performed through **message passing**. Message passing is a key concept in GNNs, mainly involving 3 key steps:

1. **gathering**: for each node in the graph, the layer gathers all neighboring node features (also called _messages_) using a defined function.
2. **aggregating**: aggregate all gathered messages using a function like sum to create a pooled message.
3. **updating**: pass the pooled messages through an update function, to update the node information for the next layer.

This process is shown in the picture below:


<figure>
<img src="message_passing.png" alt="GNN message passing" style="width: 800px;"/>
<figcaption style="text-align:center;font-style:italic">(from <a href='https://distill.pub/2021/gnn-intro'>Sanchez-Lengeling, et al., "A Gentle Introduction to Graph Neural Networks", Distill, 2021</a> ) </figcaption>    
</figure>

There are different type of GNNs. The two most important are Graph Convolutional Networks (GCNs) and Attention-based GNNs (GAT):

* [**Graph Convolutional Networks (GCN)**](https://openreview.net/pdf?id=SJU4ayYgl): this specific type of GNNs  use **convolution operations** to update node representations during the message passing phase, similarly to the convolution operations used on Convolutional Neural Netowrks (CNN) for image processing. GCNs are effective for tasks where nodes have a clear notion of ordering and locality in the graph structure (for example in human-pose based tasks)
* [**Graph Attention Networks (GAT)**](https://arxiv.org/pdf/1710.10903.pdf): in this case **attentions mechanism** are used to assign different weights to different neighbors based on their relevance, instead of using fixed weights for aggregating information during message passing. GATs are particularly useful when nodes have varying importance (for example for social network analysis tasks), or when capturing long-range dependencies in the graph.

A detailed (but comprehensible) description of GNNs characteristics can be found here: https://distill.pub/2021/gnn-intro/

GNNs are used for 3 tasks:
* **node classification**: predicting the label of a node in a graph
* **link prediction**: predicting the existence of edges between nodes
* **graph classification**: classify and entire graph based on its structural properties

Below we are going to look into these 3 tasks in more detail.

### Getting started <a class="anchor" id="gnn-start"></a>

Let's start by importing all the libraries we need. We will implement our model using the deep learning package [PyTorch](https://github.com/pytorch/pytorch) (similar to Tensorflow).

In [None]:
# uncomment the following if you do not have PyTorch already installed
# !pip install torch

# uncomment the following if you do not have PyTorch Geoemtric already installed
# PyTorch Geometric provides us a set of common graph layers
# !pip install torch-geometric

In [None]:
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

import torch
from torch_geometric.utils import to_networkx
from torch_geometric.nn import GCNConv
from torch_geometric.datasets import Planetoid, TUDataset

### Exercise: node-level task <a class="anchor" id="gnn-node"></a>

For node-level tasks, GNNs are used to classify nodes in a graph, in a semi-supervised manner: we have a graph with nodes of which only a certain amount of nodes are labeled. The aim of the GNN is to learn those labeled examples during training and try to generalize to the unlabeled nodes.

In this guided exercise we will use the [Cora dataset](https://en.wikipedia.org/wiki/CORA_dataset), a citation network consisting of 2708 scientific publications (nodes) with edges between each other, representing the citation of one paper by another. The papers in the graphs are labeled with 7 different labels: Neural Networks, Rule Learning, Reinforcement Learning, Probabilistic Methods, Theory, Genetic Algorithms, Case-Based Reasoning. As the MNIST dataset is widely used as benchmark for computer vision models, this dataset is the equivalent to evaluate the performance of graph neural networks and other graph-based algorithms.

Let's start by exploring the dataset.

**Cora dataset**

In [None]:
# get the cora dataset + metatada
# this will download the dataset in your folder
cora_dataset = Planetoid(root='./', name="Cora")

# get the actual data
cora = cora_dataset.data

In [None]:
# explore dataset characteristics
print('Number of nodes:', cora.num_nodes)
print('Number of edges:', cora.num_edges)
print('Number of features:', cora_dataset.num_features)
print('Number of classes:', cora_dataset.num_classes)
print('Has isolated nodes:', cora.has_isolated_nodes())  
print('Has self-loops:', cora.has_self_loops())  
print('Is undirected:', cora.is_undirected())  

The dataset contains 2708 pubblication and 10556 links. Each node/publication in the dataset is described by a 0/1-valued vector indicating whether a specific word from the dictionary is included or not in the pubblication. The dictionary consists of 1433 unique words (features).

We can use NetworkX to plot the graph (the code below will take few minutes to run).

In [None]:
G = to_networkx(cora, to_undirected=True)

# use spring layout
pos = nx.spring_layout(G)

# compute degree centrality
# degree represents the number of edges from each node,
# the centrality allows us to undertsand the more 'popular' nodes
cent = nx.degree_centrality(G)
cent_array = np.array(list(cent.values()))

# size of nodes will be proportional to their popularity
node_size = list(map(lambda x: x * 500, cent.values()))


plt.figure(figsize=(20, 20))

#draw nodes
nodes = nx.draw_networkx_nodes(G, pos, node_size=node_size,
                               cmap=plt.cm.plasma,
                               nodelist=list(cent.keys()))
# draw edges
edges = nx.draw_networkx_edges(G, pos, width=0.25, alpha=0.3)

plt.show()

The papers in the Cora dataset are divided into 7 classes. Let's see which of the classes is more popular (and whether we have an unbalanced dataset).

In [None]:
# associate labels to class numbers
# see https://keras.io/examples/graph/gnn_citations/
label_dict = {
    0: "Theory",
    1: "Reinforcement_Learning",
    2: "Genetic_Algorithms",
    3: "Neural_Networks",
    4: "Probabilistic_Methods",
    5: "Case_Based",
    6: "Rule_Learning"}

In [None]:
# get the labels
labels = cora.y.numpy()

In [None]:
# count instances for each label
count_label = [list(labels).count(key) for key in label_dict.keys()]
dict(zip(label_dict.values(), count_label))

**Task 1 (optional)**

Colour-code the nodes in the above graph by subject area.

In [None]:
# write here your code

**Train the model**

It's time to train our model. The cora dataset contains 3 objects:  `train_mask`, `val_mask`, and `test_mask`, representing   which nodes we should use for training, validation, and testing, respectively. The `x` tensor is the **feature** tensor of our 2708 publications, and `y` the **labels** for all nodes (see above, already used to count the labels). 

Note that x and y are [Pytorch tensors](https://pytorch.org/tutorials/beginner/introyt/tensors_deeper_tutorial.html),  multi-dimensional arrays that allow for efficient computation within the Pytorch framework.

In [None]:
print('Number of training nodes:', cora.train_mask.sum().numpy())
print('Number of validation nodes:', cora.val_mask.sum().numpy())
print('Number of test nodes:', cora.test_mask.sum().numpy())

**Note**: We have 1640 of labelled data that we are going to use for the learning process.

We are going to use a GCN model structure for our task. The model will contain two GCNConv layers, relu activation, a dropout rate of 0.1, and  16 hidden channels.

In [None]:
dim_in = cora.num_features
dim_out = cora_dataset.num_classes
hidden_channels = 16

class GCNModel(torch.nn.Module):
    def __init__(self):
        # define model components
        super(GCNModel, self).__init__()
        
        # aactivation functions
        self.relu = torch.nn.ReLU()
        
        # layers
        self.conv1 = GCNConv(dim_in, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, dim_out)
        
        # dropout
        self.dropout = torch.nn.Dropout(p=0.1)
        
    # construct the feed-forward network
    def forward(self, data):
        # for graph layers, we need the feature of the nodes (the tensor 'x')
        # and we need also the "edge_index" tensor as additional input
        # this represents which node is connected to which other node
        x, edge_index = data.x, data.edge_index
        
        # layer 1: activate input conv layer (143,16) using ReLu function
        x = self.relu(self.conv1(x, edge_index))
        
        # add dropout with probability 0.1
        x = self.dropout(x)
        
        # layer 2 : activate output conv layer (16,7) using ReLu function
        x = self.relu(self.conv2(x, edge_index))
        
        return x

The code above can be implemented in a more complex way using Tensorflow: https://keras.io/examples/graph/gnn_citations/

In [None]:
# compile model
model = GCNModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()

Before start training our model, we want to visualise the node representations  (embeddings) of our untrained GCN network, in other words what the network knows (doesn't know?) at the moment. For this visualisation, we make use of [**TSNE**](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html), this is a statistical method to visualise high-dimensional representations onto a 2D plane.

In [None]:
from sklearn.manifold import TSNE

# the function below takes the nodes embeddings as 'known' by the GCN and the related labels
# the function returns a 2D-plot
def visualise_embeddings(nodes_embed, labels):
    tsne = TSNE(n_components=2).fit_transform(nodes_embed.detach().numpy())

    plt.figure(figsize=(10,10))
    plt.xticks([])
    plt.yticks([])

    plt.scatter(tsne[:, 0], tsne[:, 1], s=70, c=labels, cmap="Set2")
    plt.show()
    


In [None]:
# get GCN knowledge of the nodes embeddings
kb = model(cora)

# plot the embeddings on a 2D-plan, colour by labels' classes
visualise_embeddings(kb, labels=cora.y)

Right now, our GCN knowledge is very messy! Let's train the model and make the network learn about the Cora dataset.

In [None]:
# Helper function for the training phase
#  compute loss and perform back-propagatio    
def train():
    model.train()
    # set the gradients to zero before starting the backpropagation of the training process
    # see this thread for more details: 
    # https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch/48009142#48009142
    optimizer.zero_grad()
    out = model(cora)
    loss = criterion(out[cora.train_mask], cora.y[cora.train_mask])
    loss.backward()
    optimizer.step()
    return loss

# Helper function for the testing phase
def test():
    model.eval()
    out = model(cora)
    pred = out.argmax(dim=1)
    # count correct predictions
    acc = (pred[cora.test_mask] == cora.y[cora.test_mask]).sum().item() / cora.test_mask.sum().item()
    return acc

In [None]:
#Train the model
EPOCHS = 201
for epoch in range(EPOCHS):
    loss = train()
    acc = test()
    if epoch % 10 == 0:    # print every 10 epochs
            print(f'Epoch: {epoch:03d}, Train Loss: {loss:.4f}, Test Accuracy: {acc:.4f}')

Not too bad for being a very simple GCN! Let's what the model knows now about the dataset.

In [None]:
# get GCN knowledge of the nodes embeddings
kb = model(cora)

# plot the embeddings on a 2D-plan, colour by labels' classes
visualise_embeddings(kb, labels=cora.y)

The 'knowledge' of our system is now more structured!

**Task 2**

The [**Zachary's karate club network**](https://en.wikipedia.org/wiki/Zachary%27s_karate_club) is a  graph describing the social network of 34 members (nodes) of a karate club,  and members interactions outside the club(edges). There are 4 classes, which represent the community each node/member belongs to.

Implement a node-level task GCN to classify memmbers communities into 4 classes.

In [None]:
# write here your code

# uncomment the following code to get the dataset
#from torch_geometric.datasets import KarateClub
# dataset = KarateClub().data

### Exercise: edge-level task <a class="anchor" id="gnn-edge"></a>

For edge-level tasks, GNNs are used to predict whether there will be/should be an edge between two nodes. This is particualr useful for social network analysis.

There are different ideas for doing edge-prediction. Most solutions are based on an encoding-decoding process: we encode our edge information by training a GCN (as done above), we decode the output  by computing a similarity score on the pair of node features. This will be close to 1 if there should be a link, and 0 otherwise.

**Task 3**


For this task, please review the code in the following [PyTorch link prediction](https://github.com/pyg-team/pytorch_geometric/blob/master/examples/link_pred.py) example. The guided example still uses the Cora dataset,  but this time to predict edges between node points.


In [None]:
# write here your code

### Exercise: graph-level task <a class="anchor" id="gnn-graph"></a>



Graph-level tasks refer to the problem of classifiying entire graphs, based on some structural graph properties. For these tasks, the input is a dataset of graphs. These solutions are often applied in the field of chemistry, to classify chemical compounds, these molecules can be represented as graphs, where atoms are nodes and chemical bonds are edges.

For our guided exercise we will use the [MUTAG](https://paperswithcode.com/dataset/mutag) dataset.This dataset consists of 188 chemical compounds, each represented as a graph. Each graphs/molecule has around 18 nodes and 20 edges, and is labeled as either mutagenic (positive class) or non-mutagenic (negative class). 

In [None]:
mutag = TUDataset(root='./', name='MUTAG')

In [None]:
# explore dataset characteristics
print('Number of graphs:', len(mutag))
print('Number of features:', mutag.num_features)
print('Number of classes:', mutag.num_classes)

Each node in the graph has 7 features (atom types). Let's get some details on a specific graph.

In [None]:
# get first graph
g = mutag[0]  

print('Number of nodes:',g.num_nodes)
print('Number of edges:',g.num_edges)
print('Graph label:', g.y.numpy().item())
print('Has isolated nodes:', g.has_isolated_nodes())
print('Has self-loops:',g.has_self_loops())
print('Is undirected:',g.is_undirected())

We can represent the graph using NetworkX

In [None]:
G = to_networkx(g, to_undirected=True)

nx.draw(G)

plt.show()

We can start training our model. After shuffling the graphs in our dataset, we split them into train (80%) and testing (20%) by considering the first 150 graph for the training and the remaining ones for testing.

In [None]:
# set randm seed for the shuffling
torch.manual_seed(42)
# shuffle dataset
mutag = mutag.shuffle()

# create train and test split
train_dataset = mutag[:150]
test_dataset = mutag[150:]

print('Number of training graphs:',len(train_dataset))
print('Number of test graphs:',len(test_dataset))

While training a model, we  want to pass data in 'batches', reshuffled data at every epoch to reduce model overfitting. In PyTorch this is done using [`DataLoader`](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#preparing-your-data-for-training-with-dataloaders), an iterable pytorch construct that allows to create batches to feed into the model.

In [None]:
from torch_geometric.loader import DataLoader

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

For the structure of our model, we can reuse the same structure used for the node-level task. However, since in a graph-level task the output is a label, and not the entire graph structure as in the node classification task, we need to add an average pool and a linear layer.

In [None]:
from torch.nn import Linear
from torch_geometric.nn import global_mean_pool

dim_in = mutag.num_node_features
dim_out = mutag.num_classes
hidden_channels = 16

class GCNModelGraph(torch.nn.Module):
    def __init__(self):
        super(GCNModelGraph, self).__init__()
        
        # aactivation functions
        self.relu = torch.nn.ReLU()
        
        # layers
        self.conv1 = GCNConv(dim_in, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)
        self.lin = Linear(hidden_channels, dim_out)
        
        # dropout
        self.dropout = torch.nn.Dropout(p=0.1)
        
        
    def forward(self, data):
        # for graph layers, we need the feature of the nodes (the tensor 'x')
        # and we need also the "edge_index" tensor as additional input
        # this represents which node is connected to which other node
        x, edge_index, batch = data.x, data.edge_index, data.batch
        
        # layer 1: activate input conv layer (7,16) using ReLu function
        x = self.relu(self.conv1(x, edge_index))
        
        # add dropout with probability 0.1
        x = self.dropout(x)
        
        # layer 2 : activate  conv layer (16,16) using ReLu function
        x = self.relu(self.conv2(x, edge_index))

        # average pooling
        x = global_mean_pool(x, batch)
        
        # get final classifier
        x = self.lin(x)
        
        return x

model = GCNModelGraph()
print(model)

In [None]:
# compile model
model = GCNModelGraph()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()

In [None]:
# these helper functions are similar to the one used above

def train():
    model.train()
    # iterate through the batches generated by the Dataloader
    for data in train_loader:
        # set the gradients to zero before starting the backpropagation of the training process
        # see this thread for more details: 
        # https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch/48009142#48009142
        optimizer.zero_grad()
        out = model(data)
        loss = criterion(out, data.y)
        loss.backward()
        optimizer.step()
    return loss


def test():
    model.eval()
    correct_pred = 0
    # iterate through the batches generated by the Dataloader
    for data in test_loader:
        out = model(data)
        pred = out.argmax(dim=1)
        # count correct predictions
        correct_pred += int((pred == data.y).sum())
    return correct_pred / len(test_loader.dataset)


In [None]:
#Train the model
EPOCHS = 201
for epoch in range(EPOCHS):
    loss = train()
    acc = test()
    if epoch % 10 == 0:    # print every 10 epochs
            print(f'Epoch: {epoch:03d}, Train Loss: {loss:.4f}, Test Accuracy: {acc:.4f}')

**Task 4**

The accuracy obtained is not very high. What happens by adding a 3rd GCNConv and increasing the dropout probability?

In [None]:
# write here your code

**Task 5**

The MUTAG dataset is part of a large collection of different graph classification datasets, known as the [TUDatasets](https://chrsmrrs.github.io/datasets/). All these datasets are directly accessible through PyTorch  Geometric [`torch_geometric.datasets.TUDataset`](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.TUDataset).

Implement a similar GCN system with another TUDataset.

In [None]:
# write here your code