In [1]:
import torch
import torch_geometric

# Graph Machine Learning with Graph Neural Networks (GNNs)

Having explored network science, we are about to dive into Graph Neural Networks (GNNs). The best introduction to GNNs is a long blog post by []() entitled [A Gentle Introduction to Graph Neural Networks](https://distill.pub/2021/gnn-intro/) which the authors have _generously_ licensed under the Creative Commons. This lets me utilize their work to explain how GNNs work while providing source code along with it to bring your theoretical understanding to a practical one.

## Citation: A Gentle Introduction to Graph Neural Networks

Parts of the content in Part 4 of this course are based upon: `Sanchez-Lengeling, et al., "A Gentle Introduction to Graph Neural Networks", Distill, 2021.` This content is cited inline. Students are encouraged to read this blog post before or after class, and to reference it if they become confused about concepts in their data science and machine learning practice. 

The full list of authors is:

* [Benjamin Sanchez-Lengeling](https://research.google/people/106640/)
* [Emily Reif](https://research.google/people/106150/)
* [Adam Pearce](https://research.google/people/AdamPearce/)
* [Alexander B. Wiltschko](https://www.linkedin.com/in/alex-wiltschko-0a7b7537/)

During the course you will have access to the instructor, who understands GNNs and can elaborate further and answer any questions you may have :)

## Why is there so much talk about Graph Neural Networks?

Knowledge graphs are at the peak of the Gartner hype cycle and graph neural networks (GNNs) are soon to be high on the ramp because they tap and unlock the potential of enterprise knowledge graphs. Data lakes put data in one place, knowledge graphs link datasets together and graph neural networks automate business processes using data from across an enterprise. 



Most graph databases are fast becoming cloud-based GNN platforms:

* Neo4j → [Neo4j Graph Data Science](https://neo4j.com/product/graph-data-science/)
* TigerGraph → [Machine Learning Workbench](https://www.tigergraph.com/ml-workbench/)
* ArangoDB → [ArrangoGraphML](https://www.arangodb.com/arangodb-for-machine-learning/)
* Kumo → [SQL query the future](https://kumo.ai/)


# Notes: Extra Text

Let's wrap our dataset in a `torch_geometric` `Dataset` class.

# PyG: Pytorch Geometric aka `torch_geometric`

## Describing Graphs with PyG `Data` Classes

Entire graphs in PyG are described by `Data` objects. The simple 3-node, 2-edge graph with a single feature in the [PyG documentation](https://pytorch-geometric.readthedocs.io/en/latest/get_started/introduction.html) looks like this:

Note we have to define our edges bidirectionally.

<center><img src="images/3-node-2-edge-pyg-graph.svg" width="300px" /></center>

In [2]:
import torch
from torch_geometric.data import Data

edge_index = torch.tensor([[0, 1, 1, 2],
                           [1, 0, 2, 1]], dtype=torch.long)
x = torch.tensor([[-1], [0], [1]], dtype=torch.float)

data = Data(x=x, edge_index=edge_index)
print(data)
data.validate(raise_on_error=True)

Data(x=[3, 1], edge_index=[2, 4])


True

`Data` classes can describe themselves.

In [3]:
data.keys

['edge_index', 'x']

In [4]:
print("Describing our happy little Graph :)\n")
print(f"Number of nodes: {data.num_nodes:,}")
print(f"Number of edges: {data.num_edges:,}")
print(f"Number of node features: {data.num_node_features:,}")
print(f"Has isolated nodes: {data.has_isolated_nodes()}")
print(f"Has self loops: {data.has_self_loops()}")
print(f"Is directed: {data.is_directed()}")

Describing our happy little Graph :)

Number of nodes: 3
Number of edges: 4
Number of node features: 1
Has isolated nodes: False
Has self loops: False
Is directed: False


### Directed Graph `Data`

Below we make a directed version by failing to reflect the node IDs across the diagonal of the adjacency matrix.

In [5]:
directed_data = Data(x=x, edge_index=torch.tensor([[1,1],[0,2]]))
print(directed_data)
directed_data.edge_index

Data(x=[3, 1], edge_index=[2, 2])


tensor([[1, 1],
        [0, 2]])

In [6]:
directed_data.is_directed()

True

# Graph Neural Networks (GNNs) with DGL (Deep Graph Library)

[DGL or Deep Graph Library](https://dgl.ai) is the simplest way to get started with graph machine learning using graph neural networks (GNNs).

First we will cover a few common operations with each major task type we covered in the lecture: node-level, edge-level, subgraph-level and graph-level.

## Node-Level Tasks: Classification

Node-level tasks usually involve property prediction - classifying nodes into categories or regressing one of their numeric properties. We'll cover both.

As in the network science section of this course, we will start with a Text Attributed Graph (TAG) called a Citation Graph. We are going to use the [CORA dataset](https://relational.fit.cvut.cz/dataset/CORA), [described by Papers with Code](https://paperswithcode.com/dataset/cora) as:

> Introduced by Andrew McCallum et al. in [Automating the Construction of Internet Portals with Machine Learning](https://doi.org/10.1023/A:1009953814988)
>
> The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

### CORA Node Features: Bag of Words

Note... the features for this network are a [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) model: simple and _sparse_ rather than modern text representations which are _dense_, distributed representations in the form of language models or [embeddings](https://cloud.google.com/blog/topics/developers-practitioners/meet-ais-multitool-vector-embeddings). Each node has a row in the feature matrix and each of 1,433 unique words get a column with the word count. Before [Word2Vec](https://arxiv.org/abs/1301.3781) introduced text embeddings in 2013, the features for NLP problems were mostly 0s, with a few non-zero values.

<center><img src="images/sparse_vs_dense_vectors.webp" width="800px" alt="Bag-of-Words (BoW) sparse vectors used in traditional NLP versus dense, embedded vector representations used in modern deep learning NLP" /></center>

The [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) prevented NLP applications from realizing their modern capabilities - the more words that were added, the more dimensions the features data got and the more dimensions you add to a _sparse_ feature vector... the more all the values of that vector start to approximate the same value. They stretch out over many dimensions and look the same.

Embeddings like Word2Vec related _sparse representations_ of words to the text around them by storing a middle layer of a neural network, creating _dense representations_.

<center><img src="images/from_sparse_to_dense.webp" width="800px" alt="The Word2Vec's Skipgram architecture maps sparse to dense vectors via a shallow embedding technique" /></center>

These are very useful because the dimensions of the feature vector correspond to particular semantics, and because you can compare two dense vectors and get a sense of how similar the objects the represent are. This is very useful for information retreival applications like search and clustering.

<center><img src="images/king_minus_man_plus_woman.webp" width="800px" alt="Given the dense embedding vector for the word 'king', if we subtract the vector for 'man' and add 'woman', we arrive at a vector very close to 'queen'." /></center>

We could use a language model or large language model (LLM) to embed the features or the original text and get better performance from our GCN. However, it is good to start simple and worry about feature engineering lately... you can spend an endless amount of time over optimizing a task nobody cares about. Make sure they want the prototype before you engineer incredible performance. A Bag of Words representation is a fine start.

### CORA Classifier: Graph Convlutional Network

We are going to use a neural network architecture that may be familiar to you: a convolutional neural network. The type we will employ is called a Graph Convolutional Network (GCN). Message passing occurs between nodes and the series of input messages to a node are summarized by the layers of a GCN after each round of message passing.

<center><img src="images/gcn-decagon-overview.png" width="1000px" alt="Graph Neural Networks for Multirelational Link Prediction" /><a href="https://snap.stanford.edu/decagon/">Graph Neural Networks for Multirelational Link Prediction, Zitnik et al., 2018</a></center>


There is often a big of tinkering required to make GNNs run, so even for this simple problem in DGL, we must specify our GNN architectre. It is simple enough. Let's see how it looks...

Note: Figures Sources: [Dense Vectors: Capturing Meaning with Code](https://towardsdatascience.com/dense-vectors-capturing-meaning-with-code-88fc18bd94b9) by [James Briggs](https://jamescalam.medium.com/), [Graph Neural Networks for Multirelational Link Prediction, Zitnik et al., 2018](https://snap.stanford.edu/decagon/)

### Building a GCN in DGL

Let's build, train and evaluate our first GNN: a graph convoltional network for classifying CORA articles into categories.

Note: Source for this section is the [Blitz Tutorial, Node Classification with DGL](https://docs.dgl.ai/tutorials/blitz/1_introduction.html#sphx-glr-tutorials-blitz-1-introduction-py).

In [7]:
import os

# DGL can also use Tensorflow or MXNet
os.environ["DGLBACKEND"] = "pytorch"
import dgl
import dgl.data
import torch
import torch.nn as nn
import torch.nn.functional as F

For now we will use a pre-loaded dataset. It contains the standard CORA bag-of-word (BoW) featres. Later we will construct our own graphs to perform feature engineering on them to do more sophisticated work.

In [8]:
dataset = dgl.data.CoraGraphDataset()

print(f"Number of categories: {dataset.num_classes}")

  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done loading data from cached files.
Number of categories: 7


In [9]:
# There can be more than one graph, this dataset has just one
g = dataset[0]
g

Graph(num_nodes=2708, num_edges=10556,
      ndata_schemes={'feat': Scheme(shape=(1433,), dtype=torch.float32), 'label': Scheme(shape=(), dtype=torch.int64), 'test_mask': Scheme(shape=(), dtype=torch.bool), 'val_mask': Scheme(shape=(), dtype=torch.bool), 'train_mask': Scheme(shape=(), dtype=torch.bool)}
      edata_schemes={})

`train_mask`, `val_mask` and `test_mask` are bit masks that denote the rows in the `label` and `feat` [Schemes](https://github.com/dmlc/dgl/blob/master/python/dgl/frame.py#L125) which with `DGLBACKEND=pytorch` contain DGL mappings to the [torch.Tensors](https://pytorch.org/docs/stable/tensors.html) making up the training, validation and test datasets respectively.

In [10]:
print("Node features")
print(g.ndata)

print("Edge features")
print(g.edata)

Node features
{'feat': tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]]), 'label': tensor([3, 4, 4,  ..., 3, 3, 3]), 'test_mask': tensor([False, False, False,  ...,  True,  True,  True]), 'val_mask': tensor([False, False, False,  ..., False, False, False]), 'train_mask': tensor([ True,  True,  True,  ..., False, False, False])}
Edge features
{}


### GCN Model Architecture - Diagrams, then Code

The model itself is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) that uses the [dgl.nn.conv.GraphConv](https://docs.dgl.ai/generated/dgl.nn.pytorch.conv.GraphConv.html) class. 

<center><img src="images/Schematic-diagram-of-a-two-layer-GCN-model-The-dark-green-denotes-target-nodes-that-need_W640.jpg" alt="Diagram of 2-layer GCN from Graph neural networks in node classification: survey and evaluation, Xiao et al., 2022" width="600px" /></center>

<br />

<center>Image credit: <a href="https://www.researchgate.net/publication/355873169_Graph_neural_networks_in_node_classification_survey_and_evaluation">Diagram of 2-layer GCN from Graph neural networks in node classification: survey and evaluation, Xiao et al., 2022</a></center>

<br />

Let's dig into this diagram of our GCN before coding it in DGL.

### Over Smoothing in GNNs: Too Many Layers Means Too Many Hops Sampled

Note that **each layer of the GCN represents a round of message passing where nodes aggregate information from their neighbors.** This is important to know, as if you have too many layers in a GNN, you run into the [oversmoothing problem](https://towardsdatascience.com/over-smoothing-issue-in-graph-neural-network-bddc8fbc2472) where nodes start to look the same as all the other nodes.

<center><img src="images/GNN-oversmoothing-first-layer.webp" width="840px" alt="First layer of GNN message passing, aggregation and summarization results in features of different colors" /></center>
<center>The first layer of GNN message passing, aggregation and summarization results in features represented by different colors.</center>
<center><i>Image credit: <a href="https://towardsdatascience.com/over-smoothing-issue-in-graph-neural-network-bddc8fbc2472">Over-smoothing issue in graph neural network</a> by <a href="https://towardsdatascience.com/over-smoothing-issue-in-graph-neural-network-bddc8fbc2472">Anas Ait Aomar</a></i></center>

<br /><br />

<center><img src="images/GNN-oversmoothing-second-layer.webp" width="1000px" alt="Second layer of GNN message passing, aggregation and summarization results in features with more similar colors" /></center>
<center>The second layer of GNN message passing, aggregation and summarization results in features represented by more similar colors.</center>

<center><i>Image credit: <a href="https://towardsdatascience.com/over-smoothing-issue-in-graph-neural-network-bddc8fbc2472">Over-smoothing issue in graph neural network</a> by <a href="https://towardsdatascience.com/over-smoothing-issue-in-graph-neural-network-bddc8fbc2472">Anas Ait Aomar</a></i></center>

### Relu Activation Function

Note how the GraphConv layers in the GCN architecture diagram above are separated by a Relu layer. Without this layer, the GCN could not learn effectively. Relu is an activation function that enables nonlinearity in neural networks - it lets them model messy data in a way that is much more powerful than a linear model. Relu is defined as `max(0, x)` which means that it maps negative values to 0 and positive values are left alone. Note that there are many derivatives of Relu that attempt to improve its performance.

<center><img src="images/relu.png" width="600px" alt="Relu is max(0, x), making its plot flat when x is less than zero, and evently diagonal in a 1:1 ratio when x is greater than zero." /></center>
<center>The Relu activation function: <code>max(0, x)</code></center>
<center><i>Image Credit: <a href="https://medium.com/@danqing/a-practical-guide-to-relu-b83ca804f1f7">A Practical Guide to ReLU</a> by <a href="https://medium.com/@danqing">Danqing Liu</a></i></center>

> we can stack as many linear classifiers as we want on top of each other, and without nonlinear functions between them, it will just be the same as one linear classifier.
>
> But if we put a nonlinear function between them, such as max, then this is no longer true. Now each linear layer is actually somewhat decoupled from the other ones and can do its own useful work. The max function operates as a simple if statement.
>
_Source: [Nonlinearity and Neural Networks](https://medium.com/unpackai/nonlinearity-and-neural-networks-2ffaaac0e6ff) by [Aravinda 加阳](https://medium.com/@aravinda-gn)_

This video by [deeplizard on Youtube](https://www.youtube.com/@deeplizard) explains Relu and its significance:

In [11]:
%%HTML
<center><iframe width="800" height="460" src="https://www.youtube.com/embed/6MmGNZsA5nI?si=sglt8BijkpykWdWP&amp;start=10"></iframe></center>

### Coding the Above GCN Diagram

The equivalent DGL code for the GCN diagram above appears below. The graph structure and CORA BoW features are shown as the input, which feeds into one GCN layer, then a Relu activaton function, another GCN layer and finally they are mapped into the labels of our classes, in this case fields of study.

In [12]:
from dgl.nn import GraphConv


class GCN(nn.Module):
    """2-layer Graph Convolutional Network"""
    
    def __init__(self, in_feats, h_feats, num_classes):
        """Setup two GCN layers of with the input, inner and output dimensions."""
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        """Operate a forward pass of the network"""
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h


# Create the model with given dimensions
model = GCN(g.ndata["feat"].shape[1], 16, dataset.num_classes)

In [13]:
model

GCN(
  (conv1): GraphConv(in=1433, out=16, normalization=both, activation=None)
  (conv2): GraphConv(in=16, out=7, normalization=both, activation=None)
)

### Training a GCN

Below we define a training function that will iteratively train our GCN using message passing.

In [30]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


def metrics(y_true, y_pred):

    return {
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, average="micro"),
        "recall": recall_score(y_true, y_pred, average="micro"),
        "f1": f1_score(y_true, y_pred, average="micro"),
    }


def train(g, model):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    best_val_acc = 0
    best_test_acc = 0

    features = g.ndata["feat"]
    labels = g.ndata["label"]
    train_mask = g.ndata["train_mask"]
    val_mask = g.ndata["val_mask"]
    test_mask = g.ndata["test_mask"]
    for e in range(100):
        # Forward
        logits = model(g, features)

        # Compute prediction
        pred = logits.argmax(1)

        # Compute loss
        # Note that you should only compute the losses of the nodes in the training set.
        loss = F.cross_entropy(logits[train_mask], labels[train_mask])

        # Compute accuracy on training/validation/test
        train_acc = (pred[train_mask] == labels[train_mask]).float().mean()
        val_acc = (pred[val_mask] == labels[val_mask]).float().mean()
        test_acc = (pred[test_mask] == labels[test_mask]).float().mean()

        train_scores = metrics(labels[train_mask], pred[train_mask])
        val_scores = metrics(labels[val_mask], pred[val_mask])
        test_scores = metrics(labels[test_mask], pred[test_mask])

        # Save the best validation accuracy and the corresponding test accuracy.
        if best_val_acc < val_acc:
            best_val_acc = val_acc
            best_test_acc = test_acc

        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if e % 5 == 0:
            print(
                f"In epoch {e}, loss: {loss:.3f}, val acc: {val_acc:.3f} (best {best_val_acc:.3f}), test acc: {test_acc:.3f} (best {best_test_acc:.3f}),",
                f'val precision: {val_scores["precision"]:.3f}, val recall: {val_scores["recall"]:.3f}, val f1: {val_scores["f1"]:.3f}'
            )


model = GCN(g.ndata["feat"].shape[1], 16, dataset.num_classes)
train(g, model)

In epoch 0, loss: 1.946, val acc: 0.108 (best 0.108), test acc: 0.104 (best 0.104), val precision: 0.108, val recall: 0.108, val f1: 0.108
In epoch 5, loss: 1.894, val acc: 0.528 (best 0.552), test acc: 0.553 (best 0.557), val precision: 0.528, val recall: 0.528, val f1: 0.528
In epoch 10, loss: 1.817, val acc: 0.608 (best 0.608), test acc: 0.586 (best 0.586), val precision: 0.608, val recall: 0.608, val f1: 0.608
In epoch 15, loss: 1.716, val acc: 0.662 (best 0.662), test acc: 0.665 (best 0.665), val precision: 0.662, val recall: 0.662, val f1: 0.662
In epoch 20, loss: 1.591, val acc: 0.664 (best 0.672), test acc: 0.669 (best 0.666), val precision: 0.664, val recall: 0.664, val f1: 0.664
In epoch 25, loss: 1.446, val acc: 0.674 (best 0.674), test acc: 0.679 (best 0.679), val precision: 0.674, val recall: 0.674, val f1: 0.674
In epoch 30, loss: 1.284, val acc: 0.714 (best 0.714), test acc: 0.705 (best 0.705), val precision: 0.714, val recall: 0.714, val f1: 0.714
In epoch 35, loss: 1.1

In [32]:
model

GCN(
  (conv1): GraphConv(in=1433, out=16, normalization=both, activation=None)
  (conv2): GraphConv(in=16, out=7, normalization=both, activation=None)
)