# Graph Neural Networks (GNN)

For this part, we will be using spektral to create GNN.

## What is spektral?
Spektral is a Python library for graph deep learning, based on the Keras API and TensorFlow 2. The main goal of this project is to provide a simple but flexible framework for creating graph neural networks (GNNs).

You can use Spektral for classifying the users of a social network, predicting molecular properties, generating new graphs with GANs, clustering nodes, predicting links, and any other task where data is described by graphs.

Spektral implements some of the most popular layers for graph deep learning, including:

- Graph Convolutional Networks (GCN)
- Chebyshev convolutions
- GraphSAGE
- ARMA convolutions
- Edge-Conditioned Convolutions (ECC)
- Graph attention networks (GAT)
- Approximated Personalized Propagation of Neural Predictions (APPNP)
- Graph Isomorphism Networks (GIN)
- Diffusional Convolutions


### Graphs
A graph is a mathematical object that represents relations between entities. We call the entities "nodes" and the relations "edges".

Both the nodes and the edges can have vector features.

In Spektral, graphs are represented with instances of spektral.data.Graph. A graph can have four main attributes:

a: the adjacency matrix
x: the node features
e: the edge features
y: the labels
A graph can have all of these attributes or none of them. Since Graphs are just plain Python objects, you can also add extra attributes if you want. For instance, see graph.n_nodes, graph.n_node_features, etc.

### Adjacency matrix (graph.a)
Each entry a[i, j] of the adjacency matrix is non-zero if there exists an edge going from node j to node i, and zero otherwise.

We can represent a as a dense np.array or as a Scipy sparse matrix of shape [n_nodes, n_nodes]. Using an np.array to represent the adjacency matrix can be expensive, since we need to store a lot of 0s in memory, so sparse matrices are usually preferable.

With sparse matrices, we only need to store the non-zero entries of a. In practice, we can implement a sparse matrix by only storing the indices and values of the non-zero entries in a list, and assuming that if a pair of indices is missing from the list then its corresponding value will be 0.
This is called the COOrdinate format and it is the format used by TensorFlow to represent sparse tensors.

For example, the adjacency matrix of a weighted ring graph with 4 nodes:
       
       [[0, 1, 0, 2],
       [3, 0, 4, 0],
       [0, 5, 0, 6],
       [7, 0, 8, 0]]
             
can be represented in COOrdinate format as follows:

        R, C, V
        0, 1, 1
        0, 3, 2
        1, 0, 3
        1, 2, 4
        2, 1, 5
        2, 3, 6
        3, 0, 7
        3, 2, 8
 
where R indicates the "row" indices, C the columns, and V the non-zero values a[i, j]. For example, in the second line, we see that there is an edge that goes from node 3 to node 0 with weight 2.

We also see that, in this case, all edges have a corresponding edge that goes in the opposite direction. For the sake of this example, all edges have been assigned a different weight. In practice, however, edge i, j will often have the same weight as edge j, i and the adjacency matrix will be symmetric.

Many convolutional and pooling layers in Spektral use this sparse representation of matrices to do their computation, and sometimes you will see in the documentation a comment saying that "This layer expects a sparse adjacency matrix."

### Node features (graph.x)

When working with graph neural networks, we usually associate a vector of features with each node of a graph. This is no different from how every pixel in an image has an [R, G, B, A] vector associated with it.

Since we have n_nodes nodes and each node has a feature vector of size n_node_features, we can stack all features in a matrix x of shape [n_nodes, n_node_features].

In Spektral, x is always represented with a dense np.array (since in this case we don't run the risk of storing many useless zeros -- at least not often).


### Edge features (graph.e)
Similar to node features, we can also have features associated with edges. These are usually different from the edge weights that we saw for the adjacency matrix, and often represent the kind of relation between two nodes (e.g., acquaintances, friends, or partners).

When representing edge features, we run into the same problems that we have for the adjacency matrix.

If we store them in a dense np.array, then the array will have shape [n_nodes, n_nodes, n_edge_features] and most of its entries will be zeros. Unfortunately, order-3 tensors cannot be represented as Scipy sparse matrices, so we need to be smart about it.

Similar to how we stored the adjacency matrix as a list of entries r, c, v, here we can use the COOrdinate format to represent our edge features. Assume that, in the example above, each edge has n_edge_features=3 features. We could do something like:

                R, C, V
            0, 1, [ef_1, ef_2, ef_3]
            0, 3, [ef_1, ef_2, ef_3]
            1, 0, [ef_1, ef_2, ef_3]
            1, 2, [ef_1, ef_2, ef_3]
            2, 1, [ef_1, ef_2, ef_3]
            2, 3, [ef_1, ef_2, ef_3]
            3, 0, [ef_1, ef_2, ef_3]
            3, 2, [ef_1, ef_2, ef_3]
            
            
            
Since we already have the information of R and C in the adjacency matrix, we only need to store the V column as a matrix e of shape [n_edges, n_edge_features]. In this case, n_edges indicates the number of non-zero entries in the adjacency matrix.

Note that, since we have separated the edge features from the edge indices of the adjacency matrix, the order in which we store the edge features is very important. We must not break the correspondence between the edges in a and the edges in e.

In Spektral, we always assume that edges are sorted in the row-major ordering (we first sort by row, then by column, like in the example above). This is not important when building the adjacency matrix, but it is important when building e.

You can use spektral.utils.sparse.reorder to sort a matrix of edge features in the correct row-major order given by an edge index (i.e., the matrix obtained by stacking the R and C columns).


### Labels (graph.y)
Finally, in many machine learning tasks we want to predict a label given an input. When working with GNNs, labels can be of two types:

Graph labels represent some global properties of an entire graph;
Node labels represent some properties of each individual node in a graph;
Spektral supports both kinds.

Labels are dense np.arrays or scalars, stored in the y attribute of a Graph object.
Graph-level labels can be either scalars or 1-dimensional arrays of shape [n_labels, ].
Node-level labels can be 1-dimensional arrays of shape [n_nodes, ] (representing a scalar label for each node), or 2-dimensional arrays of shape [n_nodes, n_labels].

This difference is relevant only when using a DisjointLoader



### Datasets
The spektral.data.Dataset container provides some useful functionality to manipulate collections of graphs.

Let's load a popular benchmark dataset for graph classification:

In [1]:
%pip install spektral

Note: you may need to restart the kernel to use updated packages.


In [3]:
from spektral.datasets import TUDataset

dataset = TUDataset('PROTEINS')
dataset

Successfully loaded PROTEINS.


TUDataset(n_graphs=1113)

In [4]:
# We can now retrieve individual graphs:
dataset[0]

Graph(n_nodes=42, n_node_features=4, n_edge_features=None, n_labels=2)

In [6]:
#or shuffle the data:
import numpy as np 

np.random.shuffle(dataset)

In [7]:
print(type(dataset))
#but built on top of numpy

<class 'spektral.datasets.tudataset.TUDataset'>


In [8]:
#or slice the dataset into sub-datsets:
dataset[:100]

TUDataset(n_graphs=100)

Datasets also provide methods for applying transforms to each datum:

- apply(transform) - modifies the dataset in-place, by applying the transform to each graph;
- map(transform) - returns a list obtained by applying the transform to each graph;
- filter(function) - removes from the dataset any graph for which function(graph) is False. This is also an in-place operation.

For example, let's modify our dataset so that we only have graphs with less than 500 nodes:

In [9]:
dataset.filter(lambda g: g.n_nodes < 500)
dataset

TUDataset(n_graphs=1111)

Now let's apply some transforms to our graphs. For example, we can modify each graph so that the node features also contain the one-hot-encoded degree of the nodes.

First, we compute the maximum degree of the dataset, so that we know the size of the one-hot vectors: 

In [10]:
max_degree = dataset.map(lambda g: g.a.sum(-1).max(), reduce=max)

Try to go over the lambda function to see what it does. Also, notice that we passed a reduction function to the method, using the reduce keyword. This will be run on the output list computed by the map.

Now we are ready to augment our node features with the one-hot-encoded degree. Spektral has a lot of pre-implemented transforms that we can use:

In [17]:
from spektral.transforms import Degree
dataset.apply(Degree(int(max_degree)))

We can see that it worked because now we have an extra max_degree + 1 node features:

In [18]:
dataset[0]

Graph(n_nodes=23, n_node_features=56, n_edge_features=None, n_labels=2)

Since we will be using a GCNConv layer in our GNN, we also want to follow the original paper that introduced this layer and do some extra pre-processing of the adjacency matrix.

Since this is a fairly common operation, Spektral has a transform to do it:

In [19]:
from spektral.transforms import GCNFilter
dataset.apply(GCNFilter())

Many layers will require you to do some form of preprocessing. If you don't want to go back to the literature every time, every convolutional layer in Spektral has a preprocess(a) method that you can use to transform the adjacency matrix as needed.

Have a look at the handy LayerPreprocess transform.

## Creating a GNN

Creating GNNs is where Spektral really shines. Since Spektral is designed as an extension of Keras, you can plug any Spektral layer into a Keras Model without modifications.
We just need to use the functional API because GNN layers usually need two or more inputs (so no Sequential models for now).

For our first GNN, we will create a simple network that first does a bit of graph convolution, then sums all the nodes together (known as "global pooling"), and finally classifies the result with a dense softmax layer. We will also use dropout for regularization.

Let's start by importing the necessary layers:

In [20]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout
from spektral.layers import GCNConv, GlobalSumPool

Now we can use model subclassing to define our model:



In [21]:
class MyFirstGNN(Model):

    def __init__(self, n_hidden, n_labels):
        super().__init__()
        self.graph_conv = GCNConv(n_hidden)
        self.pool = GlobalSumPool()
        self.dropout = Dropout(0.5)
        self.dense = Dense(n_labels, 'softmax')

    def call(self, inputs):
        out = self.graph_conv(inputs)
        out = self.dropout(out)
        out = self.pool(out)
        out = self.dense(out)

        return out

And that's it!

Note how we mixed layers from Spektral and Keras interchangeably: it's all just computation with tensors underneath.

This also means that if you want to break free from Graph and Dataset and every other feature of Spektral, you can.

Note: If you don't want to subclass Model to implement your GNN, you can also use the classical declarative style. You just need to pay attention to the Input and leave "node" dimensions unspecified (so None instead of n_nodes).

### Training the GNN
Now we're ready to train the GNN. First, we instantiate and compile our model:

In [22]:
model = MyFirstGNN(32, dataset.n_labels)
model.compile('adam', 'categorical_crossentropy')

nd we're almost there!

However, here's where graphs get in our way. Unlike regular data, like images or sequences, graphs cannot be stretched, cut, or reshaped so that we can fit them into tensors of pre-defined shapes. If a graph has 10 nodes and another one has 4, we have to keep them that way.

This means that iterating over a dataset in mini-batches is not trivial and we cannot simply use the model.fit() method of Keras as-is.

We have to use a data Loader.

### Loaders

Loaders iterate over a graph dataset to create mini-batches. They hide a lot of the complexity behind the process so that you don't need to think about it. You only need to go to this page and read up on data modes, so that you know which loader to use.

Each loader has a load() method that returns a data generator that Keras can process.

Since we're doing graph-level classification, we can use a BatchLoader. It's a bit slow and memory intensive (a DisjointLoader would have been better), but it lets us simplify the definition of MyFirstGNN. Again, go read about data modes after this tutorial.

Let's create a data loader:

In [24]:
from spektral.data import BatchLoader

loader = BatchLoader(dataset, batch_size=32)

and we can finally train our GNN!

Since loaders are essentially generators, we need to provide the steps_per_epoch keyword to model.fit() and we don't need to specify a batch size:

In [25]:
model.fit(loader.load(), steps_per_epoch=loader.steps_per_epoch, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1a9288b4d60>

### Node-level learning
Besides learning to predict labels for the whole graph, like in this tutorial, GNNs are very effective at learning to predict labels for each node. This is called "node-level learning" and we usually do it for datasets with one big graph (think a social network).

# Best Practices

## Going beyond the Sequential model: the Keras functional API

The Keras functional API is a way to create models that are more flexible than the tf.keras.Sequential API. The functional API can handle models with non-linear topology, shared layers, and even multiple inputs or outputs.

The main idea is that a deep learning model is usually a directed acyclic graph (DAG) of layers. So the functional API is a way to build graphs of layers.
Consider the following model:

```
(input: 784-dimensional vectors)
       ↧
[Dense (64 units, relu activation)]
       ↧
[Dense (64 units, relu activation)]
       ↧
[Dense (10 units, softmax activation)]
       ↧
(output: logits of a probability distribution over 10 classes)
```

This is a basic graph with three layers.
Let's see how we can create this type of models on this link:https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/functional_api.ipynb#scrollTo=6Q1lukiNDLoy


## Inspecting and monitoring deep-learning models using Keras callbacks and TensorBoard


https://keras.io/guides/writing_your_own_callbacks/


## Hyperparameter optimization

#### 
1. Choose a set of hyperparameters (automatically).
2. Build the corresponding model.
3. Fit it to your training data, and measure the final performance on the validation
data.
4. Choose the next set of hyperparameters to try (automatically).
5. Repeat.
6. Eventually, measure performance on your test data.

In [32]:
%pip install keras-tuner -q

Note: you may need to restart the kernel to use updated packages.


In [33]:
from tensorflow import keras
from tensorflow.keras import layers

def build_model(hp):
    units = hp.Int(name="units", min_value=16, max_value=64, step=16)
    model = keras.Sequential([
        layers.Dense(units, activation="relu"),
        layers.Dense(10, activation="softmax")
    ])
    optimizer = hp.Choice(name="optimizer", values=["rmsprop", "adam"])
    model.compile(
        optimizer=optimizer,
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"])
    return model

In [34]:
import kerastuner as kt

class SimpleMLP(kt.HyperModel):
    def __init__(self, num_classes):
        self.num_classes = num_classes

    def build(self, hp):
        units = hp.Int(name="units", min_value=16, max_value=64, step=16)
        model = keras.Sequential([
            layers.Dense(units, activation="relu"),
            layers.Dense(self.num_classes, activation="softmax")
        ])
        optimizer = hp.Choice(name="optimizer", values=["rmsprop", "adam"])
        model.compile(
            optimizer=optimizer,
            loss="sparse_categorical_crossentropy",
            metrics=["accuracy"])
        return model

hypermodel = SimpleMLP(num_classes=10)

  import kerastuner as kt


In [39]:

tuner = kt.BayesianOptimization(
    build_model,
    objective="val_accuracy",
    max_trials=100,
    executions_per_trial=2,
    directory="mnist_kt_test",
    overwrite=True,
)

In [40]:
tuner.search_space_summary()

Search space summary
Default search space size: 2
units (Int)
{'default': None, 'conditions': [], 'min_value': 16, 'max_value': 64, 'step': 16, 'sampling': None}
optimizer (Choice)
{'default': 'rmsprop', 'conditions': [], 'values': ['rmsprop', 'adam'], 'ordered': False}


In [41]:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = x_train.reshape((-1, 28 * 28)).astype("float32") / 255
x_test = x_test.reshape((-1, 28 * 28)).astype("float32") / 255
x_train_full = x_train[:]
y_train_full = y_train[:]
num_val_samples = 10000
x_train, x_val = x_train[:-num_val_samples], x_train[-num_val_samples:]
y_train, y_val = y_train[:-num_val_samples], y_train[-num_val_samples:]
callbacks = [
    keras.callbacks.EarlyStopping(monitor="val_loss", patience=5),
]
tuner.search(
    x_train, y_train,
    batch_size=128,
    epochs=5,
    validation_data=(x_val, y_val),
    callbacks=callbacks,
    verbose=2,
)

Trial 5 Complete [00h 00m 05s]
val_accuracy: 0.9662500023841858

Best val_accuracy So Far: 0.9662500023841858
Total elapsed time: 00h 00m 25s
INFO:tensorflow:Oracle triggered exit


Querying the best hyperparameter configurations



In [42]:
top_n = 4
best_hps = tuner.get_best_hyperparameters(top_n)

In [47]:

def get_best_epoch(hp):
    model = build_model(hp)
    callbacks=[
        keras.callbacks.EarlyStopping(
            monitor="val_loss", mode="min", patience=10)
    ]
    history = model.fit(
        x_train, y_train,
        validation_data=(x_val, y_val),
        epochs=5,
        batch_size=128,
        callbacks=callbacks)
    val_loss_per_epoch = history.history["val_loss"]
    best_epoch = val_loss_per_epoch.index(min(val_loss_per_epoch)) + 1
    print(f"Best epoch: {best_epoch}")
    return best_epoch

In [49]:
def get_best_trained_model(hp):
    best_epoch = get_best_epoch(hp)
    model.fit(
        x_train_full, y_train_full,
        batch_size=128, epochs=int(best_epoch * 1.2))
    return model

best_models = []
for hp in best_hps:
    model = get_best_trained_model(hp)
    model.evaluate(x_test, y_test)
    best_models.append(model)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Best epoch: 5
Epoch 1/6


OperatorNotAllowedInGraphError: in user code:

    C:\Users\asa279\Anaconda3\lib\site-packages\tensorflow\python\keras\engine\training.py:855 train_function  *
        return step_function(self, iterator)
    <ipython-input-21-bc75defdf961>:11 call  *
        out = self.graph_conv(inputs)
    C:\Users\asa279\Anaconda3\lib\site-packages\spektral\layers\convolutional\conv.py:99 _inner_check_dtypes  *
        inputs = check_dtypes(inputs)
    C:\Users\asa279\Anaconda3\lib\site-packages\spektral\layers\convolutional\conv.py:81 check_dtypes  *
        x, a = inputs
    C:\Users\asa279\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py:520 __iter__
        self._disallow_iteration()
    C:\Users\asa279\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py:513 _disallow_iteration
        self._disallow_when_autograph_enabled("iterating over `tf.Tensor`")
    C:\Users\asa279\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py:489 _disallow_when_autograph_enabled
        raise errors.OperatorNotAllowedInGraphError(

    OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.


In [50]:
best_models = tuner.get_best_models(top_n)

## Model ensembling
"Another powerful technique for obtaining the best possible results on a task is model
ensembling. Ensembling consists of pooling together the predictions of a set of different
models, to produce better predictions. If you look at machine-learning competitions,
in particular on Kaggle, you’ll see that the winners use very large ensembles of
models that inevitably beat any single model, no matter how good."


    preds_a = model_a.predict(x_val)
    preds_b = model_b.predict(x_val)
    preds_c = model_c.predict(x_val)
    preds_d = model_d.predict(x_val)
    final_preds = 0.25 * (preds_a + preds_b + preds_c + preds_d)
    
 
 
 A smarter way to ensemble classifiers is to do a weighted average, where the
weights are learned on the validation data—typically, the better classifiers are given a
higher weight, and the worse classifiers are given a lower weight. To search for a good
set of ensembling weights, you can use random search or a simple optimization algorithm
such as Nelder-Mead:

    preds_a = model_a.predict(x_val)
    preds_b = model_b.predict(x_val)
    preds_c = model_c.predict(x_val)
    preds_d = model_d.predict(x_val)
    final_preds = 0.5 * preds_a + 0.25 * preds_b + 0.1 * preds_c + 0.15 * preds_d
    
Winning machine-learning competitions or otherwise obtaining the best possible
results on a task can only be done with large ensembles of models. Ensembling
via a well-optimized weighted average is usually good enough. Remember:
diversity is strength.

## Multi-GPU and distributed training (optional)
There are generally two ways to distribute computation across multiple devices:

**Data parallelism**, where a single model gets replicated on multiple devices or
multiple machines. Each of them processes different batches of data, then they merge
their results. There exist many variants of this setup, that differ in how the different
model replicas merge results, in whether they stay in sync at every batch or whether they
are more loosely coupled, etc.

**Model parallelism**, where different parts of a single model run on different devices,
processing a single batch of data together. This works best with models that have a
naturally-parallel architecture, such as models that feature multiple branches.

This guide focuses on data parallelism, in particular **synchronous data parallelism**,
where the different replicas of the model stay in sync after each batch they process.
Synchronicity keeps the model convergence behavior identical to what you would see for
single-device training.

Specifically, this guide teaches you how to use the `tf.distribute` API to train Keras
models on multiple GPUs, with minimal changes to your code, in the following two setups:

- On multiple GPUs (typically 2 to 8) installed on a single machine (single host,
multi-device training). This is the most common setup for researchers and small-scale
industry workflows.
- On a cluster of many machines, each hosting one or multiple GPUs (multi-worker
distributed training). This is a good setup for large-scale industry workflows, e.g.
training high-resolution image classification models on tens of millions of images using
20-100 GPUs.


https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/distributed_training.ipynb

## Training Keras models with TensorFlow Cloud (optional)
TensorFlow Cloud is a library that makes it easier to do training and hyperparameter tuning of Keras models on Google Cloud.

Using TensorFlow Cloud's run API, you can send your model code directly to your Google Cloud account, and use Google Cloud compute resources without needing to login and interact with the Cloud UI (once you have set up your project in the console).

This means that you can use your Google Cloud compute resources from inside directly a Python notebook: a notebook just like this one! You can also send models to Google Cloud from a plain .py Python script.

https://colab.research.google.com/github/keras-team/keras-io/blob/master/guides/ipynb/training_keras_models_on_cloud.ipynb

References:
- https://www.tensorflow.org/tutorials
- keras.io/guides/
- Deep Learning with Python