# Supervised graph classification with GCN


<table><tr><td>Run the latest release of this notebook:</td><td><a href="https://mybinder.org/v2/gh/stellargraph/stellargraph/master?urlpath=lab/tree/demos/graph-classification/gcn-supervised-graph-classification.ipynb" alt="Open In Binder" target="_parent"><img src="https://mybinder.org/badge_logo.svg"/></a></td><td><a href="https://colab.research.google.com/github/stellargraph/stellargraph/blob/master/demos/graph-classification/gcn-supervised-graph-classification.ipynb" alt="Open In Colab" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg"/></a></td></tr></table>

This notebook demonstrates how to train a graph classification model in a supervised setting using graph convolutional layers followed by a mean pooling layer as well as any number of fully connected layers.

The graph convolutional classification model architecture is based on the one proposed in [1] (see Figure 5 in [1]) using the graph convolutional layers from [2]. This demo differs from [1] in the dataset, MUTAG, used here; MUTAG is a collection of static graphs representing chemical compounds with each graph associated with a binary label. Furthermore, none of the graph convolutional layers in our model utilise an attention head as proposed in [1].

Evaluation data for graph kernel-based approaches shown in the very last cell in this notebook are taken from [3].

**References**

[1] Fake News Detection on Social Media using Geometric Deep Learning, F. Monti, F. Frasca, D. Eynard, D. Mannion, and M. M. Bronstein, ICLR 2019. ([link](https://arxiv.org/abs/1902.06673))

[2] Semi-supervised Classification with Graph Convolutional Networks, T. N. Kipf and M. Welling, ICLR 2017. ([link](https://arxiv.org/abs/1609.02907))

[3] An End-to-End Deep Learning Architecture for Graph Classification, M. Zhang, Z. Cui, M. Neumann, Y. Chen, AAAI-18. ([link](https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewPaper/17146))

In [1]:
# install StellarGraph if running on Google Colab
import sys
if 'google.colab' in sys.modules:
  %pip install -q stellargraph[demos]==1.2.1

In [2]:
# verify that we're using the correct version of StellarGraph for this notebook
import stellargraph as sg

try:
    sg.utils.validate_notebook_version("1.2.1")
except AttributeError:
    raise ValueError(
        f"This notebook requires StellarGraph version 1.2.1, but a different version {sg.__version__} is installed.  Please see <https://github.com/stellargraph/stellargraph/issues/1172>."
    ) from None

In [3]:
import pandas as pd
import numpy as np

import stellargraph as sg
from stellargraph.mapper import PaddedGraphGenerator
from stellargraph.layer import GCNSupervisedGraphClassification
from stellargraph import StellarGraph

from stellargraph import datasets

from sklearn import model_selection
from IPython.display import display, HTML

from tensorflow.keras import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import binary_crossentropy
from tensorflow.keras.callbacks import EarlyStopping
import tensorflow as tf
import matplotlib.pyplot as plt
import json
import pprint
from stellargraph.datasets.dataset_loader import DatasetLoader

## Import the data

(See [the "Loading from Pandas" demo](../basics/loading-pandas.ipynb) for details on how data can be loaded.)

In [4]:
def _load_graph_kernel_dataset(dataset):
    expected_files = dataset.expected_files
    A_filename = expected_files[0]
    graph_indicator_filename = expected_files[1]
    node_labels_filename = expected_files[2]
    edge_labels_filename = expected_files[3]
    graph_labels_filename = expected_files[4]
    
    def _load_from_txt_file(filename, names=None, dtype=None, index_increment=None):
        df = pd.read_csv(filename,header=None,index_col=False,dtype=dtype,names=names)
        # We optional increment the index by 1 because indexing, e.g. node IDs, for this dataset starts
        # at 1 whereas the Pandas DataFrame implicit index starts at 0 potentially causing confusion selecting
        # rows later on.
        if index_increment:
            df.index = df.index + index_increment
        return df

    # edge information:
    df_graph = _load_from_txt_file(filename=A_filename, names=["source", "target"])

    if dataset._edge_labels_as_weights:
        # there's some edge labels, that can be used as edge weights
        df_edge_labels = _load_from_txt_file(
            filename=edge_labels_filename, names=["weight"], dtype=int
        )
        df_graph = pd.concat([df_graph, df_edge_labels], axis=1)

    # node information:
    df_graph_ids = _load_from_txt_file(
        filename=graph_indicator_filename, names=["graph_id"], index_increment=1
    )

    df_node_labels = _load_from_txt_file(
        filename=node_labels_filename, dtype="category", index_increment=1
    )
    # One-hot encode the node labels because these are used as node features in graph classification
    # tasks.
    df_node_features = pd.get_dummies(df_node_labels)

    # graph information:
    df_graph_labels = _load_from_txt_file(
        filename=graph_labels_filename, dtype="category", names=["label"], index_increment=1
    )

    # split the data into each of the graphs, based on the nodes in each one
    def graph_for_nodes(nodes):
        # each graph is disconnected, so the source is enough to identify the graph for an edge
        edges = df_graph[df_graph["source"].isin(nodes.index)]
        return StellarGraph(nodes, edges)

    groups = df_node_features.groupby(df_graph_ids["graph_id"])
    graphs = [graph_for_nodes(nodes) for _, nodes in groups]

    return graphs, df_graph_labels["label"]

In [5]:
class GuavData():
    name="PPMI"
#     expected_files=["GCN_A.txt",
#         "GCN_graph_indicator.txt",
#         "GCN_node_labels.txt",
#         "GCN_edge_labels.txt",
#         "GCN_graph_labels.txt"]
    expected_files=["pp/GCN_A.txt",
            "pp/GCN_graph_indicator.txt",
            "pp/GCN_node_labels.txt",
            "pp/GCN_edge_labels.txt",
            "pp/GCN_graph_labels.txt"]
#     expected_files=["MUTAG_A.txt",
#         "MUTAG_graph_indicator.txt",
#         "MUTAG_node_labels.txt",
#         "MUTAG_edge_labels.txt",
#         "MUTAG_graph_labels.txt"]
    description="Each graph represents a brain derived from an fMRI. There are 164 nodes with 141 distinct node labels.",
    _edge_labels_as_weights = False
    _node_attributes = False

    def load(self):
        """
        Load this dataset into a list of StellarGraph objects with corresponding labels, downloading it if required.

        Note: Edges in MUTAG are labelled as one of 4 values: aromatic, single, double, and triple indicated by integers
        0, 1, 2, 3 respectively. The edge labels are included in the  :class:`.StellarGraph` objects as edge weights in
        integer representation.

        Returns:
            A tuple that is a list of :class:`.StellarGraph` objects and a Pandas Series of labels one for each graph.
        """
        return _load_graph_kernel_dataset(self)

In [6]:
dataset = GuavData()
display(dataset.description)
graphs, graph_labels = dataset.load()

('Each graph represents a brain derived from an fMRI. There are 164 nodes with 141 distinct node labels.',)

The `graphs` value is a list of many `StellarGraph` instances, each of which has a few node features:

In [7]:
print(graphs[0].info())

StellarGraph: Undirected multigraph
 Nodes: 164, Edges: 4518

 Node types:
  default: [164]
    Features: float32 vector, length 142
    Edge types: default-default->default

 Edge types:
    default-default->default: [4518]
        Weights: all 1 (default)
        Features: none


In [8]:
print(graphs[1].info())

StellarGraph: Undirected multigraph
 Nodes: 164, Edges: 4518

 Node types:
  default: [164]
    Features: float32 vector, length 142
    Edge types: default-default->default

 Edge types:
    default-default->default: [4518]
        Weights: all 1 (default)
        Features: none


Summary statistics of the sizes of the graphs:

In [9]:
summary = pd.DataFrame(
    [(g.number_of_nodes(), g.number_of_edges()) for g in graphs],
    columns=["nodes", "edges"],
)
summary.describe().round(1)

Unnamed: 0,nodes,edges
count,188.0,188.0
mean,164.0,3140.1
std,0.0,1289.4
min,164.0,1724.0
25%,164.0,1844.0
50%,164.0,3654.0
75%,164.0,4455.0
max,164.0,4645.0


In [10]:
# print(dir(dataset))

In [11]:
# print(dir(dataset.load()[0][0]))

The labels are `1` or `0`:

In [12]:
graph_labels.value_counts().to_frame()

Unnamed: 0,label
-1,15744
1,15088


In [13]:
graph_labels = pd.get_dummies(graph_labels, drop_first=True)

### Prepare graph generator

To feed data to the `tf.Keras` model that we will create later, we need a data generator. For supervised graph classification, we create an instance of `StellarGraph`'s `PaddedGraphGenerator` class. Note that `graphs` is a list of `StellarGraph` graph objects.

In [14]:
generator = PaddedGraphGenerator(graphs=graphs)

### Create the Keras graph classification model

We are now ready to create a `tf.Keras` graph classification model using `StellarGraph`'s `GraphClassification` class together with standard `tf.Keras` layers, e.g., `Dense`. 

The input is the graph represented by its adjacency and node features matrices. The first two layers are Graph Convolutional as in [2] with each layer having 64 units and `relu` activations. The next layer is a mean pooling layer where the learned node representation are summarized to create a graph representation. The graph representation is input to two fully connected layers with 32 and 16 units respectively and `relu` activations. The last layer is the output layer with a single unit and `sigmoid` activation.

![](graph_classification_architecture.png)

In [15]:
def create_graph_classification_model(generator):
    gc_model = GCNSupervisedGraphClassification(
        layer_sizes=[64, 64],
        activations=["relu", "relu"],
        generator=generator,
        dropout=0.5,
    )
    x_inp, x_out = gc_model.in_out_tensors()
    predictions = Dense(units=32, activation="relu")(x_out)
    predictions = Dense(units=16, activation="relu")(predictions)
    predictions = Dense(units=1, activation="sigmoid")(predictions)

    # Let's create the Keras model and prepare it for training
    model = Model(inputs=x_inp, outputs=predictions)
    model.compile(optimizer=Adam(0.005), loss=binary_crossentropy, metrics=["acc"])

    return model

### Train the model

We can now train the model using the model's `fit` method. First, we specify some important training parameters such as the number of training epochs, number of fold for cross validation and the number of time to repeat cross validation.

In [16]:
epochs = 200  # maximum number of training epochs
folds = 10  # the number of folds for k-fold cross validation
n_repeats = 5  # the number of repeats for repeated k-fold cross validation

In [17]:
es = EarlyStopping(
    monitor="val_loss", min_delta=0, patience=25, restore_best_weights=True
)

The method `train_fold` is used to train a graph classification model for a single fold of the data.

In [19]:
def train_fold(model, train_gen, test_gen, es, epochs):
    history = model.fit(
        train_gen, epochs=epochs, validation_data=test_gen, verbose=2, callbacks=[es],
    )
    # calculate performance on the test data and return along with history
    test_metrics = model.evaluate(test_gen, verbose=0)
    test_acc = test_metrics[model.metrics_names.index("acc")]

    return history, test_acc

In [20]:
def get_generators(train_index, test_index, graph_labels, batch_size):
    train_gen = generator.flow(
        train_index, targets=graph_labels.iloc[train_index].values, batch_size=batch_size
    )
    test_gen = generator.flow(
        test_index, targets=graph_labels.iloc[test_index].values, batch_size=batch_size
    )

    return train_gen, test_gen

The code below puts all the above functionality together in a training loop for repeated k-fold cross-validation where the number of folds is 10, `folds=10`; that is we do 10-fold cross validation `n_repeats` times where `n_repeats=5`.

**Note**: The below code may take a long time to run depending on the value set for `n_repeats`. The larger the latter, the longer it takes since for each repeat we train and evaluate 10 graph classification models, one for each fold of the data. For progress updates, we recommend that you set `verbose=2` in the call to the `fit` method is cell 10, line 3.

In [21]:
test_accs = []

stratified_folds = model_selection.RepeatedStratifiedKFold(
    n_splits=folds, n_repeats=n_repeats
).split(graph_labels, graph_labels)

In [22]:
print(enumerate(stratified_folds).__sizeof__())

48


In [23]:
# print(graph_labels)

In [25]:
for i, (train_index, test_index) in enumerate(stratified_folds):
    print(f"Training and evaluating on fold {i+1} out of {folds * n_repeats}...")
    print("train_index" + str(train_index))
    print("test_index" + str(test_index))
    train_gen, test_gen = get_generators(
        train_index, test_index, graph_labels, batch_size=30
    )

    model = create_graph_classification_model(generator)

    history, acc = train_fold(model, train_gen, test_gen, es, epochs)

    test_accs.append(acc)

Training and evaluating on fold 1 out of 50...
train_index[    0     1     2 ... 30829 30830 30831]
test_index[   12    13    17 ... 30802 30815 30824]
Epoch 1/200


UnknownError:  IndexError: index 8827 is out of bounds for axis 0 with size 188
Traceback (most recent call last):

  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\ops\script_ops.py", line 249, in __call__
    ret = func(*args)

  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 645, in wrapper
    return func(*args, **kwargs)

  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\data\ops\dataset_ops.py", line 961, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\keras\engine\data_adapter.py", line 837, in wrapped_generator
    for data in generator_fn():

  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\keras\engine\data_adapter.py", line 963, in generator_fn
    yield x[i]

  File "C:\ProgramData\Anaconda3\lib\site-packages\stellargraph\mapper\padded_graph_generator.py", line 314, in __getitem__
    graphs = self.graphs[batch_ilocs]

IndexError: index 8827 is out of bounds for axis 0 with size 188


	 [[{{node PyFunc}}]]
	 [[IteratorGetNext]] [Op:__inference_train_function_2481]

Function call stack:
train_function


In [None]:
print(
    f"Accuracy over all folds mean: {np.mean(test_accs)*100:.3}% and std: {np.std(test_accs)*100:.2}%"
)

Finally, we plot a histogram of the accuracy of all `n_repeats x folds` models trained (50 in total).

In [None]:
plt.figure(figsize=(8, 6))
plt.hist(test_accs)
plt.xlabel("Accuracy")
plt.ylabel("Count")

The histogram shown above indicates the difficulty of training a good model on the MUTAG dataset due to the following factors,
- small amount of available data, i.e., only 188 graphs
- small amount of validation data since for a single fold only 19 graphs are used for validation
- the data are unbalanced since the majority class is twice as prevalent in the data

Given the above, average performance as estimated using repeated 10-fold cross validation displays high variance but overall good performance for a straightforward application of graph convolutional neural networks to supervised graph classification. The high variance is likely the result of the small dataset size.

Generally, performance is a bit lower than SOTA in recent literature. However, we have not tuned the model for the best performance possible so some improvement over the current baseline may be attainable.

When comparing to graph kernel-based approaches, our straightforward GCN with mean pooling graph classification model is competitive with the WL kernel being the exception.

For comparison, some performance numbers repeated from [3] for graph kernel-based approaches are, 
- Graphlet Kernel (GK): $81.39\pm1.74$
- Random Walk Kernel (RW): $79.17\pm2.07$
- Propagation Kernel (PK): $76.00\pm2.69$
- Weisfeiler-Lehman Subtree Kernel (WL): $84.11\pm1.91$

<table><tr><td>Run the latest release of this notebook:</td><td><a href="https://mybinder.org/v2/gh/stellargraph/stellargraph/master?urlpath=lab/tree/demos/graph-classification/gcn-supervised-graph-classification.ipynb" alt="Open In Binder" target="_parent"><img src="https://mybinder.org/badge_logo.svg"/></a></td><td><a href="https://colab.research.google.com/github/stellargraph/stellargraph/blob/master/demos/graph-classification/gcn-supervised-graph-classification.ipynb" alt="Open In Colab" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg"/></a></td></tr></table>