# Write your own GNN module

In [the introduction](1_introduction.ipynb), we have learned using the built-in [graph convolution modules](https://docs.dgl.ai/api/python/nn.pytorch.html#module-dgl.nn.pytorch.conv) to build a multi-layer graph neural network. However, sometimes one desires to invent a new way of aggregating neighbor information. DGL's message passing APIs are designed for this scenario.

Goal of this tutorial:

* Understand DGL's message passing APIs.
* Implement GraphSAGE convolution by your own.
* Implement Graph Attention Networks by your own.

In [1]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F

Using backend: pytorch


This tutorial assumes that you already know the pipeline of full-graph node classification.  If not, please refer to the [introduction](1_introduction.ipynb).

The following code for data loading and training loop is directly copied from the introduction tutorial.

In [2]:
import dgl.data

dataset = dgl.data.CoraGraphDataset()
g = dataset[0]

def train(g, net):
    optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
    all_logits = []
    best_val_acc = 0
    best_test_acc = 0

    features = g.ndata['feat']
    labels = g.ndata['label']
    train_mask = g.ndata['train_mask']
    val_mask = g.ndata['val_mask']
    test_mask = g.ndata['test_mask']
    for e in range(200):
        # Forward
        logits = net(g, features)

        # Compute prediction
        pred = logits.argmax(1)

        # Compute loss
        # Note that we should only compute the losses of the nodes in the training set,
        # i.e. with train_mask 1.
        loss = F.cross_entropy(logits[train_mask], labels[train_mask])

        # Compute accuracy on training/validation/test
        train_acc = (pred[train_mask] == labels[train_mask]).float().mean()
        val_acc = (pred[val_mask] == labels[val_mask]).float().mean()
        test_acc = (pred[test_mask] == labels[test_mask]).float().mean()

        # Save the best validation accuracy and the corresponding test accuracy.
        if best_val_acc < val_acc:
            best_val_acc = val_acc
            best_test_acc = test_acc

        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        all_logits.append(logits.detach())

        if e % 5 == 0:
            print('In epoch {}, loss: {:.3f}, val acc: {:.3f} (best {:.3f}), test acc: {:.3f} (best {:.3f})'.format(
                e, loss, val_acc, best_val_acc, test_acc, best_test_acc))

Loading from cache failed, re-processing.
Finished data loading and preprocessing.
  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done saving data into cached files.


  if not i.flags.writeable or i.dtype not in (np.int32, np.int64):
  if not j.flags.writeable or j.dtype not in (np.int32, np.int64):


## Message passing and GNNs

DGL follows the *message passing paradigm* inspired by the Message Passing Neural Network proposed by [Gilmer et al.](https://arxiv.org/abs/1704.01212) Essentially, they found many GNN models can fit into the following framework:

$$
m_{u\to v}^{(l)} = M^{(l)}\left(h_v^{(l-1)}, h_u^{(l-1)}, e_{u\to v}^{(l-1)}\right)
$$

$$
m_{v}^{(l)} = \sum_{u\in\mathcal{N}(v)}m_{u\to v}^{(l)}
$$

$$
h_v^{(l)} = U^{(l)}\left(h_v^{(l-1)}, m_v^{(l)}\right)
$$

where DGL calls $M^{(l)}$ the *message function* and $\sum$ the *reduce function*.  Note that $\sum$ here can represent any function and is not necessarily a summation.

For example, the [GraphSAGE convolution (Hamilton et al., 2017)](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) takes the following mathematical form:

$$
h_{\mathcal{N}(v)}^k\leftarrow \text{Average}\{h_u^{k-1},\forall u\in\mathcal{N}(v)\}
$$

$$
h_v^k\leftarrow \text{ReLU}\left(W^k\cdot \text{CONCAT}(h_v^{k-1}, h_{\mathcal{N}(v)}^k) \right)
$$

You can see that message passing is directional: the message sent from one node $u$ to other node $v$ is not necessarily the same as the other message sent from node $v$ to node $u$ in the opposite direction.

DGL graphs provide two members `srcdata` and `dstdata` for the purpose of message passing.  You first put the input node features in `srcdata`.  After you perform message passing, you can retrieve the result of message passing from `dstdata`.

<div class="alert alert-info">
    <b>Note: </b>In full graph message passing, both the input nodes and the output nodes are the full node set.  Therefore, <code>srcdata</code> and <code>dstdata</code> in homogeneous graph (i.e. with only one node type and one edge type) are identical to <code>ndata</code>.  See <a href=H3_message_passing.ipynb>here (TODO)</a> for heterogeneous graph message passing.
</div>

Although DGL has builtin support of GraphSAGE via [`dgl.nn.SAGEConv`](https://docs.dgl.ai/api/python/nn.pytorch.html#sageconv), here is how you can implement GraphSAGE convolution in DGL by your own.

In [3]:
import dgl.function as fn

class SAGEConv(nn.Module):
    """Graph convolution module used by the GraphSAGE model.
    
    Parameters
    ----------
    in_feat : int
        Input feature size.
    out_feat : int
        Output feature size.
    """
    def __init__(self, in_feat, out_feat):
        super(SAGEConv, self).__init__()
        # A linear submodule for projecting the input and neighbor feature to the output.
        self.linear = nn.Linear(in_feat * 2, out_feat)
    
    def forward(self, g, h):
        """Forward computation
        
        Parameters
        ----------
        g : Graph
            The input graph.
        h : Tensor
            The input node feature.
        """
        with g.local_scope():
            g.srcdata['h'] = h
            # update_all is a message passing API.
            g.update_all(message_func=fn.copy_u('h', 'm'), reduce_func=fn.mean('m', 'h_neigh'))
            h_neigh = g.dstdata['h_neigh']
            h_total = torch.cat([h, h_neigh], dim=1)
            return self.linear(h_total)

The central piece in this code is the [`g.update_all`](https://docs.dgl.ai/generated/dgl.DGLGraph.update_all.html#dgl.DGLGraph.update_all) function, which gathers and averages the neighbor features. There are three concepts here:
* Message function `fn.copy_u('h', 'm')` that copies the node feature under name `'h'` as *messages* sent to neighbors.
* Reduce function `fn.mean('m', 'h_neigh')` that averages all the received messages under name `'m'` and saves the result as a new node feature `'h_neigh'`.
* `update_all` tells DGL to trigger the message and reduce functions for all the nodes and edges.

Afterwards, you can stack your own GraphSAGE convolution layers to form a multi-layer GraphSAGE network.

<div class="alert alert-info">
    <b>Note</b>: the GraphSAGE model here is only a demonstration of how you could write a neural network module; the hyperparameters and architecture choices are not tuned so the performance here may be inferior to what the paper actually reported.  For an accurate reproduction, please refer to <a href=https://github.com/dmlc/dgl/blob/master/examples/pytorch/graphsage/train_full.py>our example</a>.
</div>

In [4]:
class Net(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(Net, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats)
        self.conv2 = SAGEConv(h_feats, num_classes)
    
    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h
    
net = Net(g.ndata['feat'].shape[1], 16, dataset.num_classes)
train(g, net)

In epoch 0, loss: 1.949, val acc: 0.122 (best 0.122), test acc: 0.130 (best 0.130)
In epoch 5, loss: 1.882, val acc: 0.420 (best 0.420), test acc: 0.462 (best 0.462)
In epoch 10, loss: 1.761, val acc: 0.580 (best 0.592), test acc: 0.638 (best 0.639)
In epoch 15, loss: 1.580, val acc: 0.616 (best 0.616), test acc: 0.654 (best 0.654)
In epoch 20, loss: 1.343, val acc: 0.622 (best 0.632), test acc: 0.656 (best 0.654)
In epoch 25, loss: 1.073, val acc: 0.654 (best 0.654), test acc: 0.675 (best 0.675)
In epoch 30, loss: 0.800, val acc: 0.680 (best 0.680), test acc: 0.701 (best 0.701)
In epoch 35, loss: 0.560, val acc: 0.716 (best 0.716), test acc: 0.725 (best 0.725)
In epoch 40, loss: 0.371, val acc: 0.728 (best 0.732), test acc: 0.747 (best 0.736)
In epoch 45, loss: 0.238, val acc: 0.734 (best 0.736), test acc: 0.750 (best 0.752)
In epoch 50, loss: 0.152, val acc: 0.730 (best 0.736), test acc: 0.749 (best 0.752)
In epoch 55, loss: 0.099, val acc: 0.728 (best 0.736), test acc: 0.749 (best 0

## More customization

In DGL, we provide many built-in message and reduce functions under the `dgl.function` package.

![api](assets/dgl-mp.png)

You can find more details in [the API doc](https://docs.dgl.ai/api/python/function.html).

These APIs allow one to quickly implement new graph convolution modules. For example, the following implements a new `SAGEConv` that aggregates neighbor representations using a weighted average.  Note that `edata` member can hold edge features which can also take part in message passing.

In [5]:
class WeightedSAGEConv(nn.Module):
    """Graph convolution module used by the GraphSAGE model with edge weights.
    
    Parameters
    ----------
    in_feat : int
        Input feature size.
    out_feat : int
        Output feature size.
    """
    def __init__(self, in_feat, out_feat):
        super(WeightedSAGEConv, self).__init__()
        # A linear submodule for projecting the input and neighbor feature to the output.
        self.linear = nn.Linear(in_feat * 2, out_feat)
    
    def forward(self, g, h, w):
        """Forward computation
        
        Parameters
        ----------
        g : Graph
            The input graph.
        h : Tensor
            The input node feature.
        w : Tensor
            The edge weight.
        """
        with g.local_scope():
            g.srcdata['h'] = h
            g.edata['w'] = w
            g.update_all(fn.u_mul_e('h', 'w', 'm'), fn.mean('m', 'h_neigh'))
            h_neigh = g.dstdata['h_neigh']
            h_total = torch.cat([h, h_neigh], dim=1)
            return self.linear(h_total)

Because the graph in this dataset does not have edge weights, we manually assign all edge weights to one in the `forward()` function of the model.  You can replace it with your own edge weights.

In [6]:
class Net(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(Net, self).__init__()
        self.conv1 = WeightedSAGEConv(in_feats, h_feats)
        self.conv2 = WeightedSAGEConv(h_feats, num_classes)
    
    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat, torch.ones(g.num_edges()).to(g.device))
        h = F.relu(h)
        h = self.conv2(g, h, torch.ones(g.num_edges()).to(g.device))
        return h
    
net = Net(g.ndata['feat'].shape[1], 16, dataset.num_classes)
train(g, net)

In epoch 0, loss: 1.952, val acc: 0.162 (best 0.162), test acc: 0.149 (best 0.149)
In epoch 5, loss: 1.858, val acc: 0.290 (best 0.290), test acc: 0.299 (best 0.299)
In epoch 10, loss: 1.697, val acc: 0.668 (best 0.668), test acc: 0.658 (best 0.671)
In epoch 15, loss: 1.467, val acc: 0.662 (best 0.668), test acc: 0.658 (best 0.671)
In epoch 20, loss: 1.182, val acc: 0.702 (best 0.702), test acc: 0.703 (best 0.703)
In epoch 25, loss: 0.877, val acc: 0.722 (best 0.722), test acc: 0.715 (best 0.715)
In epoch 30, loss: 0.597, val acc: 0.736 (best 0.736), test acc: 0.736 (best 0.736)
In epoch 35, loss: 0.378, val acc: 0.742 (best 0.744), test acc: 0.746 (best 0.747)
In epoch 40, loss: 0.229, val acc: 0.752 (best 0.752), test acc: 0.752 (best 0.752)
In epoch 45, loss: 0.138, val acc: 0.750 (best 0.752), test acc: 0.757 (best 0.752)
In epoch 50, loss: 0.084, val acc: 0.756 (best 0.756), test acc: 0.763 (best 0.761)
In epoch 55, loss: 0.054, val acc: 0.758 (best 0.758), test acc: 0.763 (best 0

## Even more customization by user-defined function

DGL allows user-defined message and reduce function for the maximal expressiveness. Here is a user-defined message function that is equivalent to `fn.u_mul_e('h', 'w', 'm')`.

In [7]:
def u_mul_e_udf(edges):
    return {'m' : edges.src['h'] * edges.data['w']}

`edges` has three members: `src`, `data` and `dst`, representing the source node feature, edge feature, and destination node feature for all edges.

You can also write your own reduce function.  For example, the following is equivalent to the builtin `fn.sum('m', 'h')` function that sums up the incoming messages:

In [8]:
def sum_udf(nodes):
    return {'h': nodes.mailbox['m'].sum(1)}

In short, DGL will group the nodes by their in-degrees, and for each group DGL stacks the incoming messages along the second dimension.  One can then perform a reduction along the second dimension to aggregate messages.

For more details on customizing message and reduce function with user-defined function, please refer to the [API reference](https://docs.dgl.ai/api/python/udf.html#apiudf).

## Computing softmax over incoming edges of all nodes

A common reduction operation is to compute a weighted aggregation of messages via a softmax.  That is, given the edge weights $z_{i\to j}$ and the node features $h_i^{(l)}$, you would like to compute a softmax over all incoming edges of every individual node $i$, and compute a weighted average based on them:

$$
a_{i\to j} = \frac{\exp(z_{i\to j})}{\sum_{i'\in\mathcal{N}(j)}\exp(z_{i'\to j})} \qquad h_j^{(l+1)} = \sum_{i\in \mathcal{N}(j)} a_{i\to j} h_i^{(l)}
$$

DGL has an efficient wrapper function `dgl.ops.edge_softmax` that computes the softmax output in a memory- and time-efficient way.

For example, DGL has builtin support of [Graph Attention Networks (Veličković et al., 2017)](https://arxiv.org/abs/1710.10903) via [`dgl.nn.GATConv`](https://docs.dgl.ai/api/python/nn.pytorch.html#dgl.nn.pytorch.conv.GATConv), but here is how you can implement GAT convolutions in DGL by your own.

In [9]:
import dgl.function as fn
from dgl.ops import edge_softmax

class GATConv(nn.Module):
    """Graph convolution module used by the GraphSAGE model.
    
    Parameters
    ----------
    in_feat : int
        Input feature size.
    out_feat : int
        Output feature size.
    """
    def __init__(self, in_feat, out_feat):
        super(GATConv, self).__init__()
        # A linear submodule for projecting the input and neighbor feature to the output.
        self.linear = nn.Linear(in_feat, out_feat)
    
    def forward(self, g, h):
        """Forward computation
        
        Parameters
        ----------
        g : Graph
            The input graph.
        h : Tensor
            The input node feature.
        """
        with g.local_scope():
            g.srcdata['h'] = h
            g.dstdata['h'] = h
            # Compute a dot product for all edges between input nodes and output nodes.
            # The dot product will serve as logits of the softmax.
            g.apply_edges(fn.u_dot_v('h', 'h', 'z'))
            # Compute a softmax over the incoming edges of all nodes.
            g.edata['a'] = edge_softmax(g, g.edata['z'])
            g.update_all(fn.u_mul_e('h', 'a', 'm'), fn.mean('m', 'h_neigh'))
            h_neigh = g.dstdata['h_neigh']
            return self.linear(h_neigh)

The code here introduces two more new methods:

* `apply_edges` takes in a single message function.  It computes the message function output for all the edges, and store it as an edge feature.  Here, the statement `g.apply_edges(fn.u_dot_v('h', 'h', 'a'))` computes for each edge a dot product of its source node and destination node.
* `edge_softmax` takes in a DGL graph object as well as a edge weight tensor.  It returns another edge weight tensor representing the post-softmax edge weights.

You can again create your own multi-layer GAT as usual.

<div class="alert alert-info">
    <b>Note</b>: the GAT model here is only a demonstration of how you could write a neural network module; the hyperparameters and architecture choices are not tuned so the performance here may be inferior to what the paper actually reported.  For an accurate reproduction, please refer to <a href=https://github.com/dmlc/dgl/tree/master/examples/pytorch/gat/train.py>our example</a>.
</div>

In [10]:
class Net(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(Net, self).__init__()
        self.conv1 = GATConv(in_feats, h_feats)
        self.conv2 = GATConv(h_feats, num_classes)
    
    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h
    
net = Net(g.ndata['feat'].shape[1], 64, dataset.num_classes)
train(g, net)

In epoch 0, loss: 1.948, val acc: 0.124 (best 0.124), test acc: 0.130 (best 0.130)
In epoch 5, loss: 1.933, val acc: 0.066 (best 0.124), test acc: 0.065 (best 0.130)
In epoch 10, loss: 1.907, val acc: 0.112 (best 0.124), test acc: 0.125 (best 0.130)
In epoch 15, loss: 1.867, val acc: 0.096 (best 0.124), test acc: 0.108 (best 0.130)
In epoch 20, loss: 1.816, val acc: 0.164 (best 0.164), test acc: 0.160 (best 0.160)
In epoch 25, loss: 1.755, val acc: 0.290 (best 0.290), test acc: 0.305 (best 0.305)
In epoch 30, loss: 1.688, val acc: 0.284 (best 0.308), test acc: 0.284 (best 0.297)
In epoch 35, loss: 1.615, val acc: 0.300 (best 0.308), test acc: 0.300 (best 0.297)
In epoch 40, loss: 1.536, val acc: 0.318 (best 0.320), test acc: 0.330 (best 0.325)
In epoch 45, loss: 1.452, val acc: 0.322 (best 0.328), test acc: 0.329 (best 0.327)
In epoch 50, loss: 1.367, val acc: 0.322 (best 0.328), test acc: 0.341 (best 0.327)
In epoch 55, loss: 1.284, val acc: 0.348 (best 0.348), test acc: 0.377 (best 0

## Recap

* Use `srcdata` and `dstdata` to assign input node features and retrieve output node features.
* Using the built-in message and reduce functions in `dgl.function` to customize a new NN module.
* User-defined function provides even more flexibility.
* `dgl.ops.edge_softmax` can compute softmax over incoming edges efficiently.