# Write your own GNN module

In [the introduction](1_introduction.ipynb), we have learned using the built-in [graph convolution modules](https://docs.dgl.ai/api/python/nn.pytorch.html#module-dgl.nn.pytorch.conv) to build a multi-layer graph neural network. However, sometimes one desires to invent a new way of aggregating neighbor information. DGL's message passing APIs are designed for this scenario.

Goal of this tutorial:

* Understand DGL's message passing APIs.
* Implement GraphSAGE convolution by your own.
* Implement Graph Attention Networks by your own.

In [1]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F

Using backend: pytorch


## Message passing and GNNs

DGL follows the *message passing paradigm* inspired by the Message Passing Neural Network proposed by [Gilmer et al.](https://arxiv.org/abs/1704.01212) Essentially, they found many GNN models can fit into the following framework:

$$
m_{u\sim v}^{(l)} = M^{(l)}\left(h_v^{(l-1)}, h_u^{(l-1)}, e_{u\sim v}^{(l-1)}\right)
$$

$$
m_{v}^{(l)} = \sum_{u\in\mathcal{N}(v)}m_{u\sim v}^{(l)}
$$

$$
h_v^{(l)} = U^{(l)}\left(h_v^{(l-1)}, m_v^{(l)}\right)
$$

where DGL calls $M^{(l)}$ the *message function* and $\sum$ the *reduce function*.  Note that $\sum$ here can represent any function and is not necessarily a summation.

For example, the [GraphSAGE convolution (Hamilton et al., 2017)](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) takes the following mathematical form:

$$
h_{\mathcal{N}(v)}^k\leftarrow \text{Average}\{h_u^{k-1},\forall u\in\mathcal{N}(v)\}
$$

$$
h_v^k\leftarrow \text{ReLU}\left(W^k\cdot \text{CONCAT}(h_v^{k-1}, h_{\mathcal{N}(v)}^k) \right)
$$

You can see that message passing is directional: the message sent from one node $u$ to other node $v$ is not necessarily the same as the other message sent from node $v$ to node $u$ in the opposite direction.

DGL graphs provide two members `srcdata` and `dstdata` for the purpose of message passing.  You first put the input node features in `srcdata`.  After you perform message passing, you can retrieve the result of message passing from `dstdata`.

<div class="alert alert-info">
    <b>Note: </b>In full graph message passing, both the input nodes and the output nodes are the full node set.  Therefore, <code>srcdata</code> and <code>dstdata</code> in homogeneous graph (i.e. with only one node type and one edge type) are identical to <code>ndata</code>.  See <a href=H3_message_passing.ipynb>here (TODO)</a> for heterogeneous graph message passing.
</div>

Although DGL has builtin support of GraphSAGE via [`dgl.nn.SAGEConv`](https://docs.dgl.ai/api/python/nn.pytorch.html#sageconv), here is how you can implement GraphSAGE convolution in DGL by your own.

In [2]:
import dgl.function as fn

class SAGEConv(nn.Module):
    """Graph convolution module used by the GraphSAGE model.
    
    Parameters
    ----------
    in_feat : int
        Input feature size.
    out_feat : int
        Output feature size.
    """
    def __init__(self, in_feat, out_feat):
        super(SAGEConv, self).__init__()
        # A linear submodule for projecting the input and neighbor feature to the output.
        self.linear = nn.Linear(in_feat * 2, out_feat)
    
    def forward(self, g, h):
        """Forward computation
        
        Parameters
        ----------
        g : Graph
            The input graph.
        h : Tensor
            The input node feature.
        """
        with g.local_scope():
            g.srcdata['h'] = h
            # update_all is a message passing API.
            g.update_all(fn.copy_u('h', 'm'), fn.mean('m', 'h_neigh'))
            h_neigh = g.dstdata['h_neigh']
            h_total = torch.cat([h, h_neigh], dim=1)
            return F.relu(self.linear(h_total))

The central piece in this code is the [`g.update_all`](https://docs.dgl.ai/generated/dgl.DGLGraph.update_all.html#dgl.DGLGraph.update_all) function, which gathers and averages the neighbor features. There are three concepts here:
* Message function `fn.copy_u('h', 'm')` that copies the node feature under name `'h'` as *messages* sent to neighbors.
* Reduce function `fn.mean('m', 'h_neigh')` that averages all the received messages under name `'m'` and saves the result as a new node feature `'h_neigh'`.
* `update_all` tells DGL to trigger the message and reduce functions for all the nodes and edges.

## More customization

In DGL, we provide many built-in message and reduce functions under the `dgl.function` package.

![api](assets/dgl-mp.png)

You can find more details in [the API doc](https://docs.dgl.ai/api/python/function.html).

These APIs allow one to quickly implement new graph convolution modules. For example, the following implements a new `SAGEConv` that aggregates neighbor representations using a weighted average.  Note that `edata` member can hold edge features which can also take part in message passing.

In [3]:
class WeightedSAGEConv(nn.Module):
    """Graph convolution module used by the GraphSAGE model with edge weights.
    
    Parameters
    ----------
    in_feat : int
        Input feature size.
    out_feat : int
        Output feature size.
    """
    def __init__(self, in_feat, out_feat):
        super(SAGEConv, self).__init__()
        # A linear submodule for projecting the input and neighbor feature to the output.
        self.linear = nn.Linear(in_feat * 2, out_feat)
    
    def forward(self, g, h, w):
        """Forward computation
        
        Parameters
        ----------
        g : Graph
            The input graph.
        h : Tensor
            The input node feature.
        w : Tensor
            The edge weight.
        """
        with g.local_scope():
            g.srcdata['h'] = h
            g.edata['w'] = w
            # update_all is a message passing API.
            g.update_all(fn.u_mul_e('h', 'w', 'm'), fn.mean('m', 'h_neigh'))
            h_neigh = g.dstdata['h_neigh']
            h_total = torch.cat([h, h_neigh], dim=1)
            return F.relu(self.linear(h_total))

## Even more customization by user-defined function

DGL allows user-defined message and reduce function for the maximal expressiveness. Here is a user-defined message function that is equivalent to `fn.u_mul_e('h', 'w', 'm')`.

In [4]:
def u_mul_e_udf(edges):
    return {'m' : edges.src['h'] * edges.data['w']}

`edges` has three members: `src`, `data` and `dst`, representing the source node feature, edge feature, and destination node feature for all edges.

You can also write your own reduce function.  For example, the following is equivalent to the builtin `fn.sum('m', 'h')` function that sums up the incoming messages:

In [5]:
def sum_udf(nodes):
    return {'h': nodes.mailbox['m'].sum(1)}

In short, DGL will group the nodes by their in-degrees, and for each group DGL stacks the incoming messages along the second dimension.  One can then perform a reduction along the second dimension to aggregate messages.

For more details on customizing message and reduce function with user-defined function, please refer to the [API reference](https://docs.dgl.ai/api/python/udf.html#apiudf).

## Computing softmax over incoming edges of all nodes

A common reduction operation is to compute a weighted aggregation of messages via a softmax.  That is, given the edge weights $z_{i\to j}$ and the node features $h_i^{(l)}$, you would like to compute a softmax over all incoming edges of every individual node $i$, and compute a weighted average based on them:

$$
a_{i\to j} = \frac{\exp(z_{i\to j})}{\sum_{i'\in\mathcal{N}(j)}\exp(z_{i'\to j})} \qquad h_j^{(l+1)} = \sum_{i\in \mathcal{N}(j)} a_{i\to j} h_i^{(l)}
$$

DGL has an efficient wrapper function `dgl.ops.edge_softmax` that computes the softmax output in a memory- and time-efficient way.

For example, DGL has builtin support of [Graph Attention Networks (Veličković et al., 2017)](https://arxiv.org/abs/1710.10903) via [`dgl.nn.GATConv`](https://docs.dgl.ai/api/python/nn.pytorch.html#dgl.nn.pytorch.conv.GATConv), but here is how you can implement GAT convolutions in DGL by your own.

In [9]:
import dgl.function as fn
from dgl.ops import edge_softmax

class GATConv(nn.Module):
    """Graph convolution module used by the GraphSAGE model.
    
    Parameters
    ----------
    in_feat : int
        Input feature size.
    out_feat : int
        Output feature size.
    """
    def __init__(self, in_feat, out_feat):
        super(SAGEConv, self).__init__()
        # A linear submodule for projecting the input and neighbor feature to the output.
        self.linear = nn.Linear(in_feat * 2, out_feat)
    
    def forward(self, g, h):
        """Forward computation
        
        Parameters
        ----------
        g : Graph
            The input graph.
        h : Tensor
            The input node feature.
        """
        with g.local_scope():
            g.srcdata['h'] = h
            g.dstdata['h'] = h
            # Compute a dot product for all edges between input nodes and output nodes.
            # The dot product will serve as logits of the softmax.
            g.apply_edges(fn.u_dot_v('h', 'h', 'z'))
            # Compute a softmax over the incoming edges of all nodes.
            g.edata['a'] = edge_softmax(g, g.edata['z'])
            g.update_all(fn.u_mul_e('h', 'a', 'm'), fn.mean('m', 'h_neigh'))
            h_neigh = g.dstdata['h_neigh']
            return F.relu(self.linear(h_neigh))

The code here introduces two more new methods:

* `apply_edges` takes in a single message function.  It computes the message function output for all the edges, and store it as an edge feature.  Here, the statement `g.apply_edges(fn.u_dot_v('h', 'h', 'a'))` computes for each edge a dot product of its source node and destination node.
* `edge_softmax` takes in a DGL graph object as well as a edge weight tensor.  It returns another edge weight tensor representing the post-softmax edge weights.

## Recap

* Use `srcdata` and `dstdata` to assign input node features and retrieve output node features.
* Using the built-in message and reduce functions in `dgl.function` to customize a new NN module.
* User-defined function provides even more flexibility.
* `dgl.ops.edge_softmax` can compute softmax over incoming edges efficiently.