# Link Prediction using Graph Neural Networks

GNNs are powerful tools for many machine learning tasks on graphs. This tutorial teaches the basic workflow of using GNNs for link prediction. We again use the Cora dataset but try to predict interactions (citation relationships) between two papers.

In this tutorial, you will learn:
* Prepare training and testing sets for link prediction task.
* Build a GNN-based link prediction model.
* Train the model and verify the result.

<div class="alert alert-info">
    <b>Note: </b>The Cora dataset provided by DGL is bidirectional, meaning that the edges can only represent whether a citation relationship exist between two papers: they cannot tell which paper cites which other paper.
</div>

In [1]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F
import itertools
import numpy as np
import scipy.sparse as sp

Using backend: pytorch


## Load graph and features

Following the [introduction](1_introduction.ipynb), we first load the Zachery's Karate Club graph and creates node embeddings.

In [4]:
import dgl.data

dataset = dgl.data.CoraGraphDataset()
g = dataset[0]

Loading from cache failed, re-processing.
Finished data loading and preprocessing.
  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done saving data into cached files.


## Prepare training and testing sets

In general, a link prediction data set contains two types of edges, *positive* and *negative edges*. Positive edges are usually drawn from the existing edges in the graph. In this example, we randomly pick 1000 edges for testing and leave the rest for training.

In [9]:
# Split edge set for training and testing
TEST_SIZE = 1000
u, v = g.edges()
eids = np.arange(g.num_edges())
eids = np.random.permutation(eids)
test_pos_u, test_pos_v = u[eids[:TEST_SIZE]], v[eids[:TEST_SIZE]]
train_pos_u, train_pos_v = u[eids[TEST_SIZE:]], v[eids[TEST_SIZE:]]

Since the number of negative edges is large, sampling is usually desired. How to choose proper negative sampling algorithms is a widely-studied topic and is out of scope of this tutorial.  Here we simply randomly sample the same number of negative edges as the positive edges.

In [10]:
# Find all negative edges and split them for training and testing
adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
adj_neg = 1 - adj.todense() - np.eye(g.num_nodes())
neg_u, neg_v = np.where(adj_neg != 0)
neg_eids = np.random.choice(len(neg_u), g.num_edges())
test_neg_u, test_neg_v = neg_u[neg_eids[:TEST_SIZE]], neg_v[neg_eids[:TEST_SIZE]]
train_neg_u, train_neg_v = neg_u[neg_eids[TEST_SIZE:]], neg_v[neg_eids[TEST_SIZE:]]

Put positive and negative edges together and form training and testing sets.

In [11]:
# Create training set.
train_u = torch.cat([torch.as_tensor(train_pos_u), torch.as_tensor(train_neg_u)])
train_v = torch.cat([torch.as_tensor(train_pos_v), torch.as_tensor(train_neg_v)])
train_label = torch.cat([torch.zeros(len(train_pos_u)), torch.ones(len(train_neg_u))])

# Create testing set.
test_u = torch.cat([torch.as_tensor(test_pos_u), torch.as_tensor(test_neg_u)])
test_v = torch.cat([torch.as_tensor(test_pos_v), torch.as_tensor(test_neg_v)])
test_label = torch.cat([torch.zeros(len(test_pos_u)), torch.ones(len(test_neg_u))])

## Define a GraphSAGE model

Our model will be a two-layer [GraphSAGE convolution (Hamilton et al., 2017)](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf), which takes the following mathematical form:

$$
h_{\mathcal{N}(v)}^k\leftarrow \text{Average}\{h_u^{k-1},\forall u\in\mathcal{N}(v)\}
$$

$$
h_v^k\leftarrow \text{ReLU}\left(W^k\cdot \text{CONCAT}(h_v^{k-1}, h_{\mathcal{N}(v)}^k) \right)
$$

DGL supports, alongside GraphSAGE via [`dgl.nn.SAGEConv`](https://docs.dgl.ai/api/python/nn.pytorch.html#sageconv), [many other graph convolution modules](https://docs.dgl.ai/api/python/nn.pytorch.html#module-dgl.nn.pytorch.conv).

In [13]:
from dgl.nn import SAGEConv

# ----------- 2. create model -------------- #
# build a two-layer GraphSAGE model
class GraphSAGE(nn.Module):
    def __init__(self, in_feats, h_feats):
        super(GraphSAGE, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats, 'mean')
        self.conv2 = SAGEConv(h_feats, h_feats, 'mean')
    
    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h
    
# Create the model with hidden layer dimension 16.
net = GraphSAGE(g.ndata['feat'].shape[1], 16)

We then optimize the model using the following loss function.

$$
\hat{y}_{u\sim v} = \sigma(h_u^T h_v)
$$

$$
\mathcal{L} = -\sum_{u\sim v\in \mathcal{D}}\left( y_{u\sim v}\log(\hat{y}_{u\sim v}) + (1-y_{u\sim v})\log(1-\hat{y}_{u\sim v})) \right)
$$

Essentially, the model predicts a score for each edge by dot-producting the representations of its two end-points. It then computes a binary cross entropy loss with the target $y$ being 0 or 1 meaning whether the edge is a positive one or not.

In [17]:
# ----------- 3. set up loss and optimizer -------------- #
# in this case, loss will in training loop
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)

# ----------- 4. training -------------------------------- #
all_logits = []
for e in range(100):
    # forward
    logits = net(g, g.ndata['feat'])
    pred = torch.sigmoid((logits[train_u] * logits[train_v]).sum(dim=1))
    
    # compute loss
    loss = F.binary_cross_entropy(pred, train_label)
    
    # backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    all_logits.append(logits.detach())
    
    if e % 5 == 0:
        print('In epoch {}, loss: {}'.format(e, loss))

In epoch 0, loss: 0.715376615524292
In epoch 5, loss: 0.6942322850227356
In epoch 10, loss: 0.6904299855232239
In epoch 15, loss: 0.683499276638031
In epoch 20, loss: 0.6637445092201233
In epoch 25, loss: 0.6305882334709167
In epoch 30, loss: 0.5955677032470703
In epoch 35, loss: 0.5764950513839722
In epoch 40, loss: 0.5546678900718689
In epoch 45, loss: 0.5326592922210693
In epoch 50, loss: 0.5074127912521362
In epoch 55, loss: 0.47796398401260376
In epoch 60, loss: 0.451864093542099
In epoch 65, loss: 0.42299002408981323
In epoch 70, loss: 0.3948816955089569
In epoch 75, loss: 0.36992791295051575
In epoch 80, loss: 0.3451216220855713
In epoch 85, loss: 0.32089129090309143
In epoch 90, loss: 0.2978387773036957
In epoch 95, loss: 0.27626192569732666


In [18]:
# ----------- 5. check results ------------------------ #
pred = torch.sigmoid((logits[test_u] * logits[test_v]).sum(dim=1))
print('Accuracy', ((pred >= 0.5) == test_label).sum().item() / len(pred))

Accuracy 0.7495


## What's next?

If you wish to scale up your link prediction model, please see the tutorial [Stochastic Training of GNN for Link Prediction on Large Graphs](L2_large_link_prediction.ipynb).
* The training experience on large graph is different from training on full graphs, so we recommend you going through the tutorial [Stochastic Training of GNN for Node Classification on Large Graphs](L1_large_node_classification.ipynb) first.