# Link Prediction using Graph Neural Networks

In the [introduction](1_introduction.ipynb), you already learned the basic workflow of using GNNs for node classification, i.e. predicting the category of a node in a graph.  This tutorial will teach you how to train a GNN for link prediction, i.e. predicting the existence of an edge between two arbitrary nodes in a graph.

Goal of this tutorial:

* Build a GNN-based link prediction model.

In [1]:
import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F
import itertools
import numpy as np
import scipy.sparse as sp

Using backend: pytorch


## Overview of Link Prediction with GNN

Many applications such as social recommendation, item recommendation, knowledge graph completion, etc., can be formulated as link prediction, which predicts whether an edge exists between two particular nodes.  This tutorial shows an example of predicting whether a citation relationship, either citing or being cited, between two papers exists in a citation network.

## Load graph and features

Following the [introduction](1_introduction.ipynb), we first load the Cora dataset.

In [6]:
import dgl.data

dataset = dgl.data.CoraGraphDataset()
g = dataset[0]

Loading from cache failed, re-processing.
Finished data loading and preprocessing.
  NumNodes: 2708
  NumEdges: 10556
  NumFeats: 1433
  NumClasses: 7
  NumTrainingSamples: 140
  NumValidationSamples: 500
  NumTestSamples: 1000
Done saving data into cached files.


<div class="alert alert-info">
    
**Note**: some domains such as large-scale recommender systems or information retrieval favor metrics that emphasize good performance of top-K predictions.  In these cases you may want to consider other metrics such as mean average precision, and use more sophisticated online negative sampling methods such as [LambdaRank] or one in [PinSage].

</div>

This tutorial follows a relatively simple practice from [SEAL](https://papers.nips.cc/paper/2018/file/53f0d7c537d99b3824f0f99d62ea2428-Paper.pdf).  It formulates the link prediction problem as a binary classification problem as follows:

* Treat the edges in the graph as *positive examples*.
* Sample a number of non-existent edges (i.e. node pairs with no edges between them) as *negative* examples.
* Divide the positive examples and negative examples into a training set and a test set.
* Evaluate the model with any binary classification metric such as Area Under Curve (AUC).

## Prepare training and testing sets

This tutorial randomly picks 10% of the edges for positive examples in the test set, and leave the rest for the training set.  It then samples the same number of edges for negative examples in both sets.

In [7]:
# Split edge set for training and testing
u, v = g.edges()

eids = np.arange(g.number_of_edges())
eids = np.random.permutation(eids)
test_size = int(len(eids) * 0.1)
train_size = g.number_of_edges() - test_size
test_pos_u, test_pos_v = u[eids[:test_size]], v[eids[:test_size]]
train_pos_u, train_pos_v = u[eids[test_size:]], v[eids[test_size:]]

In [9]:
# Find all negative edges and split them for training and testing
adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
adj_neg = 1 - adj.todense() - np.eye(g.number_of_nodes())
neg_u, neg_v = np.where(adj_neg != 0)

neg_eids = np.random.choice(len(neg_u), g.number_of_edges())
test_neg_u, test_neg_v = neg_u[neg_eids[:test_size]], neg_v[neg_eids[:test_size]]
train_neg_u, train_neg_v = neg_u[neg_eids[test_size:]], neg_v[neg_eids[test_size:]]

In [10]:
# Create training set.
train_u = torch.cat([torch.as_tensor(train_pos_u), torch.as_tensor(train_neg_u)])
train_v = torch.cat([torch.as_tensor(train_pos_v), torch.as_tensor(train_neg_v)])
train_label = torch.cat([torch.zeros(len(train_pos_u)), torch.ones(len(train_neg_u))])

# Create testing set.
test_u = torch.cat([torch.as_tensor(test_pos_u), torch.as_tensor(test_neg_u)])
test_v = torch.cat([torch.as_tensor(test_pos_v), torch.as_tensor(test_neg_v)])
test_label = torch.cat([torch.zeros(len(test_pos_u)), torch.ones(len(test_neg_u))])

## Define a GraphSAGE model

This tutorial builds a model consisting of two [GraphSAGE](https://arxiv.org/abs/1706.02216) layers, each computes new node representations by averaging neighbor information.  DGL provides `dgl.nn.SAGEConv` that conveniently creates a GraphSAGE layer.

In [16]:
from dgl.nn import SAGEConv

# ----------- 2. create model -------------- #
# build a two-layer GraphSAGE model
class GraphSAGE(nn.Module):
    def __init__(self, in_feats, h_feats):
        super(GraphSAGE, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats, 'mean')
        self.conv2 = SAGEConv(h_feats, h_feats, 'mean')
    
    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h
    
model = GraphSAGE(g.ndata['feat'].shape[1], 16)

The model then predicts the probability of existence of an edge by computing a dot product between the representations of both incident nodes.

$$
\hat{y}_{u\sim v} = \sigma(h_u^T h_v)
$$

The loss function is simply binary cross entropy loss.

$$
\mathcal{L} = -\sum_{u\sim v\in \mathcal{D}}\left( y_{u\sim v}\log(\hat{y}_{u\sim v}) + (1-y_{u\sim v})\log(1-\hat{y}_{u\sim v})) \right)
$$

<div class="alert alert-info">
    
**Note**: this tutorial did not include evaluation on a validation set.  In practice you should save and evaluate the best model based on performance on the validation set.

</div>

In [17]:
# ----------- 3. set up loss and optimizer -------------- #
# in this case, loss will in training loop
optimizer = torch.optim.Adam(itertools.chain(model.parameters()), lr=0.01)

# ----------- 4. training -------------------------------- #
all_logits = []
for e in range(100):
    # forward
    logits = model(g, g.ndata['feat'])
    pred = torch.sigmoid((logits[train_u] * logits[train_v]).sum(dim=1))
    
    # compute loss
    loss = F.binary_cross_entropy(pred, train_label)
    
    # backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    all_logits.append(logits.detach())
    
    if e % 5 == 0:
        print('In epoch {}, loss: {}'.format(e, loss))

In epoch 0, loss: 0.7159186005592346
In epoch 5, loss: 0.6917704939842224
In epoch 10, loss: 0.6827414035797119
In epoch 15, loss: 0.6644165515899658
In epoch 20, loss: 0.6312846541404724
In epoch 25, loss: 0.5956920981407166
In epoch 30, loss: 0.5744430422782898
In epoch 35, loss: 0.5570815205574036
In epoch 40, loss: 0.540522575378418
In epoch 45, loss: 0.5278851985931396
In epoch 50, loss: 0.5155758857727051
In epoch 55, loss: 0.5017237067222595
In epoch 60, loss: 0.4842952489852905
In epoch 65, loss: 0.46314892172813416
In epoch 70, loss: 0.44303062558174133
In epoch 75, loss: 0.4239100217819214
In epoch 80, loss: 0.402626633644104
In epoch 85, loss: 0.3804161250591278
In epoch 90, loss: 0.35775208473205566
In epoch 95, loss: 0.33470577001571655


In [19]:
# ----------- 5. check results ------------------------ #
from sklearn.metrics import roc_auc_score
with torch.no_grad():
    pred = torch.sigmoid((logits[test_u] * logits[test_v]).sum(dim=1))
    pred = pred.numpy()
    label = test_label.numpy()
    print('AUC', roc_auc_score(label, pred))

AUC 0.7918914669481818


## What's Next?

* See [here](L2_large_link_prediction.ipynb) for a tutorial on link prediction on a large-scale graph.