# Introduction

The source code: https://github.com/dglai/WSDM21-Hands-on-Tutorial/blob/main/L2_large_link_prediction.ipynb

This replica is simple but clear in showing how unsupvised learning works in DGL module.

# loading dataset

In [28]:
import dgl
import torch
from dgl.data import AsNodePredDataset
import numpy as np 

device = 'cpu'

In [2]:
def load_cora():
    data0 = dgl.data.CSVDataset('../graph_dgl/cora_csv/')
    data = AsNodePredDataset(data0, split_ratio=(0.5,0.2,0.3))
    g = data[0]
    g.ndata["features"] = g.ndata.pop("feat")
    g.ndata["labels"] = g.ndata.pop("label")
    return g, data.num_classes

In [7]:
raw_g, n_classes = load_cora()

Done loading data from cached files.


In [8]:
g = dgl.add_reverse_edges(raw_g)

In [17]:
node_features = g.ndata['features']
node_labels = g.ndata['labels']
num_features = node_features.shape[1]
num_classes = (node_labels.max() + 1).item()
print('Number of classes: {:d}'.format(num_classes))

Number of classes: 7


In [18]:
train_nid = torch.nonzero(g.ndata['train_mask'], as_tuple=True)[0]
val_nid = torch.nonzero(g.ndata['val_mask'], as_tuple=True)[0]
test_nid = torch.nonzero(~(g.ndata['train_mask'] | g.ndata['val_mask']), as_tuple=True)[0]

In [23]:
train_nid.shape, val_nid.shape, test_nid.shape

(torch.Size([1354]), torch.Size([541]), torch.Size([813]))

# Defining Neighbor Sampler and Data Loader in DGL

DGL provides `dgl.dataloading.EdgeDataLoader` to iterate over edges for edge classification or link prediction tasks.

To perform link prediction, you need to specify a negative sampler. DGL provides builtin negative samplers such as `dgl.dataloading.negative_sampler.Uniform`. 

Here this tutorial uniformly **draws 5 negative examples per positive example**.

In [25]:
negative_sampler = dgl.dataloading.negative_sampler.Uniform(5)

In [31]:
sampler = dgl.dataloading.MultiLayerNeighborSampler([4,4])  # select 4 on each layer
train_dataloader = dgl.dataloading.EdgeDataLoader(
    g,
    torch.arange(g.number_of_edges()), # the edges to iterate over 
    sampler,                           # neighbor sampler
    negative_sampler=negative_sampler, 
    device = device,
    # --- the following arguments are inherited from Pytorch Dataloader ---
    batch_size = 128,
    shuffle = True,
    drop_last = False, # if we drop the last incomplete batch
    num_workers = 0
)

In [32]:
input_nodes, pos_graph, neg_graph, mfgs = next(iter(train_dataloader))
print('Number of input nodes:', len(input_nodes))
print('Positive graph # nodes:', pos_graph.number_of_nodes(), '# edges:', pos_graph.number_of_edges())
print('Negative graph # nodes:', neg_graph.number_of_nodes(), '# edges:', neg_graph.number_of_edges())
print(mfgs)

Number of input nodes: 2158
Positive graph # nodes: 726 # edges: 128
Negative graph # nodes: 726 # edges: 640
[Block(num_src_nodes=2158, num_dst_nodes=1614, num_edges=5055), Block(num_src_nodes=1614, num_dst_nodes=726, num_edges=2170)]
