# Installation

In [1]:
!pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
!pip install torch-geometric

Looking in links: https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
Collecting torch-scatter
[?25l  Downloading https://pytorch-geometric.com/whl/torch-1.8.0%2Bcu101/torch_scatter-2.0.7-cp37-cp37m-linux_x86_64.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 3.8MB/s 
[?25hInstalling collected packages: torch-scatter
Successfully installed torch-scatter-2.0.7
Looking in links: https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
Collecting torch-sparse
[?25l  Downloading https://pytorch-geometric.com/whl/torch-1.8.0%2Bcu101/torch_sparse-0.6.9-cp37-cp37m-linux_x86_64.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 5.2MB/s 
Installing collected packages: torch-sparse
Successfully installed torch-sparse-0.6.9
Looking in links: https://pytorch-geometric.com/whl/torch-1.8.0+cu101.html
Collecting torch-cluster
[?25l  Downloading https://pytorch-geometric.com/whl/torch-1.8.0%2Bcu101/torch_cluster-1.5.9-cp37-cp37m-linux_x86_64.whl (1.0MB)
[K     |███████

#  Introduction
In the previous notebook we looked at link prediction using an encoder-decoder framework. In this notebook we will consider an alternative, more general approach to link prediction  described in the paper "[Link Prediction Based on Graph Neural Networks](https://www.researchgate.net/profile/Muhan_Zhang2/publication/323443864_Link_Prediction_Based_on_Graph_Neural_Networks/links/5dc113364585151435e9382a/Link-Prediction-Based-on-Graph-Neural-Networks.pdf)". 

The approach is called SEAL (Subgraphs, Embeddings, and Attributes for Link prediction) and is illustrated in the figure below:

![alt text](https://ai.science/api/authorized-images/YShW5uEMQ1dJUj1UMyFhsSvYvuzRx2GV0RYXygKPfpoqson6GarYb4%2B8rIDTD%2BRP0NRmOcQ4y2uTzvzTtdYpExjb9X6QQmHbLD%2F5zni0Y4tkszpDqvz4Wn%2FO2L9bP1%2Fgh5sOyDTcsJ3n5nn16MnjGDh%2FwNgd9zpwn3VoA3WFQdBg%2B82QyZn2H1W8uD8j6gOr2BC56qbRwuouhQGmVeCZu5gQoDEztnDvo2qEUclyiTB8svQssa7JO7MuxAa7XR66JIppvj%2FlMdnLadPxWdAkPKL70ENFhjTShmFksRhDGVyccLROaRBNPSjiHcMQTX%2BSFo2XskBzK0P4EFWdwssahwvmbsn0p%2BdGdO0DFFAw4X44pz%2Fqk%2FUeoiNj19cy9FzupHfm%2F1AHOdwNgPFzDnsiCjttp39%2FXDkPr8%2Fy41Y2HaXWJBmBZg%2BFqWuzY49H549slpwfD%2FiQQTsQXTzD1WktDZoyDbfY7OEanaZa5eOlbjJg5pCOc5DFTvUw9p3ILa%2BDIMo7yUey%2FJV%2FdXzxWXCMkwrr4Z4MJ7woWQsq7Zf3abMVWIa2fa%2FROcnbfNxOM1XQTvqyy3JB5xRvARRYOvwygwGYGo6aQlXkQ%2BHweKWiWeFX3MYH7yjVa2IgXGb6dwWmPiSo3HiML5MwPFX7ClU0KMfbkLH0FI8JSwtm2m9hko8%3Dg)

 The basic idea here is to extract a subgraph around a link to be detected, and then classify the subgraph.  The nodes in these subgraphs can have a feature vector built from attributes or embeddings obtained via an encoder (or both). This effectively translates a link prediction problem into a graph prediction problem.

# Dataset
We will use the same PPI network as in the previous notebooks

In [2]:
import torch
from torch_geometric.datasets import PPI
from torch_geometric.utils import train_test_split_edges

# Load dataset 
PPI.url = "https://data.dgl.ai/dataset/ppi.zip" #  Workaround for wrong URL in pytorch geometric
dataset_ppi = PPI(root="./tmp/ppi") 

# For simplicity, pich the largest graph out of the dataset
data = max(dataset_ppi, key= lambda x:x.num_nodes) 

# Remove properties related to node-labeling (not needed here)
data.train_mask = data.val_mask = data.test_mask = data.y = None

# Create train/val/test split
data = train_test_split_edges(data, val_ratio=0.25, test_ratio=0.25,)
#data.x = torch.ones([data.x.shape[0], 2])


Downloading https://data.dgl.ai/dataset/ppi.zip
Extracting tmp/ppi/ppi.zip
Processing...
Done!


# Subgraphs
The SEAL algorithm requires the extraction of subgraphs enclosing links, as well as the distance every node in the subgraph to each of the edge nodes. This functionality is not yet provided by Pytorch-Geometric, so we will using networkx for this purpose.

In [3]:
import networkx as nx
from torch_geometric.data import Data
from torch_geometric.utils import to_networkx

# Create a Data object using on only the positive training edges
data_pos = Data(edge_index=data.train_pos_edge_index, num_nodes=data.num_nodes)

# Convert this graph to networkx format
G_train_pos=to_networkx(data_pos).to_undirected()




In the SEAL link prediction framework, the nodes in the edge-enclosing subgraph are assigned a structural label according to their distance from the node-pair adjacent to the edge being considered. This label, called double radius, for node $i$ in the subgraph is defined by

$$ f(i) = 1 + min(d_x,d_y) + (d/2)[(d/2)+(d\%2)-1]$$

where $x$ and $y$ are the nodes adjacent to the considered edge, $d_x$ is the distance of node $i$ to $x$, $d_y$ is the distance of node $i$ to $y$ and $d= d_x+d_y$.  

If $d_x = \infty$ or   $d_y = \infty$  in the subgraph, they get a label of 0. The double radius of nodes $x$ and $y$ is 1. This helps identify the structural importance of nodes $x$ and $y$.

In [4]:
def double_radius(d_x, d_y):
    if (d_x==0) or (d_y==0):
        return 1
    
    if np.isinf(d_x) or np.isinf(d_y):
        return 0
    d = d_x + d_y
    dr = 1 + min(d_x,d_y) + (d//2) * ( d//2 + d%2 -1 )
    return dr


We need to obtain n-hop subgraphs around each node. We will use n_hop=1.



In [5]:
def create_ego_graphs(G, n_hops):
    dict_ego_graphs= {}
    for v in G.nodes():
        dict_ego_graphs[v] = nx.ego_graph(G,v, n_hops)
    return dict_ego_graphs
  
node_to_nhop_subgraphs = create_ego_graphs(G=G_train_pos,n_hops=1)

Once we have the n_hop subgraphs around each node, we can easily compute the double radius of the nodes in each edge-enclosing subgraph. (This is of course trivial for n_hop=1).

In [6]:
import numpy as np

def get_edge_to_double_radii(node_to_nhop_subgraphs, edge, n_hops):
    v_x, v_y = edge

    # Compute distance from d_x to each node in subgraph
    subgraph_x = node_to_nhop_subgraphs[v_x]
    d_x = nx.single_source_shortest_path_length(subgraph_x, v_x, cutoff=n_hops)
    
    # Compute distance from d_y to each node in subgraph
    subgraph_y = node_to_nhop_subgraphs[v_y]
    d_y = nx.single_source_shortest_path_length(subgraph_y, v_y, cutoff=n_hops)
    
    # Get the union of the node ids for the two subgraphs
    nodes_subgraph = set(d_x.keys()) | set(d_y.keys())

    # Compute the double radius for each node in the union of the two subgraphs
    double_radii = {}

    for v_n in nodes_subgraph:
        d_x_to_n = d_x.get(v_n, np.inf)
        d_y_to_n = d_y.get(v_n, np.inf)
        double_radii[v_n] = double_radius(d_x_to_n, d_y_to_n)
    
    return double_radii


We will store the double radii for each edge-enclosing subgraph


In [7]:
# helper function to get the double radii
def get_double_radii(edge_list, node_to_nhop_subgraph, n_hops):
    edge_to_double_radii = {}

    for edge in edge_list:
        if edge[1]>edge[0]: # Only get enclosing subgraph once
            double_radii=get_edge_to_double_radii(node_to_nhop_subgraphs,edge, n_hops=n_hops)
            edge_to_double_radii[tuple(edge)] = double_radii
    return edge_to_double_radii

In [8]:
edge_to_double_radii_train_pos = get_double_radii(data.train_pos_edge_index.T.numpy(), node_to_nhop_subgraphs, 1)

We will now build the same subgraphs for negative samples. For the sake of speed, we will only use one set of negative training edge samples.  We will also have to create subgraphs for the validation and testing sets 

In [9]:
from torch_geometric.utils import negative_sampling
neg_edge_index = negative_sampling(edge_index=data.train_pos_edge_index,
                                    num_nodes=data.x.size(0))

edge_to_double_radii_train_neg = get_double_radii(neg_edge_index.T.numpy(), node_to_nhop_subgraphs, 1)

## Exercise 1

Compute the map from edge to enclosing-subgraph-nodes double radii for the validation and testing sets


### Reference Solution

In [10]:
edge_to_double_radii_val_pos = get_double_radii(data.val_pos_edge_index.T.numpy(), node_to_nhop_subgraphs, 1)

edge_to_double_radii_val_neg = get_double_radii(data.val_neg_edge_index.T.numpy(), node_to_nhop_subgraphs, 1)

In [11]:
edge_to_double_radii_test_pos = get_double_radii(data.test_pos_edge_index.T.numpy(), node_to_nhop_subgraphs, 1)

edge_to_double_radii_test_neg = get_double_radii(data.test_neg_edge_index.T.numpy(), node_to_nhop_subgraphs, 1)

# Converting Networkx subgraphs back to Data Objects

Let's recap what we have now. For each edge in the training set, we have extracting an enclosing subgraph. For each node we assigned a structural label called the "double radius". Now we need to translate each of these subgraphs to a list of PyG Data objects. We will also assign a feature vector made by concatenating a one-hot encoding of the double-radius to the original features in the dataset. We will record the existance or non-existance of an edge by assigning a label to these Data objects.
(Note this may take some time to complete)


In [14]:
# Helper function to create the dataset
from torch_geometric.utils import subgraph
from sklearn.preprocessing import OneHotEncoder

def create_dataset(edge_to_double_radii_pos, edge_to_double_radii_neg, 
                   max_radius, edge_index, device):
    # One-hot encoding to the maximum radius
    X = [[i] for i in range(max_radius + 1)] 
    encoder = OneHotEncoder(sparse=False)
    encoder.fit(X)

    dataset = []

    for graph_label, edge_to_double_radii in [(0, edge_to_double_radii_neg),
                                        (1, edge_to_double_radii_pos)]:
        for edge in tqdm(edge_to_double_radii):

            double_radii_subgraph = edge_to_double_radii[edge] 
            node_ids_subgraph = sorted(double_radii_subgraph.keys())

            # Create subgraph, with nodes relabed to be contiguous
            edge_index_subgraph,_ = subgraph(node_ids_subgraph, edge_index,
                                            relabel_nodes=True)

            # Convert dict to np.array
            double_radii_subgraph = np.asarray([double_radii_subgraph[key] 
                                                for key in node_ids_subgraph])

            # Create one-hot encoding of the double-radii.
            struct_features = encoder.transform(double_radii_subgraph.reshape(-1,1))

            # Concatenate the one-hot encoding with the existing features of the graph
            x= torch.cat([torch.tensor(struct_features,dtype=torch.float), 
                        data.x[node_ids_subgraph]],dim=1)

            dataset.append(Data(x=x.to(device), edge_index=edge_index_subgraph.to(device), 
                            y=torch.tensor([graph_label]).to(device)).to(device))
    return dataset

In [15]:
dataset = create_dataset(edge_to_double_radii_train_pos, edge_to_double_radii_train_neg, 2, data_pos.edge_index, 'cuda')

HBox(children=(FloatProgress(value=0.0, max=26704.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=26689.0), HTML(value='')))




## Exercise 2
The dataset created in the previous cell is the training dataset. Create the validation and testing dataset using the same procedure.
(You need the edge_to_double_radii dictionaries created in Exercise 1)

In [16]:
dataset_val = create_dataset(edge_to_double_radii_val_pos, edge_to_double_radii_val_neg, 2, data_pos.edge_index, 'cuda')

dataset_test = create_dataset(edge_to_double_radii_test_pos, edge_to_double_radii_test_neg, 2, data_pos.edge_index, 'cuda')

HBox(children=(FloatProgress(value=0.0, max=13344.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=13344.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=13344.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=13344.0), HTML(value='')))




## Handling multiple graphs in a batch
We have created now a list of Data objects called *dataset*. Each Data object contains an edge_list, feature vector and label (check this!). We now need to create a GNN that assigns a label to each subgraph and we will be done with the link-prediction task.

In Pytorch Geometric, multiple graphs can be treated as a single large graph when [batching with a DataLoader object](https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html?highlight=Batch#mini-batches). Message passing between nodes is not affected, simply because nodes in different graphs will not be connected (PyG takes care of assigning appropriate node-labels during batching). In order to keep track of which graph a node belongs to, a Data object representing a batch has the *.batch* attribute when can be used to , for example, pool all the node features in a graph.

In [17]:
from torch_geometric.data import DataLoader
loader = DataLoader(dataset=dataset,batch_size=1000,shuffle=True)

# Example
for data_batch in loader:
  break
print("Graph labels of each node in the batch: ", data_batch.batch)
print("Number of nodes in a batch:", data_batch.num_nodes)
print("Number of edges in a batch:", data_batch.num_edges)

Graph labels of each node in the batch:  tensor([  0,   0,   0,  ..., 999, 999, 999], device='cuda:0')
Number of nodes in a batch: 62669
Number of edges in a batch: 445566


In [18]:
len([1 for d in dataset_val if d.y == 1])

13344

In [19]:
len(dataset)

53393

## Graph Classification GNN

We will now define a simple graph classification GNN. This basically consists of one convolutional layer, followed by global mean pooling, and finally a linear layer and log_softmax for classification


In [20]:
from torch_geometric.nn import GCNConv, glob
import torch.nn.functional as F

class GraphClassifierNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = GCNConv(dataset[0].num_node_features, 4)
        self.lin = torch.nn.Linear(4, 2)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = glob.global_mean_pool(x, batch)
        x = self.lin(x)
        return F.log_softmax(x, dim=1)

## Exercise 3
Train the graph network and compute the AUC of the validation and test set (Use results from Exercise 1 and Exercise 2). A sample training script is provided below

In [21]:
from sklearn.metrics import roc_auc_score

@torch.no_grad()
def test(model, dataset):
    model.eval()
    loader = DataLoader(dataset=dataset,batch_size=1000,shuffle=False)
    labels = []
    predicted = []
    for data in loader:
        preds = torch.argmax(model(data.to('cuda')), dim=1)
        labels.extend(data.y.cpu().numpy())
        predicted.extend(preds.cpu().numpy())
    return roc_auc_score(labels, predicted)


## Reference Solution
Here we train the model on the training set and show the train, validation and test AUC

In [22]:

graph_classifier= GraphClassifierNet().to("cuda")
optimizer = torch.optim.Adam(graph_classifier.parameters(), lr=0.01)
graph_classifier.train()

for epoch in tqdm(range(1, 21)):
  for data in loader:
    optimizer.zero_grad() 
    log_softmax = graph_classifier(data.to("cuda")) 
    nll_loss = F.nll_loss(log_softmax, data.y.cuda())
    nll_loss.backward()
    optimizer.step()
  if epoch % 5 == 0:
    print(f'{epoch} epoches......')
    print('train auc:', test(graph_classifier, dataset))
    print('validation auc:', test(graph_classifier, dataset_val))
    print('test auc:', test(graph_classifier, dataset_test))
    print('loss:', nll_loss.item())

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

5 epoches......
train auc: 0.7799545418253404
validation auc: 0.7755170863309352
test auc: 0.7791891486810552
loss: 0.4622424244880676
10 epoches......
train auc: 0.7907656172586974
validation auc: 0.7867206235011992
test auc: 0.7881819544364508
loss: 0.4829671084880829
15 epoches......
train auc: 0.7955166920176641
validation auc: 0.7903926858513189
test auc: 0.7938024580335732
loss: 0.43833380937576294
20 epoches......
train auc: 0.7961572785791451
validation auc: 0.7918540167865706
test auc: 0.7936900479616306
loss: 0.4544219970703125

