# Graph Convolutional Network for Link Prediction
This notebook demonstrates the training of [Graph Convolutional Networks (GCN)](https://arxiv.org/pdf/1609.02907.pdf) for Link Prediction with TigerGraph. Pytorch Geometric's implementation of GCN is used here. We train the model on the Cora dataset from [PyG datasets](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.Planetoid) with TigerGraph as the data store. The dataset contains 2708 machine learning papers and 10556 citation links between the papers. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from a dictionary. The dictionary consists of 1433 unique words. Each paper is classified into one of seven classes based on the topic. The goal is to predict whether two papers are linked or not.

The following libraries are required to run this notebook. Uncomment to install them if necessary. You might need to restart the kernel after installing.

In [None]:
#!pip install torch==1.12.0 --extra-index-url https://download.pytorch.org/whl/cpu
#!pip install torch-scatter==2.0.9 torch-sparse==0.6.14 torch-cluster==1.6.0 torch-spline-conv==1.2.1 torch-geometric==2.0.4 -f https://data.pyg.org/whl/torch-1.12.0+cpu.html
#!pip install pyTigerGraph[gds]
#!pip install tensorboard # If you use tensorboard for visualization later

**NOTE**: Currently, your database needs to be activated (only once) to enjoy all the functions provided by the ML Workbench. If you are using ML Workbench on Cloud, then the activator is included and you can run the cell below (uncomment first) to activate. For other versions of the Workbench, you can download the activator at https://act.tigergraphlabs.com. Detailed instructions are also included on that website. 

In [None]:
# Uncomment below and fill out the necessary information. For detailed instructions, please see https://act.tigergraphlabs.com
# !mlwb activate [database address] -u [username] -p [password] -s [secret]

## Table of Contents
* [Data Processing](#data_processing)  
* [Whole Graph Training](#train_whole)  
* [Stochastic Batch Training](#train_subgraph) 

## Data Processing <a name="data_processing"></a>

Here we assume the dataset is already ingested into the TigerGraph database. If not, please refer to the  [data ingestion](https://github.com/TigerGraph-DevLabs/mlworkbench-docs/blob/main/tutorials/basics/0_data_ingestion.ipynb) tutorial first.

For each edge, the original dataset include `is_train` and `is_val` attributes. You may add `is_test` if you want the train/validation/test splits. Otherwise, you can just use the edgeSplitter to get train/validation sets.

### Connect to TigerGraph

The `TigerGraphConnection` class represents a connection to the TigerGraph database. Under the hood, it stores the necessary information to communicate with the database. It is able to perform quite a few database tasks. Please see its [documentation](https://docs.tigergraph.com/pytigergraph/current/intro/) for details.

In [1]:
from pyTigerGraph import TigerGraphConnection

conn = TigerGraphConnection(
    host="http://127.0.0.1", # Change the address to your database server's
    graphname="Cora",
    username="tigergraph",
    password="tigergraph",
)

<span style="color:red">Uncomment cell below and run to get and set token if token authentication is enabled</span>. 
* This is required for all databases on tgcloud.
* `<secret>` is your user secret. See https://docs.tigergraph.com/tigergraph-server/current/user-access/managing-credentials#_secrets for details.
* If you don't know your secret, you can use `secret=conn.createSecret()` to create one.

In [None]:
#conn.getToken(<secret>)

In [2]:
conn.getVertexCount('*')

{'Paper': 2708}

In [3]:
conn.getEdgeCount('*')

{'Cite': 10556}

### Train/validation split

Split the edges into 80% train and 20% validation.

In [5]:
%%time
splitter = conn.gds.edgeSplitter(is_train=0.8, is_val=0.2)

Installing and optimizing queries. It might take a minute if this is the first time you use this loader.
Query installation finished.
CPU times: user 228 ms, sys: 39.5 ms, total: 268 ms
Wall time: 49.3 s


In [6]:
%%time
splitter.run()

Splitting edges...
Edge split finished successfully.
CPU times: user 4.73 ms, sys: 945 µs, total: 5.68 ms
Wall time: 72.6 ms


## Train on whole graph <a name="train_whole"></a>

Here, we use the full graph for link prediction. This will **NOT** work when the graph is very large. See the section of Stochastic Mini-Batch Training for real use. However, we still include this example for illustration purposes.

We load the whole graph from TigerGraph which includes the feature and split results.

### Construct graph loader and negative edges

In [7]:
graph_loader = conn.gds.graphLoader(
    num_batches=1,
    v_in_feats = ["x"],
    e_extra_feats=["is_train","is_val"],
    output_format = "PyG")

Installing and optimizing queries. It might take a minute if this is the first time you use this loader.
Query installation finished.


In [8]:
data = graph_loader.data

In [9]:
data

Data(edge_index=[2, 10556], is_train=[10556], is_val=[10556], x=[2708, 1433])

In [10]:
train_edge_index = data.edge_index[:, data.is_train]
val_edge_index = data.edge_index[:, data.is_val]

In [11]:
import torch

neg_val_edge = torch.randint(0, data.x.shape[0], val_edge_index.size(), dtype=torch.long)

In [12]:
train_edge_index.shape, val_edge_index.shape, neg_val_edge.shape

(torch.Size([2, 8454]), torch.Size([2, 2102]), torch.Size([2, 2102]))

### Construct GCN Model

We use dot product to measure the similarity of two nodes in a decode function.

In [13]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv


class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, num_layers, dropout, **kwargs):
        super(GCN, self).__init__()
        self.convs = torch.nn.ModuleList()
        self.convs.append(GCNConv(in_channels, hidden_channels))
        for _ in range(num_layers - 2):
            self.convs.append(GCNConv(hidden_channels, hidden_channels))
        self.convs.append(GCNConv(hidden_channels, out_channels))
        self.dropout = dropout

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()

    def forward(self, x, adj_t):
        for i, conv in enumerate(self.convs[:-1]):
            x = conv(x, adj_t)
            x = F.relu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.convs[-1](x, adj_t)
        return x

    def decode(self, z, pos_edge_index, neg_edge_index):
        edge_index = torch.cat([pos_edge_index, neg_edge_index], dim=-1) # concatenate pos and neg edges
        logits = (z[edge_index[0]] * z[edge_index[1]]).sum(dim=-1)  # dot product 
        return logits


### Get binary labels for positive and negative edges

In [14]:
def get_link_labels(pos_edge_index, neg_edge_index):
    E = pos_edge_index.size(1) + neg_edge_index.size(1)
    link_labels = torch.zeros(E, dtype=torch.float)
    link_labels[:pos_edge_index.size(1)] = 1.
    return link_labels

### Define Hyperparameters

In [15]:
# Hyperparameters
hp = {"hidden_dim": 128, "out_dim": 64, "num_layers": 2,
      "dropout": 0.6, "lr": 0.01, "l2_penalty": 5e-4}

### Instantiate Model and optimizer

In [16]:
model = GCN(1433, hp["hidden_dim"], hp["out_dim"], hp["num_layers"], hp["dropout"])
optimizer = torch.optim.Adam(
    model.parameters(), lr=hp["lr"], weight_decay=hp["l2_penalty"]
)

In [17]:
val_labels = get_link_labels(val_edge_index, neg_val_edge)
val_labels

tensor([1., 1., 1.,  ..., 0., 0., 0.])

### Train the model

In [18]:
from sklearn.metrics import roc_auc_score

In [19]:
for epoch in range(30):
    model.train()
    neg_train_edge = torch.randint(0, data.x.shape[0], train_edge_index.size(), dtype=torch.long)
    h = model(data.x.float(), train_edge_index)
    logits = model.decode(h, train_edge_index, neg_train_edge)
    labels = get_link_labels(train_edge_index, neg_train_edge)
    loss = F.binary_cross_entropy_with_logits(logits, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    model.eval()
    with torch.no_grad():
        val_logits = model.decode(h, val_edge_index, neg_val_edge)
        val_logits = val_logits.sigmoid()
        print('Epoch: {}, training loss: {}, valid roc_auc_score: {}'.format(epoch, loss.item(), roc_auc_score(val_labels, val_logits)))

Epoch: 0, training loss: 0.6500421166419983, valid roc_auc_score: 0.8383314427562532
Epoch: 1, training loss: 1.081446886062622, valid roc_auc_score: 0.812061323500522
Epoch: 2, training loss: 1.1687164306640625, valid roc_auc_score: 0.7562404886470317
Epoch: 3, training loss: 0.6771160364151001, valid roc_auc_score: 0.8145703742799436
Epoch: 4, training loss: 0.6512936949729919, valid roc_auc_score: 0.8144599271592186
Epoch: 5, training loss: 0.6514720320701599, valid roc_auc_score: 0.8029452490084655
Epoch: 6, training loss: 0.6470925211906433, valid roc_auc_score: 0.7915663664979481
Epoch: 7, training loss: 0.6444706916809082, valid roc_auc_score: 0.7924901163406516
Epoch: 8, training loss: 0.6450539827346802, valid roc_auc_score: 0.8071957657108768
Epoch: 9, training loss: 0.6447546482086182, valid roc_auc_score: 0.7999462022938599
Epoch: 10, training loss: 0.6415925025939941, valid roc_auc_score: 0.7999011634065152
Epoch: 11, training loss: 0.6420110464096069, valid roc_auc_score:

## Stochastic Batch Training <a name="train_subgraph"></a>

For stochastic batch training, we split the training edges into batches. At each specific batch, to do the link prediction, we need to know the neighbor graphs for each pair of nodes that has an edge.

We use the edgeNeighborLoader, which can load the neighbors of the pair nodes of an edge and has the same parameters as neighborLoader(). The result of a batch is, for example,

`Data(edge_index=[2, 6917], is_train=[6917], is_val=[6917], is_test=[6917], is_seed=[6917], x=[2188, 1433], y=[2188])`

where `is_seed` indicates whether each edge is a seed edge or not


In [20]:
# Hyperparameters
hp = {"hidden_dim": 128, "out_dim": 64, "num_layers": 2,
      "dropout": 0.6, "lr": 0.01, "l2_penalty": 5e-4}

In [21]:
model = GCN(1433, hp["hidden_dim"], hp["out_dim"], hp["num_layers"], hp["dropout"])
optimizer = torch.optim.Adam(
    model.parameters(), lr=hp["lr"], weight_decay=hp["l2_penalty"]
)

### Construct the edge_neighbor_loader for train/val edges

In [22]:
train_edge_neighbor_loader = conn.gds.edgeNeighborLoader(
    v_in_feats=["x"],
    v_out_labels=["y"],
    num_batches=5,
    e_extra_feats=["is_train","is_val"],
    output_format="PyG",
    num_neighbors=10,
    num_hops=2,
    filter_by="is_train",
    shuffle=False,
)

Installing and optimizing queries. It might take a minute if this is the first time you use this loader.
Query installation finished.


In [23]:
val_edge_neighbor_loader = conn.gds.edgeNeighborLoader(
    v_in_feats=["x"],
    v_out_labels=["y"],
    num_batches=5,
    e_extra_feats=["is_train","is_val"],
    output_format="PyG",
    num_neighbors=10,
    num_hops=2,
    filter_by="is_val",
    shuffle=False,
)

In [None]:
for epoch in range(10):
    model.train()
    total_loss = 0
    for bid, batch in enumerate(train_edge_neighbor_loader):
        # get the training edges and negative edges sampled in the same batch
        train_edges = batch.edge_index[:, batch.is_seed]
        neg_train_edges = torch.randint(0, batch.x.shape[0], train_edges.size(), dtype=torch.long)
        # The graph only include the edges whose is_train is True
        train_graph_edges = batch.edge_index[:, batch.is_train]
        h = model(batch.x.float(), train_graph_edges)
        logits = model.decode(h, train_edges, neg_train_edges)
        labels = get_link_labels(train_edges, neg_train_edges)
        loss = F.binary_cross_entropy_with_logits(logits, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    model.eval()
    all_labels = []
    all_logits = []
    for batch in val_edge_neighbor_loader:
        val_edges = batch.edge_index[:, batch.is_seed]
        neg_val_edges = torch.randint(0, batch.x.shape[0], val_edges.size(), dtype=torch.long)
        # Need to use the train edge for GCN
        val_graph_edges = batch.edge_index[:, batch.is_train]
        with torch.no_grad():
            h = model(batch.x.float(), val_graph_edges)
            logits = model.decode(h, val_edges, neg_val_edges)
            labels = get_link_labels(val_edges, neg_val_edges)
            logits = logits.sigmoid()
            all_labels.extend(labels)
            all_logits.extend(logits)
    print('Epoch: {}, training loss: {}, valid roc_auc_score: {}'.format(epoch, total_loss, roc_auc_score(all_labels, all_logits)))
    

Epoch: 0, training loss: 3.904411494731903, valid roc_auc_score: 0.8237392959086584
Epoch: 1, training loss: 3.1914963126182556, valid roc_auc_score: 0.8996884395360859
Epoch: 2, training loss: 2.9413991570472717, valid roc_auc_score: 0.9132218330419763
Epoch: 3, training loss: 2.5968366861343384, valid roc_auc_score: 0.9252033087060395
Epoch: 4, training loss: 2.427817314863205, valid roc_auc_score: 0.9299228861824314
Epoch: 5, training loss: 2.3323494493961334, valid roc_auc_score: 0.9444609410999989
Epoch: 6, training loss: 2.3284645080566406, valid roc_auc_score: 0.9524406324093495
Epoch: 7, training loss: 2.2777881622314453, valid roc_auc_score: 0.9523057420733823
Epoch: 8, training loss: 2.2383748292922974, valid roc_auc_score: 0.9618454989629739
Epoch: 9, training loss: 2.2285644710063934, valid roc_auc_score: 0.9601150098542369
Epoch: 10, training loss: 2.2250851690769196, valid roc_auc_score: 0.9643573788182338
