# GraphSAGE
This notebook demonstrates the training of [GraphSAGE models](https://arxiv.org/abs/1706.02216) with TigerGraph. [Pytorch Geometric](https://pytorch-geometric.readthedocs.io)'s implementation of GraphSAGE is used here. We train the model on the Cora dataset from [PyG datasets](https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.Planetoid) with TigerGraph as the data store. The dataset contains 2708 machine learning papers and 10556 citation links between the papers.  Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from a dictionary. The dictionary consists of 1433 unique words. Each paper is classified into one of seven classes based on the topic. The goal is to predict the class of each vertex in the graph.

The following libraries are required to run this notebook. Uncomment to install them if necessary. You might need to restart the kernel after installing.

In [None]:
#!pip install spektral
#!pip install pyTigerGraph
#!pip install tensorboard # If you use tensorboard for visualization later

**NOTE**: Currently, your database needs to be activated (only once) to enjoy all the functions provided by the ML Workbench. If you are using ML Workbench on Cloud, then the activator is included and you can run the cell below (uncomment first) to activate. For other versions of the Workbench, you can download the activator at https://act.tigergraphlabs.com. Detailed instructions are also included on that website. 

In [None]:
# Uncomment below and fill out the necessary information. For detailed instructions, please see https://act.tigergraphlabs.com
# !mlwb activate [database address] -u [username] -p [password] -s [secret]

## Table of Contents
* [Data Processing](#data_processing)  
* [Train on whole graph](#train_whole)  
* [Train on neighborhood subgraphs](#train_subgraph) 
* [Inference](#inference) 

## Data Processing <a name="data_processing"></a>

Here we assume the dataset is already ingested into the TigerGraph database. If not, please refer to the  [data ingestion](https://github.com/TigerGraph-DevLabs/mlworkbench-docs/blob/main/tutorials/basics/0_data_ingestion.ipynb) tutorial first. Since the dataset already has a split of vertices into train/validation/test sets, we don't need to do so. But we still include the code below for general use cases.

### Connect to TigerGraph

The `TigerGraphConnection` class represents a connection to the TigerGraph database. Under the hood, it stores the necessary information to communicate with the database. It is able to perform quite a few database tasks. Please see its [documentation](https://docs.tigergraph.com/pytigergraph/current/intro/) for details.

**Note**: Secret instead of username/password is required for TG cloud DBs created after 7/5/2022. Otherwise, you can leave it blank.

In [None]:
from pyTigerGraph import TigerGraphConnection

conn = TigerGraphConnection(
    host="http://127.0.0.1", # Change the address to your database server's
    graphname="Cora",
    username="tigergraph",
    password="tigergraph",
    gsqlSecret="" # secret instead of user/pass is required for TG cloud DBs created after 7/5/2022  
)

<span style="color:red">Uncomment cell below and run to get and set token if token authentication is enabled</span>. 
* This is required for all databases on tgcloud.
* `<secret>` is your user secret. See https://docs.tigergraph.com/tigergraph-server/current/user-access/managing-credentials#_secrets for details.
* If you don't know your secret, you can use `secret=conn.createSecret()` to create one.

In [None]:
#conn.getToken(<secret>)

In [None]:
conn.getVertexCount('*')

In [None]:
conn.getEdgeCount()

### Train/validation/test split

In [None]:
# The code in this cell is commented out because there is no need to split the vertices into 
# training/validation/test sets, as the split is already done in the original dataset. 
# See notebook 1_data_processing for examples on the split function.

#split = conn.gds.vertexSplitter(train_mask=0.8, val_mask=0.1, test_mask=0.1)
#split.run()

In [None]:
print(
    "Number of vertices in training set:",
    conn.getVertexCount("Paper", where="train_mask!=0"),
)
print(
    "Number of vertices in validation set:",
    conn.getVertexCount("Paper", where="val_mask!=0"),
)
print(
    "Number of vertices in test set:", 
    conn.getVertexCount("Paper", where="test_mask!=0"),
)

## Train on whole graph <a name="train_whole"></a>
We first train the model on the whole graph. This will **NOT** work when the graph is large. See the section of training on subgraphs for real use. However, we still include this example for illustration purpose. Hyperparameters for the model and training environment are defined below.

In [None]:
# Hyperparameters
hp = {"hidden_dim": 64, 
      "num_layers": 2, 
      "dropout": 0.6,
      "lr": 0.001, 
      "l2_penalty": 5e-4}

### Construct graph loader

The `GraphLoader` can get the whole graph from database all at once (`num_batches=1`). See the tutorial on dataloaders for details.

In [None]:
graph_loader = conn.gds.graphLoader(
    v_in_feats=["x"],
    v_out_labels=["y"],
    v_extra_feats=["train_mask", "val_mask", "test_mask"],
    num_batches=1,
    output_format="spektral",
    shuffle=False
)

In [None]:
# Get the whole graph from the loader in PyG format
data = graph_loader.data

data

In [None]:
x, adj, y, mask_tr, mask_va, mask_te = data.x, data.A, data.y, data.train_mask, data.val_mask, data.test_mask

### Construct model and optimizer

We build a graphSAGE model with 2 convolutional layers, and use the Adam optimizer with a learning rate of 0.01.

In [None]:
import numpy as np

from spektral.layers import GraphSageConv
from spektral.layers import GATConv
import tensorflow as tf
from tensorflow.keras.layers import Dropout, Input
from tensorflow.keras.losses import CategoricalCrossentropy
from tensorflow.keras.metrics import categorical_accuracy
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
from tensorflow.keras.utils import to_categorical

from spektral.layers import GraphSageConv

In [None]:
device = tf.device("GPU" if tf.config.list_physical_devices('GPU') else "CPU")

l2_penalty = hp["l2_penalty"]
x_in = Input(shape=(data.n_node_features,))
a_in = Input(shape=(None,), sparse=True)

sage1 = GraphSageConv(
    channels=hp["hidden_dim"],
    aggregate_op="mean",
    activation="relu",
    )([x_in, a_in])

x_2 = Dropout(hp["dropout"])(sage1)
sage2 = GraphSageConv(
    channels=7,
    aggregate_op="mean",
    activation="softmax",
    )([sage1, a_in])

model = Model(inputs=[x_in, a_in], outputs=sage2)

In [None]:
model.summary()

In [None]:
optimizer = Adam(learning_rate=hp["lr"])
loss_fn = CategoricalCrossentropy()

one_hot_y = to_categorical(y)
tf_a = tf.SparseTensor(#converts the scipy sparse matrix to a tensorflow sparse matrix
    indices=np.array([adj.row, adj.col]).T,
    values=adj.data,
    dense_shape=adj.shape)

### Train the model

In [None]:
from datetime import datetime

In [None]:
# Training step
@tf.function
def train():
    with tf.GradientTape() as tape:
        predictions = model([x, tf_a], training=True)
        loss = loss_fn(tf.boolean_mask(one_hot_y, mask_tr), tf.boolean_mask(predictions, mask_tr))
        loss += sum(model.losses)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss

In [None]:
@tf.function
def evaluate():
    predictions = model([x, tf_a], training=False)
    losses = []
    accuracies = []
    for mask in [mask_tr, mask_va, mask_te]:
        loss = loss_fn(tf.boolean_mask(one_hot_y, mask), tf.boolean_mask(predictions, mask))
        loss += sum(model.losses)
        losses.append(loss)
        acc = tf.reduce_mean(categorical_accuracy(tf.boolean_mask(one_hot_y, mask), tf.boolean_mask(predictions, mask)))
        accuracies.append(acc)
    return losses, accuracies

In [None]:
epochs = 20
for epoch in range(1, epochs + 1):
    train()
    l, a = evaluate()
    print(
        "Epoch {:.0f}:\n"
        "Training Loss: {:.4f}, Training Accuracy: {:.4f}, "
        "Validation Loss: {:.4f}, Validation Accuracy: {:.4f}".format(epoch, l[0], a[0], l[1], a[1])
    )

### Test the model

In [None]:
l, a = evaluate()

print("Testing Loss: {:.4f}, Testing Accuracy: {:.4f}".format(l[2], a[2]))

## Train on Neighborhood Subgraphs <a name="train_subgraph"></a>
Alternatively, we train the model on the neighborhood subgraphs. Each subgraph contains the 2 hop neighborhood of certain seed vertices. This method  will allow us to train the model on graphs that are way larger than the CORA dataset because we don't load the whole graph into memory all at once. 

We will use the same parameters as before, but we will use the NeighborLoader to load subgraphs. Once we finish iterating over all the subgraphs generated by the loader, it is guaranteed to cover all vertices in the graph (except for those filtered by a user provided mask). 

In [None]:
# Hyperparameters
hp = {"batch_size": 64, 
      "num_neighbors": 10, 
      "num_hops": 2, 
      "hidden_dim": 64,
      "num_layers": 2, 
      "dropout": 0.6, 
      "lr": 0.01, 
      "l2_penalty": 5e-4}

### Construct neighborhood subgraph loader

Here we construct 3 subgraph loaders. The `train_loader` only uses vertices in the training set as seeds, the `valid_loader` only uses vertices in the validation set, and the `test_loader` only uses vertices in the test set.

In [None]:
train_loader = conn.gds.neighborLoader(
    v_in_feats=["x"],
    v_out_labels=["y"],
    v_extra_feats=["train_mask","val_mask","test_mask"],
    output_format="spektral",
    batch_size=hp["batch_size"],
    num_neighbors=hp["num_neighbors"],
    num_hops=hp["num_hops"],
    shuffle=True,
    filter_by="train_mask",
)

In [None]:
valid_loader = conn.gds.neighborLoader(
    v_in_feats=["x"],
    v_out_labels=["y"],
    v_extra_feats=["train_mask","val_mask","test_mask"],
    output_format="spektral",
    batch_size=hp["batch_size"],
    num_neighbors=hp["num_neighbors"],
    num_hops=hp["num_hops"],
    shuffle=False,
    filter_by="val_mask",
)

### Construct model and optimizer

We build a graphSAGE model with 2 convolutional layers, and use the Adam optimizer with a learning rate of 0.01.

In [None]:
model = Model(inputs=[x_in, a_in], outputs=sage2)
model.summary()

In [None]:
optimizer = Adam(learning_rate=hp["lr"])
loss_fn = CategoricalCrossentropy()

### Train the model

In [None]:
from datetime import datetime

In [None]:
def preprocess_batch(graph):
  x, adj, y, mask_tr, mask_va, mask_te = graph.x, graph.A, graph.y, graph.train_mask, graph.val_mask, graph.test_mask
  one_hot_y = to_categorical(y)
  tf_a = tf.SparseTensor(#converts the scipy sparse matrix to a tensorflow sparse matrix
    indices=np.array([adj.row, adj.col]).T,
    values=adj.data,
    dense_shape=adj.shape)
  return x, tf_a, one_hot_y

val_acc_metric = tf.keras.metrics.CategoricalAccuracy()
val_loss_metric = tf.keras.metrics.CategoricalCrossentropy()

In [None]:
@tf.function
def train(x, tf_a, one_hot_y, mask):
    with tf.GradientTape() as tape:
        predictions = model([x, tf_a], training=True)
        loss = loss_fn(tf.boolean_mask(one_hot_y, mask), tf.boolean_mask(predictions, mask))
        acc = tf.reduce_mean(categorical_accuracy(tf.boolean_mask(one_hot_y, mask), tf.boolean_mask(predictions, mask)))
        loss += sum(model.losses)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss, acc

In [None]:
@tf.function
def test_step(x, tf_a, one_hot_y, mask, metrics=[]):
    val_logits = model([x, tf_a], training=False)
    for metric in metrics:
      metric.update_state(tf.boolean_mask(one_hot_y, mask), tf.boolean_mask(val_logits, mask))

In [None]:
for epoch in range(10):
    for bid, batch in enumerate(train_loader):
        batchsize = batch.n_nodes
        x, tf_a, one_hot_y = preprocess_batch(batch)
        loss, acc = train(x, tf_a, one_hot_y, batch.train_mask)
        print("Epoch {}, Train Batch {}, Loss {:.4f}, Accuracy {:.4f}".format(epoch, bid, loss, acc))
    for batch in valid_loader:
        x, tf_a, one_hot_y = preprocess_batch(batch)
        test_step(x, tf_a, one_hot_y, batch.val_mask, metrics = [val_acc_metric, val_loss_metric])
    val_acc = val_acc_metric.result()
    val_loss = val_loss_metric.result()
    val_acc_metric.reset_states()
    val_loss_metric.reset_states()
    print("Epoch {}, Valid Loss {:.4f}, Valid Accuracy {:.4f}".format(epoch, val_loss, val_acc))

### Visualize training status

We can use tensorboard to visualize and track training status. If you are running this notebook on ML Workbench Cloud, please go to the Tensorboards app to start a tensorboard server (see [doc](https://docs.tigergraph.com/ml-workbench/current/on-cloud/tensorboard) for details), and you can skip the rest of this section.

Otherwise, uncomment and run the code below to start a tensorboard server locally in the background. If there is already a tensorboard server running, skip the cell below or you will get an error complaining that the port is in use. 

In [None]:
# import os
# os.system("tensorboard --logdir logs --port 6006 --bind_all &")

Once the tensorboard service is running, go to `localhost:6006` in your browser and you should see all the pretty plots as below. For details on using tensorboard, please refer to its [official doc](https://www.tensorflow.org/tensorboard/get_started).

![Screen Shot 2022-02-09 at 5.02.00 PM.png](attachment:18fe3102-f23f-4e44-bb20-24e1a21cbb00.png)

### Test the model

In [None]:
test_loader = conn.gds.neighborLoader(
    v_in_feats=["x"],
    v_out_labels=["y"],
    v_extra_feats=["train_mask","val_mask","test_mask"],
    output_format="spektral",
    batch_size=hp["batch_size"],
    num_neighbors=hp["num_neighbors"],
    num_hops=hp["num_hops"],
    shuffle=False,
    filter_by="test_mask",
)

In [None]:
acc = tf.keras.metrics.CategoricalAccuracy()
for batch in test_loader:
    x, tf_a, one_hot_y = preprocess_batch(batch)
    test_step(x, tf_a, one_hot_y, batch.val_mask, metrics = [acc])
print("Accuracy: {:.4f}".format(acc.result()))
acc.reset_states()

## Inference <a name="inference"></a>

Finally, we use the trained model for node classification. At this stage, we typically do inference/prediction for specific nodes instead of random batches, so we will create a new data loader.

In [None]:
infer_loader = conn.gds.neighborLoader(
    v_in_feats=["x"],
    v_out_labels=["y"],
    v_extra_feats=["train_mask","val_mask","test_mask"],
    output_format="spektral",
    num_neighbors=hp["num_neighbors"],
    num_hops=hp["num_hops"],
    shuffle=False,
)

In [None]:
# Fetch specific nodes by their IDs and do prediction. 
# Each node is represented by a dict with two mandatory keys: primary_id and type.
input_nodes = [{"primary_id": 7, "type": "Paper"}, 
               {"primary_id": 999, "type": "Paper"}]
data = infer_loader.fetch(input_nodes)

In [None]:
# The returned data are the neighborhood subgraphs of the input nodes.
# The original IDs of the nodes in the subgraphs are stored in the 
# `primary_id` attribute.
data

In [None]:
# Predict. Predictions for both the input nodes and others in their 
# neighborhoods are generated.
x, tf_a, one_hot_y = preprocess_batch(batch)
pred = model([x, tf_a], training=False)
print("ID: Label")
for i,j in zip(data.primary_id, pred):
    print("{}:{}".format(i, tf.math.argmax(j)))