# GNN Inference on Google Vertex AI using TigerGraph

In this notebook, we will train a GNN model and deploy it to Google Vertex AI as an inference endpoint.

## Setup

We are going to create a working directory.
**Note:** the `mkdir` command below will fail if the directory already exists. You can safely ignore the error message.

In [3]:
import os

source_directory = "gat_cora"

os.mkdir("./{}".format(source_directory))

FileExistsError: [Errno 17] File exists: './gat_cora'

## Define The Model

We are going to define a Graph Attention Network (GAT) model, and write it to a file called `model.py`.

In [4]:
%%writefile $source_directory/model.py

import torch
import torch.nn.functional as F
from torch_geometric.nn import GATConv

class GAT(torch.nn.Module):
    def __init__(
        self, num_features, num_layers, out_dim, dropout, hidden_dim, num_heads
    ):
        super().__init__()
        self.dropout = dropout
        self.layers = torch.nn.ModuleList()
        for i in range(num_layers):
            in_units = num_features if i == 0 else hidden_dim * num_heads
            out_units = out_dim if i == (num_layers - 1) else hidden_dim
            heads = 1 if i == (num_layers - 1) else num_heads
            self.layers.append(
                GATConv(in_units, out_units, heads=heads, dropout=dropout)
            )

    def reset_parameters(self):
        for layer in self.layers:
            layer.reset_parameters()

    def forward(self, data):
        x, edge_index = data.x.float(), data.edge_index
        for layer in self.layers[:-1]:
            x = layer(x, edge_index)
            x = F.elu(x)
            x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.layers[-1](x, edge_index)
        return x

Overwriting gat_cora/model.py


## Model Parameters

Here, we define a dictionary of the parameters of the model, data loaders, and connection to the database.

In [5]:
parameters = {
    "model_name": "GAT",
    "model_config": {
        "num_features": 1433, # Number of features on Cora vertices 
        "out_dim": 7,         # Number of classes in Cora
        "num_heads": 8,       # Number of attention heads in GAT model
        "hidden_dim": 8,      # Number of hidden units in GAT model
        "num_layers": 2,      # Number of GAT layers in GAT model
        "dropout": 0.6        # Dropout probability in GAT model
    },
    "infer_loader_config": {
        "v_in_feats": ["x"],     # List of vertex features to be loaded
        "v_out_labels": ["y"],   # List of vertex labels to be loaded
        "v_extra_feats": ["train_mask","val_mask","test_mask"],     # Don't need any extra features for inference
        "output_format": "PyG",  # Using Pytorch Geometric format
        "batch_size": 64,        # Batch size for inference
        "num_neighbors": 10,     # Number of neighbors per vertex
        "num_hops": 2,           # How deep to go in the graph
        "shuffle": False         # Don't shuffle the data
    },
    "training_loader_config": {
        "v_in_feats": ["x"],
        "v_out_labels": ["y"],
        "v_extra_feats": ["train_mask","val_mask","test_mask"],
        "output_format": "PyG",
        "batch_size": 64, 
        "num_neighbors": 10, 
        "num_hops": 2,
        "shuffle": True
    },
    "optimizer_config": {
        "lr": 0.01,
        "weight_decay": 5e-4,
    },
    "connection_config": {
        "host": "http://35.230.92.92", 
        "graphname": "Cora", 
        "username": "tigergraph", 
        "password": "tigergraph"
    }
}

### Write Parameters to JSON File
We will write the parameters dictionary to a JSON file so that we can easily access the parameters when creating the inference container.

In [6]:
import json

json.dump(parameters, open("{}/config.json".format(source_directory), "w"))

## Train a GNN Model

### Load the Model
Here, we use some Python packaging tools to load the model. This is equivalent to writing `from source_directory.model import ModelName`.

Since `source_directory` and `ModelName` are unique to each developer's configs, we will use the `sys` package to import the model.

In [5]:
import sys
sys.path.append(source_directory)

import model
GAT = getattr(model, parameters["model_name"])

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
GAT

model.GAT

#### Instantiate the Model Class
Here, we use `kwargs` to pass in the parameters of the model from the parameters dictionary.

In [7]:
gat = GAT(**parameters["model_config"])
gat

GAT(
  (layers): ModuleList(
    (0): GATConv(1433, 8, heads=8)
    (1): GATConv(64, 7, heads=1)
  )
)

### Create Data Loaders
Here, we instantiate a connection to our TigerGraph database with `pyTigerGraph`. Then we create data loaders for training, validation, and testing datasets. We will use the **Neighbor Sampling** technique introduced in the GraphSAGE paper to generate batches of data.

In [8]:
from pyTigerGraph import TigerGraphConnection

conn = TigerGraphConnection(**parameters["connection_config"])

In [9]:
train_loader = conn.gds.neighborLoader(
    **parameters["training_loader_config"],
    filter_by="train_mask"
)

In [10]:
valid_loader = conn.gds.neighborLoader(
    **parameters["training_loader_config"],
    filter_by="val_mask"
)

In [11]:
test_loader = conn.gds.neighborLoader(
    **parameters["training_loader_config"],
    filter_by="test_mask"
)

### Setup Optimizer
Here, we define the `Adam` optimizer and move the model to the correct device (CPU or GPU).

In [12]:
import torch
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

gat.to(device)

optimizer = torch.optim.Adam(
    gat.parameters(), **parameters["optimizer_config"]
)

### Train the Model

In [13]:
from datetime import datetime
from pyTigerGraph.gds.metrics import Accumulator, Accuracy

In [14]:
global_steps = 0
logs = {}
for epoch in range(10):
    # Train
    gat.train()
    epoch_train_loss = Accumulator()
    epoch_train_acc = Accuracy()
    for bid, batch in enumerate(train_loader):
        batchsize = batch.x.shape[0]
        batch.to(device)
        # Forward pass
        out = gat(batch)
        # Calculate loss
        loss = F.cross_entropy(out[batch.train_mask], batch.y[batch.train_mask])
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_train_loss.update(loss.item() * batchsize, batchsize)
        # Predict on training data
        with torch.no_grad():
            pred = out.argmax(dim=1)
            epoch_train_acc.update(pred[batch.train_mask], batch.y[batch.train_mask])
        # Log training status after each batch
        logs["loss"] = epoch_train_loss.mean
        logs["acc"] = epoch_train_acc.value
        print(
            "Epoch {}, Train Batch {}, Loss {:.4f}, Accuracy {:.4f}".format(
                epoch, bid, logs["loss"], logs["acc"]
            )
        )
        global_steps += 1
    # Evaluate
    gat.eval()
    epoch_val_loss = Accumulator()
    epoch_val_acc = Accuracy()
    for batch in valid_loader:
        batchsize = batch.x.shape[0]
        batch.to(device)
        with torch.no_grad():
            # Forward pass
            out = gat(batch)
            # Calculate loss
            valid_loss = F.cross_entropy(out[batch.val_mask], batch.y[batch.val_mask])
            epoch_val_loss.update(valid_loss.item() * batchsize, batchsize)
            # Prediction
            pred = out.argmax(dim=1)
            epoch_val_acc.update(pred[batch.val_mask], batch.y[batch.val_mask])
    # Log testing result after each epoch
    logs["val_loss"] = epoch_val_loss.mean
    logs["val_acc"] = epoch_val_acc.value
    print(
        "Epoch {}, Valid Loss {:.4f}, Valid Accuracy {:.4f}".format(
            epoch, logs["val_loss"], logs["val_acc"]
        )
    )

Epoch 0, Train Batch 0, Loss 2.0951, Accuracy 0.0694
Epoch 0, Train Batch 1, Loss 1.9448, Accuracy 0.2133
Epoch 0, Train Batch 2, Loss 1.8532, Accuracy 0.2850
Epoch 0, Valid Loss 1.7549, Valid Accuracy 0.3582
Epoch 1, Train Batch 0, Loss 1.6790, Accuracy 0.4177
Epoch 1, Train Batch 1, Loss 1.5420, Accuracy 0.5000
Epoch 1, Train Batch 2, Loss 1.4300, Accuracy 0.5509
Epoch 1, Valid Loss 1.5380, Valid Accuracy 0.5374
Epoch 2, Train Batch 0, Loss 1.2564, Accuracy 0.6056
Epoch 2, Train Batch 1, Loss 1.2103, Accuracy 0.6528
Epoch 2, Train Batch 2, Loss 1.1751, Accuracy 0.6667
Epoch 2, Valid Loss 1.3291, Valid Accuracy 0.6209
Epoch 3, Train Batch 0, Loss 1.0522, Accuracy 0.7105
Epoch 3, Train Batch 1, Loss 1.0109, Accuracy 0.7162
Epoch 3, Train Batch 2, Loss 1.0068, Accuracy 0.6971
Epoch 3, Valid Loss 1.2183, Valid Accuracy 0.6531
Epoch 4, Train Batch 0, Loss 0.9570, Accuracy 0.6883
Epoch 4, Train Batch 1, Loss 0.9151, Accuracy 0.7234
Epoch 4, Train Batch 2, Loss 0.9039, Accuracy 0.7311
Epoch

### Test the Model

In [15]:
gat.eval()
acc = Accuracy()
for batch in test_loader:
    batch.to(device)
    with torch.no_grad():
        pred = gat(batch).argmax(dim=1)
        acc.update(pred[batch.test_mask], batch.y[batch.test_mask])
print("Accuracy: {:.4f}".format(acc.value))

Accuracy: 0.7401


### Save the Trained Model Weights

In [16]:
torch.save(gat.state_dict(), "{}/model.pth".format(source_directory))

## Create Dockerfile

Google Vertex AI uses Docker containers in order to host models. We use a Dockerfile to build this container.

In [26]:
%%writefile Dockerfile

FROM ubuntu:latest

# Install some basic utilities
RUN apt-get update && apt-get install -y \
    curl \
    ca-certificates \
    sudo \
    git \
    bzip2 \
    libx11-6 \
    wget \
    pip \
 && rm -rf /var/lib/apt/lists/*

WORKDIR /opt
# Set up the Conda environment
ENV CONDA_AUTO_UPDATE_CONDA=false \
    PATH=/opt/miniconda/bin:$PATH
COPY ./gat_cora/environment.yml /opt/environment.yml
RUN curl -sLo /opt/miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-py39_4.11.0-Linux-x86_64.sh \
 && chmod +x /opt/miniconda.sh \
 && /opt/miniconda.sh -b -p /opt/miniconda \
 && rm /opt/miniconda.sh \
 && conda env update -n base -f /opt/environment.yml \
 && rm /opt/environment.yml \
 && conda clean -ya

 RUN pip install --no-index torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html \
 && pip install --no-index torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html \
 && pip install --no-index torch-cluster -f https://data.pyg.org/whl/torch-1.10.0+cu113.html \
 && pip install --no-index torch-spline-conv -f https://data.pyg.org/whl/torch-1.10.0+cu113.html \
 && pip install torch-geometric \
 && pip cache purge

# install - requirements.txt
COPY ./gat_cora/requirements.txt /tmp/requirements.txt
RUN python3 -m pip install -r /tmp/requirements.txt --quiet --no-cache-dir \
  && rm -f /tmp/requirements.txt

ENV TARGET_DIR /opt/kserve-demo
WORKDIR ${TARGET_DIR}
COPY ./gat_cora/ ${TARGET_DIR}/gat_cora/

ENTRYPOINT ["python3", "-m", "./gat_cora/main"]

Overwriting Dockerfile


## Define main.py File

This `main.py` file will load the model and start running an HTTP server for model inference within the Docker container.

In [18]:
%%writefile $source_directory/main.py

import torch
import kserve
from google.cloud import storage
# from sklearn.externals import joblib
from kserve import Model, Storage
from kserve.model import ModelMissingError, InferenceError
from typing import Dict
import logging
import pyTigerGraph as tg
import os 

logger = logging.getLogger(__name__)

class VertexClassifier(Model):
    def __init__(self, name: str, source_directory: str):
        super().__init__(name)
        self.name = name
        self.source_dir = source_directory
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
        # Load configuration JSON file
        with open(os.path.join(source_directory, "config.json")) as json_file:
            data = json.load(json_file)
            self.model_config = data["model_config"]
            connection_config = data["connection_config"]
            loader_config = data["infer_loader_config"]
            model_name = data["model_name"]

        sys.path.append(source_directory)
        # Setup Connection to TigerGraph Database
        self.conn = tg.TigerGraphConnection(**connection_config)

        # Setup Inference Loader
        self.infer_loader = conn.gds.neighborLoader(**loader_config)

        # Setup Model
        self.model = self.load_model()

    def load(self):
        pass
    
    def load_model(self):
        import model
        mdl = getattr(model, model_name)(**self.model_config)
        logger.info("Instantiated Model")
        with open(os.path.join(source_directory, "model.pth"), 'rb') as f:
            mdl.load_state_dict(torch.load(f))
        mdl.to(self.device).eval()
        logger.info("Loaded Model")
        return mdl

    def predict(self, request: Dict) -> Dict:
        input_nodes = request["vertices"]
        input_ids = set([str(node['primary_id']) for node in input_nodes])
        logger.info(input_ids)
        data = self.infer_loader.fetch(input_nodes).to(self.device)
        logger.info (f"predicting {data}")
        with torch.no_grad():
            output = self.model(data)
        returnJSON = {}
        for i in range(len(input_nodes["vertices"])):
            returnJSON[input_nodes["vertices"][i]["primary_id"]] = list(output[i].tolist())
        return returnJSON

if __name__ == "__main__":
    model_name = os.environ.get('K_SERVICE', "tg-gat-gcp-demo-predictor-default")
    model_name = '-'.join(model_name.split('-')[:-2]) # removing suffix "-predictor-default"
    print(model_name)
    logging.info(f"Starting model '{model_name}'")
    model = VertexClassifier(model_name)
    kserve.ModelServer(http_port=8080).start([model])


Overwriting gat_cora/main.py


## Write requirements.txt File

In [19]:
%%writefile $source_directory/requirements.txt

# kubeflow packages
kfp==1.6.3
kfp-server-api==1.6.0
kserve==0.8

# common packages
#bokeh==2.3.2
#cloudpickle==1.6.0
#dill==0.3.4
#pandas==1.2.4

# pytorch packages
#fastai==2.4
class-resolver==0.3.9

# TigerGraph
pyTigerGraph[gds]==0.9

Overwriting gat_cora/requirements.txt


## Write environment.yml File

In [20]:
%%writefile $source_directory/environment.yml
name: base
dependencies:
- numpy=1.21.2
- pip=21.2.4
- python=3.9.7
- pytorch::pytorch=1.10.0=py3.9_cuda11.3_cudnn8.2.0_0
- scipy=1.7.1
- cloudpickle=2.0.0  

Overwriting gat_cora/environment.yml


## Build Docker Image
Using the Dockerfile defined above, we will build the Docker image for inference.

In [14]:
!docker build --rm --platform linux/amd64 --no-cache -t kserve-base:1.0 . 

[1A[1B[0G[?25l[+] Building 0.0s (0/1)                                                         
[?25h[1A[0G[?25l[+] Building 0.2s (2/3)                                                         
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 37B                                        0.0s
[0m[34m => [internal] load .dockerignore                                          0.0s
[0m[34m => => transferring context: 2B                                            0.0s
[0m => [internal] load metadata for docker.io/library/ubuntu:latest           0.1s
[?25h[1A[1A[1A[1A[1A[1A[0G[?25l[+] Building 0.3s (2/3)                                                         
[34m => [internal] load build definition from Dockerfile                       0.0s
[0m[34m => => transferring dockerfile: 37B                                        0.0s
[0m[34m => [internal] load .dockerignore                           

## Run Docker Image Locally

In [None]:
!docker run -p 8080:8080 kserve-base:1.0

## Test Local Deployment

In [16]:
import requests

data = {"vertices": [{"primary_id": 7, "type": "Paper"}, {"primary_id": 17, "type": "Paper"}, {"primary_id": 27, "type": "Paper"}, {"primary_id": 37, "type": "Paper"}]}

resp = requests.post("http://localhost:8080/v1/models/tg-gat-gcp-demo:predict", json=data)

ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /v1/models/tg-gat-gcp-demo:predict (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f93327ae7f0>: Failed to establish a new connection: [Errno 61] Connection refused'))

In [30]:
resp.text

'{"7": [0.6309744119644165, -1.8825989961624146, -3.1260595321655273, 4.124495983123779, 0.9012938737869263, -1.2114927768707275, -0.04733648896217346], "17": [-0.8207836151123047, -1.460455060005188, -2.670682191848755, 5.9252238273620605, 0.5515968203544617, -2.1751182079315186, -2.2578446865081787], "27": [1.3214856386184692, -1.6524208784103394, -2.7388596534729004, -0.7472782731056213, 0.7124011516571045, 2.9769928455352783, -1.7700220346450806], "37": [0.11736150830984116, -0.13363397121429443, -1.6442967653274536, 2.4018805027008057, -0.10535058379173279, -0.5798088312149048, -0.6423999071121216]}'

## Upload to GCP

In [20]:
!docker tag kserve-base:1.0 us-central1-docker.pkg.dev/tigergraph-ml/gnn-inference/cora-gat-inference:latest

In [10]:
!sudo docker push us-central1-docker.pkg.dev/tigergraph-ml/gnn-inference/cora-gat-inference:latest

Password:
sudo: a password is required


## Deploy Model

In [None]:
!gcloud ai models upload --region=us-central1 --display-name=cora-gat --container-image-uri=us-central1-docker.pkg.dev/tigergraph-ml/gnn-inference/cora-gat-inference:latest

In [7]:
!gcloud ai models list --region=us-central1 

zsh:1: command not found: gcloud


## Create Endpoint

In [None]:
!gcloud ai endpoints create --region=us-central1 --display-name=coragat

In [11]:
!gcloud ai endpoints list --region=us-central1 --filter=display_name=coragat

zsh:1: command not found: gcloud


## Deploy Model to Endpoint

In [None]:
!gcloud ai endpoints deploy-model 1271026645616033792 --region=us-central1 --model=8814375701554659328 --display-name=coragat

## Run Prediction

In [19]:
import json
json.dump({"instances": [data]}, open("request.json", "w"))

In [None]:
gcloud ai endpoints predict 3370829971877527552 --region=us-central1 --json-request=request.json

In [13]:
from google.cloud import aiplatform

aiplatform.init(project="tigergraph-ml", location="us-central1")

endpoint = aiplatform.Endpoint("3370829971877527552")

prediction = endpoint.predict(instances=data)
print(prediction)

DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started