# Exploring how the TorchData API works with TigerGraph data

### Doris Voina (dorisvoina@gmail.com), Feng Shi (bill.shi@tigergraph.com)

## What is TorchData?

TorchData (or "torchdata") is part of PyTorch project and provides better data loading functions for pytorch. According to the [documentation](https://pytorch.org/data/main/), 
>TorchData is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines.

The modular data loading primitives of torchdata are capable of accomplishing a variety of functions:
- FileLister: lists out files in a directory
- Filter: filters the elements in DataPipe based on a given function
- FileOpener: consumes file paths and returns opened file streams
- Mapper: Applies a function over each item from the source DataPipe 

This tutorial showcases how to use TorchData to load data from a TigerGraph database for graph machine learning tasks. The Cora dataset will used here, which is a well-known graph dataset of papers and their citations.

Let's use torchdata by looking at a particular problem: 
starting with a graph, a common problem is classifying nodes of the graph. A common approach is to consider node features and then classify nodes according to these features, using say a neural network. While we can use the features provided in the dataset, we can further enrich these features by adding node properties in the graph, such as pagerank, a property whereby the number and quality of conenctions to a node are counted in order to determine a rough estimate of how important the node is. The underlying assumption of pagerank is that more important nodes are likely to receive more connections from other nodes.  

**Outline**
- Data Ingestion: Ingest the Cora dataset into a TigerGraph database on cloud. 
- Graph Features: Compute PageRank scores for each node in the database.
- Data Loader: Create a data loader that preprocesses the data for ML.
- Model Training: Train a simple feedforward neural network on the data for node classification.


In [1]:
# A few python packages are requred for running the code below
%pip install torchdata
%pip install pyTigerGraph

Note: you may need to restart the kernel to use updated packages.


## Data Ingestion

We will use a free TigerGraph database on tgcloud to host the data. If you don't have a tgcloud account, simply go to www.tgcloud.io and get one. 

After logged in, click `Create Solution` and follow the instructions to create a database. (The free tier is enough for this demo although larger instances will bring better performance.) A solution called `torchdata-demo` at `torchdata-demo.i.tgcloud.io` is used in this tutorial. 

After the solution is created, open GraphStudio from `Applications`, and click `Global View`->`Create a graph` to create an empty graph called `Cora`.   

Finally, click `Admin` on the upper right corner to switch to the admin portal and go to `Management/Users` to create a secret for the graph. Note: you will not be able to see the full secret once you leave the page; make sure to copy it to a safe place for future uses.

Run the code below to ingest the Cora data into the database. **Replace the secret variable below with the secret generated just now**.

In [1]:
# Make a connection

from pyTigerGraph import TigerGraphConnection

secret = "udnte2efngcbsdvjalkp3r90lqer0n9h"

conn = TigerGraphConnection(
    host="https://torchdata-demo.i.tgcloud.io",
    graphname="Cora",
    gsqlSecret=secret,
)

apiToken = conn.getToken(secret)[0]

In [2]:
# Create and run schema change job
query = """
USE GRAPH Cora

CREATE SCHEMA_CHANGE JOB Cora_job FOR GRAPH Cora {
    ADD VERTEX Paper (PRIMARY_ID id Int, x List<Int>, y Int, train_mask Bool, val_mask Bool, test_mask Bool) WITH primary_id_as_attribute="true";
    ADD DIRECTED EDGE Cite (from Paper, to Paper, time Int, is_train Bool, is_val Bool);
}

RUN SCHEMA_CHANGE JOB Cora_job
"""
print(conn.gsql(query))

Using graph 'Cora'
Successfully created schema change jobs: [Cora_job].
Kick off schema change job Cora_job
Doing schema change on graph 'Cora' (current version: 0)
Trying to add local vertex 'Paper' to the graph 'Cora'.
Trying to add local edge 'Cite' to the graph 'Cora'.

Graph Cora updated to new version 1
The job Cora_job completes in 2.358 seconds!


In [3]:
# Create loading job
query = """
USE GRAPH Cora

CREATE LOADING JOB load_cora_data FOR GRAPH Cora {
    DEFINE FILENAME node_csv;
    DEFINE FILENAME edge_csv;

    LOAD node_csv TO VERTEX Paper VALUES ($0, SPLIT($1," "), $2, $3, $4, $5) USING header="false", separator=",";
    LOAD edge_csv TO EDGE Cite VALUES ($0, $1, _, _, _) USING header="false", separator=",";
}
"""
print(conn.gsql(query))

Using graph 'Cora'
Successfully created loading jobs: [load_cora_data].


In [4]:
# Load data. The data files are in the same repo with this notebook
print(conn.runLoadingJobWithFile("./data/cora/nodes.csv", "node_csv", "load_cora_data"))
print(conn.runLoadingJobWithFile("./data/cora/edges.csv", "edge_csv", "load_cora_data"))

[{'sourceFileName': 'Online_POST',
  'statistics': {'validLine': 10556,
   'rejectLine': 0,
   'failedConditionLine': 0,
   'notEnoughToken': 0,
   'invalidJson': 0,
   'oversizeToken': 0,
   'vertex': [],
   'edge': [{'typeName': 'Cite',
     'validObject': 10556,
     'noIdFound': 0,
     'invalidAttribute': 0,
     'invalidVertexType': 0,
     'invalidPrimaryId': 0,
     'invalidSecondaryId': 0,
     'incorrectFixedBinaryLength': 0}],
   'deleteVertex': [],
   'deleteEdge': []}}]

If every step above finished successfully, now you should see the Cora graph in your GraphStudio. There are should be 2708 vertices and 10556 edges. For visual inspection, go to `Explore Graph` and pick a few vertices at random to see what they look like.

## Graph Features

### PageRank

Install pagerank algorithm.

In [5]:
featurizer = conn.gds.featurizer()
featurizer.installAlgorithm("tg_pagerank")

Installing and optimizing the queries, it might take a minute


'tg_pagerank'

Run pagerank and get results with HttpReader

In [2]:
from torchdata.datapipes.iter import HttpReader, IterableWrapper

url = "https://torchdata-demo.i.tgcloud.io:443/restpp/query/Cora/tg_pagerank"
payload = {
    "v_type": "Paper",
    "e_type": "Cite",
    "top_k": 2708,
    "print_accum": True
}
authHeader = {'Authorization': "Bearer " + apiToken}
out_pr = HttpReader(
    IterableWrapper([url]),
    None,
    params=payload, headers=authHeader)

Reformat output from HttpReader 

In [3]:
from torchdata.datapipes.iter import IterDataPipe
import torch.utils.data 
import json

@torch.utils.data.functional_datapipe('process_data')
class HttpReader_processing(IterDataPipe):
    # A custom DataPipe to load and parse mesh data into PyTorch data objects.
    def __init__(self, out: IterDataPipe):
        super().__init__()
        self.out = out

    def __iter__(self):
        reader_dp = self.out.readlines()
        it = iter(reader_dp)
        path, line = next(it)

        out = json.loads(line.decode("utf8"))
    
        yield out

out_pr = out_pr.process_data()
out_pagerank = next(iter(out_pr))
print(out_pagerank['results'][0]['@@top_scores_heap'][:5])

[{'Vertex_ID': '1358', 'score': 33.06401}, {'Vertex_ID': '1701', 'score': 16.8922}, {'Vertex_ID': '1986', 'score': 14.46646}, {'Vertex_ID': '306', 'score': 13.72521}, {'Vertex_ID': '1810', 'score': 9.81972}]


Create a dataframe with two columns: vid and pageran score

In [4]:
import pandas as pd

df_pr = pd.DataFrame.from_records(out_pagerank['results'][0]['@@top_scores_heap'])
df_pr.columns = ["vid", "pagerank"]
df_pr.head()

Unnamed: 0,vid,pagerank
0,1358,33.06401
1,1701,16.8922
2,1986,14.46646
3,306,13.72521
4,1810,9.81972


### Other node features + labels

Install the query that pulls node features `x`, `y`, `train_mask`, `val_mask` and `test_mask`.

In [24]:
# Intall UDFs
ExprFunctions="https://tg-mlworkbench.s3.us-west-1.amazonaws.com/udf/1.0/ExprFunctions.hpp" 
ExprUtil="" 
conn.installUDF(ExprFunctions, ExprUtil)

# Create query to get data
query = """
USE GRAPH Cora

CREATE QUERY vertex_loader(
    SET<VERTEX> input_vertices,
    INT num_batches=1, 
    BOOL shuffle=FALSE,
    STRING filter_by
){
    /*
    This query generates batches of vertices. If `input_vertices` is given, it will generate 
    a batches of those vertices. Otherwise, it will divide all vertices into `num_batches`, 
    and return each batch separately.

    Parameters :
      input_vertices : What vertices to get.
      num_batches    : Number of batches to divide all vertices.
      shuffle        : Whether to shuffle vertices before collecting data.
      filter_by      : A Boolean attribute to determine which vertices are included.
                       Only effective when `input_vertices` is NULL.
    */
    INT num_vertices;
    SumAccum<INT> @tmp_id;

    # Shuffle vertex ID if needed
    start = {ANY};
    IF shuffle THEN
        num_vertices = start.size();
        res = SELECT s 
              FROM start:s
              POST-ACCUM s.@tmp_id = floor(rand()*num_vertices);
    ELSE
        res = SELECT s 
              FROM start:s
              POST-ACCUM s.@tmp_id = getvid(s);
    END;

    # Generate batches
    FOREACH batch_id IN RANGE[0, num_batches-1] DO
        MapAccum<VERTEX, STRING> @@v_batch;
        IF input_vertices.size()==0 THEN
            start = {ANY};
            IF filter_by IS NOT NULL THEN
                seeds = SELECT s 
                        FROM start:s 
                        WHERE s.getAttr(filter_by, "BOOL") and s.@tmp_id % num_batches == batch_id
                        POST-ACCUM @@v_batch += (s -> (int_to_string(getvid(s)) + "," + int_to_string(s.x)+","+int_to_string(s.y)+","+bool_to_string(s.train_mask)+","+bool_to_string(s.val_mask)+","+bool_to_string(s.test_mask) + "\n"));
            ELSE
                seeds = SELECT s 
                        FROM start:s 
                        WHERE s.@tmp_id % num_batches == batch_id
                        POST-ACCUM @@v_batch += (s -> (int_to_string(getvid(s)) + "," + int_to_string(s.x)+","+int_to_string(s.y)+","+bool_to_string(s.train_mask)+","+bool_to_string(s.val_mask)+","+bool_to_string(s.test_mask) + "\n"));
            END;
        ELSE
            start = input_vertices;
            seeds = SELECT s 
                    FROM start:s 
                    POST-ACCUM @@v_batch += (s -> (int_to_string(getvid(s)) + "," + int_to_string(s.x)+","+int_to_string(s.y)+","+bool_to_string(s.train_mask)+","+bool_to_string(s.val_mask)+","+bool_to_string(s.test_mask) + "\n"));
        END;
        # Add to response
        PRINT @@v_batch AS vertex_batch;  
    END;
}

INSTALL QUERY vertex_loader
"""
print(conn.gsql(query))

ExprFunctions installed successfully
Using graph 'Cora'
Successfully created queries: [vertex_loader].
Start installing queries, about 1 minute ...
vertex_loader query: curl -X GET 'https://127.0.0.1:9000/query/Cora/vertex_loader?input_vertices[INDEX]=VALUE&input_vertices[INDEX].type=VERTEX_TYPE&[num_batches=VALUE]&[shuffle=VALUE]&filter_by=VALUE'. Add -H "Authorization: Bearer TOKEN" if authentication is enabled.
Select 'm1' as compile server, now connecting ...
Node 'm1' is prepared as compile server.

Query installation finished.


Run the query and get results with HttpReader.

In [5]:
url = "https://torchdata-demo.i.tgcloud.io:443/restpp/query/Cora/vertex_loader"
out = HttpReader(
    IterableWrapper([url]),
    headers=authHeader)

out = out.process_data()
out_features = next(iter(out))
out_features = out_features["results"][0]["vertex_batch"]

Create a dataframe to store all the features

In [6]:
vids, x, y, train_mask, val_mask, test_mask = [], [], [], [], [], []
for vid, line in out_features.items():
    vids.append(vid)
    split_line = line.strip().split(",")
    x.append([int(i) for i in split_line[1].strip().split()])
    y.append(int(split_line[2]))
    train_mask = int(split_line[3])
    val_mask = int(split_line[4])
    test_mask = int(split_line[5])

df_td = pd.DataFrame({
    "vid": vids, 
    "x": x, 
    "y": y, 
    "train_mask": train_mask, 
    "val_mask": val_mask, 
    "test_mask": test_mask
})

df_td.head()

Unnamed: 0,vid,x,y,train_mask,val_mask,test_mask
0,1005,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,0,0,1
1,1031,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,0,0,1
2,1656,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1,0,0,1
3,959,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,0,0,1
4,846,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,0,0,1


Merge with pagerank to get the dataframe for model training

In [7]:
df_paper = df_td.merge(df_pr, on="vid")
df_paper.head()

Unnamed: 0,vid,x,y,train_mask,val_mask,test_mask,pagerank
0,1005,"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,0,0,1,1.0
1,1031,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,0,0,1,0.74351
2,1656,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1,0,0,1,0.62322
3,959,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,0,0,1,1.0
4,846,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,0,0,1,0.90899


## Data Loader

Use torchdata to build data loaders for splitting, shuffling, batching, collating, etc.

In [8]:
import random

# Percentages of data for training, validation, and testing respectively.
train_perc = 0.7
valid_perc = 0.15

# Define sample function for train/validation/test split
def sample_fn(n):
    r = random.random()
    if r<=train_perc:
        return 0
    elif r<train_perc+valid_perc:
        return 1
    else:
        return 2

In [10]:
import torch

# Define collate function
def coll_fn(batch):
    xs = [sample[0] + [sample[1]] for sample in batch]
    ys = [sample[2] for sample in batch] 
    return torch.tensor(xs), torch.tensor(ys)

Creating our own data_loader function that applies a Batcher, Shuffler, and Collator. Using these data primitives, the data loader 
is easily customizable.

In [11]:
from torchdata.datapipes.iter import Zipper

def data_loader(x, x2, y, shuffle, batch_sz, collator_fn=coll_fn, sample_fn=sample_fn):
    data_x = IterableWrapper(x)
    data_x2 = IterableWrapper(x2)
    data_y = IterableWrapper(y)
    data = Zipper(data_x, data_x2, data_y)

    train_set, valid_set, test_set = data.demux(
        num_instances=3, classifier_fn=sample_fn)

    train_set = train_set.batch(batch_sz).collate(coll_fn)
    valid_set = valid_set.batch(batch_sz).collate(coll_fn)
    test_set = test_set.batch(batch_sz).collate(coll_fn)

    if shuffle:
        train_set = train_set.shuffle()
        valid_set = valid_set.shuffle()
        test_set = test_set.shuffle()

    return train_set, valid_set, test_set


In [12]:
shuffle = True
batch_size = 100
train_set, valid_set, test_set = data_loader(
    df_paper.x, df_paper.pagerank, df_paper.y, shuffle, batch_size)


## Model Training

### Create model

Define a simple feedforward network that has 2 linear hidden layers and applies the ReLU non-linearity

In [13]:
from torch import nn

class simple_NN(nn.Module):
    def __init__(self, input_size, hidden_dim, output_size):
        super(simple_NN, self).__init__()
        self.linear1 = nn.Linear(input_size, hidden_dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(hidden_dim, output_size)

    def forward(self, x):
        x = x.double()
        sz = x.size()[1]
        out = self.linear1(x)
        out = self.relu(out)
        out = self.linear2(out)
        return out


In [20]:
input_size = len(df_paper.x.iloc[0]) + 1 
hidden_dim = 256
output_size = len(df_paper.y.unique())

model = simple_NN(input_size, hidden_dim, output_size)
model.double()

simple_NN(
  (linear1): Linear(in_features=1434, out_features=256, bias=True)
  (relu): ReLU()
  (linear2): Linear(in_features=256, out_features=7, bias=True)
)

Choose optimizer algorithm (ADAM) and loss function (Cross Entropy loss)

In [21]:
from torch import optim

optimizer = optim.Adam(model.parameters())
loss_fn = nn.CrossEntropyLoss()

### Train model

Define the validation function.

In [22]:
import torch.nn.functional as F

def validate(model, dataloader):
    model.eval()
    val_loss = 0
    val_acc = 0
    total = 0
    for data, target in dataloader:
        with torch.no_grad():
            output = model(data)
            val_loss += F.cross_entropy(output, target, reduction="sum").item()
            pred = output.argmax(dim=1)
            val_acc += (pred==target).sum().item()
            total += len(target)
    val_loss /= total
    val_acc /=total
    print('Validation, loss: {:.6f}, accuracy: {:.6f}'.format(val_loss, val_acc))

    return val_acc

Define the training loop

In [23]:
def train(model, train_data, val_data, epochs):
    model.train()

    train_acc = []
    val_acc = []
    
    for epoch in range(epochs):
        for batch_idx, (data, target) in enumerate(train_data):
            optimizer.zero_grad()
            output = model(data)
            loss = loss_fn(output, target)
            loss.backward()
            optimizer.step()
            predicted = output.argmax(dim=1)
            train_acc.append(float((predicted == target).sum())/len(target))
        
            if batch_idx % 10 == 0:
                print('Train Epoch: {}, batch #: {}, loss: {:.6f}, accuracy: {:.6f}'.format(
                    epoch, batch_idx, loss.item(), train_acc[-1]))
    
        val_acc.append(validate(model, val_data))
        
    return train_acc, val_acc

Train our neural network and print the training status

In [24]:
train_acc, val_acc = train(model, train_set, valid_set, 10)



Train Epoch: 0, batch #: 0, loss: 1.949416, accuracy: 0.170000
Train Epoch: 0, batch #: 10, loss: 1.829437, accuracy: 0.300000
Validation, loss: 1.663495, accuracy: 0.304136
Train Epoch: 1, batch #: 0, loss: 1.608917, accuracy: 0.350000
Train Epoch: 1, batch #: 10, loss: 1.574102, accuracy: 0.390000
Validation, loss: 1.209667, accuracy: 0.681818
Train Epoch: 2, batch #: 0, loss: 1.239762, accuracy: 0.660000
Train Epoch: 2, batch #: 10, loss: 0.910784, accuracy: 0.810000
Validation, loss: 0.860516, accuracy: 0.744949
Train Epoch: 3, batch #: 0, loss: 0.720289, accuracy: 0.800000
Train Epoch: 3, batch #: 10, loss: 0.663243, accuracy: 0.863158
Validation, loss: 0.584671, accuracy: 0.857831
Train Epoch: 4, batch #: 0, loss: 0.520550, accuracy: 0.900000
Train Epoch: 4, batch #: 10, loss: 0.508892, accuracy: 0.880000
Validation, loss: 0.428960, accuracy: 0.900726
Train Epoch: 5, batch #: 0, loss: 0.481546, accuracy: 0.850000
Train Epoch: 5, batch #: 10, loss: 0.346480, accuracy: 0.930000
Val

Compute the testing accuracy

In [25]:
test_acc = validate(model, test_set)
print("Final test accuracy is {}".format(test_acc))

Validation, loss: 0.148143, accuracy: 0.975000
Final test accuracy is 0.975
