# Data Processing

This notebook demonstrates how to use `tgml` for common data processing tasks on graphs stored in `TigerGraph`.

Note: For `tgml` to work, the Graph Data Processing Service has to be running on the TigerGraph server.

## Define Graph

Conceptually, the `TigerGraph` class represents the graph stored in the database. Under the hood, it stores the necessary information to communicate with the TigerGraph database. It can read `username` and `password` from environment variables `TGUSERNAME` and `TGPASSWORD`. Hence, we recommend storing those credentials in the environment variables instead of hardcoding them in code. However, if you do provide `username` and `password` to this class constructor, the environment variables will be ignored.

In [1]:
from tgml.data import TigerGraph

Args to the `TigerGraph` class:
*    host (str, ): Address of the server. Defaults to "http://localhost".
*    graph (str, ): Name of the graph. Defaults to None.
*    username (str, optional): Username. Defaults to the env variable TGUSERNAME or None.
*    password (str, optional): Password for the user. Defaults to the env variable TGPASSWORD or None.
*    rest_port (str, optional): Port for the REST endpoint. Defaults to "9000".
*    gs_port (str, optional): Port for GraphStudio. Defaults to "14240".
*    token_auth (bool, optional): Whether to use token authentication. If True, token authentication must be turned on in the TigerGraph database server. Defaults to True.

In [2]:
tgraph = TigerGraph(
    host="http://127.0.0.1", # Change the address to your database server's
    graph="Cora",
    username="tigergraph",
    password="tigergraph",
    token_auth=False # Whether to use token authentication. If True, token authentication must be turned on in the TigerGraph database server.
)

Initializing the graph. It might take a minute if this is the first time you run it.


In [3]:
# Basic metadata about the graph such as schema.
tgraph.info()

Using graph 'Cora'
---- Graph Cora
Vertex Types: 
  - VERTEX Paper(PRIMARY_ID id INT, x LIST<INT>, y INT, train_mask BOOL, val_mask BOOL, test_mask BOOL, tmp_id INT, tmp_id2 INT, tmp_id3 INT, train BOOL, val BOOL, test BOOL, bool_attr BOOL DEFAULT "false") WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
Edge Types: 
  - DIRECTED EDGE Cite(FROM Paper, TO Paper)

Graphs: 
  - Graph Cora(Paper:v, Cite:e)
Jobs: 
Queries: 
  - export_edge(string output_path) (installed v2)
  - export_vertex_train_val_test(string output_path) (installed v2)
  - export_vertex_x_y_train_mask_val_mask_test_mask(string output_path) (installed v2)
  - get_vertex_number(string v_type, string filter_by) (installed v2)
  - shuffle_vertices(string tmp_id) (installed v2)
  - tg_neighbor_sampler_x(set<vertex> input_vertices, string vertex_filename, string edge_filename, int batch_id, int num_batches, int num_neighbors, int num_hops, string filter_by, string tmp_id) (installed v2)
  - tg_neighbor_samp

In [4]:
# Total number of vertices
tgraph.number_of_vertices()

2708

In [5]:
# Number of vertices of a specific type
tgraph.number_of_vertices("Paper")

2708

In [6]:
# Total number of edges
tgraph.number_of_edges()

10556

In [7]:
# Number of edges of a specific type
tgraph.number_of_edges("Cite")

10556

## Train/Validation/Test Split

In [8]:
from tgml.utils import split_vertices

`tgml` provide a utility function `split_vertices` to split vertices into a training, a validation, and a test set. More precisely, it creates 3 boolean attributes with each attribute indicating whether the vertex is in the corresponding set. For example, if you want to split the vertices into 80% train, 10% validation and 10% test, you can provide as arguments to the function `train_mask=0.8, val_mask=0.1, test_mask=0.1`. This will create 3 attributes `train_mask`, `val_mask`, `test_mask` in the graph, if they don't already exist. 80% of vertices will be set to `train_mask=1`, 10% to `val_mask=1`, and 10% to `test_mask=1` at random. There will be no overlap between the partitions. You can name the attributes however you like as long as you follow the format, such as `yesterday=0.8, today=0.1, tomorrow=0.1`, but we recommend something  meaningful. 

In [9]:
split_vertices(tgraph, train_mask=0.8, val_mask=0.1, test_mask=0.1)

Installing and optimizing queries. It might take a minute if this is the first time you run it.


Now the split is done. Load all vertices and check if the split is correct. See the next tutorial for details on `VertexLoader` and other data loaders.

In [10]:
from tgml.dataloaders import VertexLoader

In [11]:
%%time
vertex_loader = VertexLoader(tgraph, attributes="train_mask,val_mask,test_mask")

Installing and optimizing queries. It might take a minute if this is the first time you use this loader.
CPU times: user 5.78 ms, sys: 2.24 ms, total: 8.02 ms
Wall time: 44.8 s


In [12]:
%%time
data = vertex_loader.data

CPU times: user 6.69 ms, sys: 3 ms, total: 9.69 ms
Wall time: 574 ms


In [13]:
data.train_mask.sum()/len(data), data.val_mask.sum()/len(data), data.test_mask.sum()/len(data)

(0.7976366322008862, 0.10376661742983752, 0.09859675036927622)

## Feature Engineering

We are adding graph algorithms to the workbench to perform feature engineering tasks. Currently we are experimenting with PageRank and a preview is given below. The interface might change when we roll out the full feature engineering function.

### PageRank
PageRank is the algorithm that originally ran Google's search engine, where it ranked the most influential webpages higher than less influential ones. A page was determined to be influential through its PageRank score, which is based off the importance of the pages that linked to it. Generalizing this, PageRank finds the most influential vertices in a graph based upon how influential the vertices that have edges to the desired vertex are.

The documentation of the PageRank query is found here: https://docs.tigergraph.com/graph-algorithm-library/centrality/pagerank.

In [15]:
from tgml.featurizer import PageRank

In [16]:
feat = PageRank(tgraph, 
                result_attr = "pagerank", # Name of the attribute that will store results.
                timeout = 300000, # Timeout threshold in millisecond when running this algorithm, doesn't apply to the initializations including query installation and schema change.
                )

In [17]:
feat.run(max_change=0.001, max_iter=25, damping=0.85)

'PageRank scores saved to attribute pagerank'