## Data Loaders

This notebook demonstrates the use of different data loaders in `tgml`. The job of a data loader is to pull data from the TigerGraph database. Currently, the following data loaders are provided:
* EdgeLoader, which returns either the whole edgelist or batches of edges. Edge attributes are not supported currently.
* VertexLoader, which returns either all the vertices or batches of vertices. Vertex attributes are supported.
* GraphLoader, which returns the whole graph in `PyG` or `DGL` format.
* NeighborLoader, which returns subgraphs using neighbor sampling.

Every data loader above can either stream data directly from the server to user or cache data on the cloud. For the latter, data will be moved to a cloud storage first and then downloaded to local, so it will be slower compared to streaming directly from the server. However, when there are multiple consumers of the same data such as when trying out different models in parallel or tuning hyperparameters, the cloud caching would reduce the workload on the server, and consequently it might be faster than hitting the server from multiple consumers at the same time. 

Note: For the data loaders to work, the Graph Data Processing Service has to be running on the TigerGraph server.

### Define Graph

Conceptually, the `TigerGraph` class represents the graph stored in the database. Under the hood, it stores the necessary information to communicate with the TigerGraph database. It can read `username` and `password` from environment variables `TGUSERNAME` and `TGPASSWORD`. Hence, we recommend storing those credentials in the environment variables instead of hardcoding them in code. However, if you do provide `username` and `password` to this class constructor, the environment variables will be ignored.

In [1]:
from tgml.data import TigerGraph

Args to the `TigerGraph` class:
*    host (str, ): Address of the server. Defaults to "http://localhost".
*    graph (str, ): Name of the graph. Defaults to None.
*    username (str, optional): Username. Defaults to the env variable TGUSERNAME or None.
*    password (str, optional): Password for the user. Defaults to the env variable TGPASSWORD or None.
*    rest_port (str, optional): Port for the REST endpoint. Defaults to "9000".
*    gs_port (str, optional): Port for GraphStudio. Defaults to "14240".
*    token_auth (bool, optional): Whether to use token authentication. If True, token authentication must be turned on in the TigerGraph database server. Defaults to True.

In [2]:
tgraph = TigerGraph(
    host="http://127.0.0.1", # Change the address to your database server's
    graph="Cora",
    username="tigergraph",
    password="tigergraph",
    token_auth=False # Whether to use token authentication. If True, token authentication must be turned on in the TigerGraph database server.
)

In [3]:
# Basic metadata about the graph such as schema.
tgraph.info()

Using graph 'Cora'
---- Graph Cora
Vertex Types: 
  - VERTEX Paper(PRIMARY_ID id INT, x LIST<INT>, y INT, train_mask BOOL, val_mask BOOL, test_mask BOOL) WITH STATS="OUTDEGREE_BY_EDGETYPE", PRIMARY_ID_AS_ATTRIBUTE="true"
Edge Types: 
  - DIRECTED EDGE Cite(FROM Paper, TO Paper)

Graphs: 
  - Graph Cora(Paper:v, Cite:e)
Jobs: 
  - CREATE LOADING JOB load_cora_data {
      DEFINE FILENAME edge_csv = "/home/tigergraph/data/Cora/edges.csv";
      DEFINE FILENAME node_csv = "/home/tigergraph/data/Cora/nodes.csv";
      LOAD node_csv TO VERTEX Paper VALUES($"id", SPLIT($"x", " "), $"y", $"train", $"valid", $"test") USING SEPARATOR=",", HEADER="true", EOL="\n";
      LOAD edge_csv TO EDGE Cite VALUES($"source", $"target") USING SEPARATOR=",", HEADER="true", EOL="\n";
    }

Queries: 
  - get_vertex_number(string v_type, string filter_by) (installed v2)






In [4]:
# Total number of vertices
tgraph.number_of_vertices()

2708

In [5]:
# Number of vertices of a specific type
tgraph.number_of_vertices("Paper")

2708

In [7]:
# Number of vertices of a specific type and filtered by a boolean attribute
tgraph.number_of_vertices(vertex_type = "Paper", filter_by = "train_mask")

140

In [6]:
# Total number of edges
tgraph.number_of_edges()

140

In [9]:
# Number of edges of a specific type
tgraph.number_of_edges("Cite")

10556

### Edge Loader

In [10]:
from tgml.dataloaders import EdgeLoader

For the first time you initialize the loader on a graph in TigerGraph, the initialization might take a minute as it installs the corresponding query to the database and optimizes it. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same TG graph again.  

There are two ways to use the data loader. 
* First, it can be used as an iterator, which means you can loop through it to get every batch of data. If you load all edges at once, there will be only one batch (of all the edges) in the iterator. 
* Second, you can access the `data` property of the class directly. If there is only one batch of data to load, it will give you the batch directly instead of an iterator, which might make more sense in that case. If there are multiple batches of data to load, it will return the iterator again. 

Args to `EdgeLoader` class:
* graph (TigerGraph): Connection to the TigerGraph database.
* batch_size (int, optional): Size of each batch. If given, `num_batches` will be recalculated based on batch size. Defaults to None.
* num_batches (int, optional): Number of batches to split the whole dataset. Defaults to 1.
* local_storage_path (str, optional): Place to store data locally. Defaults to "./tmp".
* cloud_storage_path (str, optional): S3 path used for cloud caching. Defaults to None.
* buffer_size (int, optional): Number of data batches to prefetch and store in memory. Defaults to 4.
* output_format (str, optional): Format of the output data of the loader. Defaults to "dataframe".
* aws_access_key_id (str, optional): AWS access key. Defaults to None.
* aws_secret_access_key (str, optional): AWS access key secret. Defaults to None.

If using cloud caching, cloud storage access keys need to be provided. For AWS s3, `aws_access_key_id` and `aws_secret_access_key` are required. However, the class can read from environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`, and again it is recommended to store those credentials in the `.env` file instead of hardcoding them.

#### Load all edges at once directly to local. Default.

In [11]:
%%time
edge_loader = EdgeLoader(tgraph)

CPU times: user 6.62 ms, sys: 2.85 ms, total: 9.47 ms
Wall time: 34.4 s


In [12]:
%%time
# Use case 1: iterator
data = []
for batch in edge_loader:
    data.append(batch)

CPU times: user 9.52 ms, sys: 3.26 ms, total: 12.8 ms
Wall time: 1.3 s


In [13]:
data

[          0     1
 0      2703  1298
 1      1573   598
 2      2206  1951
 3       206    71
 4      2543  1000
 ...     ...   ...
 10551   427  1528
 10552  1518  1129
 10553  1518  1063
 10554  1518   577
 10555   407  1681
 
 [10556 rows x 2 columns]]

In [14]:
%%time
# Use case 2: `data` property
data = edge_loader.data

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 8.11 µs


In [15]:
data

Unnamed: 0,0,1
0,2703,1298
1,1573,598
2,2206,1951
3,206,71
4,2543,1000
...,...,...
10551,427,1528
10552,1518,1129
10553,1518,1063
10554,1518,577


#### Stream batches of edges directly to local.

In [16]:
%%time
edge_loader = EdgeLoader(tgraph, batch_size = 256)

CPU times: user 6.19 ms, sys: 2.49 ms, total: 8.68 ms
Wall time: 35.3 s


In [17]:
%%time
# Use case 1: as an iterator
data = []
for batch in edge_loader:
    data.append(batch)

CPU times: user 199 ms, sys: 22.4 ms, total: 221 ms
Wall time: 3.28 s


In [18]:
print("Number of batches: ", len(data))
data

Number of batches:  42


[        0     1
 0     239   887
 1     484  2046
 2     239  2182
 3     137  2144
 4    2032  1336
 ..    ...   ...
 283   661  1045
 284   251  1542
 285  1656   306
 286  1518   577
 287   251  1300
 
 [288 rows x 2 columns],
         0     1
 0    2135   175
 1    2135  2282
 2    1033  2168
 3     969  2450
 4    1154  1007
 ..    ...   ...
 236  1745  2596
 237  1927  1435
 238   251   507
 239   251  1933
 240   407   695
 
 [241 rows x 2 columns],
         0     1
 0    2118  2117
 1    1273  2034
 2    1735   969
 3      49  2034
 4    1975  1676
 ..    ...   ...
 245  1821   603
 246  1821   316
 247  1701  1857
 248  1219    45
 249   244  1610
 
 [250 rows x 2 columns],
         0     1
 0    1336  2032
 1    2296  1134
 2    2163  1348
 3     708   836
 4     258  1401
 ..    ...   ...
 238  1839  2424
 239   156  1358
 240  1215  1131
 241  1753  1358
 242   251  1413
 
 [243 rows x 2 columns],
         0     1
 0    2199   731
 1     258  2645
 2    1295  1171
 3    19

In [19]:
# Use case 2: `data` property
# Since there are multiple batches of data. 
# The `data` property will return the loader itsel
data = edge_loader.data

In [20]:
%%time
print("Number of batches: ", sum(1 for batch in data))

Number of batches:  42
CPU times: user 207 ms, sys: 23.1 ms, total: 231 ms
Wall time: 3.17 s


### Vertex Loader

In [21]:
from tgml.dataloaders import VertexLoader

For the first time you initialize the loader on a graph in TigerGraph, the initialization might take half a minute as it installs the corresponding query to the database and optimizes it. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same TG graph again.  

There are two ways to use the data loader. 
* First, it can be used as an iterator, which means you can loop through it to get every batch of data. If you load all vertices at once, there will be only one batch of data (of all the vertices) in the iterator. 
* Second, you can access the `data` property of the class directly. If there is only one batch of data, it will give you the batch directly instead of an iterator, which might make more sense in that case. If there are multiple batches of data to load, it will return the loader again.

Args to class:
* graph (TigerGraph): Connection to the TigerGraph database.
* batch_size (int, optional): Size of each batch. If given, `num_batches` will be recalculated based on batch size. Defaults to None.
* num_batches (int, optional): Number of batches to split the whole dataset. Defaults to 1.
* attributes (str, optional): Vertex attributes to get, separated by comma. Defaults to "".
* local_storage_path (str, optional): Place to store data locally. Defaults to "./tmp".
* cloud_storage_path (str, optional): S3 path used for cloud caching. Defaults to None.
* buffer_size (int, optional): Number of data batches to prefetch and store in memory. Defaults to 4.
* output_format (str, optional): Format of the output data of the loader. Only pandas dataframe is supported. Defaults to "dataframe".
* aws_access_key_id (str, optional): AWS access key. Defaults to None.
* aws_secret_access_key (str, optional): AWS access key secret. Defaults to None.

#### Load all vertices at once directly to local. Default.

In [22]:
%%time
vertex_loader = VertexLoader(tgraph, attributes="x,y")
# Note: vertex primary ID will be extracted automatically. 
# No need to specify it as an attribute.

CPU times: user 6.67 ms, sys: 2.53 ms, total: 9.2 ms
Wall time: 35.5 s


In [23]:
%%time
# Use case 1: as an iterator
data = []
for batch in vertex_loader:
    data.append(batch)

CPU times: user 91.6 ms, sys: 29.7 ms, total: 121 ms
Wall time: 1.07 s


In [24]:
data

[      primary_id                                                  x  y
 0           2703  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...  3
 1            206  0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  2
 2           1203  0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  0
 3           1269  0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  5
 4           1573  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...  3
 ...          ...                                                ... ..
 2703        1931  0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  3
 2704         463  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  4
 2705        1699  0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ...  1
 2706        1780  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  1
 2707         968  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  1
 
 [2708 rows x 3 columns]]

In [25]:
%%time
# Use case 2: `data` property
data = vertex_loader.data

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 8.82 µs


In [26]:
data

Unnamed: 0,primary_id,x,y
0,2703,0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...,3
1,206,0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,2
2,1203,0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,0
3,1269,0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,5
4,1573,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...,3
...,...,...,...
2703,1931,0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,3
2704,463,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,4
2705,1699,0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ...,1
2706,1780,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...,1


#### Stream batches of vertices directly to local.

In [27]:
%%time
vertex_loader = VertexLoader(tgraph, 
                             batch_size=100,
                             attributes="x,y")

CPU times: user 6.96 ms, sys: 2.62 ms, total: 9.58 ms
Wall time: 36.1 s


In [28]:
%%time
# Use case 1: as an iterator
data = []
for batch in vertex_loader:
    data.append(batch)

CPU times: user 265 ms, sys: 67.2 ms, total: 332 ms
Wall time: 2.37 s


In [30]:
print("Number of batches: ", len(data))
data

Number of batches:  28


[     primary_id                                                  x  y
 0          1051  0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  4
 1          2703  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...  3
 2          2118  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...  3
 3          1550  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  2
 4          2448  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  2
 ..          ...                                                ... ..
 97         1349  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  4
 98           64  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  2
 99         1337  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  1
 100         427  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  2
 101        1984  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  4
 
 [102 rows x 3 columns],
      primary_id                                                  x  y
 0          1573  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [31]:
# Use case 2: `data` property
# Since there are multiple batches of data. 
# The `data` property will return the loader itsel
data = vertex_loader.data

In [32]:
%%time
print("Number of batches: ", sum(1 for batch in data))

Number of batches:  28
CPU times: user 234 ms, sys: 53.7 ms, total: 288 ms
Wall time: 2.35 s


### Graph Loader

#### Load the whole graph directly to local

In [33]:
from tgml.dataloaders import GraphLoader

For the first time you initialize the loader on a graph in TigerGraph, the initialization might take half a minute as it installs the corresponding query to the database and optimizes it. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same TG graph again.  

There are two ways to use the data loader. 
* First, it can be used as an iterator, which means you can loop through it to get every batch of data. Since this loader loads the whole graph at once, there will be only one batch of data (of the whole graph) in the iterator. 
* Second, you can access the `data` property of the class directly. Since there is only one batch of data (the whole graph), it will give you the batch directly instead of an iterator.

Args to the class:
* graph (TigerGraph): Connection to the TigerGraph database.
* v_in_feats (str, optional): Attributes to be used as input features and their types. Attributes should be separated by ',' and an attribute and its type should be separated by ':'. The type of an attribute can be omitted together with the separator ':', and the attribute will be default to type "float32". and Defaults to "".
* v_out_labels (str, optional): Attributes to be used as labels for prediction. It follows the same format as 'v_in_feats'. Defaults to "".
* v_extra_feats (str, optional): Other attributes to get such as indicators of train/test data. It follows the same format as 'v_in_feats'. Defaults to "".
* local_storage_path (str, optional): Place to store data locally. Defaults to "./tmp".
* cloud_storage_path (str, optional): S3 path used for cloud caching. Defaults to None.
* buffer_size (int, optional): Number of data batches to prefetch and store in memory. Defaults to 4.
* output_format (str, optional): Format of the output data of the loader. Only "PyG" is supported. Defaults to "PyG".
* reindex (bool, optional): Whether to reindex the vertices. Defaults to False.
* aws_access_key_id (str, optional): AWS access key. Defaults to None.
* aws_secret_access_key (str, optional): AWS access key secret. Defaults to None.

In [34]:
%%time
graph_loader = GraphLoader(
                 graph = tgraph,
                 v_in_feats = "x:float32",
                 v_out_labels = "y:int",
                 v_extra_feats = "train_mask:bool,val_mask:bool,test_mask:bool",
                 output_format = "PyG",
                 reindex=False)

CPU times: user 4.2 ms, sys: 2.06 ms, total: 6.26 ms
Wall time: 37.9 s


In [35]:
%%time
# Use case 1: as an iterator.
data = []
for batch in graph_loader:
    data.append(batch)

CPU times: user 570 ms, sys: 64.4 ms, total: 634 ms
Wall time: 2.35 s


In [36]:
data

[Data(edge_index=[2, 10556], x=[2708, 1433], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])]

In [37]:
%%time
# Use case 2: `.data` property
data = graph_loader.data

CPU times: user 7 µs, sys: 0 ns, total: 7 µs
Wall time: 11.2 µs


In [38]:
data

Data(edge_index=[2, 10556], x=[2708, 1433], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

#### Stream subgraphs with neighbor sampling

In [39]:
from tgml.dataloaders import NeighborLoader

A data loader that performs neighbor sampling as introduced in the [Inductive Representation Learning on Large Graphs](https://arxiv.org/abs/1706.02216) paper. 

Specifically, it first chooses `batch_size` number of vertices as seeds, then picks `num_neighbors` number of neighbors of each seed at random, then `num_neighbors` neighbors of each neighbor, and repeat for `num_hops`. This generates one subgraph. As you loop through this data loader, all vertices will be chosen as seeds and you will get all subgraphs expanded from those seeds.

If you want to limit seeds to certain vertices, the boolean attribute provided to `filter_by` will be used to indicate which vertices can be included as seeds.

For the first time you initialize the loader on a graph in TigerGraph, the initialization might take half a minute as it installs the corresponding query to the database and optimizes it. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same TG graph again.  

Args to this class:
* graph (TigerGraph): Connection to the TigerGraph database.
* tmp_id (str, optional): Attribute name that holds the temporary ID of 
                vertices. Defaults to "tmp_id".
* v_in_feats (str, optional): Attributes to be used as input features and their types. Attributes should be separated by ',' and an attribute and its type should be separated by ':'. The type of an attribute can be omitted together with the separator ':', and the attribute will be default to type "float32". and Defaults to "".
* v_out_labels (str, optional): Attributes to be used as labels for prediction. It follows the same format as 'v_in_feats'. Defaults to "".
* v_extra_feats (str, optional): Other attributes to get such as indicators of train/test data. It follows the same format as 'v_in_feats'. Defaults to "".
* local_storage_path (str, optional): Place to store data locally. 
                Defaults to "./tmp".
* cloud_storage_path (str, optional): S3 or GCP path used for cloud caching. 
                Defaults to None.
* buffer_size (int, optional): Number of data batches to prefetch and store 
                in memory. Defaults to 4.
* output_format (str, optional): Format of the output data of the loader. Only
                "PyG" is supported. Defaults to "PyG".
* batch_size (int, optional): Number of vertices as seeds in each batch. 
                Defaults to None.
* num_batches (int, optional): Number of batches to split the vertices. 
                Defaults to 1.
* num_neighbors (int, optional): Number of neighbors to sample for each vertex. 
                Defaults to 10.
* num_hops (int, optional): Number of hops to traverse when sampling neighbors. 
                Defaults to 2.
* cache_id (str, optional): A tag attached to data generated. 
                Defaults to None.
* shuffle (bool, optional): Whether to shuffle the vertices after every epoch. 
                Defaults to False.
* filter_by (str, optional): A boolean attribute used to indicate which vertices 
                can be included as seeds. Defaults to None.
* aws_access_key_id (str, optional): AWS access key. Defaults to None.
* aws_secret_access_key (str, optional): AWS access key secret. Defaults to None.

In [40]:
%%time
neighbor_loader = NeighborLoader(
                 graph = tgraph,
                 tmp_id = "tmp_id",
                 v_in_feats = "x:float32",
                 v_out_labels = "y:int",
                 v_extra_feats = "train_mask:bool,val_mask:bool,test_mask:bool",
                 output_format = "PyG",
                 batch_size = 64,
                 num_neighbors = 10,
                 num_hops =2)

CPU times: user 30.5 ms, sys: 10.4 ms, total: 41 ms
Wall time: 2min 24s


In [41]:
%%time
data = []
for batch in neighbor_loader:
    data.append(batch)
print("Number of batches: ", len(data))

Number of batches:  43
CPU times: user 9.06 s, sys: 512 ms, total: 9.58 s
Wall time: 17 s


In [42]:
data

[Data(edge_index=[2, 4312], x=[1076, 1433], y=[1076], train_mask=[1076], val_mask=[1076], test_mask=[1076]),
 Data(edge_index=[2, 3835], x=[991, 1433], y=[991], train_mask=[991], val_mask=[991], test_mask=[991]),
 Data(edge_index=[2, 3087], x=[825, 1433], y=[825], train_mask=[825], val_mask=[825], test_mask=[825]),
 Data(edge_index=[2, 2461], x=[803, 1433], y=[803], train_mask=[803], val_mask=[803], test_mask=[803]),
 Data(edge_index=[2, 3130], x=[909, 1433], y=[909], train_mask=[909], val_mask=[909], test_mask=[909]),
 Data(edge_index=[2, 3717], x=[1005, 1433], y=[1005], train_mask=[1005], val_mask=[1005], test_mask=[1005]),
 Data(edge_index=[2, 3036], x=[823, 1433], y=[823], train_mask=[823], val_mask=[823], test_mask=[823]),
 Data(edge_index=[2, 3444], x=[921, 1433], y=[921], train_mask=[921], val_mask=[921], test_mask=[921]),
 Data(edge_index=[2, 4011], x=[1115, 1433], y=[1115], train_mask=[1115], val_mask=[1115], test_mask=[1115]),
 Data(edge_index=[2, 3229], x=[820, 1433], y=[820

In [43]:
%%time
neighbor_loader = NeighborLoader(
                 graph = tgraph,
                 tmp_id = "tmp_id",
                 v_in_feats = "x:float32",
                 v_out_labels = "y:int",
                 v_extra_feats = "train_mask:bool,val_mask:bool,test_mask:bool",
                 output_format = "PyG",
                 batch_size = 16,
                 num_neighbors = 10,
                 num_hops =2,
                 filter_by = "train_mask")

CPU times: user 10.5 ms, sys: 0 ns, total: 10.5 ms
Wall time: 61.7 ms


In [44]:
%%time
data = []
for batch in neighbor_loader:
    data.append(batch)
print("Number of batches: ", len(data))

Number of batches:  9
CPU times: user 2.09 s, sys: 72.2 ms, total: 2.16 s
Wall time: 1.28 s


In [45]:
data

[Data(edge_index=[2, 1065], x=[287, 1433], y=[287], train_mask=[287], val_mask=[287], test_mask=[287]),
 Data(edge_index=[2, 565], x=[212, 1433], y=[212], train_mask=[212], val_mask=[212], test_mask=[212]),
 Data(edge_index=[2, 464], x=[170, 1433], y=[170], train_mask=[170], val_mask=[170], test_mask=[170]),
 Data(edge_index=[2, 907], x=[308, 1433], y=[308], train_mask=[308], val_mask=[308], test_mask=[308]),
 Data(edge_index=[2, 1234], x=[330, 1433], y=[330], train_mask=[330], val_mask=[330], test_mask=[330]),
 Data(edge_index=[2, 1123], x=[294, 1433], y=[294], train_mask=[294], val_mask=[294], test_mask=[294]),
 Data(edge_index=[2, 1727], x=[435, 1433], y=[435], train_mask=[435], val_mask=[435], test_mask=[435]),
 Data(edge_index=[2, 1057], x=[329, 1433], y=[329], train_mask=[329], val_mask=[329], test_mask=[329]),
 Data(edge_index=[2, 824], x=[282, 1433], y=[282], train_mask=[282], val_mask=[282], test_mask=[282])]

### Smart Cloud Caching

When you provide `cloud_storage_path` when creating a loader (including all vertex, edge, graph loaders), data will be moved to a cloud storage first and then downloaded to local, so it will be slower compared to streaming directly from the server. However, when there are multiple consumers of the same data such as when trying out different models in parallel or tuning hyperparameters, the cloud caching would reduce workload of the server, and consequently it might be faster than hitting the server from multiple consumers at the same time.

To share the cloud cache between different consumers, provide the same `cache_id` when creating the loaders. Below we create two loaders in this same python session to demo the use of cloud caching; in practice, you would run parallel python sessions with each having its own loader. 

In [46]:
%%time
vertex_loader = VertexLoader(tgraph, 
                             batch_size=100,
                             attributes="x,y",
                             cache_id="test_smart_cache",
                             cloud_storage_path="s3://graph-export-dev/cora_vertices",
                             aws_access_key_id="your aws_access_key_id", # This can be read from the env variable `AWS_ACCESS_KEY_ID` as well
                             aws_secret_access_key=" your aws_secret_access_key" # This can be read from the env variable `AWS_SECRET_ACCESS_KEY` as well
                             )

CPU times: user 8.92 ms, sys: 57 µs, total: 8.97 ms
Wall time: 35.1 ms


In [47]:
%%time
data = []
for batch in vertex_loader:
    data.append(batch)

CPU times: user 1.04 s, sys: 118 ms, total: 1.16 s
Wall time: 11.1 s


In [48]:
data

[     primary_id                                                  x  y
 0          2703  0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...  3
 1          1051  0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  4
 2          1550  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  2
 3          2593  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...  3
 4           857  0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...  4
 ..          ...                                                ... ..
 97          432  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  3
 98          692  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...  3
 99         1873  0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...  3
 100         181  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  6
 101        2608  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...  6
 
 [102 rows x 3 columns],
      primary_id                                                  x  y
 0           367  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [49]:
%%time
vertex_loader2 = VertexLoader(tgraph, 
                             batch_size=100,
                             attributes="x,y",
                             cache_id="test_smart_cache",
                             cloud_storage_path="s3://graph-export-dev/cora_vertices",
                             aws_access_key_id="your aws_access_key_id", # This can be read from the env variable `AWS_ACCESS_KEY_ID` as well
                             aws_secret_access_key=" your aws_secret_access_key" # This can be read from the env variable `AWS_SECRET_ACCESS_KEY` as well
                             )

CPU times: user 9.84 ms, sys: 0 ns, total: 9.84 ms
Wall time: 33.7 ms


In [50]:
%%time
data2 = []
for batch in vertex_loader2:
    data2.append(batch)

CPU times: user 926 ms, sys: 90.4 ms, total: 1.02 s
Wall time: 5.53 s


In [51]:
for d1,d2 in zip(data,data2):
    assert all(d1==d2)