Graph Store sampling setup.

Re: This thread https://github.com/Snapchat/GiGL/pull/432/files#r2687675025

We currently sample equally from each storage node for each compute node. So if we have the below setup with 2 storage nodes and 4 compute nodes:

```
Storage node 0:
[0, 1, 2, 3]

Storage node 1:
[4, 5, 6, 7]

Compute node 0 samples:
[[0], [4]]

Compute node 1 samples:
[[1], [5]]

Compute node 2 samples:
[[2], [6]]

Compute node 3 samples:
[[3], [7]]
```

This may not be efficient as we'd have more overall network connections, and it may be more efficient to have some setup like:

```
Compute node 0 samples:
[[0, 1], []]

Compute node 1 samples:
[[2, 3], []]

Compute node 2 samples:
[[], [4, 5]]

Compute node 3 samples:
[[], [6, 7]]
```

To reduce overall network chatter across the cluster.

Fortunately since the `input_nodes` are entirely user controller - they should be able to tune it, and we can add some flag to `RemoteDistDataset.get_node_ids` [1] to control how we shard.

[1]: https://github.com/Snapchat/GiGL/blob/main/python/gigl/distributed/graph_store/remote_dist_dataset.py#L125

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graph Store sampling setup. #440

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Graph Store sampling setup. #440

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions