Multi-Level-Multi-Queue (MLMQ), a GPU-friendly data structure design to accelerate Single-Source Shortest Path (SSSP) on GPUs
The main code of MLMQ is in core, including the high-level templated framework and instantiated queues.
SSSP is the SSSP program based on MLMQ, where users can specify the types of queues and their parameters by manually editing the configuration file.
SSSP_adaptive is the SSSP program with adaptive MLMQ setup policy. The parameters will be automatically configured based on the input graph information.
ADDS is the baseline with the best performance, from "Wang, K., Fussell, D., & Lin, C. (2021, February). A fast work-efficient sssp algorithm for gpus. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (pp. 133-146)". You can also get the original version at https://zenodo.org/records/4365954#.X-2s8nVKiZQ.
Other baselines are not included. They can be found in:
H-BF from "Busato, F., & Bombieri, N. (2015). An efficient implementation of the Bellman-Ford algorithm for Kepler GPU architectures. IEEE Transactions on Parallel and Distributed Systems, 27(8), 2222-2233."
Gunrock and Gunrock-CPU at https://github.com/gunrock/gunrock, from "Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., & Owens, J. D. (2016, February). Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 21st ACM SIGPLAN symposium on principles and practice of parallel programming (pp. 1-12)."
Ubuntu 20.04.1, gcc/g++ 7.5, CUDA >= 11.1, Compute capacity >= 8.0.
To compile, enter the target folder and execute the make command. For example,
cd SSSP
make
You can also compile with the script,
cd scripts
sh build_all.sh
Notice: Users need to change the compilation option based on the computational capability of GPU being used. The current released version is for CC 8.6. For example, on NVIDIA Tesla A100, the compilation option -gencode=arch=compute_86,code=sm_86 should be changed to -gencode=arch=compute_80,code=sm_80.
Most graphs we currently use have int edge weight. If you want to run on a graph with float edge weights, change #define TYPE_INT to #define TYPE_FLOAT in core/include/common.h.
We use graph data in binary GR format, the same as ADDS. Download graph data at https://zenodo.org/records/4365954#.X-2s8nVKiZQ.
We also provide a tool in graph_transform to convert sparse matrix data in mtx format to gr format. Change #define TYPE_INT to #define TYPE_FLOAT in csr_graph.h to use float edge weights, otherwise int edge weights are used. Run the tool with following commands,
cd graph_transform
make
./transform [sparse matrix file (.mtx)] [graph data file (.gr)]
Change the MLMQ configurations in core/include/common.h. Change MLMQ_TYPE for different types of queues. Changable parameters include node_size, l2_batch_size, mlmq_delta,BNUM,BUCKET_MAX.
The recommended configurations for datasets in the paper are given in configuration.csv
After configuring, compile again and run with ./main -i [graph file]. For example,
cd SSSP
make
./main -i USA-road-d.NY.gr
Run with our adaptive setup policy to enable automatic queue selection and parameter configuration on current datasets. This version may exhibit slight performance gaps on some graphs compared to optimal manual configurations, but overall, it achieves the performance close to the optimal.
Run with ./main -i [graph file]. For example,
cd SSSP_adaptive
make
./main -i USA-road-d.NY.gr
Run ADDS baseline in ADDS. ads_int is for int edge weights and ads_float is for float edge weights.
Run with ./sssp [graph file]. For example,
cd ADDS/ads_int
make
./sssp USA-road-d.NY.gr
The currently released queue instances include: L0 vector (embedded in SSSP program), L1 vector (L1V), L1 near-far queue (L1NF), L1 filter queue (L1FQ), L2 vector (L2V), L2 priority queue (L2PQ), L2 delta queue (L2DQ). We do not recommend the usage of L2PQ as its performance is extremely low due to the limitation in parallelism.
We are working towards better implementations for L2 Priority Queue (L2PQ), L2 Multi-vector Queue (L2MV), and L2 Multi-queue (L2MPQ).
This guide describes how to implement custom L1 (warp-local) and L2 (global) queues compatible with the MLMQ system.
L1 queues operate at the warp level, using registers or shared memory. Each warp maintains its own L1 queue instance.
template <typename eletype>
class L1Queue {
public:
__device__ init_status init(int wid, int lane_id);
__device__ read_status read(eletype* node_in, int& read_num, int lane_id);
__device__ write_status write(eletype* node_out, eletype* buffer, int& write_num, int& buffer_num, int lane_id);
__device__ int get_queue_size();
};
| Function | Description |
|---|---|
init(int wid, int lane_id) |
Initialize the queue before use. Called per warp. |
read(eletype* node_in, int& read_num, int lane_id) |
Read data from the queue. Updates read_num for current thread. |
write(eletype* node_out, eletype* buffer, int& write_num, int& buffer_num, int lane_id) |
Write items to the queue and optionally flush to buffer for L2. |
get_queue_size() |
Returns the total size of the queue. |
Notes:
lane_id: thread ID within the warp.buffer: used for flushing data to L2 queue.write_num,buffer_num,read_num: per-warp counters.
L2 queues reside in global memory and support multi-warp, multi-block coordination. Each L2 queue has a manager thread responsible for metadata handling.
template <typename eletype>
class L2Queue {
public:
init_status host_init(int max_size, mlmq_mdata& mdata, eletype init_limits, mlmq_setup setup);
__device__ read_status read(eletype* node_in, int& read_num, int bid, int wid, int lane_id, unsigned* debug_time);
__device__ write_status write(eletype* node_out, int& write_num, int bid, int wid, int lane_id, unsigned* debug_time);
__device__ int get_queue_size();
__device__ void manager_run(int wid, int lane_id);
__device__ void update_done(int on_the_fly_num);
};
| Function | Description |
|---|---|
host_init(...) |
Host-side queue initialization before kernel launch. |
read(...) |
Read elements from global queue into local buffer. |
write(...) |
Write elements to global queue from local buffer. |
get_queue_size() |
Returns the current global queue size. |
manager_run(...) |
Manager-side function to maintain queue metadata. |
update_done(on_the_fly_num) |
Lazily updates progress tracking metadata. |
Notes:
bid,wid,lane_id: block ID, warp ID, thread ID within the warp.debug_time: optional pointer for timing or profiling.on_the_fly_num: number of elements being processed, used for manager-worker coordination.
- L1 Queues: Use warp shuffle and ballot intrinsics for intra-warp coordination.
- L2 Queues: Use atomic operations for global coordination. Minimize contention.
- Buffering: Accumulate writes in L1 before flushing to L2 for efficiency.
By following this interface, you can plug in custom queue implementations into the MLMQ execution framework and explore the trade-offs between parallelism, memory locality, and priority guarantee.