Unroll deposit bit plane loop #35

LennartNoordsij · 2018-12-14T12:22:24Z

Setting the bound static prevents Warp divergence and speeds up the writing.
CUDA 8 supports specifying the unrolling factor which is done in other functions of the CUDA implementation as well.

mclarsen · 2018-12-14T15:41:44Z

Thanks!

lindstro · 2018-12-14T18:13:06Z

@LennartNoordsij This sounds a bit like a related optimization that we have been working on. One of the steps of zfp is transposition of the binary matrix of coefficient bits so that the matrix is organized by column (bit plane) instead of by row (coefficient). In the current CPU implementation, this transposition is done on the fly by repeated access to the rows to extract each column, i.e., for precision p and dimensionality d, we make 4^d * p memory accesses. Using a more optimized explicit transposition, we reduce this to lg(4^d) * p = 2 * d * p memory accesses (more than 10x fewer in 3D). Moreover, this computation is highly parallelizable as the algorithm is based on divide and conquer.

On the CPU, I've seen a 20x speedup of the transposition itself and a resulting 3x speedup of the (de)compressor using this optimization, which completely unrolls the whole loop over bit planes (the outer for (k) loop). This would likely run even faster if we take advantage of parallelism, but would require a significant rewrite of the CUDA implementation, which as you know currently parallelizes only over zfp blocks.

Anyway, since it seems related to your optimization, I thought I'd mention it.

Setting the bound static prevents Warp divergence and speeds up the writing. CUDA 8 supports specifying the unrolling factor which is done in other functions of the CUDA implementation as well.

codecov-io · 2019-01-29T06:22:37Z

Codecov Report

Merging #35 into develop will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff            @@
##           develop      #35   +/-   ##
========================================
  Coverage    91.54%   91.54%           
========================================
  Files           46       46           
  Lines         2247     2247           
========================================
  Hits          2057     2057           
  Misses         190      190

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be42427...57abe08. Read the comment docs.

LennartNoordsij mentioned this pull request Dec 14, 2018

Warp divergence in deposit bit plane #36

Closed

Unroll deposit bit plane loop

57abe08

Setting the bound static prevents Warp divergence and speeds up the writing. CUDA 8 supports specifying the unrolling factor which is done in other functions of the CUDA implementation as well.

salasoom force-pushed the feature/cuda-speedup-writing branch from ec39a45 to 57abe08 Compare January 29, 2019 06:22

salasoom merged commit 201d32f into LLNL:develop Jan 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unroll deposit bit plane loop #35

Unroll deposit bit plane loop #35

LennartNoordsij commented Dec 14, 2018

mclarsen commented Dec 14, 2018

lindstro commented Dec 14, 2018

codecov-io commented Jan 29, 2019 •

edited

Loading

Unroll deposit bit plane loop #35

Unroll deposit bit plane loop #35

Conversation

LennartNoordsij commented Dec 14, 2018

mclarsen commented Dec 14, 2018

lindstro commented Dec 14, 2018

codecov-io commented Jan 29, 2019 • edited Loading

Codecov Report

codecov-io commented Jan 29, 2019 •

edited

Loading