Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unroll deposit bit plane loop #35

Merged
merged 1 commit into from
Jan 31, 2019

Conversation

LennartNoordsij
Copy link
Contributor

Setting the bound static prevents Warp divergence and speeds up the writing.
CUDA 8 supports specifying the unrolling factor which is done in other functions of the CUDA implementation as well.

@mclarsen
Copy link
Contributor

Thanks!

@lindstro
Copy link
Member

@LennartNoordsij This sounds a bit like a related optimization that we have been working on. One of the steps of zfp is transposition of the binary matrix of coefficient bits so that the matrix is organized by column (bit plane) instead of by row (coefficient). In the current CPU implementation, this transposition is done on the fly by repeated access to the rows to extract each column, i.e., for precision p and dimensionality d, we make 4^d * p memory accesses. Using a more optimized explicit transposition, we reduce this to lg(4^d) * p = 2 * d * p memory accesses (more than 10x fewer in 3D). Moreover, this computation is highly parallelizable as the algorithm is based on divide and conquer.

On the CPU, I've seen a 20x speedup of the transposition itself and a resulting 3x speedup of the (de)compressor using this optimization, which completely unrolls the whole loop over bit planes (the outer for (k) loop). This would likely run even faster if we take advantage of parallelism, but would require a significant rewrite of the CUDA implementation, which as you know currently parallelizes only over zfp blocks.

Anyway, since it seems related to your optimization, I thought I'd mention it.

Setting the bound static prevents Warp divergence and speeds up the writing.
CUDA 8 supports specifying the unrolling factor which is done in other functions of the CUDA implementation as well.
@codecov-io
Copy link

codecov-io commented Jan 29, 2019

Codecov Report

Merging #35 into develop will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop      #35   +/-   ##
========================================
  Coverage    91.54%   91.54%           
========================================
  Files           46       46           
  Lines         2247     2247           
========================================
  Hits          2057     2057           
  Misses         190      190

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be42427...57abe08. Read the comment docs.

@salasoom salasoom merged commit 201d32f into LLNL:develop Jan 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants