-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add inclusive_scan with initial value support (warp/block) #1749
Conversation
🟩 CI Results [ Failed: 0 | Passed: 198 | Total: 198 ]
|
# | Runner |
---|---|
154 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
16 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
👃 Inspect Changes
Modifications in project?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental |
da78a11
to
0ca0b3c
Compare
🟩 CI Results [ Failed: 0 | Passed: 198 | Total: 198 ]
|
# | Runner |
---|---|
154 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
16 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
👃 Inspect Changes
Modifications in project?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I have troubles understanding what an initial value to a scan is, so maybe the documentation could be improved in this regard. However, it seems the term is used in several places, so it's out of scope for this PR.
0ca0b3c
to
f3142fa
Compare
🟨 CI Results [ Failed: 3 | Passed: 195 | Total: 198 ]
|
# | Runner |
---|---|
154 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
16 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
👃 Inspect Changes
Modifications in project?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental |
🟩 CI Results [ Failed: 0 | Passed: 198 | Total: 198 ]
|
# | Runner |
---|---|
154 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
16 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
👃 Inspect Changes
Modifications in project?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on value-based API. I think the issue would be caught by an _api
tests we write for literalinclude as opposed to code-block based documentation. If you have spare moment, it'd be nice to rewrite new docs to use literalincludes.
🟩 CI Results [ Failed: 0 | Passed: 198 | Total: 198 ]
|
# | Runner |
---|---|
154 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
16 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
👃 Inspect Changes
Modifications in project?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental |
1 similar comment
🟩 CI Results [ Failed: 0 | Passed: 198 | Total: 198 ]
|
# | Runner |
---|---|
154 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
16 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
👃 Inspect Changes
Modifications in project?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on the test and implementation side of value-based API of block scan. Please, let me know when those are addressed and I'll have another look.
🟩 CI Results [ Failed: 0 | Passed: 198 | Total: 198 ]
|
# | Runner |
---|---|
154 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
16 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
👃 Inspect Changes
Modifications in project?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental |
1 similar comment
🟩 CI Results [ Failed: 0 | Passed: 198 | Total: 198 ]
|
# | Runner |
---|---|
154 | linux-amd64-cpu16 |
16 | linux-arm64-cpu16 |
16 | linux-amd64-gpu-v100-latest-1 |
12 | windows-amd64-cpu16 |
👃 Inspect Changes
Modifications in project?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental |
a932030
to
726e988
Compare
🟨 CI Results: Pass: 97%/249 | Total Time: 4d 16h | Avg Time: 27m 13s | Hits: 60%/244265
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental |
🏃 Runner counts (total jobs: 249)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
40 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
f38095f
to
5bf71e0
Compare
🟩 CI Results: Pass: 100%/249 | Total Time: 1d 05h | Avg Time: 7m 08s | Hits: 99%/248441
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental |
🏃 Runner counts (total jobs: 249)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
40 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
c5fd4e9
to
a2c3772
Compare
a2c3772
to
e41e85a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor documentation fixes below. Thank you for taking your time to improve examples!
cub/cub/block/block_scan.cuh
Outdated
//! Snippet | ||
//! +++++++ | ||
//! | ||
//! The code snippet below illustrates an inclusive prefix max scan of 512 integer items that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! The code snippet below illustrates an inclusive prefix max scan of 512 integer items that | |
//! The code snippet below illustrates an inclusive prefix max scan of 128 integer items that |
cub/cub/block/block_scan.cuh
Outdated
//! +++++++ | ||
//! | ||
//! The code snippet below illustrates an inclusive prefix max scan of 512 integer items that | ||
//! are partitioned in a :ref:`blocked arrangement <flexible-data-arrangement>` across 128 threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! are partitioned in a :ref:`blocked arrangement <flexible-data-arrangement>` across 128 threads | |
//! are partitioned in a :ref:`blocked arrangement <flexible-data-arrangement>` across 64 threads |
cub/cub/block/block_scan.cuh
Outdated
//! | ||
//! The code snippet below illustrates an inclusive prefix max scan of 512 integer items that | ||
//! are partitioned in a :ref:`blocked arrangement <flexible-data-arrangement>` across 128 threads | ||
//! where each thread owns 4 consecutive items. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! where each thread owns 4 consecutive items. | |
//! where each thread owns 2 consecutive items. |
cub/cub/block/block_scan.cuh
Outdated
//! Suppose the set of input ``thread_data`` across the block of threads is | ||
//! ``{[0, -1], [2, -3],[4, -5], ... [126, -127]}``. | ||
//! The corresponding output ``thread_data`` in those threads will be | ||
//! ``{[1, 1], [2, 2],[3, 3], ... [126, 126]}``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I think we cover this in the kernel itself. I think seeing input and expected output in the docs might be confusing, but feel free to disagree:
//! Suppose the set of input ``thread_data`` across the block of threads is | |
//! ``{[0, -1], [2, -3],[4, -5], ... [126, -127]}``. | |
//! The corresponding output ``thread_data`` in those threads will be | |
//! ``{[1, 1], [2, 2],[3, 3], ... [126, 126]}``. |
cub/cub/block/block_scan.cuh
Outdated
//! | ||
//! The code snippet below illustrates an inclusive prefix max scan of 128 integer items that | ||
//! are partitioned in a :ref:`blocked arrangement <flexible-data-arrangement>` across 64 threads | ||
//! where each thread owns 4 consecutive items. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! where each thread owns 4 consecutive items. | |
//! where each thread owns 2 consecutive items. |
// warp #2 input: {2, 3, 4, 5, ..., 33} | ||
// warp #4 input: {3, 4, 5, 6, ..., 34} | ||
|
||
// Collectively compute the block-wide inclusive prefix max scan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Collectively compute the block-wide inclusive prefix max scan | |
// Collectively compute warp-wide inclusive prefix max scans |
// warp #2 input: {1, 1, 1, 1, ..., 1} | ||
// warp #4 input: {1, 1, 1, 1, ..., 1} | ||
|
||
// Collectively compute the block-wide inclusive prefix max scan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Collectively compute the block-wide inclusive prefix max scan | |
// Collectively compute warp-wide inclusive prefix max scans |
__global__ void InclusiveScanKernelAggr(int* output, int* d_warp_aggregate) | ||
{ | ||
// Specialize WarpScan for type int | ||
typedef cub::WarpScan<int> warp_scan_t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typedef cub::WarpScan<int> warp_scan_t; | |
using warp_scan_t = cub::WarpScan<int>; |
|
||
cuda::std::inclusive_scan(start, end, start, sum_op<int>{}, init_val); | ||
|
||
expected_aggr.push_back(expected[i * 32 + 31] - init_val); // warp aggregate doed not take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected_aggr.push_back(expected[i * 32 + 31] - init_val); // warp aggregate doed not take | |
expected_aggr.push_back(expected[i * 32 + 31] - init_val); // warp aggregate does not take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: is this always equivalent to:
expected_aggr.push_back(expected[i * 32 + 31] - init_val); // warp aggregate doed not take | |
expected_aggr.push_back(32); // warp aggregate does not take |
}; | ||
// input: {[0, -1], [2, -3],[4, -5], ... [126, -127]} | ||
|
||
// Collectively compute the block-wide inclusive prefix max scan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: I was under impression that it's "prefix sum" or "scan". I never heart of "prefix scan". Is this term used in the wild or is it a typo?
🟨 CI finished in 5h 44m: Pass: 99%/249 | Total: 2d 07h | Avg: 13m 20s | Max: 2h 01m | Hits: 91%/246866
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental |
🏃 Runner counts (total jobs: 249)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
40 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟩 CI finished in 19h 49m: Pass: 100%/249 | Total: 2d 07h | Avg: 13m 28s | Max: 2h 01m | Hits: 91%/248572
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental |
🏃 Runner counts (total jobs: 249)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
40 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
Doesn't fix but works towards #693.
Adds warp_scan with initial value support on warp and block level. Tests added.
todo: