Skip to content

Fix OOB in warpspeed scan kernel on last partial tiles#8134

Open
gonidelis wants to merge 1 commit intoNVIDIA:mainfrom
gonidelis:lookahead_oob
Open

Fix OOB in warpspeed scan kernel on last partial tiles#8134
gonidelis wants to merge 1 commit intoNVIDIA:mainfrom
gonidelis:lookahead_oob

Conversation

@gonidelis
Copy link
Member

@gonidelis gonidelis commented Mar 23, 2026

tries to fix one of the bugs in https://nvbugspro.nvidia.com/bug/5999870. The Assertion out-of-bounds access thrust one.

I 'll need @bernhardmgruber eyes and advice on this one as he is the original author. Not sure why we chose to read the last tile in full unconditionally my guess is that we wanted to avoid thread divergence but not sure.

I think reading the last tile in full and calling scan_op on the garbage data in the scan_warpspeed kernel leads to complex scan ops, that involve memory deref ops, reading garbage.

This PR keeps the full last tile read and just sanitizes the redundant data.

I wanna do a regression analysis see if there is any perf damage.

Fixes: #8136

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Mar 23, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@gonidelis gonidelis requested a review from miscco March 23, 2026 10:13
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Mar 23, 2026
@miscco
Copy link
Contributor

miscco commented Mar 23, 2026

The reason we are doing is that this had considerable perf issues (15% if I recall correctly)

The current scan implementation does the same, so we deemed it acceptable to do read beyond the bounds

Comment on lines +364 to +371
// For partial tiles, overwrite invalid warp sums with warp 0's sum so that
// squadScanStore never passes garbage data to scan_op.
if (is_last_tile && squad.isLeaderThreadOfWarp() && squad.warpRank() >= valid_warps)
{
refSumThreadAndWarpW.data()[squadReduce.threadCount() + squad.warpRank()] =
refSumThreadAndWarpW.data()[squadReduce.threadCount()];
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont believe this is the right place, we already used scan_op on invalid data.

If we do want to fix it we need to fix it in squadLoadSmem in the reduce and scan squad to use the last valid data point instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

squadLoadSmem only loads the data, no scan_op applied at this point. Then this chunk is executing:

        if (is_last_tile)
        {
          // TODO(bgruber): for operators where we know the identity we can probably optimize further here
          regThreadSum = ThreadReducePartial(regInput, scan_op, valid_items_this_thread);
          regWarpSum   = warpReducePartial(regThreadSum, scan_op, valid_threads_this_warp);
        }
        else
        {
          regThreadSum = ThreadReduce(regInput, scan_op);
          regWarpSum   = warpReduce(regThreadSum, scan_op);
        }

which only applies scan_op to valid data since there is the if(islast_tile) check which then applies scanop partially to valid items or fully (if tile is not last).

So at what point before the coded I added, is scan_op being applied on garbage data? The test doesn't even throw. Should I run it with a sanitizer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did run it with the sanitizer and it did flag OOB global reads from TMA bulk load. I guess that's on par with the optimizations offered by TMA. The loaded garbage enters shmem and registers but still never passed to scan_op. imp we shouldn't hunt it deeper since the present PR already does the fix (given that benches do not regress).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did run it with the sanitizer and it did flag OOB global reads from TMA bulk load.

This is a known compute-sanitizer bug and it will be fixed in a future version. It's because they did not properly handle the ignore bytes on the .ignore_oob version of bulk copy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that reading out of bounds data is already invalid.

Furthermore, if we would just replace the invalid data with the final data point we could guarantee that we pass valid data to the algorithm.

That might help with keeping the squad uniform

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that reading out of bounds data is already invalid.

Yes, and the CI passed until we added the RTX PRO 6000. I have a vague feeling that uninitialized SMEM reads produced zero on earlier architectures, but that may have changed now, which is why we are seeing tests failing.

if we would just replace the invalid data with the final data point we could guarantee that we pass valid data to the algorithm.

I am not sure what will work better: prefilling accumulator values with a valid value before overwriting them, or just predicating the scan steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its definitely something we need to measure, but we did have some predicated steps already, so just predicating squadLoadSmem would confine that to just 2 places

@bernhardmgruber
Copy link
Contributor

I just hit this bug locally as well while working on #7565. It does not surface on main though and this is extremely weird.

@bernhardmgruber
Copy link
Contributor

Not sure why we chose to read the last tile in full unconditionally

15% better perf

@gonidelis
Copy link
Member Author

Regression Analysis @miscco @bernhardmgruber

cccl) (cccl) coder ➜ ~/cccl $ _deps/nvbench-src/scripts/nvbench_compare.py  main.json patch.json 
['main.json', 'patch.json']
# base

## [0] NVIDIA RTX PRO 6000 Blackwell Workstation Edition

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|---------------|---------|----------|
|   I8    |      I64      |      2^16      |   5.514 us |      17.85% |   6.170 us |       6.30% |      0.656 us |  11.89% |   SLOW   |
|   I8    |      I64      |      2^20      |   7.197 us |      20.15% |   8.180 us |      10.77% |      0.983 us |  13.65% |   SLOW   |
|   I8    |      I64      |      2^24      |  41.392 us |       3.63% |  43.091 us |       6.42% |      1.698 us |   4.10% |   SLOW   |
|   I8    |      I64      |      2^28      | 622.339 us |       1.16% | 621.885 us |       1.17% |     -0.454 us |  -0.07% |   SAME   |
|   I8    |      I64      |      2^32      |  10.273 ms |      28.64% |  10.942 ms |      43.18% |    668.748 us |   6.51% |   SAME   |
|   I16   |      I64      |      2^16      |   6.830 us |      22.15% |   7.926 us |      17.95% |      1.096 us |  16.04% |   SAME   |
|   I16   |      I64      |      2^20      |  10.279 us |       6.32% |  10.242 us |       2.28% |     -0.037 us |  -0.36% |   SAME   |
|   I16   |      I64      |      2^24      |  76.336 us |       2.01% |  76.534 us |       2.05% |      0.198 us |   0.26% |   SAME   |
|   I16   |      I64      |      2^28      |   1.275 ms |       0.76% |   1.274 ms |       0.78% |     -0.145 us |  -0.01% |   SAME   |
|   I16   |      I64      |      2^32      |  29.159 ms |      46.45% |  33.387 ms |      57.25% |      4.228 ms |  14.50% |   SAME   |
|   I32   |      I64      |      2^16      |   6.481 us |      16.96% |   7.560 us |      30.64% |      1.079 us |  16.65% |   SAME   |
|   I32   |      I64      |      2^20      |  13.403 us |       9.87% |  16.215 us |       7.83% |      2.812 us |  20.98% |   SLOW   |
|   I32   |      I64      |      2^24      | 152.824 us |       1.15% | 152.618 us |       1.14% |     -0.206 us |  -0.13% |   SAME   |
|   I32   |      I64      |      2^28      |   2.541 ms |       0.59% |   2.542 ms |       0.63% |      0.729 us |   0.03% |   SAME   |
|   I32   |      I64      |      2^32      |  67.690 ms |      84.04% |  51.592 ms |      64.40% | -16097.792 us | -23.78% |   SAME   |
|   I64   |      I64      |      2^16      |   7.329 us |      15.16% |   7.353 us |      15.37% |      0.024 us |   0.33% |   SAME   |
|   I64   |      I64      |      2^20      |  24.158 us |       8.93% |  22.530 us |       9.63% |     -1.628 us |  -6.74% |   SAME   |
|   I64   |      I64      |      2^24      | 313.596 us |       1.17% | 314.260 us |       1.18% |      0.664 us |   0.21% |   SAME   |
|   I64   |      I64      |      2^28      |   5.014 ms |       0.45% |   5.010 ms |       0.48% |     -3.644 us |  -0.07% |   SAME   |
|   I64   |      I64      |      2^32      |  82.482 ms |       0.07% |  82.483 ms |       0.17% |      0.869 us |   0.00% |   SAME   |
|  I128   |      I64      |      2^16      |   8.849 us |      15.57% |   8.708 us |      11.79% |     -0.141 us |  -1.60% |   SAME   |
|  I128   |      I64      |      2^20      |  41.135 us |      12.93% |  42.363 us |      10.24% |      1.228 us |   2.98% |   SAME   |
|  I128   |      I64      |      2^24      | 589.405 us |       7.18% | 595.913 us |       1.20% |      6.508 us |   1.10% |   SAME   |
|  I128   |      I64      |      2^28      |   9.503 ms |       0.37% |   9.501 ms |       0.31% |     -2.126 us |  -0.02% |   SAME   |
|   F32   |      I64      |      2^16      |   6.717 us |      13.66% |   6.154 us |       2.96% |     -0.562 us |  -8.37% |   FAST   |
|   F32   |      I64      |      2^20      |  17.367 us |      15.05% |  15.470 us |      21.32% |     -1.897 us | -10.92% |   SAME   |
|   F32   |      I64      |      2^24      | 146.343 us |       7.42% | 149.385 us |       6.04% |      3.042 us |   2.08% |   SAME   |
|   F32   |      I64      |      2^28      |   2.537 ms |       0.57% |   2.538 ms |       0.64% |      0.567 us |   0.02% |   SAME   |
|   F32   |      I64      |      2^32      |  40.471 ms |       0.69% |  40.478 ms |       0.69% |      7.560 us |   0.02% |   SAME   |
|   F64   |      I64      |      2^16      |   9.412 us |      17.93% |   9.342 us |      21.85% |     -0.070 us |  -0.74% |   SAME   |
|   F64   |      I64      |      2^20      |  29.931 us |       7.71% |  27.867 us |       8.84% |     -2.064 us |  -6.89% |   SAME   |
|   F64   |      I64      |      2^24      | 310.463 us |       1.14% | 310.818 us |       1.12% |      0.355 us |   0.11% |   SAME   |
|   F64   |      I64      |      2^28      |   4.995 ms |       0.50% |   5.009 ms |       0.47% |     14.349 us |   0.29% |   SAME   |
|   F64   |      I64      |      2^32      |  82.281 ms |       0.09% |  82.428 ms |       0.23% |    147.279 us |   0.18% |   SLOW   |
|   C32   |      I64      |      2^16      |   6.898 us |      16.94% |   6.873 us |      14.65% |     -0.025 us |  -0.36% |   SAME   |
|   C32   |      I64      |      2^20      |  21.457 us |       8.23% |  21.113 us |      14.92% |     -0.344 us |  -1.60% |   SAME   |
|   C32   |      I64      |      2^24      | 303.867 us |       4.56% | 308.313 us |       4.93% |      4.446 us |   1.46% |   SAME   |
|   C32   |      I64      |      2^28      |   6.028 ms |      19.31% |   5.020 ms |       0.50% |  -1007.610 us | -16.72% |   FAST   |
|   C32   |      I64      |      2^32      |  82.465 ms |       0.13% |  82.529 ms |       0.13% |     63.960 us |   0.08% |   SAME   |

# Summary

- Total Matches: 39
  - Pass    (diff <= min_noise): 32
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  7
(cccl) (cccl) coder ➜ ~/cccl $ 

@gonidelis
Copy link
Member Author

another run, i guess noise is too much?

(cccl) (cccl) coder ➜ ~/cccl $ _deps/nvbench-src/scripts/nvbench_compare.py  main2.json patch.json 
['main2.json', 'patch.json']
# base

## [0] NVIDIA RTX PRO 6000 Blackwell Workstation Edition

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|---------------|---------|----------|
|   I8    |      I64      |      2^16      |   6.703 us |      22.10% |   6.170 us |       6.30% |     -0.532 us |  -7.94% |   FAST   |
|   I8    |      I64      |      2^20      |   8.193 us |       9.15% |   8.180 us |      10.77% |     -0.013 us |  -0.16% |   SAME   |
|   I8    |      I64      |      2^24      |  42.062 us |       3.14% |  43.091 us |       6.42% |      1.029 us |   2.45% |   SAME   |
|   I8    |      I64      |      2^28      | 609.565 us |       1.04% | 621.885 us |       1.17% |     12.320 us |   2.02% |   SLOW   |
|   I8    |      I64      |      2^32      |  13.061 ms |      64.94% |  10.942 ms |      43.18% |  -2118.631 us | -16.22% |   SAME   |
|   I16   |      I64      |      2^16      |   6.991 us |      23.08% |   7.926 us |      17.95% |      0.935 us |  13.38% |   SAME   |
|   I16   |      I64      |      2^20      |  12.354 us |       7.72% |  10.242 us |       2.28% |     -2.112 us | -17.10% |   FAST   |
|   I16   |      I64      |      2^24      |  73.922 us |       2.33% |  76.534 us |       2.05% |      2.612 us |   3.53% |   SLOW   |
|   I16   |      I64      |      2^28      |   1.278 ms |       0.69% |   1.274 ms |       0.78% |     -3.372 us |  -0.26% |   SAME   |
|   I16   |      I64      |      2^32      |  26.767 ms |      39.02% |  33.387 ms |      57.25% |      6.620 ms |  24.73% |   SAME   |
|   I32   |      I64      |      2^16      |   6.668 us |      19.31% |   7.560 us |      30.64% |      0.892 us |  13.37% |   SAME   |
|   I32   |      I64      |      2^20      |  14.660 us |       9.28% |  16.215 us |       7.83% |      1.555 us |  10.61% |   SLOW   |
|   I32   |      I64      |      2^24      | 154.830 us |       1.16% | 152.618 us |       1.14% |     -2.212 us |  -1.43% |   FAST   |
|   I32   |      I64      |      2^28      |   2.560 ms |       0.63% |   2.542 ms |       0.63% |    -17.738 us |  -0.69% |   FAST   |
|   I32   |      I64      |      2^32      |  56.611 ms |      68.68% |  51.592 ms |      64.40% |  -5019.306 us |  -8.87% |   SAME   |
|   I64   |      I64      |      2^16      |   7.586 us |      14.80% |   7.353 us |      15.37% |     -0.233 us |  -3.07% |   SAME   |
|   I64   |      I64      |      2^20      |  22.325 us |       7.45% |  22.530 us |       9.63% |      0.205 us |   0.92% |   SAME   |
|   I64   |      I64      |      2^24      | 318.625 us |       1.14% | 314.260 us |       1.18% |     -4.366 us |  -1.37% |   FAST   |
|   I64   |      I64      |      2^28      |   5.008 ms |       0.50% |   5.010 ms |       0.48% |      2.262 us |   0.05% |   SAME   |
|   I64   |      I64      |      2^32      |  82.395 ms |       0.17% |  82.483 ms |       0.17% |     87.906 us |   0.11% |   SAME   |
|  I128   |      I64      |      2^16      |   8.267 us |       6.14% |   8.708 us |      11.79% |      0.441 us |   5.33% |   SAME   |
|  I128   |      I64      |      2^20      |  39.956 us |      10.31% |  42.363 us |      10.24% |      2.407 us |   6.02% |   SAME   |
|  I128   |      I64      |      2^24      | 603.474 us |       2.95% | 595.913 us |       1.20% |     -7.561 us |  -1.25% |   FAST   |
|  I128   |      I64      |      2^28      |  16.708 ms |     118.05% |   9.501 ms |       0.31% |  -7207.273 us | -43.14% |   FAST   |
|   F32   |      I64      |      2^16      |   6.436 us |      12.43% |   6.154 us |       2.96% |     -0.282 us |  -4.38% |   FAST   |
|   F32   |      I64      |      2^20      |  16.028 us |      12.68% |  15.470 us |      21.32% |     -0.558 us |  -3.48% |   SAME   |
|   F32   |      I64      |      2^24      | 158.255 us |       1.15% | 149.385 us |       6.04% |     -8.870 us |  -5.60% |   FAST   |
|   F32   |      I64      |      2^28      |   2.562 ms |       0.56% |   2.538 ms |       0.64% |    -24.293 us |  -0.95% |   FAST   |
|   F32   |      I64      |      2^32      |  42.591 ms |      35.52% |  40.478 ms |       0.69% |  -2112.355 us |  -4.96% |   FAST   |
|   F64   |      I64      |      2^16      |   8.493 us |      10.84% |   9.342 us |      21.85% |      0.849 us |   9.99% |   SAME   |
|   F64   |      I64      |      2^20      |  22.099 us |       6.66% |  27.867 us |       8.84% |      5.768 us |  26.10% |   SLOW   |
|   F64   |      I64      |      2^24      | 311.565 us |       1.06% | 310.818 us |       1.12% |     -0.746 us |  -0.24% |   SAME   |
|   F64   |      I64      |      2^28      |   4.992 ms |       0.49% |   5.009 ms |       0.47% |     17.279 us |   0.35% |   SAME   |
|   F64   |      I64      |      2^32      | 104.807 ms |     145.87% |  82.428 ms |       0.23% | -22378.573 us | -21.35% |   FAST   |
|   C32   |      I64      |      2^16      |   6.439 us |      16.29% |   6.873 us |      14.65% |      0.434 us |   6.74% |   SAME   |
|   C32   |      I64      |      2^20      |  24.976 us |       5.55% |  21.113 us |      14.92% |     -3.863 us | -15.47% |   FAST   |
|   C32   |      I64      |      2^24      | 313.365 us |       1.11% | 308.313 us |       4.93% |     -5.051 us |  -1.61% |   FAST   |
|   C32   |      I64      |      2^28      |   5.015 ms |       0.49% |   5.020 ms |       0.50% |      5.602 us |   0.11% |   SAME   |
|   C32   |      I64      |      2^32      |  82.479 ms |       0.12% |  82.529 ms |       0.13% |     50.176 us |   0.06% |   SAME   |

i don't feel the results are reliable. any hints?

@bernhardmgruber
Copy link
Contributor

any hints?

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|---------------|---------|----------|
|   I8    |      I64      |      2^32      |  13.061 ms |      64.94% |  10.942 ms |      43.18% |  -2118.631 us | -16.22% |   SAME  

Look at Noise. I would say more than 10% noise is problematic. I know you are benchmarking on your local workstation. I recommend closing/minimizing any applications with a GPU context, such as a browser, IDE, etc. before running the benchmark. This helps usually, but the most reliable benchmark is still done on the cluster.

@gonidelis
Copy link
Member Author

will rerun with closed graphics apps. unfortunatley it's hard to find sm100+ in the clusters. i ll poll though

@gonidelis gonidelis marked this pull request as ready for review March 24, 2026 14:16
@gonidelis gonidelis requested a review from a team as a code owner March 24, 2026 14:16
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Mar 24, 2026
@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 44m: Pass: 100%/249 | Total: 7d 20h | Max: 1h 28m | Hits: 74%/159565

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[BUG] warpspeed scan causes OOB reads in some Thrust tests

3 participants