Is this a duplicate?
Type of Bug
Runtime Error
Component
CUB
Describe the bug
cub::DeviceScan::InclusiveScan using the Warpspeed scan path on sm_121 can trigger an illegal memory access for a very small scan when the input type is 32-bit and the output type is 64-bit, and the output iterator is offset by one element.
This was minimized from failures in RAPIDS cuVS IVF-PQ tests on Blackwell. The original cuVS call was a thrust::inclusive_scan over uint32_t cluster sizes into int64_t offsets at cluster_offsets + 1. The standalone reproducer below removes cuVS, Thrust API, RAFT, RMM, custom streams, async allocation, and large inputs.
Minimal requirements observed for the failure:
- GPU target/runtime:
sm_121 on NVIDIA GB10 (cc=12.1)
- CUB
DeviceScan::InclusiveScan
- Input type is 32-bit and output type is 64-bit, e.g.
int* -> long long*
num_items == 2 is sufficient
- Output pointer is shifted by one element (
out + 1)
- Warpspeed scan is enabled
Compiling the same source with -DCCCL_DISABLE_WARPSPEED_SCAN makes the reproducer pass.
How to Reproduce
Use CCCL from source at 2b7188d2b5e4f2671f68078abc7d842e6b1e3d22 (v3.4.0.dev-687-g2b7188d2b5) with CUDA 13.1 and compile for sm_121.
repro.cu:
#include <cub/device/device_scan.cuh>
#include <cstdio>
int main()
{
int in_h[2] = {1, 1};
int* in{};
long long* out{};
void* tmp{};
size_t tmp_bytes{};
cudaMalloc(&in, sizeof(in_h));
cudaMalloc(&out, 3 * sizeof(*out));
cudaMemcpy(in, in_h, sizeof(in_h), cudaMemcpyHostToDevice);
cudaMemset(out, 0, 3 * sizeof(*out));
cub::DeviceScan::InclusiveScan(tmp, tmp_bytes, in, out + 1, cuda::std::plus<>{}, 2);
cudaMalloc(&tmp, tmp_bytes);
cub::DeviceScan::InclusiveScan(tmp, tmp_bytes, in, out + 1, cuda::std::plus<>{}, 2);
auto status = cudaDeviceSynchronize();
if (status != cudaSuccess) {
std::printf("sync failed: %s\n", cudaGetErrorString(status));
return 1;
}
long long out_h[3]{};
cudaMemcpy(out_h, out, sizeof(out_h), cudaMemcpyDeviceToHost);
std::printf("ok: {%lld, %lld, %lld}\n", out_h[0], out_h[1], out_h[2]);
}
Failing build/run:
nvcc -std=c++20 \
-arch=sm_121 \
-I/path/to/cccl/thrust \
-I/path/to/cccl/cub \
-I/path/to/cccl/libcudacxx/include \
repro.cu -o repro
./repro
Observed output:
sync failed: an illegal memory access was encountered
Control build with Warpspeed disabled:
nvcc -std=c++20 \
-arch=sm_121 \
-DCCCL_DISABLE_WARPSPEED_SCAN \
-I/path/to/cccl/thrust \
-I/path/to/cccl/cub \
-I/path/to/cccl/libcudacxx/include \
repro.cu -o repro_no_warpspeed
./repro_no_warpspeed
Observed output:
Expected behavior
The scan should complete without an illegal memory access and produce {0, 1, 2} in the shifted output buffer.
Reproduction link
No response
Operating System
Ubuntu 24.04.4 LTS, Linux 6.17.0-1014-nvidia, aarch64
nvidia-smi output
Wed May 6 06:03:17 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 Off | N/A |
| N/A 33C P8 3W / N/A | Not Supported | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
NVCC version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_Dec_16_07:27:17_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.115
Build cuda_13.1.r13.1/compiler.37061995_0
Is this a duplicate?
Type of Bug
Runtime Error
Component
CUB
Describe the bug
cub::DeviceScan::InclusiveScanusing the Warpspeed scan path onsm_121can trigger an illegal memory access for a very small scan when the input type is 32-bit and the output type is 64-bit, and the output iterator is offset by one element.This was minimized from failures in RAPIDS cuVS IVF-PQ tests on Blackwell. The original cuVS call was a
thrust::inclusive_scanoveruint32_tcluster sizes intoint64_toffsets atcluster_offsets + 1. The standalone reproducer below removes cuVS, Thrust API, RAFT, RMM, custom streams, async allocation, and large inputs.Minimal requirements observed for the failure:
sm_121on NVIDIA GB10 (cc=12.1)DeviceScan::InclusiveScanint* -> long long*num_items == 2is sufficientout + 1)Compiling the same source with
-DCCCL_DISABLE_WARPSPEED_SCANmakes the reproducer pass.How to Reproduce
Use CCCL from source at
2b7188d2b5e4f2671f68078abc7d842e6b1e3d22(v3.4.0.dev-687-g2b7188d2b5) with CUDA 13.1 and compile forsm_121.repro.cu:Failing build/run:
Observed output:
Control build with Warpspeed disabled:
Observed output:
Expected behavior
The scan should complete without an illegal memory access and produce
{0, 1, 2}in the shifted output buffer.Reproduction link
No response
Operating System
Ubuntu 24.04.4 LTS, Linux 6.17.0-1014-nvidia, aarch64
nvidia-smi output
NVCC version