Skip to content

[BUG]: Warpspeed InclusiveScan illegal memory access for shifted output #8838

Description

@bdice

Is this a duplicate?

  • I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Runtime Error

Component

CUB

Describe the bug

cub::DeviceScan::InclusiveScan using the Warpspeed scan path on sm_121 can trigger an illegal memory access for a very small scan when the input type is 32-bit and the output type is 64-bit, and the output iterator is offset by one element.

This was minimized from failures in RAPIDS cuVS IVF-PQ tests on Blackwell. The original cuVS call was a thrust::inclusive_scan over uint32_t cluster sizes into int64_t offsets at cluster_offsets + 1. The standalone reproducer below removes cuVS, Thrust API, RAFT, RMM, custom streams, async allocation, and large inputs.

Minimal requirements observed for the failure:

  • GPU target/runtime: sm_121 on NVIDIA GB10 (cc=12.1)
  • CUB DeviceScan::InclusiveScan
  • Input type is 32-bit and output type is 64-bit, e.g. int* -> long long*
  • num_items == 2 is sufficient
  • Output pointer is shifted by one element (out + 1)
  • Warpspeed scan is enabled

Compiling the same source with -DCCCL_DISABLE_WARPSPEED_SCAN makes the reproducer pass.

How to Reproduce

Use CCCL from source at 2b7188d2b5e4f2671f68078abc7d842e6b1e3d22 (v3.4.0.dev-687-g2b7188d2b5) with CUDA 13.1 and compile for sm_121.

repro.cu:

#include <cub/device/device_scan.cuh>
#include <cstdio>

int main()
{
  int in_h[2] = {1, 1};
  int* in{};
  long long* out{};
  void* tmp{};
  size_t tmp_bytes{};

  cudaMalloc(&in, sizeof(in_h));
  cudaMalloc(&out, 3 * sizeof(*out));
  cudaMemcpy(in, in_h, sizeof(in_h), cudaMemcpyHostToDevice);
  cudaMemset(out, 0, 3 * sizeof(*out));

  cub::DeviceScan::InclusiveScan(tmp, tmp_bytes, in, out + 1, cuda::std::plus<>{}, 2);
  cudaMalloc(&tmp, tmp_bytes);
  cub::DeviceScan::InclusiveScan(tmp, tmp_bytes, in, out + 1, cuda::std::plus<>{}, 2);

  auto status = cudaDeviceSynchronize();
  if (status != cudaSuccess) {
    std::printf("sync failed: %s\n", cudaGetErrorString(status));
    return 1;
  }

  long long out_h[3]{};
  cudaMemcpy(out_h, out, sizeof(out_h), cudaMemcpyDeviceToHost);
  std::printf("ok: {%lld, %lld, %lld}\n", out_h[0], out_h[1], out_h[2]);
}

Failing build/run:

nvcc -std=c++20 \
  -arch=sm_121 \
  -I/path/to/cccl/thrust \
  -I/path/to/cccl/cub \
  -I/path/to/cccl/libcudacxx/include \
  repro.cu -o repro

./repro

Observed output:

sync failed: an illegal memory access was encountered

Control build with Warpspeed disabled:

nvcc -std=c++20 \
  -arch=sm_121 \
  -DCCCL_DISABLE_WARPSPEED_SCAN \
  -I/path/to/cccl/thrust \
  -I/path/to/cccl/cub \
  -I/path/to/cccl/libcudacxx/include \
  repro.cu -o repro_no_warpspeed

./repro_no_warpspeed

Observed output:

ok: {0, 1, 2}

Expected behavior

The scan should complete without an illegal memory access and produce {0, 1, 2} in the shifted output buffer.

Reproduction link

No response

Operating System

Ubuntu 24.04.4 LTS, Linux 6.17.0-1014-nvidia, aarch64

nvidia-smi output

Wed May  6 06:03:17 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0 Off |                  N/A |
| N/A   33C    P8              3W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

NVCC version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_Dec_16_07:27:17_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.115
Build cuda_13.1.r13.1/compiler.37061995_0

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

Status
In Review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions