[thrust] Single-pass `is_partitioned` via adjacent zip_iterator by edenfunf · Pull Request #8427 · NVIDIA/cccl

edenfunf · 2026-04-15T02:18:35Z

What

Replaces the two-pass thrust::is_partitioned CUDA implementation with a single-pass version that mirrors cuda::std::is_partitioned from the PSTL.

Before:

ItemsIt boundary = cuda_cub::find_if_not(policy, first, last, predicate);
ItemsIt end      = cuda_cub::find_if(policy, boundary, last, predicate);
return end == last;

Two kernel launches in the fully-partitioned case (the common fast path).

After:

const auto first_zip = ::cuda::make_zip_iterator(first, first + 1);
const auto last_zip  = ::cuda::make_zip_iterator(last, last);
const auto result    = cuda_cub::find_if(policy, first_zip, last_zip,
                         __is_partitioned_fn<Predicate>{predicate});
return ::cuda::std::get<1>(result.__iterators()) == last;

One kernel launch, scanning adjacent pairs for a "false → true" transition.

Why

thrust::is_partitioned was benchmarked as significantly slower than cuda::std::is_partitioned (up to ~33 % slower for int8_t at large N), even though both check the same invariant. The root cause is the extra kernel launch:

find_if_not scans the whole range to find the boundary.
find_if then scans from the boundary to the end.

The PSTL avoids this by zipping adjacent elements and scanning once for any (!pred(a[i]) && pred(a[i+1])) pair. This PR ports that approach to the Thrust CUDA backend.

Closes #8085.

How

Added __is_partitioned_fn<Predicate> functor (host+device callable) that takes a cuda::zip_iterator value and returns true on a partitioning violation.
Used cuda::make_zip_iterator(first, first+1) / cuda::make_zip_iterator(last, last). Because zip_iterator::operator- returns the minimum component distance, the synthetic range has exactly n-1 elements — the number of adjacent pairs.
The return check uses ::cuda::std::get<1>(result.__iterators()) == last (not result == last_zip) to handle the n==1 edge case correctly: when n==1 the zip distance is 0, find_if_n returns first_zip, and get<1>(first_zip) == last evaluates to true as expected.
Pattern directly mirrors libcudacxx/include/cuda/std/__pstl/is_partitioned.h.
Predicate arguments are wrapped with thrust::raw_reference_cast before evaluation, and results stored in const bool locals, to handle proxy-reference iterators (e.g. device_vector<T>::iterator whose dereference yields device_reference<T>). Without this, implicit bool conversion requires two user-defined conversions, which C++ forbids. This matches the pattern used in thrust/detail/internal_functional.h and thrust/detail/range/head_flags.h for the same proxy-reference problem.

Test

All existing tests in thrust/testing/is_partitioned.cu and thrust/testing/cuda/is_partitioned.cu were compiled and executed on an NVIDIA GeForce RTX 5070 (sm_89), CUDA 12.9, MSVC 19.50:

thrust.test.is_partitioned
  Running 8 unit tests.
  ........
  Totals: 0 failures, 0 known failures, 0 errors, and 8 passes.

thrust.test.cuda.is_partitioned.cdp_0
  Running 3 unit tests.
  ...
  Totals: 0 failures, 0 known failures, 0 errors, and 3 passes.

thrust.test.cuda.is_partitioned.cdp_1
  Running 3 unit tests.
  ...
  Totals: 0 failures, 0 known failures, 0 errors, and 3 passes.

The tests cover the following cases:

Case	Expected
Empty range	`true`
Single element	`true`
All-true partition	`true`
True + false partition	`true`
Single false element only	`true`
Not partitioned (false→true transition)	`false`
Partitioned after `thrust::partition`	`true`
Proxy-reference iterators (`device_vector<custom_numeric>`)	`true` / `false`
CUDA streams (`thrust::cuda::par.on(s)`)	`true` / `false`
Device-side execution (`thrust::seq`, `thrust::device`)	`true` / `false`

The previous implementation required two kernel launches: find_if_not to locate the partition boundary, then find_if to check whether any true element follows it. Replace this with a single find_if over adjacent element pairs (a[i], a[i+1]) built with cuda::make_zip_iterator. The predicate detects a "false → true" transition that violates the partitioning invariant. Using zip_iterator::operator- (which returns the minimum component distance) the synthetic range has n-1 elements, so only one kernel launch is needed regardless of input. Edge-case handling: - Empty range: early return true (avoids computing first+1). - n==1: zip distance is min(1,0)=0, find_if_n returns first_zip; get<1>(first_zip) == last is true, which is correct. Pattern mirrors cuda::std::is_partitioned in libcudacxx/include/cuda/std/__pstl/is_partitioned.h. Fixes NVIDIA#8085

copy-pr-bot · 2026-04-15T02:18:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

When iterating over thrust::device_vector, dereferencing yields device_reference<T> rather than T. Applying a predicate to a device_reference<T> and then applying operator! to the result fails to compile: converting device_reference<T> to bool requires two user-defined conversions (device_reference → T → void*), which C++ forbids in a single implicit conversion sequence. Fix by wrapping each tuple element with thrust::raw_reference_cast before passing to the predicate, and assigning the result to an intermediate bool. This matches the pattern used elsewhere in Thrust (internal_functional.h, head_flags.h) for the same proxy- reference problem, and mirrors the intent of the PSTL reference implementation which stores pred results in const bool locals. Verified: all 14 tests pass on RTX 5070 / CUDA 12.9 (sm_89): thrust.test.is_partitioned 8/8 pass thrust.test.cuda.is_partitioned.cdp_0 3/3 pass thrust.test.cuda.is_partitioned.cdp_1 3/3 pass

edenfunf · 2026-04-16T14:07:37Z

@gevtushenko Hi, could you please take a look at this PR when you have time?

This PR replaces the current two-pass CUDA implementation of thrust::is_partitioned
with a single-pass version using adjacent zip_iterator, following the approach used in
cuda::std::is_partitioned (PSTL).

Key points:

Reduces kernel launches from 2 → 1 in the fully-partitioned (common) case
Detects partition violations via adjacent pair scan (!pred(a[i]) && pred(a[i+1]))
Handles edge cases (e.g. n == 1) correctly via iterator tuple inspection
Preserves support for proxy-reference iterators (via raw_reference_cast)

Benchmarks showed up to ~33% speedup compared to the previous implementation.

I've verified:

All existing Thrust tests pass
CUDA backend behavior matches PSTL semantics

Would really appreciate your feedback, thanks!

edenfunf requested a review from a team as a code owner April 15, 2026 02:18

edenfunf requested a review from gevtushenko April 15, 2026 02:18

github-project-automation bot added this to CCCL Apr 15, 2026

cccl-authenticator-app bot moved this to In Review in CCCL Apr 15, 2026

CCCL Fix added 2 commits April 15, 2026 19:42

style: clang-format partition.h

b95ee9e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[thrust] Single-pass `is_partitioned` via adjacent zip_iterator#8427

[thrust] Single-pass `is_partitioned` via adjacent zip_iterator#8427
edenfunf wants to merge 3 commits intoNVIDIA:mainfrom
edenfunf:fix/thrust-is-partitioned-single-pass

edenfunf commented Apr 15, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Apr 15, 2026

Uh oh!

edenfunf commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

edenfunf commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Test

Uh oh!

copy-pr-bot bot commented Apr 15, 2026

Uh oh!

edenfunf commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

edenfunf commented Apr 15, 2026 •

edited

Loading