Skip to content

changed is_partitioned impl to make it similar to pstl's is_partitioned#8130

Open
dhruvilmehta wants to merge 4 commits intoNVIDIA:mainfrom
dhruvilmehta:partitionbranch
Open

changed is_partitioned impl to make it similar to pstl's is_partitioned#8130
dhruvilmehta wants to merge 4 commits intoNVIDIA:mainfrom
dhruvilmehta:partitionbranch

Conversation

@dhruvilmehta
Copy link
Copy Markdown
Contributor

@dhruvilmehta dhruvilmehta commented Mar 21, 2026

Description

Changed the implementation of thrust::is_partitioned to make it similar to libcudacxx::is_partitioned(#8084 ).

closes #8085

Made the implementation similar to pstl's implementation

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@dhruvilmehta dhruvilmehta requested a review from a team as a code owner March 21, 2026 17:48
@dhruvilmehta dhruvilmehta requested a review from miscco March 21, 2026 17:48
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 21, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot bot commented Mar 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 21, 2026
@bernhardmgruber
Copy link
Copy Markdown
Contributor

Thx for the PR! We need a benchmark that shows the benefits of the new approach. Can you please add the benchmark from #8084 to Thrust and measure the performance before and after your changes, and report the nvbench diff? Instructions for benchmarking are in our documentation: https://nvidia.github.io/cccl/unstable/cub/benchmarking.html

Comment on lines +236 to +238
auto result = thrust::transform_reduce(
exec, first, last, detail::is_partitioned_unary_op<Predicate>{pred}, identity, detail::partition_binary_op());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is not the right approach here.

The reason being that transform_reduce does not short-circuit on a negative match, so especially for larger arrays you are spending a considerable time searching for a result you already know.

in #8084 I used our DeviceFind::If machinery, because that stops searching after the first hit

@github-project-automation github-project-automation bot moved this from In Review to In Progress in CCCL Mar 23, 2026
@dhruvilmehta dhruvilmehta requested a review from a team as a code owner April 11, 2026 18:59
@dhruvilmehta dhruvilmehta requested a review from shwina April 11, 2026 18:59
@dhruvilmehta
Copy link
Copy Markdown
Contributor Author

I made the implementation similar to pstl's implementation. There'e no major difference in the benchmark results which I got.

['../oldimpl.json', '../new7impl.json']
# base

## [0] Tesla T4

|  T   |  Elements  |  MismatchAt  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|------|------------|--------------|------------|-------------|------------|-------------|-------------|---------|----------|
|  I8  |    2^16    |      1       |  45.334 us |      18.33% |  45.856 us |      71.58% |    0.522 us |   1.15% |   SAME   |
|  I8  |    2^20    |      1       |  51.538 us |      49.43% |  50.800 us |      55.33% |   -0.738 us |  -1.43% |   SAME   |
|  I8  |    2^24    |      1       |  52.547 us |      22.93% |  52.088 us |      67.03% |   -0.460 us |  -0.87% |   SAME   |
|  I8  |    2^28    |      1       |  52.800 us |      75.12% |  51.713 us |      37.90% |   -1.087 us |  -2.06% |   SAME   |
|  I8  |    2^16    |     0.5      |  40.156 us |      70.62% |  37.834 us |      27.81% |   -2.322 us |  -5.78% |   SAME   |
|  I8  |    2^20    |     0.5      |  51.094 us |      16.84% |  49.730 us |      26.71% |   -1.364 us |  -2.67% |   SAME   |
|  I8  |    2^24    |     0.5      |  53.648 us |      93.98% |  51.422 us |      22.64% |   -2.227 us |  -4.15% |   SAME   |
|  I8  |    2^28    |     0.5      |  52.578 us |      37.88% |  51.725 us |      19.76% |   -0.853 us |  -1.62% |   SAME   |
|  I8  |    2^16    |     0.01     |  39.959 us |      38.91% |  38.194 us |      81.40% |   -1.766 us |  -4.42% |   SAME   |
|  I8  |    2^20    |     0.01     |  52.631 us |      86.96% |  50.805 us |      43.13% |   -1.826 us |  -3.47% |   SAME   |
|  I8  |    2^24    |     0.01     |  52.499 us |      10.31% |  52.492 us |      86.75% |   -0.008 us |  -0.01% |   SAME   |
|  I8  |    2^28    |     0.01     |  53.753 us |     111.54% |  52.706 us |      75.45% |   -1.047 us |  -1.95% |   SAME   |
| I16  |    2^16    |      1       |  41.483 us |      83.14% |  37.766 us |      23.98% |   -3.717 us |  -8.96% |   SAME   |
| I16  |    2^20    |      1       |  56.753 us |      77.24% |  56.059 us |      68.79% |   -0.694 us |  -1.22% |   SAME   |
| I16  |    2^24    |      1       |  66.726 us |      84.07% |  59.674 us |      32.90% |   -7.052 us | -10.57% |   SAME   |
| I16  |    2^28    |      1       |  65.277 us |     221.17% |  62.395 us |     103.03% |   -2.882 us |  -4.41% |   SAME   |
| I16  |    2^16    |     0.5      |  41.761 us |     112.52% |  37.572 us |      79.34% |   -4.188 us | -10.03% |   SAME   |
| I16  |    2^20    |     0.5      |  56.374 us |      24.10% |  55.865 us |      64.63% |   -0.509 us |  -0.90% |   SAME   |
| I16  |    2^24    |     0.5      |  60.688 us |      52.68% |  60.477 us |      87.88% |   -0.210 us |  -0.35% |   SAME   |
| I16  |    2^28    |     0.5      |  62.222 us |      69.35% |  59.797 us |      25.45% |   -2.426 us |  -3.90% |   SAME   |
| I16  |    2^16    |     0.01     |  47.228 us |      76.56% |  43.949 us |      65.08% |   -3.279 us |  -6.94% |   SAME   |
| I16  |    2^20    |     0.01     |  56.698 us |      72.63% |  56.757 us |      72.10% |    0.059 us |   0.10% |   SAME   |
| I16  |    2^24    |     0.01     |  61.906 us |      76.56% |  60.659 us |      64.02% |   -1.246 us |  -2.01% |   SAME   |
| I16  |    2^28    |     0.01     |  61.559 us |      99.16% |  60.249 us |      25.47% |   -1.310 us |  -2.13% |   SAME   |
| I32  |    2^16    |      1       |  42.349 us |      32.27% |  40.112 us |      55.90% |   -2.237 us |  -5.28% |   SAME   |
| I32  |    2^20    |      1       |  64.538 us |      68.48% |  62.926 us |      21.84% |   -1.611 us |  -2.50% |   SAME   |
| I32  |    2^24    |      1       | 298.761 us |       1.68% | 298.374 us |       2.57% |   -0.388 us |  -0.13% |   SAME   |
| I32  |    2^28    |      1       |   3.948 ms |       4.30% |   3.944 ms |       2.66% |   -4.081 us |  -0.10% |   SAME   |
| I32  |    2^16    |     0.5      |  42.137 us |      24.59% |  39.374 us |      80.15% |   -2.763 us |  -6.56% |   SAME   |
| I32  |    2^20    |     0.5      |  69.540 us |      15.57% |  69.100 us |      63.16% |   -0.440 us |  -0.63% |   SAME   |
| I32  |    2^24    |     0.5      | 301.848 us |       2.19% | 300.784 us |       1.28% |   -1.065 us |  -0.35% |   SAME   |
| I32  |    2^28    |     0.5      |   3.962 ms |       5.42% |   3.961 ms |       4.93% |   -1.704 us |  -0.04% |   SAME   |
| I32  |    2^16    |     0.01     |  49.163 us |      89.30% |  45.400 us |      16.42% |   -3.764 us |  -7.66% |   SAME   |
| I32  |    2^20    |     0.01     |  71.883 us |      63.06% |  70.766 us |       4.59% |   -1.117 us |  -1.55% |   SAME   |
| I32  |    2^24    |     0.01     | 305.453 us |      25.33% | 304.298 us |      16.40% |   -1.154 us |  -0.38% |   SAME   |
| I32  |    2^28    |     0.01     |   4.785 ms |       0.50% |   4.819 ms |       0.50% |   33.909 us |   0.71% |   SLOW   |
| I64  |    2^16    |      1       |  43.228 us |      17.70% |  42.177 us |      71.66% |   -1.051 us |  -2.43% |   SAME   |
| I64  |    2^20    |      1       |  84.847 us |      14.31% |  94.500 us |     177.12% |    9.653 us |  11.38% |   SAME   |
| I64  |    2^24    |      1       | 541.479 us |       0.88% | 547.327 us |      20.87% |    5.848 us |   1.08% |   SLOW   |
| I64  |    2^28    |      1       |   7.810 ms |       1.16% |   7.811 ms |       2.04% |    0.767 us |   0.01% |   SAME   |
| I64  |    2^16    |     0.5      |  43.242 us |      14.36% |  39.416 us |      26.06% |   -3.826 us |  -8.85% |   SAME   |
| I64  |    2^20    |     0.5      |  86.760 us |      70.38% |  84.565 us |       4.86% |   -2.195 us |  -2.53% |   SAME   |
| I64  |    2^24    |     0.5      | 550.581 us |      16.48% | 548.877 us |      17.38% |   -1.704 us |  -0.31% |   SAME   |
| I64  |    2^28    |     0.5      |   7.829 ms |       1.91% |   7.816 ms |       0.11% |  -12.872 us |  -0.16% |   FAST   |
| I64  |    2^16    |     0.01     |  50.755 us |     157.63% |  43.512 us |      67.15% |   -7.243 us | -14.27% |   SAME   |
| I64  |    2^20    |     0.01     |  94.091 us |      85.55% |  90.271 us |      29.35% |   -3.820 us |  -4.06% |   SAME   |
| I64  |    2^24    |     0.01     | 548.130 us |       0.85% | 544.964 us |       0.68% |   -3.166 us |  -0.58% |   SAME   |
| I64  |    2^28    |     0.01     |   8.296 ms |       1.99% |   8.274 ms |       0.50% |  -21.517 us |  -0.26% |   SAME   |
| I128 |    2^16    |      1       |  46.213 us |      35.17% |  45.490 us |      75.35% |   -0.722 us |  -1.56% |   SAME   |
| I128 |    2^20    |      1       | 113.639 us |       4.16% | 113.891 us |      49.22% |    0.252 us |   0.22% |   SAME   |
| I128 |    2^24    |      1       |   1.040 ms |      14.70% |   1.034 ms |      12.78% |   -6.069 us |  -0.58% |   SAME   |
| I128 |    2^28    |      1       |  15.581 ms |       0.15% |  15.593 ms |       0.20% |   11.849 us |   0.08% |   SAME   |
| I128 |    2^16    |     0.5      |  47.244 us |      34.59% |  48.081 us |     148.75% |    0.837 us |   1.77% |   SAME   |
| I128 |    2^20    |     0.5      | 120.271 us |      67.71% | 118.311 us |      33.21% |   -1.960 us |  -1.63% |   SAME   |
| I128 |    2^24    |     0.5      |   1.032 ms |       0.44% |   1.032 ms |       0.49% |   -0.506 us |  -0.05% |   SAME   |
| I128 |    2^28    |     0.5      |  15.618 ms |       1.38% |  15.604 ms |       0.78% |  -13.271 us |  -0.08% |   SAME   |
| I128 |    2^16    |     0.01     |  50.795 us |      92.86% |  49.921 us |      73.83% |   -0.874 us |  -1.72% |   SAME   |
| I128 |    2^20    |     0.01     | 118.445 us |      12.91% | 117.693 us |      28.31% |   -0.752 us |  -0.64% |   SAME   |
| I128 |    2^24    |     0.01     |   1.050 ms |      10.23% |   1.045 ms |       7.30% |   -5.625 us |  -0.54% |   SAME   |
| I128 |    2^28    |     0.01     |  16.042 ms |       0.08% |  16.028 ms |       0.06% |  -14.009 us |  -0.09% |   FAST   |
| F32  |    2^16    |      1       |  43.349 us |      50.85% |  38.624 us |      29.72% |   -4.725 us | -10.90% |   SAME   |
| F32  |    2^20    |      1       |  63.879 us |      16.81% |  62.904 us |      29.56% |   -0.975 us |  -1.53% |   SAME   |
| F32  |    2^24    |      1       | 302.790 us |      23.29% | 298.561 us |      13.27% |   -4.229 us |  -1.40% |   SAME   |
| F32  |    2^28    |      1       |   3.950 ms |       2.89% |   3.965 ms |       6.60% |   15.831 us |   0.40% |   SAME   |
| F32  |    2^16    |     0.5      |  42.210 us |      53.55% |  38.462 us |      26.29% |   -3.748 us |  -8.88% |   SAME   |
| F32  |    2^20    |     0.5      |  68.671 us |       4.19% |  68.953 us |      59.08% |    0.282 us |   0.41% |   SAME   |
| F32  |    2^24    |     0.5      | 307.301 us |      31.60% | 300.741 us |       1.16% |   -6.560 us |  -2.13% |   FAST   |
| F32  |    2^28    |     0.5      |   3.981 ms |       0.16% |   3.985 ms |       3.57% |    4.630 us |   0.12% |   SAME   |
| F32  |    2^16    |     0.01     |  49.036 us |      43.25% |  44.988 us |      72.12% |   -4.049 us |  -8.26% |   SAME   |
| F32  |    2^20    |     0.01     |  72.343 us |      60.99% |  72.100 us |      55.31% |   -0.243 us |  -0.34% |   SAME   |
| F32  |    2^24    |     0.01     | 305.944 us |      19.45% | 303.353 us |      19.43% |   -2.591 us |  -0.85% |   SAME   |
| F32  |    2^28    |     0.01     |   4.944 ms |       3.84% |   4.817 ms |       2.14% | -126.893 us |  -2.57% |   FAST   |
| F64  |    2^16    |      1       |  44.029 us |      94.79% |  42.527 us |      26.53% |   -1.502 us |  -3.41% |   SAME   |
| F64  |    2^20    |      1       |  87.612 us |      83.17% |  84.878 us |      84.00% |   -2.734 us |  -3.12% |   SAME   |
| F64  |    2^24    |      1       | 539.380 us |       0.87% | 536.610 us |       0.92% |   -2.770 us |  -0.51% |   SAME   |
| F64  |    2^28    |      1       |   7.781 ms |       0.50% |   7.777 ms |       0.16% |   -4.394 us |  -0.06% |   SAME   |
| F64  |    2^16    |     0.5      |  43.664 us |      30.53% |  41.996 us |      15.39% |   -1.668 us |  -3.82% |   SAME   |
| F64  |    2^20    |     0.5      |  85.005 us |       4.18% |  85.806 us |      55.31% |    0.800 us |   0.94% |   SAME   |
| F64  |    2^24    |     0.5      | 547.693 us |      14.47% | 539.460 us |       0.73% |   -8.233 us |  -1.50% |   FAST   |
| F64  |    2^28    |     0.5      |   7.804 ms |       2.28% |   7.796 ms |       2.04% |   -7.296 us |  -0.09% |   SAME   |
| F64  |    2^16    |     0.01     |  47.953 us |      40.22% |  46.605 us |      69.76% |   -1.348 us |  -2.81% |   SAME   |
| F64  |    2^20    |     0.01     |  89.236 us |       4.18% |  86.693 us |       3.80% |   -2.543 us |  -2.85% |   SAME   |
| F64  |    2^24    |     0.01     | 557.966 us |      31.47% | 547.790 us |      12.80% |  -10.177 us |  -1.82% |   SAME   |
| F64  |    2^28    |     0.01     |   8.363 ms |       0.18% |   8.325 ms |       0.11% |  -38.158 us |  -0.46% |   FAST   |

# Summary

- Total Matches: 84
  - Pass    (diff <= min_noise): 76
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  8

@dhruvilmehta dhruvilmehta changed the title changed is_partitioned impl by using thrust::transform_reduce instead of is_sorted changed is_partitioned impl to make it similar to pstl's is_partitioned Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

thrust::is_partitioned is slower than it should be

3 participants