Skip to content

Support fancy iterators in vectorized transform and port thrust::tabulate to it#6012

Merged
bernhardmgruber merged 9 commits intoNVIDIA:mainfrom
bernhardmgruber:fancy_transform
Sep 30, 2025
Merged

Support fancy iterators in vectorized transform and port thrust::tabulate to it#6012
bernhardmgruber merged 9 commits intoNVIDIA:mainfrom
bernhardmgruber:fancy_transform

Conversation

@bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Sep 24, 2025

thrust::sequence and thrust::tabulate before and after:

B200

# sequence

## [0] NVIDIA B200

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |    2^16    |   8.726 us |       4.23% |   8.950 us |       4.17% |    0.224 us |   2.57% |   SAME   |
|   I8    |    2^20    |   9.672 us |       3.85% |   9.031 us |       8.99% |   -0.641 us |  -6.63% |   FAST   |
|   I8    |    2^24    |  25.408 us |       1.22% |  11.243 us |       2.92% |  -14.165 us | -55.75% |   FAST   |
|   I8    |    2^28    | 275.036 us |       0.20% |  44.473 us |       0.95% | -230.563 us | -83.83% |   FAST   |
|   I16   |    2^16    |   8.681 us |       3.93% |   8.832 us |       4.84% |    0.151 us |   1.74% |   SAME   |
|   I16   |    2^20    |   9.707 us |       2.99% |   9.138 us |       4.58% |   -0.569 us |  -5.86% |   FAST   |
|   I16   |    2^24    |  25.654 us |       2.27% |  13.627 us |       2.73% |  -12.027 us | -46.88% |   FAST   |
|   I16   |    2^28    | 275.101 us |       0.16% |  79.732 us |       0.96% | -195.368 us | -71.02% |   FAST   |
|   I32   |    2^16    |   8.865 us |       8.48% |   9.033 us |       3.60% |    0.168 us |   1.89% |   SAME   |
|   I32   |    2^20    |   9.757 us |       4.64% |   9.463 us |       4.26% |   -0.294 us |  -3.01% |   SAME   |
|   I32   |    2^24    |  25.449 us |       1.43% |  17.660 us |       2.32% |   -7.789 us | -30.60% |   FAST   |
|   I32   |    2^28    | 275.090 us |       0.14% | 150.338 us |       0.43% | -124.752 us | -45.35% |   FAST   |
|   I64   |    2^16    |   8.701 us |       6.58% |   8.884 us |       4.88% |    0.184 us |   2.11% |   SAME   |
|   I64   |    2^20    |   9.958 us |       3.20% |  10.153 us |       3.33% |    0.195 us |   1.96% |   SAME   |
|   I64   |    2^24    |  26.378 us |       1.63% |  26.722 us |       2.78% |    0.343 us |   1.30% |   SAME   |
|   I64   |    2^28    | 291.084 us |       0.40% | 291.368 us |       0.38% |    0.284 us |   0.10% |   SAME   |

# seg_size

## [0] NVIDIA B200

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |    2^16    |  10.114 us |       3.24% |  10.642 us |       4.50% |    0.528 us |   5.22% |   SLOW   |
|   I8    |    2^20    |  11.566 us |       2.66% |  10.818 us |       5.76% |   -0.747 us |  -6.46% |   FAST   |
|   I8    |    2^24    |  42.721 us |       1.53% |  17.983 us |       2.28% |  -24.739 us | -57.91% |   FAST   |
|   I8    |    2^28    | 541.304 us |       0.10% | 125.420 us |       0.34% | -415.884 us | -76.83% |   FAST   |
|   I16   |    2^16    |  10.094 us |       4.36% |   9.902 us |       5.06% |   -0.192 us |  -1.90% |   SAME   |
|   I16   |    2^20    |  12.110 us |       4.82% |  10.572 us |       3.11% |   -1.538 us | -12.70% |   FAST   |
|   I16   |    2^24    |  46.531 us |       2.00% |  23.409 us |       1.68% |  -23.123 us | -49.69% |   FAST   |
|   I16   |    2^28    | 584.457 us |       0.09% | 211.022 us |       0.32% | -373.435 us | -63.89% |   FAST   |
|   I32   |    2^16    |  10.137 us |       4.13% |   9.696 us |       5.77% |   -0.442 us |  -4.36% |   FAST   |
|   I32   |    2^20    |  12.144 us |       4.38% |  11.012 us |       2.85% |   -1.132 us |  -9.32% |   FAST   |
|   I32   |    2^24    |  50.881 us |       1.04% |  32.073 us |       1.54% |  -18.809 us | -36.97% |   FAST   |
|   I32   |    2^28    | 643.465 us |       0.08% | 339.741 us |       0.14% | -303.724 us | -47.20% |   FAST   |
|   I64   |    2^16    |  10.248 us |       4.25% |   9.777 us |       6.25% |   -0.470 us |  -4.59% |   FAST   |
|   I64   |    2^20    |  13.325 us |       2.16% |  12.565 us |       2.51% |   -0.761 us |  -5.71% |   FAST   |
|   I64   |    2^24    |  61.892 us |       0.68% |  49.092 us |       1.32% |  -12.800 us | -20.68% |   FAST   |
|   I64   |    2^28    | 808.201 us |       0.10% | 622.470 us |       0.20% | -185.731 us | -22.98% |   FAST   |

Fixes: #5808

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Sep 24, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Sep 24, 2025
Comment on lines +377 to +380
(int{sizeof(it_value_t<RandomAccessIteratorsIn>)}
* THRUST_NS_QUALIFIER::is_contiguous_iterator_v<RandomAccessIteratorsIn>) ...,
int{size_of<it_value_t<RandomAccessIteratorOut>>});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙀 I believe those warrant a slightly more elaborate comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a bigger refactoring in general

@bernhardmgruber bernhardmgruber force-pushed the fancy_transform branch 2 times, most recently from 3a9e091 to 512baf5 Compare September 26, 2025 10:50
@bernhardmgruber bernhardmgruber marked this pull request as ready for review September 26, 2025 11:05
@bernhardmgruber bernhardmgruber requested review from a team as code owners September 26, 2025 11:05
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Sep 26, 2025
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber changed the title Support mixing fancy iterators with normal ones in vectorized transform Support fancy iterators in vectorized transform and port thrust::tabulate to it Sep 28, 2025
Comment on lines +249 to +252
constexpr int element_size = int{first_nonzero_value(
(sizeof(it_value_t<RandomAccessIteratorsIn>)
* THRUST_NS_QUALIFIER::is_contiguous_iterator_v<RandomAccessIteratorsIn>) ...,
size_of<output_t>)};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really pull that out into a function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am strongly considering to refactor the entire mess, so let's postpone any small fixes for now.

Comment on lines +365 to +368
static constexpr bool can_memcpy_contiguous_inputs =
((!THRUST_NS_QUALIFIER::is_contiguous_iterator_v<RandomAccessIteratorsIn>
|| THRUST_NS_QUALIFIER::is_trivially_relocatable_v<it_value_t<RandomAccessIteratorsIn>>)
&& ...);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a trait for that, I guess it will come up more often

@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 3h 48m: Pass: 100%/185 | Total: 8d 00h | Max: 3h 34m | Hits: 18%/189478

See results here.

@bernhardmgruber bernhardmgruber enabled auto-merge (squash) September 29, 2025 19:14
@bernhardmgruber bernhardmgruber merged commit ad70011 into NVIDIA:main Sep 30, 2025
195 of 196 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Sep 30, 2025
@bernhardmgruber bernhardmgruber deleted the fancy_transform branch September 30, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Use cub::DeviceTransform::Transform in thrust::tabulate

3 participants