Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds new virtual shared memory facility to DeviceMergeSort #1117

Merged
merged 9 commits into from
Dec 6, 2023

Conversation

elstehle
Copy link
Collaborator

@elstehle elstehle commented Nov 16, 2023

Description

Closes #549

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Results for cub.bench.merge_sort.keys.base on V100

cub.bench.merge_sort.keys.base

[0] Tesla V100-SXM2-32GB

T{ct} OffsetT{ct} Elements{io} Entropy Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 2^16 1 57.361 us 6.58% 54.631 us 2.62% -2.730 us -4.76% FAIL
I8 I32 2^20 1 161.505 us 0.92% 161.193 us 0.83% -0.312 us -0.19% PASS
I8 I32 2^24 1 1.572 ms 0.15% 1.571 ms 0.16% -0.931 us -0.06% PASS
I8 I32 2^28 1 29.056 ms 0.50% 29.026 ms 0.50% -29.795 us -0.10% PASS
I8 I32 2^16 0.201 54.049 us 2.62% 53.846 us 2.68% -0.203 us -0.38% PASS
I8 I32 2^20 0.201 156.019 us 0.98% 155.003 us 0.82% -1.016 us -0.65% PASS
I8 I32 2^24 0.201 1.475 ms 0.17% 1.475 ms 0.16% -0.526 us -0.04% PASS
I8 I32 2^28 0.201 26.618 ms 0.50% 26.616 ms 0.50% -1.927 us -0.01% PASS
I8 I64 2^16 1 55.521 us 2.57% 55.111 us 2.51% -0.410 us -0.74% PASS
I8 I64 2^20 1 164.403 us 0.95% 163.368 us 0.84% -1.035 us -0.63% PASS
I8 I64 2^24 1 1.583 ms 0.17% 1.584 ms 0.15% 1.653 us 0.10% PASS
I8 I64 2^28 1 29.250 ms 0.50% 29.246 ms 0.50% -3.738 us -0.01% PASS
I8 I64 2^16 0.201 54.600 us 2.47% 54.734 us 2.80% 0.134 us 0.25% PASS
I8 I64 2^20 0.201 156.824 us 0.98% 157.054 us 0.85% 0.231 us 0.15% PASS
I8 I64 2^24 0.201 1.487 ms 0.16% 1.489 ms 0.15% 1.929 us 0.13% PASS
I8 I64 2^28 0.201 26.831 ms 0.50% 26.828 ms 0.50% -2.132 us -0.01% PASS
I16 I32 2^16 1 57.918 us 2.81% 57.779 us 2.69% -0.139 us -0.24% PASS
I16 I32 2^20 1 170.207 us 0.84% 170.695 us 0.94% 0.488 us 0.29% PASS
I16 I32 2^24 1 1.881 ms 0.16% 1.883 ms 0.16% 1.344 us 0.07% PASS
I16 I32 2^28 1 35.901 ms 0.22% 35.906 ms 0.25% 5.266 us 0.01% PASS
I16 I32 2^16 0.201 57.400 us 3.16% 57.698 us 2.78% 0.298 us 0.52% PASS
I16 I32 2^20 0.201 162.373 us 0.90% 162.991 us 0.85% 0.619 us 0.38% PASS
I16 I32 2^24 0.201 1.792 ms 0.18% 1.793 ms 0.17% 1.500 us 0.08% PASS
I16 I32 2^28 0.201 33.123 ms 0.50% 33.123 ms 0.50% 0.358 us 0.00% PASS
I16 I64 2^16 1 58.243 us 2.71% 58.684 us 2.64% 0.441 us 0.76% PASS
I16 I64 2^20 1 171.645 us 0.83% 172.408 us 0.87% 0.763 us 0.44% PASS
I16 I64 2^24 1 1.847 ms 0.17% 1.846 ms 0.15% -1.493 us -0.08% PASS
I16 I64 2^28 1 35.107 ms 0.42% 35.067 ms 0.29% -40.072 us -0.11% PASS
I16 I64 2^16 0.201 58.037 us 3.04% 57.528 us 2.34% -0.508 us -0.88% PASS
I16 I64 2^20 0.201 164.881 us 0.90% 164.071 us 0.86% -0.811 us -0.49% PASS
I16 I64 2^24 0.201 1.772 ms 0.18% 1.770 ms 0.17% -1.292 us -0.07% PASS
I16 I64 2^28 0.201 32.553 ms 0.50% 32.555 ms 0.50% 1.652 us 0.01% PASS
I32 I32 2^16 1 55.880 us 3.10% 55.278 us 2.89% -0.602 us -1.08% PASS
I32 I32 2^20 1 168.913 us 1.12% 168.515 us 1.14% -0.398 us -0.24% PASS
I32 I32 2^24 1 2.697 ms 0.11% 2.697 ms 0.11% -0.165 us -0.01% PASS
I32 I32 2^28 1 53.283 ms 0.02% 53.291 ms 0.02% 7.912 us 0.01% PASS
I32 I32 2^16 0.201 55.770 us 2.80% 55.098 us 2.82% -0.673 us -1.21% PASS
I32 I32 2^20 0.201 165.220 us 1.28% 164.335 us 1.11% -0.885 us -0.54% PASS
I32 I32 2^24 0.201 2.653 ms 0.11% 2.653 ms 0.11% -0.802 us -0.03% PASS
I32 I32 2^28 0.201 51.982 ms 0.49% 51.985 ms 0.50% 3.468 us 0.01% PASS
I32 I64 2^16 1 56.621 us 2.84% 55.863 us 2.81% -0.759 us -1.34% PASS
I32 I64 2^20 1 170.427 us 0.99% 169.637 us 1.05% -0.789 us -0.46% PASS
I32 I64 2^24 1 2.708 ms 0.11% 2.708 ms 0.13% 0.210 us 0.01% PASS
I32 I64 2^28 1 53.294 ms 0.02% 53.299 ms 0.02% 5.306 us 0.01% PASS
I32 I64 2^16 0.201 56.533 us 4.31% 56.666 us 2.90% 0.133 us 0.24% PASS
I32 I64 2^20 0.201 165.266 us 1.17% 166.081 us 1.07% 0.815 us 0.49% PASS
I32 I64 2^24 0.201 2.662 ms 0.13% 2.662 ms 0.13% 0.196 us 0.01% PASS
I32 I64 2^28 0.201 52.051 ms 0.48% 52.050 ms 0.49% -1.024 us -0.00% PASS
I64 I32 2^16 1 63.103 us 2.65% 63.859 us 2.52% 0.756 us 1.20% PASS
I64 I32 2^20 1 395.912 us 0.65% 396.724 us 0.63% 0.812 us 0.21% PASS
I64 I32 2^24 1 5.765 ms 0.08% 5.763 ms 0.07% -2.177 us -0.04% PASS
I64 I32 2^28 1 116.216 ms 0.08% 116.220 ms 0.08% 4.097 us 0.00% PASS
I64 I32 2^16 0.201 64.570 us 2.46% 63.857 us 2.80% -0.714 us -1.11% PASS
I64 I32 2^20 0.201 401.225 us 0.72% 400.720 us 0.70% -0.505 us -0.13% PASS
I64 I32 2^24 0.201 5.808 ms 0.08% 5.806 ms 0.08% -2.260 us -0.04% PASS
I64 I32 2^28 0.201 115.780 ms 0.04% 115.788 ms 0.03% 8.285 us 0.01% PASS
I64 I64 2^16 1 64.539 us 2.72% 64.120 us 2.60% -0.419 us -0.65% PASS
I64 I64 2^20 1 398.059 us 0.63% 398.081 us 0.66% 0.022 us 0.01% PASS
I64 I64 2^24 1 5.773 ms 0.06% 5.772 ms 0.06% -0.942 us -0.02% PASS
I64 I64 2^28 1 116.296 ms 0.06% 116.236 ms 0.09% -60.043 us -0.05% PASS
I64 I64 2^16 0.201 65.613 us 2.49% 65.376 us 2.70% -0.237 us -0.36% PASS
I64 I64 2^20 0.201 403.609 us 0.59% 404.018 us 0.65% 0.409 us 0.10% PASS
I64 I64 2^24 0.201 5.811 ms 0.08% 5.812 ms 0.07% 0.424 us 0.01% PASS
I64 I64 2^28 0.201 115.835 ms 0.04% 115.847 ms 0.05% 11.450 us 0.01% PASS
I128 I32 2^16 1 72.274 us 2.59% 72.825 us 2.81% 0.551 us 0.76% PASS
I128 I32 2^20 1 716.527 us 0.57% 715.194 us 0.65% -1.332 us -0.19% PASS
I128 I32 2^24 1 11.891 ms 0.09% 11.891 ms 0.09% 0.357 us 0.00% PASS
I128 I32 2^28 1 235.594 ms 0.07% 235.573 ms 0.06% -21.125 us -0.01% PASS
I128 I32 2^16 0.201 72.226 us 2.61% 73.122 us 2.67% 0.897 us 1.24% PASS
I128 I32 2^20 0.201 716.452 us 0.56% 716.085 us 0.60% -0.367 us -0.05% PASS
I128 I32 2^24 0.201 11.892 ms 0.09% 11.892 ms 0.10% -0.119 us -0.00% PASS
I128 I32 2^28 0.201 235.665 ms 0.10% 235.692 ms 0.06% 26.811 us 0.01% PASS
I128 I64 2^16 1 73.328 us 2.85% 73.319 us 2.66% -0.009 us -0.01% PASS
I128 I64 2^20 1 718.717 us 0.51% 716.663 us 0.65% -2.054 us -0.29% PASS
I128 I64 2^24 1 11.896 ms 0.08% 11.898 ms 0.09% 2.024 us 0.02% PASS
I128 I64 2^28 1 235.582 ms 0.07% 235.663 ms 0.10% 80.335 us 0.03% PASS
I128 I64 2^16 0.201 73.022 us 2.49% 73.111 us 2.60% 0.089 us 0.12% PASS
I128 I64 2^20 0.201 718.404 us 0.55% 717.463 us 0.55% -0.941 us -0.13% PASS
I128 I64 2^24 0.201 11.894 ms 0.08% 11.895 ms 0.10% 0.929 us 0.01% PASS
I128 I64 2^28 0.201 235.753 ms 0.09% 235.775 ms 0.10% 21.412 us 0.01% PASS
F32 I32 2^16 1 55.453 us 3.03% 55.095 us 3.21% -0.358 us -0.65% PASS
F32 I32 2^20 1 167.572 us 1.03% 167.444 us 1.16% -0.127 us -0.08% PASS
F32 I32 2^24 1 2.689 ms 0.11% 2.689 ms 0.12% -0.099 us -0.00% PASS
F32 I32 2^28 1 53.139 ms 0.03% 53.137 ms 0.02% -2.793 us -0.01% PASS
F32 I32 2^16 0.201 55.219 us 2.93% 54.730 us 3.19% -0.489 us -0.88% PASS
F32 I32 2^20 0.201 164.975 us 1.29% 164.847 us 1.24% -0.128 us -0.08% PASS
F32 I32 2^24 0.201 2.647 ms 0.12% 2.647 ms 0.11% -0.509 us -0.02% PASS
F32 I32 2^28 0.201 51.889 ms 0.48% 51.892 ms 0.49% 2.645 us 0.01% PASS
F32 I64 2^16 1 55.499 us 2.88% 55.432 us 3.12% -0.066 us -0.12% PASS
F32 I64 2^20 1 168.288 us 1.01% 168.434 us 0.96% 0.146 us 0.09% PASS
F32 I64 2^24 1 2.700 ms 0.12% 2.700 ms 0.12% -0.237 us -0.01% PASS
F32 I64 2^28 1 53.142 ms 0.02% 53.146 ms 0.02% 4.096 us 0.01% PASS
F32 I64 2^16 0.201 55.844 us 3.27% 55.986 us 2.85% 0.142 us 0.25% PASS
F32 I64 2^20 0.201 166.714 us 1.30% 166.461 us 1.17% -0.253 us -0.15% PASS
F32 I64 2^24 0.201 2.656 ms 0.13% 2.656 ms 0.11% -0.065 us -0.00% PASS
F32 I64 2^28 0.201 51.972 ms 0.48% 51.959 ms 0.48% -12.497 us -0.02% PASS
F64 I32 2^16 1 63.133 us 2.73% 63.064 us 2.88% -0.070 us -0.11% PASS
F64 I32 2^20 1 396.074 us 0.72% 395.387 us 0.71% -0.687 us -0.17% PASS
F64 I32 2^24 1 5.766 ms 0.09% 5.764 ms 0.07% -1.448 us -0.03% PASS
F64 I32 2^28 1 116.245 ms 0.10% 116.256 ms 0.06% 10.984 us 0.01% PASS
F64 I32 2^16 0.201 64.031 us 2.80% 64.176 us 2.59% 0.144 us 0.23% PASS
F64 I32 2^20 0.201 399.947 us 0.61% 399.620 us 0.64% -0.326 us -0.08% PASS
F64 I32 2^24 0.201 5.797 ms 0.07% 5.796 ms 0.07% -0.753 us -0.01% PASS
F64 I32 2^28 0.201 115.938 ms 0.05% 115.920 ms 0.04% -18.247 us -0.02% PASS
F64 I64 2^16 1 63.897 us 3.08% 64.237 us 2.76% 0.341 us 0.53% PASS
F64 I64 2^20 1 398.150 us 0.63% 398.080 us 0.65% -0.070 us -0.02% PASS
F64 I64 2^24 1 5.771 ms 0.07% 5.772 ms 0.08% 1.036 us 0.02% PASS
F64 I64 2^28 1 116.202 ms 0.09% 116.256 ms 0.05% 53.901 us 0.05% PASS
F64 I64 2^16 0.201 64.399 us 2.79% 64.525 us 2.71% 0.126 us 0.20% PASS
F64 I64 2^20 0.201 401.394 us 0.68% 400.991 us 0.61% -0.403 us -0.10% PASS
F64 I64 2^24 0.201 5.801 ms 0.07% 5.803 ms 0.07% 2.177 us 0.04% PASS
F64 I64 2^28 0.201 115.977 ms 0.06% 115.981 ms 0.04% 3.723 us 0.00% PASS
C64 I32 2^16 1 254.834 us 1.94% 254.690 us 1.97% -0.144 us -0.06% PASS
C64 I32 2^20 1 1.087 ms 1.01% 1.089 ms 0.91% 1.958 us 0.18% PASS
C64 I32 2^24 1 17.241 ms 0.11% 17.283 ms 0.15% 42.071 us 0.24% FAIL
C64 I32 2^28 1 347.701 ms 0.05% 349.092 ms 0.04% 1.391 ms 0.40% FAIL
C64 I32 2^16 0.201 378.342 us 1.10% 377.350 us 1.15% -0.992 us -0.26% PASS
C64 I32 2^20 0.201 1.978 ms 2.33% 1.974 ms 2.27% -4.128 us -0.21% PASS
C64 I32 2^24 0.201 29.942 ms 0.16% 29.996 ms 0.16% 54.392 us 0.18% FAIL
C64 I32 2^28 0.201 518.921 ms 0.04% 519.988 ms 0.02% 1.067 ms 0.21% FAIL
C64 I64 2^16 1 257.246 us 1.75% 256.792 us 1.77% -0.454 us -0.18% PASS
C64 I64 2^20 1 1.097 ms 0.98% 1.094 ms 0.94% -2.902 us -0.26% PASS
C64 I64 2^24 1 17.512 ms 0.12% 17.473 ms 0.16% -39.089 us -0.22% FAIL
C64 I64 2^28 1 354.382 ms 0.03% 353.200 ms 0.05% -1181.790 us -0.33% FAIL
C64 I64 2^16 0.201 380.073 us 1.15% 380.308 us 1.12% 0.235 us 0.06% PASS
C64 I64 2^20 0.201 2.001 ms 2.32% 1.988 ms 2.17% -13.876 us -0.69% PASS
C64 I64 2^24 0.201 30.424 ms 0.11% 30.330 ms 0.13% -94.027 us -0.31% FAIL
C64 I64 2^28 0.201 527.414 ms 0.04% 526.048 ms 0.02% -1366.289 us -0.26% FAIL

@elstehle elstehle requested review from a team as code owners November 16, 2023 18:00
@elstehle elstehle requested review from gevtushenko and miscco and removed request for a team November 16, 2023 18:00
cub/cub/util_device.cuh Outdated Show resolved Hide resolved
cub/cub/device/dispatch/dispatch_merge_sort.cuh Outdated Show resolved Hide resolved
cub/cub/device/dispatch/dispatch_merge_sort.cuh Outdated Show resolved Hide resolved
cub/cub/device/dispatch/dispatch_merge_sort.cuh Outdated Show resolved Hide resolved
cub/cub/device/dispatch/dispatch_merge_sort.cuh Outdated Show resolved Hide resolved
@elstehle elstehle merged commit d855743 into NVIDIA:main Dec 6, 2023
516 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Integrate new VShmem facility into DeviceMergeSort
2 participants