Skip to content

Use synchronize_optional for device-to-device copy in thrust::copy()#3149

Merged
miscco merged 1 commit intoNVIDIA:mainfrom
davidwendt:thrust-copy-nosync
Dec 13, 2024
Merged

Use synchronize_optional for device-to-device copy in thrust::copy()#3149
miscco merged 1 commit intoNVIDIA:mainfrom
davidwendt:thrust-copy-nosync

Conversation

@davidwendt
Copy link
Contributor

Description

Changes the call to synchronize() for the target stream to synchronize_optional() in the trivial_copy_device_to_device() utility. This allows the caller to pass in a thrust::cuda::par_nosync policy without requiring a sync on the stream. The return value for thrust::copy() does not rely on the copy result but is simply an increment of the output iterator.

Reference issue #1474 -- only handles device-to-device stream sync case.

For RAPIDS libcudf this results in an up to 2x performance improvement certain cases that use thrust::copy().

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt requested review from a team as code owners December 12, 2024 22:52
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Dec 12, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@davidwendt
Copy link
Contributor Author

If necessary, I have reproducer code and nsys traces that show the synchronize is removed from thrust::copy() (device-to-device) when this changed is applied.

@miscco
Copy link
Contributor

miscco commented Dec 13, 2024

/ok to test

@github-actions
Copy link
Contributor

🟩 CI finished in 1h 31m: Pass: 100%/94 | Total: 2d 14h | Avg: 39m 57s | Max: 1h 08m | Hits: 67%/12384
  • 🟩 thrust: Pass: 100%/46 | Total: 1d 00h | Avg: 32m 18s | Max: 1h 03m | Hits: 69%/9260

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 48m 40s | Avg: 24m 20s | Max: 28m 22s
    🟩 cpu
      🟩 amd64              Pass: 100%/44  | Total: 23h 40m | Avg: 32m 17s | Max:  1h 03m | Hits:  69%/9260  
      🟩 arm64              Pass: 100%/2   | Total:  1h 05m | Avg: 32m 33s | Max: 35m 42s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  3h 29m | Avg: 29m 56s | Max: 54m 01s | Hits:  62%/1852  
      🟩 12.5               Pass: 100%/2   | Total:  1h 42m | Avg: 51m 12s | Max: 52m 04s
      🟩 12.6               Pass: 100%/37  | Total: 19h 33m | Avg: 31m 43s | Max:  1h 03m | Hits:  71%/7408  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 57m 08s | Avg: 28m 34s | Max: 30m 10s
      🟩 nvcc11.1           Pass: 100%/7   | Total:  3h 29m | Avg: 29m 56s | Max: 54m 01s | Hits:  62%/1852  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 42m | Avg: 51m 12s | Max: 52m 04s
      🟩 nvcc12.6           Pass: 100%/35  | Total: 18h 36m | Avg: 31m 54s | Max:  1h 03m | Hits:  71%/7408  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 57m 08s | Avg: 28m 34s | Max: 30m 10s
      🟩 nvcc               Pass: 100%/44  | Total: 23h 48m | Avg: 32m 28s | Max:  1h 03m | Hits:  69%/9260  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  1h 51m | Avg: 27m 46s | Max: 34m 21s
      🟩 Clang10            Pass: 100%/1   | Total: 32m 50s | Avg: 32m 50s | Max: 32m 50s
      🟩 Clang11            Pass: 100%/1   | Total: 30m 44s | Avg: 30m 44s | Max: 30m 44s
      🟩 Clang12            Pass: 100%/1   | Total: 30m 20s | Avg: 30m 20s | Max: 30m 20s
      🟩 Clang13            Pass: 100%/1   | Total: 32m 40s | Avg: 32m 40s | Max: 32m 40s
      🟩 Clang14            Pass: 100%/1   | Total: 33m 17s | Avg: 33m 17s | Max: 33m 17s
      🟩 Clang15            Pass: 100%/1   | Total: 32m 04s | Avg: 32m 04s | Max: 32m 04s
      🟩 Clang16            Pass: 100%/1   | Total: 35m 11s | Avg: 35m 11s | Max: 35m 11s
      🟩 Clang17            Pass: 100%/1   | Total: 36m 17s | Avg: 36m 17s | Max: 36m 17s
      🟩 Clang18            Pass: 100%/7   | Total:  3h 01m | Avg: 25m 54s | Max: 35m 40s
      🟩 GCC6               Pass: 100%/2   | Total: 51m 33s | Avg: 25m 46s | Max: 29m 45s
      🟩 GCC7               Pass: 100%/2   | Total: 55m 59s | Avg: 27m 59s | Max: 31m 43s
      🟩 GCC8               Pass: 100%/1   | Total: 32m 47s | Avg: 32m 47s | Max: 32m 47s
      🟩 GCC9               Pass: 100%/3   | Total:  1h 27m | Avg: 29m 05s | Max: 34m 26s
      🟩 GCC10              Pass: 100%/1   | Total: 35m 27s | Avg: 35m 27s | Max: 35m 27s
      🟩 GCC11              Pass: 100%/1   | Total: 34m 58s | Avg: 34m 58s | Max: 34m 58s
      🟩 GCC12              Pass: 100%/1   | Total: 34m 29s | Avg: 34m 29s | Max: 34m 29s
      🟩 GCC13              Pass: 100%/8   | Total:  3h 29m | Avg: 26m 08s | Max: 38m 19s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 40m 02s | Avg: 40m 02s | Max: 40m 02s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 54m 01s | Avg: 54m 01s | Max: 54m 01s | Hits:  62%/1852  
      🟩 MSVC14.29          Pass: 100%/1   | Total: 53m 15s | Avg: 53m 15s | Max: 53m 15s | Hits:  62%/1852  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 18m | Avg: 46m 17s | Max:  1h 03m | Hits:  74%/5556  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 42m | Avg: 51m 12s | Max: 52m 04s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  9h 15m | Avg: 29m 15s | Max: 36m 17s
      🟩 GCC                Pass: 100%/19  | Total:  9h 01m | Avg: 28m 30s | Max: 38m 19s
      🟩 Intel              Pass: 100%/1   | Total: 40m 02s | Avg: 40m 02s | Max: 40m 02s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 06m | Avg: 49m 13s | Max:  1h 03m | Hits:  69%/9260  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 42m | Avg: 51m 12s | Max: 52m 04s
    🟩 gpu
      🟩 v100               Pass: 100%/46  | Total:  1d 00h | Avg: 32m 18s | Max:  1h 03m | Hits:  69%/9260  
    🟩 jobs
      🟩 Build              Pass: 100%/40  | Total: 23h 08m | Avg: 34m 42s | Max:  1h 03m | Hits:  62%/7408  
      🟩 TestCPU            Pass: 100%/3   | Total: 36m 49s | Avg: 12m 16s | Max: 21m 11s | Hits:  99%/1852  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 01m | Avg: 20m 21s | Max: 20m 59s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 22m 27s | Avg: 22m 27s | Max: 22m 27s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  1h 57m | Avg: 23m 31s | Max: 25m 34s
      🟩 14                 Pass: 100%/4   | Total:  2h 29m | Avg: 37m 27s | Max: 54m 01s | Hits:  62%/1852  
      🟩 17                 Pass: 100%/12  | Total:  7h 33m | Avg: 37m 48s | Max: 54m 09s | Hits:  62%/3704  
      🟩 20                 Pass: 100%/23  | Total: 11h 56m | Avg: 31m 08s | Max:  1h 03m | Hits:  80%/3704  
    
  • 🟩 cub: Pass: 100%/45 | Total: 1d 13h | Avg: 49m 36s | Max: 1h 08m | Hits: 60%/3124

    🟩 cpu
      🟩 amd64              Pass: 100%/43  | Total:  1d 11h | Avg: 49m 20s | Max:  1h 08m | Hits:  60%/3124  
      🟩 arm64              Pass: 100%/2   | Total:  1h 51m | Avg: 55m 31s | Max: 55m 58s
    🟩 ctk
      🟩 11.1               Pass: 100%/7   | Total:  5h 40m | Avg: 48m 36s | Max: 55m 57s | Hits:  60%/781   
      🟩 12.5               Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 07m
      🟩 12.6               Pass: 100%/36  | Total:  1d 05h | Avg: 48m 56s | Max:  1h 08m | Hits:  60%/2343  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 00m
      🟩 nvcc11.1           Pass: 100%/7   | Total:  5h 40m | Avg: 48m 36s | Max: 55m 57s | Hits:  60%/781   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 07m
      🟩 nvcc12.6           Pass: 100%/34  | Total:  1d 03h | Avg: 48m 17s | Max:  1h 08m | Hits:  60%/2343  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 00m
      🟩 nvcc               Pass: 100%/43  | Total:  1d 11h | Avg: 49m 07s | Max:  1h 08m | Hits:  60%/3124  
    🟩 cxx
      🟩 Clang9             Pass: 100%/4   | Total:  3h 22m | Avg: 50m 33s | Max: 53m 34s
      🟩 Clang10            Pass: 100%/1   | Total: 53m 49s | Avg: 53m 49s | Max: 53m 49s
      🟩 Clang11            Pass: 100%/1   | Total: 53m 17s | Avg: 53m 17s | Max: 53m 17s
      🟩 Clang12            Pass: 100%/1   | Total: 50m 43s | Avg: 50m 43s | Max: 50m 43s
      🟩 Clang13            Pass: 100%/1   | Total: 52m 58s | Avg: 52m 58s | Max: 52m 58s
      🟩 Clang14            Pass: 100%/1   | Total: 55m 22s | Avg: 55m 22s | Max: 55m 22s
      🟩 Clang15            Pass: 100%/1   | Total: 53m 53s | Avg: 53m 53s | Max: 53m 53s
      🟩 Clang16            Pass: 100%/1   | Total: 52m 28s | Avg: 52m 28s | Max: 52m 28s
      🟩 Clang17            Pass: 100%/1   | Total: 53m 07s | Avg: 53m 07s | Max: 53m 07s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 28m | Avg: 46m 51s | Max:  1h 00m
      🟩 GCC6               Pass: 100%/2   | Total:  1h 35m | Avg: 47m 40s | Max: 47m 45s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 39m | Avg: 49m 56s | Max: 50m 43s
      🟩 GCC8               Pass: 100%/1   | Total: 57m 20s | Avg: 57m 20s | Max: 57m 20s
      🟩 GCC9               Pass: 100%/3   | Total:  2h 25m | Avg: 48m 25s | Max: 51m 34s
      🟩 GCC10              Pass: 100%/1   | Total: 54m 49s | Avg: 54m 49s | Max: 54m 49s
      🟩 GCC11              Pass: 100%/1   | Total: 54m 48s | Avg: 54m 48s | Max: 54m 48s
      🟩 GCC12              Pass: 100%/1   | Total: 56m 53s | Avg: 56m 53s | Max: 56m 53s
      🟩 GCC13              Pass: 100%/8   | Total:  4h 33m | Avg: 34m 14s | Max: 58m 21s
      🟩 Intel2023.2.0      Pass: 100%/1   | Total: 55m 35s | Avg: 55m 35s | Max: 55m 35s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 55m 57s | Avg: 55m 57s | Max: 55m 57s | Hits:  60%/781   
      🟩 MSVC14.29          Pass: 100%/1   | Total:  1h 05m | Avg:  1h 05m | Max:  1h 05m | Hits:  60%/781   
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 11m | Avg:  1h 05m | Max:  1h 08m | Hits:  60%/1562  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 07m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total: 15h 55m | Avg: 50m 18s | Max:  1h 00m
      🟩 GCC                Pass: 100%/19  | Total: 13h 58m | Avg: 44m 07s | Max: 58m 21s
      🟩 Intel              Pass: 100%/1   | Total: 55m 35s | Avg: 55m 35s | Max: 55m 35s
      🟩 MSVC               Pass: 100%/4   | Total:  4h 12m | Avg:  1h 03m | Max:  1h 08m | Hits:  60%/3124  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 10m | Avg:  1h 05m | Max:  1h 07m
    🟩 gpu
      🟩 v100               Pass: 100%/45  | Total:  1d 13h | Avg: 49m 36s | Max:  1h 08m | Hits:  60%/3124  
    🟩 jobs
      🟩 Build              Pass: 100%/39  | Total:  1d 11h | Avg: 54m 09s | Max:  1h 08m | Hits:  60%/3124  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 23m 33s | Avg: 23m 33s | Max: 23m 33s
      🟩 GraphCapture       Pass: 100%/1   | Total: 19m 34s | Avg: 19m 34s | Max: 19m 34s
      🟩 HostLaunch         Pass: 100%/2   | Total: 35m 41s | Avg: 17m 50s | Max: 19m 36s
      🟩 TestGPU            Pass: 100%/2   | Total: 41m 56s | Avg: 20m 58s | Max: 23m 14s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 25m 56s | Avg: 25m 56s | Max: 25m 56s
    🟩 std
      🟩 11                 Pass: 100%/5   | Total:  3h 59m | Avg: 47m 49s | Max: 53m 25s
      🟩 14                 Pass: 100%/4   | Total:  3h 27m | Avg: 51m 59s | Max: 55m 57s | Hits:  60%/781   
      🟩 17                 Pass: 100%/12  | Total: 11h 25m | Avg: 57m 05s | Max:  1h 07m | Hits:  60%/1562  
      🟩 20                 Pass: 100%/24  | Total: 18h 20m | Avg: 45m 51s | Max:  1h 08m | Hits:  60%/781   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 09s | Avg: 4m 34s | Max: 7m 09s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 09s | Avg:  4m 34s | Max:  7m 09s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 09s | Avg:  4m 34s | Max:  7m 09s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 09s | Avg:  4m 34s | Max:  7m 09s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 09s | Avg:  4m 34s | Max:  7m 09s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 09s | Avg:  4m 34s | Max:  7m 09s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 09s | Avg:  4m 34s | Max:  7m 09s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 09s | Avg:  4m 34s | Max:  7m 09s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 00s | Avg:  2m 00s | Max:  2m 00s
      🟩 Test               Pass: 100%/1   | Total:  7m 09s | Avg:  7m 09s | Max:  7m 09s
    
  • 🟩 python: Pass: 100%/1 | Total: 27m 52s | Avg: 27m 52s | Max: 27m 52s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 27m 52s | Avg: 27m 52s | Max: 27m 52s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 27m 52s | Avg: 27m 52s | Max: 27m 52s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 27m 52s | Avg: 27m 52s | Max: 27m 52s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 27m 52s | Avg: 27m 52s | Max: 27m 52s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 27m 52s | Avg: 27m 52s | Max: 27m 52s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 27m 52s | Avg: 27m 52s | Max: 27m 52s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 27m 52s | Avg: 27m 52s | Max: 27m 52s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 27m 52s | Avg: 27m 52s | Max: 27m 52s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 94)

# Runner
70 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16

@miscco miscco merged commit a9cefd9 into NVIDIA:main Dec 13, 2024
110 checks passed
@davidwendt davidwendt deleted the thrust-copy-nosync branch December 13, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants