Fix atomic reduce for arches < 600 with dtype double#5428
Fix atomic reduce for arches < 600 with dtype double#5428NaderAlAwar merged 7 commits intoNVIDIA:mainfrom
Conversation
…o fall back to the default reduce
|
the second option is to use emulation, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions, see |
🟨 CI finished in 1h 51m: Pass: 93%/162 | Total: 3d 16h | Avg: 32m 40s | Max: 1h 36m | Hits: 77%/145529
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 162)
| # | Runner |
|---|---|
| 93 | linux-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 17 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 9 | linux-amd64-gpu-h100-latest-1 |
| 7 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
…ing for architecture at compile time to decide to fall back or not will not work because arch macros are 0 in host code
@fbusato I ended up going with this approach. Falling back to another implementation in |
@NaderAlAwar be careful about NaNs if you want to go with this path😄 |
🟩 CI finished in 1h 52m: Pass: 100%/162 | Total: 1d 22h | Avg: 17m 16s | Max: 1h 50m | Hits: 91%/152477
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 162)
| # | Runner |
|---|---|
| 93 | linux-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 17 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 9 | linux-amd64-gpu-h100-latest-1 |
| 7 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
@fbusato after considering this and discussing with Georgii we decided to just disable atomic reduce with doubles for pre sm60. Supporting this might prove to be more trouble than it's worth. |
🟨 CI finished in 2h 18m: Pass: 90%/162 | Total: 1d 18h | Avg: 15m 37s | Max: 2h 15m | Hits: 94%/150282
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 162)
| # | Runner |
|---|---|
| 93 | linux-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 17 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 9 | linux-amd64-gpu-h100-latest-1 |
| 7 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 1h 30m: Pass: 100%/162 | Total: 1d 12h | Avg: 13m 39s | Max: 1h 16m | Hits: 97%/152477
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 162)
| # | Runner |
|---|---|
| 93 | linux-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 17 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 9 | linux-amd64-gpu-h100-latest-1 |
| 7 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 4h 06m: Pass: 100%/162 | Total: 3d 18h | Avg: 33m 23s | Max: 1h 42m | Hits: 77%/152477
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 162)
| # | Runner |
|---|---|
| 93 | linux-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 17 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 9 | linux-amd64-gpu-h100-latest-1 |
| 7 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
Description
closes #5427
Checklist