Rework our fabs implementation to be potentially constexpr#5302
Rework our fabs implementation to be potentially constexpr#5302miscco merged 2 commits intoNVIDIA:mainfrom
fabs implementation to be potentially constexpr#5302Conversation
5ab43f7 to
6c65cc3
Compare
🟨 CI finished in 1h 52m: Pass: 99%/205 | Total: 3d 16h | Avg: 25m 59s | Max: 1h 49m | Hits: 81%/331213
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 205)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 14 | linux-amd64-gpu-h100-latest-1 |
| 14 | linux-amd64-gpu-rtxa6000-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟨 CI finished in 2h 26m: Pass: 99%/205 | Total: 1d 15h | Avg: 11m 27s | Max: 1h 08m | Hits: 95%/337287
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 205)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 14 | linux-amd64-gpu-h100-latest-1 |
| 14 | linux-amd64-gpu-rtxa6000-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟨 CI finished in 2h 39m: Pass: 99%/205 | Total: 1d 15h | Avg: 11m 27s | Max: 1h 08m | Hits: 95%/337287
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 205)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 14 | linux-amd64-gpu-h100-latest-1 |
| 14 | linux-amd64-gpu-rtxa6000-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟨 CI finished in 4h 56m: Pass: 99%/205 | Total: 1d 15h | Avg: 11m 27s | Max: 1h 08m | Hits: 95%/337287
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 205)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 14 | linux-amd64-gpu-h100-latest-1 |
| 14 | linux-amd64-gpu-rtxa6000-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| } | ||
| # endif // _CCCL_HAS_LONG_DOUBLE() | ||
| #endif // !_CCCL_HAS_CONSTEXPR_BIT_CAST() && _CCCL_COMPILER(GCC) | ||
| // We cannot use `abs.f16` or `abs.bf16` because it is not IEEE 754 compliant, see docs |
There was a problem hiding this comment.
wait, do you mean that __habs is not IEEE754 compliant? https://docs.nvidia.com/cuda/cuda-math-api/cuda_math_api/group__CUDA__MATH____HALF__ARITHMETIC.html#_CPPv46__habsK6__half
There was a problem hiding this comment.
I think that was not the case for all versions @davebayer would remember best
There was a problem hiding this comment.
There was a problem hiding this comment.
The problem is that it is undefined if positive or negative NaN is returned when NaN is passed to the instruction. If I remember correctly, C++ standard requires returning positive NaN
There was a problem hiding this comment.
did a bit of investigation. If we consider the latest C23 ISO/IEC 9899 standard version, which also captures C++ numerical behavior for fabs, then fabs with NaN is not strictly specified
F.10.4.3 https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf.
IEEE754 |x| is defined at bitwise level: "set the last bit to 0"
A 2021 proposal https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2651.pdf says
fabs(x) returns a value with the same bit representation as x, except with the sign bit set to 0 (positive), for all values of x (even quiet and signaling NaNs)
There was a problem hiding this comment.
the difference is huge, 4X more instructions
https://godbolt.org/z/n87hGWW9d
There was a problem hiding this comment.
The current implementation is also a single instruction. We just do:
return float_value & (~sign_mask_of_float);I've checked it in the past and that's how __builtin_fabs is implemented by the host compilers
There was a problem hiding this comment.
the example that I posted above shows the opposite.
I've checked it in the past and that's how __builtin_fabs is implemented by the host compilers.
This what IEEE754 requires (not C/C++). This one of the reason because I don't like compiler built-ins in device code.
There was a problem hiding this comment.
Anyway, I'm ok if we can go for the most conservative way, but we should document the performance issue.
There was a problem hiding this comment.
Sorry, I didn't see the comment with the example for some reason! In that case we can call the __habs if the expression is not being constant evaluated, that should work fine :)
🟩 CI finished in 2h 36m: Pass: 100%/210 | Total: 4d 14h | Avg: 31m 42s | Max: 1h 53m | Hits: 67%/321009
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 210)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 10 | linux-amd64-gpu-h100-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| } | ||
| # endif // _CCCL_HAS_LONG_DOUBLE() | ||
| #endif // !_CCCL_HAS_CONSTEXPR_BIT_CAST() && _CCCL_COMPILER(GCC) | ||
| // We cannot use `abs.f16` or `abs.bf16` because it is not IEEE 754 compliant, see docs |
There was a problem hiding this comment.
did a bit of investigation. If we consider the latest C23 ISO/IEC 9899 standard version, which also captures C++ numerical behavior for fabs, then fabs with NaN is not strictly specified
F.10.4.3 https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf.
IEEE754 |x| is defined at bitwise level: "set the last bit to 0"
A 2021 proposal https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2651.pdf says
fabs(x) returns a value with the same bit representation as x, except with the sign bit set to 0 (positive), for all values of x (even quiet and signaling NaNs)
| } | ||
| # endif // _CCCL_HAS_LONG_DOUBLE() | ||
| #endif // !_CCCL_HAS_CONSTEXPR_BIT_CAST() && _CCCL_COMPILER(GCC) | ||
| // We cannot use `abs.f16` or `abs.bf16` because it is not IEEE 754 compliant, see docs |
There was a problem hiding this comment.
in conclusion, I would suggest to use abs.f16 and not `abs.b16``
🟨 CI finished in 1h 55m: Pass: 96%/210 | Total: 4d 19h | Avg: 32m 52s | Max: 1h 54m | Hits: 62%/294835
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 210)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 10 | linux-amd64-gpu-h100-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 1h 16m: Pass: 100%/210 | Total: 1d 17h | Avg: 11m 44s | Max: 1h 12m | Hits: 94%/320925
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 210)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 10 | linux-amd64-gpu-h100-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
fbusato
left a comment
There was a problem hiding this comment.
looks good. The only thing missing is to handle run-time cases with
::fabsfforfloat::fabsfordouble::__habsfor__half::__nv_fp128_fabsfor__float128
see https://github.com/NVIDIA/cccl/pull/5302/files#r2216525334 for the discussion
c31a348 to
a31bfa3
Compare
🟨 CI finished in 1h 50m: Pass: 91%/210 | Total: 4d 05h | Avg: 29m 03s | Max: 1h 42m | Hits: 74%/306037
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 210)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 10 | linux-amd64-gpu-h100-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟨 CI finished in 5h 54m: Pass: 99%/210 | Total: 4d 18h | Avg: 32m 36s | Max: 2h 54m | Hits: 79%/320925
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 210)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 10 | linux-amd64-gpu-h100-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 16h 14m: Pass: 100%/210 | Total: 4d 17h | Avg: 32m 31s | Max: 2h 54m | Hits: 79%/320925
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 210)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 10 | linux-amd64-gpu-h100-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
🟩 CI finished in 3h 45m: Pass: 100%/210 | Total: 5d 05h | Avg: 35m 57s | Max: 2h 09m | Hits: 43%/322263
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| +/- | libcu++ |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| python | |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| +/- | libcu++ |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 210)
| # | Runner |
|---|---|
| 128 | linux-amd64-cpu16 |
| 23 | windows-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 12 | linux-arm64-cpu16 |
| 11 | linux-amd64-gpu-rtx2080-latest-1 |
| 10 | linux-amd64-gpu-h100-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
…5302) * Rework our `fabs` implementation to be potentially constexpr * Use the compiler builtin when possible
…5302) * Rework our `fabs` implementation to be potentially constexpr * Use the compiler builtin when possible
We really want this to be available at compile time and we cannot use the device builtins anyhow, because they are not fully conforming