Conversation
|
This will be slower on 100-class hardware, but this isn't where |
|
/build |
Greptile SummaryThis PR adds a device-only fast path for Confidence Score: 5/5Safe to merge; the only finding is a minor comment inaccuracy in the test file. The core algorithm is mathematically correct — bit-field extraction, scale computation, sign application, and fast2sum all check out. No P0/P1 issues found. The single P2 finding is a wrong threshold value in a test comment that doesn't affect behavior. test/00_misc/FloatFloatTests.cu (minor comment fix at line 2337) Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["fltflt(double x) constructor"] --> B{__CUDA_ARCH__ defined?}
B -- No --> HOST["HOST path\nhi = float(x)\nlo = float(x - double(hi))"]
B -- Yes --> CE{__builtin_is_constant_evaluated?}
CE -- Yes --> CONSTEXPR["CONSTEXPR path (compile-time)\nhi = float(x)\nlo = float(x - double(hi))"]
CE -- No --> BITS["Extract IEEE-754 bits\nsign, e_x, mant"]
BITS --> CHK{"e_x == 0 OR\nhi_exp <= 0 OR\nhi_exp >= 255?"}
CHK -- Yes --> FALLBACK["FALLBACK\nhi = float(x)\nlo = 0.0f\n(NaN / Inf / subnormal / out-of-range)"]
CHK -- No --> FAST["FAST PATH (no FP64)\nhi = truncated top-23 mantissa bits"]
FAST --> RES["r = remaining 29 mantissa bits\nr_float = __uint2float_rn(r)\nlo_exp = hi_exp >= 53 ? hi_exp-52 : 0\nlo_raw = r_float * 2^(lo_exp-127)\nApply sign bit to lo_raw"]
RES --> F2S["fast2sum\ns = hi + lo_raw\nlo = lo_raw - (s - hi)\nhi = s\n(guarantees fl(hi+lo) == hi)"]
Reviews (1): Last reviewed commit: "apply fast2sum at end, tweak handling of..." | Re-trigger Greptile |
| // Large double near FLT_MAX: hi_exp = 254, triggers overflow path (hi_exp >= 226). | ||
| // x = FLT_MAX + 2^76 (adds a nonzero residual so lo != 0). |
There was a problem hiding this comment.
Incorrect overflow-path threshold in comment
The comment says "triggers overflow path (hi_exp >= 226)" but the actual fallback threshold in the constructor is hi_exp >= 255. With hi_exp = 254, this case takes the fast path, not the fallback — so the comment contradicts both the threshold value and the direction. The intent of the test (verifying the fast path handles the maximum non-fallback exponent without producing Inf for lo) is sound, but the comment is misleading.
| // Large double near FLT_MAX: hi_exp = 254, triggers overflow path (hi_exp >= 226). | |
| // x = FLT_MAX + 2^76 (adds a nonzero residual so lo != 0). | |
| // Large double near FLT_MAX: hi_exp = 254, maximum value that takes the fast path | |
| // (fast path requires hi_exp < 255). x = FLT_MAX + 2^76 (adds a nonzero residual so lo != 0). |
|
/build |
|
This improves the overflow/underflow handling adds a fast2sum step to keep the invariant |
| src_val = src_val + static_cast<T>(0.0001); | ||
| // For double, increment the bit pattern to get the next representable value | ||
| // so the loop anti-aliasing doesn't introduce a double-precision add. | ||
| if constexpr (std::is_same_v<T, double>) { |
There was a problem hiding this comment.
We should try to use cuda::std here especially since it's a device function, even though it doesn't matter for this particular function
There was a problem hiding this comment.
that's fine (thought we do use this elsewhere in the repo)
| } | ||
| }; | ||
|
|
||
| struct DoubleToFltFlt { |
There was a problem hiding this comment.
Should we have these defined in cast.h instead?
| unsigned long long mant = xbits & 0x000FFFFFFFFFFFFFULL; | ||
| // hi_exp: float biased exponent = (e_x - 1023) + 127 = e_x - 896. | ||
| int hi_exp = (int)e_x - 896; | ||
| if (e_x == 0 || hi_exp <= 0 || hi_exp >= 255) { |
There was a problem hiding this comment.
We may need a bit more guard for hi_exp (e.g., >= 254 for this path). We could end up with hi == FLT_MAX and then have that overflow to Inf during the fast2sum. That would then change our accuracy guarantees in the case that hi_exp == 254. Or it may be enough to just add (hi_exp == 254 && hi_mantissa == 0x7FFFFF).
There was a problem hiding this comment.
I actually think that would be ok: the only way it would overflow was if lo_raw >= ulp(FLT_MAX), in which case hi should round up to Inf anyway. the lo will be NaN, but that's what it would be now.
There was a problem hiding this comment.
I think we can have lo_raw >= ulp(FLT_MAX) (thus the need for fast2sum to renormalize). r is in [0, 2^29-1], but r_float can round up to 2^29. The scale factor in this case is the float with biased exponent 254-52=202, so the unbiased exponent is 202-127=75. Then lo_raw is 2^29 * 2^75 = 2^104 = ulp(FLT_MAX). So, in this case it's the hi + lo_raw during renormalization that will overflow s to Inf, but we generally need the renormalization because the rounding for r_float can round up to ulp(hi).
There was a problem hiding this comment.
actually i realized that isn't quite right. let me think about this a bit more
2x faster is still great! One potential concern could be increased register pressure, but it seems unlikely for that to result in the old behavior being faster. I can try it in the recently merged sarbp example to see if there is any change in register usage or any spilling for that kernel for one sanity check. |
This speeds up
doubletofltfltconversions on fp64-decoupled hardware. I also tweaked the cast benchmark so that it isn't affected by fp64 arithmetic perf.L40S results
Before (with update benchmark)
After