Skip to content

add emulated double->fltflt cast#1159

Open
simonbyrne wants to merge 2 commits intomainfrom
sbyrne/fltflt-cast-emulated
Open

add emulated double->fltflt cast#1159
simonbyrne wants to merge 2 commits intomainfrom
sbyrne/fltflt-cast-emulated

Conversation

@simonbyrne
Copy link
Copy Markdown
Collaborator

This speeds up double to fltflt conversions on fp64-decoupled hardware. I also tweaked the cast benchmark so that it isn't affected by fp64 arithmetic perf.

L40S results

Before (with update benchmark)

Benchmark       float        double       fltflt       fltflt vs dbl
------------------------------------------------------------------
cast2fltflt     1.00x        32.52x       4.39x        7.40x

--------------------------------------------------------------------------------
Raw timings (auto-scaled units):

Benchmark       float           double          fltflt          fltflt vs dbl
---------------------------------------------------------------------------
cast2fltflt     618.045 us      20.100 ms       2.716 ms        7.40x

After

Benchmark       float        double       fltflt       fltflt vs dbl
------------------------------------------------------------------
cast2fltflt     1.00x        10.65x       4.40x        2.42x

--------------------------------------------------------------------------------
Raw timings (auto-scaled units):

Benchmark       float           double          fltflt          fltflt vs dbl
---------------------------------------------------------------------------
cast2fltflt     618.190 us      6.585 ms        2.721 ms        2.42x

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@simonbyrne simonbyrne requested a review from tbensonatl April 23, 2026 04:28
@simonbyrne simonbyrne marked this pull request as ready for review April 23, 2026 20:03
@simonbyrne
Copy link
Copy Markdown
Collaborator Author

This will be slower on 100-class hardware, but this isn't where fltflt is intended to be used.

@simonbyrne
Copy link
Copy Markdown
Collaborator Author

/build

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

This PR adds a device-only fast path for double → fltflt conversion that replaces the original FP64 subtract-and-cast approach with pure IEEE-754 bit manipulation (__double_as_longlong, __int_as_float, __uint2float_rn), eliminating FP64 instructions on fp64-decoupled hardware (e.g. L40S). The benchmark's loop-increment is also changed to a ULP-step for double to avoid polluting the cast timing with a FP64 add. The algorithm correctly handles the fallback for NaN/Inf/subnormals/out-of-float-range inputs, uses fast2sum to guarantee fl(hi+lo) == hi, and is guarded by __builtin_is_constant_evaluated() so constexpr usage still compiles cleanly.

Confidence Score: 5/5

Safe to merge; the only finding is a minor comment inaccuracy in the test file.

The core algorithm is mathematically correct — bit-field extraction, scale computation, sign application, and fast2sum all check out. No P0/P1 issues found. The single P2 finding is a wrong threshold value in a test comment that doesn't affect behavior.

test/00_misc/FloatFloatTests.cu (minor comment fix at line 2337)

Important Files Changed

Filename Overview
include/matx/kernels/fltflt.h Adds a device-only fast path for double→fltflt conversion using IEEE-754 bit manipulation to avoid FP64 instructions; fallback for NaN/Inf/subnormal/out-of-float-range doubles; fast2sum ensures fl(hi+lo)==hi. Algorithm appears mathematically correct.
test/00_misc/FloatFloatTests.cu Adds ConvertFromDouble test covering pi, zero, exact round-trip, small doubles (hi_exp < 53 boundary), negative values, and the hi_exp=254 edge case; one comment incorrectly states the fallback threshold as ">= 226" (should be ">= 255").
bench/00_misc/fltflt_arithmetic.cu Replaces FP64 add in benchmark loop increment with a ULP-step via bit manipulation for the double specialization, preventing FP64 latency from contaminating cast timing.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["fltflt(double x) constructor"] --> B{__CUDA_ARCH__ defined?}
    B -- No --> HOST["HOST path\nhi = float(x)\nlo = float(x - double(hi))"]
    B -- Yes --> CE{__builtin_is_constant_evaluated?}
    CE -- Yes --> CONSTEXPR["CONSTEXPR path (compile-time)\nhi = float(x)\nlo = float(x - double(hi))"]
    CE -- No --> BITS["Extract IEEE-754 bits\nsign, e_x, mant"]
    BITS --> CHK{"e_x == 0 OR\nhi_exp <= 0 OR\nhi_exp >= 255?"}
    CHK -- Yes --> FALLBACK["FALLBACK\nhi = float(x)\nlo = 0.0f\n(NaN / Inf / subnormal / out-of-range)"]
    CHK -- No --> FAST["FAST PATH (no FP64)\nhi = truncated top-23 mantissa bits"]
    FAST --> RES["r = remaining 29 mantissa bits\nr_float = __uint2float_rn(r)\nlo_exp = hi_exp >= 53 ? hi_exp-52 : 0\nlo_raw = r_float * 2^(lo_exp-127)\nApply sign bit to lo_raw"]
    RES --> F2S["fast2sum\ns = hi + lo_raw\nlo = lo_raw - (s - hi)\nhi = s\n(guarantees fl(hi+lo) == hi)"]
Loading

Reviews (1): Last reviewed commit: "apply fast2sum at end, tweak handling of..." | Re-trigger Greptile

Comment on lines +2337 to +2338
// Large double near FLT_MAX: hi_exp = 254, triggers overflow path (hi_exp >= 226).
// x = FLT_MAX + 2^76 (adds a nonzero residual so lo != 0).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Incorrect overflow-path threshold in comment

The comment says "triggers overflow path (hi_exp >= 226)" but the actual fallback threshold in the constructor is hi_exp >= 255. With hi_exp = 254, this case takes the fast path, not the fallback — so the comment contradicts both the threshold value and the direction. The intent of the test (verifying the fast path handles the maximum non-fallback exponent without producing Inf for lo) is sound, but the comment is misleading.

Suggested change
// Large double near FLT_MAX: hi_exp = 254, triggers overflow path (hi_exp >= 226).
// x = FLT_MAX + 2^76 (adds a nonzero residual so lo != 0).
// Large double near FLT_MAX: hi_exp = 254, maximum value that takes the fast path
// (fast path requires hi_exp < 255). x = FLT_MAX + 2^76 (adds a nonzero residual so lo != 0).

@cliffburdick
Copy link
Copy Markdown
Collaborator

/build

@simonbyrne
Copy link
Copy Markdown
Collaborator Author

This improves the overflow/underflow handling adds a fast2sum step to keep the invariant fl(hi + lo) == hi. Unfortunately it adds some cost, but is still 2x faster than the old conversion.


Benchmark       float        double       fltflt       fltflt vs dbl
------------------------------------------------------------------
cast2fltflt     1.00x        16.85x       4.36x        3.87x

--------------------------------------------------------------------------------
Raw timings (auto-scaled units):

Benchmark       float           double          fltflt          fltflt vs dbl
---------------------------------------------------------------------------
cast2fltflt     622.532 us      10.489 ms       2.713 ms        3.87x

src_val = src_val + static_cast<T>(0.0001);
// For double, increment the bit pattern to get the next representable value
// so the loop anti-aliasing doesn't introduce a double-precision add.
if constexpr (std::is_same_v<T, double>) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to use cuda::std here especially since it's a device function, even though it doesn't matter for this particular function

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine (thought we do use this elsewhere in the repo)

}
};

struct DoubleToFltFlt {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have these defined in cast.h instead?

unsigned long long mant = xbits & 0x000FFFFFFFFFFFFFULL;
// hi_exp: float biased exponent = (e_x - 1023) + 127 = e_x - 896.
int hi_exp = (int)e_x - 896;
if (e_x == 0 || hi_exp <= 0 || hi_exp >= 255) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need a bit more guard for hi_exp (e.g., >= 254 for this path). We could end up with hi == FLT_MAX and then have that overflow to Inf during the fast2sum. That would then change our accuracy guarantees in the case that hi_exp == 254. Or it may be enough to just add (hi_exp == 254 && hi_mantissa == 0x7FFFFF).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think that would be ok: the only way it would overflow was if lo_raw >= ulp(FLT_MAX), in which case hi should round up to Inf anyway. the lo will be NaN, but that's what it would be now.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have lo_raw >= ulp(FLT_MAX) (thus the need for fast2sum to renormalize). r is in [0, 2^29-1], but r_float can round up to 2^29. The scale factor in this case is the float with biased exponent 254-52=202, so the unbiased exponent is 202-127=75. Then lo_raw is 2^29 * 2^75 = 2^104 = ulp(FLT_MAX). So, in this case it's the hi + lo_raw during renormalization that will overflow s to Inf, but we generally need the renormalization because the rounding for r_float can round up to ulp(hi).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i realized that isn't quite right. let me think about this a bit more

@tbensonatl
Copy link
Copy Markdown
Collaborator

This improves the overflow/underflow handling adds a fast2sum step to keep the invariant fl(hi + lo) == hi. Unfortunately it adds some cost, but is still 2x faster than the old conversion.

2x faster is still great! One potential concern could be increased register pressure, but it seems unlikely for that to result in the old behavior being faster. I can try it in the recently merged sarbp example to see if there is any change in register usage or any spilling for that kernel for one sanity check.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage is 91.801%sbyrne/fltflt-cast-emulated into main. No base build found for main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants