add emulated double->fltflt cast by simonbyrne · Pull Request #1159 · NVIDIA/MatX

simonbyrne · 2026-04-23T04:27:59Z

This speeds up double to fltflt conversions on fp64-decoupled hardware. I also tweaked the cast benchmark so that it isn't affected by fp64 arithmetic perf.

L40S results

Before (with update benchmark)

Benchmark       float        double       fltflt       fltflt vs dbl
------------------------------------------------------------------
cast2fltflt     1.00x        32.52x       4.39x        7.40x

--------------------------------------------------------------------------------
Raw timings (auto-scaled units):

Benchmark       float           double          fltflt          fltflt vs dbl
---------------------------------------------------------------------------
cast2fltflt     618.045 us      20.100 ms       2.716 ms        7.40x

After

Benchmark       float        double       fltflt       fltflt vs dbl
------------------------------------------------------------------
cast2fltflt     1.00x        10.65x       4.40x        2.42x

--------------------------------------------------------------------------------
Raw timings (auto-scaled units):

Benchmark       float           double          fltflt          fltflt vs dbl
---------------------------------------------------------------------------
cast2fltflt     618.190 us      6.585 ms        2.721 ms        2.42x

copy-pr-bot · 2026-04-23T04:28:02Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

simonbyrne · 2026-04-23T20:04:49Z

This will be slower on 100-class hardware, but this isn't where fltflt is intended to be used.

simonbyrne · 2026-04-23T20:05:52Z

/build

greptile-apps · 2026-04-23T20:08:47Z

Greptile Summary

This PR adds a device-only fast path for double → fltflt conversion that replaces the original FP64 subtract-and-cast approach with pure IEEE-754 bit manipulation (__double_as_longlong, __int_as_float, __uint2float_rn), eliminating FP64 instructions on fp64-decoupled hardware (e.g. L40S). The benchmark's loop-increment is also changed to a ULP-step for double to avoid polluting the cast timing with a FP64 add. The algorithm correctly handles the fallback for NaN/Inf/subnormals/out-of-float-range inputs, uses fast2sum to guarantee fl(hi+lo) == hi, and is guarded by __builtin_is_constant_evaluated() so constexpr usage still compiles cleanly.

Confidence Score: 5/5

Safe to merge; the only finding is a minor comment inaccuracy in the test file.

The core algorithm is mathematically correct — bit-field extraction, scale computation, sign application, and fast2sum all check out. No P0/P1 issues found. The single P2 finding is a wrong threshold value in a test comment that doesn't affect behavior.

test/00_misc/FloatFloatTests.cu (minor comment fix at line 2337)

Important Files Changed

Filename	Overview
include/matx/kernels/fltflt.h	Adds a device-only fast path for double→fltflt conversion using IEEE-754 bit manipulation to avoid FP64 instructions; fallback for NaN/Inf/subnormal/out-of-float-range doubles; fast2sum ensures fl(hi+lo)==hi. Algorithm appears mathematically correct.
test/00_misc/FloatFloatTests.cu	Adds ConvertFromDouble test covering pi, zero, exact round-trip, small doubles (hi_exp < 53 boundary), negative values, and the hi_exp=254 edge case; one comment incorrectly states the fallback threshold as ">= 226" (should be ">= 255").
bench/00_misc/fltflt_arithmetic.cu	Replaces FP64 add in benchmark loop increment with a ULP-step via bit manipulation for the double specialization, preventing FP64 latency from contaminating cast timing.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["fltflt(double x) constructor"] --> B{__CUDA_ARCH__ defined?}
    B -- No --> HOST["HOST path\nhi = float(x)\nlo = float(x - double(hi))"]
    B -- Yes --> CE{__builtin_is_constant_evaluated?}
    CE -- Yes --> CONSTEXPR["CONSTEXPR path (compile-time)\nhi = float(x)\nlo = float(x - double(hi))"]
    CE -- No --> BITS["Extract IEEE-754 bits\nsign, e_x, mant"]
    BITS --> CHK{"e_x == 0 OR\nhi_exp <= 0 OR\nhi_exp >= 255?"}
    CHK -- Yes --> FALLBACK["FALLBACK\nhi = float(x)\nlo = 0.0f\n(NaN / Inf / subnormal / out-of-range)"]
    CHK -- No --> FAST["FAST PATH (no FP64)\nhi = truncated top-23 mantissa bits"]
    FAST --> RES["r = remaining 29 mantissa bits\nr_float = __uint2float_rn(r)\nlo_exp = hi_exp >= 53 ? hi_exp-52 : 0\nlo_raw = r_float * 2^(lo_exp-127)\nApply sign bit to lo_raw"]
    RES --> F2S["fast2sum\ns = hi + lo_raw\nlo = lo_raw - (s - hi)\nhi = s\n(guarantees fl(hi+lo) == hi)"]

_{Reviews (1): Last reviewed commit: "apply fast2sum at end, tweak handling of..." | Re-trigger Greptile}

greptile-apps · 2026-04-23T20:08:51Z

+  // Large double near FLT_MAX: hi_exp = 254, triggers overflow path (hi_exp >= 226).
+  // x = FLT_MAX + 2^76 (adds a nonzero residual so lo != 0).


Incorrect overflow-path threshold in comment

The comment says "triggers overflow path (hi_exp >= 226)" but the actual fallback threshold in the constructor is hi_exp >= 255. With hi_exp = 254, this case takes the fast path, not the fallback — so the comment contradicts both the threshold value and the direction. The intent of the test (verifying the fast path handles the maximum non-fallback exponent without producing Inf for lo) is sound, but the comment is misleading.

Suggested change

// Large double near FLT_MAX: hi_exp = 254, triggers overflow path (hi_exp >= 226).

// x = FLT_MAX + 2^76 (adds a nonzero residual so lo != 0).

// Large double near FLT_MAX: hi_exp = 254, maximum value that takes the fast path

// (fast path requires hi_exp < 255). x = FLT_MAX + 2^76 (adds a nonzero residual so lo != 0).

cliffburdick · 2026-04-23T20:10:59Z

/build

simonbyrne · 2026-04-23T20:11:04Z

This improves the overflow/underflow handling adds a fast2sum step to keep the invariant fl(hi + lo) == hi. Unfortunately it adds some cost, but is still 2x faster than the old conversion.


Benchmark       float        double       fltflt       fltflt vs dbl
------------------------------------------------------------------
cast2fltflt     1.00x        16.85x       4.36x        3.87x

--------------------------------------------------------------------------------
Raw timings (auto-scaled units):

Benchmark       float           double          fltflt          fltflt vs dbl
---------------------------------------------------------------------------
cast2fltflt     622.532 us      10.489 ms       2.713 ms        3.87x

cliffburdick · 2026-04-23T20:11:45Z

-      src_val = src_val + static_cast<T>(0.0001);
+      // For double, increment the bit pattern to get the next representable value
+      // so the loop anti-aliasing doesn't introduce a double-precision add.
+      if constexpr (std::is_same_v<T, double>) {


We should try to use cuda::std here especially since it's a device function, even though it doesn't matter for this particular function

that's fine (thought we do use this elsewhere in the repo)

cliffburdick · 2026-04-23T20:13:58Z

  }
 };

+struct DoubleToFltFlt {


Should we have these defined in cast.h instead?

tbensonatl · 2026-04-23T23:22:46Z

+            unsigned long long mant = xbits & 0x000FFFFFFFFFFFFFULL;
+            // hi_exp: float biased exponent = (e_x - 1023) + 127 = e_x - 896.
+            int hi_exp = (int)e_x - 896;
+            if (e_x == 0 || hi_exp <= 0 || hi_exp >= 255) {


We may need a bit more guard for hi_exp (e.g., >= 254 for this path). We could end up with hi == FLT_MAX and then have that overflow to Inf during the fast2sum. That would then change our accuracy guarantees in the case that hi_exp == 254. Or it may be enough to just add (hi_exp == 254 && hi_mantissa == 0x7FFFFF).

I actually think that would be ok: the only way it would overflow was if lo_raw >= ulp(FLT_MAX), in which case hi should round up to Inf anyway. the lo will be NaN, but that's what it would be now.

I think we can have lo_raw >= ulp(FLT_MAX) (thus the need for fast2sum to renormalize). r is in [0, 2^29-1], but r_float can round up to 2^29. The scale factor in this case is the float with biased exponent 254-52=202, so the unbiased exponent is 202-127=75. Then lo_raw is 2^29 * 2^75 = 2^104 = ulp(FLT_MAX). So, in this case it's the hi + lo_raw during renormalization that will overflow s to Inf, but we generally need the renormalization because the rounding for r_float can round up to ulp(hi).

actually i realized that isn't quite right. let me think about this a bit more

tbensonatl · 2026-04-23T23:26:07Z

This improves the overflow/underflow handling adds a fast2sum step to keep the invariant fl(hi + lo) == hi. Unfortunately it adds some cost, but is still 2x faster than the old conversion.

2x faster is still great! One potential concern could be increased register pressure, but it seems unlikely for that to result in the old behavior being faster. I can try it in the recently merged sarbp example to see if there is any change in register usage or any spilling for that kernel for one sanity check.

coveralls · 2026-04-24T02:30:50Z

Coverage is 91.801% — sbyrne/fltflt-cast-emulated into main. No base build found for main.

add emulated double->fltflt cast

5ab0b57

simonbyrne requested a review from tbensonatl April 23, 2026 04:28

apply fast2sum at end, tweak handling of overflow and subnormals

11bc155

simonbyrne marked this pull request as ready for review April 23, 2026 20:03

greptile-apps Bot reviewed Apr 23, 2026

View reviewed changes

cliffburdick reviewed Apr 23, 2026

View reviewed changes

cliffburdick approved these changes Apr 23, 2026

View reviewed changes

tbensonatl reviewed Apr 23, 2026

View reviewed changes

		// Large double near FLT_MAX: hi_exp = 254, triggers overflow path (hi_exp >= 226).
		// x = FLT_MAX + 2^76 (adds a nonzero residual so lo != 0).

Conversation

simonbyrne commented Apr 23, 2026

L40S results

Before (with update benchmark)

After

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

simonbyrne commented Apr 23, 2026

Uh oh!

simonbyrne commented Apr 23, 2026

Uh oh!

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

cliffburdick commented Apr 23, 2026

Uh oh!

simonbyrne commented Apr 23, 2026

Uh oh!

cliffburdick Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

simonbyrne Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

cliffburdick Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

tbensonatl Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

simonbyrne Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

tbensonatl Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

simonbyrne Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

tbensonatl commented Apr 23, 2026

Uh oh!

coveralls commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants