Skip to content

[ConstantFolding] Add flag to disable call folding #140270

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 30, 2025

Conversation

LewisCrawford
Copy link
Contributor

Add an optional flag to disable constant-folding for function calls. This applies to both intrinsics and libcalls.

This is not necessary in most cases, so is disabled by default, but in cases that require bit-exact precision between the result from constant-folding and run-time execution, having this flag can be useful, and may help with debugging. Cases where mismatches can occur include GPU execution vs host-side folding, cross-compilation scenarios, or compilation vs execution environments with different math library versions.

This applies only to calls, rather than all FP arithmetic. Methods such as fast-math-flags can be used to limit reassociation, fma-fusion etc, and basic arithmetic operations are precisely defined in IEEE 754. However, other math operations such as sqrt, sin, pow etc. represented by either libcalls or intrinsics are less well defined, and may vary more between different architectures/library implementations.

As this option is not intended for most common use-cases, this patch takes the more conservative approach of disabling constant-folding even for operations like fmax, copysign, fabs etc. in order to keep the implementation simple, rather than sprinkling checks for this flag throughout.

The use-cases for this option are similar to StrictFP, but it is only limited to FP call folding, rather than all FP operations, as it is about precise arithmetic results, rather than FP environment behaviours. It also can be used to when linking .bc files compiled with different StrictFP settings with llvm-link.

Add an optional flag to disable constant-folding for function calls.
This applies to both intrinsics and libcalls.

This is not necessary in most cases, so is disabled by default, but
in cases that require bit-exact precision between the result from
constant-folding and run-time execution, having this flag can be useful,
and may help with debugging. Cases where mismatches can occur include
GPU execution vs host-side folding, cross-compilation scenarios, or
compilation vs execution environments with different math library versions.

This applies only to calls, rather than all FP arithmetic. Methods
such as fast-math-flags can be used to limit reassociation, fma-fusion
etc, and basic arithmetic operations are precisely defined in IEEE 754.
However, other math operations such as sqrt, sin, pow etc. represented
by either libcalls or intrinsics are less well defined, and may vary
more between different architectures/library implementations.

As this option is not intended for most common use-cases, this patch
takes the more conservative approach of disabling constant-folding
even for operations like fmax, copysign, fabs etc. in order to keep
the implementation simple, rather than sprinkling checks for this flag
throughout.

The use-cases for this option are similar to StrictFP, but it is only
limited to FP call folding, rather than all FP operations, as it is
about precise arithmetic results, rather than FP environment behaviours.
It also can be used to when linking .bc files compiled with different
StrictFP settings with llvm-link.
@llvmbot
Copy link
Member

llvmbot commented May 16, 2025

@llvm/pr-subscribers-llvm-analysis

Author: Lewis Crawford (LewisCrawford)

Changes

Add an optional flag to disable constant-folding for function calls. This applies to both intrinsics and libcalls.

This is not necessary in most cases, so is disabled by default, but in cases that require bit-exact precision between the result from constant-folding and run-time execution, having this flag can be useful, and may help with debugging. Cases where mismatches can occur include GPU execution vs host-side folding, cross-compilation scenarios, or compilation vs execution environments with different math library versions.

This applies only to calls, rather than all FP arithmetic. Methods such as fast-math-flags can be used to limit reassociation, fma-fusion etc, and basic arithmetic operations are precisely defined in IEEE 754. However, other math operations such as sqrt, sin, pow etc. represented by either libcalls or intrinsics are less well defined, and may vary more between different architectures/library implementations.

As this option is not intended for most common use-cases, this patch takes the more conservative approach of disabling constant-folding even for operations like fmax, copysign, fabs etc. in order to keep the implementation simple, rather than sprinkling checks for this flag throughout.

The use-cases for this option are similar to StrictFP, but it is only limited to FP call folding, rather than all FP operations, as it is about precise arithmetic results, rather than FP environment behaviours. It also can be used to when linking .bc files compiled with different StrictFP settings with llvm-link.


Full diff: https://github.com/llvm/llvm-project/pull/140270.diff

2 Files Affected:

  • (modified) llvm/lib/Analysis/ConstantFolding.cpp (+23-3)
  • (added) llvm/test/Transforms/InstSimplify/disable_folding.ll (+54)
diff --git a/llvm/lib/Analysis/ConstantFolding.cpp b/llvm/lib/Analysis/ConstantFolding.cpp
index 412a0e8979193..2b02db88e809d 100644
--- a/llvm/lib/Analysis/ConstantFolding.cpp
+++ b/llvm/lib/Analysis/ConstantFolding.cpp
@@ -64,6 +64,11 @@
 
 using namespace llvm;
 
+static cl::opt<bool> DisableFPCallFolding(
+    "disable-fp-call-folding",
+    cl::desc("Disable constant-folding of FP intrinsics and libcalls."),
+    cl::init(false), cl::Hidden);
+
 namespace {
 
 //===----------------------------------------------------------------------===//
@@ -1576,6 +1581,17 @@ bool llvm::canConstantFoldCallTo(const CallBase *Call, const Function *F) {
     return false;
   if (Call->getFunctionType() != F->getFunctionType())
     return false;
+
+  // Allow FP calls (both libcalls and intrinsics) to avoid being folded.
+  // This can be useful for GPU targets or in cross-compilation scenarios
+  // when the exact target FP behaviour is required, and the host compiler's
+  // behaviour may be slightly different from the device's run-time behaviour.
+  if (DisableFPCallFolding && (F->getReturnType()->isFloatingPointTy() ||
+                               any_of(F->args(), [](const Argument &Arg) {
+                                 return Arg.getType()->isFloatingPointTy();
+                               })))
+    return false;
+
   switch (F->getIntrinsicID()) {
   // Operations that do not operate floating-point numbers and do not depend on
   // FP environment can be folded even in strictfp functions.
@@ -1700,7 +1716,6 @@ bool llvm::canConstantFoldCallTo(const CallBase *Call, const Function *F) {
   case Intrinsic::x86_avx512_vcvtsd2usi64:
   case Intrinsic::x86_avx512_cvttsd2usi:
   case Intrinsic::x86_avx512_cvttsd2usi64:
-    return !Call->isStrictFP();
 
   // NVVM FMax intrinsics
   case Intrinsic::nvvm_fmax_d:
@@ -1775,6 +1790,7 @@ bool llvm::canConstantFoldCallTo(const CallBase *Call, const Function *F) {
   case Intrinsic::nvvm_d2ull_rn:
   case Intrinsic::nvvm_d2ull_rp:
   case Intrinsic::nvvm_d2ull_rz:
+    return !Call->isStrictFP();
 
   // Sign operations are actually bitwise operations, they do not raise
   // exceptions even for SNANs.
@@ -3886,8 +3902,12 @@ ConstantFoldStructCall(StringRef Name, Intrinsic::ID IntrinsicID,
 Constant *llvm::ConstantFoldBinaryIntrinsic(Intrinsic::ID ID, Constant *LHS,
                                             Constant *RHS, Type *Ty,
                                             Instruction *FMFSource) {
-  return ConstantFoldIntrinsicCall2(ID, Ty, {LHS, RHS},
-                                    dyn_cast_if_present<CallBase>(FMFSource));
+  auto *Call = dyn_cast_if_present<CallBase>(FMFSource);
+  // Ensure we check flags like StrictFP that might prevent this from getting
+  // folded before generating a result.
+  if (Call && !canConstantFoldCallTo(Call, Call->getCalledFunction()))
+    return nullptr;
+  return ConstantFoldIntrinsicCall2(ID, Ty, {LHS, RHS}, Call);
 }
 
 Constant *llvm::ConstantFoldCall(const CallBase *Call, Function *F,
diff --git a/llvm/test/Transforms/InstSimplify/disable_folding.ll b/llvm/test/Transforms/InstSimplify/disable_folding.ll
new file mode 100644
index 0000000000000..66adf6af1e97f
--- /dev/null
+++ b/llvm/test/Transforms/InstSimplify/disable_folding.ll
@@ -0,0 +1,54 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -passes=instsimplify -march=nvptx64 --mcpu=sm_86 --mattr=+ptx72 -S | FileCheck %s --check-prefixes CHECK,FOLDING_ENABLED
+; RUN: opt < %s -disable-fp-call-folding -passes=instsimplify -march=nvptx64 --mcpu=sm_86 --mattr=+ptx72 -S | FileCheck %s --check-prefixes CHECK,FOLDING_DISABLED
+
+; Check that we can disable folding of intrinsic calls via both the -disable-fp-call-folding flag and the strictfp attribute.
+
+; Should be folded by default unless -disable-fp-call-folding is set
+define float @test_fmax_ftz_nan_xorsign_abs_f() {
+; FOLDING_ENABLED-LABEL: define float @test_fmax_ftz_nan_xorsign_abs_f() {
+; FOLDING_ENABLED-NEXT:    ret float -2.000000e+00
+;
+; FOLDING_DISABLED-LABEL: define float @test_fmax_ftz_nan_xorsign_abs_f() {
+; FOLDING_DISABLED-NEXT:    [[RES:%.*]] = call float @llvm.nvvm.fmax.ftz.nan.xorsign.abs.f(float 1.250000e+00, float -2.000000e+00)
+; FOLDING_DISABLED-NEXT:    ret float [[RES]]
+;
+  %res = call float @llvm.nvvm.fmax.ftz.nan.xorsign.abs.f(float 1.25, float -2.0)
+  ret float %res
+}
+
+; Check that -disable-fp-call-folding triggers for LLVM instrincis, not just NVPTX target-specific ones.
+define float @test_llvm_sin() {
+; FOLDING_ENABLED-LABEL: define float @test_llvm_sin() {
+; FOLDING_ENABLED-NEXT:    ret float 0x3FDEAEE880000000
+;
+; FOLDING_DISABLED-LABEL: define float @test_llvm_sin() {
+; FOLDING_DISABLED-NEXT:    [[RES:%.*]] = call float @llvm.sin.f32(float 5.000000e-01)
+; FOLDING_DISABLED-NEXT:    ret float [[RES]]
+;
+  %res = call float @llvm.sin.f32(float 0.5)
+  ret float %res
+}
+
+; Should not be folded, even when -disable-fp-call-folding is not set, as it is marked as strictfp.
+define float @test_fmax_ftz_nan_f_strictfp() {
+; CHECK-LABEL: define float @test_fmax_ftz_nan_f_strictfp() {
+; CHECK-NEXT:    [[RES:%.*]] = call float @llvm.nvvm.fmax.ftz.nan.f(float 1.250000e+00, float -2.000000e+00) #[[ATTR1:[0-9]+]]
+; CHECK-NEXT:    ret float [[RES]]
+;
+  %res = call float @llvm.nvvm.fmax.ftz.nan.f(float 1.25, float -2.0) #1
+  ret float %res
+}
+
+; Check that strictfp disables folding for LLVM math intrinsics like sin.f32
+; even when -disable-fp-call-folding is not set.
+define float @test_llvm_sin_strictfp() {
+; CHECK-LABEL: define float @test_llvm_sin_strictfp() {
+; CHECK-NEXT:    [[RES:%.*]] = call float @llvm.sin.f32(float 5.000000e-01) #[[ATTR1]]
+; CHECK-NEXT:    ret float [[RES]]
+;
+  %res = call float @llvm.sin.f32(float 0.5) #1
+  ret float %res
+}
+
+attributes #1 = { strictfp }

@llvmbot
Copy link
Member

llvmbot commented May 16, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Lewis Crawford (LewisCrawford)

Changes

Add an optional flag to disable constant-folding for function calls. This applies to both intrinsics and libcalls.

This is not necessary in most cases, so is disabled by default, but in cases that require bit-exact precision between the result from constant-folding and run-time execution, having this flag can be useful, and may help with debugging. Cases where mismatches can occur include GPU execution vs host-side folding, cross-compilation scenarios, or compilation vs execution environments with different math library versions.

This applies only to calls, rather than all FP arithmetic. Methods such as fast-math-flags can be used to limit reassociation, fma-fusion etc, and basic arithmetic operations are precisely defined in IEEE 754. However, other math operations such as sqrt, sin, pow etc. represented by either libcalls or intrinsics are less well defined, and may vary more between different architectures/library implementations.

As this option is not intended for most common use-cases, this patch takes the more conservative approach of disabling constant-folding even for operations like fmax, copysign, fabs etc. in order to keep the implementation simple, rather than sprinkling checks for this flag throughout.

The use-cases for this option are similar to StrictFP, but it is only limited to FP call folding, rather than all FP operations, as it is about precise arithmetic results, rather than FP environment behaviours. It also can be used to when linking .bc files compiled with different StrictFP settings with llvm-link.


Full diff: https://github.com/llvm/llvm-project/pull/140270.diff

2 Files Affected:

  • (modified) llvm/lib/Analysis/ConstantFolding.cpp (+23-3)
  • (added) llvm/test/Transforms/InstSimplify/disable_folding.ll (+54)
diff --git a/llvm/lib/Analysis/ConstantFolding.cpp b/llvm/lib/Analysis/ConstantFolding.cpp
index 412a0e8979193..2b02db88e809d 100644
--- a/llvm/lib/Analysis/ConstantFolding.cpp
+++ b/llvm/lib/Analysis/ConstantFolding.cpp
@@ -64,6 +64,11 @@
 
 using namespace llvm;
 
+static cl::opt<bool> DisableFPCallFolding(
+    "disable-fp-call-folding",
+    cl::desc("Disable constant-folding of FP intrinsics and libcalls."),
+    cl::init(false), cl::Hidden);
+
 namespace {
 
 //===----------------------------------------------------------------------===//
@@ -1576,6 +1581,17 @@ bool llvm::canConstantFoldCallTo(const CallBase *Call, const Function *F) {
     return false;
   if (Call->getFunctionType() != F->getFunctionType())
     return false;
+
+  // Allow FP calls (both libcalls and intrinsics) to avoid being folded.
+  // This can be useful for GPU targets or in cross-compilation scenarios
+  // when the exact target FP behaviour is required, and the host compiler's
+  // behaviour may be slightly different from the device's run-time behaviour.
+  if (DisableFPCallFolding && (F->getReturnType()->isFloatingPointTy() ||
+                               any_of(F->args(), [](const Argument &Arg) {
+                                 return Arg.getType()->isFloatingPointTy();
+                               })))
+    return false;
+
   switch (F->getIntrinsicID()) {
   // Operations that do not operate floating-point numbers and do not depend on
   // FP environment can be folded even in strictfp functions.
@@ -1700,7 +1716,6 @@ bool llvm::canConstantFoldCallTo(const CallBase *Call, const Function *F) {
   case Intrinsic::x86_avx512_vcvtsd2usi64:
   case Intrinsic::x86_avx512_cvttsd2usi:
   case Intrinsic::x86_avx512_cvttsd2usi64:
-    return !Call->isStrictFP();
 
   // NVVM FMax intrinsics
   case Intrinsic::nvvm_fmax_d:
@@ -1775,6 +1790,7 @@ bool llvm::canConstantFoldCallTo(const CallBase *Call, const Function *F) {
   case Intrinsic::nvvm_d2ull_rn:
   case Intrinsic::nvvm_d2ull_rp:
   case Intrinsic::nvvm_d2ull_rz:
+    return !Call->isStrictFP();
 
   // Sign operations are actually bitwise operations, they do not raise
   // exceptions even for SNANs.
@@ -3886,8 +3902,12 @@ ConstantFoldStructCall(StringRef Name, Intrinsic::ID IntrinsicID,
 Constant *llvm::ConstantFoldBinaryIntrinsic(Intrinsic::ID ID, Constant *LHS,
                                             Constant *RHS, Type *Ty,
                                             Instruction *FMFSource) {
-  return ConstantFoldIntrinsicCall2(ID, Ty, {LHS, RHS},
-                                    dyn_cast_if_present<CallBase>(FMFSource));
+  auto *Call = dyn_cast_if_present<CallBase>(FMFSource);
+  // Ensure we check flags like StrictFP that might prevent this from getting
+  // folded before generating a result.
+  if (Call && !canConstantFoldCallTo(Call, Call->getCalledFunction()))
+    return nullptr;
+  return ConstantFoldIntrinsicCall2(ID, Ty, {LHS, RHS}, Call);
 }
 
 Constant *llvm::ConstantFoldCall(const CallBase *Call, Function *F,
diff --git a/llvm/test/Transforms/InstSimplify/disable_folding.ll b/llvm/test/Transforms/InstSimplify/disable_folding.ll
new file mode 100644
index 0000000000000..66adf6af1e97f
--- /dev/null
+++ b/llvm/test/Transforms/InstSimplify/disable_folding.ll
@@ -0,0 +1,54 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -passes=instsimplify -march=nvptx64 --mcpu=sm_86 --mattr=+ptx72 -S | FileCheck %s --check-prefixes CHECK,FOLDING_ENABLED
+; RUN: opt < %s -disable-fp-call-folding -passes=instsimplify -march=nvptx64 --mcpu=sm_86 --mattr=+ptx72 -S | FileCheck %s --check-prefixes CHECK,FOLDING_DISABLED
+
+; Check that we can disable folding of intrinsic calls via both the -disable-fp-call-folding flag and the strictfp attribute.
+
+; Should be folded by default unless -disable-fp-call-folding is set
+define float @test_fmax_ftz_nan_xorsign_abs_f() {
+; FOLDING_ENABLED-LABEL: define float @test_fmax_ftz_nan_xorsign_abs_f() {
+; FOLDING_ENABLED-NEXT:    ret float -2.000000e+00
+;
+; FOLDING_DISABLED-LABEL: define float @test_fmax_ftz_nan_xorsign_abs_f() {
+; FOLDING_DISABLED-NEXT:    [[RES:%.*]] = call float @llvm.nvvm.fmax.ftz.nan.xorsign.abs.f(float 1.250000e+00, float -2.000000e+00)
+; FOLDING_DISABLED-NEXT:    ret float [[RES]]
+;
+  %res = call float @llvm.nvvm.fmax.ftz.nan.xorsign.abs.f(float 1.25, float -2.0)
+  ret float %res
+}
+
+; Check that -disable-fp-call-folding triggers for LLVM instrincis, not just NVPTX target-specific ones.
+define float @test_llvm_sin() {
+; FOLDING_ENABLED-LABEL: define float @test_llvm_sin() {
+; FOLDING_ENABLED-NEXT:    ret float 0x3FDEAEE880000000
+;
+; FOLDING_DISABLED-LABEL: define float @test_llvm_sin() {
+; FOLDING_DISABLED-NEXT:    [[RES:%.*]] = call float @llvm.sin.f32(float 5.000000e-01)
+; FOLDING_DISABLED-NEXT:    ret float [[RES]]
+;
+  %res = call float @llvm.sin.f32(float 0.5)
+  ret float %res
+}
+
+; Should not be folded, even when -disable-fp-call-folding is not set, as it is marked as strictfp.
+define float @test_fmax_ftz_nan_f_strictfp() {
+; CHECK-LABEL: define float @test_fmax_ftz_nan_f_strictfp() {
+; CHECK-NEXT:    [[RES:%.*]] = call float @llvm.nvvm.fmax.ftz.nan.f(float 1.250000e+00, float -2.000000e+00) #[[ATTR1:[0-9]+]]
+; CHECK-NEXT:    ret float [[RES]]
+;
+  %res = call float @llvm.nvvm.fmax.ftz.nan.f(float 1.25, float -2.0) #1
+  ret float %res
+}
+
+; Check that strictfp disables folding for LLVM math intrinsics like sin.f32
+; even when -disable-fp-call-folding is not set.
+define float @test_llvm_sin_strictfp() {
+; CHECK-LABEL: define float @test_llvm_sin_strictfp() {
+; CHECK-NEXT:    [[RES:%.*]] = call float @llvm.sin.f32(float 5.000000e-01) #[[ATTR1]]
+; CHECK-NEXT:    ret float [[RES]]
+;
+  %res = call float @llvm.sin.f32(float 0.5) #1
+  ret float %res
+}
+
+attributes #1 = { strictfp }

@Artem-B
Copy link
Member

Artem-B commented May 16, 2025

While I can see potential usefulness to be able to generate specific target code instead of getting a reference result computed by LLVM, I am not convinced that this patch is the way to go.

It does give us a bit more control and will get some computations done on the GPU instead of LLVM-computed result, but it's neither here nor there and leaves us quite far from "let GPU do all FP calculations". The fact that LLVM will still be able to optimize regular FP operations renders it all almost moot. E.g. If LLVM happens to inline some trigonometric function with constant argument, it may be able to fold it all completely bypassing this option.
Things get further complicated by the fact that NVPTX often passes reduced precision FP types as opaque integers, and we presumably want to avoid folding those, too.

That said, as a debug flag for disabling folding of some functions in general it may be useful. Working around function folding in tests is somewhat common.

If we do want to apply the no-folding to a subset of functions, we may need to find a more precise way to determine that set. Manually curating the list will be a pain. For the functions we don't want to fold during compilation we may need to have a way to mark them explicitly, perhaps via an attribute on the function itself, or on the caller function.

Another possibility is to allow specifying a list of functions or patterns to match, and apply the flag only to the matching functions. This way it will be up to the user to specify which function calls they want to preserve.

@LewisCrawford
Copy link
Contributor Author

The context for this flag, is that I want to add constant-folding support for all these NVVM math intrinsics here: #141233

From the list of intrinsics supported there, several might end up with slightly different results from the device-side version of the code depending on what the host-side compiler's math library does to constant fold them.

  • nvvm.cos.approx.*
  • nvvm.ex2.approx.*
  • nvvm.lg2.approx.*
  • nvvm.rcp.*
  • nvvm.rsqrt.approx.*
  • nvvm.sin.approx.*
  • nvvm.sqrt.f
  • nvvm.sqrt.rn.*
  • nvvm.sqrt.approx.*

There have also been other discussions about folding FP call instructions (e.g. here: https://discourse.llvm.org/t/fp-constant-folding-of-floating-point-operations/73138 ).

It does give us a bit more control and will get some computations done on the GPU instead of LLVM-computed result, but it's neither here nor there and leaves us quite far from "let GPU do all FP calculations".

The aim here is not "let GPU do all FP calculations", but just to handle the narrower cases of math library function-calls and intrinsics where potential issues precision differences are more likely to be visible. A more general "disable all FP folding" or even "disable all constant-folding" flag might also have some value, but I think this narrower flag is all that would be needed to cover the potential problems users expecting bit-accurate results could face from the folding in #141233 (or are already facing from folding similar LLVM sin/cos intrinsics or libcalls).

If LLVM happens to inline some trigonometric function with constant argument, it may be able to fold it all completely bypassing this option.

I LLVM is able to inline the function, then it already has the exact implementation available, so the problem of using a different implementation for a function like sin to fold an an intrinsic like llvm.sine or nvvm.sin.approx would not occur. This patch is more for cases where functions are either intrinsics, or folded by function-name as a libcall, rather than having a fully specified implementation available to inline and fold that way.

Things get further complicated by the fact that NVPTX often passes reduced precision FP types as opaque integers, and we presumably want to avoid folding those, too.

Currently, we don't have constant-folding support for those operations. The f16 versions of the above intrinsics, like nvvm.ex2.approx.f16x2 are implemented with real floating-point types rather than ints, so should be covered by this patch. The intrinsics involving smaller FP types like nvvm_e5m2x2_to_f16x2_rn, nvvm_e2m3x2_to_f16x2_rn_relu etc all look like they are conversion intrinsics, rather than more complex math-library-like intrinsics. These should have well-defined conversion semantics, so the implementation of the constant-folding on the host-side should be bitwise identical to the device-side if we ever add an implementation for this, so it will not cause the sort of problems this disable-fp-call-folding flag is intended to solve. Can you give any examples of an intrinsic using an int to represent a small FP type that would potentially cause precision issues between host and device-side execution if it was folded?

That said, as a debug flag for disabling folding of some functions in general it may be useful. Working around function folding in tests is somewhat common.

This is a good point, which I hadn't considered as a use-case before. However, if we narrow this flag to only cover specific functions, it seems like it will become less useful for this, as users will need to carefully check which function are/are not covered by it.

If we do want to apply the no-folding to a subset of functions, we may need to find a more precise way to determine that set.

What do you view as the benefit from making the subset narrower? I agree, that the current implementation is broader than it needs to be, and that something like including ex2 but excluding fabs would stop the flag from blocking folding that would be precise. However, I do not expect this flag to be used by most people. When it is used, the fact it applies to all functions with FP inputs or outputs makes its scope easy to understand from the flag-name/description without checking the LLVM source-code for a precise list of functions. I expect it to be useful to test with vs without the flag to spot cases where host vs device mismatch occurs, and then users can another method to avoid constant-folding (e.g. passing a value as a kernel parameter or via a load from memory) if they determine a specific point where this matters in their code, and the performance is too slow with the flag enabled.

For the functions we don't want to fold during compilation we may need to have a way to mark them explicitly, perhaps via an attribute on the function itself, or on the caller function.

I don't think the caller function would work, as you'd need to block inlining for those functions to preserve the attribute, which would potentially have even more of a perf impact than just not folding a few instructions (and would add complexity). Adding an attribute to the function, e.g. specifying something like MayFoldInexactly in the intrinsic definitions for functions like nvvm.ex2.approx.* (or even the inverse - adding FoldsExactly to fabs, fmax etc) might be a decent way to implement this if there is real value to narrowing this to a small subset of functions. I'm not 100% sure how this would work for LibCalls, but the NVPTX backend does not use LibCalls, so that is not strictly necessary for the use-cases I need this flag for.

However, I think this approach could become error-prone, as it would be easy to miss adding this attribute in a case where it would be needed. It also makes the semantics of a flag like this harder to understand for users without reading the implementation for which functions it includes. There may also be cases where the functions are almost exact, but NaN payloads or FTZ semantics might be slightly different depending on the host vs device, or library-version used.

Another possibility is to allow specifying a list of functions or patterns to match, and apply the flag only to the matching functions. This way it will be up to the user to specify which function calls they want to preserve.

This seems very flexible and powerful for the user. However, I don't think there are enough people who would need this functionality to make it worth implementing all the additional complexity required to parse and check this list. Cases like intrinsics that are auto-upgraded (e.g. nvvm_fabs_f gets upgraded into nvvm_fabs) or transformed (e.g. NVPTXTargetTransformInfo.cpp turns nvvm_fmax_f into llvm.maxnum currently), or optimized in InstCombine somehow, would complicate this sort of mechanism too. Users would need to know all variants of the intrinsic that input might get turned into in order for this to work reliably.

Currently, I still think the simple approach in this patch is best. It makes it easy for us as maintainers, as we do not need to evaluate individual libcalls/intrinsics for whether they need included/excluded from this flag, and makes it easy for users as they do not need to check exactly which calls this covers. It's still a fairly blunt instrument, so I don't think it will be useful for users that would need this in production for performance-critical code, but I think it is broad enough to be useful as a debugging tool that can be used to help find precision issues, and then work around them in other ways. There may be use-cases for more general flags to disable all folding or all FP folding, or more specific flags that control specific function folding rules, but I think the current implementation is a decent middle-ground between those two extremes, and is simple enough to be useful without providing an additional maintenance burden.

@nikic nikic requested review from arsenm and efriedma-quic May 23, 2025 18:43
@efriedma-quic
Copy link
Collaborator

I don't see why you'd use an inaccurate implementation to constant-fold something like nvvm.rsqrt.approx.f32. It maps to some exact formula; likely a small table lookup plus linear interpolation. (Unless it isn't consistent across targets?)

For the target-independent transcendental intrinsics, you can use a nobuiltin call to the actual implementation. For non-transcendental intrinsics, you can control lowering with fast-math flags. If you want to turn off optimizations for debugging, we have other tools for that, like opt-bisect-limit.

Disabling folding for everything, even cases where it's possible to fold deterministically, is a good indication to users that we don't expect them to use this flag in production. So maybe this is okay.

@Artem-B
Copy link
Member

Artem-B commented May 27, 2025

@LewisCrawford Thank you. I appreciate your thoughtful response. With the intended scope of the patch as a debug-only knob, a most of my concerns and handwaving are moot, and applying it to functions with FP results or arguments is good enough.

We can revisit more granular selection if/when we actually need it.

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want such a debugging flag, I think this is the wrong heuristic. I would be more interesting in a way of disabling folding of any calls that go through the host library. Not folding exact functions we directly and correctly implement in APFloat is silly

@Artem-B
Copy link
Member

Artem-B commented May 28, 2025

If we want such a debugging flag, I think this is the wrong heuristic.

It's a heuristic that works well enough for the immediate use case @LewisCrawford needs it for right now.

But I agree that it would be nice to make the selection more granular.

I think pattern matching applicable functions, similar to how we select them for -filter-print-funcs may work in this case. Selecting them based on argument type happens to work in this case, but, IMO it's a bit too wide (we may not want/need all such functions) and too narrow (we may want other functions. Comparing FP results between host and device is only one of the possible use cases for such a flag)

@LewisCrawford
Copy link
Contributor Author

LewisCrawford commented May 30, 2025

I think to cover all conceivable use-cases we'd need flags for:

1: Disable all folding
2: Disable all FP folding
3: Disable FP call folding
4: Disable FP call folding implemented via a call to the host library
5: Disable folding only specific intrinsics/libcalls marked as potentially inexact between implementations
6: Disable folding only for specific intrinsics/libcalls named via a command-line arg

For the specific use-case people have been asking me for this, (3) seems the best balance, but I'd be happy for any of the others to be added in addition to cover more general or more narrow use-cases.

I agree, it does seem a bit silly to disable exact implementations. However, one example that has been brought up to me is that of fabs. The NVVM version of fabs is not necessarily bit-exact, as it may canonicalize NaNs. The PTX spec states:

For abs.f32, NaN input yields unspecified NaN.
Future implementations may comply with the IEEE 754 standard by preserving payload and modifying only the sign bit.

So it is technically legal for us to fold with only changing the sign-bit, since the NaN output is unspecified. However, this might produce a different result to the hardware if it chooses to use a canonical value for NaN instead here (and different architectures may technically produce different NaN values).

Also, we could choose to fold this using either a libcall to fabs, or with APFloat's clearSign function. In 2019, LLVM's target-independent abs intrinsic was switched (along with several others) from libcall to APFloat implementation here: https://reviews.llvm.org/D67459 . We want this flag to be slightly more general than just not folding host libcalls (4), because that implementation can change (and has changed in that review), and some cases with bit-exact implementations via APFloat can still produce differing results on NVPTX hardware in cases where the spec allows flexibility (e.g. around NaN canonicalization).

rsqrt is another example where the PTX spec allows flexibility:

The maximum relative error for rsqrt.f32 over the entire positive finite floating-point range is 2-22.9

So it it technically possible that the host-side folding may be more precise than the device-side implementation without violating the spec (and e.g. x86 may be different from aarch64 on the host-side, and sm60 may be different from sm100 on the device-side if they happen to implement this slightly differently).

So (3) seems the best balance between allowing a little FP math to be folded (regular adds/muls etc), while disabling calls to other functions consistently without requiring end-users to know implementation details about whether libcalls are used in the implementation (which may change between versions), or whether the specific intrinsics get auto-upgraded or transformed into other intrinsics later on.

@LewisCrawford LewisCrawford merged commit 1f7885c into llvm:main May 30, 2025
15 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented May 30, 2025

LLVM Buildbot has detected a new failure on builder lldb-arm-ubuntu running on linaro-lldb-arm-ubuntu while building llvm at step 6 "test".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/18/builds/16796

Here is the relevant piece of the build log for the reference
Step 6 (test) failure: build (failure)
...
PASS: lldb-api :: tools/lldb-dap/variables/children/TestDAP_variables_children.py (1222 of 3095)
UNSUPPORTED: lldb-api :: tools/lldb-server/TestAppleSimulatorOSType.py (1223 of 3095)
PASS: lldb-api :: tools/lldb-dap/threads/TestDAP_threads.py (1224 of 3095)
PASS: lldb-api :: tools/lldb-server/TestGdbRemoteAttach.py (1225 of 3095)
PASS: lldb-api :: tools/lldb-server/TestGdbRemoteCompletion.py (1226 of 3095)
PASS: lldb-api :: tools/lldb-server/TestGdbRemoteExitCode.py (1227 of 3095)
UNSUPPORTED: lldb-api :: tools/lldb-server/TestGdbRemoteFork.py (1228 of 3095)
UNSUPPORTED: lldb-api :: tools/lldb-server/TestGdbRemoteForkNonStop.py (1229 of 3095)
UNSUPPORTED: lldb-api :: tools/lldb-server/TestGdbRemoteForkResume.py (1230 of 3095)
PASS: lldb-api :: tools/lldb-server/TestGdbRemoteAuxvSupport.py (1231 of 3095)
FAIL: lldb-api :: tools/lldb-dap/variables/TestDAP_variables.py (1232 of 3095)
******************** TEST 'lldb-api :: tools/lldb-dap/variables/TestDAP_variables.py' FAILED ********************
Script:
--
/usr/bin/python3.10 /home/tcwg-buildbot/worker/lldb-arm-ubuntu/llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env LLVM_LIBS_DIR=/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./lib --env LLVM_INCLUDE_DIR=/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/include --env LLVM_TOOLS_DIR=/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin --arch armv8l --build-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex --lldb-module-cache-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin/lldb --compiler /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin/clang --dsymutil /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin/dsymutil --make /usr/bin/gmake --llvm-tools-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./bin --lldb-obj-root /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/tools/lldb --lldb-libs-dir /home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/./lib --cmake-build-type Release /home/tcwg-buildbot/worker/lldb-arm-ubuntu/llvm-project/lldb/test/API/tools/lldb-dap/variables -p TestDAP_variables.py
--
Exit Code: 1

Command Output (stdout):
--
lldb version 21.0.0git (https://github.com/llvm/llvm-project.git revision 1f7885cf9c6801d11491c8c194c999f7223dd141)
  clang revision 1f7885cf9c6801d11491c8c194c999f7223dd141
  llvm revision 1f7885cf9c6801d11491c8c194c999f7223dd141
Skipping the following test categories: ['libc++', 'dsym', 'gmodules', 'debugserver', 'objc']

--
Command Output (stderr):
--
UNSUPPORTED: LLDB (/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/bin/clang-arm) :: test_darwin_dwarf_missing_obj (TestDAP_variables.TestDAP_variables) (requires one of macosx, darwin, ios, tvos, watchos, bridgeos, iphonesimulator, watchsimulator, appletvsimulator) 
UNSUPPORTED: LLDB (/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/bin/clang-arm) :: test_darwin_dwarf_missing_obj_with_symbol_ondemand_enabled (TestDAP_variables.TestDAP_variables) (requires one of macosx, darwin, ios, tvos, watchos, bridgeos, iphonesimulator, watchsimulator, appletvsimulator) 
========= DEBUG ADAPTER PROTOCOL LOGS =========
1748601886.723957777 --> (stdio) {"command":"initialize","type":"request","arguments":{"adapterID":"lldb-native","clientID":"vscode","columnsStartAt1":true,"linesStartAt1":true,"locale":"en-us","pathFormat":"path","supportsRunInTerminalRequest":true,"supportsVariablePaging":true,"supportsVariableType":true,"supportsStartDebuggingRequest":true,"supportsProgressReporting":true,"$__lldb_sourceInitFile":false},"seq":1}
1748601886.727401495 <-- (stdio) {"body":{"$__lldb_version":"lldb version 21.0.0git (https://github.com/llvm/llvm-project.git revision 1f7885cf9c6801d11491c8c194c999f7223dd141)\n  clang revision 1f7885cf9c6801d11491c8c194c999f7223dd141\n  llvm revision 1f7885cf9c6801d11491c8c194c999f7223dd141","completionTriggerCharacters":["."," ","\t"],"exceptionBreakpointFilters":[{"default":false,"filter":"cpp_catch","label":"C++ Catch"},{"default":false,"filter":"cpp_throw","label":"C++ Throw"},{"default":false,"filter":"objc_catch","label":"Objective-C Catch"},{"default":false,"filter":"objc_throw","label":"Objective-C Throw"}],"supportTerminateDebuggee":true,"supportsBreakpointLocationsRequest":true,"supportsCancelRequest":true,"supportsCompletionsRequest":true,"supportsConditionalBreakpoints":true,"supportsConfigurationDoneRequest":true,"supportsDataBreakpoints":true,"supportsDelayedStackTraceLoading":true,"supportsDisassembleRequest":true,"supportsEvaluateForHovers":true,"supportsExceptionInfoRequest":true,"supportsExceptionOptions":true,"supportsFunctionBreakpoints":true,"supportsHitConditionalBreakpoints":true,"supportsInstructionBreakpoints":true,"supportsLogPoints":true,"supportsModulesRequest":true,"supportsReadMemoryRequest":true,"supportsRestartRequest":true,"supportsSetVariable":true,"supportsStepInTargetsRequest":true,"supportsSteppingGranularity":true,"supportsValueFormattingOptions":true},"command":"initialize","request_seq":1,"seq":0,"success":true,"type":"response"}
1748601886.727997541 --> (stdio) {"command":"launch","type":"request","arguments":{"program":"/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/tools/lldb-dap/variables/TestDAP_variables.test_indexedVariables/a.out","initCommands":["settings clear --all","settings set symbols.enable-external-lookup false","settings set target.inherit-tcc true","settings set target.disable-aslr false","settings set target.detach-on-error false","settings set target.auto-apply-fixits false","settings set plugin.process.gdb-remote.packet-timeout 60","settings set symbols.clang-modules-cache-path \"/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/module-cache-lldb/lldb-api\"","settings set use-color false","settings set show-statusline false"],"disableASLR":false,"enableAutoVariableSummaries":false,"enableSyntheticChildDebugging":false,"displayExtendedBacktrace":false},"seq":2}
1748601886.728522062 <-- (stdio) {"body":{"category":"console","output":"Running initCommands:\n"},"event":"output","seq":0,"type":"event"}
1748601886.728566647 <-- (stdio) {"body":{"category":"console","output":"(lldb) settings clear --all\n"},"event":"output","seq":0,"type":"event"}
1748601886.728588820 <-- (stdio) {"body":{"category":"console","output":"(lldb) settings set symbols.enable-external-lookup false\n"},"event":"output","seq":0,"type":"event"}
1748601886.728600979 <-- (stdio) {"body":{"category":"console","output":"(lldb) settings set target.inherit-tcc true\n"},"event":"output","seq":0,"type":"event"}
1748601886.728612661 <-- (stdio) {"body":{"category":"console","output":"(lldb) settings set target.disable-aslr false\n"},"event":"output","seq":0,"type":"event"}
1748601886.728623867 <-- (stdio) {"body":{"category":"console","output":"(lldb) settings set target.detach-on-error false\n"},"event":"output","seq":0,"type":"event"}
1748601886.728635073 <-- (stdio) {"body":{"category":"console","output":"(lldb) settings set target.auto-apply-fixits false\n"},"event":"output","seq":0,"type":"event"}
1748601886.728647232 <-- (stdio) {"body":{"category":"console","output":"(lldb) settings set plugin.process.gdb-remote.packet-timeout 60\n"},"event":"output","seq":0,"type":"event"}
1748601886.728676081 <-- (stdio) {"body":{"category":"console","output":"(lldb) settings set symbols.clang-modules-cache-path \"/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/module-cache-lldb/lldb-api\"\n"},"event":"output","seq":0,"type":"event"}
1748601886.728689432 <-- (stdio) {"body":{"category":"console","output":"(lldb) settings set use-color false\n"},"event":"output","seq":0,"type":"event"}
1748601886.728701115 <-- (stdio) {"body":{"category":"console","output":"(lldb) settings set show-statusline false\n"},"event":"output","seq":0,"type":"event"}
1748601886.894868374 <-- (stdio) {"command":"launch","request_seq":2,"seq":0,"success":true,"type":"response"}
1748601886.894940615 <-- (stdio) {"event":"initialized","seq":0,"type":"event"}
1748601886.894950867 <-- (stdio) {"body":{"module":{"addressRange":"0xf7c0b000","debugInfoSize":"983.3KB","id":"0D794E6C-AF7E-D8CB-B9BA-E385B4F8753F-5A793D65","name":"ld-linux-armhf.so.3","path":"/usr/lib/arm-linux-gnueabihf/ld-linux-armhf.so.3","symbolFilePath":"/usr/lib/arm-linux-gnueabihf/ld-linux-armhf.so.3","symbolStatus":"Symbols loaded."},"reason":"new"},"event":"module","seq":0,"type":"event"}
1748601886.895180464 <-- (stdio) {"body":{"module":{"addressRange":"0x5e0000","debugInfoSize":"25.5KB","id":"ADAECA3F","name":"a.out","path":"/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/tools/lldb-dap/variables/TestDAP_variables.test_indexedVariables/a.out","symbolFilePath":"/home/tcwg-buildbot/worker/lldb-arm-ubuntu/build/lldb-test-build.noindex/tools/lldb-dap/variables/TestDAP_variables.test_indexedVariables/a.out","symbolStatus":"Symbols loaded."},"reason":"new"},"event":"module","seq":0,"type":"event"}

@nikic
Copy link
Contributor

nikic commented May 30, 2025

@LewisCrawford @llvm.fabs must lower to a strictly bitwise operation -- unlike many other FP intrinsics, there is no leeway here. If it fails to do that on NVPTX, that is a backend bug and you need to lower it differently.


I'm not really happy about this flag as implemented. I think a cleaner way would be to tie into the existing AllowNonDeterminstic flag, to disable all non-deterministic constant folding. This includes FP calls, but also e.g. non-deterministic NaN results and non-determinism due to FMF.

@LewisCrawford
Copy link
Contributor Author

The @llvm.fabs intrinsic is lowered correctly in NVPTX. It's the target-specific @llvm.nvvm.fabs intrinsic that is allowed have weird NaN behaviour.

I've merged it for now to unblock adding more NVVM-intrinsic constant-folding in #141233 , but I'll take a look at whether AllowNonDeterminstic might work here in a follow-up patch.

sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Jun 3, 2025
Add an optional flag to disable constant-folding for function calls.
This applies to both intrinsics and libcalls.

This is not necessary in most cases, so is disabled by default, but in
cases that require bit-exact precision between the result from
constant-folding and run-time execution, having this flag can be useful,
and may help with debugging. Cases where mismatches can occur include
GPU execution vs host-side folding, cross-compilation scenarios, or
compilation vs execution environments with different math library
versions.

This applies only to calls, rather than all FP arithmetic. Methods such
as fast-math-flags can be used to limit reassociation, fma-fusion etc,
and basic arithmetic operations are precisely defined in IEEE 754.
However, other math operations such as sqrt, sin, pow etc. represented
by either libcalls or intrinsics are less well defined, and may vary
more between different architectures/library implementations.

As this option is not intended for most common use-cases, this patch
takes the more conservative approach of disabling constant-folding even
for operations like fmax, copysign, fabs etc. in order to keep the
implementation simple, rather than sprinkling checks for this flag
throughout.

The use-cases for this option are similar to StrictFP, but it is only
limited to FP call folding, rather than all FP operations, as it is
about precise arithmetic results, rather than FP environment behaviours.
It also can be used to when linking .bc files compiled with different
StrictFP settings with llvm-link.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants