Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64] Don't try to vectorize fixed point to fp narrowing conversion #130665

Merged
merged 2 commits into from
Mar 25, 2025

Conversation

pranavk
Copy link
Contributor

@pranavk pranavk commented Mar 10, 2025

GCC, correctly, doesn't vectorize in this case. Absence of direct instructions to convert larger fixed point to lower floating point precision inadvertently causes rounding leading to subtle differences across ISAs.

https://godbolt.org/z/ssEchMWrE

Co-authored by: @echristo

@llvmbot
Copy link
Member

llvmbot commented Mar 10, 2025

@llvm/pr-subscribers-backend-aarch64

Author: Pranav Kant (pranavk)

Changes

GCC, correctly, doesn't vectorize in this case. Absence of direct instructions to convert larger fixed point to lower floating point precision inadvertently causes rounding leading to subtle differences across ISAs.

https://godbolt.org/z/ssEchMWrE


Full diff: https://github.com/llvm/llvm-project/pull/130665.diff

1 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+5-13)
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 9511206c0660a..8ba763df1360b 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -5095,19 +5095,11 @@ SDValue AArch64TargetLowering::LowerVectorINT_TO_FP(SDValue Op,
   uint64_t VTSize = VT.getFixedSizeInBits();
   uint64_t InVTSize = InVT.getFixedSizeInBits();
   if (VTSize < InVTSize) {
-    MVT CastVT =
-        MVT::getVectorVT(MVT::getFloatingPointVT(InVT.getScalarSizeInBits()),
-                         InVT.getVectorNumElements());
-    if (IsStrict) {
-      In = DAG.getNode(Opc, dl, {CastVT, MVT::Other},
-                       {Op.getOperand(0), In});
-      return DAG.getNode(ISD::STRICT_FP_ROUND, dl, {VT, MVT::Other},
-                         {In.getValue(1), In.getValue(0),
-                          DAG.getIntPtrConstant(0, dl, /*isTarget=*/true)});
-    }
-    In = DAG.getNode(Opc, dl, CastVT, In);
-    return DAG.getNode(ISD::FP_ROUND, dl, VT, In,
-                       DAG.getIntPtrConstant(0, dl, /*isTarget=*/true));
+    // Due to the absence of any vector instructions to directly convert
+    // larger fixed point to lower floating point, we end up using intermediate
+    // representation before finally getting VTSize-d floating point. This extra
+    // rounding can lead to subtly incorrect results.
+    return SDValue();
   }
 
   if (VTSize > InVTSize) {

@pranavk pranavk requested review from echristo and lntue March 10, 2025 20:31
@rnk
Copy link
Collaborator

rnk commented Mar 10, 2025

Test case? Maybe you have one but forgot to git add the new file.

@davemgreen
Copy link
Collaborator

There will be a lot of test cases that need updating. We should attempt to limit the damage this does, and I don't believe the default expansion is very efficient for unsigned operations.

@echristo
Copy link
Contributor

There will be a lot of test cases that need updating. We should attempt to limit the damage this does, and I don't believe the default expansion is very efficient for unsigned operations.

I did notice that arguably turning off vectorization of this operation might be the most efficient way. It does leave the loop rather than unrolling it, but the code is quite a bit tighter and the scalar conversion is precise.

@davemgreen
Copy link
Collaborator

I did notice that arguably turning off vectorization of this operation might be the most efficient way. It does leave the loop rather than unrolling it, but the code is quite a bit tighter and the scalar conversion is precise.

Yeah - we should probably try to address both. There are two issues going on - no matter where the vector i64->f32 convert comes from (vectorization, intrinsics, optimizations, the front end) we should be producing correctly rounded code (*). This patch is the bug fix for that issue. Then there is the question of whether the vectorizer should be producing the vector operations. That will probably require a cost-model change to increase the costs (which might mean they are still used in loops with other vector operations, they will just be scalarized for that portion, it still might be profitable overall).

(* I was wondering if there was something that we could do with fast-math to make this OK. Especially as AFAICT it has been this way for 10+ years without anyone having an issue so far. None of the existing flags match exactly, but maybe contract or afn could be co-opted into it. (If fptrunc+fptrunc could be contracted, maybe fptrunc can be "uncontracted" into fptrunc+fptrunc). Something for a follow-on if we need it maybe, we might be able to use SVE instructions instead for some of them if the architecture feature is present).

If you can update all the failing test cases we can take a look at whether they need to change. I believe, for example, that i32->half conversions are always fine (https://alive2.llvm.org/ce/z/RXvYNv). That probably applies to higher int sizes too as the range or a half is relatively narrow. BFloats I would guess the same as float, unfortunately, but it would be good if we could prove one was right or wrong.

Copy link

github-actions bot commented Mar 18, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@pranavk
Copy link
Contributor Author

pranavk commented Mar 18, 2025

Thanks for taking a look.

I think you are right about ->half conversions. For now, I have made this conditional on target type being f32 to try to limit the impact, and also to avoid the backend crash for one of the tests (llvm/test/CodeGen/AArch64/sitofp-to-tbl.ll:sitofp_v8i8_to_v8f16).

The diffs look pretty inefficient as we guessed.

For the test that crashes, backend crashes because sitofp's operand in this test (v8i8) gets promoted to v8i32 instead of custom-lowered. So the place where I added SDValue is never encountered which in other cases leads to expansion of vector ops to scalar nodes. This leads of v8i32in the DAG by the time we call SelectionDAG::Legalize() which now encounters the return SDValue I introduced here which fails expansion and ends up trying to find an appropriate libcall for v4i32 -> v4f16 conversion which fails with stack trace below:

Assertion `LC != RTLIB::UNKNOWN_LIBCALL && "Unable to legalize as libcall"' failed
...
#10 0x000055b3966f8f03 (anonymous namespace)::SelectionDAGLegalize::ConvertNodeToLibcall(llvm::SDNode*) /usr/local/google/home/prka/p/llvm/llvm_main2/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp:4800:21
#11 0x000055b3966e7e92 (anonymous namespace)::SelectionDAGLegalize::LegalizeOp(llvm::SDNode*) /usr/local/google/home/prka/p/llvm/llvm_main2/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp:1353:7
#12 0x000055b3966e5f04 llvm::SelectionDAG::Legalize() /usr/local/google/home/prka/p/llvm/llvm_main2/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp:5861:13
#13 0x000055b396891413 llvm::SelectionDAGISel::CodeGenAndEmitDAG() /usr/local/google/home/prka/p/llvm/llvm_main2/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:1078:3
#14 0x000055b396890193 llvm::SelectionDAGISel::SelectBasicBlock(llvm::ilist_iterator_w_bits<llvm::ilist_detail::node_options<llvm::Instruction, true, false, void, true, llvm::BasicBlock>, false, true>, llvm::ilist_iterator_w_bits<llvm::ilist_detail::node_options<llvm::Instruction, true, false, void, true, llvm::BasicBlock>, false, true>, bool&) /usr/local/google/home/prka/p/llvm/llvm_main2/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:877:1
#15 0x000055b39688fd76 llvm::SelectionDAGISel::SelectAllBasicBlocks(llvm::Function const&) /usr/local/google/home/prka/p/llvm/llvm_main2/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:1903:11
#16 0x000055b39688cd99 llvm::SelectionDAGISel::runOnMachineFunction(llvm::MachineFunction&) /usr/local/google/home/prka/p/llvm/llvm_main2/llvm/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp:614:3
#17 0x000055b39482de05 (anonymous namespace)::AArch64DAGToDAGISel::runOnMachineFunction(llvm::Mach

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unsigned variants are trying to do some sort of math to keep things in vectors, but it would be better to just scalarize as we have a scalar instruction and the number of vector lanes is only 2. Can we try and get it to scalarize instead?

@@ -0,0 +1,89 @@
; RUN: llc -o - %s | FileCheck %s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this test, given the others that change.

Comment on lines 4 to 6
; CHECK: autogen_SD19655
; CHECK: scvtf
; CHECK: ret
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these old check lines

; CHECK-NEXT: scvtf v1.2d, v7.2d
; CHECK-NEXT: fcvtn2 v4.4s, v5.2d
; CHECK-NEXT: fcvtn2 v2.4s, v3.2d
; CHECK-NEXT: fmov x12, d0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have this still work as before? It is converting to a fp16, but probably doing so in multiple steps so there is still a i64->fp32 convert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried looking. If we want to do that, the check likely needs to be moved earlier before LowerVectorINT_TO_FP (and I am not sure there's any appropriate position for that earlier) as there doesn't seem to be any way to distinguish between instructions that are generated by the compiler (during split, promote, etc.) vs original sint_to_fp instruction once we reach LowerVectorINT_TO_FP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I pattern-matched that and was able to exclude all such tests that eventually end up rounding to f16 as that doesn't exhibit rounding errors we are worried about. That did get rid of ~5 tests that remain unchanged with this PR now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will refactor the existing code in a while. Perhaps there are elegant ways to pattern match a thing like this.

@pranavk pranavk force-pushed the sint_to_fp branch 2 times, most recently from 1dfeb67 to e6a91ff Compare March 20, 2025 21:36
davemgreen added a commit to davemgreen/llvm-project that referenced this pull request Mar 21, 2025
After llvm#130665 these operations will be scalarized to avoid double-rounding.
This updates the cost model to match.

In the future we might be able to use SVE instructions to help, but for the
moment the costs shoule be higher. Costsize and Latency costs are not yet
expected to be accurate.
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put together a patch in #132366 to update the costs. I will look into updating the GISel codegen too, if you are not looking into it. I am hoping it is just an update to the legalization descriptors for the operations.

davemgreen added a commit to davemgreen/llvm-project that referenced this pull request Mar 24, 2025
This is a GISel equivalent of llvm#130665, preventing a double-rounding issue in
sitofp/uitofp by scalarizing i64->f32 converts. Most of the changes are made in
the ActionDefinitionsBuilder for G_SITOFP/G_UITOFP. Because it is legal to
convert i64->f16 itofp without double-rounding, but not a fpround i64->f16,
that variant is lowered to build the two extends.
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the complex-int-to-fp.ll test to remove the older CHECK lines, to make sure they shouldn't cause a problem.

LGTM, so long as the tests pass. Thanks

@pranavk
Copy link
Contributor Author

pranavk commented Mar 24, 2025

Rebased. If tests pass, I will merge this.

Thank you for reviewing and other patches (cost-model, and GISel changes).

@pranavk pranavk merged commit 0919ab3 into llvm:main Mar 25, 2025
11 checks passed
davemgreen added a commit that referenced this pull request Mar 26, 2025
After #130665 these operations are scalarized to avoid
double-rounding. This updates the cost model to match.

In the future we might be able to use SVE instructions to help, but for
the moment the costs should be higher. Costsize and Latency costs are
not yet expected to be accurate. The vector insert/extract will use the
cost of VectorInsertExtractBaseCost (2 by default).
davemgreen added a commit to davemgreen/llvm-project that referenced this pull request Mar 27, 2025
This is a GISel equivalent of llvm#130665, preventing a double-rounding issue in
sitofp/uitofp by scalarizing i64->f32 converts. Most of the changes are made in
the ActionDefinitionsBuilder for G_SITOFP/G_UITOFP. Because it is legal to
convert i64->f16 itofp without double-rounding, but not a fpround i64->f16,
that variant is lowered to build the two extends.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants