-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64] Don't try to vectorize fixed point to fp narrowing conversion #130665
Conversation
@llvm/pr-subscribers-backend-aarch64 Author: Pranav Kant (pranavk) ChangesGCC, correctly, doesn't vectorize in this case. Absence of direct instructions to convert larger fixed point to lower floating point precision inadvertently causes rounding leading to subtle differences across ISAs. https://godbolt.org/z/ssEchMWrE Full diff: https://github.com/llvm/llvm-project/pull/130665.diff 1 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 9511206c0660a..8ba763df1360b 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -5095,19 +5095,11 @@ SDValue AArch64TargetLowering::LowerVectorINT_TO_FP(SDValue Op,
uint64_t VTSize = VT.getFixedSizeInBits();
uint64_t InVTSize = InVT.getFixedSizeInBits();
if (VTSize < InVTSize) {
- MVT CastVT =
- MVT::getVectorVT(MVT::getFloatingPointVT(InVT.getScalarSizeInBits()),
- InVT.getVectorNumElements());
- if (IsStrict) {
- In = DAG.getNode(Opc, dl, {CastVT, MVT::Other},
- {Op.getOperand(0), In});
- return DAG.getNode(ISD::STRICT_FP_ROUND, dl, {VT, MVT::Other},
- {In.getValue(1), In.getValue(0),
- DAG.getIntPtrConstant(0, dl, /*isTarget=*/true)});
- }
- In = DAG.getNode(Opc, dl, CastVT, In);
- return DAG.getNode(ISD::FP_ROUND, dl, VT, In,
- DAG.getIntPtrConstant(0, dl, /*isTarget=*/true));
+ // Due to the absence of any vector instructions to directly convert
+ // larger fixed point to lower floating point, we end up using intermediate
+ // representation before finally getting VTSize-d floating point. This extra
+ // rounding can lead to subtly incorrect results.
+ return SDValue();
}
if (VTSize > InVTSize) {
|
Test case? Maybe you have one but forgot to |
There will be a lot of test cases that need updating. We should attempt to limit the damage this does, and I don't believe the default expansion is very efficient for unsigned operations. |
I did notice that arguably turning off vectorization of this operation might be the most efficient way. It does leave the loop rather than unrolling it, but the code is quite a bit tighter and the scalar conversion is precise. |
Yeah - we should probably try to address both. There are two issues going on - no matter where the vector i64->f32 convert comes from (vectorization, intrinsics, optimizations, the front end) we should be producing correctly rounded code (*). This patch is the bug fix for that issue. Then there is the question of whether the vectorizer should be producing the vector operations. That will probably require a cost-model change to increase the costs (which might mean they are still used in loops with other vector operations, they will just be scalarized for that portion, it still might be profitable overall). (* I was wondering if there was something that we could do with fast-math to make this OK. Especially as AFAICT it has been this way for 10+ years without anyone having an issue so far. None of the existing flags match exactly, but maybe contract or afn could be co-opted into it. (If fptrunc+fptrunc could be contracted, maybe fptrunc can be "uncontracted" into fptrunc+fptrunc). Something for a follow-on if we need it maybe, we might be able to use SVE instructions instead for some of them if the architecture feature is present). If you can update all the failing test cases we can take a look at whether they need to change. I believe, for example, that i32->half conversions are always fine (https://alive2.llvm.org/ce/z/RXvYNv). That probably applies to higher int sizes too as the range or a half is relatively narrow. BFloats I would guess the same as float, unfortunately, but it would be good if we could prove one was right or wrong. |
✅ With the latest revision this PR passed the C/C++ code formatter. |
Thanks for taking a look. I think you are right about ->half conversions. For now, I have made this conditional on target type being f32 to try to limit the impact, and also to avoid the backend crash for one of the tests ( The diffs look pretty inefficient as we guessed. For the test that crashes, backend crashes because sitofp's operand in this test (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unsigned variants are trying to do some sort of math to keep things in vectors, but it would be better to just scalarize as we have a scalar instruction and the number of vector lanes is only 2. Can we try and get it to scalarize instead?
@@ -0,0 +1,89 @@ | |||
; RUN: llc -o - %s | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove this test, given the others that change.
; CHECK: autogen_SD19655 | ||
; CHECK: scvtf | ||
; CHECK: ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these old check lines
; CHECK-NEXT: scvtf v1.2d, v7.2d | ||
; CHECK-NEXT: fcvtn2 v4.4s, v5.2d | ||
; CHECK-NEXT: fcvtn2 v2.4s, v3.2d | ||
; CHECK-NEXT: fmov x12, d0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have this still work as before? It is converting to a fp16, but probably doing so in multiple steps so there is still a i64->fp32 convert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried looking. If we want to do that, the check likely needs to be moved earlier before LowerVectorINT_TO_FP (and I am not sure there's any appropriate position for that earlier) as there doesn't seem to be any way to distinguish between instructions that are generated by the compiler (during split, promote, etc.) vs original sint_to_fp instruction once we reach LowerVectorINT_TO_FP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I pattern-matched that and was able to exclude all such tests that eventually end up rounding to f16 as that doesn't exhibit rounding errors we are worried about. That did get rid of ~5 tests that remain unchanged with this PR now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will refactor the existing code in a while. Perhaps there are elegant ways to pattern match a thing like this.
1dfeb67
to
e6a91ff
Compare
After llvm#130665 these operations will be scalarized to avoid double-rounding. This updates the cost model to match. In the future we might be able to use SVE instructions to help, but for the moment the costs shoule be higher. Costsize and Latency costs are not yet expected to be accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put together a patch in #132366 to update the costs. I will look into updating the GISel codegen too, if you are not looking into it. I am hoping it is just an update to the legalization descriptors for the operations.
This is a GISel equivalent of llvm#130665, preventing a double-rounding issue in sitofp/uitofp by scalarizing i64->f32 converts. Most of the changes are made in the ActionDefinitionsBuilder for G_SITOFP/G_UITOFP. Because it is legal to convert i64->f16 itofp without double-rounding, but not a fpround i64->f16, that variant is lowered to build the two extends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the complex-int-to-fp.ll test to remove the older CHECK lines, to make sure they shouldn't cause a problem.
LGTM, so long as the tests pass. Thanks
Rebased. If tests pass, I will merge this. Thank you for reviewing and other patches (cost-model, and GISel changes). |
After #130665 these operations are scalarized to avoid double-rounding. This updates the cost model to match. In the future we might be able to use SVE instructions to help, but for the moment the costs should be higher. Costsize and Latency costs are not yet expected to be accurate. The vector insert/extract will use the cost of VectorInsertExtractBaseCost (2 by default).
This is a GISel equivalent of llvm#130665, preventing a double-rounding issue in sitofp/uitofp by scalarizing i64->f32 converts. Most of the changes are made in the ActionDefinitionsBuilder for G_SITOFP/G_UITOFP. Because it is legal to convert i64->f16 itofp without double-rounding, but not a fpround i64->f16, that variant is lowered to build the two extends.
GCC, correctly, doesn't vectorize in this case. Absence of direct instructions to convert larger fixed point to lower floating point precision inadvertently causes rounding leading to subtle differences across ISAs.
https://godbolt.org/z/ssEchMWrE
Co-authored by: @echristo