-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMDGPU General Protection Faults #735
Comments
File is |
I believe this is the runtime issue that @nickdesaulniers mentioned in IRC? |
Yes, @yshui great job debugging. You've come to the same conclusions I did. While it seems that their driver is compiled w/ 16B stack alignment, the rest of the x86_64 kernel is compiled with 8B stack alignment (as a micro optimization). Thus, there are no guarantees that the stack is actually 16B aligned. I added the I don't know why their kernel uses a 16B stack alignment, since it's not possible to guarantee your stack is actually 16B aligned if called from 8B stack aligned translation units (without additional inline assembly to detect and realign such cases, which I would consider a hack upon a hack). Even if the 16B alignment was removed from their driver, it would still be incorrect to use If we implemented the soft-fp routines in the kernel (there is precedent for lifting them from libgcc's source), we'd also need to remove the calls to The soft-fp routines only make sense if the code is not performance critical. @Ajs1984 suspects it has something to do with setting mode lines; I'm not sure if that would be harmed by soft-fp, but it's worth a try. Once soft-fp routines were implemented, we could then remove the Finally, we could spend more time on x86_64 instruction selection w/ sse but no sse2, but it's a very limited use case that even LLVM's x86_64 backend maintainer had doubts about that approach. |
According to the Makefile, clang should get the |
Hmm, the amdgpu module (specifically, the dc) is the only place the stack is 16 bytes aligned. All other parts of the kernel have 8 bytes alignment. How does this work with gcc? Is the stack pointer somehow aligned before calling the amdgpu functions? Normally you would use |
Probably need input from AMD people to understand. |
I assume that either:
It is not, hence this GPF w/ Clang. You'd have to audit the translation unit boundaries, and annotate them. Without an automated way of doing so or validating them, it would very easy to break and would be a mental burden for AMDGPU driver developers to remember to do and check. Thus, I consider that brittle.
Oh, interesting, I did not know about that flag. If clang implements it, I wonder if we can use it for just those translation units? (Normally, I wouldn't adjust the stack alignment, but such is life). |
If that is true, then passing |
I agree, but maybe there's another reason it's used that I can't think of. I did not go read the commits that added it to see if maybe a good reason was provided, but that's a good place to start.
I've asked the AMDGPU maintainer to ask around and set up a call, and I've asked their partner engineer on an internal thread related to CrOS. I've also spoken in person with an AMDGPU LLVM developer about this FWIW. I haven't heard back from them yet. I give them the benefit of the doubt, but should chase if I don't hear back soon. |
To a very rough approximation, SSE is support for
Not quite. The aligned movs (MOVAPD in this case) require a 16b aligned memory operand, which in this case it doesn't get because it references a 16b object misaligned on an 8b stack. An 8 byte alignment generally, with 16 in a callee is an ABI breakage, unless the function knows it need to cope with misaligned callers. Either
Code generation to use MOVAPD is correct in this case. It is on a temporary variable spilled to the stack, and because the stack is 16 bytes aligned, this is a safe instruction to schedule. |
Well put (more precise).
Agreed.
I believe the global switch occurred before the driver was upstreamed? Maybe it existed out of tree prior to the switch, and only for older kernels?
@yshui pointed out
$ cd linux-next
$ grep -r incoming-stack-boundary
$ echo $?
1 so 1, but also 2: https://godbolt.org/z/N9b67Q |
What is the best solution here? It seems quite clear there is a bug in amdgpu at least, since it should have used So we need to:
? |
I don't immediately see anything in the driver which would necessitate the use of a 16 byte stack. I'd start by dropping that and seeing what explodes. Once Clang sees that the stack is only 8b aligned, its no longer correct to schedule MOVAPD, and it will probably use MOVUPD instead which is the unaligned variant. |
I was thinking more of https://godbolt.org/z/TeiLEr but yes. I don't see suitable prologue code being generated even in GCC. |
|
This feels like papering over an ABI mismatch which is bad behavior we shouldn't bend over backwards to support IMO. While it is a possible route to a solution, I consider it suboptimal.
Agreed.
Oh, good point, and if it doesn't, that's a bug in Clang.
Oh, I was just showing the flag being unrecognized. The code in the link was irrelevant.
Should the flag be You both are making me so happy with your use of godbolt! 💟 |
diff --git a/drivers/gpu/drm/amd/display/dc/calcs/Makefile b/drivers/gpu/drm/amd/display/dc/calcs/Makefile
index 985633c08a26..3c3778de33b4 100644
--- a/drivers/gpu/drm/amd/display/dc/calcs/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/calcs/Makefile
@@ -24,13 +24,7 @@
# It calculates Bandwidth and Watermarks values for HW programming
#
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
- cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
- cc_stack_align := -mstack-alignment=16
-endif
-
-calcs_ccflags := -mhard-float -msse $(cc_stack_align)
+calcs_ccflags := -mhard-float -msse
ifdef CONFIG_CC_IS_CLANG
calcs_ccflags += -msse2
diff --git a/drivers/gpu/drm/amd/display/dc/dcn20/Makefile b/drivers/gpu/drm/amd/display/dc/dcn20/Makefile
index ddb8d5649e79..13d1fd2bf8f8 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn20/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dcn20/Makefile
@@ -10,13 +10,7 @@ ifdef CONFIG_DRM_AMD_DC_DSC_SUPPORT
DCN20 += dcn20_dsc.o
endif
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
- cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
- cc_stack_align := -mstack-alignment=16
-endif
-
-CFLAGS_$(AMDDALPATH)/dc/dcn20/dcn20_resource.o := -mhard-float -msse $(cc_stack_align)
+CFLAGS_$(AMDDALPATH)/dc/dcn20/dcn20_resource.o := -mhard-float -msse
ifdef CONFIG_CC_IS_CLANG
CFLAGS_$(AMDDALPATH)/dc/dcn20/dcn20_resource.o += -msse2
diff --git a/drivers/gpu/drm/amd/display/dc/dcn21/Makefile b/drivers/gpu/drm/amd/display/dc/dcn21/Makefile
index ef673bffc241..c4042732c8a0 100644
--- a/drivers/gpu/drm/amd/display/dc/dcn21/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dcn21/Makefile
@@ -3,13 +3,7 @@
DCN21 = dcn21_hubp.o dcn21_hubbub.o dcn21_resource.o
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
- cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
- cc_stack_align := -mstack-alignment=16
-endif
-
-CFLAGS_$(AMDDALPATH)/dc/dcn21/dcn21_resource.o := -mhard-float -msse $(cc_stack_align)
+CFLAGS_$(AMDDALPATH)/dc/dcn21/dcn21_resource.o := -mhard-float -msse
ifdef CONFIG_CC_IS_CLANG
CFLAGS_$(AMDDALPATH)/dc/dcn21/dcn21_resource.o += -msse2
diff --git a/drivers/gpu/drm/amd/display/dc/dml/Makefile b/drivers/gpu/drm/amd/display/dc/dml/Makefile
index 5b2a65b42403..a604cdbaf317 100644
--- a/drivers/gpu/drm/amd/display/dc/dml/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dml/Makefile
@@ -24,13 +24,7 @@
# It provides the general basic services required by other DAL
# subcomponents.
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
- cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
- cc_stack_align := -mstack-alignment=16
-endif
-
-dml_ccflags := -mhard-float -msse $(cc_stack_align)
+dml_ccflags := -mhard-float -msse
ifdef CONFIG_CC_IS_CLANG
dml_ccflags += -msse2
diff --git a/drivers/gpu/drm/amd/display/dc/dsc/Makefile b/drivers/gpu/drm/amd/display/dc/dsc/Makefile
index b456cd23c6fa..8954ba2c284b 100644
--- a/drivers/gpu/drm/amd/display/dc/dsc/Makefile
+++ b/drivers/gpu/drm/amd/display/dc/dsc/Makefile
@@ -1,13 +1,7 @@
#
# Makefile for the 'dsc' sub-component of DAL.
-ifneq ($(call cc-option, -mpreferred-stack-boundary=4),)
- cc_stack_align := -mpreferred-stack-boundary=4
-else ifneq ($(call cc-option, -mstack-alignment=16),)
- cc_stack_align := -mstack-alignment=16
-endif
-
-dsc_ccflags := -mhard-float -msse $(cc_stack_align)
+dsc_ccflags := -mhard-float -msse
ifdef CONFIG_CC_IS_CLANG
dsc_ccflags += -msse2
but I still see $ llvm-objdump -d drivers/gpu/drm/amd/display/dc/dcn20/dcn20_resource.o | grep movaps
9b3: 0f 28 05 00 00 00 00 movaps (%rip), %xmm0
fcc: 0f 28 05 00 00 00 00 movaps (%rip), %xmm0
$ llvm-objdump -d drivers/gpu/drm/amd/display/dc/calcs/dcn_calcs.o | grep movaps
d4c: 0f 28 c4 movaps %xmm4, %xmm0
1160: 0f 28 d0 movaps %xmm0, %xmm2
... |
Ah - much better.
Godbolt is awesome.
Those 4 are all fine. The first two will be to objects store in .data etc, whose alignment doesn't depend on the stack at all. They are instruction-pointer relative accesses without outstanding relocations (hence the 0's) because the object files haven't been linked yet. The second two are register to register moves, so have no alignment restrictions. You'll want to grep for 'mova' because for the float variants alone, you've got |
I have a kernel built without the stack alignment option, with clang. Seems to work fine so far. |
Ah, sorry, I should have included the full output, and used
You mean with outstanding relocations? If so, that makes sense, and I should have recognized the missing relocation info from the 4B's of zeros.
Right, but I think there were then stack references later on in the full output, IIRC. (Shall post full output). Maybe those had relocations as well though (I always forget
That's great news @yshui ! Can you test running it with Clang and GCC, and if it seems ok, I'm happy to send the above diff with your tested-by+debugged-by && @andyhhp 's suggested by tags (or whatever you all would like) to LKML. |
(And if that works for both, I wonder if we can then further [re-]add |
Yes I did. (Sorry - it was late here.) If you want a S-by tag, I'm |
@nickdesaulniers The gcc side seems to work fine as well, I didn't test adding On the clang side, I hit a GPU timeout, which I don't know if is related to this or not. I still haven't seen a GPF, which is good. |
Great, thanks for the report. I've asked the AMD folks on our internal CrOS thread to test the diff as well. I'll send the patch by EOD, hopefully with 1 more tested by tag on it. 🏁 |
@nickdesaulniers Isn't enabling sse2 for gcc<7.1 also an option? Edit: just tried, nope. https://godbolt.org/z/xXbjMr |
Another question would be: is |
Probably not. They use |
The x86 kernel is compiled with an 8B stack alignment via `-mpreferred-stack-boundary=3` for GCC since 3.6-rc1 via commit d9b0cde ("x86-64, gcc: Use -mpreferred-stack-boundary=3 if supported") or `-mstack-alignment=8` for Clang. Parts of the AMDGPU driver are compiled with 16B stack alignment. Generally, the stack alignment is part of the ABI. Linking together two different translation units with differing stack alignment is dangerous, particularly when the translation unit with the smaller stack alignment makes calls into the translation unit with the larger stack alignment. While 8B aligned stacks are sometimes also 16B aligned, they are not always. Multiple users have reported General Protection Faults (GPF) when using the AMDGPU driver compiled with Clang. Clang is placing objects in stack slots assuming the stack is 16B aligned, and selecting instructions that require 16B aligned memory operands. At runtime, syscall handlers with 8B aligned stack call into code that assumes 16B stack alignment. When the stack is a multiple of 8B but not 16B, these instructions result in a GPF. Remove the code that added compatibility between the differing compiler flags, as it will result in runtime GPFs when built with Clang. Cleanups for GCC will be sent in later patches in the series. Link: ClangBuiltLinux/linux#735 Tested-by: Shirish S <shirish.s@amd.com> Debugged-by: Yuxuan Shui <yshuiv7@gmail.com> Reported-by: Shirish S <shirish.s@amd.com> Reported-by: Yuxuan Shui <yshuiv7@gmail.com> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The x86 kernel is compiled with an 8B stack alignment via `-mpreferred-stack-boundary=3` for GCC since 3.6-rc1 via commit d9b0cde ("x86-64, gcc: Use -mpreferred-stack-boundary=3 if supported") or `-mstack-alignment=8` for Clang. Parts of the AMDGPU driver are compiled with 16B stack alignment. Generally, the stack alignment is part of the ABI. Linking together two different translation units with differing stack alignment is dangerous, particularly when the translation unit with the smaller stack alignment makes calls into the translation unit with the larger stack alignment. While 8B aligned stacks are sometimes also 16B aligned, they are not always. Multiple users have reported General Protection Faults (GPF) when using the AMDGPU driver compiled with Clang. Clang is placing objects in stack slots assuming the stack is 16B aligned, and selecting instructions that require 16B aligned memory operands. At runtime, syscall handlers with 8B aligned stack call into code that assumes 16B stack alignment. When the stack is a multiple of 8B but not 16B, these instructions result in a GPF. Remove the code that added compatibility between the differing compiler flags, as it will result in runtime GPFs when built with Clang. Cleanups for GCC will be sent in later patches in the series. Link: ClangBuiltLinux#735 Tested-by: Shirish S <shirish.s@amd.com> Debugged-by: Yuxuan Shui <yshuiv7@gmail.com> Reported-by: Shirish S <shirish.s@amd.com> Reported-by: Yuxuan Shui <yshuiv7@gmail.com> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
The patches have been accepted and presumably passed AMD's internal testing because they are on their way for 5.4: https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-5.4-2019-10-30 |
Nick's patches have been merged and released in 5.4-rc6. |
dmesg
Disassembly
rbp-30 is not aligned to 16 bytes boundary. Hopefully I didn't misunderstand anything?
The text was updated successfully, but these errors were encountered: