-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMDGPU general protection fault with LTO #1374
Comments
Please do try a vanilla kernel first if it is not too difficult. Some custom kernels have a lot of out of tree patches, which can make things more complicated to debug. |
Makes sense. I just did, and same error. Logs: 5.12.3-vanilla.log |
Your build command is a little odd:
I do not see
I have no idea if this is actually related to your issue, just a general observation. It may be worth reporting this to |
from the initial log:
so some bad stack access in dcn30_update_bw_bounding_box perhaps? @aruhier can you provide the output of |
I do, I export |
Sorry, I'm really not used to tweak with LLVM binaries, so I'm probably not going to be able to adapt too much the commands you give me.
Which is probably not what you want. However I executed it on the amdgpu module: amdgpu.log Edit: sorry, for that I had to recompile my kernel again, as I cleaned my modules (I won't again do it for the next steps of the debugging). To make your debug easier, here's the log of the panic of this kernel and module: 5.12.3-vanilla.log |
$ python -c 'print(hex(0x2efc90+0x33))'
0x2efcc3
So again, we're using SSE/SIMD instructions that require 16B aligned operands. If the stack was 0xffffb80d81b6b5d8, then 0xffffb80d81b6b5d8 + 240 is 0xffffb80d81b6b6c8 which is obviously not a multiple of 16B. This mentions that movaps will GP fault on misaligned operands. So this is the same symptom as #735 . So my next line of investigation is:
It looks like @aruhier can you provide the output of your build invocation with |
We do have this in
Might not work correctly, of course. |
Here is the full log of the build: genkernel-vanilla.log.gz I looked at
|
@samitolvanen is I don't think so, from the provided log:
Indeed. |
The linker flags used for
Based on the log, it definitely looks like something isn't defined correctly there. |
KBUILD_LDFLAGS is getting clobbered in ...
ifdef CONFIG_LTO_CLANG
KBUILD_LDFLAGS += -plugin-opt=-code-model=kernel \
-plugin-opt=-stack-alignment=$(if $(CONFIG_X86_32),4,8)
endif
...
KBUILD_LDFLAGS := -m elf_$(UTS_MACHINE)
... I'll draft up a patch and have @aruhier test it before sending it along. |
Sure, thanks! |
@aruhier please apply this and see if resolves your issue! Additionally, please provide me with an email address if you would like a proper @nickdesaulniers @samitolvanen if you have any suggestions for the wording, let me know. |
I will also stick a |
I'll test it in ~12h, thanks a lot for this patch! |
The patch looks good to me, but would it be better to just change the |
Thank you for reporting it :)
Sounds good, I have updated the patch now. Once you test it and confirm it works, I will add a |
yeah, resetting KBUILD_LDFLAGS seems like a mistake TBH; not sure if that changes the Fixes tag. |
Sure, I can change the |
Built today on 5.12.4, it runs great 🎉 PS: I don't know if you want to close this issue once the patch is sent or not, so I let it open but feel free to close it. |
Patch submitted: https://lore.kernel.org/r/20210518190106.60935-1-nathan@kernel.org/
We will close the issue when this patch is merged into Linus's tree. Thanks again for the report and testing! |
Merged into mainline: https://git.kernel.org/torvalds/c/0024430e920f This should be backported to stable automatically. |
Commit b33fff0 ("x86, build: allow LTO to be selected") added a couple of '-plugin-opt=' flags to KBUILD_LDFLAGS because the code model and stack alignment are not stored in LLVM bitcode. However, these flags were added to KBUILD_LDFLAGS prior to the emulation flag assignment, which uses ':=', so they were overwritten and never added to $(LD) invocations. The absence of these flags caused misalignment issues in the AMDGPU driver when compiling with CONFIG_LTO_CLANG, resulting in general protection faults. Shuffle the assignment below the initial one so that the flags are properly passed along and all of the linker flags stay together. At the same time, avoid any future issues with clobbering flags by changing the emulation flag assignment to '+=' since KBUILD_LDFLAGS is already defined with ':=' in the main Makefile before being exported for modification here as a result of commit ce99d0b ("kbuild: clear LDFLAGS in the top Makefile"). Cc: stable@vger.kernel.org Fixes: b33fff0 ("x86, build: allow LTO to be selected") Link: ClangBuiltLinux#1374 Reported-by: Anthony Ruhier <aruhier@mailbox.org> Tested-by: Anthony Ruhier <aruhier@mailbox.org> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210518190106.60935-1-nathan@kernel.org
Linking this in https://gitlab.freedesktop.org/drm/amd/-/issues/1555 |
commit 0024430 upstream. Commit b33fff0 ("x86, build: allow LTO to be selected") added a couple of '-plugin-opt=' flags to KBUILD_LDFLAGS because the code model and stack alignment are not stored in LLVM bitcode. However, these flags were added to KBUILD_LDFLAGS prior to the emulation flag assignment, which uses ':=', so they were overwritten and never added to $(LD) invocations. The absence of these flags caused misalignment issues in the AMDGPU driver when compiling with CONFIG_LTO_CLANG, resulting in general protection faults. Shuffle the assignment below the initial one so that the flags are properly passed along and all of the linker flags stay together. At the same time, avoid any future issues with clobbering flags by changing the emulation flag assignment to '+=' since KBUILD_LDFLAGS is already defined with ':=' in the main Makefile before being exported for modification here as a result of commit: ce99d0b ("kbuild: clear LDFLAGS in the top Makefile") Fixes: b33fff0 ("x86, build: allow LTO to be selected") Reported-by: Anthony Ruhier <aruhier@mailbox.org> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Anthony Ruhier <aruhier@mailbox.org> Cc: stable@vger.kernel.org Link: ClangBuiltLinux/linux#1374 Link: https://lore.kernel.org/r/20210518190106.60935-1-nathan@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Commit b33fff0 ("x86, build: allow LTO to be selected") added a couple of '-plugin-opt=' flags to KBUILD_LDFLAGS because the code model and stack alignment are not stored in LLVM bitcode. However, these flags were added to KBUILD_LDFLAGS prior to the emulation flag assignment, which uses ':=', so they were overwritten and never added to $(LD) invocations. The absence of these flags caused misalignment issues in the AMDGPU driver when compiling with CONFIG_LTO_CLANG, resulting in general protection faults. Shuffle the assignment below the initial one so that the flags are properly passed along and all of the linker flags stay together. At the same time, avoid any future issues with clobbering flags by changing the emulation flag assignment to '+=' since KBUILD_LDFLAGS is already defined with ':=' in the main Makefile before being exported for modification here as a result of commit: ce99d0b ("kbuild: clear LDFLAGS in the top Makefile") Fixes: b33fff0 ("x86, build: allow LTO to be selected") Reported-by: Anthony Ruhier <aruhier@mailbox.org> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Tested-by: Anthony Ruhier <aruhier@mailbox.org> Cc: stable@vger.kernel.org Link: ClangBuiltLinux/linux#1374 Link: https://lore.kernel.org/r/20210518190106.60935-1-nathan@kernel.org (cherry picked from commit 0024430) Bug: 187129171 Signed-off-by: Connor O'Brien <connoro@google.com> Change-Id: I9f9c056829483f341251cc7407d0029c05e8b503
Hello,
I'm using linux-xanmod 5.12, and tried clang with thinlto. It boots but when the amdgpu driver loads, I get a general protection fault. Same with full LTO. However it runs fine without LTO.
Xanmod is using a multigen LRU, but disabling it gives the same error (so I can try with a vanilla kernel, but I don't think that's the issue here).
Logs: 5.12.3-2.log
Config: kernel-config-5.12.3-x86_64.txt
CPU: Ryzen 5950x
GPU: AMD 6800xt
LLVM: 12
Distribution: Gentoo, using genkernel 4.2.1.
The text was updated successfully, but these errors were encountered: