-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SRSO] "missing return thunk" warning when booting on Intel or Zen1 machine #1911
Comments
If I build with IAS and GNU as in two separate build folders then copy The diff of
|
Actually, it is not whitespace difference, it seems like the machine code is slightly different, my fancy diff viewer helped visualize it. |
I can confirm this with clang-16 + Linux 6.4.9 + Nick's and Petr's patches. Kaby Lake laptop. |
Same with clang-16 + kernel 6.1.44 + both patches, intel Kaby Lake. Disappears by disabling LLVM_IAS. |
This might be Intel specific. EDIT: d'oh it was in the title. Though @misotolar 's data point below shows otherwise. FWIW, I cannot reproduce on
(Zen 2) |
It appears to be, as I don't see it on either of my AMD boxes. |
AMD Ryzen 5 3500U, Full LTO:
|
Interesting data point, as that is Zen+, whereas Nick and I only have Zen 2 hardware. |
I think the padding has different encodings between functions, but is a red herring because it is of the same length. The only other difference that I spot is llvm/llvm-project#64603. |
Ok, I need folks that can reproduce this crash (@nathanchance or @misotolar ) to test this for me.
That should invoke GNU as rather than clang's integrated assembler, but make gas behave like clang wrt. the relocations for x86. If that continues to fail to boot with the above trace, then the issue is something other than the relocations. if that boots without issue, then the relocations themselves are the problem. |
I should note that there is no boot failure (at least for me), just the warning in
Is this backwards (i.e., if Regardless:
|
Yes, sorry. Based on @MaskRay 's comment, binutils-2_24 and older may also be broken in the same way then. |
Brothers, please test: From 45bd5cc6edf3dd974ca030a1f969fcec1391acac Mon Sep 17 00:00:00 2001
From: Nick Desaulniers <ndesaulniers@google.com>
Date: Fri, 11 Aug 2023 08:42:07 -0700
Subject: [PATCH] x86/srso: fix "missing return thunk" on non -mno-shared
assemblers
A few users have reported observing the following splat from a
WARN_ONCE:
[ 0.086618] ------------[ cut here ]------------
[ 0.086996] missing return thunk: __ret+0x5/0x7e-__ret+0x0/0x7e: e9 f6 ff ff ff
[ 0.087005] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:753 apply_returns+0x2da/0x4
30
[ 0.088328] Modules linked in:
[ 0.088585] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0-rc5-00056-gcacc6e22932f #1
[ 0.089216] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 0
4/01/2014
[ 0.089329] RIP: 0010:apply_returns+0x2da/0x430
[ 0.089624] Code: ff ff 0f 0b e9 c8 fd ff ff c6 05 60 bd c2 01 01 48 c7 c7 ae 5a 68 bd 4c 89 ee
4c 89 e2 b9 05 00 00 00 4d 89 e8 e8 b6 4d 05 00 <0f> 0b e9 a0 fd ff ff 45 85 e4 0f 84 2e ff ff ff
48 c7 c7 6e 5a 68
[ 0.090328] RSP: 0000:ffffffffbda03e20 EFLAGS: 00010246
[ 0.090740] RAX: cb2b7f056bc62700 RBX: ffffffffbe319188 RCX: ffffffffbda53e80
[ 0.091328] RDX: ffffffffbda03cd8 RSI: 00000000ffffdfff RDI: ffffffffbda84110
[ 0.091891] RBP: ffffffffbda03ef8 R08: 0000000000001fff R09: ffffffffbda54110
[ 0.092328] R10: 0000000000005ffd R11: 0000000000000004 R12: ffffffffbcf60040
[ 0.093328] R13: ffffffffbcf60045 R14: ffffffffbe319180 R15: ffffffffbda03e38
[ 0.093896] FS: 0000000000000000(0000) GS:ffff97db5ee00000(0000) knlGS:0000000000000000
[ 0.094328] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.094775] CR2: ffff97db55001000 CR3: 000000001442a001 CR4: 0000000000770ef0
[ 0.095329] PKRU: 55555554
[ 0.095555] Call Trace:
[ 0.095755] <TASK>
[ 0.095930] ? __warn+0xc3/0x1c0
[ 0.096328] ? apply_returns+0x2da/0x430
[ 0.096621] ? report_bug+0x14e/0x1f0
[ 0.096860] ? handle_bug+0x3d/0x80
[ 0.097087] ? exc_invalid_op+0x1a/0x50
[ 0.097328] ? asm_exc_invalid_op+0x1a/0x20
[ 0.097645] ? __ret+0x5/0x7e
[ 0.097847] ? zen_untrain_ret+0x1/0x1
[ 0.098329] ? apply_returns+0x2da/0x430
[ 0.098586] ? __ret+0x5/0x7e
[ 0.098781] ? __ret+0x14/0x7e
[ 0.098981] ? __ret+0xa/0x7e
[ 0.099175] alternative_instructions+0x47/0x110
[ 0.099329] arch_cpu_finalize_init+0x2c/0x50
[ 0.099613] start_kernel+0x2e4/0x390
[ 0.099853] x86_64_start_reservations+0x24/0x30
[ 0.100328] x86_64_start_kernel+0xab/0xb0
[ 0.100595] secondary_startup_64_no_verify+0x17a/0x17b
[ 0.100957] </TASK>
[ 0.101101] ---[ end trace 0000000000000000 ]---
It seems that the presence of (or lack thereof) relocations in
arch/x86/lib/retpoline.o seem to be triggering this. I'm not certain,
but I suspect that this code may be checking the return thunk BEFORE
relocations have been applied.
GNU as ("GAS") has a command line flag pair -mshared/-mno-shared that
controls this behavior. In binutils 2.25, the implicit default value for
this flag was changed from -mshared to -mno-shared, but only for x86.[0]
Building with KAFLAGS=-Wa,-mshared can reproduce the above splat.
While Documentation/process/changes.rst currently lists binutils 2.25 as
the minimum supported version, the SRSO patches were backported to
stable's linux-5.4.y where binutils 2.21 is still supported. We could
add -Wa,-mno-shared to KBUILD_AFLAGS, but Clang's integrated assembler
doesn't support this flag, and defaults to -mshared for all
architectures. [1]
Instead, we can simply add a local label that aliases the global label
__ret, and refer to that within arch/x86/lib/retpoline.S to avoid any
relocations being generated for any assembler regardless of its implicit
default behavior with respect to -mshared/-mno-shared.
Cc: stable@vger.kernel.org
Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://github.com/ClangBuiltLinux/linux/issues/1911
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=b084df0b8d1262fb1e969c74bcc5c61e262a6199 [0]
Link: https://github.com/llvm/llvm-project/issues/64603 [1]
---
arch/x86/lib/retpoline.S | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index 5c43684ec982..5acb78da5488 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -184,7 +184,7 @@ SYM_FUNC_END(srso_safe_ret_alias)
* from re-poisioning the BTB prediction.
*/
.align 64
- .skip 64 - (__ret - zen_untrain_ret), 0xcc
+ .skip 64 - (.L__ret - zen_untrain_ret), 0xcc
SYM_START(zen_untrain_ret, SYM_L_GLOBAL, SYM_A_NONE)
ANNOTATE_NOENDBR
/*
@@ -217,6 +217,7 @@ SYM_START(zen_untrain_ret, SYM_L_GLOBAL, SYM_A_NONE)
* which will be contained safely by the INT3.
*/
SYM_INNER_LABEL(__ret, SYM_L_GLOBAL)
+SYM_INNER_LABEL(.L__ret, SYM_L_LOCAL)
ret
int3
SYM_CODE_END(__ret)
@@ -230,7 +231,7 @@ SYM_CODE_END(__ret)
* Jump back and execute the RET in the middle of the TEST instruction.
* INT3 is for SLS protection.
*/
- jmp __ret
+ jmp .L__ret
int3
SYM_FUNC_END(zen_untrain_ret)
__EXPORT_THUNK(zen_untrain_ret)
@@ -265,7 +266,7 @@ SYM_FUNC_END(srso_untrain_ret)
__EXPORT_THUNK(srso_untrain_ret)
SYM_FUNC_START(__x86_return_thunk)
- ALTERNATIVE_2 "jmp __ret", "call srso_safe_ret", X86_FEATURE_SRSO, \
+ ALTERNATIVE_2 "jmp .L__ret", "call srso_safe_ret", X86_FEATURE_SRSO, \
"call srso_safe_ret_alias", X86_FEATURE_SRSO_ALIAS
int3
SYM_CODE_END(__x86_return_thunk)
--
2.41.0.694.ge786442a9b-goog @torvic9 @misotolar @autogris if you respond with:
|
That diff appears to work for me in QEMU. I will do a boot test on bare metal shortly. |
Somewhat interesting... I see the warning when booting
Will test that patch to make sure every location that I see the warning in is fixed. |
Then there may be some kind of race between the static call patching of this and relocations. |
That diff resolves the warning for me in all three places that I could reproduce it. |
Not tried the diff yet, but without it, I get the same error message on a Zen2 machine using clang 16.0.6+ThinLTO and the two accepted patches.
and
|
My Zen1 laptop boot |
|
https://github.com/ClangBuiltLinux/linux/commit/150c42407f87463c27a2459e06845965291d9973.patch By combining it with the above two patches, it can be built with Clang(full lto) without any problems. Formatting only. From 45bd5cc6edf3dd974ca030a1f969fcec1391acac Mon Sep 17 00:00:00 2001
From: Nick Desaulniers <ndesaulniers@google.com>
Date: Fri, 11 Aug 2023 08:42:07 -0700
Subject: [PATCH] x86/srso: fix "missing return thunk" on non -mno-shared
assemblers
A few users have reported observing the following splat from a
WARN_ONCE:
[ 0.086618] ------------[ cut here ]------------
[ 0.086996] missing return thunk: __ret+0x5/0x7e-__ret+0x0/0x7e: e9 f6 ff ff ff
[ 0.087005] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:753 apply_returns+0x2da/0x4
30
[ 0.088328] Modules linked in:
[ 0.088585] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.5.0-rc5-00056-gcacc6e22932f #1
[ 0.089216] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.2-1-1 0
4/01/2014
[ 0.089329] RIP: 0010:apply_returns+0x2da/0x430
[ 0.089624] Code: ff ff 0f 0b e9 c8 fd ff ff c6 05 60 bd c2 01 01 48 c7 c7 ae 5a 68 bd 4c 89 ee
4c 89 e2 b9 05 00 00 00 4d 89 e8 e8 b6 4d 05 00 <0f> 0b e9 a0 fd ff ff 45 85 e4 0f 84 2e ff ff ff
48 c7 c7 6e 5a 68
[ 0.090328] RSP: 0000:ffffffffbda03e20 EFLAGS: 00010246
[ 0.090740] RAX: cb2b7f056bc62700 RBX: ffffffffbe319188 RCX: ffffffffbda53e80
[ 0.091328] RDX: ffffffffbda03cd8 RSI: 00000000ffffdfff RDI: ffffffffbda84110
[ 0.091891] RBP: ffffffffbda03ef8 R08: 0000000000001fff R09: ffffffffbda54110
[ 0.092328] R10: 0000000000005ffd R11: 0000000000000004 R12: ffffffffbcf60040
[ 0.093328] R13: ffffffffbcf60045 R14: ffffffffbe319180 R15: ffffffffbda03e38
[ 0.093896] FS: 0000000000000000(0000) GS:ffff97db5ee00000(0000) knlGS:0000000000000000
[ 0.094328] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.094775] CR2: ffff97db55001000 CR3: 000000001442a001 CR4: 0000000000770ef0
[ 0.095329] PKRU: 55555554
[ 0.095555] Call Trace:
[ 0.095755] <TASK>
[ 0.095930] ? __warn+0xc3/0x1c0
[ 0.096328] ? apply_returns+0x2da/0x430
[ 0.096621] ? report_bug+0x14e/0x1f0
[ 0.096860] ? handle_bug+0x3d/0x80
[ 0.097087] ? exc_invalid_op+0x1a/0x50
[ 0.097328] ? asm_exc_invalid_op+0x1a/0x20
[ 0.097645] ? __ret+0x5/0x7e
[ 0.097847] ? zen_untrain_ret+0x1/0x1
[ 0.098329] ? apply_returns+0x2da/0x430
[ 0.098586] ? __ret+0x5/0x7e
[ 0.098781] ? __ret+0x14/0x7e
[ 0.098981] ? __ret+0xa/0x7e
[ 0.099175] alternative_instructions+0x47/0x110
[ 0.099329] arch_cpu_finalize_init+0x2c/0x50
[ 0.099613] start_kernel+0x2e4/0x390
[ 0.099853] x86_64_start_reservations+0x24/0x30
[ 0.100328] x86_64_start_kernel+0xab/0xb0
[ 0.100595] secondary_startup_64_no_verify+0x17a/0x17b
[ 0.100957] </TASK>
[ 0.101101] ---[ end trace 0000000000000000 ]---
It seems that the presence of (or lack thereof) relocations in
arch/x86/lib/retpoline.o seem to be triggering this. I'm not certain,
but I suspect that this code may be checking the return thunk BEFORE
relocations have been applied.
GNU as ("GAS") has a command line flag pair -mshared/-mno-shared that
controls this behavior. In binutils 2.25, the implicit default value for
this flag was changed from -mshared to -mno-shared, but only for x86.[0]
Building with KAFLAGS=-Wa,-mshared can reproduce the above splat.
While Documentation/process/changes.rst currently lists binutils 2.25 as
the minimum supported version, the SRSO patches were backported to
stable's linux-5.4.y where binutils 2.21 is still supported. We could
add -Wa,-mno-shared to KBUILD_AFLAGS, but Clang's integrated assembler
doesn't support this flag, and defaults to -mshared for all
architectures. [1]
Instead, we can simply add a local label that aliases the global label
__ret, and refer to that within arch/x86/lib/retpoline.S to avoid any
relocations being generated for any assembler regardless of its implicit
default behavior with respect to -mshared/-mno-shared.
Cc: stable@vger.kernel.org
Fixes: fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Closes: https://github.com/ClangBuiltLinux/linux/issues/1911
Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=b084df0b8d1262fb1e969c74bcc5c61e262a6199 [0]
Link: https://github.com/llvm/llvm-project/issues/64603 [1]
---
arch/x86/lib/retpoline.S | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/x86/lib/retpoline.S b/arch/x86/lib/retpoline.S
index 5c43684ec982..5acb78da5488 100644
--- a/arch/x86/lib/retpoline.S
+++ b/arch/x86/lib/retpoline.S
@@ -184,7 +184,7 @@ SYM_FUNC_END(srso_safe_ret_alias)
* from re-poisioning the BTB prediction.
*/
.align 64
- .skip 64 - (__ret - zen_untrain_ret), 0xcc
+ .skip 64 - (.L__ret - zen_untrain_ret), 0xcc
SYM_START(zen_untrain_ret, SYM_L_GLOBAL, SYM_A_NONE)
ANNOTATE_NOENDBR
/*
@@ -217,6 +217,7 @@ SYM_START(zen_untrain_ret, SYM_L_GLOBAL, SYM_A_NONE)
* which will be contained safely by the INT3.
*/
SYM_INNER_LABEL(__ret, SYM_L_GLOBAL)
+SYM_INNER_LABEL(.L__ret, SYM_L_LOCAL)
ret
int3
SYM_CODE_END(__ret)
@@ -230,7 +231,7 @@ SYM_CODE_END(__ret)
* Jump back and execute the RET in the middle of the TEST instruction.
* INT3 is for SLS protection.
*/
- jmp __ret
+ jmp .L__ret
int3
SYM_FUNC_END(zen_untrain_ret)
__EXPORT_THUNK(zen_untrain_ret)
@@ -265,7 +266,7 @@ SYM_FUNC_END(srso_untrain_ret)
__EXPORT_THUNK(srso_untrain_ret)
SYM_FUNC_START(__x86_return_thunk)
- ALTERNATIVE_2 "jmp __ret", "call srso_safe_ret", X86_FEATURE_SRSO, \
+ ALTERNATIVE_2 "jmp .L__ret", "call srso_safe_ret", X86_FEATURE_SRSO, \
"call srso_safe_ret_alias", X86_FEATURE_SRSO_ALIAS
int3
SYM_CODE_END(__x86_return_thunk) |
Building kernel 6.1.45 fails with this additional patch, giving the following errror:
LLVM_IAS was set to 1 again, and the 2 previous working patches (https://lore.kernel.org/lkml/20230809-gds-v1-1-eaac90b0cbcc@google.com/ and 150c424) were also applied. |
With this patch on top of the lld fix from 8a9b1f6 applied to either git master or the 6.4.10 stable kernel, I'm seeing the 'Size expression must be absolute' build error too.
This is with llvm-as, clang and lld from the most recent 16.0.6 release, so it may be llvm version dependent? I'm not using LTO, and initially didn't test with the LTO fix from 150c424 but applying this in addition has no effect on the error. Without the retpoline.S patch, the kernel builds fine but emits exactly the "missing return thunk" warning described above when booting a test Skylake host. |
Sorry, it turns out I am an idiot. I fixed it, then re-read the patch you posted more carefully and realised that I had accidentally hand-applied -SYM_INNER_LABEL(__ret, SYM_L_GLOBAL)
+SYM_INNER_LABEL(.L__ret, SYM_L_LOCAL) instead of SYM_INNER_LABEL(__ret, SYM_L_GLOBAL)
+SYM_INNER_LABEL(.L__ret, SYM_L_LOCAL) Your patch didn't have the problem in the first place. Apologies for the noise! When correctly applied (!), your patch does indeed build fine and fixes the warning for both my Intel Skylake box and a Zen2 4800U box that also had the same warning when I tested it. If it's helpful,
|
Sure, here: config.txt |
Oh my, I made the exact same mistake adjusting it manually to 6.1.45. Sorry for the trouble. |
With this patch applied, no more warning on my Zen2 box running 6.4.10. For Zen2: |
Boris points out that the second patch of Peter's clean up series (which depends on the first) should resolve this and I can confirm that for at least mainline and 6.4. 6.1 and earlier seems a lot harder since that was before call depth tracking was added. |
Peter mentions on IRC that 770ae1b should be applied to 6.1.y first. I confirmed it applies cleanly, and @nathanchance confirmed then that the next two patches applied + fixed the issue for that branch of stable |
Peter's patch is now in -tip: https://git.kernel.org/tip/d43490d0ab824023e11d0b57d0aeec17a6e0ca13 |
This is now in mainline: https://git.kernel.org/linus/d43490d0ab824023e11d0b57d0aeec17a6e0ca13 It has been backported to 6.4 so far but it should go back to 6.1 and 5.15, it appears the conflicts/issues with all the SRSO fixes have delayed these backports past 6.4 for now. |
Do we want to keep this open to track backports? |
Sure. |
Stable backports out for review. 6.1: https://lore.kernel.org/20230824141447.392496309@linuxfoundation.org/ |
The fix is now released in the stable kernels. 6.1: https://git.kernel.org/stable/c/44dbc912fd8a132f01eca8fc0928d6c9beb1f51c |
After commit fb3bd914b3ec ("x86/srso: Add a Speculative RAS Overflow mitigation"), I see the following stack trace when booting on my two pieces of Intel hardware (either bare metal or virtually):
I can reproduce this with just
CC=clang
, so it appears unrelated to the existing issues withld.lld
. This appears to be related to the integrated assembler, as this is not reproducible withCC=clang LLVM_IAS=0
...The text was updated successfully, but these errors were encountered: