Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stack protector boot failure in android-4.14-stable #1815

Closed
nathanchance opened this issue Mar 15, 2023 · 28 comments
Closed

stack protector boot failure in android-4.14-stable #1815

nathanchance opened this issue Mar 15, 2023 · 28 comments
Assignees
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 boot failure This issue results in a failure to boot [BUG] llvm A bug that should be fixed in upstream LLVM [FIXED][LLVM] 16 This bug was fixed in LLVM 16.0 [PATCH] Submitted A patch has been submitted for review

Comments

@nathanchance
Copy link
Member

CI reports:

https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/4423392922/jobs/7758026839

It can be reproduced at llvmorg-16.0.0-rc4.

$ make -skj"$(nproc)" AR=llvm-ar ARCH=x86_64 CC=clang LD=ld.lld NM=llvm-nm O=build mrproper x86_64_cuttlefish_defconfig all
...

$ boot-qemu.py -a x86_64 -k build -t 30s
QEMU location: /usr/bin

QEMU version: QEMU emulator version 7.2.0
$ timeout --foreground 25s stdbuf -eL -oL /usr/bin/qemu-system-x86_64 -display none -nodefaults -no-reboot -d unimp,guest_errors -append 'console=ttyS0 earlycon=uart8250,io,0x3f8' -kernel .../build/arch/x86/boot/bzImage -initrd .../images/x86_64/rootfs.cpio -cpu host -enable-kvm -m 512m -smp 8 -serial mon:stdio
[    0.000000] Linux version 4.14.309-00199-ge8edc5a4238a (nathan@dev-arch.thelio-3990X) (ClangBuiltLinux clang version 16.0.0, ClangBuiltLinux LLD 16.0.0) #1 SMP PREEMPT Wed Mar 15 14:43:49 MST 2023
[    0.000000] Command line: console=ttyS0 earlycon=uart8250,io,0x3f8
...
[    0.010000] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[    0.010000] Spectre V2 : Mitigation: Retpolines
[    0.010000] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[    0.010000] Spectre V2 : Spectre v2 / SpectreRSB : Filling RSB on VMEXIT
[    0.010000] RETBleed: Vulnerable
[    0.010000] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[    0.010000] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[    0.010006] Freeing SMP alternatives memory: 132K
[    0.012725] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in:         (ptrval)
[    0.012725]
[    0.013587] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.309-00199-ge8edc5a4238a #1
[    0.014176] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.1-1-1 04/01/2014
[    0.014903] Call Trace:
[    0.015096]  panic+0x11b/0x570
[    0.015327]  ? acpi_enable+0x110/0x210
[    0.015613]  ? start_kernel+0xc5d/0xc90
[    0.015905]  ? early_idt_handler_array+0x120/0x120
[    0.016277]  __stack_chk_fail+0x14/0x20
[    0.016565]  start_kernel+0xc5d/0xc90
[    0.016864]  x86_64_start_reservations+0x30a/0x310
[    0.017250]  x86_64_start_kernel+0x44a/0x450
[    0.017579]  secondary_startup_64+0xa5/0xb0
[    0.017931] Rebooting in 5 seconds..

This appears to be an LLVM 16 regression, as it is not seen with LLVM 15. It could also be related to LTO but I have not tried that yet since I just wanted to get the initial report out there.

@nathanchance nathanchance added [BUG] Untriaged Something isn't working [ARCH] x86_64 This bug impacts ARCH=x86_64 boot failure This issue results in a failure to boot labels Mar 15, 2023
@nathanchance
Copy link
Member Author

nathanchance commented Mar 16, 2023

This is not reproducible without LTO. My LLVM bisect landed on

llvm/llvm-project@d656ae2

d656ae28095726830f9beb8dbd4d69f5144ef821 is the first bad commit
commit d656ae28095726830f9beb8dbd4d69f5144ef821
Author: Xiang1 Zhang <xiang1.zhang@intel.com>
Date:   Fri Dec 9 19:16:00 2022 +0800

    Enhance stack protector
    
    Reviewed By: LuoYuanke
    
    Differential Revision: https://reviews.llvm.org/D139254

 llvm/lib/CodeGen/StackProtector.cpp                |  69 +++++++--
 llvm/test/CodeGen/X86/stack-protector-2.ll         |  30 ++++
 llvm/test/CodeGen/X86/stack-protector-no-return.ll | 165 +++++----------------
 .../CodeGen/X86/stack-protector-recursively.ll     |  26 ++++
 4 files changed, 152 insertions(+), 138 deletions(-)
 create mode 100644 llvm/test/CodeGen/X86/stack-protector-recursively.ll
bisect found first bad commit
# bad: [603c286334b07f568d39f6706c848f576914f323] Bump the trunk major version to 17
# good: [809855b56f06dd7182685f88fbbc64111df9339a] Bump the trunk major version to 16
git bisect start 'llvmorg-17-init' 'llvmorg-16-init'
# good: [a784de783af5096e593c5e214c2c78215fe303f5] [flang] Add -ffp-contract option processing
git bisect good a784de783af5096e593c5e214c2c78215fe303f5
# good: [3c700cf754dbeb5f1f7c1e03e1f04ed716f7d9dc] [BOLT] Use std::optional instead of None in comments (NFC)
git bisect good 3c700cf754dbeb5f1f7c1e03e1f04ed716f7d9dc
# bad: [5f2acfb6356f012a9191e8ea931ce7be686e1ba8] [ScalarEvolutionExpanderTest] Avoid sign warning
git bisect bad 5f2acfb6356f012a9191e8ea931ce7be686e1ba8
# bad: [4b2cf982cc51b425b935842e64aa7ec645ad6807] [clangd] Support type hints for `decltype(expr)`
git bisect bad 4b2cf982cc51b425b935842e64aa7ec645ad6807
# bad: [8580de156f2bceebd31d884a5217f936a7da1ea6] [NFC] Fixes ModuleMaker example build failure caused by c143b77b30fc23f70aac94be66e412651771c0fc
git bisect bad 8580de156f2bceebd31d884a5217f936a7da1ea6
# bad: [b19c26747f4e76549d29573946f097851db14829] [AMDGPU][GFX940][DOC][NFC] Update assembler syntax description
git bisect bad b19c26747f4e76549d29573946f097851db14829
# bad: [8cdb1aa1ec2ba15f5ec8641f5ece23758bf15a06] [Clang] Convert PCH tests to opaque pointers (NFC)
git bisect bad 8cdb1aa1ec2ba15f5ec8641f5ece23758bf15a06
# bad: [62fec084d67af5b3d55b09271a5b9aab604698f5] [mlir] Add LLDB visualizers for MLIR constructs
git bisect bad 62fec084d67af5b3d55b09271a5b9aab604698f5
# good: [29e8de5de1611dad9c71f351616992ceaeb5c5cc] [VPlan] Summarize recipes used to model inductions (NFC).
git bisect good 29e8de5de1611dad9c71f351616992ceaeb5c5cc
# bad: [47b9da72e032f8042d0fdbfef75ecfbb3c6960eb] [VP][RISCV] Add vp.bitreverse and RISC-V support.
git bisect bad 47b9da72e032f8042d0fdbfef75ecfbb3c6960eb
# good: [4d1c5b946ad7f10d398b43e7f20a528407fb79b9] [openmp] Fix a doc comment issue found by -Wdocumentation
git bisect good 4d1c5b946ad7f10d398b43e7f20a528407fb79b9
# bad: [94f290e71600bf694646c454b0618bb3504bc711] [AArch64][NFC] Add tests for D134260
git bisect bad 94f290e71600bf694646c454b0618bb3504bc711
# good: [26330e5f5dc7466d9809091f904a5cb100bc17f6] [gn build] Port 443b46e6d313
git bisect good 26330e5f5dc7466d9809091f904a5cb100bc17f6
# bad: [d656ae28095726830f9beb8dbd4d69f5144ef821] Enhance stack protector
git bisect bad d656ae28095726830f9beb8dbd4d69f5144ef821
# first bad commit: [d656ae28095726830f9beb8dbd4d69f5144ef821] Enhance stack protector

@nickdesaulniers nickdesaulniers self-assigned this Mar 27, 2023
@nickdesaulniers
Copy link
Member

I can repro the report.

acpi_enable takes this path out.

https://reviews.llvm.org/D139254 mentions skipping the check for noreturn calls. I didn't see any such calls in the IR, though perhaps LTO is able to deduce calls to functions outside of the TU are noreturn even when they weren't explicitly forward declared as such.

Adding __attribute__((no_stack_protector)) to acpi_enable simply moves the panic up one frame to start_kernel (without any hint to "Kernel stack is corrupted"). Adding noinline to acpi_enable does nothing.

I don't see any difference in control flow between LLVM commits (26330e5f5dc7 and d656ae280957).
26330e5f5dc7.good.txt
d656ae280957.bad.txt
Still digging.

@nickdesaulniers
Copy link
Member

enabling KASAN disables LTO, so not reproducible.
Not reproducible with UBSAN enabled (LTO stays on).
Getting strange crashes in GDB trying to isolate/follow this.

@nickdesaulniers
Copy link
Member

There appears to be a stack guard check in start_kernel after the call to arch_post_acpi_subsys_init before rest_init. That's what's calling __stack_chk_fail. Without https://reviews.llvm.org/D139254, there is no stack protector generated.

This smells like
commit a9a3ed1 ("x86: Fix early boot crash on gcc-10, third try")
not being obeyed.

@nickdesaulniers
Copy link
Member

There are actually 2 stack guard checks inserted into start_kernel. If I can find the two functions that were inlined and mark them noinline, that should allow us to boot. Not sure yet if that's the proper fix, vs trying to deploy the fn attr no_stack_protector which is nascent and not all compiler versions support just yet.

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Mar 27, 2023

We discussed adding the attribute before:

Specifically, it looks like perhaps it's the calls to memblock_virt_alloc_nopanic from setup_command_line that's triggering this, and panic, and rest_init. If I make a copy of memblock_virt_alloc_nopanic, mark that copy noinline, then call that copy (rather than memblock_virt_alloc_nopanic) in setup_command_line that eliminates some of the control flow paths to __stack_chk_fail.

rest_init is marked sspstrong in IR, but not noreturn. I'm not sure yet why the call to rest_init is followed by a ud2. (perhaps related to a9a3ed1)

The comment block in kernel/sched/idle.c cpu_startup_entry is suspect.

357   /*                                                                            
358    * This #ifdef needs to die, but it's too late in the cycle to                
359    * make this generic (arm and sh have never invoked the canary                
360    * init for the non boot cpus!). Will be fixed in 3.11                        
361    */                                                                           
362 #ifdef CONFIG_X86
362   /*                                                                            
363    * If we're the non-boot CPU, nothing set the stack canary up                 
364    * for us. The boot CPU already has it initialized but no harm                
365    * in doing it again. This is a good place for updating it, as                
366    * we wont ever return from this function (so the invalid                     
367    * canaries already on the stack wont ever trigger).                          
368    */ 
369    */                                                                           
370   boot_init_stack_canary(); 
371 #endif                                                                          

(introduced by commit cf37b6b ("sched/idle: Move cpu/idle.c to sched/idle.c"))
Didn't a9a3ed1 mention boot_init_stack_canary already being called? Is the boot cpu having its stack canary initialized twice (for the main thread)?

start_kernel
  -> boot_init_stack_canary()
  -> arch_call_rest_init()
    -> rest_init()
      -> cpu_startup_entry()
        -> boot_init_stack_canary() (oops!)

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Mar 27, 2023

diff --git a/init/main.c b/init/main.c
index d50ea3c3473e..3320c7ed2f6a 100644
--- a/init/main.c
+++ b/init/main.c
@@ -535,6 +535,7 @@ static void __init mm_init(void)
        pti_init();
 }
 
+__attribute__((no_stack_protector))
 asmlinkage __visible void __init start_kernel(void)
 {
        char *command_line;

still panics with another stack check failure:

[    0.010000] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: 00000000f6f32a35
[    0.010000] 
[    0.010000] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.309-01424-ge8edc5a4238a-dirty #11
[    0.010000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[    0.010000] Call Trace:
[    0.010000]  panic+0x11b/0x570
[    0.010000]  ? tick_check_new_device+0x2a7/0x300
[    0.010000]  ? start_secondary+0x4ca/0x4d0
[    0.010000]  ? setup_boot_APIC_clock.cfi_jt+0x8/0x8
[    0.010000]  ? unlz4.cfi_jt+0x8/0x8
[    0.010000]  __stack_chk_fail+0x14/0x20
[    0.010000]  start_secondary+0x4ca/0x4d0
[    0.010000]  secondary_startup_64+0xa5/0xb0

Probably need the attribute on start_secondary or cpu_startup_entry as well.

@nickdesaulniers
Copy link
Member

The call to boot_init_stack_canary in cpu_startup_entry should probably be moved to arch/x86/kernel/smpboot.c start_secondary before the call to cpu_startup_entry.

@nickdesaulniers
Copy link
Member

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 66f2a950935a..821a9ed2d339 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -210,6 +210,7 @@ static int enable_start_cpu0;
 /*
  * Activate a secondary processor.
  */
+__attribute__((no_stack_protector))
 static void notrace start_secondary(void *unused)
 {
        /*
diff --git a/init/main.c b/init/main.c
index d50ea3c3473e..3320c7ed2f6a 100644
--- a/init/main.c
+++ b/init/main.c
@@ -535,6 +535,7 @@ static void __init mm_init(void)
        pti_init();
 }
 
+__attribute__((no_stack_protector))
 asmlinkage __visible void __init start_kernel(void)
 {
        char *command_line;

allows us to boot (though this should get wired up properly in include/linux/compiler_attributes.h).

I'll need to think hard about whether we want to remove prevent_tail_call_optimization if !__has_attribute(no_stack_protector) and how best to support 4.14.y.

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Mar 28, 2023

Here's my understanding of the bug (and a proto commit message):

If LLVM can see the definition of cpu_startup_entry across translation units, such as when CONFIG_LTO_CLANG_FULL is enabled, it's able to infer that cpu_startup_entry is noreturn. Because rest_init ends in an unconditional call to cpu_startup_entry, rest_init is transitively inferred to be noreturn as well.

clang-16 will now check the stack protector before calling noreturn functions. This behavior differs from GCC.

When rest_init is not inlined into start_kernel (config dependent), the call to rest_init (which was inferred to be noreturn) will be preceded by a stack canary check. This check will unconditionally fail, since the stack canary set at the beginning of start_kernel will have been swapped with a new canary value when boot_init_stack_canary was called midway through start_kernel.

Back during the discussion that ultimately led to

commit a9a3ed1 ("x86: Fix early boot crash on gcc-10, third try")

it was obvious that a true solution required marking each caller of boot_init_stack_canary as __attribute__((no_stack_protector)). At the time, such a function attribute did not exist in GCC, but was added in gcc-11 (and clang-7).

Now that this option exists in both toolchains, we should start looking at how best to use it to better express intent, thus rolling back

commit a9a3ed1 ("x86: Fix early boot crash on gcc-10, third try")

at least when newer toolchains are in use. We might want to do this in separate steps in order to fix this on branches of stable. Another potential cleanup here is that it appears that boot_init_stack_canary is called twice for the boot cpu.

Cc: stable@vger.kernel.org
Cc: Xiang1 Zhang xiang1.zhang@intel.com
Cc: Luo, Yuanke yuanke.luo@intel.com
Reported-by: Nathan Chancellor nathan@kernel.org
Link: #1815
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58245
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94722
Link: https://reviews.llvm.org/rGd656ae28095726830f9beb8dbd4d69f5144ef821
Link: https://lore.kernel.org/all/20200316130414.GC12561@hirez.programming.kicks-ass.net/

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Mar 28, 2023

commit d4e5268 ("x86,objtool: Mark cpu_startup_entry() __noreturn")

marked cpu_startup_entry explicitly as noreturn in v5.18-rc5.

When building mainline with LTO, I only see the call to panic as noreturn. I wonder why rest_init isn't inferred as noreturn...in the IR it is! Ah, because the call to arch_call_rest_init was not inlined. I wonder if we can improve llvm's attributor, though that will just make this problem more obvious on mainline.

TODO: write a test for

...
 902 ; Function Attrs: fn_ret_thunk_extern noinline noredzone noreturn nounwind null_pointer_is_valid sspstrong                                                                 
 903 define hidden void @rest_init() local_unnamed_addr #6 section ".ref.text" align 16 {
...
1230 ; Function Attrs: cold fn_ret_thunk_extern noredzone nounwind null_pointer_is_valid optsize sspstrong                                                                      
1231 define weak hidden void @arch_call_rest_init() local_unnamed_addr #4 section ".init.text" align 16 {                                                                       
1232 entry:                                                                                                                                                                     
1233   tail call void @rest_init() #29                                                                                                                                          
1234   unreachable                                                                                                                                                              
1235 }

in llvm/unittests/Transforms/IPO/AttributorTest.cpp for AANoReturn.

@nickdesaulniers
Copy link
Member

Yi Kong mentions there's -mllvm -disable-check-noreturn-call to undo this. Guessing Android already shipped this flag globally. I might send that back through to stable 4.14, then pursue no_stack_protector.

@nickdesaulniers
Copy link
Member

TODO: write a test for
in llvm/unittests/Transforms/IPO/AttributorTest.cpp for AANoReturn.

Seems it's addNoReturnAttrs in llvm/lib/Transforms/IPO/FunctionAttrs.cpp. Not sure yet why it's failing to mark arch_call_rest_init as noreturn. This should work. https://godbolt.org/z/evdo9zxYn

@nickdesaulniers
Copy link
Member

Not sure yet why it's failing to mark arch_call_rest_init as noreturn.

Looks like because arch_call_rest_init has weak linkage, the definition is not considered "exact" so propagation of noreturn is skipped. I'm not sure that's correct. Let me see if we can fix that.

@nickdesaulniers
Copy link
Member

Yeah, if I remove the check for hasExactDefinition in addNoReturnAttrs, I can get mainline with CONFIG_LTO_CLANG=y+CONFIG_STACKPROTECTOR_STRONG=y to panic in the same way (demonstrating that we just narrowly avoid this on any kernel branch with 53c99bd due to a bug in LLVM). I'm going to fix that in LLVM then backport -mllvm -disable-check-noreturn-call (if it works, still haven't tested it) through all stable branches, then add no_stack_protector to seal this radioactive biohazard in concrete for eternity.

@nickdesaulniers
Copy link
Member

https://reviews.llvm.org/D147177

All that said, it would be weird and bad if a weak definition was noreturn but the strong one was not. Perhaps arch_call_rest_init and rest_init should also be marked noreturn if cpu_startup_entry was in d4e5268.

Testing -mllvm -disable-check-noreturn-call, that builds+boots but produces the following new warning from objtool:

vmlinux.o: warning: objtool: .text.unlikely.panic: unexpected end of section

Disassembling panic, it appears to end in multiple int3 trapping instructions, so I'm not sure quite yet what the problem is.

@nickdesaulniers
Copy link
Member

Testing -mllvm -disable-check-noreturn-call, that builds+boots but produces the following new warning from objtool:

That seems to be only with https://reviews.llvm.org/D147177 applied though.

@nathanchance
Copy link
Member Author

It appears that https://lore.kernel.org/cover.1680912057.git.jpoimboe@kernel.org/ will expose this on mainline:

# Base is from https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/?h=objtool/core
$ git log --oneline FETCH_HEAD^..
aadc1854f91a x86/hyperv: Mark hv_ghcb_terminate() as noreturn
1087e11a3b30 scsi: message: fusion: Mark mpt_halt_firmware() __noreturn
029ce5efb21e objtool: Include weak functions in global_noreturns check
61a9c0038927 x86/cpu: Mark {hlt,resume}_play_dead() __noreturn
9b9a021acf20 cpu: Mark nmi_panic_self_stop() __noreturn
b8151be824fa cpu: Mark panic_smp_self_stop() __noreturn
5af90c36450b arm64/cpu: Mark cpu_park_loop() and friends __noreturn
d18f262d34e5 btrfs: Mark btrfs_assertfail() __noreturn
e80cc52c35af x86/head: Mark *_start_kernel() __noreturn
ece10631a9fe init: Mark start_kernel() __noreturn
8ecdb31ce33c init: Mark [arch_call_]rest_init() __noreturn
fb799447ae29 x86,objtool: Split UNWIND_HINT_EMPTY in two

$ boot-qemu.py -k $TMP_BUILD_FOLDER/linux-next
...
[    0.000000] Linux version 6.3.0-rc1-00040-gaadc1854f91a (nathan@dev-arch.thelio-3990X) (ClangBuiltLinux clang version 17.0.0 (https://github.com/llvm/llvm-project e08af170736271f022b2cab44d58326356ce1db8), ClangBuiltLinux LLD 17.0.0) #1 SMP PREEMPT_DYNAMIC Mon Apr 10 12:14:39 MST 2023
...
[    0.094791] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_kernel+0x4de/0x4e0
[    0.095774] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.3.0-rc1-00040-gaadc1854f91a #1
[    0.095774] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[    0.095774] Call Trace:
[    0.095774]  <TASK>
[    0.095774]  panic+0x148/0x3b0
[    0.095774]  ? start_kernel+0x4de/0x4e0
[    0.095774]  __stack_chk_fail+0x14/0x20
[    0.095774]  start_kernel+0x4de/0x4e0
[    0.095774]  x86_64_start_reservations+0x24/0x30
[    0.095774]  x86_64_start_kernel+0xab/0xb0
[    0.095774]  secondary_startup_64_no_verify+0xe1/0xeb
[    0.095774]  </TASK>
[    0.095774] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_kernel+0x4de/0x4e0 ]---

@nickdesaulniers
Copy link
Member

@nickdesaulniers nickdesaulniers added [PATCH] Submitted A patch has been submitted for review [BUG] llvm A bug that should be fixed in upstream LLVM and removed [BUG] Untriaged Something isn't working labels Apr 10, 2023
@nickdesaulniers
Copy link
Member

nickdesaulniers added a commit to llvm/llvm-project that referenced this issue Apr 13, 2023
…functions

https://reviews.llvm.org/rGd656ae28095726830f9beb8dbd4d69f5144ef821
introduced a additional checks before calling noreturn functions in
response to this security paper related to Catch Handler Oriented
Programming (CHOP):
https://download.vusec.net/papers/chop_ndss23.pdf
See also:
https://bugs.chromium.org/p/llvm/issues/detail?id=30

This causes stack canaries to be inserted in C code which was
unexpected; we noticed certain Linux kernel trees stopped booting after
this (in functions trying to initialize the stack canary itself).
ClangBuiltLinux/linux#1815

There is no point checking the stack canary like this when exceptions
are disabled (-fno-exceptions or function is marked noexcept) or for C
code.  The GCC patch for this issue does something similar:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=a25982ada523689c8745d7fb4b1b93c8f5dab2e7

Android measured a 2% regression in RSS as a result of d656ae2 and
undid it globally:
https://android-review.googlesource.com/c/platform/build/soong/+/2524336

Reviewed By: xiangzhangllvm

Differential Revision: https://reviews.llvm.org/D147975
llvmbot pushed a commit to llvm/llvm-project-release-prs that referenced this issue Apr 13, 2023
…functions

https://reviews.llvm.org/rGd656ae28095726830f9beb8dbd4d69f5144ef821
introduced a additional checks before calling noreturn functions in
response to this security paper related to Catch Handler Oriented
Programming (CHOP):
https://download.vusec.net/papers/chop_ndss23.pdf
See also:
https://bugs.chromium.org/p/llvm/issues/detail?id=30

This causes stack canaries to be inserted in C code which was
unexpected; we noticed certain Linux kernel trees stopped booting after
this (in functions trying to initialize the stack canary itself).
ClangBuiltLinux/linux#1815

There is no point checking the stack canary like this when exceptions
are disabled (-fno-exceptions or function is marked noexcept) or for C
code.  The GCC patch for this issue does something similar:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=a25982ada523689c8745d7fb4b1b93c8f5dab2e7

Android measured a 2% regression in RSS as a result of d656ae2 and
undid it globally:
https://android-review.googlesource.com/c/platform/build/soong/+/2524336

Reviewed By: xiangzhangllvm

Differential Revision: https://reviews.llvm.org/D147975

(cherry picked from commit fc4494d)
tru pushed a commit to llvm/llvm-project-release-prs that referenced this issue Apr 17, 2023
…functions

https://reviews.llvm.org/rGd656ae28095726830f9beb8dbd4d69f5144ef821
introduced a additional checks before calling noreturn functions in
response to this security paper related to Catch Handler Oriented
Programming (CHOP):
https://download.vusec.net/papers/chop_ndss23.pdf
See also:
https://bugs.chromium.org/p/llvm/issues/detail?id=30

This causes stack canaries to be inserted in C code which was
unexpected; we noticed certain Linux kernel trees stopped booting after
this (in functions trying to initialize the stack canary itself).
ClangBuiltLinux/linux#1815

There is no point checking the stack canary like this when exceptions
are disabled (-fno-exceptions or function is marked noexcept) or for C
code.  The GCC patch for this issue does something similar:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=a25982ada523689c8745d7fb4b1b93c8f5dab2e7

Android measured a 2% regression in RSS as a result of d656ae2 and
undid it globally:
https://android-review.googlesource.com/c/platform/build/soong/+/2524336

Reviewed By: xiangzhangllvm

Differential Revision: https://reviews.llvm.org/D147975

(cherry picked from commit fc4494d)
@nickdesaulniers nickdesaulniers added the [FIXED][LLVM] 16 This bug was fixed in LLVM 16.0 label Apr 18, 2023
@nickdesaulniers
Copy link
Member

The llvm-16 build is still failing because the llvm-16 build is from April 4. The backport was merged April 17 (linked above):
https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/4729226155/jobs/8392326831
llvm-17 is good:
https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/4729214601/jobs/8392130188

Leaving open until we confirm clang-16 boots are fixed. Hopefully the kernel patches get picked up too, but that's not a blocker for closing this out.

@nathanchance
Copy link
Member Author

apt.llvm.org is only a couple of days behind, I assume we will get a new rebuild here soon.

$ sudo apt info clang-16 &| grep Version:
Version: 1:16.0.2~++20230414073057+b5aa566a7e53-1~exp1~20230414073206.71

We may have to request an explicit update to tuxmake/tuxsuite's container images, as the stable ones are only rebuilt monthly, rather than daily like the nightly containers.

@nathanchance
Copy link
Member Author

On both aarch64 and x86_64:

$ sudo apt info clang-16 &| grep Version:
Version: 1:16.0.2~++20230419070902+18ddebe1a1a9-1~exp1~20230419071013.73

I have requested that TuxMake and TuxSuite be updated to explicitly include this: https://gitlab.com/Linaro/tuxmake/-/issues/202

I think this can be closed now but I'll let someone else do that if they agree with my assessment.

@nathanchance
Copy link
Member Author

TuxMake containers are updated so our builds should go back to green shortly:

https://gitlab.com/Linaro/tuxmake/-/issues/202#note_1361045827

Closing this out.

ramosian-glider added a commit to ramosian-glider/syzkaller that referenced this issue May 9, 2023
Linux v6.4-rc1 built with Clang versions <= 16 with stack protector
enabled panic with the following stack trace:

Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_kernel+0xd8a/0xd90
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.3.0-rc1-00042-g9ea7e6b62c2b-dirty google#106
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88
 dump_stack_lvl+0x1bc/0x250 lib/dump_stack.c:106
 dump_stack+0x1e/0x20 lib/dump_stack.c:113
 panic+0x4cd/0xc10 kernel/panic.c:340
 __stack_chk_fail+0x18/0x20 kernel/panic.c:759
 start_kernel+0xd8a/0xd90 init/main.c:?
 x86_64_start_reservations+0x2e/0x30 arch/x86/kernel/head64.c:556
 x86_64_start_kernel+0x118/0x120 arch/x86/kernel/head64.c:537
 secondary_startup_64_no_verify+0xcf/0xdb arch/x86/kernel/head_64.S:358
 </TASK>

ClangBuiltLinux/linux#1815 describes the
problem, which is fixed on the Clang side
(https://reviews.llvm.org/D147975), but before the fix reaches syzbot
we'll have to keep the stack protector disabled.
ramosian-glider added a commit to google/syzkaller that referenced this issue May 9, 2023
Linux v6.4-rc1 built with Clang versions <= 16 with stack protector
enabled panic with the following stack trace:

Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_kernel+0xd8a/0xd90
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.3.0-rc1-00042-g9ea7e6b62c2b-dirty #106
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88
 dump_stack_lvl+0x1bc/0x250 lib/dump_stack.c:106
 dump_stack+0x1e/0x20 lib/dump_stack.c:113
 panic+0x4cd/0xc10 kernel/panic.c:340
 __stack_chk_fail+0x18/0x20 kernel/panic.c:759
 start_kernel+0xd8a/0xd90 init/main.c:?
 x86_64_start_reservations+0x2e/0x30 arch/x86/kernel/head64.c:556
 x86_64_start_kernel+0x118/0x120 arch/x86/kernel/head64.c:537
 secondary_startup_64_no_verify+0xcf/0xdb arch/x86/kernel/head_64.S:358
 </TASK>

ClangBuiltLinux/linux#1815 describes the
problem, which is fixed on the Clang side
(https://reviews.llvm.org/D147975), but before the fix reaches syzbot
we'll have to keep the stack protector disabled.
flemairen6 pushed a commit to Xilinx/llvm-project that referenced this issue May 10, 2023
…functions

https://reviews.llvm.org/rGd656ae28095726830f9beb8dbd4d69f5144ef821
introduced a additional checks before calling noreturn functions in
response to this security paper related to Catch Handler Oriented
Programming (CHOP):
https://download.vusec.net/papers/chop_ndss23.pdf
See also:
https://bugs.chromium.org/p/llvm/issues/detail?id=30

This causes stack canaries to be inserted in C code which was
unexpected; we noticed certain Linux kernel trees stopped booting after
this (in functions trying to initialize the stack canary itself).
ClangBuiltLinux/linux#1815

There is no point checking the stack canary like this when exceptions
are disabled (-fno-exceptions or function is marked noexcept) or for C
code.  The GCC patch for this issue does something similar:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=a25982ada523689c8745d7fb4b1b93c8f5dab2e7

Android measured a 2% regression in RSS as a result of d656ae2 and
undid it globally:
https://android-review.googlesource.com/c/platform/build/soong/+/2524336

Reviewed By: xiangzhangllvm

Differential Revision: https://reviews.llvm.org/D147975
@nickdesaulniers
Copy link
Member

Need https://gitlab.com/Linaro/tuxmake/-/merge_requests/306 to update the android containers, too.

@nickdesaulniers
Copy link
Member

very strange, there's a similar-ish boot failure: https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/4977144432/jobs/8906720088#step:5:4501

but:

  1. android-4.19-stable not android-4.14-stable
  2. ARCH=arm64 not ARCH=x86
  3. Debian clang version 16.0.3 0419 which I think should have the fix. Was stack protector boot failure in android-4.14-stable #1815 (comment) not addressed?

@nickdesaulniers
Copy link
Member

llvm/llvm-project-release-prs@cbd175f was the fix merged May 1, so 16.0.3 from 4/19 is not new enough.

@nickdesaulniers
Copy link
Member

https://gitlab.com/Linaro/tuxmake/-/issues/202#note_1392060006 mentions that clang-16 tuxmake containers have been upgraded to 16.0.4. Will close this out once verified in CI.

f0rm2l1n pushed a commit to f0rm2l1n/my-syzkaller that referenced this issue Jun 8, 2023
Linux v6.4-rc1 built with Clang versions <= 16 with stack protector
enabled panic with the following stack trace:

Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: start_kernel+0xd8a/0xd90
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.3.0-rc1-00042-g9ea7e6b62c2b-dirty google#106
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88
 dump_stack_lvl+0x1bc/0x250 lib/dump_stack.c:106
 dump_stack+0x1e/0x20 lib/dump_stack.c:113
 panic+0x4cd/0xc10 kernel/panic.c:340
 __stack_chk_fail+0x18/0x20 kernel/panic.c:759
 start_kernel+0xd8a/0xd90 init/main.c:?
 x86_64_start_reservations+0x2e/0x30 arch/x86/kernel/head64.c:556
 x86_64_start_kernel+0x118/0x120 arch/x86/kernel/head64.c:537
 secondary_startup_64_no_verify+0xcf/0xdb arch/x86/kernel/head_64.S:358
 </TASK>

ClangBuiltLinux/linux#1815 describes the
problem, which is fixed on the Clang side
(https://reviews.llvm.org/D147975), but before the fix reaches syzbot
we'll have to keep the stack protector disabled.
veselypeta pushed a commit to veselypeta/cherillvm that referenced this issue Aug 21, 2024
…functions

https://reviews.llvm.org/rGd656ae28095726830f9beb8dbd4d69f5144ef821
introduced a additional checks before calling noreturn functions in
response to this security paper related to Catch Handler Oriented
Programming (CHOP):
https://download.vusec.net/papers/chop_ndss23.pdf
See also:
https://bugs.chromium.org/p/llvm/issues/detail?id=30

This causes stack canaries to be inserted in C code which was
unexpected; we noticed certain Linux kernel trees stopped booting after
this (in functions trying to initialize the stack canary itself).
ClangBuiltLinux/linux#1815

There is no point checking the stack canary like this when exceptions
are disabled (-fno-exceptions or function is marked noexcept) or for C
code.  The GCC patch for this issue does something similar:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=a25982ada523689c8745d7fb4b1b93c8f5dab2e7

Android measured a 2% regression in RSS as a result of d656ae2 and
undid it globally:
https://android-review.googlesource.com/c/platform/build/soong/+/2524336

Reviewed By: xiangzhangllvm

Differential Revision: https://reviews.llvm.org/D147975
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 boot failure This issue results in a failure to boot [BUG] llvm A bug that should be fixed in upstream LLVM [FIXED][LLVM] 16 This bug was fixed in LLVM 16.0 [PATCH] Submitted A patch has been submitted for review
Projects
None yet
Development

No branches or pull requests

2 participants