Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoCo code breaks with CFI #2015

Closed
0n-s opened this issue Apr 13, 2024 · 6 comments
Closed

CoCo code breaks with CFI #2015

0n-s opened this issue Apr 13, 2024 · 6 comments
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 Changed defaults The compiler/linker/etc. has been patched to implicitly change a default value [FEATURE] CFI Related to building the kernel with Clang Control Flow Integrity

Comments

@0n-s
Copy link

0n-s commented Apr 13, 2024

Greetings. I've tried to build an x86_64 kernel with CFI enabled. It's a "distro kernel" config of sorts, kitted out with every driver & every bit of functionality. The x86 CoCo infrastructure breaks when you enable CFI. Here's the relevant log snippet:

<4>[    0.913788] ------------[ cut here ]------------
<4>[    0.916633] no CFI hash found at: __cfi_cc_platform_has+0x0/0x20 ffffffffaa451460 90 90 90 90 90
<4>[    0.919977] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:1183 __apply_fineibt+0xb15/0xb80
<4>[    0.923297] Modules linked in:
<4>[    0.924268] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.5-ehhhhhhhhh #1 77789fd3cc3ef13083ac0da371db90ee68380b84
<4>[    0.926629] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-stable202302-for-qemu 03/01/2023
<4>[    0.929965] RIP: 0010:__apply_fineibt+0xb15/0xb80
<4>[    0.933296] Code: 80 7c 24 0c 00 74 4f 48 c7 c7 6e b5 a0 aa eb 41 48 c7 c7 c0 1b a2 aa 48 89 de 48 89 da b9 05 00 00 00 49 89 d8 e8 8b b2 09 00 <0f> 0b eb 1c 48 c7 c7 c0 1b a2 aa 48 89 ee 48 89 ea b9 05 00 00 00
<4>[    0.936635] RSP: 0000:ffffffffab203e68 EFLAGS: 00010246
<4>[    0.939962] RAX: 0000000000000000 RBX: ffffffffaa451460 RCX: 0000000000000000
<4>[    0.942013] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
<4>[    0.943298] RBP: 000000009434d7bb R08: 0000000000000000 R09: 0000000000000000
<4>[    0.945433] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffac202de0
<4>[    0.946637] R13: ffffffffac1f5258 R14: ffffffffac32fd18 R15: ffffffffac304e94
<4>[    0.949962] FS:  0000000000000000(0000) GS:ffff9dcdf6c00000(0000) knlGS:0000000000000000
<4>[    0.953301] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[    0.956629] CR2: ffff9dcd55001000 CR3: 0000000213a22001 CR4: 0000000000060ef0
<4>[    0.959968] Call Trace:
<4>[    0.963299]  <TASK>
<4>[    0.963983]  ? __warn+0xcf/0x1e0
<4>[    0.966630]  ? __apply_fineibt+0xb15/0xb80
<4>[    0.969964]  ? report_bug+0x154/0x220
<4>[    0.973299]  ? handle_bug+0x3d/0x90
<4>[    0.976630]  ? exc_invalid_op+0x1a/0x70
<4>[    0.977841]  ? asm_exc_invalid_op+0x1a/0x20
<4>[    0.979972]  ? memset_orig+0xb0/0xb0
<4>[    0.981083]  ? __apply_fineibt+0xb15/0xb80
<4>[    0.983297]  ? __apply_fineibt+0xb15/0xb80
<4>[    0.984587]  alternative_instructions+0x3f/0x15b
<4>[    0.986635]  arch_cpu_finalize_init+0x46/0xbb
<4>[    0.987978]  start_kernel+0x3bb/0x48b
<4>[    0.989964]  x86_64_start_reservations+0x32/0x40
<4>[    0.991376]  x86_64_start_kernel+0x78/0x8b
<4>[    0.993300]  secondary_startup_64_no_verify+0x185/0x19b
<4>[    0.994871]  </TASK>
<4>[    0.996630] ---[ end trace 0000000000000000 ]---

You don't need any CoCo-supporting HW or setup to trigger this, this'll break anywhere on boot. Specifically, it happens during the alternative patching stage, so you can't miss it. Just enable any option that enables CONFIG_ARCH_HAS_CC_PLATFORM (enables the cc_platform_has function seen above).

I am compiling with LLVM=1 LLVM_IAS=1, with self-built LLVM 18.1.4.

Let me know if you need the config.

@nathanchance nathanchance added [BUG] Untriaged Something isn't working [ARCH] x86_64 This bug impacts ARCH=x86_64 [FEATURE] CFI Related to building the kernel with Clang Control Flow Integrity labels Apr 13, 2024
@nathanchance
Copy link
Member

Hmmm, I am not able to reproduce this on top of defconfig with the following options enabled. Perhaps I will need your full configuration.

# LLVM=1 implies LLVM_IAS=1 now, menuconfig to enable CFI and TDX_GUEST to select ARCH_HAS_CC_PLATFORM
$ make -skj"$(nproc)" ARCH=x86_64 LLVM=1 defconfig menuconfig bzImage

$ rg -N 'ALTERNATIVE|CC_PLATFORM|CFI|IBT' .config
CONFIG_CC_HAS_IBT=y
CONFIG_X86_KERNEL_IBT=y
CONFIG_FUNCTION_PADDING_CFI=11
CONFIG_FINEIBT=y
CONFIG_ARCH_SUPPORTS_CFI_CLANG=y
CONFIG_ARCH_USES_CFI_TRAPS=y
CONFIG_CFI_CLANG=y
# CONFIG_CFI_PERMISSIVE is not set
CONFIG_ARCH_HAS_CC_PLATFORM=y

$ boot-qemu.py -k .
...
$ .../qemu-system-x86_64 -display none -nodefaults -drive if=pflash,format=raw,file=/usr/share/edk2/x64/OVMF_CODE.fd,readonly=on -drive if=pflash,format=raw,file=<boot_utils>/images/x86_64/OVMF_VARS.fd -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci -append 'console=ttyS0 earlycon=uart8250,io,0x3f8' -kernel <build_dir>/arch/x86/boot/bzImage -initrd <boot_utils>/images/x86_64/rootfs.cpio -cpu host -enable-kvm -m 512m -smp 64 -serial mon:stdio
...
[    0.000000] Linux version 6.8.6 (nathan@dev-arch.thelio-3990X) (ClangBuiltLinux clang version 18.1.3 (https://github.com/llvm/llvm-project.git c13b7485b87909fcf739f62cfa382b55407433c0), ClangBuiltLinux LLD 18.1.3) #1 SMP PREEMPT_DYNAMIC Sat Apr 13 07:38:12 MST 2024
...
[    0.313494] SMP alternatives: Using kCFI
[    0.344079] Freeing SMP alternatives memory: 56K
...

I'll try to reproduce with both tip of tree LLVM main and release/18.x later. If this does not happen with earlier LLVM releases, it would be helpful to bisect what LLVM commit introduced this.

@0n-s
Copy link
Author

0n-s commented Apr 13, 2024

Hmmmm, I probably should've disabled cc_platform_has before I filed this bug 'cause I get this now:

<4>[    1.019866] ------------[ cut here ]------------
<4>[    1.020839] no CFI hash found at: __cfi_do_syscall_64+0x0/0x20 ffffffff8ec43460 90 90 90 90 90
<4>[    1.024186] WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:1183 __apply_fineibt+0xb15/0xb80
<4>[    1.027506] Modules linked in:
<4>[    1.030840] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.5-hopeforcfi #2 568f2a53d64ea19d91d6d5e0bd3c11c677f120f0
<4>[    1.034173] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-stable202302-for-qemu 03/01/2023
<4>[    1.037507] RIP: 0010:__apply_fineibt+0xb15/0xb80
<4>[    1.040840] Code: 80 7c 24 0c 00 74 4f 48 c7 c7 b9 9c 20 8f eb 41 48 c7 c7 94 03 22 8f 48 89 de 48 89 da b9 05 00 00 00 49 89 d8 e8 0b 62 09 00 <0f> 0b eb 1c 48 c7 c7 94 03 22 8f 48 89 ee 48 89 ea b9 05 00 00 00
<4>[    1.044173] RSP: 0000:ffffffff8fa03e68 EFLAGS: 00010246
<4>[    1.047506] RAX: 0000000000000000 RBX: ffffffff8ec43460 RCX: 0000000000000000
<4>[    1.050839] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
<4>[    1.053458] RBP: 000000003f2abd0f R08: 0000000000000000 R09: 0000000000000000
<4>[    1.054173] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff909f52bc
<4>[    1.057506] R13: ffffffff909e7760 R14: ffffffff90b2140c R15: ffffffff90af6714
<4>[    1.060839] FS:  0000000000000000(0000) GS:ffff9d7f36c00000(0000) knlGS:0000000000000000
<4>[    1.064172] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[    1.066286] CR2: ffff9d7f0dc01000 CR3: 000000028c822001 CR4: 0000000000060ef0
<4>[    1.067510] Call Trace:
<4>[    1.068468]  <TASK>
<4>[    1.070841]  ? __warn+0xcf/0x1e0
<4>[    1.072077]  ? __apply_fineibt+0xb15/0xb80
<4>[    1.073643]  ? report_bug+0x154/0x220
<4>[    1.074174]  ? handle_bug+0x3d/0x90
<4>[    1.075501]  ? exc_invalid_op+0x1a/0x70
<4>[    1.077506]  ? asm_exc_invalid_op+0x1a/0x20
<4>[    1.080840]  ? memset_orig+0xb0/0xb0
<4>[    1.084174]  ? __apply_fineibt+0xb15/0xb80
<4>[    1.085717]  ? __apply_fineibt+0xb15/0xb80
<4>[    1.087507]  alternative_instructions+0x3f/0x15b
<4>[    1.089250]  arch_cpu_finalize_init+0x46/0xbb
<4>[    1.090841]  start_kernel+0x3bb/0x48b
<4>[    1.092246]  x86_64_start_reservations+0x32/0x40
<4>[    1.094173]  x86_64_start_kernel+0x6e/0x7b
<4>[    1.095721]  secondary_startup_64_no_verify+0x180/0x19b
<4>[    1.097508]  </TASK>
<4>[    1.098392] ---[ end trace 0000000000000000 ]---
<3>[    1.100840] SMP alternatives: Something went horribly wrong trying to rewrite the CFI implementation.

This is on the same config as the original, but without CONFIG_ARCH_HAS_CC_PLATFORM. So it doesn't seem to be a problem with the CoCo code. Additionally, I can reproduce this exact same splat the defconfig+CFI+cc_platform_has, so I think it may be a problem with my TC. I am applying some patches from my distro for additional hardening by default to match their GCC (SSP by default, stuff like that). Allow me to rebuild LLVM to see if that's it (still sticking with stable LLVM & not ToT).

Also, I used ToT to test just before filing this report, & the same issue is present there. Has been for months, as a matter of fact. But again, it may just be the patches messing with it, so give me a day or so to test fully vanilla LLVM.

But if it helps, here is my config: lolconfig.txt

It does play with fire a little compared to the usual distro kernel config, with stuff like TRIM_UNUSED_KSYMS, thin LTO, & other such stuff, so it might uncover issues of its own.

@nathanchance
Copy link
Member

I am applying some patches from my distro for additional hardening by default to match their GCC (SSP by default, stuff like that). Allow me to rebuild LLVM to see if that's it (still sticking with stable LLVM & not ToT).

Something with these patches is almost certainly going to be the root cause of the problem. If it does, it would be pretty helpful to narrow down the exact change that causes it, as it may be possible to disable it for the kernel explicitly so that regardless of changed defaults, everything works. There is precedent for this with GCC, which allows certain defaults to be customized by the user.

Also, I used ToT to test just before filing this report, & the same issue is present there. Has been for months, as a matter of fact. But again, it may just be the patches messing with it, so give me a day or so to test fully vanilla LLVM.

No worries.

But if it helps, here is my config: lolconfig.txt

It does play with fire a little compared to the usual distro kernel config, with stuff like TRIM_UNUSED_KSYMS, thin LTO, & other such stuff, so it might uncover issues of its own.

I'll give it a go when I have some time.

@nathanchance nathanchance added Changed defaults The compiler/linker/etc. has been patched to implicitly change a default value and removed [BUG] Untriaged Something isn't working labels Apr 13, 2024
@0n-s
Copy link
Author

0n-s commented Apr 13, 2024

Something with these patches is almost certainly going to be the root cause of the problem. If it does, it would be pretty helpful to narrow down the exact change that causes it, as it may be possible to disable it for the kernel explicitly so that regardless of changed defaults, everything works. There is precedent for this with GCC, which allows certain defaults to be customized by the user.

Yes. Most people building the kernel with Clang use truly vanilla LLVM, but distributions often patch LLVM to have sane security defaults, so it would be nice to have an OOTB LLVM build with those. I do wish they'd use Clang configuration files, but AFAIK only Gentoo does that RN.

I'll give it a go when I have some time.

Thanks! The devil is always in the details, but please remember to disable most modules here to save on compile times; I don't think it'll affect whether the bug is reproducible or not. This config enables basically everything, but I don't think it'd hurt to disable network drivers, multimedia, SCSI, DRM, industrial I/O, filesystems, & sound card support. Just those account for some 3/4 of the build.

@0n-s
Copy link
Author

0n-s commented Apr 13, 2024

Also worth noting that the LLVM I am building RN, aside from removing all patches, I have taken out -ftrivial-auto-var-init=zero for the build (of LLVM, not the kernel). I have no idea how the kcfi sanitizer works, but seeing as the error is no CFI hash, maybe there's some uninitialized variable usage going on? This is all uneducated conjecture, though, I am simply mentioning all the details just in case anyone who knows better might be able to figure out whether that'd affect it or not.

Would be neat if that was actually the cause; that option in most cases provides a safe fallback for buggy code, but for some code it brings out its edge cases 100% of the time, as opposed to basically never in practice.

@0n-s
Copy link
Author

0n-s commented Apr 15, 2024

What I have found is that this issue only occurs because my LLVM was built with a random ToT revision that was silently buggy for kcfi only, somehow (I swear, I run the kernels I build with LLVM through plenty of testsuites, they do absolutely work just fine). I did one build that was only different from the compiler in the initial bug report in that it used my system GCC to compile the 1st stage of the LLVM build & that was enough to make it work.

So it's not the patches that were relevant, though I've since dropped those. So the solution is just to stop building stable LLVM with a dangerous cowboy revision of LLVM, even if it passed the testsuite at the time.

I just built the entire config shared above with these issues fixed, & everything works:

<6>[    0.842618] SMP alternatives: Using kCFI

So the issue was just one of those fun compiler issues that requires a 3-stage compile to fix, & I didn't think to do that since it'd been working perfectly for everything else. Really was just me being a little foolish, sorry for that.

@0n-s 0n-s closed this as completed Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 Changed defaults The compiler/linker/etc. has been patched to implicitly change a default value [FEATURE] CFI Related to building the kernel with Clang Control Flow Integrity
Projects
None yet
Development

No branches or pull requests

2 participants