Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INIT_STACK_ALL_ZERO - Framework Laptop system freezes (kernel oops?) on boot #1626

Closed
0x647262 opened this issue Apr 26, 2022 · 34 comments
Closed
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 [BUG] linux A bug that should be fixed in the mainline kernel. [FIXED][LINUX] 6.0 This bug was fixed in Linux 6.0

Comments

@0x647262
Copy link

0x647262 commented Apr 26, 2022

If this is not the correct place to report this bug, would someone be able to point me to the correct mailing list it should be reported to? I'm not 100% sure if this is Clang specific, since GCC 11.2.0 (as far as I am aware) does not trigger CC_HAS_AUTO_VAR_INIT_ZERO when building a kernel config.

Hardware:

Framework Laptop (BIOS: 3.07)

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 140
model name	: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
stepping	: 1
microcode	: 0x9a
cpu MHz		: 2800.000
cache size	: 12288 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 27
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear flush_l1d arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple pml ept_mode_based_exec tsc_scaling
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips	: 5608.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

OS:

Arch Linux

System Information:

Linux version 5.17.4-arch1-1 (linux@archlinux) (clang version 13.0.1, LLD 13.0.1) #1 SMP PREEMPT Mon, 25 Apr 2022 20:04:10 +0000

Kernel Command Line:

initrd=\intel-ucode.img initrd=\initramfs-linux.img root=UUID=69dcbb13-26b1-4e22-beaf-022d614ba025 rootflags=rw,noatime,compress-force=zstd:1,ssd,space_cache=v2,subvolid=257,subvol=/mithril.btrfssv rw

Issue:

System boots from initrd completely fine, but after switching root freezes. When booted with the kernel command line option oops=panic the device's caps lock LED flashes after the freeze, indicating a kernel panic(?).

Files:

init-stack-all-zero-config.txt

EDIT: Corrected config init-stack-all-zero-config.txt

dmesg.log

@nickdesaulniers
Copy link
Member

Thanks for reporting. I didn't see anything suspicious in the logs. We should test your config in QEMU to see if the hang is easily reproducible.

cc @isanbard @kees

@nickdesaulniers nickdesaulniers added [BUG] Untriaged Something isn't working [ARCH] x86_64 This bug impacts ARCH=x86_64 labels Apr 26, 2022
@torvic9
Copy link

torvic9 commented Apr 26, 2022

Not sure if I am just very confused, but... the dmesg output doesn't fit the config you posted.
For example, the config has BFQ and Zswap disabled, but they appear as enabled in your log.
(The config itself looks quite small similar to defconfig.)

Now, using Arch's system clang 13.0.1 and INIT_STACK_ALL_ZERO, my 5.17 compiles and boots just fine on this Kaby Lake laptop, but it uses a way more "complete" config.

@nathanchance
Copy link
Member

nathanchance commented Apr 26, 2022

I have CONFIG_INIT_STACK_ALL_ZERO=y in the configuration that I boot test on four different machines (my main workstation and three test systems) with Arch Linux and none of them have issues with booting (although I use tip of tree LLVM for those builds). I'll see if I can reproduce this in QEMU.

Just to confirm, changing CONFIG_INIT_STACK_ALL_ZERO to CONFIG_INIT_STACK_NONE allows the system to fully boot?

@nathanchance
Copy link
Member

Hmmm, I cannot reproduce this in QEMU:

$ cat /proc/version
Linux version 5.17.4 (nathan@dev-arch.thelio-3990X) (clang version 13.0.1, LLD 13.0.1) #1 SMP PREEMPT Tue Apr 26 11:30:09 MST 2022

$ zgrep INIT_STACK /proc/config.gz
# CONFIG_INIT_STACK_NONE is not set
# CONFIG_INIT_STACK_ALL_PATTERN is not set
CONFIG_INIT_STACK_ALL_ZERO=y

That could be due to a difference in hardware, meaning I do not load a driver that is problematic, although I would expect to see problems sooner, as the modules should be loaded during the initrd stage.

@kees kees added the more info needed More information requested to issue author from project members. label Apr 26, 2022
@0x647262
Copy link
Author

Not sure if I am just very confused, but... the dmesg output doesn't fit the config you posted.
For example, the config has BFQ and Zswap disabled, but they appear as enabled in your log.
(The config itself looks quite small similar to defconfig.)

Sorry, about that... Grabbed the wrong config. I've updated the original report with the correct one.

I have CONFIG_INIT_STACK_ALL_ZERO=y in the configuration that I boot test on four different machines (my main workstation and three test systems) with Arch Linux and none of them have issues with booting (although I use tip of tree LLVM for those builds). I'll see if I can reproduce this in QEMU.

I installed the (INIT_STACK_ALL_ZERO) kernel on my desktop (AMD Ryzen 7 1700) and an older desktop I have sitting around (AMD FX8350), and both boot without issue.

Perhaps the issue is specific to the hardware configuration of Framework Laptops? That seems a bit more likely to me than it being Tigerlake specific...

@nickdesaulniers
Copy link
Member

The next steps would be to find the object file affected by this.

First steps, you can use

subdir-ccflags-y += -ftrivial-auto-var-init=uninitialized

in various directories' Makefiles. Then, once we find which directory, we can pare this down further to just:

CFLAGS_<object file name> += -ftrivial-auto-var-init=uninitialized

to isolate the translation unit. Then finally, we can sprinkle __attribute((uninitialized)) in that TU to find which variables are problematic.

This will probably take multiple builds+boots, @0x647262 .

@nathanchance
Copy link
Member

@0x647262 If you want help doing that with Arch's PKGBUILD, let me know, as I have done it before.

@kees kees removed the more info needed More information requested to issue author from project members. label Apr 27, 2022
@kees kees changed the title INIT_STACK_ALL_ZERO - System freezes (kernel oops?) on boot INIT_STACK_ALL_ZERO - Framework Laptop system freezes (kernel oops?) on boot Apr 27, 2022
@0x647262
Copy link
Author

@nickdesaulniers

After making those changes to a subdir's Makefile, what specifically will I be looking for during compilation / on boot?

This will probably take multiple builds+boots

No problem! It might not be the fastest turnaround, but I'm happy to work though this issue.

@nathanchance Your help would be appreciated, since my PKGBUILD skills are a tad rusty.

@nathanchance
Copy link
Member

I'll do a write up on this process tomorrow then!

@nathanchance
Copy link
Member

After making those changes to a subdir's Makefile, what specifically will I be looking for during compilation / on boot?

Sorry, forgot to answer this. You are specifically looking for the issue to be resolved, as that is basically turning off CONFIG_INIT_STACK_ALL_ZERO for the particular translation unit. Once we have gotten down to the particular translation unit that is problematic, we can bisect the specific functions (if necessary).

@nathanchance
Copy link
Member

Turns out I had some time today :)

https://nathanchance.dev/posts/package-standalone-linux-kernel-with-abs/

@0x647262 that is how to actually generate the .pkg.tar.zst package, I will try my best to explain the process of bisecting the object files in a separate post tomorrow.

@nathanchance
Copy link
Member

Only a day late:

https://nathanchance.dev/posts/bisect-compiler-flag-problem-linux-kernel/

Honestly, I am not super happy with how I described the process but I could not come up with anything that I liked better, it is better than nothing, and I am willing to update it based on any questions that might come up during the process!

@0x647262
Copy link
Author

0x647262 commented May 4, 2022

So I've narrowed the bug down to a single directory: drivers/gpu/drm/i915/gvt, but I've ran into a bit of a snag... When adding:

subdir-ccflags-y += -ftrivial-auto-var-init=uninitialized

To the top of the directory's Makefile, the issue is resolved. However, when I add individual cflags for the all of the resulting object files the module compiles, but the kernel panic is reintroduced. I'm a bit stumped on this one, because it seems like subdir-ccflags-y and CFLAGS_foo.o (for each object file in the directory) would be more-or-less equivalent.

If someone could help point me back in the right direction here I'd greatly appreciate it! 😃

@nathanchance
Copy link
Member

So I've narrowed the bug down to a single directory: drivers/gpu/drm/i915/gvt

Awesome!

but I've ran into a bit of a snag... When adding:

subdir-ccflags-y += -ftrivial-auto-var-init=uninitialized

To the top of the directory's Makefile, the issue is resolved. However, when I add individual cflags for the all of the resulting object files the module compiles, but the kernel panic is reintroduced. I'm a bit stumped on this one, because it seems like subdir-ccflags-y and CFLAGS_foo.o (for each object file in the directory) would be more-or-less equivalent.

If someone could help point me back in the right direction here I'd greatly appreciate it! 😃

Ah drivers/gpu/drm Makefiles... :(

In this particular case, it is because of the way that drivers/gpu/drm/i915/Makefile adds the drivers/gpu/drm/i915/gvt objects to the build (I can explain in depth if you care but I am time constrained at the moment). You'll want to do something like:

$ cd drivers/gpu/drm/i915

$ for file in gvt/*.o; do
echo CFLAGS_$file := -ftrivial-auto-var-init=uninitialized
done >>Makefile

then work off of drivers/gpu/drm/i915/Makefile.

For me, that produces:

diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile
index d2b18f03a33c..fe0c5720ebe5 100644
--- a/drivers/gpu/drm/i915/Makefile
+++ b/drivers/gpu/drm/i915/Makefile
@@ -355,3 +355,25 @@ quiet_cmd_hdrtest = HDRTEST $(patsubst %.hdrtest,%.h,$@)

 $(obj)/%.hdrtest: $(src)/%.h FORCE
        $(call if_changed_dep,hdrtest)
+CFLAGS_gvt/aperture_gm.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/cfg_space.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/cmd_parser.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/debugfs.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/display.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/dmabuf.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/edid.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/execlist.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/fb_decoder.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/firmware.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/gtt.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/handlers.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/interrupt.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/kvmgt.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/mmio_context.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/mmio.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/opregion.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/page_track.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/sched_policy.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/scheduler.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/trace_points.o := -ftrivial-auto-var-init=uninitialized
+CFLAGS_gvt/vgpu.o := -ftrivial-auto-var-init=uninitialized

and it works:

$ rg ftrivial-auto-var-init=uninitialized drivers/gpu/drm/i915/gvt/.debugfs.o.cmd
1:cmd_drivers/gpu/drm/i915/gvt/debugfs.o := ccache clang -Wp,-MMD,drivers/gpu/drm/i915/gvt/.debugfs.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -Qunused-arguments -fmacro-prefix-map=./= -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Werror=return-type -Wno-format-security -std=gnu11 --target=x86_64-linux-gnu -fintegrated-as -Werror=unknown-warning-option -Werror=ignored-optimization-argument -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m64 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mstack-alignment=8 -mskip-rax-setup -mtune=generic -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables -mretpoline-external-thunk -fno-delete-null-pointer-checks -Wno-frame-address -Wno-address-of-packed-member -O2 -Wframe-larger-than=2048 -fstack-protector-strong -Werror -Wimplicit-fallthrough -Wno-gnu -Wno-unused-but-set-variable -Wno-unused-const-variable -fomit-frame-pointer -ftrivial-auto-var-init=zero -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang -fno-stack-clash-protection -Wdeclaration-after-statement -Wvla -Wno-pointer-sign -Wcast-function-type -fno-strict-overflow -fno-stack-check -Werror=date-time -Werror=incompatible-pointer-types -Wno-initializer-overrides -Wno-format -Wno-sign-compare -Wno-format-zero-length -Wno-pointer-to-enum-cast -Wno-tautological-constant-out-of-range-compare -Wno-unaligned-access -Wall -Wextra -Wno-format-security -Wno-unused-parameter -Wno-type-limits -Wno-missing-field-initializers -Wno-sign-compare -Wno-shift-negative-value -Wno-unused-but-set-variable -Wno-frame-address -I./drivers/gpu/drm/i915 -ftrivial-auto-var-init=uninitialized    -DKBUILD_MODFILE='"drivers/gpu/drm/i915/kvmgt"' -DKBUILD_BASENAME='"debugfs"' -DKBUILD_MODNAME='"kvmgt"' -D__KBUILD_MODNAME=kmod_kvmgt -c -o drivers/gpu/drm/i915/gvt/debugfs.o drivers/gpu/drm/i915/gvt/debugfs.c  ; ./tools/objtool/objtool  --hacks=jump_label  --hacks=noinstr    --orc  --retpoline --uaccess    --static-call    drivers/gpu/drm/i915/gvt/debugfs.o

@0x647262
Copy link
Author

0x647262 commented May 4, 2022

TIL!

That's perfect, I'll poke at this some more tonight!

@0x647262
Copy link
Author

0x647262 commented Jun 5, 2022

Had some time this weekend to continue work on this. It seems that when CFLAGS_gvt/filename.o := -ftrivial-auto-var-init=uninitialized is passed, the -ftrivial-auto-var-init=uninitialized is added further along in the filename.o.cmd's arguments, which produces a faulty i915 module (one that continues to freeze).

WORKING | subdir-cflags-firmware.o.cmd.txt

NONWORKING | file-cflags-firmware.o.cmd.txt

Is there a way to manually place the -ftrivial-auto-var-init=uninitialized in a filename.o.cmd file (either via the Makefile, or prehaps overriding it)?

EDIT: A diff for quick reference:

drb@aether ~/D/G/a/l/trunk (packages/packages/linux-clang-lto-debug) [2]> diff --color --unified ~/subdir-cflags-firmware.o.cmd.txt ~/file-cflags-firmware.o.cmd.txt
--- /home/drb/subdir-cflags-firmware.o.cmd.txt	2022-06-05 17:38:21.021966253 -0500
+++ /home/drb/file-cflags-firmware.o.cmd.txt	2022-06-05 17:51:58.550426182 -0500
@@ -1,4 +1,4 @@
-cmd_drivers/gpu/drm/i915/gvt/firmware.o := clang -Wp,-MMD,drivers/gpu/drm/i915/gvt/.firmware.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -Qunused-arguments -fmacro-prefix-map=./= -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Werror=return-type -Wno-format-security -std=gnu11 --target=x86_64-linux-gnu -fintegrated-as -Werror=unknown-warning-option -Werror=ignored-optimization-argument -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m64 -mno-80387 -mstack-alignment=8 -mtune=generic -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables -mretpoline-external-thunk -fno-delete-null-pointer-checks -Wno-frame-address -Wno-address-of-packed-member -O2 -Wframe-larger-than=2048 -fstack-protector-strong -Wno-gnu -Wno-unused-but-set-variable -Wno-unused-const-variable -ftrivial-auto-var-init=zero -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang -fno-stack-clash-protection -pg -mfentry -DCC_USING_NOP_MCOUNT -DCC_USING_FENTRY -fno-lto -flto=thin -fsplit-lto-unit -fvisibility=hidden -Wdeclaration-after-statement -Wvla -Wno-pointer-sign -Wcast-function-type -fno-strict-overflow -fno-stack-check -Werror=date-time -Werror=incompatible-pointer-types -Wno-initializer-overrides -Wno-format -Wno-sign-compare -Wno-format-zero-length -Wno-pointer-to-enum-cast -Wno-tautological-constant-out-of-range-compare -g -gdwarf-5 -Wall -Wextra -Wno-format-security -Wno-unused-parameter -Wno-type-limits -Wno-missing-field-initializers -Wno-sign-compare -Wno-shift-negative-value -Wno-unused-but-set-variable -Wno-frame-address -I./drivers/gpu/drm/i915 -ftrivial-auto-var-init=uninitialized -I ./drivers/gpu/drm/i915 -I ./drivers/gpu/drm/i915/gvt/  -DMODULE  -DKBUILD_BASENAME='"firmware"' -DKBUILD_MODNAME='"i915"' -D__KBUILD_MODNAME=kmod_i915 -c -o drivers/gpu/drm/i915/gvt/firmware.o drivers/gpu/drm/i915/gvt/firmware.c 
+cmd_drivers/gpu/drm/i915/gvt/firmware.o := clang -Wp,-MMD,drivers/gpu/drm/i915/gvt/.firmware.o.d -nostdinc -I./arch/x86/include -I./arch/x86/include/generated  -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/compiler-version.h -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -Qunused-arguments -fmacro-prefix-map=./= -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Werror=return-type -Wno-format-security -std=gnu11 --target=x86_64-linux-gnu -fintegrated-as -Werror=unknown-warning-option -Werror=ignored-optimization-argument -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -fcf-protection=none -m64 -mno-80387 -mstack-alignment=8 -mtune=generic -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables -mretpoline-external-thunk -fno-delete-null-pointer-checks -Wno-frame-address -Wno-address-of-packed-member -O2 -Wframe-larger-than=2048 -fstack-protector-strong -Wno-gnu -Wno-unused-but-set-variable -Wno-unused-const-variable -ftrivial-auto-var-init=zero -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang -fno-stack-clash-protection -pg -mfentry -DCC_USING_NOP_MCOUNT -DCC_USING_FENTRY -fno-lto -flto=thin -fsplit-lto-unit -fvisibility=hidden -Wdeclaration-after-statement -Wvla -Wno-pointer-sign -Wcast-function-type -fno-strict-overflow -fno-stack-check -Werror=date-time -Werror=incompatible-pointer-types -Wno-initializer-overrides -Wno-format -Wno-sign-compare -Wno-format-zero-length -Wno-pointer-to-enum-cast -Wno-tautological-constant-out-of-range-compare -g -gdwarf-5 -Wall -Wextra -Wno-format-security -Wno-unused-parameter -Wno-type-limits -Wno-missing-field-initializers -Wno-sign-compare -Wno-shift-negative-value -Wno-unused-but-set-variable -Wno-frame-address -I./drivers/gpu/drm/i915 -I ./drivers/gpu/drm/i915 -I ./drivers/gpu/drm/i915/gvt/ -ftrivial-auto-var-init=uninitialized  -DMODULE  -DKBUILD_BASENAME='"firmware"' -DKBUILD_MODNAME='"i915"' -D__KBUILD_MODNAME=kmod_i915 -c -o drivers/gpu/drm/i915/gvt/firmware.o drivers/gpu/drm/i915/gvt/firmware.c 
 
 source_drivers/gpu/drm/i915/gvt/firmware.o := drivers/gpu/drm/i915/gvt/firmware.c
 

@kees
Copy link

kees commented Jun 6, 2022

Had some time this weekend to continue work on this. It seems that when CFLAGS_gvt/filename.o := -ftrivial-auto-var-init=uninitialized is passed, the -ftrivial-auto-var-init=uninitialized is added further along in the filename.o.cmd's arguments, which produces a faulty i915 module (one that continues to freeze).

You mean "firmware" not "filename", yes?

WORKING | subdir-cflags-firmware.o.cmd.txt

NONWORKING | file-cflags-firmware.o.cmd.txt

It seems firmware.o needs "uninitialized" after the -I args??

@0x647262
Copy link
Author

0x647262 commented Jun 6, 2022

You mean "firmware" not "filename", yes?

I used "filename" since it happens with all of the files in the i915/gvt directory. i915/gvt/firmware.o is just the one I'm using as an example.

It seems firmware.o needs "uninitialized" after the -I args??

That's correct. I tried modifying the .o.cmd file directly, but it seems to be overridden by the Makefile.

@nickdesaulniers
Copy link
Member

nickdesaulniers commented Jun 6, 2022

The .cmd files shouldn't be modified by hand; the command line args the compiler uses are NOT consumed from those files. The compiler writes out the commands it was passed to those files; I'm pretty sure make consumes the .cmd files to know whether to rebuild an object file.

the -ftrivial-auto-var-init=uninitialized is added further along in the filename.o.cmd's arguments

That's what I'd expect. The last instance of -ftrivial-auto-var-init= wins. So we care specifically about the order -ftrivial-auto-var-init=zero -ftrivial-auto-var-init=uninitialized. If subdir-ccflags-y += -ftrivial-auto-var-init=uninitialized works, but @nathanchance 's diff above does not, then something very fishy is going on here.

@nathanchance
Copy link
Member

I agree with what Nick said. All that means is that gvt/firmware.c is not responsible for the issue. Remember, adding -ftrivial-auto-var-init=uninitialized to the flags should resolve the issue; if it does not, that just means the issue is not with that file.

@0x647262
Copy link
Author

0x647262 commented Jun 7, 2022

Remember, adding -ftrivial-auto-var-init=uninitialized to the flags should resolve the issue; if it does not, that just means the issue is not with that file.

Right. I neglected to explain that this is occurring with all the files. I tried bisecting the i915/gvt directory by trying half of the files at a time to no avail. Basically where I'm at is the subdir option works as expected, but the directive for individual files does not.

Here's the patch I'm using on the i915 directory's Makefile:
i915-gvt-ftrivial-auto-var-init-uninitialized.patch.txt

@nathanchance
Copy link
Member

So subdir-ccflags-y := -ftrivial-auto-var-init=uninitialized in drivers/gpu/drm/i915/gvt/Makefile works but that diff doesn't? It might be helpful to diff the .o.cmd files then so we can see if there is a difference in the flags getting added.

@0x647262
Copy link
Author

0x647262 commented Jun 8, 2022

It might be helpful to diff the .o.cmd files then so we can see if there is a difference in the flags getting added.

The only difference is the order in which the flag is added:

# Working (subdir):
-I./drivers/gpu/drm/i915 -ftrivial-auto-var-init=uninitialized -I ./drivers/gpu/drm/i915 -I ./drivers/gpu/drm/i915/gvt/  -DMODULE  -DKBUILD_BASENAME='"firmware"'
# Faulty (per-file):
-I./drivers/gpu/drm/i915 -I ./drivers/gpu/drm/i915 -I ./drivers/gpu/drm/i915/gvt/ -ftrivial-auto-var-init=uninitialized  -DMODULE  -DKBUILD_BASENAME='"firmware"'

That's what spurred my intial question regarding "manually overriding" the flag's placement. As far as I can tell, that's the only difference between the two methods.

@nathanchance
Copy link
Member

That is incredibly odd... Perhaps it might be better to try just removing -ftrivial-auto-var-init=zero instead.

CFLAGS_REMOVE_gvt/firmware.o := -ftrivial-auto-var-init=zero

instead of CFLAGS_... := .... With that, you should see no instances of -ftrivial-auto-var-init=zero.

@0x647262
Copy link
Author

0x647262 commented Jun 9, 2022

So after a bit of finessing I've narrowed down the file to:

i915/display/intel_bw.o

The gvt directory ended up being a dead end because the subdir CC options set in its makefile propagated to the rest of the module. Before I start jumping into adding __attribute to individual variables, is there a way to "bisect" the file? I.E. add the uninitialized attribute to half the file at a time to speed up the debugging process? Or am I stuck with manually sprinkling it throught the file? 😅

@kees
Copy link

kees commented Jun 9, 2022

Before I start jumping into adding __attribute to individual variables, is there a way to "bisect" the file?

I don't know a good way to do this, but if it helps, here's a diff with them all marked uninit, and you can just carve out portions of it...:
uninit.patch.txt

I generated this with a Coccinelle script:

@set_uninit@
type TYPE;
identifier VAR;
@@

 TYPE VAR
+__attribute__((uninitialized))
 ;

@nathanchance
Copy link
Member

You should be able to take that monolithic patch, apply it, commit it, use git checkout -p HEAD^ + git cherry-pick to break it into the individual functions, and bisect off of that.

@0x647262
Copy link
Author

0x647262 commented Aug 21, 2022

Had some time to continue hacking on things again today and I've narrowed the issue down to the following line in intel_bw.c:

 403   │         struct intel_bw_info *bi_next;

Using the patch:

diff --git a/drivers/gpu/drm/i915/display/intel_bw.c b/drivers/gpu/drm/i915/display/intel_bw.c
index 37bd7b17f3d0..ffe29d6b328c 100644
--- a/drivers/gpu/drm/i915/display/intel_bw.c
+++ b/drivers/gpu/drm/i915/display/intel_bw.c
@@ -400,7 +400,7 @@ static int tgl_get_bw_info(struct drm_i915_private *dev_priv, const struct intel
 
 	for (i = 0; i < num_groups; i++) {
 		struct intel_bw_info *bi = &dev_priv->max_bw[i];
-		struct intel_bw_info *bi_next;
+		struct intel_bw_info *bi_next __attribute__((uninitialized));
 		int clpchgroup;
 		int j;

Following the code a bit further I found the definition of the struct in question here:

 756   │     struct intel_bw_info {
 757   │         /* for each QGV point */
 758   │         unsigned int deratedbw[I915_NUM_QGV_POINTS];
 759   │         /* for each PSF GV point */
 760   │         unsigned int psf_bw[I915_NUM_PSF_GV_POINTS];
 761   │         u8 num_qgv_points;
 762   │         u8 num_psf_gv_points;
 763   │         u8 num_planes;
 764   │     } max_bw[6];

Although I'm not certain whether or not this is the correct next step, I tried applying the following patch:

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 00d7eeae33bd..4d72d2850960 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -755,12 +755,12 @@ struct drm_i915_private {

	struct intel_bw_info {
		/* for each QGV point */
-		unsigned int deratedbw[I915_NUM_QGV_POINTS];
+		unsigned int deratedbw[I915_NUM_QGV_POINTS] __attribute__((uninitialized));
		/* for each PSF GV point */
-		unsigned int psf_bw[I915_NUM_PSF_GV_POINTS];
-		u8 num_qgv_points;
-		u8 num_psf_gv_points;
-		u8 num_planes;
+		unsigned int psf_bw[I915_NUM_PSF_GV_POINTS] __attribute__((uninitialized));
+		u8 num_qgv_points __attribute__((uninitialized));
+		u8 num_psf_gv_points __attribute__((uninitialized));
+		u8 num_planes __attribute__((uninitialized));
	} max_bw[6];

	struct intel_global_obj bw_obj;

In an attempt to further narrow down the issue, but due to my limited knowledge of C, was met with the following warnings during compilation (truncated for berevity):

  CC [M]  drivers/gpu/drm/i915/i915_getparam.o
In file included from drivers/gpu/drm/i915/i915_drm_client.c:14:
In file included from ./drivers/gpu/drm/i915/gem/i915_gem_context.h:12:
In file included from ./drivers/gpu/drm/i915/gt/intel_context.h:14:
./drivers/gpu/drm/i915/i915_drv.h:758:62: warning: 'uninitialized' attribute only applies to local variables [-Wignored-attributes]
                unsigned int deratedbw[I915_NUM_QGV_POINTS] __attribute__((uninitialized));
...

Which indicate to me that __attribute__ ((uninitialized)) cannot be applied (at least in the method I tried) to members of that struct. I did a bit of searching to figure out if there was a way to toggle that attribute for members of a struct but came up empty handed.

So I guess I'm ready for some more help on this one! (and a huge thanks to everyone who has helped me along with this issue so far! 😃)

@nathanchance
Copy link
Member

Aha! So this is why my Intel test box needs this patch to work properly. I never thought about it enough to realize this is the same problem. If the first if statement is not taken, bi_next will be a NULL pointer with -ftrivial-auto-var-init=zero and the following else branch will dereference it. I am not sure why that patch has not been picked up but I will try to chase it on Monday (I have kind of forgotten I have been applying that patch since I built the system earlier this year...).

@0x647262
Copy link
Author

Oh, neat!

Here's to hoping a fix gets mainlined! 😃

@nathanchance
Copy link
Member

nathanchance commented Aug 22, 2022

Patch pinged: https://lore.kernel.org/YwPoCqvQ02kUl9tP@dev-arch.thelio-3990X/ (might take a bit to show up)

@nathanchance nathanchance added [PATCH] Exists There is a patch that fixes this issue [BUG] linux A bug that should be fixed in the mainline kernel. [PATCH] Accepted A submitted patch has been accepted upstream and removed [BUG] Untriaged Something isn't working [PATCH] Exists There is a patch that fixes this issue labels Aug 22, 2022
@nathanchance
Copy link
Member

@0x647262
Copy link
Author

0x647262 commented Sep 9, 2022

Looks like this snuck into 5.19.8 😃 https://lwn.net/ml/linux-kernel/1662629579112246@kroah.com/

If it's not an issue I'll close this out!

@nathanchance
Copy link
Member

Ah nice, I did not realize it had been applied to a fixes branch as well!

https://git.kernel.org/linus/458ec0c8f35963626ccd51c3d50b752de5f1b9d4

Thank you again for the report and the time spent helping us get to the bottom of this!

@nathanchance nathanchance added [FIXED][LINUX] 6.0 This bug was fixed in Linux 6.0 and removed [PATCH] Accepted A submitted patch has been accepted upstream labels Sep 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[ARCH] x86_64 This bug impacts ARCH=x86_64 [BUG] linux A bug that should be fixed in the mainline kernel. [FIXED][LINUX] 6.0 This bug was fixed in Linux 6.0
Projects
None yet
Development

No branches or pull requests

5 participants