Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux 5.18 NVIDIA module won't load: Missing ENDBR: _portMemAllocatorAllocNonPagedWrapper+0x0/0x10 #256

Closed
rnd-ash opened this issue May 25, 2022 · 87 comments
Assignees
Labels
bug Something isn't working NV-Triaged An NVBug has been created for dev to investigate

Comments

@rnd-ash
Copy link

rnd-ash commented May 25, 2022

NVIDIA Open GPU Kernel Modules Version

515.43.04

Does this happen with the proprietary driver (of the same version) as well?

Yes

Operating System and Version

Arch Linux

Kernel Release

5.18.0-arch1-1

Hardware: GPU

RTX 3070 laptop (System 76 Oryx 8)

Describe the bug

Since upgrading to Kernel 5.18, loading the nvidia driver (Or proprietary one) fails with the same kernel log:

[    5.429675] nvidia-nvlink: Nvlink Core is being initialized, major device number 510

[    5.429718] traps: Missing ENDBR: _portMemAllocatorAllocNonPagedWrapper+0x0/0x10 [nvidia]
[    5.429816] ------------[ cut here ]------------
[    5.429817] kernel BUG at arch/x86/kernel/traps.c:252!
[    5.429828] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    5.429830] CPU: 9 PID: 948 Comm: modprobe Tainted: G           OE     5.18.0-arch1-1 #1 b71a70fe104889aac2f32556bc52f649da2881d2
[    5.429832] Hardware name: System76 Oryx Pro/Oryx Pro, BIOS 2021-09-23_b9b0e89 09/23/2021
[    5.429833] RIP: 0010:exc_control_protection+0xc2/0xd0
[    5.429837] Code: 8b 93 80 00 00 00 be f9 00 00 00 48 c7 c7 d3 ab 66 b5 e8 d1 01 50 ff e9 72 ff ff ff 48 c7 c7 ba ab 66 b5 e8 c7 31 fb ff 0f 0b <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 66 0f 1f 00 55 53 48 89
[    5.429838] RSP: 0018:ffffa9c3413b3bb8 EFLAGS: 00010002
[    5.429839] RAX: 000000000000004d RBX: ffffa9c3413b3bd8 RCX: 0000000000000027
[    5.429840] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9d195fa616a0
[    5.429841] RBP: 0000000000000003 R08: 0000000000000000 R09: ffffa9c3413b39d8
[    5.429842] R10: 0000000000000003 R11: ffffffffb5ecaa08 R12: 0000000000000000
[    5.429842] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    5.429843] FS:  00007f0aa9bbe740(0000) GS:ffff9d195fa40000(0000) knlGS:0000000000000000
[    5.429844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.429845] CR2: 00007f0aa8382000 CR3: 00000001063ce002 CR4: 0000000000f70ee0
[    5.429846] PKRU: 55555554
[    5.429847] Call Trace:
[    5.429848]  <TASK>
[    5.429849]  asm_exc_control_protection+0x22/0x30
[    5.429852] RIP: 0010:_portMemAllocatorAllocNonPagedWrapper+0x0/0x10 [nvidia]
[    5.429920] Code: 08 48 89 d0 48 89 0f 48 c1 e0 17 48 31 c2 48 89 c8 48 c1 e8 05 48 31 c8 48 31 d0 48 c1 ea 12 48 31 d0 48 89 47 08 01 c8 c3 90 <48> 89 f7 e9 38 0f 00 00 0f 1f 84 00 00 00 00 00 48 89 f7 e9 88 0f
[    5.429921] RSP: 0018:ffffa9c3413b3c80 EFLAGS: 00010202
[    5.429922] RAX: ffffffffc1eae5f0 RBX: 0000000000000010 RCX: 0000000000000000
[    5.429923] RDX: 0000000000000000 RSI: 000000000000002c RDI: ffffffffc20f7b70
[    5.429923] RBP: ffffa9c3413b3c98 R08: 0000000000000020 R09: ffffffffc20f7bf0
[    5.429924] R10: ffffffffc20f55d0 R11: 0000000000000000 R12: ffffffffc20f7b70
[    5.429925] R13: 00007f0aa8382dc0 R14: 000055916224ef30 R15: ffffa9c3413b3e20
[    5.429926]  ? portCryptoPseudoRandomGeneratorGetU32+0x30/0x30 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.429991]  _portMemAllocatorAlloc+0x2e/0x170 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430054]  portCryptoPseudoRandomGeneratorCreate+0x16/0xb0 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430117]  portCryptoInitialize+0x2a/0x40 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430182]  portInitialize+0x2b/0x40 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430246]  coreInitializeRm+0x24/0x90 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430324]  RmInitRm+0x9/0x20 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430399]  rm_init_rm+0x9/0x10 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430472]  nvidia_init_module+0x22e/0x5b0 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430517]  ? nvidia_init_module+0x5b0/0x5b0 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430565]  nvidia_frontend_init_module+0x50/0x91 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430616]  ? nvidia_init_module+0x5b0/0x5b0 [nvidia 5737a4bc014c2c47af46ebdec30e9ee078e09f14]
[    5.430663]  do_one_initcall+0x5a/0x220
[    5.430667]  do_init_module+0x4a/0x240
[    5.430670]  __do_sys_init_module+0x138/0x1b0
[    5.430672]  do_syscall_64+0x5c/0x90
[    5.430674]  ? exc_page_fault+0x74/0x170
[    5.430676]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    5.430677] RIP: 0033:0x7f0aa9512c3e
[    5.430679] Code: 48 8b 0d 5d b1 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2a b1 0e 00 f7 d8 64 89 01 48
[    5.430680] RSP: 002b:00007fff39f3cc58 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[    5.430681] RAX: ffffffffffffffda RBX: 000055916224ebd0 RCX: 00007f0aa9512c3e
[    5.430682] RDX: 000055916224ef30 RSI: 00000000008f1db0 RDI: 00007f0aa7a91010
[    5.430682] RBP: 00007f0aa7a91010 R08: 000055916224eae0 R09: 0000000000000000
[    5.430683] R10: 0000000000000005 R11: 0000000000000246 R12: 000055916224ef30
[    5.430684] R13: 000055916224ed00 R14: 000055916224ebd0 R15: 000055916224ef60
[    5.430685]  </TASK>
[    5.430685] Modules linked in: pcc_cpufreq(-) nvidia(OE+) acpi_cpufreq(-) bnep bridge stp llc btusb btrtl btbcm btintel uvcvideo btmtk videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 videobuf2_common ecdh_generic videodev mc snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_hda_codec_realtek snd_sof_intel_hda snd_hda_codec_generic snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_soc_hdac_hda iwlmvm snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi joydev intel_tcc_cooling soundwire_bus mousedev ledtrig_audio mac80211 x86_pkg_temp_thermal intel_powerclamp snd_soc_core coretemp snd_compress ac97_bus kvm_intel libarc4 hid_multitouch snd_hda_codec_hdmi 8250_dw spi_nor mei_pxp snd_pcm_dmaengine mei_hdcp ee1004 mtd i915 iTCO_wdt snd_hda_intel kvm intel_pmc_bxt snd_intel_dspcfg iTCO_vendor_support intel_rapl_msr iwlwifi irqbypass snd_intel_sdw_acpi snd_hda_codec crct10dif_pclmul crc32_pclmul
[    5.430709]  ghash_clmulni_intel snd_hda_core iwlmei vfat aesni_intel processor_thermal_device_pci_legacy processor_thermal_device pmt_telemetry snd_hwdep crypto_simd pmt_class cryptd fat intel_cstate r8169 drm_buddy cfg80211 intel_uncore snd_pcm processor_thermal_rfim realtek psmouse ttm processor_thermal_mbox mei_me snd_timer rfkill pcspkr i2c_i801 mdio_devres processor_thermal_rapl intel_lpss_pci spi_intel_pci intel_rapl_common snd libphy intel_lpss drm_dp_helper spi_intel i2c_smbus soundcore int340x_thermal_zone thunderbolt mei i2c_hid_acpi idma64 intel_gtt intel_vsec intel_soc_dts_iosf i2c_hid intel_hid video intel_scu_pltdrv sparse_keymap system76_acpi mac_hid coreboot_table dm_multipath dm_mod ipmi_devintf ipmi_msghandler crypto_user acpi_call(OE) fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 serio_raw atkbd uas libps2 usb_storage usbhid vivaldi_fmap nvme xhci_pci nvme_core crc32c_intel i8042 xhci_pci_renesas serio
[    5.430736] ---[ end trace 0000000000000000 ]---

To Reproduce

  1. Upgrade to kernel 5.18
  2. Reboot
  3. Observe nvidia module won't load and check kernel logs for the same error

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

Originally I thought this issue was to do with optimus-manager (As I am using a hybrid setup I use that utility to switch between intel and nvidia mode), but after uninstalling optimus manager the same issue occurs

@rnd-ash rnd-ash added the bug Something isn't working label May 25, 2022
@rnd-ash rnd-ash changed the title Linux 5.18 nvidia module wont' load: Missing ENDBR: _portMemAllocatorAllocNonPagedWrapper+0x0/0x10 Linux 5.18 NVIDIA module won't load: Missing ENDBR: _portMemAllocatorAllocNonPagedWrapper+0x0/0x10 May 25, 2022
@gauravjuvekar
Copy link
Member

Hi, I couldn't repro this with 5.18.0-arch1-1 from testing. Can you try with nvidia-open-dkms-515.43.04-8 just to be sure that the kernel module was rebuilt with the matching kernel headers?

@rnd-ash
Copy link
Author

rnd-ash commented May 25, 2022

I was using the nvidia-open-dkms package, had run mkinitcpio multiple times and DKMS said it was installing the open modules for 5.18, so I assume I had matching headers

Downgraded to kernel 5.17.9-arch1-1 and everything works for me

@aritger
Copy link
Collaborator

aritger commented May 25, 2022

@rnd-ash: Could you experiment to see if the same ENDBR error happens with:
(1) the open kernel modules packaged with the NVIDIA .run file (i.e., install from .run file with -m=kernel-open)
(2) the closed kernel modules packaged with the NVIDIA .run file (i.e., install from .run file with -m=kernel)

I'm curious if the problem has something to do with how the open nvidia.ko was built by arch-linux (maybe something about the toolchain used). I think experiments (1) and (2) should help shake that out.

It looks like ENDBR is new in 5.18. I wonder if the problem here only manifests with certain kernel kconfigs. E.g., maybe it requires X86_KERNEL_IBT

@rnd-ash
Copy link
Author

rnd-ash commented May 26, 2022

From archlinux's config file, I can see that on the problematic kernel version, X86_KERNEL_IBT is enabled here

I tried to download the .run file from https://us.download.nvidia.com/XFree86/Linux-x86_64/515.43.04/NVIDIA-Linux-x86_64-515.43.04.run, but every time I tried to run it I kept getting installation failed.

However, I switched over to try both the nvidia-open-dkms and nvidia-dkms packages from arch (PKGBUILDs can be seen here and here), and they all result in the same ENDBR error.

@atiensivu
Copy link

Does it work if you pass the kernel ' ibt=off' ?

@rnd-ash
Copy link
Author

rnd-ash commented May 26, 2022

Does it work if you pass the kernel ' ibt=off' ?

Just tried it, it does!

@danrbball1
Copy link

ibt=off also works for me from grub. I am running Arcolinux on a Dell XPS 9520 (NVIDIA 3050 and 16 GB ram). I am also running NVIDIA Prime. Any idea what the issue may be?

@QuestionMark001
Copy link

QuestionMark001 commented May 29, 2022

GPU: NVIDIA RTX 3060 laptop
Driver Version: Closed NVIDIA Driver 515.43.04
I also faced this problem.If you updated Kernel to Linux 5.18,will display "Failed start to Linux Kernel".

@QuestionMark001
Copy link

QuestionMark001 commented May 29, 2022

Does it work if you pass the kernel ' ibt=off' ?

Perfect fix😉

@Six6pounder
Copy link

Same issue here. Kernel: 5.18 - GPU Driver: NVIDIA 515.43.04 - Rtx 3080 desktop

@atiensivu thank you, what does ibt=off do? It boots if I use it

@edjubert
Copy link

I also faced this problem.
Kernel: 5.18.0-arch1-1
GPU Driver: NVIDIA 515.43.04 - RTX 3070 (laptop)

Optimus-manager failed.
Adding ibt=off to bootloader (grub for me) fixed it

@kv-y
Copy link

kv-y commented May 29, 2022

what does ibt=off do?

Indirect Branch Tracking

Add support for Intel CET-IBT (Indirect Branch Tracking), a hardware support course-grain forward-edge Control Flow Integrity protection. It enforces that all indirect calls must land on an ENDBR instruction, as such, the compiler will instrument the code with them to make this happen.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7001052160d172f6de06adeffde24dde9935ece8

@Tudmotu
Copy link

Tudmotu commented May 29, 2022

Anyone knows how to add parameter this when using EFIStub?

Edit:
Downgraded my kernel for now.
I found this issue after my boot was hanging on "start job is running for Load Kernel Modules".
To downgrade:

  • Use a live USB (systemrescue, arch install, etc) to chroot into your installation (mount /boot, /var, etc as necessary)
  • cd /var/cache/pacman/pkg
  • pacman -U file://linux-5.17.5.arch1-1-x86_64.pkg.tar.zst

@ocelik94
Copy link

same goes for me

Kernel: 5.18.0-arch1-1
GPU Driver: NVIDIA 515.43.04 - RTX 2080

@mahancoder
Copy link

Happens to me too
Kernel: 5.18.0-arch1-1
Driver: NVIDIA 515.43.04
GPU: MX450

Setting ibt=off fixes the issue temporarily, but cannot be considered a full solution.

@CryptLabs
Copy link

I can confirm, I have the same issue.
Arch Linux
5.18.0-zen1-1-zen
Nvidia RTX 5000

How can we fix this issue?

@mahancoder
Copy link

mahancoder commented May 29, 2022

How can we fix this issue?

@CryptLabs You can temporarily fix the issue by adding ibt=off to your kernel command line parameters

@EugeneKorshenko
Copy link

EugeneKorshenko commented May 29, 2022

I can confirm the same issue on my laptop.

Arch Linux
5.18.0-arch1-1
Driver Version: 515.43.04

RTX 3070 Laptop

@CryptLabs
Copy link

CryptLabs commented May 29, 2022

@mahancoder I have used ibt=off.
However, as you said, I feel that this is not a good solution.

@domino14
Copy link

please fix

@edjubert
Copy link

edjubert commented May 30, 2022

I just installed latest nvidia dkms drivers (515.43.04-2) and the fix does not work anymore

@m1guelperez
Copy link

m1guelperez commented May 30, 2022

5.18.0-arch1-1
nvidia-dkms 515.43.04-2
Nvidia GTX1080

Same problem here latest Nvidia driver literally broke my system. I was stuck on Reached target Graphical Interface. And received several errors. Only solution to interact with the system was CTRL+ALT+F2 .

What fixed the Issue:
Either uninstall everything all Nvidia packages or pass the ibt=off flag to the kernel parameters:

I just installed latest nvidia drivers (515.43.04-6) and the fix does not work anymore

Did you try to remove the ibt=off flag when using the latest Nvidia driver?

@codicocodes
Copy link

I had the same issue this morning when updating the drivers. It seems like ibt=off was removed from my kernel options in my latest update, when I re added it back the drivers started working again.

@rnd-ash
Copy link
Author

rnd-ash commented May 30, 2022

There appears to be an open bug now on Archlinux about this issue
https://bugs.archlinux.org/task/74891

@edjubert
Copy link

@m1guelperez yes, the first thing I've done is to remove and reboot but still not working.

Also, I'm not sure it's related, but even with nvidia drivers loading properly, HDMI does not seems to work with the workaround (my HDMI is wired to my GPU)

@mtijanic
Copy link
Collaborator

The following patch will insert the necessary endbr64 instructions:

diff --git a/src/nvidia-modeset/Makefile b/src/nvidia-modeset/Makefile
index c63b86b..69490d0 100644
--- a/src/nvidia-modeset/Makefile
+++ b/src/nvidia-modeset/Makefile
@@ -95,7 +95,6 @@ CFLAGS += -ffunction-sections
 CFLAGS += -fdata-sections
 CFLAGS += -ffreestanding
 
-CONDITIONAL_CFLAGS := $(call TEST_CC_ARG, -fcf-protection=none)
 CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -Wformat-overflow=2)
 CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -Wformat-truncation=1)
 ifeq ($(TARGET_ARCH),x86_64)
diff --git a/src/nvidia/Makefile b/src/nvidia/Makefile
index 9bdb826..cc05ab7 100644
--- a/src/nvidia/Makefile
+++ b/src/nvidia/Makefile
@@ -119,8 +119,6 @@ CFLAGS += -fdata-sections
 NV_KERNEL_O_LDFLAGS += --gc-sections
 EXPORTS_LINK_COMMAND = exports_link_command.txt
 
-CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -fcf-protection=none)
-
 ifeq ($(TARGET_ARCH),x86_64)
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mindirect-branch-register)
   CONDITIONAL_CFLAGS += $(call TEST_CC_ARG, -mindirect-branch=thunk-extern)

Is there anyone facing these problems that can try rebuilding the modules with the patch and report back?

I'm not sure why the -fcf-protection=none is there in the first place, but I expect it was an attempt to minimize the code size.

@m1guelperez
Copy link

@m1guelperez yes, the first thing I've done is to remove and reboot but still not working.

Also, I'm not sure it's related, but even with nvidia drivers loading properly, HDMI does not seems to work with the workaround (my HDMI is wired to my GPU)

Hmm, I can't help you there since I use DP. But I will definitely wait with any updates for now. 😄

@TheBakerCat
Copy link

I can confirm the issue.

i5-11400h + RTX 3050ti laptop
nvidia-dkms 515.43.04-2 + 5.18.zen1-1

ibt=off fixes the issue

@gtkramer
Copy link

gtkramer commented Jul 11, 2022

Thank you for forwarding this internally, @mtijanic! I bought a RTX 3070 and an i7-12700F for my custom desktop PC build that I'm eager to use for working with Blender and Unreal Engine 5 on Linux. Running into this right away after doing a clean install of Arch was a bit surprising, but I'm glad there are workarounds for now. Seeing this discussion about NVIDIA dirvers on GitHub is exciting! Hopefully the development team can find a way to make even the proprietary driver work with the new IBT feature on the latest Intel processors!

@gtkramer
Copy link

gtkramer commented Jul 11, 2022

On an updated Arch Linux install, this is what worked for me to get a Wayland session using the proprietary NVIDIA drivers:

cat >> /etc/modprobe.d/nvidia.conf <<EOF
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia-drm modeset=1
EOF
mkinitcpio -p linux
systemctl enable nvidia-{hibernate,suspend,resume}
reboot

No modification was necessary to /etc/environment. It wasn't until I read /usr/lib/udev/rules.d/61-gdm.rules that I found what the required criteria was for GDM to start a Wayland session. GDM is picky about power management. I found this to be rather helpful too: https://download.nvidia.com/XFree86/Linux-x86_64/515.57/README/powermanagement.html

If anyone is relying on UEFI booting with efibootmgr, here's what got me over the IBT hump:

efibootmgr -c -d "${BLOCK_DEV}" -p 1 -L 'Arch Linux' -l /vmlinuz-linux -u 'cryptdevice=PARTLABEL=root:root root=/dev/mapper/root rw ibt=off initrd=/initramfs-linux.img quiet'

Some of the other kernel parameters are for an encrypted hard drive. Though my drive supports TCG OPAL, my motherboard does not support booting from such a drive when it's enabled. I found that using block device software encryption like this is much less of a hassle. CPUs probably accelerate this anyway nowadays. Oh, the many hours it took for my thick skull to learn this :)

@xnox
Copy link

xnox commented Jul 14, 2022

CONDITIONAL_CFLAGS := $(call TEST_CC_ARG, -fcf-protection=none)

Note that above is not required for v5.13+ kernels as kernel configs were fixed up correctly to always emit correct value kernel needs for cf-protection flag if compiler supports it. It would be best to fix those up to be conditional. Or for example, always use IBT compatible flags that are backwards compatible.

@totof3110
Copy link

Ran into the same issue with newly released kernel 5.19 on Manjaro. ibt=off also solves it.

GPU: GeForce RTX 3080 laptop.

@dxbednarczyk
Copy link

On 5.19.4 (Fedora 36, 3070ti 12400f), the only way I can boot into my system is by not setting ibt=off. Otherwise, grub2 boots into emergency mode and does not write anything to journalctl (silent fail?). I assume this has either something to do with a newer kernel or some Red Hat magic

@renie
Copy link

renie commented Sep 11, 2022

Still needs ibt=off to boot.

Kernel 5.19.7 (Arch, GTX 1650Ti Mobile, Intel 1165G7).
NVIDIA driver: 515.65.01
GRUB: 2.06.r322.gd9b4638c5-4

Still needs ibt=off to boot.


edit: new versions tested

Kernel 5.19.13-arch1-1 (Arch, GTX 1650Ti Mobile, Intel 1165G7).
NVIDIA driver: 515.76
GRUB: 2:2.06.r334.g340377470-1


edit: new versions tested

Kernel 6.0.2-arch1-1 (Arch, GTX 1650Ti Mobile, Intel 1165G7).
NVIDIA driver: 520.56.06
GRUB: 2:2.06.r334.g340377470-1

@RIvance
Copy link

RIvance commented Nov 2, 2022

Does it work if you pass the kernel ' ibt=off' ?

asus z690-p prime + RTX 3080 TI + archlinux with kernel 6.0.4, same issue, perfectly solved! Thank you 😃

@karmux
Copy link

karmux commented Nov 6, 2022

ibt=off fixes also Nvidia 3050 Ti laptop for newer kernels. Wihout it system is not bootable.

@gtkramer
Copy link

gtkramer commented Nov 8, 2022

https://www.phoronix.com/news/Linux-IBT-By-Default-Tip

According to Phoronix, the default configuration for the Linux kernel will now have IBT turned on for everyone. This will no longer be a distro-specific change. NVIDIA, please consider implementing proper support for IBT sooner rather than later for all of your Linux drivers.

@mtijanic
Copy link
Collaborator

mtijanic commented Nov 8, 2022

Hi all, just want to reassure you that this is an issue that is actively being worked on. For open-gpu-kernel-modules, there is a patch posted above that you (or a distro) can apply, and we will be integrating something similar soon enough. Since the patch changes the compile flags for the project, we need to properly QA it on a wide range of HW and kernel configs, hence the delay.

For the proprietary driver, the story is a fair bit more complicated, as that driver is built with a custom patched-up version of GCC, that doesn't support IBT. Which means we have to port the patches over and build with a completely new compiler version. And that requires even more QA. It's all actively being worked on, but I'm sure you understand why we can't give an ETA.

Thanks for the interest and the understanding!

@gtkramer
Copy link

gtkramer commented Nov 8, 2022

Thank you, @mtijanic, for the detailed response and update! It's always interesting to see glimpses into the software development processes of other companies. Working at a large one myself, I understand that there are reasons for how things are done. I appreciate the diligence the team is taking to ensure a quality release, as these things can be tricky.

@alexm77
Copy link

alexm77 commented Nov 9, 2022

Hi all, just want to reassure you that this is an issue that is actively being worked on. For open-gpu-kernel-modules, there is a patch posted above that you (or a distro) can apply, and we will be integrating something similar soon enough. Since the patch changes the compile flags for the project, we need to properly QA it on a wide range of HW and kernel configs, hence the delay.

For the proprietary driver, the story is a fair bit more complicated, as that driver is built with a custom patched-up version of GCC, that doesn't support IBT. Which means we have to port the patches over and build with a completely new compiler version. And that requires even more QA. It's all actively being worked on, but I'm sure you understand why we can't give an ETA.

Thanks for the interest and the understanding!

ETA is just that: an estimation. You can't even estimate if it will a couple of months, a year or more?

@aritger
Copy link
Collaborator

aritger commented Nov 10, 2022

For open-gpu-kernel-modules, this should be addressed in 525.53. Since the Issue here is specifically tracking open-gpu-kernel-modules, I'm going to mark this Issue as closed.

For the closed-source kernel modules, we're still wrangling with the toolchain issues that mtijanic mentioned. I suspect it could be as much as a few months until we ship the close-source kernel modules with IBT support. But, yes, we're in a race with Linux kernel 6.2.

@solomonbstoner
Copy link

Encountered the same issue. Running 6.0.11-arch1-1 with Nvidia driver version 525.60.11. ibt=off fixed the issue.
I am surprised there was no mention of systemd-modules-load anywhere. Prior to the fix, my GUI wouldnt start, but I could use the other tty. systemctl --failed mentioned that systemd-modules-load was killed due to segmentation fault. The Nvidia kernel module was not singled out in the logs. Only after digging deeper did I notice the ENDBR error message, which led me here.

@ccwienk
Copy link

ccwienk commented Dec 24, 2022

@aritger : is there some means to track this patch's arrival in closed-source kmod? thx

gtkramer added a commit to gtkramer/arch-linux-setup-scripts that referenced this issue Dec 24, 2022
This is required when using the proprietary NVIDIA driver with 12th gen
Intel processors until NVIDIA makes an update to support IBT.

NVIDIA/open-gpu-kernel-modules#256

The open NVIDIA driver has been fixed, but prefer to use the proprietary
version.  The open version is considered alpha quality according to a
note on the Arch Linux NVIDIA wiki.

Add nvidia kernel module parameters via modprobe.d configuration
instead.
@amrit1711 amrit1711 self-assigned this May 14, 2024
@amrit1711
Copy link
Collaborator

Hi All,
This bug has been fixed long back, could someone please help to verify the fix and share test results.
Thanks in advance.

@ricocheting
Copy link

I had this bug on Arch, nvidia-dkms, and an RTX 3090. I used the ibt=off kernel parameter to boot for months. Fairly recently during other changes I removed the line from grub and I can confirm it is fixed for me and everything works normally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working NV-Triaged An NVBug has been created for dev to investigate
Projects
None yet
Development

No branches or pull requests