Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application/driver freeze on resume from suspend with optimus #739

Open
mobarre opened this issue Feb 23, 2016 · 11 comments
Open

Application/driver freeze on resume from suspend with optimus #739

mobarre opened this issue Feb 23, 2016 · 11 comments

Comments

@mobarre
Copy link

mobarre commented Feb 23, 2016

I'm cross posting this also on the bumblebee bugtracker, because there seem to be a lot of similar yet different issues posted here. Original post: https://devtalk.nvidia.com/default/topic/918576/linux/application-driver-freeze-on-resume-from-suspend-with-optimus/

When I start an application with optirun and/or primus and suspend my laptop, the application is frozen on resume. From what I gather of dmesg, the driver doesn't seem to be able to wake up the card or restore its state properly.

Steps to reproduce:
That's the easy part... run glxspheres (32bit or 64bit) through optirun, suspend, resume and voilà ! glxspheres should be frozen. The rest of the system works fine. Sometimes restarting the opengl app will work, sometimes a full system restart is needed.

This works with any OpenGL application that I try to run with optirun. it's been happening since at least septembre (might be older.)

So far, the most useful logs I can produce are an extract of my dmesg with the whole suspend/resume process. File is attached. You should notice the mess here:

[ 987.402372] NVRM: GPU at PCI:0000:03:00: GPU-c3350c76-8707-abd9-a985-52814992bd10
[ 987.402380] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: Shader Program Header 1 Error
[ 987.402442] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: Shader Program Header 2 Error
[ 987.402493] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: Shader Program Header 3 Error
[ 987.402545] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: Shader Program Header 9 Error
[ 987.402596] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: Shader Program Header 18 Error
[ 987.402648] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x405840=0xa204020e
[ 987.402727] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ChID 0010, Class 0000b097, Offset 00001644, Data 00000001

I did check that under Windows 10 , the crash does not happen (not a hardware issue).

Additional info:

  • Distribution is Arch Linux
    Package version(s):
  • bumblebee 3.2.1-10
  • primus 20151110-1
  • nvidia 361.28-1
  • xorg 1.18.1-3
  • kernel 4.4.1-2-ARCH

Hardware:

  • laptop is an ASUS UX303LN with an Nvidia GeForce 840M graphics card + an Intel Corporation Haswell-ULT Integrated Graphics Controller.

Anyone seeing this ? Which information would help pin this down ?

@Yamakuzure
Copy link

By pure chance I happened to having tried that out involuntarily just yesterday.

I am starting VMware Workstation using primusrun so I can have 3D acceleration in Windows 10. Yesterday I forgot to shut down my virtual machine before going home. When I woke up my laptop at home, VMware Workstation was still running and I could log into Windows just fine. Nothing froze.

My setup:

  • Gentoo Linux
  • Bumblebee from git, "develop" branch
  • primus and bbswitch from git, "master" branch
  • nvidia-358.16
  • xorg-server-1.17.4
  • kernel-4.3.5 with gentoo patchset

I have seen posts in the gentoo forums with problems when using the newest nvidia drivers.

Please note, that the releases you use are very old, and the driver and kernel are bleeding edge. That does clash a bit I guess...

@mobarre
Copy link
Author

mobarre commented Feb 23, 2016

ok, so if I understand correctly you would suggest me to try out bumblebee and primus latest git to match the kernel and drivers ?

I'll give it a shot. it does make sense. Although I did have the same issue with older arch kernels (4.3) and most certainly nvidia driver 358.

@Yamakuzure
Copy link

There had been some development regarding module unloading and the nvidia-uvm module, which did not exist when the last release came out.

Further, I think you have to patch in the awareness of the UVM module with the patch I attached. Unfortunately I can't seem to be able to find out where I got it. :-( (but I am sure it was from one of the issues opened here regarding nvidia-uvm)

nvidia-uvm-support.patch.txt

mobarre pushed a commit to sn-archi/Bumblebee that referenced this issue Feb 23, 2016
@mobarre
Copy link
Author

mobarre commented Feb 23, 2016

OK no success with latest got of bumblebeed. Primus is already the latest got version.

The patch, although unrelated in my honest opinion does help with module unloading on a clean application shutdown, but doesn't change the behavior on suspend.
The application should indeed be suspended and not shutdown so the nvidia module should not be unloaded. You expect to find the module loaded and operational on resume.

@mobarre
Copy link
Author

mobarre commented Feb 23, 2016

oh, and setting the bridge to virtualgl changes the error message in the kernel logs:
[ 1205.292587] PM: resume of devices complete after 777.383 msecs
[ 1205.293070] PM: Finishing wakeup.
[ 1205.293072] Restarting tasks ...
[ 1205.298268] NVRM: GPU at PCI:0000:03:00: GPU-c3350c76-8707-abd9-a985-52814992bd10
[ 1205.298277] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception on GPC 0: SAVE_RESTORE_ADDR_OOB
[ 1205.298332] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ESR 0x500900=0x80000001

[ 1205.298398] NVRM: Xid (PCI:0000:03:00): 13, Graphics Exception: ChID 0010, Class 0000b097, Offset 00001b0c, Data 1000f010
[ 1205.300415] done.

I feel like I'm getting somewhere. I'll try to downgrade the nvidia kernel to a version that actually had some success with some users, although I'm not sure anyone really does suspend with an optimus laptop running a 3D application...

@mobarre
Copy link
Author

mobarre commented Feb 23, 2016

New attempt made with a near full system downgrade to:

linux 4.2.1-1-ARCH
xorg server 1.17.2-4
nvidia driver 355.11-1
bumblebee 20150118-2
primus 20150118-2

Behaves exactly the same. Any way I can get more debug log on the suspend process on bumblebee and nvidia driver side ?

@mobarre
Copy link
Author

mobarre commented Feb 23, 2016

Note that the PR does in no way fix the issue, but seems to make module unloading work for those who have an nvidia driver that loads nvidia_modeset

@szebrowski
Copy link

szebrowski commented Feb 26, 2016

Same problem here:

[28191.854094] NVRM: GPU at PCI:0000:01:00: GPU-f0067e93-55ea-0863-49f5-0485472bf256
[28191.854103] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 1 Error
[28191.854108] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 2 Error
[28191.854112] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 3 Error
[28191.854116] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 9 Error
[28191.854119] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: Shader Program Header 18 Error
[28191.854125] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x405840=0xa204020e
[28191.854149] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ChID 0010, Class 0000b097, Offset 00002390, Data 00000000
[28193.526034] r8169 0000:09:00.0 enp9s0: link up
[28193.526048] IPv6: ADDRCONF(NETDEV_CHANGE): enp9s0: link becomes ready

@mobarre
Copy link
Author

mobarre commented Feb 26, 2016

Ok, I've been investigating more. Since we have someone else, let's see what we have in common. I never posted the full dmesg which might be silly. -> http://pastebin.com/HUM0C38h
SEveral issues are visible in here, although I quite convinced that the main issue is with the Xid errors. but stil, some highlights:

  • Driver issues:
    [ 7.985106] nvidia: module license 'NVIDIA' taints kernel.
    [ 7.985110] Disabling lock debugging due to kernel taint
    [ 8.008226] nvidia 0000:03:00.0: Refused to change power state, currently in D3
    [ 8.008248] NVRM: This is a 64-bit BAR mapped above 4GB by the system
    NVRM: BIOS or the Linux kernel, but the PCI bridge
    NVRM: immediately upstream of this GPU does not define
    NVRM: a matching prefetchable memory window.
    [ 8.008252] NVRM: This may be due to a known Linux kernel bug. Please
    NVRM: see the README section on 64-bit BARs for additional
    NVRM: information.
    [ 8.008381] nvidia: probe of 0000:03:00.0 failed with error -1
    [ 8.009100] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
    [ 8.009488] NVRM: The NVIDIA probe routine failed for 1 device(s).
    [ 8.009491] NVRM: None of the NVIDIA graphics adapters were initialized!
    [ 8.009493] [drm] Module unloaded
    [ 8.009666] NVRM: NVIDIA init module failed!
  • ACPI/power-managerment issues which might or might not be related:
    [ 738.966460] bbswitch: disabling discrete graphics
    [ 738.966498] ACPI Warning: SB.PCI0.RP05.PEGP._DSM: Argument README.markdown #4 type mismatch - Found [Buffer], ACPI requires Package
    [ 738.979619] pci 0000:03:00.0: Refused to change power state, currently in D0
  • And of course stacktraces that might or might not appear. It depends.
    [ 738.948139] ------------[ cut here ]------------
    [ 738.948144] WARNING: CPU: 1 PID: 2371 at drivers/base/driver.c:191 driver_unregister+0x47/0x50()
    [ 738.948145] Unexpected driver unregister!
    [ 738.948146] Modules linked in: nvidia(PO-) ecb ecryptfs cbc encrypted_keys mcryptd sha1_ssse3 sha1_generic trusted sha256_ssse3 sha256_generic hmac drbg ansi_cprng ctr ccm fuse arc4 nls_iso8859_1 nls_cp437 intel_rapl vfat x86_pkg_temp_thermal fat intel_powerclamp coretemp asus_nb_wmi kvm_intel iTCO_wdt asus_wmi iTCO_vendor_support kvm mxm_wmi sparse_keymap irqbypass crct10dif_pclmul crc32_pclmul iwlmvm mac80211 uvcvideo aesni_intel aes_x86_64 videobuf2_vmalloc lrw videobuf2_memops gf128mul videobuf2_v4l2 glue_helper videobuf2_core ablk_helper v4l2_common cryptd mousedev iwlwifi input_leds videodev snd_soc_rt5640 media snd_soc_rl6231 btusb snd_hda_codec_hdmi psmouse btrtl btbcm btintel serio_raw snd_hda_codec_conexant pcspkr snd_hda_codec_generic cfg80211 bluetooth i2c_i801 lpc_ich snd_hda_intel mei_me
    [ 738.948181] snd_hda_codec rfkill shpchp mei thermal snd_soc_core snd_hda_core snd_compress snd_pcm_dmaengine snd_hwdep ac97_bus wmi dw_dmac int3402_thermal snd_pcm dw_dmac_core battery elan_i2c processor_thermal_device spi_pxa2xx_platform snd_timer int340x_thermal_zone i2c_hid snd_soc_sst_acpi gpio_lynxpoint snd intel_soc_dts_iosf 8250_dw i2c_designware_platform i2c_designware_core fjes soundcore iosf_mbi acpi_als tpm_tis ac tpm int3400_thermal kfifo_buf acpi_thermal_rel industrialio evdev processor mac_hid sch_fq_codel joydev ip_tables x_tables ext4 crc16 mbcache jbd2 dm_mod hid_generic sd_mod hid_logitech_hidpp hid_logitech_dj usbhid hid atkbd libps2 crc32c_intel ahci libahci libata xhci_pci xhci_hcd scsi_mod usbcore usb_common i8042 serio sdhci_acpi sdhci led_class mmc_core bbswitch(O) i915 video
    [ 738.948215] button intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm [last unloaded: nvidia_modeset]
    [ 738.948221] CPU: 1 PID: 2371 Comm: rmmod Tainted: P W O 4.4.1-2-ARCH Complete rework of building #1
    [ 738.948222] Hardware name: ASUSTeK COMPUTER INC. UX303LN/UX303LN, BIOS UX303LN.204 09/01/2014
    [ 738.948223] 0000000000000000 000000002d76c1e7 ffff88031bc83dc0 ffffffff812c7f39
    [ 738.948226] ffff88031bc83e08 ffff88031bc83df8 ffffffff810765b2 ffffffffa1131d68
    [ 738.948227] ffffffffa1131de8 ffffffffa1131de0 0000000000000800 000000000087d010
    [ 738.948229] Call Trace:
    [ 738.948234] [] dump_stack+0x4b/0x72
    [ 738.948238] [] warn_slowpath_common+0x82/0xc0
    [ 738.948241] [] warn_slowpath_fmt+0x5c/0x80
    [ 738.948255] [] ? unregister_chrdev_region+0x41/0x50
    [ 738.948257] [] driver_unregister+0x47/0x50
    [ 738.948261] [] pci_unregister_driver+0x29/0x90
    [ 738.948349] [] ebridge_exit+0x15/0x20 [nvidia]
    [ 738.948393] [] nvidia_exit_module+0x29/0xa8 [nvidia]
    [ 738.948444] [] nvidia_frontend_exit_module+0x9/0x2c [nvidia]
    [ 738.948448] [] SyS_delete_module+0x1ae/0x250
    [ 738.948450] [] ? exit_to_usermode_loop+0x5e/0xc0
    [ 738.948454] [] entry_SYSCALL_64_fastpath+0x12/0x71
    [ 738.948456] ---[ end trace 35a0ca57557b6435 ]---

@szebrowski could you have a look at your kernel logs and tell me what you get that looks like those highlights ? what laptop are you seeing this on ? can you give a quick version list (driver, xork, bumblebee, kernel at least) ?

Also, if you could +1 and add info on the nvidia devtalk post here: https://devtalk.nvidia.com/default/topic/918576/linux/application-driver-freeze-on-resume-from-suspend-with-optimus/ it could help. It's not like the nvidia dev seem to be giving a shit at the moment.

@ArchangeGabriel
Copy link
Member

I’m adding this bug report to my review queue for 4.0. Will test it on my system to see if I can reproduce, and else will dig into your logs to see what we have here.

@mobarre
Copy link
Author

mobarre commented May 17, 2016

little update:
I've tried removing bumblebee and use the nvidia driver + modesetting driver.
suspend issue isn't there anymore, although I do get screen corruption on my X background.
also, gdm is a no go (x gives me a black screen), secondary monitor blackens the screen too (at least with gnome-shell) when pluged in.

Biottom line is, bumblebee might be doing something that prevents normal resume. I'm still waiting for some feedback from the nvidia forums (wouldn't hold my breath though...) apparently when you code a proprietary driver, you make it your duty to leave people completely in the dark :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants