Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bbswitch is broken with kernel 4.8 pcie port power management #140

Open
nathanielwarner opened this issue Oct 9, 2016 · 121 comments
Open

bbswitch is broken with kernel 4.8 pcie port power management #140

nathanielwarner opened this issue Oct 9, 2016 · 121 comments

Comments

@nathanielwarner
Copy link

@nathanielwarner nathanielwarner commented Oct 9, 2016

I just upgraded to kernel 4.8, and bbswitch 0.8-1 no longer works properly. When I try to run something with primusrun, it fails with "bumblebee could not enable discrete graphics card" or something, and I get this in dmesg:

bbswitch: enabling discrete graphics
pci 0000:01:00.0: Refused to change power state, currently in D3
pci 0000:01:00.0: Refused to change power state, currently in D3

When I use the kernel command line option pcie_port_pm=off primusrun works again, and I get this in dmesg upon using primusrun:

bbswitch: enabling discrete graphics
nvidia-nvlink: Nvlink Core is being initialized, major device number 242
NVRM: loading NVIDIA UNIX x86_64 Kernel Module  370.28  Thu Sep  1 19:45:04 PDT 2016
vgaarb: this pci device is not a vga device
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  370.28  Thu Sep  1 19:18:48 PDT 2016
nvidia-modeset: Allocated GPU:0 (GPU-33c835cf-d564-600a-037b-c7ecb9188d7c) @ PCI:0000:01:00.0
nvidia-modeset: Freed GPU:0 (GPU-33c835cf-d564-600a-037b-c7ecb9188d7c) @ PCI:0000:01:00.0
vgaarb: this pci device is not a vga device
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
nvidia-modeset: Unloading
nvidia-nvlink: Unregistered the Nvlink Core, major device number 242
bbswitch: disabling discrete graphics
ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
pci 0000:01:00.0: Refused to change power state, currently in D0

Is lack of support for Kernel 4.8 default configuration an issue that anyone else is having? I'm running Manjaro with Kernel 4.8.1-1, Nvidia driver 370.28, bbswitch 0.8-1.

@Lekensteyn
Copy link
Member

@Lekensteyn Lekensteyn commented Oct 10, 2016

bbswitch has indeed not been updated for the new PM method in kernel 4.8. If you have a newer machine (>= 2015), you might experience issues if you enabled runtime PM for devices.

Do you happen to have udev rules or other "laptop mode tools" that enable power saving features (i.e. by writing auto to the power/control node in sysfs)? It is my current belief that your problem cannot occur unless you enable such power saving methods,

As a workaround you can boot with the pcie_port_pm=off kernel option (or disable runtime PM for the NVIDIA PCI device or its parent PCIe port).

@nathanielwarner
Copy link
Author

@nathanielwarner nathanielwarner commented Oct 10, 2016

I am using TLP and Powertop, but bbswitch still doesn't work with those disabled. The strange thing is that bbswitch seems to think the NVIDIA card is stuck in D0 power state on startup, but then is unable to start it upon invocation of primusrun, and reports that the card is stuck in D3.
On startup, I get these messages:

[    8.164115] bbswitch: version 0.8
[    8.164123] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.GFX0
[    8.164132] bbswitch: Found discrete VGA device 0000:01:00.0: \_SB_.PCI0.PEG0.PEGP
[    8.164148] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[    8.164326] bbswitch: detected an Optimus _DSM function
[    8.164338] bbswitch: device 0000:01:00.0 is in use by driver 'nvidia', refusing OFF
[    8.164341] bbswitch: Succesfully loaded. Discrete card 0000:01:00.0 is on
[    8.164941] [drm] [nvidia-drm] [GPU ID 0x00000100] Unloading driver
[    8.183647] nvidia-modeset: Unloading
[    8.200285] Bluetooth: BNEP (Ethernet Emulation) ver 1.3
[    8.200287] Bluetooth: BNEP filters: protocol multicast
[    8.200293] Bluetooth: BNEP socket layer initialized
[    8.200384] nvidia-nvlink: Unregistered the Nvlink Core, major device number 242
[    8.221637] bbswitch: disabling discrete graphics
[    8.221655] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[    8.236787] pci 0000:01:00.0: Refused to change power state, currently in D0

And on invocation of primusrun, I get this:

[  225.420138] bbswitch: enabling discrete graphics
[  225.496646] pci 0000:01:00.0: Refused to change power state, currently in D3
[  225.573232] pci 0000:01:00.0: Refused to change power state, currently in D3

Is it possible that the kernel-based port power management is able to control the power state, but bbswitch is not? This would make sense because on pre-4.8 kernel versions there is no kernel-based PCIe port power management, and the card seems to always be stuck in D0 (see dmesg output in my first post), and bbswitch has no problems this way. It is when the card is successfully put into D3 that bbswitch is unable to use it.

@Lekensteyn
Copy link
Member

@Lekensteyn Lekensteyn commented Oct 10, 2016

Have you rebooted after making disabling TLP? The PCIe port mgmt introduced with 4.8 cannot be combined with bbswitch in one boot, that case is not supported (this may or may not work, no guarantees).

Have you tried the kernel option which I mentioned above? What is your laptop and GPU btw?

If you see the messages "bbswitch: enabling discrete graphics" followed by "Refused to change power state, currently in D3" (or similarly, "disabling" and "currently in D0"), then it is an indication that something went wrong...

@nathanielwarner
Copy link
Author

@nathanielwarner nathanielwarner commented Oct 10, 2016

Yes I rebooted after disabling tlp and powertop. As I said I above, when I boot with the kernel option you mention, primusrun works again, but bbswitch reports that it cannot change the gpu power state out of D0 when primusrun stops running. But that happened with earlier kernels as well. And yes, something is clearly going wrong. But maybe this has actually been a problem all along, but is only now showing itself because the kernel is putting the dGPU out of D0.
My laptop is Dell XPS 15 9550, gpu is Intel HD 530 + GeForce GTX 960M.

@rockorequin
Copy link

@rockorequin rockorequin commented Oct 12, 2016

Odd, I have exactly the same laptop and I'm using tlp and powertop, but I don't get this problem until after a suspend/resume cycle. Or is this issue fixed in bbswitch 0.8.4ubuntu1?

@nathanielwarner
Copy link
Author

@nathanielwarner nathanielwarner commented Oct 12, 2016

You're sure you have the exact same model, and that you're running Kernel 4.8? It's possible you have a different BIOS than me (If you don't know, Dell has been rapidly pushing out BIOS updates to try to fix an alarming number of issues. Many of the updates have made things worse, so I'm currently on an older BIOS.) It's also possible that the issue is fixed in bbswitch 0.8.4ubuntu1. Maybe I'll try Ubuntu and see if that works better.

@rockorequin
Copy link

@rockorequin rockorequin commented Oct 12, 2016

Yes, it's the same model, with the same GPUs. It has the 4K screen and I'm running the 1.2.0 BIOS (I tried 1.2.14, but it has a screen flickering problem which makes it unusable.) It's possible I disabled tlp and powertop power management and forgot, of course.

@nathanielwarner
Copy link
Author

@nathanielwarner nathanielwarner commented Oct 12, 2016

I'm in the exact same situation as you, on 1.2.0 (Seriously, Dell needs to get their act together!)
Since disabling tlp and powertop didn't solve it for me, my only guesses are that the version of bbswitch you have is newer than mine, or that you are running a different kernel version (pre 4.8).

@rockorequin
Copy link

@rockorequin rockorequin commented Oct 12, 2016

I'm running the mainline 4.8 kernel also (with the patch from https://bugs.freedesktop.org/show_bug.cgi?id=97596 to avoid a weird flickering artefact that occurs on Skylake architecture with 4.8 if you have a second monitor attached).

@nathanielwarner
Copy link
Author

@nathanielwarner nathanielwarner commented Oct 12, 2016

Ok, I'll probably try Ubuntu with the mainline kernel at some point to see if that fixes the issue. Until then, should this issue be closed?

@Lekensteyn
Copy link
Member

@Lekensteyn Lekensteyn commented Oct 17, 2016

@rockorequin Perhaps you are using nouveau instead of bbswitch? Personally I am back to nouveau since my new laptop requires it for an external monitor.

@nathanielwarner
Copy link
Author

@nathanielwarner nathanielwarner commented Oct 18, 2016

I actually did try it with Ubuntu 16.10 with Kernel 4.8, and it is fixed. There must be something internal to Manjaro that is screwing it up.

@ArchangeGabriel
Copy link
Member

@ArchangeGabriel ArchangeGabriel commented Oct 21, 2016

I’m reopening this, because even if it seems to work (i.e. it reports OFF), the power consumption and temperature correspond to the case of a ON card on my setup. Adding pcie_port_pm=off to the kernel parameters solves it.

When using nouveau, temperature and power consumption also correspond to a ON card.

@Lekensteyn
Copy link
Member

@Lekensteyn Lekensteyn commented Oct 21, 2016

The result of combining the DSM method (as used by bbswitch) with the new power resources method (as used since Linux 4.8 and nouveau) in a single boot is not known (I would call it undefined behavior). Forcing pcie_port_pm=off basically reverts to the DSM method.

How do you observe that the video card is off with nouveau? You have to check your dmesg for the last messages related to nouveau. If you see "DRM: resuming kernel object tree" with no "suspending console" as follow up, then you know something is keeping the device busy.

@ArchangeGabriel
Copy link
Member

@ArchangeGabriel ArchangeGabriel commented Oct 22, 2016

I do get those lines at the end:

kernel: nouveau 0000:01:00.0: DRM: suspending console...
kernel: nouveau 0000:01:00.0: DRM: suspending display...
kernel: nouveau 0000:01:00.0: DRM: evicting buffers...
kernel: nouveau 0000:01:00.0: DRM: waiting for kernel channels to go idle...
kernel: nouveau 0000:01:00.0: DRM: suspending client object trees...
kernel: nouveau 0000:01:00.0: DRM: suspending kernel object tree...

That being said, I probably need to do some more investigations (power consumption with bbswitch vs nouveau vs nothing, and all those with or without pcie_port_pm=off) to properly determine what seems to work and what not, and then start reporting bug against kernel/nouveau.

@ArchangeGabriel
Copy link
Member

@ArchangeGabriel ArchangeGabriel commented Oct 22, 2016

Also, @Lekensteyn, grabbed this at some point, if I remember correctly it was while running a boot without pcie_port_pm=off on my newer machine and trying to echo OFF to bbswitch after seeing temperature increase:

[13827.423220] bbswitch: disabling discrete graphics
[13827.423230] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[13827.424013] ------------[ cut here ]------------
[13827.424017] WARNING: CPU: 3 PID: 2343 at drivers/pci/pci.c:1616 pci_disable_device+0xa8/0xd0
[13827.424018] pci 0000:01:00.0: disabling already-disabled device
[13827.424019] Modules linked in:
[13827.424019]  bbswitch(O) mousedev snd_hda_codec_conexant snd_hda_codec_generic hid_generic arc4 msr iTCO_wdt i2c_designware_platform hp_wmi iTCO_vendor_support i2c_designware_core mxm_wmi joydev sparse_keymap nls_iso8859_1 $
[13827.424049]  evdev hp_wireless ac mac_hid tpm_tis acpi_pad tpm_tis_core tpm sch_fq_codel ip_tables x_tables btrfs xor raid6_pq algif_skcipher af_alg dm_crypt dm_mod rtsx_pci_sdmmc mmc_core serio_raw atkbd libps2 crct10dif_p$
[13827.424077] CPU: 3 PID: 2343 Comm: tee Tainted: G        W  O    4.8.2-1-ARCH #1
[13827.424078] Hardware name: HP HP ZBook Studio G3/80D4, BIOS N82 Ver. 01.07 04/27/2016
[13827.424079]  0000000000000286 00000000be2b784c ffff8808181bbcf8 ffffffff812fe280
[13827.424081]  ffff8808181bbd48 0000000000000000 ffff8808181bbd38 ffffffff8107c85b
[13827.424083]  0000065000000000 ffff88089b30c000 ffff88089b2fefa0 00007ffd831c7e40
[13827.424086] Call Trace:
[13827.424090]  [<ffffffff812fe280>] dump_stack+0x63/0x83
[13827.424092]  [<ffffffff8107c85b>] __warn+0xcb/0xf0
[13827.424093]  [<ffffffff8107c8df>] warn_slowpath_fmt+0x5f/0x80
[13827.424094]  [<ffffffff8134bf6b>] ? __pci_set_master+0x3b/0xf0
[13827.424096]  [<ffffffff8134ee98>] pci_disable_device+0xa8/0xd0
[13827.424098]  [<ffffffffa06a548d>] bbswitch_off+0xad/0x240 [bbswitch]
[13827.424100]  [<ffffffffa06a5870>] bbswitch_proc_write+0xb0/0xc7 [bbswitch]
[13827.424102]  [<ffffffff81276f82>] proc_reg_write+0x42/0x70
[13827.424104]  [<ffffffff812087b7>] __vfs_write+0x37/0x140
[13827.424107]  [<ffffffff810c7b87>] ? percpu_down_read+0x17/0x50
[13827.424108]  [<ffffffff81209586>] vfs_write+0xb6/0x1a0
[13827.424109]  [<ffffffff8120aa05>] SyS_write+0x55/0xc0
[13827.424111]  [<ffffffff815f7cf2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[13827.424112] ---[ end trace 4f6318674a3d9756 ]---

Will try to reproduce, but I think this is likely caused by bbswitch/pcie_port_pm interaction on 4.8 for newer systems.

@Lekensteyn
Copy link
Member

@Lekensteyn Lekensteyn commented Oct 22, 2016

For your last issue, if you have some udev rule enabling runtime PM for devices (e.g. "laptop mode tools") then indeed it will upset bbswitch on the new behavior (4.8 without pcie_port_pm=off on newer laptops).

@nathanielwarner
Copy link
Author

@nathanielwarner nathanielwarner commented Oct 22, 2016

I should point out that if you're using nouveau (rather than nvidia), you actually don't need bumblebee or bbswitch- you can just use DRI_PRIME=1 before the app you want to run with the discrete gpu. See https://wiki.archlinux.org/index.php/PRIME

@ArchangeGabriel
Copy link
Member

@ArchangeGabriel ArchangeGabriel commented Oct 22, 2016

@nathanielwarner If you’re telling that to me, I assure you that I know. ;) But that’s not really related to the current issue.

@Lekensteyn OK, I’ve got tlp installed (and running) on the same system, I’ll also try with or without it to see what it gives. So that’s one more factor to try. Should have time tomorrow to look at all that. :)

On a side note, do you still intend to update bbswitch for supporting this new method any time soon? Maybe we should release Bumblebee 4.0 without waiting much further and add a release note about bbswitch state (interaction with 4.8 pcie_port_pm, open/known issues). ;)

@Lekensteyn
Copy link
Member

@Lekensteyn Lekensteyn commented Oct 22, 2016

intend to update bbswitch for supporting this new method
yes

any time soon?
no (time constraints). nouveau seems to work so I have not really propritized it here.

I was hoping to get this fixed before Bumblebee 4, but it seems things are really stalling, so maybe it is better to release it since it at least improves the nvidia driver situation. Release note with known issues should be ok :)

@bluca
Copy link
Member

@bluca bluca commented Oct 22, 2016

+1 for a new release, debian 9 deadlines are approaching fast :-)

@ArchangeGabriel
Copy link
Member

@ArchangeGabriel ArchangeGabriel commented Oct 23, 2016

OK, I’ll go through all open issues soon (help appreciated) and will try to release by the end of the week. Stay tuned. If there is any need for discussion, Bumblebee-Project/Bumblebee#319 is the place to go now. ;)

@GreatBigWhiteWorld
Copy link

@GreatBigWhiteWorld GreatBigWhiteWorld commented Oct 31, 2016

I see that in your bumblebee 4 issue, you are delaying its release for another few weeks. So I need to get this to work even temporarily.

If I understand right, I need to add "pcie_port_pm=off" in the grub configuration as kernel parameter, and the drawback is that I am constantly running on nvidia card right?

Thanks in advance.

@Lekensteyn
Copy link
Member

@Lekensteyn Lekensteyn commented Oct 31, 2016

@GreatBigWhiteWorld pcie_port_pm=off is a workaround that allows you to use bbswitch with kernel 4.8 and newer. If you use older kernels, you do not need that option.

If you use nouveau (and not bbswitch nor the nvidia proprietary driver), then you do not have to do anything.

@GreatBigWhiteWorld
Copy link

@GreatBigWhiteWorld GreatBigWhiteWorld commented Nov 1, 2016

Thanks. Yes I am running 4.8 kernel and bbswitch is always off at the moment. I guess I need this option.

@ademcal
Copy link

@ademcal ademcal commented Nov 13, 2016

I have same like that problem and the problem solved with this parameter pcie_port_pm=off Beside Laptop is Dell 7559 and OpenSUSE-Thumbleweed

Dmesg output is down

[    7.982013] ------------[ cut here ]------------
[    7.982017] WARNING: CPU: 6 PID: 1550 at ../drivers/pci/pci.c:1616 pci_disable_device+0xa1/0xd0
[    7.982018] pci 0000:02:00.0: disabling already-disabled device
[    7.982019] Modules linked in:
[    7.982019]  af_packet nf_log_ipv6 xt_pkttype nf_log_ipv4 nf_log_common xt_LOG xt_limit bnep nls_iso8859_1 nls_cp437 vfat fat uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core hid_generic videodev usbhid btusb btrtl snd_hda_codec_hdmi dell_led arc4 intel_rapl x86_pkg_temp_thermal intel_powerclamp snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec hid_multitouch kvm_intel snd_hda_core kvm snd_hwdep irqbypass iwlmvm snd_pcm iTCO_wdt crct10dif_pclmul iTCO_vendor_support crc32_pclmul mac80211 snd_seq crc32c_intel ghash_clmulni_intel i2c_designware_platform snd_seq_device snd_timer i2c_designware_core aesni_intel idma64 virt_dma iwlwifi dell_wmi aes_x86_64 sparse_keymap lrw dell_smbios glue_helper dcdbas dell_smm_hwmon ablk_helper cryptd rtsx_pci_ms hci_uart
[    7.982039]  snd ip6t_REJECT nf_reject_ipv6 btbcm memstick pcspkr i2c_i801 mei_me cfg80211 i2c_smbus mei intel_lpss_pci int3403_thermal btqca xt_tcpudp soundcore joydev btintel nf_conntrack_ipv6 battery pinctrl_sunrisepoint bluetooth nf_defrag_ipv6 ac pinctrl_intel intel_lpss_acpi intel_lpss fan processor_thermal_device int3402_thermal int340x_thermal_zone dell_rbtn shpchp int3400_thermal intel_soc_dts_iosf acpi_als acpi_thermal_rel kfifo_buf tpm_tis fjes thermal tpm_tis_core industrialio rfkill ip6table_raw acpi_pad tpm ipt_REJECT nf_reject_ipv4 iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables rtsx_pci_sdmmc mmc_core mxm_wmi i915 serio_raw xhci_pci
[    7.982058]  rtsx_pci mfd_core xhci_hcd i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt usbcore fb_sys_fops usb_common drm wmi video i2c_hid button coretemp msr sg bbswitch(O) efivarfs [last unloaded: nvidia]
[    7.982067] CPU: 6 PID: 1550 Comm: bumblebeed Tainted: P     U     O    4.8.6-2-default #1
[    7.982067] Hardware name: Dell Inc. Inspiron 7559/0H0CC0, BIOS 1.2.0 09/22/2016
[    7.982068]  0000000000000000 ffffffffb03a4272 ffff99f4daab3da8 0000000000000000
[    7.982070]  ffffffffb007de2e ffff99f501704000 ffff99f4daab3df8 ffff99f4daab3f28
[    7.982072]  00000000017d7270 0000000000000000 0000000000000028 ffffffffb007de9f
[    7.982074] Call Trace:
[    7.982082]  [<ffffffffb002eefe>] dump_trace+0x5e/0x310
[    7.982085]  [<ffffffffb002f2cb>] show_stack_log_lvl+0x11b/0x1a0
[    7.982087]  [<ffffffffb0030001>] show_stack+0x21/0x40
[    7.982090]  [<ffffffffb03a4272>] dump_stack+0x5c/0x7a
[    7.982093]  [<ffffffffb007de2e>] __warn+0xbe/0xe0
[    7.982096]  [<ffffffffb007de9f>] warn_slowpath_fmt+0x4f/0x60
[    7.982098]  [<ffffffffb03eb551>] pci_disable_device+0xa1/0xd0
[    7.982101]  [<ffffffffc036e409>] bbswitch_off+0x89/0x230 [bbswitch]
[    7.982104]  [<ffffffffc036e7c3>] bbswitch_proc_write+0x93/0xaa [bbswitch]
[    7.982108]  [<ffffffffb02854dd>] proc_reg_write+0x3d/0x60
[    7.982111]  [<ffffffffb02187c3>] __vfs_write+0x23/0x140
[    7.982114]  [<ffffffffb0219080>] vfs_write+0xb0/0x190
[    7.982115]  [<ffffffffb021a302>] SyS_write+0x42/0x90
[    7.982118]  [<ffffffffb06d43f6>] entry_SYSCALL_64_fastpath+0x1e/0xa8
[    7.983563] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x1e/0xa8

[    7.983564] Leftover inexact backtrace:

[    7.983566] ---[ end trace 8e83878053cc2799 ]---
@ssbb
Copy link

@ssbb ssbb commented Nov 13, 2016

I have Dell XPS 15 9550 with 960M too. cat /proc/acpii/bbswitch tell me that GPU if off but my laptop is noisy all the time. I think it happens only with 4.8 kernel since this had not been before.

I am added pcie_port_pm=off as kernel paramter but looks like it does not help:

[  193.771954] bbswitch: enabling discrete graphics
[  199.161884] bbswitch: disabling discrete graphics
[  199.161893] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[  262.993580] bbswitch: enabling discrete graphics
[  263.317141] nvidia: module license 'NVIDIA' taints kernel.
[  263.317143] Disabling lock debugging due to kernel taint
[  263.324303] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[  263.324323] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  375.10  Fri Oct 14 10:30:06 PDT 2016 (using threaded interrupts)
[  263.899187] vgaarb: this pci device is not a vga device
[  263.907317] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[  263.907458] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[  263.907543] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[  263.907620] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[  263.907696] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[  263.907811] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[  263.907888] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[  263.937265] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[  264.134610] vgaarb: this pci device is not a vga device
[  264.417771] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  375.10  Fri Oct 14 10:05:55 PDT 2016
[  267.564436] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[  267.570652] nvidia-modeset: Unloading
[  267.583848] nvidia-nvlink: Unregistered the Nvlink Core, major device number 241
[  267.611886] bbswitch: disabling discrete graphics
[  267.611895] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[  267.627364] pci 0000:01:00.0: Refused to change power state, currently in D0
@Zeben
Copy link

@Zeben Zeben commented Jul 8, 2018

Bump:

dGPU is powering on when I plug AC adapter; cat command gives me wrong results:

The issue was solved via changing RUNTIME_PM_ON_AC from on to auto, which enables runtime PM for all PCI devices, even if AC adapter is plugged on. But I'm not sure if it's right...

@liskin
Copy link

@liskin liskin commented Jul 8, 2018

@Zeben The whole point of disabling pm on AC is that you get rid of those (possibly) tens/hundreds of milliseconds waits for devices to power up. Try doing lspci on battery: it's not instanstaneous but takes almost a second. Try plugging in headphones: there's a somewhat annoying click when the soundcard powers down. On the other hand, with pm enabled, you'll hear your fan a lot less often. It's your decision to make. :-)

@Zeben
Copy link

@Zeben Zeben commented Jul 8, 2018

So, after more experiments I've got some conclusions.

All works without problems with three types of configurations.

  1. Installed packages: linux 4.17.4, bbswitch, bumblebee, tlp, tlp-ui, nvidia.
    Blacklisted 01:00.0 Nvidia card in RUNTIME_PM_BLACKLIST variable.
    Using pcie_port_pm=on in kernel command-line options.
    Works with plugging/unplugging AC adapter.
    Works enabling/disabling dGPU power.
    No any errors.

  2. Installed packages linux 4.17.4, bumblebee, tlp, tlp-ui, nvidia.
    bbswitch removed.
    Using pcie_port_pm=on in kernel command-line options.
    Tip-1: to make dGPU able to power-off, we need to unload nvidia and nvidia_modeset kernel modules manually.
    Tip-2: When AC adapter plugged/unplugged, dGPU keeps powered on. We need to find our dGPU vendor/device via lspci -nn and add the device into udev rules and always set it to auto, instead on.
    I guess it will be a default configuration in future versions of Linux-based distributions, after some fixes.

  3. Legacy configuration.
    Installed packages: linux 4.16.12, bumblebee, bbswitch-dkms, nvidia-dkms, laptop-mode-tools.
    Using pcie_port_pm=off in kernel command-line options.
    No any additional changes needed.

Many thanks for @liskin for suggestions and tips. Maybe our conversation will be helpful for those who have same issues. Waiting for complete implementation of dynamic switchable graphics, out-the-box, without bbswitch.

@real-or-random
Copy link

@real-or-random real-or-random commented Jul 9, 2018

Hm, for me removing pcie_port_pm=off does not help. Without that, I cannot load the nvidia driver.

However, the problem with 4.16.13 went away in a later kernel version (actually already some weeks ago, I just forgot to report it here). So for me, pcie_port_pm=off is still the way to go...

@Zeben
Copy link

@Zeben Zeben commented Jul 9, 2018

@real-or-random I've combined two technologies to make using swichable graphics possible: runtime PM for all devices (by keeping pcie_port_pm=on or removing it completely) and blacklisting dGPU in tlp. As a result, bbswitch doesn't interferer with linux, its new runtime PM; bbswitch-releated tracebacks in dmesg is also gone. Now bbswitch completely controls dGPU device and the device isn't controlled by runtime PM.

@real-or-random
Copy link

@real-or-random real-or-random commented Jul 10, 2018

I tried that but it didn't work. But I'm not convinced that the blacklisting in tlp worked because powertop still showed that PM enabled on for the NVIDIA card. Is that the right place to check? (Where can I check manually?)

@IngeniousDox
Copy link

@IngeniousDox IngeniousDox commented Jul 25, 2018

@liskin I have a Dell XPS 15 9570, I have reached the same point as you. I'm using runtime PM without bbswitch, because using bbswitch (normal or pm-rework branch) both result in a dGPU you cannot power back on. Optirun / Primusrun both work, they load the nvidia module, but unloading afterwards does not work. I tried normal bumblebee and bumblebee-git with development branch with libkmod2. So I have to remove with with modprobe -r.

You said something about making a bug report about it, but I can't find anything. Did I miss it? or were you still planning to make it? I guess we need a new PMMethod=modules_only, that only unloads the modules? You seem to know what the issue is, I'm rather new laptop with nvidia / bumblebee.

@Lekensteyn We talked on IRC briefly about your pm-rework branch. I thought it was working, since compared to the normal bbswitch, it turned the dGPU on / off. However I could not load the nvidia module due to:

[ 1604.981868] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 1604.982029] NVRM: This is a 64-bit BAR mapped above 4GB by the system
               NVRM: BIOS or the Linux kernel, but the PCI bridge
               NVRM: immediately upstream of this GPU does not define
               NVRM: a matching prefetchable memory window.

This probably has to do with the torvalds/linux@abf92f8 commit you already pointed out before. As you suggested I switched to using runtime-pm. (where Bumblebee is not unloading the nvidia module, as I wrote at the start of this post). But I figured I followed you up on my attempt to use pm-rework.

@TungstenOxide
Copy link

@TungstenOxide TungstenOxide commented Jul 25, 2018

Which distro are you using? I gave up and used Arch and that's working pretty well.

@IngeniousDox
Copy link

@IngeniousDox IngeniousDox commented Jul 25, 2018

I have Arch on the 9570. Seems it works differently then your 9560, since I followed the archwiki 9560 page at first, but it simply did not work. (See my/our struggle in Dell XPS 15 9570 - bbswitch not working, Nvdia card won't power off/on). However, with runtime-pm, turning dGPU on with bumblebee works, this happens just by loading the nvidia module. Unloading the nvidia module lets the dGPU to go into suspend. I'm using TLP right now for that.

However, bumblebee doesn't unload the modules after it is used, so I have to do that manually. I figure I can make a wrapper script, that calls a wrapped "modprobe -r nvidia" script that I can then allow to be used with sudo without password. Seems to me a good compromise between security and usability. But, I figure I'm not the only one with this problem and it could be better implemented in Bumblebee.

Now, I know people think Bumblebee is "feature complete", but I guess this actually is a case where a new PMMethod based on just unloading the nvidia modules could be handy.

@ArchangeGabriel
Copy link
Member

@ArchangeGabriel ArchangeGabriel commented Jul 25, 2018

Bumblebee is not feature complete. There are still improvements to be made w.r.t. modules and PM handling, as well as Vulkan/VDPAU support missing (but being worked on).

@IngeniousDox
Copy link

@IngeniousDox IngeniousDox commented Jul 25, 2018

I was mostly referring to the sentiment in the bumblebee issue "Is this project dead?". Some called it dead, some called it complete. But yeah, I have been following the Vulkan discussion, since I use DXVK myself.

Now, you say "but being worked on", was that just referring to Vulkan/VDPAU? Or is there work under way for PM handling just by unloading modules after use?

@ArchangeGabriel
Copy link
Member

@ArchangeGabriel ArchangeGabriel commented Jul 25, 2018

I was referring to Vulkan/VDPAU. Actually, reworking PM/modules should be easier, but requires some time available.

@liskin
Copy link

@liskin liskin commented Jul 25, 2018

@IngeniousDox You didn't miss anything, I'm just too busy/lazy. :-)

@marine1988
Copy link

@marine1988 marine1988 commented Jul 30, 2018

I have 1 xiaomi mi book 13.3 2017 and im having the same issue with [ 2444.539249] [ERROR]Cannot access secondary GPU - error: Could not enable discrete graphics card i use tlp and power top so that is the problem? how can I add the proprieties to kernel to fix it?

@KonnorTimmons1297
Copy link

@KonnorTimmons1297 KonnorTimmons1297 commented Aug 6, 2018

Hi, I'm having this issue as well on Arch Linux. I'm using a Dell G5 15 5587 laptop, with Nvidia GTX 1050 Ti 4GB.

I'm using kernel version 4.17.11-arch1 .

I have installed bumblebee, bbswitch(and bbswitch-dkms), intel drivers, nvidia drivers, and primus. When trying to run primus I get this error:

primus: fatal: Bumblebee daemon reported: error: Could not enable discrete graphics card

At exactly the same time that this happens, I see these lines in the debug logging of bumblebeed:

[ 427.789535] [DEBUG]Accepted new connection
[ 427.790244] [INFO]Switching dedicated card ON [bbswitch]
[ 427.790383] [ERROR]Could not enable discrete graphics card
[ 427.790595] [DEBUG]Socket closed.

Looking further in dmesg, I see these lines:

[ 234.727340] bbswitch: enabling discrete graphics
[ 234.727403] pci 0000:01:00.0: Refused to change power state, currently in D3

I have tlp installed, and I have added the nvidia drivers as well as the pci id to its blacklist. I have also added the kernel parameter pcie_port_pm=off and the results are the same as above.

There are no nouveau drivers installed, so there is not conflict there.

Additionally, when trying to manually enable the card using this command:

sudo tee /proc/acpi/bbswitch <<<ON

and then cat-ing the file, I receive this output

0000:01:00.0 OFF

Currently this is the only thread that I've found where people are actively working on this issue. I'd like to hear what you guys think the issue is and help come up with a solution to this problem.

@liskin
Copy link

@liskin liskin commented Aug 6, 2018

@KonnorTimmons1297 Can you try the bbswitch-less approach as well? I have no idea why bbswitch doesn't work for you as you seem to have done all the right steps, but spending more time making it work is a bit pointless these days. :-)

@KonnorTimmons1297
Copy link

@KonnorTimmons1297 KonnorTimmons1297 commented Aug 6, 2018

@liskin Alright, are you referring to using the nouveau drivers and using the PRIME method to switch between intel graphics and the dedicated graphics card? I thought that nouveau driver was causing problems with an optimus system.

If this is what you are talking about, then I just want to clarify that I have the right idea on what to do next. I need to uninstall the nvidia driver, bbswitch, bumblebee, and primus. Then install the nouveau drivers and xrandr(or something like that)?

@liskin
Copy link

@liskin liskin commented Aug 6, 2018

@KonnorTimmons1297 No, I'm not referring to that. I'm referring to the fact modern kernels on modern systems will power the card off as soon as you unload the nvidia module. Just drop bbswitch and give it a try. And read the comments above you if you run into problems. (Which you will.)

@IngeniousDox
Copy link

@IngeniousDox IngeniousDox commented Aug 6, 2018

@liskin I have opened an issue with a Feature Request. I think I covered everything, but if you could look at it and see if I missed something, that would be nice.

@KonnorTimmons1297 This is the Arch thread where we were hunting for solutions aswell: https://bbs.archlinux.org/viewtopic.php?pid=1800742. In the end, we use ended up using Bumblebee with PMMethod=none and something like tlp to put the pci bus + gpu to runtime suspend when it isn't used. There are some issues still, which is why I opened a feature request. And perhaps some other kinks you need to be aware off, but they are discussed in that forum thread.

@KonnorTimmons1297
Copy link

@KonnorTimmons1297 KonnorTimmons1297 commented Aug 6, 2018

Alright, this stuff is still kind of new to me. I'm not entirely sure how to do this, however I do understand what it means to load a kernel module. Looking at the ArchWiki I get the sense that this is what needs to be done, correct me if I'm wrong.

Uninstall bbswitch, run this command sudo modprobe nvidia to load the NVidia module into the kernel. Once the module has been loaded, I should try primusrun glxgears again as well?

@liskin
Copy link

@liskin liskin commented Aug 6, 2018

@IngeniousDox Thanks!

@KonnorTimmons1297
Copy link

@KonnorTimmons1297 KonnorTimmons1297 commented Aug 7, 2018

Alright, I uninstalled bbswitch and tried manually loading the nvidia kernel module. After that, I was able to successfully run glxspheres64 & glxgears using the dGPU. However, once the kernel module is loaded, I am unable to completely remove it.

I'd like to completely disable the dGPU while it is not in use so that I extend my battery life. I realize that that's what bbswitch originally intended to do, but because of this power management bug, it is unable to automate the process of turning the card on and off.

I have tried removing the kernel module by running sudo modprobe -r nvidia and sudo rmmod nvidia but each of these return this line:

modprobe: FATAL: Module nvidia is in use.

Is there a way that I can manually enable and disable the kernel module, as well as cutting the power off the card itself? I have tlp installed and I thought that if I remove the card, and the driver from it's blacklist that it would take care of that for me, but that isn't the case because the card is still 'active' according to /sys/module/nvidia/drivers/pci:nvidia/0000:01:00.0/power/runtime_status.

Do you guys know what can be done to shut the card down at all?

@liskin
Copy link

@liskin liskin commented Aug 7, 2018

You absolutely need to unload the module. If it says it's in use, it's in use, and bbswitch wouldn't have helped you either. Something is keeping the card not just powered on, but in use. Try lsof /dev/nvidia0, that should help you debug what's using it.

@liskin
Copy link

@liskin liskin commented Aug 7, 2018

Oh, and it could also be that another module is using it. In my case, I need to rmmod nvidia-modeset first before rmmod nvidia.

@IngeniousDox
Copy link

@IngeniousDox IngeniousDox commented Aug 7, 2018

Even if you unload the other modules in the correct order, it could be that the nvida module is kept busy by X. For sddm I had to make a special xorg.conf that didn't use modesetting (See Arch forum thread). I used xf86-video-intel for the intel gpu, and dummy driver for the nvidia gpu. That allowed me to load / unload the modules without issue.

However it seems that doesn't work for gdm/gnome-shell. michelesr outlined his method for that in the Arch forum thread. Anyways, like liskin said, you can use lsof /dev/nvidia0 to check what it could be for you.

@KonnorTimmons1297
Copy link

@KonnorTimmons1297 KonnorTimmons1297 commented Aug 8, 2018

I ran `lsof /dev/nvidiactl' and found that module was being used by Xorg, and that explains why I am able to unload it.

I was able to 'disable' the card, according to bbswitch, by using the command echo 'auto' > '/sys/bus/pci/devices/0000:01:00.0/power/control. However, the nvidia kernel module was still being loaded and which made me unsure that the card was actually turning off. So I decided to try using the same xorg.conf that @IngeniousDox created on the forum(https://bbs.archlinux.org/viewtopic.php?pid=1800742).

After adding the xorg.conf and rebooting the battery drain seems to have stabilized and is not draining as fast.

@ChronicledMonocle
Copy link

@ChronicledMonocle ChronicledMonocle commented Jan 17, 2019

Hi, I'm having this issue as well on Arch Linux. I'm using a Dell G5 15 5587 laptop, with Nvidia GTX 1050 Ti 4GB.

I'm using kernel version 4.17.11-arch1 .

I have installed bumblebee, bbswitch(and bbswitch-dkms), intel drivers, nvidia drivers, and primus. When trying to run primus I get this error:

primus: fatal: Bumblebee daemon reported: error: Could not enable discrete graphics card

At exactly the same time that this happens, I see these lines in the debug logging of bumblebeed:

[ 427.789535] [DEBUG]Accepted new connection
[ 427.790244] [INFO]Switching dedicated card ON [bbswitch]
[ 427.790383] [ERROR]Could not enable discrete graphics card
[ 427.790595] [DEBUG]Socket closed.

Looking further in dmesg, I see these lines:

[ 234.727340] bbswitch: enabling discrete graphics
[ 234.727403] pci 0000:01:00.0: Refused to change power state, currently in D3

I have tlp installed, and I have added the nvidia drivers as well as the pci id to its blacklist. I have also added the kernel parameter pcie_port_pm=off and the results are the same as above.

There are no nouveau drivers installed, so there is not conflict there.

Additionally, when trying to manually enable the card using this command:

sudo tee /proc/acpi/bbswitch <<<ON

and then cat-ing the file, I receive this output

0000:01:00.0 OFF

Currently this is the only thread that I've found where people are actively working on this issue. I'd like to hear what you guys think the issue is and help come up with a solution to this problem.

I have the same laptop as you with a 1060. Running Manjaro with 4.20 kernel and I still have bbswitch not working. I have to go into bumblebee.conf and set PMMethod to none, then reboot, to get the dGPU working. Otherwise I run into the same D3 stuck state issue.

@kshots
Copy link

@kshots kshots commented Jun 5, 2019

Same here with an alienware m15 (rtx 2060... so no nouveau anytime in the next few years). Disabling PM in bbswitch allows optirun to work... Otherwise, I get the same D3 stuck state issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.