On kernel 3.10 bumblebee will be unable to turn the nvidia card back on or even unexpetedly shutdown the computer #455

Closed
rechapit opened this Issue Aug 12, 2013 · 30 comments
@rechapit

Distro: debian jessie/sid
kernel: Linux ncc-74656-a 3.10-2-amd64 #1 SMP Debian 3.10.5-1 x86_64 GNU/Linux
bumblebee version: 3.2.1
baseboard-manufacturer: ASUSTeK COMPUTER INC.
baseboard-product-name: N56VZ
baseboard-version : 1.0

system-manufacturer : ASUSTeK COMPUTER INC.
system-product-name : N56VZ
system-version : 1.0

bios-vendor : American Megatrends Inc.
bios-version : N56VZ.216
bios-release-date : 12/06/2012

Works perfectly with kernel 3.9.8 but on kernel 3.10.5 the nvidia card will fail to start or computer might shutdown:
Output from /var/log/messages
Aug 12 10:18:13 ncc-74656-a kernel: [ 150.039613] bbswitch: enabling discrete graphics
Aug 12 10:18:14 ncc-74656-a kernel: [ 150.456339] pci 0000:01:00.0: power state changed by ACPI to D0
Aug 12 10:18:14 ncc-74656-a kernel: [ 150.662868] nvidia: module license 'NVIDIA' taints kernel.
Aug 12 10:18:14 ncc-74656-a kernel: [ 150.662873] Disabling lock debugging due to kernel taint
Aug 12 10:18:14 ncc-74656-a kernel: [ 150.675549] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=io+mem,decodes=none:owns=none
Aug 12 10:18:14 ncc-74656-a kernel: [ 150.675655] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 325.15 Wed Jul 31 18:50:56 PDT 2013
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.107624] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.107639] NVRM: os_pci_init_handle: invalid context!
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.107642] NVRM: os_pci_init_handle: invalid context!
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.107648] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.107653] NVRM: os_pci_init_handle: invalid context!
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.107654] NVRM: os_pci_init_handle: invalid context!
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.137672] NVRM: RmInitAdapter failed! (0x25:0x28:1157)
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.137683] NVRM: rm_init_adapter(0) failed

Attachment:

@amonakov

This is an nVidia driver bug. I've reported the issue to them, but we can't do much beyond that.

Check if booting with rcutree.rcu_idle_gp_delay=1, or running

sudo tee /sys/module/rcutree/parameters/rcu_idle_gp_delay <<<1

before invoking optirun makes it any better.

@dilworks

Ouch. And kernel 3.10 just landed on Testing (and I really need it for a few PM fixes on my laptop - coincidentially an Asus too...)

So what's the suggestion - boot with 3.9 for Optimus in the meanwhile? Does it affect every GPU and laptop out there? (mine is a Sandy Bridge + 610M). I'm going to test anyway.

@WhiteWolf1776

I'm on 3.10 with an asus n56vj laptop. Works fine as long as I used the nvidia 325.15 driver... all the previous ones will not work.

@dilworks

OK, 3.10 update applied to my K53SD.

So far so good... didn't bothered with the rcu_idle_gp_delay setting - just primusrun glxspheres, seems to work.

Closed the pretty spheres windows, ran primusrun glxgears... still working. GPU shuts down and restarts as expected.

Went off to run Portal... it's segfaulting with primusrun (ye olde Steam Overlay bug that still hits me, but only on Portal - guess that it doesn't like the cake). OK, let's disable the overlay, it runs fine.

Check dmesg... nothing odd there (aside of a big fat backtrace of my USB3 port acting wonky again, but that's completely unrelated to Bumblebee).

So... could it be a specifc issue that only affects some laptops? So far my 325.15 + 3.10 on my Sandy Bridge + 610M seems to be behaving as intended.

@WhiteWolf1776

Yea, steam community messes with some games, dota2, etc. Easy enough to disable tho. I keep meaning to try running steam with primus instead of just the game, but keep forgetting to try. I keep thinking the issue may be steam community / overlay is running on the intel since steam is running on intel, causing some issues.

@dilworks

In my case, the overlay fails only on Portal, and only through primusrun, not on Intel.

It works flawlessly on things like Euro Truck Simulator 2 (which also got a massive patch today - new kernel, new Steam beta, new ETS2 patch!?).

Anyway, it seems that I skipped a bullet this time... just after being hit by a whole ammo magazine because of #452

The nVidia threads regarding 3.10 and Optimus are not that long at this stage. We should build a table of affected and non-affected configurations, to see if it hits a particular combo (Ivy vs Sandy?).

@amonakov

Yes, the 3.10 problem affects only some laptops. It appears specific to Kepler GPUs, which would imply IvyBridge CPUs.

Segfaults with Steam overlay are fixed in primus' git, but the distro packages have not been updated.

@WhiteWolf1776

Maybe some distro's are waiting for a 'release' ;)

@p2004a

I looked at rechapit logs and I have this same problem. It stopped working after upgrading from 3.9 to 3.10 kernel. I'm using debian testing.

@rickysarraf

This hit me on a Lenovo ThinkPad W530 machine.

[132472.285497] ehci-pci 0000:00:1d.0: power state changed by ACPI to D3cold
[132472.655995] nvidia 0000:01:00.0: irq 50 for MSI/MSI-X
[132474.583033] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3cold
[132495.861860] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[132495.861881] NVRM: os_pci_init_handle: invalid context!
[132495.861884] NVRM: os_pci_init_handle: invalid context!
[132495.861890] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[132495.861894] NVRM: os_pci_init_handle: invalid context!
[132495.861895] NVRM: os_pci_init_handle: invalid context!
[132495.888638] NVRM: RmInitAdapter failed! (0x25:0x28:1157)
[132495.888647] NVRM: rm_init_adapter(0) failed
[132495.889042] bumblebeed[20121]: [132409.079926] [WARN]XORG Open ACPI failed (/var/run/acpid.socket) (No such file or directory)
[132495.889427] bumblebeed[20121]: [132409.079969] [ERROR]XORG NVIDIA(0): Failed to initialize the NVIDIA GPU at PCI:1:0:0. Please
[132495.889652] bumblebeed[20121]: [132409.079973] [ERROR]XORG NVIDIA(0): check your system's kernel log for additional error
[132495.889849] bumblebeed[20121]: [132409.079977] [ERROR]XORG NVIDIA(0): messages and refer to Chapter 8: Common Problems in the
[132495.890047] bumblebeed[20121]: [132409.079980] [ERROR]XORG NVIDIA(0): README for additional information.
[132495.890243] bumblebeed[20121]: [132409.079983] [ERROR]XORG NVIDIA(0): Failed to initialize the NVIDIA graphics device!
[132495.890434] bumblebeed[20121]: [132409.079986] [ERROR]XORG NVIDIA(0): Failing initialization of X screen 0
[132495.890627] bumblebeed[20121]: [132409.079989] [ERROR]XORG Screen(s) found, but none have a usable configuration.
[132495.890931] bumblebeed[20121]: [132409.081743] [ERROR]X did not start properly
[132500.432476] bbswitch: enabling discrete graphics
[132500.432663] bumblebeed[20121]: [132413.620535] [ERROR]Could not enable discrete graphics card
[132572.708669] bbswitch: enabling discrete graphics
[132572.708682] nvidia 0000:01:00.0: power state changed by ACPI to D0
[132572.723626] nvidia 0000:01:00.0: Refused to change power state, currently in D3

@nbdt

Having the same issue on Debian testing. 3.10-2 and NVIDIA Corporation GK107M [GeForce GT 650M]

@danilogr

What about 3.11? I got the same problem
I'm using Arch Linux and I have a Sony Vaio S (SVS15116FXB)

$ lspci -vnn | grep '\''[030[02]\]'
00:02.0 VGA compatible controller [0300]: Intel Corporation 3rd Gen Core processor Graphics Controller [8086:0166] (rev 09) (prog-if 00 [VGA controller])
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK107M [GeForce GT 640M LE] [10de:0fd3] (rev ff) (prog-if ff)

$ uname -a
Linux 3.11.3-1-ARCH #1 SMP PREEMPT Wed Oct 2 01:38:48 CEST 2013 x86_64 GNU/Linux
@ivankukobko

True story. After updating to Ubuntu 13.10 optirun shuts down my laptop (Lenovo V580 core i5 / nvidia GT 640M)

@austn3

Same problem with a Vaio S 13" GeForce GT 640M LE. Laptop completely shuts down when trying to turn on the card.

uname -a
Linux 3.11.5-1-ARCH #1 SMP PREEMPT Mon Oct 14 08:31:43 CEST 2013 x86_64 GNU/Linux

@austn3

Not sure if this will help everyone, but this solved my problem.

https://bbs.archlinux.org/viewtopic.php?pid=1326090#p1326090

@ArchangeGabriel
Bumblebee-Project member

Issues between NVIDIA and kernel should now be fixed.

@amonakov

What? Where is any evidence that this is fixed?

@amonakov amonakov reopened this Dec 6, 2013
@ArchangeGabriel
Bumblebee-Project member

One user provided a workaround, and I remember reading a news on phoronix (but can’t retrieve it) that NVIDIA published a new version solving incompatibilities with 3.10+. I thought it was ones such as this one, but if it’s not, let’s someone affected says so and see what can be done.

BTW, that’s an issue at NVIDIA level, not ours AFAIK.

@amonakov

nVidia fixed the source incompatibility, but the "fallen off the bus" issue on 3.10+ still remains.

This is indeed either an nVidia or a kernel issue, but I believe it's better to keep it open to help affected users find the workaround.

@ArchangeGabriel
Bumblebee-Project member

Ok, sorry for that. And I agree, it’s better to keep it open to avoid users opening new ones.

@xarses

@amonakov's kernel options resolved my issue with debian 3.12 kernel on lenovo W530

@rickysarraf

@xarses Could you please mention what kernel options did you use ??

Are you referring to the option "rcutree.rcu_idle_gp_delay=1" ??

@ex0hunt

Same error with/without rcu_idle_gp_delay (#570)

> cat /sys/module/rcutree/parameters/rcu_idle_gp_delay
1
> optirun glxspheres
[ 1428.559477] [ERROR]Cannot access secondary GPU - error: Could not enable discrete graphics card
[ 1428.559529] [ERROR]Aborting because fallback start is disabled.

OS: Gentoo_x64
bbswitch: 0.7, 0.8, 9999
nvidia_drivers: 334.21-r3, 331.49
kernel: 3.12, 3.10

@mcilloni

It happened to me too, today, with kernel 3.10, 3.14 and nvidia 344.21 on arch x86_64

It didn't work for a while, then after a reboot some hour later optirun worked, but primusrun crashed the whole system.

Now I've rebooted again on 3.14 and primusrun works perfectly. I am on kepler (650m) and ivy (i7-3610QM)

@ex0hunt
CONFIG_HZ_1000=y
CONFIG_HZ=1000

and kernel 3.14 fix my problem

@mcilloni

I fixed my issues blacklisting nvidia and rcutree.rcu_idle_gp_delay=1 on cmdline

@phaazon

Issue not solved with kernel 3.16.1-1

Additional info: http://lpaste.net/110063

@koct9i

I've found possble soltion. Somebody please check top patch from https://github.com/koct9i/linux/commits/acpi and report results.

Patch sould apply clearly even on v3.10, also remove workaround (rcutree.rcu_idle_gp_delay=1) from commandline.

@ArchangeGabriel
Bumblebee-Project member

For information, this is fixed in 3.19 (per this commit https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=commit/?id=74b51ee152b6d99e61ba329799a039453fb9438f), and might be backported to older kernel, but not sure about this.

@ArchangeGabriel
Bumblebee-Project member

I’m closing this since the fix is normally largely documented and also most recent distro have >=3.19 kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment