Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On kernel 3.10 bumblebee will be unable to turn the nvidia card back on or even unexpetedly shutdown the computer #455

Closed
rechapit opened this issue Aug 12, 2013 · 32 comments
Labels

Comments

@rechapit
Copy link

Distro: debian jessie/sid
kernel: Linux ncc-74656-a 3.10-2-amd64 #1 SMP Debian 3.10.5-1 x86_64 GNU/Linux
bumblebee version: 3.2.1
baseboard-manufacturer: ASUSTeK COMPUTER INC.
baseboard-product-name: N56VZ
baseboard-version : 1.0
system-manufacturer : ASUSTeK COMPUTER INC.
system-product-name : N56VZ
system-version : 1.0
bios-vendor : American Megatrends Inc.
bios-version : N56VZ.216
bios-release-date : 12/06/2012

Works perfectly with kernel 3.9.8 but on kernel 3.10.5 the nvidia card will fail to start or computer might shutdown:
Output from /var/log/messages
Aug 12 10:18:13 ncc-74656-a kernel: [ 150.039613] bbswitch: enabling discrete graphics
Aug 12 10:18:14 ncc-74656-a kernel: [ 150.456339] pci 0000:01:00.0: power state changed by ACPI to D0
Aug 12 10:18:14 ncc-74656-a kernel: [ 150.662868] nvidia: module license 'NVIDIA' taints kernel.
Aug 12 10:18:14 ncc-74656-a kernel: [ 150.662873] Disabling lock debugging due to kernel taint
Aug 12 10:18:14 ncc-74656-a kernel: [ 150.675549] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=io+mem,decodes=none:owns=none
Aug 12 10:18:14 ncc-74656-a kernel: [ 150.675655] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 325.15 Wed Jul 31 18:50:56 PDT 2013
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.107624] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.107639] NVRM: os_pci_init_handle: invalid context!
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.107642] NVRM: os_pci_init_handle: invalid context!
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.107648] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.107653] NVRM: os_pci_init_handle: invalid context!
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.107654] NVRM: os_pci_init_handle: invalid context!
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.137672] NVRM: RmInitAdapter failed! (0x25:0x28:1157)
Aug 12 10:18:20 ncc-74656-a kernel: [ 157.137683] NVRM: rm_init_adapter(0) failed

Attachment:

@amonakov
Copy link
Contributor

This is an nVidia driver bug. I've reported the issue to them, but we can't do much beyond that.

Check if booting with rcutree.rcu_idle_gp_delay=1, or running

sudo tee /sys/module/rcutree/parameters/rcu_idle_gp_delay <<<1

before invoking optirun makes it any better.

@dilworks
Copy link

Ouch. And kernel 3.10 just landed on Testing (and I really need it for a few PM fixes on my laptop - coincidentially an Asus too...)

So what's the suggestion - boot with 3.9 for Optimus in the meanwhile? Does it affect every GPU and laptop out there? (mine is a Sandy Bridge + 610M). I'm going to test anyway.

@WhiteWolf1776
Copy link

I'm on 3.10 with an asus n56vj laptop. Works fine as long as I used the nvidia 325.15 driver... all the previous ones will not work.

@dilworks
Copy link

OK, 3.10 update applied to my K53SD.

So far so good... didn't bothered with the rcu_idle_gp_delay setting - just primusrun glxspheres, seems to work.

Closed the pretty spheres windows, ran primusrun glxgears... still working. GPU shuts down and restarts as expected.

Went off to run Portal... it's segfaulting with primusrun (ye olde Steam Overlay bug that still hits me, but only on Portal - guess that it doesn't like the cake). OK, let's disable the overlay, it runs fine.

Check dmesg... nothing odd there (aside of a big fat backtrace of my USB3 port acting wonky again, but that's completely unrelated to Bumblebee).

So... could it be a specifc issue that only affects some laptops? So far my 325.15 + 3.10 on my Sandy Bridge + 610M seems to be behaving as intended.

@WhiteWolf1776
Copy link

Yea, steam community messes with some games, dota2, etc. Easy enough to disable tho. I keep meaning to try running steam with primus instead of just the game, but keep forgetting to try. I keep thinking the issue may be steam community / overlay is running on the intel since steam is running on intel, causing some issues.

@dilworks
Copy link

In my case, the overlay fails only on Portal, and only through primusrun, not on Intel.

It works flawlessly on things like Euro Truck Simulator 2 (which also got a massive patch today - new kernel, new Steam beta, new ETS2 patch!?).

Anyway, it seems that I skipped a bullet this time... just after being hit by a whole ammo magazine because of #452

The nVidia threads regarding 3.10 and Optimus are not that long at this stage. We should build a table of affected and non-affected configurations, to see if it hits a particular combo (Ivy vs Sandy?).

@amonakov
Copy link
Contributor

Yes, the 3.10 problem affects only some laptops. It appears specific to Kepler GPUs, which would imply IvyBridge CPUs.

Segfaults with Steam overlay are fixed in primus' git, but the distro packages have not been updated.

@WhiteWolf1776
Copy link

Maybe some distro's are waiting for a 'release' ;)

@p2004a
Copy link

p2004a commented Aug 21, 2013

I looked at rechapit logs and I have this same problem. It stopped working after upgrading from 3.9 to 3.10 kernel. I'm using debian testing.

@rickysarraf
Copy link

This hit me on a Lenovo ThinkPad W530 machine.

[132472.285497] ehci-pci 0000:00:1d.0: power state changed by ACPI to D3cold
[132472.655995] nvidia 0000:01:00.0: irq 50 for MSI/MSI-X
[132474.583033] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3cold
[132495.861860] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[132495.861881] NVRM: os_pci_init_handle: invalid context!
[132495.861884] NVRM: os_pci_init_handle: invalid context!
[132495.861890] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[132495.861894] NVRM: os_pci_init_handle: invalid context!
[132495.861895] NVRM: os_pci_init_handle: invalid context!
[132495.888638] NVRM: RmInitAdapter failed! (0x25:0x28:1157)
[132495.888647] NVRM: rm_init_adapter(0) failed
[132495.889042] bumblebeed[20121]: [132409.079926] [WARN]XORG Open ACPI failed (/var/run/acpid.socket) (No such file or directory)
[132495.889427] bumblebeed[20121]: [132409.079969] [ERROR]XORG NVIDIA(0): Failed to initialize the NVIDIA GPU at PCI:1:0:0. Please
[132495.889652] bumblebeed[20121]: [132409.079973] [ERROR]XORG NVIDIA(0): check your system's kernel log for additional error
[132495.889849] bumblebeed[20121]: [132409.079977] [ERROR]XORG NVIDIA(0): messages and refer to Chapter 8: Common Problems in the
[132495.890047] bumblebeed[20121]: [132409.079980] [ERROR]XORG NVIDIA(0): README for additional information.
[132495.890243] bumblebeed[20121]: [132409.079983] [ERROR]XORG NVIDIA(0): Failed to initialize the NVIDIA graphics device!
[132495.890434] bumblebeed[20121]: [132409.079986] [ERROR]XORG NVIDIA(0): Failing initialization of X screen 0
[132495.890627] bumblebeed[20121]: [132409.079989] [ERROR]XORG Screen(s) found, but none have a usable configuration.
[132495.890931] bumblebeed[20121]: [132409.081743] [ERROR]X did not start properly
[132500.432476] bbswitch: enabling discrete graphics
[132500.432663] bumblebeed[20121]: [132413.620535] [ERROR]Could not enable discrete graphics card
[132572.708669] bbswitch: enabling discrete graphics
[132572.708682] nvidia 0000:01:00.0: power state changed by ACPI to D0
[132572.723626] nvidia 0000:01:00.0: Refused to change power state, currently in D3

@nbdt
Copy link

nbdt commented Sep 2, 2013

Having the same issue on Debian testing. 3.10-2 and NVIDIA Corporation GK107M [GeForce GT 650M]

@danilogr
Copy link

danilogr commented Oct 6, 2013

What about 3.11? I got the same problem
I'm using Arch Linux and I have a Sony Vaio S (SVS15116FXB)

$ lspci -vnn | grep '\''[030[02]\]'
00:02.0 VGA compatible controller [0300]: Intel Corporation 3rd Gen Core processor Graphics Controller [8086:0166] (rev 09) (prog-if 00 [VGA controller])
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK107M [GeForce GT 640M LE] [10de:0fd3] (rev ff) (prog-if ff)

$ uname -a
Linux 3.11.3-1-ARCH #1 SMP PREEMPT Wed Oct 2 01:38:48 CEST 2013 x86_64 GNU/Linux

@ivankukobko
Copy link

True story. After updating to Ubuntu 13.10 optirun shuts down my laptop (Lenovo V580 core i5 / nvidia GT 640M)

@austn3
Copy link

austn3 commented Oct 18, 2013

Same problem with a Vaio S 13" GeForce GT 640M LE. Laptop completely shuts down when trying to turn on the card.

uname -a
Linux 3.11.5-1-ARCH #1 SMP PREEMPT Mon Oct 14 08:31:43 CEST 2013 x86_64 GNU/Linux

@austn3
Copy link

austn3 commented Oct 18, 2013

Not sure if this will help everyone, but this solved my problem.

https://bbs.archlinux.org/viewtopic.php?pid=1326090#p1326090

@ArchangeGabriel
Copy link
Member

Issues between NVIDIA and kernel should now be fixed.

@amonakov
Copy link
Contributor

amonakov commented Dec 6, 2013

What? Where is any evidence that this is fixed?

@amonakov amonakov reopened this Dec 6, 2013
@ArchangeGabriel
Copy link
Member

One user provided a workaround, and I remember reading a news on phoronix (but can’t retrieve it) that NVIDIA published a new version solving incompatibilities with 3.10+. I thought it was ones such as this one, but if it’s not, let’s someone affected says so and see what can be done.

BTW, that’s an issue at NVIDIA level, not ours AFAIK.

@amonakov
Copy link
Contributor

amonakov commented Dec 6, 2013

nVidia fixed the source incompatibility, but the "fallen off the bus" issue on 3.10+ still remains.

This is indeed either an nVidia or a kernel issue, but I believe it's better to keep it open to help affected users find the workaround.

@ArchangeGabriel
Copy link
Member

Ok, sorry for that. And I agree, it’s better to keep it open to avoid users opening new ones.

@xarses
Copy link

xarses commented Jan 23, 2014

@amonakov's kernel options resolved my issue with debian 3.12 kernel on lenovo W530

@rickysarraf
Copy link

@xarses Could you please mention what kernel options did you use ??

Are you referring to the option "rcutree.rcu_idle_gp_delay=1" ??

@dkatargin
Copy link

Same error with/without rcu_idle_gp_delay (#570)

> cat /sys/module/rcutree/parameters/rcu_idle_gp_delay
1
> optirun glxspheres
[ 1428.559477] [ERROR]Cannot access secondary GPU - error: Could not enable discrete graphics card
[ 1428.559529] [ERROR]Aborting because fallback start is disabled.

OS: Gentoo_x64
bbswitch: 0.7, 0.8, 9999
nvidia_drivers: 334.21-r3, 331.49
kernel: 3.12, 3.10

@mcilloni
Copy link

It happened to me too, today, with kernel 3.10, 3.14 and nvidia 344.21 on arch x86_64

It didn't work for a while, then after a reboot some hour later optirun worked, but primusrun crashed the whole system.

Now I've rebooted again on 3.14 and primusrun works perfectly. I am on kepler (650m) and ivy (i7-3610QM)

@dkatargin
Copy link

CONFIG_HZ_1000=y
CONFIG_HZ=1000

and kernel 3.14 fix my problem

@mcilloni
Copy link

I fixed my issues blacklisting nvidia and rcutree.rcu_idle_gp_delay=1 on cmdline

@hadronized
Copy link

Issue not solved with kernel 3.16.1-1

Additional info: http://lpaste.net/110063

@koct9i
Copy link

koct9i commented Nov 4, 2014

I've found possble soltion. Somebody please check top patch from https://github.com/koct9i/linux/commits/acpi and report results.

Patch sould apply clearly even on v3.10, also remove workaround (rcutree.rcu_idle_gp_delay=1) from commandline.

@ArchangeGabriel
Copy link
Member

For information, this is fixed in 3.19 (per this commit https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=commit/?id=74b51ee152b6d99e61ba329799a039453fb9438f), and might be backported to older kernel, but not sure about this.

@ArchangeGabriel
Copy link
Member

I’m closing this since the fix is normally largely documented and also most recent distro have >=3.19 kernel.

@y-usuzumi
Copy link

I still occasionally run into this issue:
kernel 4.13.5. rcutree.rcu_idle_gp_delay=1 didn't help.

dmesg:

[18454.962751] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[18454.963239] NVRM: The NVIDIA GPU 0000:01:00.0
               NVRM: (PCI ID: 10de:139b) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[18454.963267] nvidia: probe of 0000:01:00.0 failed with error -1
[18454.963286] NVRM: The NVIDIA probe routine failed for 1 device(s).
[18454.963287] NVRM: None of the NVIDIA graphics adapters were initialized!
[18454.963382] nvidia-nvlink: Unregistered the Nvlink Core, major device number 242

@grubshka
Copy link

I'm using 4.19.0.9 and rcutree.rcu_idle_gp_delay does not exist anymore.
I tried rcutree.gp_init_delay=1 and now averything is back !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests