Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Debian] Nvidia Card randomly turning on #144

Closed
Hoverbear opened this issue May 2, 2012 · 65 comments
Closed

[Debian] Nvidia Card randomly turning on #144

Hoverbear opened this issue May 2, 2012 · 65 comments

Comments

@Hoverbear
Copy link

Hi all,
Having some issues with Bumblebee on Debian Wheezy... Upon starting the daemon power management seems to be working alright, after awhile, it seems that the card just randomly turns on. At first I thought it was flash accessing the nvidia 32-bit glx or something, but I don't think that's the case.

Using Bumblebee with the Nvidia Binary driver (Though the problem exists with Nouveau as well)

# cat /etc/modprobe.d/nvidia.conf 
blacklist nvidia

Here's a grab from /var/log/messages:

May  2 08:12:22 turing kernel: [14433.970674] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:12:22 turing kernel: [14433.970697] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:12:22 turing kernel: [14433.970709] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
May  2 08:12:22 turing kernel: [14433.970724] nvidia 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
May  2 08:12:22 turing kernel: [14433.970731] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:12:22 turing kernel: [14433.970775] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none
May  2 08:12:22 turing kernel: [14433.971486] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  295.40  Thu Apr  5 21:37:00 PDT 2012
May  2 08:23:14 turing kernel: [15085.014467] bbswitch: disabling discrete graphics
May  2 08:23:14 turing kernel: [15085.030200] pci 0000:01:00.0: Refused to change power state, currently in D0
May  2 08:23:14 turing kernel: [15085.031875] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:23:14 turing kernel: [15085.142589] pci 0000:01:00.0: power state changed by ACPI to D3
May  2 08:23:25 turing kernel: [15095.819263] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:23:25 turing kernel: [15095.819273] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:23:25 turing kernel: [15095.819339] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:23:25 turing kernel: [15095.819350] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
May  2 08:23:25 turing kernel: [15095.819366] nvidia 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
May  2 08:23:25 turing kernel: [15095.819401] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none
May  2 08:23:25 turing kernel: [15095.820118] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  295.40  Thu Apr  5 21:37:00 PDT 2012
May  2 08:23:30 turing kernel: [15101.269372] bbswitch: disabling discrete graphics
May  2 08:23:30 turing kernel: [15101.285279] pci 0000:01:00.0: Refused to change power state, currently in D0
May  2 08:23:30 turing kernel: [15101.286816] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:23:31 turing kernel: [15101.397524] pci 0000:01:00.0: power state changed by ACPI to D3
May  2 08:24:30 turing kernel: [15160.564168] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:24:30 turing kernel: [15160.564178] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:24:30 turing kernel: [15160.564183] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
May  2 08:24:30 turing kernel: [15160.564191] nvidia 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
May  2 08:24:30 turing kernel: [15160.564209] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none
May  2 08:24:30 turing kernel: [15160.564527] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  295.40  Thu Apr  5 21:37:00 PDT 2012
May  2 08:24:30 turing kernel: [15160.565381] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:27:57 turing kernel: [15366.801038] bbswitch: disabling discrete graphics
May  2 08:27:57 turing kernel: [15366.814651] pci 0000:01:00.0: Refused to change power state, currently in D0
May  2 08:27:57 turing kernel: [15366.815807] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:27:57 turing kernel: [15366.927034] pci 0000:01:00.0: power state changed by ACPI to D3
May  2 08:28:25 turing kernel: [15395.063147] bbswitch: enabling discrete graphics
May  2 08:28:25 turing kernel: [15395.304803] pci 0000:01:00.0: power state changed by ACPI to D0
May  2 08:28:25 turing kernel: [15395.304826] pci 0000:01:00.0: power state changed by ACPI to D0
May  2 08:28:25 turing kernel: [15395.304855] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:28:25 turing kernel: [15395.304943] pci 0000:01:00.0: power state changed by ACPI to D0
May  2 08:28:25 turing kernel: [15395.304951] pci 0000:01:00.0: power state changed by ACPI to D0
May  2 08:28:25 turing kernel: [15395.304968] pci 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
May  2 08:28:25 turing kernel: [15395.339725] bbswitch: disabling discrete graphics
May  2 08:28:25 turing kernel: [15395.355479] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:28:25 turing kernel: [15395.466181] pci 0000:01:00.0: power state changed by ACPI to D3
@Hoverbear
Copy link
Author

Managed to get a clip of /var/log/messages right after (I think) the card turned on


/var/log# tail messages
May  2 08:35:47 turing kernel: [15836.349142] e1000e 0000:00:19.0: BAR 1: set to [mem 0xf392b000-0xf392bfff] (PCI address [0xf392b000-0xf392bfff])
May  2 08:35:47 turing kernel: [15836.349152] e1000e 0000:00:19.0: BAR 2: set to [io  0x6080-0x609f] (PCI address [0x6080-0x609f])
May  2 08:35:59 turing kernel: [15847.909345] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:35:59 turing kernel: [15847.909353] thinkpad_acpi: EC reports that Thermal Table has changed
May  2 08:35:59 turing kernel: [15847.909386] nvidia 0000:01:00.0: power state changed by ACPI to D0
May  2 08:35:59 turing kernel: [15847.909402] nvidia 0000:01:00.0: enabling device (0000 -> 0003)
May  2 08:35:59 turing kernel: [15847.909423] nvidia 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
May  2 08:35:59 turing kernel: [15847.909469] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none
May  2 08:35:59 turing kernel: [15847.910192] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  295.40  Thu Apr  5 21:37:00 PDT 2012
May  2 08:36:15 turing kernel: [15863.962150] [Hardware Error]: Machine check events logged

@Lekensteyn
Copy link
Member

What is that "Hardware Error"? After blacklisting, did you run update-initramfs -u? Always check whether /dev/nvidia{ctl,0} exist. If it does, then you can expect your card to power on at random.

@Hoverbear
Copy link
Author

Hi Lekensteyn, I just updated Bumblebee from http://suwako.nomanga.net/ .... Going to do some things and see if the problem is still present. I did not update-initramfs, I've run that now. /dev/nvidia{ctl,0} Both exist however.

@Lekensteyn
Copy link
Member

update-initramfs just makes sure that the blacklist applies at next boot. When /dev/nvidia exist, it will turn your card on at random. So, by blacklisting, you prevent that from happening on the next boot until you use the nvidia card through optirun.

@Hoverbear
Copy link
Author

After doing update-initramfs -u and rebooting, as well as updating to the newest bumblebee from said repo.... The problem seems to be solved. I will re-open the issue if I encounter the issue again.

Thank you Lekensteyn for your help.

@Hoverbear Hoverbear reopened this May 3, 2012
@Hoverbear
Copy link
Author

It appears I'm still having this issue. I have the nvidia module blacklisted in my /etc/modprobe.d/nvidia.conf, have run update-initramfs -u.

@Lekensteyn
Copy link
Member

Does /dev/nvidia* exist after such a failure? Note that you can just purge the nvidia driver if you do not need to use the nvidia card through optirun or CUDA.

@Hoverbear
Copy link
Author

I'm noting the appearance of /dev/nvidia{ctl,0} and /dev/nvram even when the card is off. I would like to keep ability to utilize the Nvidia card in the future, as I am interested in CUDA.

$  lspci -v -d 10de:
01:00.0 VGA compatible controller: NVIDIA Corporation GF119 [Quadro NVS 4200M] (rev ff) (prog-if ff)
    !!! Unknown header type 7f
$ ls /dev/nv*
/dev/nvidia0  /dev/nvidiactl  /dev/nvram

@Lekensteyn
Copy link
Member

Whenever the card is enabled, you can disable it by triggering a start/stop action: optirun true. This is not ideal, but at least it disables the card. I've noticed that some users who have CUDA installed experience this issue more often.

@Hoverbear
Copy link
Author

Hi Lekensteyn, I guess that's a suitable workaround. Thank you. Closing this issue.

@Lekensteyn Lekensteyn reopened this May 8, 2012
@Lekensteyn
Copy link
Member

@Hoverbear I've noticed that the NVreg_ModifyDeviceFiles=0 module option prevents the /dev/nvidia* files from being created. This does indeed what its description suggests, rendering the module unusable. However, it also seems that these files can be removed safely when the module is unloaded. Can you try that and see if you have any issues when manually removing that file?

@Hoverbear
Copy link
Author

Hi Lekensteyn, I actually ended up erasing that installation and trying F17, however I'm back to debian and no longer seem to be having the issue.

@Lekensteyn
Copy link
Member

Ah, with the proprietary nvidia driver?

@Hoverbear
Copy link
Author

Yes, in fact.

On Tue, May 8, 2012 at 6:26 AM, Peter <
reply@reply.github.com

wrote:

Ah, with the proprietary nvidia driver?


Reply to this email directly or view it on GitHub:

#144 (comment)

@hni
Copy link

hni commented Jun 7, 2012

Hi Lekensteyn,
I have the same issue on Arch Linux. So what is the procedure if I want to use optirun but I do not want the nvidia card to turn on at random. Will deleting /dev/nvidia* or using NVreg_ModifyDeviceFiles=0 solve the problem permanently without interfering with bumblebee?

EDIT: manually running 'optirun true' is not really helping because often when I am not at my machine, nvidia will decide to activate the card resulting in an increase in temperature between 5 and 20 degrees

@Lekensteyn
Copy link
Member

Do you happen to use CUDA?

@hni
Copy link

hni commented Jun 8, 2012

Not that I know. I haven't installed any CUDA related packages consciously. I use the nvidia proprietary driver, my gpu is NVS 4200M inside a T420s Thinkpad. Anecdotally, the only other person I have seen with this problem also uses a T420 with Arch Linux (reference: https://bbs.archlinux.org/viewtopic.php?pid=1112688#p1112688).

@Lekensteyn
Copy link
Member

libvdpau is mentioned, do you have that installed? Can you try modifying the C source to delete /lib/nvidia0 and /lib/nvidiactl after disabling the video card? Have a look at src/switchers/switchers.c (iirc)

@hni
Copy link

hni commented Jun 8, 2012

yes, I have libvdpau installed. Working on modifying the C source at the moment. One thing I noticed is that when I do cat /dev/nvidia0, the nvidia card will switch on. Just mentioning it in case it is significant.

@Lekensteyn
Copy link
Member

Yeah, that is exactly the problem here. libvdpau probes for the device which in its turn loads the driver and enables the card. That is really, really bad and totally undesirable. I'm curious of removing the /dev/nvidia{ctl,0} stuff helps.

@hni
Copy link

hni commented Jun 8, 2012

running my own version of bumblebee now with this change: hni@26f23f2

Will report back how it works, looks good so far.

@hni
Copy link

hni commented Jun 17, 2012

So far I have no issues. It might be worth adding something similar to the main branch to expose it to more users.

@Lekensteyn
Copy link
Member

It may be worth filling this issue on nvnews.net. The char devices are supposed to get unregistered when the module is unloaded.

@hni
Copy link

hni commented Jun 17, 2012

Sorry, I am not familiar with that website. Is that an official NVIDIA
forum? I was not able to find a bug tracker there. Do you mean I
should start a thread here:
http://www.nvnews.net/vbulletin/forumdisplay.php?f=14?

On 17 June 2012 14:26, Peter
reply@reply.github.com
wrote:

It may be worth filling this issue on nvnews.net. The char devices are supposed to get unregistered when the module is unloaded.


Reply to this email directly or view it on GitHub:
#144 (comment)

@Lekensteyn
Copy link
Member

Yup, that one.

@gnetwork-git
Copy link

thanks to you both. I am settling in after a large meal and watching Batman Begins (2005). will try these hacks tomorrow, and also the "clean" way from hni's package. ultimately, i want to make a debianized package version available thru repository, for Wheezy and SolusOS. maybe I should contact the guy with the current debian repo too, he must know about this. i know this can be fixed on my machine, but feel its more useful to make it easily available to all (without them going thru 100 steps only to find it fails). will post then, thankyou.

@gnetwork-git
Copy link

@nxdefiant
I'm on Debian Wheezy too, the only difference on SolusOS is Gnome 3 desktop is modified to look like Gnome 2.
My /lib/udev/rules.d/91-permissions.rules also contains
SUBSYSTEM=="nvidia", GROUP="video"
I avoided your suggestion to add myself to group "video" as have no idea on implications of doing this (stability, security, etc), but after doing so all seems to work fine now, thankyou.

@Lekensteyn
as the hni was incomplete, i merged with the standard bumblebee source, then built, the final outcome was a mess, enough said.
The code you mentioned, added to /etc/rc.local didn't help (not sure if i did it right anyway - limited instructions).

You say "If xorg runs as root, the device files should be created". xorg is indeed running as root, and the files were not recreated.

nxdefiant's suggestion of creating /etc/udev/rules.d/99_nvidia.rules:
DEVPATH=="/module/nvidia", ACTION=="remove", RUN+="/bin/rm /dev/nvidia0 /dev/nvidiactl"
did work, but only after adding myself to group "video". From your knowledge is this a potential problem or security issue from doing this? or is there a better way, like patch or something?
If ok, for now this is an easy enough fix for the problem in Debian Wheezy.

@Lekensteyn
Copy link
Member

@gnetwork-git Hopefully you did not add the acpi-handle-hack, that was something machines-specific ;)

Adding the 99_nvidia.rules thing does not compromise security. Adding a user to the video group allows you to restrict access to a select group. It should be relatively safe, though it also allows members to access other /dev/dri/card* devices. I am not sure what the exact implications are, other than having a direct line with the kernel video driver.

@hni
Copy link

hni commented Jul 17, 2012

@gnetwork-git
note that the patch is in the 'develop' branch, not master. Cloning and building that branch should work, I have been running it ever since I forked and there have been no changes

@gnetwork-git
Copy link

@hni i used the one from https://github.com/hni/Bumblebee/tarball/master and merged it with the standard.

@Lekensteyn
Great. So we now have just a 2 line fix for the problem of cards turning on unnecessarily (usually by Flash or Mplayer, and not under optirun) in Debian Wheezy and possibly other distros.

Run as root and insert your username where appears $USER:

# echo 'DEVPATH=="/module/nvidia", ACTION=="remove", RUN+="/bin/rm /dev/nvidia0 /dev/nvidiactl"' >> /etc/udev/rules.d/99-nvidiactrl.rules
# usermod -a -G video $USER

Will you be doing much more on Bumblebee, or winding down due to coming Prime release, maybe 6 months away? - though I will believe it when I see it!

@nxdefiant
Copy link

I'm wondering. I always thought beeing a member of the video group was required for dri to work?

@gnetwork-git
Copy link

@hni if your fork works well, i'm happy to try it, and make available as repo.
just post download link and basic instructions. thanks.

@hni
Copy link

hni commented Jul 17, 2012

@gnetwork-git the download link is https://github.com/hni/Bumblebee/tarball/develop. Instructions are the same as for the unpatched bumblebee.

As written further above, I have tried to raise this issue in the nvidia forums so that the fix can be included in the nvidia proprietary driver, but there has been no answer for roughly a month. Hence the question remains whether this patch should be included in mainline. I personally think udev is not the right place to handle this, but I have no strong feelings either way. If it is decided this patch should not be merged into mainline, I might merge it into my forked master and create an Arch Linux package for convenience.

@Lekensteyn
Copy link
Member

The devices are supposed to get removed on module unload. Why that isn't happening, I don't know.

@Lekensteyn Lekensteyn reopened this Jul 17, 2012
@gnetwork-git
Copy link

@hni tried that one too. sorry mate, the instructions for building your source cannot be the same as unpatched Bumblebee, you are missing files, thats why i had to merge it last time to try to make it work.
if you don't have a complete source, thats why instruction is needed to make it work.

@Lekensteyn "The devices are supposed to get removed on module unload. Why that isn't happening, I don't know."
i'm just guessing, but maybe a permission issue with some OS configs?

@Lekensteyn
Copy link
Member

@gnetwork-git I'd expect the nvidia driver to take care of the devices themselves, but since it's closed source, everything could happen (or not).

Building from git requres you to run autoreconf -fi to generate configure.

@gnetwork-git
Copy link

ok i had another go, and always start with a clean install for surety. installed autoreconf then ran autoreconf -fi as suggested, then did ./configure with defaults for Wheezy as per http://wiki.debian.org/Bumblebee
it solved the problem of unnecsessary occurence of card ON, but when running optirun it was Gallium not NVIDIA, and ran way too slow. so i added references in bumblebee.conf > Driver=nvidia and KernelDriver=nvidia
rebooted then tried optirun under nvidia driver, again slow, and same error as above in #144 (comment)

$ optirun /opt/VirtualGL/bin/glxspheres
Polygons in scene: 62464
Visual ID of window: 0x21
NVIDIA: could not open the device file /dev/nvidiactl (Permission denied).
[VGL] WARNING: The OpenGL rendering context obtained on X display
[VGL] :8 is indirect, which may cause performance to suffer.
[VGL] If :8 is a local X display, then the framebuffer device
[VGL] permissions may be set incorrectly.
Context is Indirect
OpenGL Renderer: GeForce GT 525M/PCIe/SSE2
54.016971 frames/sec - 48.427295 Mpixels/sec
55.457619 frames/sec - 49.718864 Mpixels/sec
54.347052 frames/sec - 48.723219 Mpixels/sec
53.702776 frames/sec - 48.145613 Mpixels/sec
53.669497 frames/sec - 48.115777 Mpixels/sec
53.517453 frames/sec - 47.979467 Mpixels/sec
52.942158 frames/sec - 47.463703 Mpixels/sec

looks like nxdefiant hack is only one to work in Debian Wheezy for now.

PS: i normally get well over 100 on framerates and Mpixels/sec on optirun glxspheres

@nxdefiant
Copy link

According to http://wiki.debian.org/NvidiaGraphicsDrivers/ you need to be in the video group to get any 3d acceleration.

@hni
Copy link

hni commented Jul 18, 2012

@gnetwork-git sorry can't help you with your problem. I built from my hni git 'develop' branch (see https://gist.github.com/3135959 for configure options, albeit tailored to Arch Linux), added myself to the bumblebee group (I am also member of the video group) and then everything works.

@gnetwork-git
Copy link

@nxdefiant the only place on that page that mentions adding yourself to video group, is in reference to serious problem running
$ grep Driver /etc/X11/xorg.conf 2>&1|grep nvidia
and we know in bumblebee there is no xorg.conf

i think adduser to video group may be required in our case as bumblebee is being subverted or bypassed, and normally bumblebee (the group which we are added to) takes care of things.

@hni thanks for your work, its probably better suited for Arch. Wheezy can be fixed with one command and adduser to video group.

@nxdefiant
Copy link

gnetwork-git, it is in the problem section because the X-user is usually put into the video-group by the Debian-Installer.

@seankhl
Copy link

seankhl commented Jul 31, 2012

Sorry if this is redundant, but I tried 2-liner that was proposed for Wheezy and it worked. Thanks.

@gnetwork-git
Copy link

seanlaguna glad it worked for you, it is current fix we use with SolusOS 2 which is based on Wheezy.
I will be keeping the following page updated for best installation procedure of Bumblebee in SolusOS/Debian Wheezy, bookmark/share it:
http://main.solusos.com/showthread.php?1817-Optimus-Solutions-NVIDIA-Intel-Hybrid-Graphics-Bumblebee-SolusOS-2

@gnetwork-git
Copy link

Wow. I just read the following article, and it appears Nvidia may have to pay some more attention to /dev/nvidia0 - it is now a security issue...
NVIDIA Linux Driver Hack Gives You Root Access
http://www.phoronix.com/scan.php?page=news_item&px=MTE1MTk
"...It basically abuses the fact that the /dev/nvidia0 device accept changes to the VGA window and moves the window around until it can read/write to somewhere useful in physical RAM, then it just does an priv escalation by writing directly to kernel memory."

@svenstaro
Copy link

I've got the same problem on Arch Linux and bumblebeed 3.0.1. I appears there is no fix thus far?

@hni
Copy link

hni commented Aug 20, 2012

I never received a reply at the nvidia forums (http://www.nvnews.net/vbulletin/showthread.php?t=184442). I think both the udev workaround and the changes in my fork do the job. Would be great to have a patched package in Community though :-).

@Lekensteyn
Copy link
Member

imo the udev rule is the best solution to this.

Lekensteyn added a commit that referenced this issue Aug 20, 2012
If a user has udev, install it in /lib/udev/rules.d (if you are a
packager) or /etc/udev/rules.d (if you are a user). (GH-144)
@svenstaro
Copy link

@hni You should write the NVIDIA Linux team directly. Also, where are your changes?

The udev rule should added to bumbleed installation routine.

@grossetti
Copy link

The OP said it does the same thing w/ the Nouveau driver, so maybe this is not an NVIDIA driver issue but a more general issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants