New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory/data corruption / crash on Lenovo T440p (GT 730M). #78
Comments
Kernel version:
OS: Linux Mint 16, Mainline kernel, happens with default (Ubuntu patched) 3.11 kernel and other distributions as well. The system crashes at this line: https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c#L289 If the bbswitch module is loaded during suspend, it crashes right away (obviously, as bbswitch enables it), if the module is unloaded, it crashes on resume (probably because the kernel sets the power state). cat /proc/acpi/dump_info:
Launchpad gives me a timeout, here's the ACPI debug tarball: http://media.leoluk.de/LENOVO-20AWS02A00.tar.gz
|
Interesting ACPI methods for For some reason,
|
Linux tries to always report compatibility with Windows (such as |
No, I haven't installed either driver. How do I enable nouveau's dynamic power management? |
Can you post your dmesg somewhere? Unless you blacklisted it, nouveau will get loaded (bumblebee does unload it before using bbswitch, so be sure to disable bumblebeed too). To enable dynamic PM, you can write to sysfs or use |
This is what happened after loading the nouveau module:
|
I tried disabling/enabling the card after loading the nouveau module and enabling runtime PM, but it still crashed the system. |
Switching it off using acpi_call works fine (enabling still crashes the system):
|
Is there any way to prevent the card from being enabled / keep the system from crashing after a resume? |
|
I tried this, but it did not work. As soon as the I installed Windows and tried enabling/disabling the card, which worked fine, so apparently it's doing something different. |
I'm also observing this on my T440p (with the newest BIOS 1.17) with Ubuntu 13.10. As soon as I do echo ON > /proc/acpi/bbswitch I get weird memory corruption issues. It's apparently not a driver issue (noveau/nvidia) as the driver is not even loaded when I do the switch. I already use acpi_osi="!Windows 2012"for other reasons (backlight control), but it does not help. |
By the way, I get some ACPI warnings during the first module load: [ 5.100039] bbswitch: version 0.7 [ 5.100042] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.VID_ [ 5.100046] bbswitch: Found discrete VGA device 0000:02:00.0: \_SB_.PCI0.PEG_.VID_ [ 5.100052] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95) [ 5.100321] bbswitch: detected an Optimus _DSM function [ 5.100332] pci 0000:02:00.0: enabling device (0004 -> 0007) [ 5.100359] bbswitch: Succesfully loaded. Discrete card 0000:02:00.0 is on [ 5.101973] bbswitch: disabling discrete graphics [ 5.101980] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95) |
Those warnings are harmless, see ee0591b. The memory corruption, etc. sound (as I said before) like issues related to power. @leoluk You stated that you have tried nouveau, in what way did you disable the nvidia card? Have you just enabled runtime PM and then waited for it to kick in? Or write to the vgaswitcheroo file in debugfs? |
The |
How to reliably reproduce the problem:
Maybe this is useful for gathering more information (all T440p affected? only a small subset? just the latest BIOS version?). |
Partial success! I recompiled the mainline kernel with a modified DSDT table and commented out the entire The kernel understands that the power state change fails:
Now I'll just have to figure out why powering on the card is crashing the system. Any ideas? The Windows driver might use another mechanism to power down the card. |
Hi, Let me know if I can provide any additional information. Keep in mind that I'm not a hardware expert (and DSDT hacking is a thing I've never touched). |
I wonder if it has something to do with bbswitch reading from the PCI configuration space to determine whether a card is available or not. @leoluk If vgaswitcheroo is not available, I still would like to know if nouveau runtime PM exposes the issues experienced here. Watch for the DSM warnings (hey, useful debugging tool now :-) ) to see when the card gets disabled. |
@Lekensteyn If that helps: the problem can be triggered using only @rkaw92 If you're adventurous and submit your machine information before (see below), you could try the modified kernel and check if it prevents the memory corruption. https://github.com/Bumblebee-Project/bbswitch#reporting-bugs |
I just tried out nouveau on vanilla kernel Shortly after that the system crashed again with the known symptoms. I guess nouveau decided that the card was not in use and tried to disable it. After the crash I shortly saw two of the ACPI warnings, but dmesg was quickly flooded by the memory errors. I will try to capture that more precisely. |
Here is a more complete log of the crash using nouveau: https://gist.github.com/x-quadraht/7902930 You can see virtuoso dying, as a first casualty of the memory corruption... |
Sorry, I haven't had the time to properly survey my system nor try the DSDT override.
|
@rkaw92 That's what I tried first (actually, I disabled it with |
Two observations: |
Interesting observation, this means that the DSDT override is not even necessary (but useful if you want to make sure that the card stays disabled). Possible explanation: bbswitch calls |
I ended up returning my T440p for unrelated reasons (fan noise, broken wifi), so I unfortunately cannot longer contribute to this bug report by debugging it. A few suggestions:
|
I'd be interested in taking a look at using the windows kernel debugger. I don't have any experience w/ Windows debugging though (although probably could figure it out from above documentation), and more problematically how to install the checked acpi.sys :| |
I can confirm the same filesystem corruption with bumblebee. I have the same T440p. I first noticed it right away after installing Manjaro LInux which defaults install with bbswitch enabled. I then switched to ArchLinux, from scratch install leaving out bbswitch. Saw no issues. I then installed bumblebee/bbswitch and within a short matter of minutes got filesystem corruption. I then installed Fedora which does not have it included in default install either and it is likewise stable. |
Hello, |
Hi, Running 4.8 for a while right now (without any patches). Everything seems to be OK. But I don't use NVIDIA GPU at all. |
Does the left side of the keyboard still get hot or is it okay when the |
It's became a bit hot when I have some CPU intensive tasks. But in normal conditions (like browsing) it's just slightly warm. |
I still see memory corruption using kernel 4.8.6 on T440p. Turning the card off works, but if I run The last message from nouveau is "DRM: resuming kernel object tree..." @dionorgua Are you using any special kernel cmdline parameters like apci_osi=XYZ? Anything else that might be special in your setup? I'm running BIOS version 1.17, which is pretty old. Should I update? |
@xqms Are you using any special kernel parameters or modules? Can you post your full dmesg somewhere? |
@Lekensteyn I have no custom kernel modules. I'm using the ubuntu mainline 4.8.6 kernel, which is pretty vanilla. I will try to capture dmesg, although there is a bit of luck involved whether the system stays alive long enough to save something. |
dmesg is here: https://gist.github.com/xqms/9c5f35509a4ea9e2d9a6be9bb55c50b7 This time the system froze completely without further evidence for memory corruption. |
My parameters from /proc/cmdline: I don't think that there is something really special here. i915.enable_psr=0 is needed to avoid display flickering when external monitor is connected. As about BIOS, I currently use 2.37. I think that it's better to update (because there were 'Win8/Win10 support entries in changelog that are probably related to Optimus'. Probably old BIOS is root cause of memory corruption for you. There are also some differences in dmesg. Your:
My:
So you don't have message about PR and 'will not use DSM'. But in any case I just want to warn you that once updated, there will be no way to downgrade back. |
@dionorgua Could you upload your acpidump? Alternatively, upload acpidump+dmidecode+lspci following the instructions at https://bugs.launchpad.net/lpbugreporter/+bug/752542 In a slightly older tarball for this laptop (LENOVO-20AWS02A00, BIOS GLET41WW (1.16), 10/27/2013), there seems to be some code for Windows 8 compatibility, but I cannot judge whether it is good or bad. Having something to compare would be nice. Also note that this version differs by one comparing to @xqms's dmesg (so maybe you could also upload your acpidump?) As for the |
I uploaded my data here: https://bugs.launchpad.net/lpbugreporter/+bug/752542/comments/801 I'll try the BIOS update next. |
Yep, the BIOS update did the trick. acpidump from the new BIOS: https://bugs.launchpad.net/lpbugreporter/+bug/752542/comments/802 |
It's from BIOS 2.37. |
@xqms you've updated to 2.39, everything else is OK? |
@dionorgua Yes, so far everything works. I haven't done anything fancy with the nvidia GPU yet, though. |
Hm, BIOS 1.17 has a date before 2015 (11/14/2013), so that is why the new PR3 functionality in kernel 4.8 was not activated. Between 1.17 and 2.39 there are no significant changes in the ACPI tables that could have an influence on this. |
Do I understand correctly that currently there's no way to have both proper PM (fully turned off card) and nvidia proprietary drivers at all? I'm not complaining, it's just that there are several threads with discussions mostly about nouveau going on, and it's difficult to catch up on the progress. P.S. That's on T440p with new BIOSes. |
Probably no. As far as I understand currently (with latest kernel and DRI_PRIME), nouveau is responsible for power management of NVIDIA card. It's maybe possible to get power management + NVIDIA driver using dedicated X server by using some scripts that will rmmod nvidia && modprobe nouveau once external GPU is unused. But I don't know whether it's implemented somewhere or not. |
Yeah, I've thought about something along those lines if bbswitch is not a full-featured option currently. Thanks for the confirmation! |
If you are feeling adventurous, there is a pm-rework branch which you can try. The vgaswitcheroo addition is however not stable, it oopsed somewhere. That is, try the branch at 5c7b3f5 |
I'd be glad to. Do you want any specific tests or just feedback on whether it works? On November 20, 2016 11:57:20 PM GMT+03:00, Peter Wu notifications@github.com wrote:
Sent from my Android device with K-9 Mail. Please excuse my brevity. |
@abbradar Some people have reported that it works for them, maybe it is also good enough for you. YMMV As for the test: OFF/ON/OFF/ON, system sleep/resume. |
Lenovo T440p, BIOS 2.39. Disconnected external displays, cold boot, bumblebeed disabled.
So, no luck for me it seems. If you want any additional testing, want to try out a patch on this hardware etc. feel free to ping! And thanks for the try, anyway ^_^ |
Normally this should show auto/suspended for both the Nvidia and parent device:
For this to work, you must not have the |
I've come to another interesting problem -- when I'm not plugged to power adapter, X.org hangs when I have bbswitch loaded. I have discovered it only now -- bbswitch didn't work before at all because I haven't had I've implemented alternative switching method proposed by @dionorgua in Bumblebee-Project/Bumblebee#820 |
Any update? I ran into this problem recently. My entire disk is fucked up. I was brought to the emergency mode. fsck reported millions of errors and then rendered my laptop unbootable. |
If you don't use NVIDIA blob, it's already fixed. Not sure about nvidia. You just don't need to do anything except using recent kernel. nouveau is able to suspend nvidia graphics (and resume when needed) |
I made a fork of nvidia-xrun that automatically loads nouveau on exit to power down the GPU. This should be a ok workaround. |
I just installed bbswitch on the newly-released Thinkpad T440p.
Loading the bbswitch module and disabling the card works perfectly fine:
But enabling it again seems to mess up the PCI bus, crash the network adapters, and cause data corruption (files read from the disk contain random characters, filesystem errors, some files are missing, empty or filled with random data after a reboot).
There are no bbswitch errors, but shortly after entering the command, the syslog fills with various kernel messages related to internal devices no longer responding. The filesystem sometimes remounts as read-only, and the system becomes unusable and has to be reset.
At this point, the machine is unable to write files to the disk or a USB stick or communicate with the network, so I made some "screen shots" using my smartphone. I tried to redirect the syslog to another machine using the internal network as well as a USB WLAN adapter, but the data cuts off as soon as the graphics card is enabled.
Installing the other bumblebee components or the Nvidia/Nouveau drivers does not make any difference.
The text was updated successfully, but these errors were encountered: