Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory/data corruption / crash on Lenovo T440p (GT 730M). #78

Open
leoluk opened this issue Dec 1, 2013 · 297 comments
Open

Memory/data corruption / crash on Lenovo T440p (GT 730M). #78

leoluk opened this issue Dec 1, 2013 · 297 comments
Milestone

Comments

@leoluk
Copy link

leoluk commented Dec 1, 2013

I just installed bbswitch on the newly-released Thinkpad T440p.

Loading the bbswitch module and disabling the card works perfectly fine:

[  142.881587] bbswitch: version 0.7 
[  142.881593] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.VID_
[  142.881596] bbswitch: Found discrete VGA device 0000:02:00.0: \_SB_.PCI0.PEG_.VID_
[Package] (20130517/nsarguments-95)  
[  142.882097] bbswitch: detected an Optimus _DSM function
[  142.882106] pci 0000:02:00.0: enabling device (0004 -> 0007)
[  142.882127] bbswitch: Succesfully loaded. Discrete card 0000:02:00.0 is on
[  156.250136] bbswitch: disabling discrete graphics
[Package] (20130517/nsarguments-95)  
[  156.265409] thinkpad_acpi: EC reports that Thermal Table has changed
[  156.376985] pci 0000:02:00.0: power state changed by ACPI to D3cold

But enabling it again seems to mess up the PCI bus, crash the network adapters, and cause data corruption (files read from the disk contain random characters, filesystem errors, some files are missing, empty or filled with random data after a reboot).

[  160.406244] bbswitch: enabling discrete graphics
[  160.647323] pci 0000:02:00.0: power state changed by ACPI to D0 
[  160.647336] thinkpad_acpi: EC reports that Thermal Table has changed

There are no bbswitch errors, but shortly after entering the command, the syslog fills with various kernel messages related to internal devices no longer responding. The filesystem sometimes remounts as read-only, and the system becomes unusable and has to be reset.

At this point, the machine is unable to write files to the disk or a USB stick or communicate with the network, so I made some "screen shots" using my smartphone. I tried to redirect the syslog to another machine using the internal network as well as a USB WLAN adapter, but the data cuts off as soon as the graphics card is enabled.

Installing the other bumblebee components or the Nvidia/Nouveau drivers does not make any difference.

@leoluk
Copy link
Author

leoluk commented Dec 2, 2013

Kernel version:

Linux 3.12.0-031200-generic #201311031935 SMP Mon Nov 4 00:36:54 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux 

OS: Linux Mint 16, Mainline kernel, happens with default (Ubuntu patched) 3.11 kernel and other distributions as well.

The system crashes at this line: https://github.com/Bumblebee-Project/bbswitch/blob/master/bbswitch.c#L289

If the bbswitch module is loaded during suspend, it crashes right away (obviously, as bbswitch enables it), if the module is unloaded, it crashes on resume (probably because the kernel sets the power state).

cat /proc/acpi/dump_info:

0000:00:00.0 060000 
0000:00:01.0 060400 \_SB_.PCI0.PEG0
0000:00:01.1 060400 \_SB_.PCI0.PEG_
0000:00:02.0 030000 \_SB_.PCI0.VID_
0000:00:03.0 040300 \_SB_.PCI0.B0D3
0000:00:14.0 0c0330 \_SB_.PCI0.XHCI
0000:00:16.0 078000 
0000:00:19.0 020000 \_SB_.PCI0.IGBE
0000:00:1a.0 0c0320 \_SB_.PCI0.EHC2
0000:00:1b.0 040300 \_SB_.PCI0.HDEF
0000:00:1c.0 060400 \_SB_.PCI0.EXP1
0000:00:1c.1 060400 \_SB_.PCI0.EXP2
0000:00:1d.0 0c0320 \_SB_.PCI0.EHC1
0000:00:1f.0 060100 \_SB_.PCI0.LPC_
0000:00:1f.2 010601 \_SB_.PCI0.SAT1
0000:00:1f.3 0c0500 \_SB_.PCI0.SMBU
0000:02:00.0 030000 \_SB_.PCI0.PEG_.VID_
0000:03:00.0 ff0000 
0000:04:00.0 028000 

Launchpad gives me a timeout, here's the ACPI debug tarball:

http://media.leoluk.de/LENOVO-20AWS02A00.tar.gz

# echo "\_SB.PCI0.PEG.VID.ISOP" > /proc/acpi/call 
# cat /proc/acpi/call 
0xffffffff

@leoluk
Copy link
Author

leoluk commented Dec 2, 2013

Interesting ACPI methods for \_SB.PCI0.PEG.VID. Calling PSOF 0 does not seem to switch it off, unfortunately.

For some reason, \WIN8 is 0x1 even if acpi_osi=Linux.

                    Method (_PS0, 0, NotSerialized)
                    {
                        If (LNot (VMSH))
                        {
                            GPON (0x00)
                        }
                    }

                    Method (_PS1, 0, NotSerialized)
                    {
                        Noop
                    }

                    Method (_PS2, 0, NotSerialized)
                    {
                        Noop
                    }

                    Method (_PS3, 0, NotSerialized)
                    {
                        If (LNot (VMSH))
                        {
                            GPOF (0x00)
                        }
                    }

                    Method (GPON, 1, NotSerialized)
                    {
                        If (ISOP ())
                        {
                            If (DGOS)
                            {
                                \VHYB (0x02, 0x00)
                                Sleep (0x64)
                                If (LEqual (ToInteger (Arg0), 0x00)) {}
                                \VHYB (0x00, 0x01)
                                Sleep (0x64)
                                \VHYB (0x02, 0x01)
                                Sleep (0x01)
                                \VHYB (0x08, 0x01)
                                Store (0x0A, Local0)
                                Store (0x32, Local1)
                                While (Local1)
                                {
                                    Sleep (Local0)
                                    If (\LCHK (0x01))
                                    {
                                        Break
                                    }

                                    Decrement (Local1)
                                }

                                \VHYB (0x08, 0x03)
                                \VHYB (0x04, 0x00)
                                \SWTT (0x01)
                                Store (Zero, DGOS)
                            }
                            Else
                            {
                                If (LAnd (LNotEqual (VSID, 0x220F17AA), LNotEqual (VSID, 0x221D17AA)))
                                {
                                    \VHYB (0x04, 0x00)
                                }
                            }

                            \VHYB (0x09, \_SB.PCI0.PEG.VID.HDAS)
                        }
                        Else
                        {
                            Store (0x220E17AA, VIDS)
                        }
                    }

                    Method (GPOF, 1, NotSerialized)
                    {
                        If (ISOP ())
                        {
                            If (LOr (VMSH, LEqual (\_SB.PCI0.PEG.VID.OMPR, 0x03)))
                            {
                                \SWTT (0x00)
                                \VHYB (0x08, 0x00)
                                Store (0x0A, Local0)
                                Store (0x32, Local1)
                                While (Local1)
                                {
                                    Sleep (Local0)
                                    If (\LCHK (0x00))
                                    {
                                        Break
                                    }

                                    Decrement (Local1)
                                }

                                \VHYB (0x08, 0x02)
                                \VHYB (0x02, 0x00)
                                Sleep (0x64)
                                \VHYB (0x00, 0x00)
                                If (LEqual (ToInteger (Arg0), 0x00)) {}
                                Store (One, DGOS)
                                Store (0x02, \_SB.PCI0.PEG.VID.OMPR)
                            }
                        }
                    }

                    Method (_STA, 0, NotSerialized)
                    {
                        Return (0x0F)
                    }

                    Method (_DSM, 4, NotSerialized)
                    {
                        If (\CMPB (Arg0, Buffer (0x10)
                                {
                                    /* 0000 */    0xF8, 0xD8, 0x86, 0xA4, 0xDA, 0x0B, 0x1B, 0x47,
                                    /* 0008 */    0xA7, 0x2B, 0x60, 0x42, 0xA6, 0xB5, 0xBE, 0xE0
                                }))
                        {
                            Return (NVOP (Arg0, Arg1, Arg2, Arg3))
                        }

                        If (\CMPB (Arg0, Buffer (0x10)
                                {
                                    /* 0000 */    0x01, 0x2D, 0x13, 0xA3, 0xDA, 0x8C, 0xBA, 0x49,
                                    /* 0008 */    0xA5, 0x2E, 0xBC, 0x9D, 0x46, 0xDF, 0x6B, 0x81
                                }))
                        {
                            Return (NVPS (Arg0, Arg1, Arg2, Arg3))
                        }

                        If (\WIN8)
                        {
                            If (\CMPB (Arg0, Buffer (0x10)
                                    {
                                        /* 0000 */    0x75, 0x0B, 0xA5, 0xD4, 0xC7, 0x65, 0xF7, 0x46,
                                        /* 0008 */    0xBF, 0xB7, 0x41, 0x51, 0x4C, 0xEA, 0x02, 0x44
                                    }))
                            {
                                Return (NBCI (Arg0, Arg1, Arg2, Arg3))
                            }
                        }

                        Return (Buffer (0x04)
                        {
                            0x01, 0x00, 0x00, 0x80
                        })
                    }

@Lekensteyn
Copy link
Member

Linux tries to always report compatibility with Windows (such as \WIN8) because BIOS vendors write code that assumes that anything other than it is broken/outdated. The symptoms you described sound like a power shortage, could you observe similar problems when using the nouveau driver with dynamic power management enabled?

@leoluk
Copy link
Author

leoluk commented Dec 2, 2013

No, I haven't installed either driver. How do I enable nouveau's dynamic power management?

@Lekensteyn
Copy link
Member

Can you post your dmesg somewhere? Unless you blacklisted it, nouveau will get loaded (bumblebee does unload it before using bbswitch, so be sure to disable bumblebeed too). To enable dynamic PM, you can write to sysfs or use powertop to enable Runtime PM at tunables.

@leoluk
Copy link
Author

leoluk commented Dec 2, 2013

This is what happened after loading the nouveau module:

[ 3144.325079] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Integer], ACPI requires [Package] (20130725/nsarguments-95)
[ 3144.325160] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95)
[ 3144.325346] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95)
[ 3144.325522] pci 0000:02:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported
[ 3144.325524] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG_.VID_ handle
[ 3144.325546] nouveau 0000:02:00.0: enabling device (0004 -> 0007)
[ 3144.325660] [drm] hdmi device  not found 2 0 1
[ 3144.325767] nouveau E[  DEVICE][0000:02:00.0] unknown chipset, 0x108100a1
[ 3144.325769] nouveau E[     DRM] failed to create 0x80000080, -22
[ 3144.325856] nouveau: probe of 0000:02:00.0 failed with error -22

@leoluk
Copy link
Author

leoluk commented Dec 2, 2013

I tried disabling/enabling the card after loading the nouveau module and enabling runtime PM, but it still crashed the system.

@leoluk
Copy link
Author

leoluk commented Dec 2, 2013

Switching it off using acpi_call works fine (enabling still crashes the system):

# echo "\_SB.PCI0.PEG.VID._DSM {0xF8,0xD8,0x86,0xA4,0xDA,0x0B,0x1B,0x47,0xA7,0x2B,0x60,0x42,0xA6,0xB5,0xBE,0xE0} 0x100 0x1A {0x1,0x0,0x0,0x3}" > /proc/acpi/call ; cat /proc/acpi/call 
{0x59, 0x00, 0x00, 0x11}

# echo "\_SB.PCI0.PEG.VID._PS3" > /proc/acpi/call ; cat /proc/acpi/call 
0x2called

@leoluk
Copy link
Author

leoluk commented Dec 2, 2013

Is there any way to prevent the card from being enabled / keep the system from crashing after a resume?

@Lekensteyn
Copy link
Member

acpi_call does not work well with resume. You could try commenting out some problematic parts in bbswitch such that s/r still works.

@leoluk
Copy link
Author

leoluk commented Dec 5, 2013

I tried this, but it did not work. As soon as the _PS0 function is called, the system crashes. Commenting out the ACPI calls in bbswitch_on (or even the entire function), the PM handlers or even unloading the module before suspending the system did not help (it still crashes on resume). If I don't disable the PM handler, it crashes before it suspends.

I installed Windows and tried enabling/disabling the card, which worked fine, so apparently it's doing something different.

@xqms
Copy link

xqms commented Dec 5, 2013

I'm also observing this on my T440p (with the newest BIOS 1.17) with Ubuntu 13.10. As soon as I do

echo ON > /proc/acpi/bbswitch

I get weird memory corruption issues. It's apparently not a driver issue (noveau/nvidia) as the driver is not even loaded when I do the switch.

I already use

acpi_osi="!Windows 2012"
for other reasons (backlight control), but it does not help.

@xqms
Copy link

xqms commented Dec 5, 2013

By the way, I get some ACPI warnings during the first module load:

[    5.100039] bbswitch: version 0.7
[    5.100042] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.VID_
[    5.100046] bbswitch: Found discrete VGA device 0000:02:00.0: \_SB_.PCI0.PEG_.VID_
[    5.100052] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95)
[    5.100321] bbswitch: detected an Optimus _DSM function
[    5.100332] pci 0000:02:00.0: enabling device (0004 -> 0007)
[    5.100359] bbswitch: Succesfully loaded. Discrete card 0000:02:00.0 is on
[    5.101973] bbswitch: disabling discrete graphics
[    5.101980] ACPI Warning: \_SB_.PCI0.PEG_.VID_._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20130725/nsarguments-95)

@Lekensteyn
Copy link
Member

Those warnings are harmless, see ee0591b.

The memory corruption, etc. sound (as I said before) like issues related to power. @leoluk You stated that you have tried nouveau, in what way did you disable the nvidia card? Have you just enabled runtime PM and then waited for it to kick in? Or write to the vgaswitcheroo file in debugfs?

@leoluk
Copy link
Author

leoluk commented Dec 5, 2013

The /sys/kernel/debug/vgaswitcheroo directory was missing, so I just tried all the methods I already knew, just with nouveau loaded and runtime PM enabled. But I just realized that's probably not what you were thinking about.

@leoluk
Copy link
Author

leoluk commented Dec 5, 2013

How to reliably reproduce the problem:

  • Boot from any sufficiently recent Linux live image (I used Linux Mint 16 x64, kernel 3.11.0-12-generic). Don't use your existing installation (if you have any), because you'd risk messing it up.

  • Open a terminal and run this:

    wget https://github.com/Bumblebee-Project/bbswitch/archive/master.zip; unzip master.zip; cd bbswitch-master; make; sudo make load; sudo tee /proc/acpi/bbswitch <<<OFF
    
  • Your kernel log (type dmesg) should show a message like this one:
    [ 4550.007526] bbswitch: Succesfully loaded. Discrete card 0000:02:00.0 is off

  • Now run sudo tee /proc/acpi/bbswitch <<<ON. If you just lost all network connectivity and the system gradually stops working, your device is affected as well.

Maybe this is useful for gathering more information (all T440p affected? only a small subset? just the latest BIOS version?).

@leoluk
Copy link
Author

leoluk commented Dec 5, 2013

Partial success! I recompiled the mainline kernel with a modified DSDT table and commented out the entire \_SB.PCI0.PEG.VID.GPON method (which is called from _PS0 and the NVP3 power resource), preventing bbswitch, the kernel or anything else from powering on the card again. Powering down the card and subsequent suspend/resume is now working, Bumblebee obviously isn't.

The kernel understands that the power state change fails:

[ 408.603544] pci 0000:02:00.0: Refused to change power state, currently in D3

Now I'll just have to figure out why powering on the card is crashing the system. Any ideas? The Windows driver might use another mechanism to power down the card.

@rkaw92
Copy link

rkaw92 commented Dec 9, 2013

Hi,
this also occurs on my T440p. NVIDIA 730m, newest BIOS (did not test on the previous BIOS - it made the trackpoint completely unusable so I let it go before I even installed Linux). Until finding this bug report, I was wondering where the memory corruption was coming from...

Let me know if I can provide any additional information. Keep in mind that I'm not a hardware expert (and DSDT hacking is a thing I've never touched).

@Lekensteyn
Copy link
Member

I wonder if it has something to do with bbswitch reading from the PCI configuration space to determine whether a card is available or not.

@leoluk If vgaswitcheroo is not available, I still would like to know if nouveau runtime PM exposes the issues experienced here. Watch for the DSM warnings (hey, useful debugging tool now :-) ) to see when the card gets disabled.

@leoluk
Copy link
Author

leoluk commented Dec 10, 2013

@Lekensteyn If that helps: the problem can be triggered using only acpi_call, which does not seem to do anything related to the PCI configuration space. I'm not familiar with nouveau's runtime PM/Optimus support, but what I gathered from reading through the source is that it should automatically disable the card if it's idle, right? So I just load the module, enable automatic runtime PM, and wait? How do I enable it afterwards? By using PRIME, or disabling settting PM to "on"?

@rkaw92 If you're adventurous and submit your machine information before (see below), you could try the modified kernel and check if it prevents the memory corruption.

https://github.com/Bumblebee-Project/bbswitch#reporting-bugs

@xqms
Copy link

xqms commented Dec 11, 2013

I just tried out nouveau on vanilla kernel 3.13.0-rc3 as the GT730M is not supported by nouveau in the stable kernel.
Here is dmesg after it has loaded: https://gist.github.com/x-quadraht/7902666

Shortly after that the system crashed again with the known symptoms. I guess nouveau decided that the card was not in use and tried to disable it. After the crash I shortly saw two of the ACPI warnings, but dmesg was quickly flooded by the memory errors. I will try to capture that more precisely.

@xqms
Copy link

xqms commented Dec 11, 2013

Here is a more complete log of the crash using nouveau: https://gist.github.com/x-quadraht/7902930

You can see virtuoso dying, as a first casualty of the memory corruption...

@rkaw92
Copy link

rkaw92 commented Dec 11, 2013

Sorry, I haven't had the time to properly survey my system nor try the DSDT override.
As a temporary workaround, I have decided to install acpi_call, which seems to work, provided that you never really need to power on the NVIDIA GPU. Blacklisted nouveau and uninstalled NVIDIA proprietary drivers, too.

echo "\_SB.PCI0.PEG.VID._DSM {0xF8,0xD8,0x86,0xA4,0xDA,0x0B,0x1B,0x47,0xA7,0x2B,0x60,0x42,0xA6,0xB5,0xBE,0xE0} 0x100 0x1A {0x1,0x0,0x0,0x3}" >/proc/acpi/call
echo "\_SB.PCI0.PEG.VID.GPOF" >/proc/acpi/call

@leoluk
Copy link
Author

leoluk commented Dec 12, 2013

@rkaw92 That's what I tried first (actually, I disabled it with bbswitch and then unloaded the module), but the kernel enabled the card after resuming from standby and it still crashed the system, which is why I patched my DSDT table. Could you try if standby works for you using the manual ACPI calls?

@rkaw92
Copy link

rkaw92 commented Dec 12, 2013

Two observations:
A) Indeed, unloading the bbswitch module does not fix the problem - the kernel still (supposedly) attempts to re-enable the card at resume. I am not sure if the module gets re-loaded somehow (could not verify - dmesg, lsmod and friends got overwritten as soon as I resumed).
B) Using just acpi_call with the methods outlined above, I am able to suspend/resume without issues. Thus, it seems to be different from using bbswitch. This is my temporary solution, which I've been using since yesterday with full success (for a workaround) and no discernible side effects apart from complete NVIDIA disablement.

@leoluk
Copy link
Author

leoluk commented Dec 12, 2013

Interesting observation, this means that the DSDT override is not even necessary (but useful if you want to make sure that the card stays disabled). Possible explanation: bbswitch calls pci_save_state before it disables the device. On suspend, the card is already disabled so the state is not saved again. On resume, the kernel restores the previously saved state and enables the device.

@leoluk
Copy link
Author

leoluk commented Dec 23, 2013

I ended up returning my T440p for unrelated reasons (fan noise, broken wifi), so I unfortunately cannot longer contribute to this bug report by debugging it.

A few suggestions:

  • apparently, there's someone on the Arch forums who has a T440p with an older BIOS revision where Bumblebee/bbswitch works - maybe compare the ACPI tables?
  • the AMLI debug extension for the Windows kernel debugger could be used to trace the methods called by the Nvidia driver (I can provide the checked acpi.sys for Windows 7 x64, if anyone is interested)

@jhnphm
Copy link

jhnphm commented Dec 30, 2013

I'd be interested in taking a look at using the windows kernel debugger. I don't have any experience w/ Windows debugging though (although probably could figure it out from above documentation), and more problematically how to install the checked acpi.sys :|

@seanvk
Copy link

seanvk commented Jan 3, 2014

I can confirm the same filesystem corruption with bumblebee. I have the same T440p. I first noticed it right away after installing Manjaro LInux which defaults install with bbswitch enabled. I then switched to ArchLinux, from scratch install leaving out bbswitch. Saw no issues. I then installed bumblebee/bbswitch and within a short matter of minutes got filesystem corruption. I then installed Fedora which does not have it included in default install either and it is likewise stable.

@abbradar
Copy link

abbradar commented Jan 4, 2014

Hello,
I have this problem, too, and I've managed to install Windows debuggers, symbols, checked acpi.sys and whatever else needed for ACPI debugging in Windows (what a pain...). I don't have any experience on this, though, and my blocker now is that ACPI event dump is too big (many pages even for second or two), and when getting it for, say, 30 seconds, WinDbg can't congest such a size at all. Maybe someone more familiar with ACPI debugging gives an advice on it? Some filter on output, maybe?

@dionorgua
Copy link

Hi,

Running 4.8 for a while right now (without any patches). Everything seems to be OK. But I don't use NVIDIA GPU at all.
Tried to launch a few games using DRI_PRIME=1. And it works (card is resumed when needed and then goes to sleep once it isn't needed).

@kubik369
Copy link

kubik369 commented Oct 14, 2016

Does the left side of the keyboard still get hot or is it okay when the
Nvidia GPU is turned off?

@dionorgua
Copy link

It's became a bit hot when I have some CPU intensive tasks. But in normal conditions (like browsing) it's just slightly warm.

@xqms
Copy link

xqms commented Nov 3, 2016

I still see memory corruption using kernel 4.8.6 on T440p. Turning the card off works, but if I run lspci or start Xorg (which apparently also enumerates the card), lspci freezes and I get random failures in dmesg and/or filesystem corruption.

The last message from nouveau is "DRM: resuming kernel object tree..."

@dionorgua Are you using any special kernel cmdline parameters like apci_osi=XYZ? Anything else that might be special in your setup?

I'm running BIOS version 1.17, which is pretty old. Should I update?

@Lekensteyn
Copy link
Member

@xqms Are you using any special kernel parameters or modules? Can you post your full dmesg somewhere?

@xqms
Copy link

xqms commented Nov 3, 2016

@Lekensteyn I have no custom kernel modules. I'm using the ubuntu mainline 4.8.6 kernel, which is pretty vanilla.

I will try to capture dmesg, although there is a bit of luck involved whether the system stays alive long enough to save something.

@xqms
Copy link

xqms commented Nov 3, 2016

dmesg is here: https://gist.github.com/xqms/9c5f35509a4ea9e2d9a6be9bb55c50b7

This time the system froze completely without further evidence for memory corruption.

@dionorgua
Copy link

My parameters from /proc/cmdline:
BOOT_IMAGE=/vmlinuz-4.8.5+ root=/dev/mapper/debpad2-root ro elevator=deadline init=/lib/sysvinit/init intel_iommu=on init=/bin/systemd i915.enable_psr=0

I don't think that there is something really special here. i915.enable_psr=0 is needed to avoid display flickering when external monitor is connected.

As about BIOS, I currently use 2.37. I think that it's better to update (because there were 'Win8/Win10 support entries in changelog that are probably related to Optimus'. Probably old BIOS is root cause of memory corruption for you. There are also some differences in dmesg. Your:

[   11.580024] ACPI Warning: \_SB.PCI0.PEG.VID._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[   11.580226] ACPI Warning: \_SB.PCI0.PEG.VID._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[   11.580489] pci 0000:02:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported
[ 11.580492] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG_.VID_ handle

My:

Oct 27 18:56:17 debpad kernel: [  140.650174] ACPI Warning: \_SB.PCI0.PEG.VID._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
Oct 27 18:56:17 debpad kernel: [  140.650498] ACPI Warning: \_SB.PCI0.PEG.VID._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
Oct 27 18:56:17 debpad kernel: [  140.650805] pci 0000:02:00.0: optimus capabilities: enabled, status dynamic power, hda bios codec supported
Oct 27 18:56:17 debpad kernel: [  140.650809] VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG_.VID_ handle
Oct 27 18:56:17 debpad kernel: [  140.650809] nouveau: detected PR support, will not use DSM

So you don't have message about PR and 'will not use DSM'.

But in any case I just want to warn you that once updated, there will be no way to downgrade back.

@Lekensteyn
Copy link
Member

Lekensteyn commented Nov 3, 2016

@dionorgua Could you upload your acpidump? Alternatively, upload acpidump+dmidecode+lspci following the instructions at https://bugs.launchpad.net/lpbugreporter/+bug/752542

In a slightly older tarball for this laptop (LENOVO-20AWS02A00, BIOS GLET41WW (1.16), 10/27/2013), there seems to be some code for Windows 8 compatibility, but I cannot judge whether it is good or bad. Having something to compare would be nice. Also note that this version differs by one comparing to @xqms's dmesg (so maybe you could also upload your acpidump?)

As for the _PR3 vs _DSM message, upgrading BIOS could indeed help.

@xqms
Copy link

xqms commented Nov 4, 2016

I uploaded my data here: https://bugs.launchpad.net/lpbugreporter/+bug/752542/comments/801

I'll try the BIOS update next.

@xqms
Copy link

xqms commented Nov 4, 2016

Yep, the BIOS update did the trick. lspci no longer freezes the system and so far I had none of the memory corruption symptoms.

acpidump from the new BIOS: https://bugs.launchpad.net/lpbugreporter/+bug/752542/comments/802

@dionorgua
Copy link

It's from BIOS 2.37.
LENOVO-20ANCTO1WW.tar.gz

@dionorgua
Copy link

@xqms you've updated to 2.39, everything else is OK?

@xqms
Copy link

xqms commented Nov 4, 2016

@dionorgua Yes, so far everything works. I haven't done anything fancy with the nvidia GPU yet, though.

@Lekensteyn
Copy link
Member

Hm, BIOS 1.17 has a date before 2015 (11/14/2013), so that is why the new PR3 functionality in kernel 4.8 was not activated. Between 1.17 and 2.39 there are no significant changes in the ACPI tables that could have an influence on this.

@abbradar
Copy link

abbradar commented Nov 20, 2016

@Lekensteyn,

@dionorgua on the latest laptops, bbswitch is not necessary unsafe but it might not fully turn your graphics card off.

Do I understand correctly that currently there's no way to have both proper PM (fully turned off card) and nvidia proprietary drivers at all? I'm not complaining, it's just that there are several threads with discussions mostly about nouveau going on, and it's difficult to catch up on the progress.

P.S. That's on T440p with new BIOSes.

@dionorgua
Copy link

Probably no. As far as I understand currently (with latest kernel and DRI_PRIME), nouveau is responsible for power management of NVIDIA card.

It's maybe possible to get power management + NVIDIA driver using dedicated X server by using some scripts that will rmmod nvidia && modprobe nouveau once external GPU is unused. But I don't know whether it's implemented somewhere or not.

@abbradar
Copy link

Yeah, I've thought about something along those lines if bbswitch is not a full-featured option currently. Thanks for the confirmation!

@Lekensteyn
Copy link
Member

If you are feeling adventurous, there is a pm-rework branch which you can try. The vgaswitcheroo addition is however not stable, it oopsed somewhere. That is, try the branch at 5c7b3f5

@abbradar
Copy link

I'd be glad to. Do you want any specific tests or just feedback on whether it works?

On November 20, 2016 11:57:20 PM GMT+03:00, Peter Wu notifications@github.com wrote:

If you are feeling adventurous, there is a pm-rework branch which you
can try. The vgaswitcheroo addition is however not stable, it oopsed
somewhere. That is, try the branch at
5c7b3f5

You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#78 (comment)

Sent from my Android device with K-9 Mail. Please excuse my brevity.

@Lekensteyn
Copy link
Member

Lekensteyn commented Nov 20, 2016

@abbradar Some people have reported that it works for them, maybe it is also good enough for you. YMMV

As for the test: OFF/ON/OFF/ON, system sleep/resume.

@abbradar
Copy link

abbradar commented Nov 20, 2016

Lenovo T440p, BIOS 2.39. Disconnected external displays, cold boot, bumblebeed disabled.

[    5.059683] bbswitch: version 0.8
[    5.059687] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.VID_
[    5.059691] bbswitch: Found discrete VGA device 0000:02:00.0: \_SB_.PCI0.PEG_.VID_
[    5.059698] ACPI Warning: \_SB.PCI0.PEG.VID._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[    5.059961] bbswitch: detected an Optimus _DSM function
[    5.059964] bbswitch: Succesfully loaded. Discrete card 0000:02:00.0 is on

echo OFF > /proc/acpi/bbswitch

[   61.705104] vga_switcheroo: enabled
[   61.705112] bbswitch: disabling discrete graphics
[   61.705120] ACPI Warning: \_SB.PCI0.PEG.VID._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95)
[   61.716917] bbswitch 0000:02:00.0: Refused to change power state, currently in D0

So, no luck for me it seems. If you want any additional testing, want to try out a patch on this hardware etc. feel free to ping! And thanks for the try, anyway ^_^

@Lekensteyn
Copy link
Member

pci_set_power_state(pdev, PCI_D3hot) should probably become pci_set_power_state(pdev, PCI_D3cold). Should not make much of a difference except that maybe the "Refused to change power state" warning goes away.

Normally this should show auto/suspended for both the Nvidia and parent device:

grep . /sys/bus/pci/devices/0000:02:00.0/{../,}power/{control,runtime_status}

For this to work, you must not have the pcie_port_pm=off option in your kernel command.

@abbradar
Copy link

I've come to another interesting problem -- when I'm not plugged to power adapter, X.org hangs when I have bbswitch loaded. I have discovered it only now -- bbswitch didn't work before at all because I haven't had nvidia-drm module blacklisted, so it got loaded and bbswitch couldn't try to disable the card. I haven't yet tried pm-rework branch w.r.t. this problem -- I'll do this later.

I've implemented alternative switching method proposed by @dionorgua in Bumblebee-Project/Bumblebee#820

@y-usuzumi
Copy link

Any update?

I ran into this problem recently. My entire disk is fucked up. I was brought to the emergency mode. fsck reported millions of errors and then rendered my laptop unbootable.

@dionorgua
Copy link

If you don't use NVIDIA blob, it's already fixed. Not sure about nvidia. You just don't need to do anything except using recent kernel. nouveau is able to suspend nvidia graphics (and resume when needed)

@xobust
Copy link

xobust commented Aug 17, 2017

I made a fork of nvidia-xrun that automatically loads nouveau on exit to power down the GPU. This should be a ok workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests