New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heavy system instability after suspend - cause possibly identified #3359

Closed
Aekez opened this Issue Nov 30, 2017 · 3 comments

Comments

Projects
None yet
2 participants
@Aekez

Aekez commented Nov 30, 2017

Qubes OS version:

Qubes 4 RC-2

Affected TemplateVMs:

sys-net


Steps to reproduce the behavior:

This issue is triggered by following the Qubes driver module blacklist reload guide, i.e. regain internet after suspend / hibernation. Specifically, the last headline section in the guide, about putting this action into automatic process by blacklisting the drivers in suspend/hibernation, found here
https://www.qubes-os.org/doc/wireless-troubleshooting/
Doing this manually is not a problem, it's the automatic process in the blacklist configuration file that is the root of the problem. Removing the driver module blacklisting, and the system works flawlessly again. Except, of course, the driver issue the blacklist solves originally.

Driver modules are iwlwifi and iwlmvm, same as the ones in the guide, on this particular machine.

Unfortunately I did not manage to find time to try this out in the new release Qubes 4 RC-3, but if not resolved by then, I'll update this thread when I get around to try Qubes 4 RC-3.

I can reproduce this issue 100% of the time simply by putting these drivers in the autoamtic blacklist and then suspend or hibernate.

System is otherwise completely stable. It's purely happening by the above steps, and purely after a hibernate or suspend.

Expected behavior:

  • Working graphic / interface stability after returning from suspend or hibernation.
  • No random kernel panics during resume from suspend or hibernate.
  • Visible VM graphics, instead of only dom0 interface.

Actual behavior:

Interface freezes or graphical server appears to be collapsing and quickly becomes unstable after returning from suspend or hibernation, while using the automatic driver module blacklist solution in the Qubes guide. It appears to be purely a graphical collapse, everything otherwise seems to still work, it's just not visible on the screen. I.e. I can still run qubes-dom0-update, so sys-net and sys-firewall is still working, except all VM graphics are either frozen or gone (depending on scenario), including network widgets or other menu icons originating from within any VM. I assume the other AppVM's still work too, though not visibly on the screen. The exception is when the entire line of user processes collapses and kicks me back to the login screen, or kernel panic that causes reboot.

It varies how quickly it happens, but it's usually quick after returning from suspend or hibernation. Often only the graphical process that I was doing, that remains working, while the remaining interface freezes. Nothing works except the graphic "action" I was doing the moment of freeze. See below for examples in the most common type of issues. The lesser common issues are unrelated to freezes, but appears related to the root cause as well.

Most common symptoms after wake

  • If I was scrolling a windowed, I can still scroll up and down, but nothing else works.
  • If I was opening the Qubes widget, then only that works, and the rest if frozen.
  • If I was in the Qubes menu, then nothing else works except the Qubes menu.

Lesser common symptoms but happens frequently enough (1-2 times a day with some 5-6 suspends).

  • System does not freeze, but while all VM's are still running, none of them appears in graphical mode. Only dom0 appears. All VM windows and widgets like networking, are gone, and no amount of VM restart fixes it, unless performing a full Qubes system restart.
  • Sometimes there is a delayed freeze with the avove "system does not freeze, but" in mind.
  • By all appearance, a kernel panic seems to sometimes happen during the hibernation / suspend proces, so that when waking up, you'l see the disk decryption from rebooting, rather than the usual screensaver login screen. Using the workarounds (see general notes), I also never encounter kernel panics, and the system works flawlessly.

It's 100% reproduce-able on this particular machine, if the common does not happen, then a lesser common happens. It's always one of the two types of scenarios.

General notes:

  • This issue appears to start happening sometime 10 days before the Qubes 4 RC-3 release. I believe it was some of the python code that fixed some of the VM issues back then. Possibly? This was the testing repository.

  • The wireless driver guide, linked above, used to work flawlessly in Qubes 3.2 and Qubes 4 RC-2 as well, until the above update came around.

  • Undoing the automatic driver blacklist fixes everything above, but instead leaves the user without internet after suspend / hibernation.

  • I found a workable workaround solution by making a bash file in sys-net to rmmod and modprobe the drivers in question, which I trigger by keybinding "qvm-run sys-net 'bash wifi-resume' " with the dom0 xfce4 keyboard tools. Doing a manual keybind triggered driver reset in sys-net causes no issues, it's entirely the automatic blacklisting that causes the issue.

  • Another workaround I found was to make sure sys-net was not running when suspend / hibernate. Any other VM can run just fine, as long as sys-net is not running, the above issue does not happen.

  • I'll be happy to provide with whichever information, logs or otherwise to help solve the solution. I'm content with my manual workaround and does not seek a solution as such, but I'll be glad to help with whatever I can.


Related issues:

None that I can think of, I have never seen this issue before. It seems unique, and in relation to the python code update in testing repository, specifically those some 10 days before Qubes 4 RC-3 release, possibly.

@Aekez

This comment has been minimized.

Show comment
Hide comment
@Aekez

Aekez Nov 30, 2017

I did a full test yesterday with 20 suspends, and there was no issue.
Today, it happened again, even though I had removed the blacklist module fix. The only difference, is this time I moved the laptop around physically in my bag.

Apparently the blacklist of these modules do not cause the issue, but it certainly still does aggravate it.
Now I only get it sometimes (1 out of 20 times so far), instead of all the time.

Though it seems like it's a little different now, like it only happens when I move my laptop in my bag, but not when I test it on the table.
Could it be a screen sleeping sensor that causes the conflict perhaps? It's a laptop / tablet hybrid, it has a suspend sensor next to the camera. Maybe this could explain why it seems to happen only when moving now. But I wouldn't know, at the very least it doesn't appear like a loose connection in the hardware.

  • Removing blacklist as explained in the primary post above, fixes it mostly.
  • Suspend without closing the laptop lid and without moving it around, seems to keep it stable too. But it's too early to tell if this is a factor or not.

Aekez commented Nov 30, 2017

I did a full test yesterday with 20 suspends, and there was no issue.
Today, it happened again, even though I had removed the blacklist module fix. The only difference, is this time I moved the laptop around physically in my bag.

Apparently the blacklist of these modules do not cause the issue, but it certainly still does aggravate it.
Now I only get it sometimes (1 out of 20 times so far), instead of all the time.

Though it seems like it's a little different now, like it only happens when I move my laptop in my bag, but not when I test it on the table.
Could it be a screen sleeping sensor that causes the conflict perhaps? It's a laptop / tablet hybrid, it has a suspend sensor next to the camera. Maybe this could explain why it seems to happen only when moving now. But I wouldn't know, at the very least it doesn't appear like a loose connection in the hardware.

  • Removing blacklist as explained in the primary post above, fixes it mostly.
  • Suspend without closing the laptop lid and without moving it around, seems to keep it stable too. But it's too early to tell if this is a factor or not.
@Aekez

This comment has been minimized.

Show comment
Hide comment
@Aekez

Aekez Jan 17, 2018

As of the last 10 days (daily suspending multiple of times), Qubes 4 RC-2 fully updated (current testing), has on this one system been stable regarding any issues above.

I'm not entirely sure which day, or update, that fixed it, but it's probably longer back than the last 10 days. I can only speak for my own system setup, but as of this time, the above is no longer an issue.

Also I'm able to use the wi-fi driver fix "Automatically reloading drivers on suspend/resume" @ https://www.qubes-os.org/doc/wireless-troubleshooting/ again, without issues.

Also, not sure what happened about the battery issue during the suspend, but it's pretty good now. It can last over the whole weekend in suspend mode. All in all, it just seems to work smooth now for this laptop. For reference, this laptop/tablet is an Asus T300 Chi.

Aekez commented Jan 17, 2018

As of the last 10 days (daily suspending multiple of times), Qubes 4 RC-2 fully updated (current testing), has on this one system been stable regarding any issues above.

I'm not entirely sure which day, or update, that fixed it, but it's probably longer back than the last 10 days. I can only speak for my own system setup, but as of this time, the above is no longer an issue.

Also I'm able to use the wi-fi driver fix "Automatically reloading drivers on suspend/resume" @ https://www.qubes-os.org/doc/wireless-troubleshooting/ again, without issues.

Also, not sure what happened about the battery issue during the suspend, but it's pretty good now. It can last over the whole weekend in suspend mode. All in all, it just seems to work smooth now for this laptop. For reference, this laptop/tablet is an Asus T300 Chi.

@andrewdavidwong

This comment has been minimized.

Show comment
Hide comment
@andrewdavidwong

andrewdavidwong Jan 18, 2018

Member

Closing this for now as it appears that the issue no longer affects any Qubes users. If you believe this is a mistake, or if anyone is still affected by this issue, please leave a comment, and we'll be happy to reopen this. Thank you.

Member

andrewdavidwong commented Jan 18, 2018

Closing this for now as it appears that the issue no longer affects any Qubes users. If you believe this is a mistake, or if anyone is still affected by this issue, please leave a comment, and we'll be happy to reopen this. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment