Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upbroken SSL connections after dom0 upgrade #2840
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jun 2, 2017
Member
Booting with an older dom0 kernel didn't help. Booting with an older VM kernel setting in QVMM did not help either.
Have you tried that for all VMs on the network path too?
Similar report: QubesOS/updates-status#55 (comment)
Have you tried that for all VMs on the network path too? |
adrelanos
referenced this issue
in QubesOS/updates-status
Jun 2, 2017
Closed
linux-kernel v4.9.29-17 (r3.2) #55
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Just now tested. Yes, same issues in AppVM -> sys-firewall -> sys-net. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
MarioGeckler
Jun 2, 2017
I have the same problem, since last reboot (just few minutes ago). The problem exists in all AppVMs.
The problem exists even when browsing in sys-firewall, but seems to be ok when browsing directly in sys-net.
MarioGeckler
commented
Jun 2, 2017
•
|
I have the same problem, since last reboot (just few minutes ago). The problem exists in all AppVMs. The problem exists even when browsing in sys-firewall, but seems to be ok when browsing directly in sys-net. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adrelanos
Jun 2, 2017
Member
but seems to be ok when browsing directly in sys-net.
Some more browsing and/or browser restarts and/or VM restarts will also trigger it in that VM?
Some more browsing and/or browser restarts and/or VM restarts will also trigger it in that VM? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 2, 2017
Need some more data to help rule things out:
- Templates behind sys-net, sys-firewall and affected AppVM(s)
- Kernel version(s) where this is happening, and kernel version(s) where this doesn't happen
- NIC hardware (to help rule out driver regressions)
This is also similar to this one: #2778
rtiangha
commented
Jun 2, 2017
|
Need some more data to help rule things out:
This is also similar to this one: #2778 |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adrelanos
Jun 2, 2017
Member
I just managed to break my Qubes installation on USB using R3.2. Going with dom0 stable repository only worked. After enabling qubes-dom0-testing and qubes-dom0-update'ing from that one, it this issue happens.
Templates behind sys-net, sys-firewall and affected AppVM(s)
sys-net / sys-firewall based on fedora-24 Template (also happening with fedora-24).
Happening to AppVM's based on Debian / Fedora / Whonix templates.
Kernel version(s) where this is happening, and kernel version(s) where this doesn't happen
Worked with 4.4.67-12 (current version in qubes dom0 stable repository). After the upgrade, that kernel gets somehow uninstalled?
However, after upgrading from qubes-dom0-testing, switching back to kernel 4.4.55-11 and 4.4.31-11 in grub boot menu does not fix the issue either.
NIC hardware (to help rule out driver regressions)
Happening with two different computers. One has Intel network chips (wired and wifi) the other has Qucallcomm (wired and wifi).
This is also similar to this one: #2778
Yes. (I referenced this above.)
|
I just managed to break my Qubes installation on USB using R3.2. Going with dom0 stable repository only worked. After enabling qubes-dom0-testing and qubes-dom0-update'ing from that one, it this issue happens.
sys-net / sys-firewall based on fedora-24 Template (also happening with fedora-24). Happening to AppVM's based on Debian / Fedora / Whonix templates.
Worked with 4.4.67-12 (current version in qubes dom0 stable repository). After the upgrade, that kernel gets somehow uninstalled? However, after upgrading from qubes-dom0-testing, switching back to kernel 4.4.55-11 and 4.4.31-11 in grub boot menu does not fix the issue either.
Happening with two different computers. One has Intel network chips (wired and wifi) the other has Qucallcomm (wired and wifi).
Yes. (I referenced this above.) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 2, 2017
Thanks. Sorry about that; there's like 3 tickets now that I'm trying to follow with similar behaviors.
I'm curious: Did you upgrade everything in qubes-dom0-testing, or just the kernel? The interesting part to me is that going back to 4.4.55 still exhibits the behavior, which may imply it's actually not the kernel or just the kernel alone if you did upgrade more from testing than just the kernel.
rtiangha
commented
Jun 2, 2017
•
|
Thanks. Sorry about that; there's like 3 tickets now that I'm trying to follow with similar behaviors. I'm curious: Did you upgrade everything in qubes-dom0-testing, or just the kernel? The interesting part to me is that going back to 4.4.55 still exhibits the behavior, which may imply it's actually not the kernel or just the kernel alone if you did upgrade more from testing than just the kernel. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adrelanos
Jun 2, 2017
Member
Right.
Thanks. I'm curious: Did you upgrade everything in qubes-dom0-testing, or just the kernel?
Everything. (Just a "dumb" run of "sudo qubes-dom0-update".)
|
Right.
Everything. (Just a "dumb" run of "sudo qubes-dom0-update".) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adrelanos
Jun 2, 2017
Member
I forgot to keep the dom0 stable -> dom0 testing upgrade log. Do you know which packages / versions changed?
Do we have such a diff easily accessible? @marmarek
|
I forgot to keep the dom0 stable -> dom0 testing upgrade log. Do you know which packages / versions changed? Do we have such a diff easily accessible? @marmarek |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jun 2, 2017
Member
/var/log/dnf.rpm.log
Regarding kernel version - I think the important part is VM kernel version in all VMs on network path. Try changing VM kernel in sys-net, sys-firewall etc.
|
/var/log/dnf.rpm.log Regarding kernel version - I think the important part is VM kernel version in all VMs on network path. Try changing VM kernel in sys-net, sys-firewall etc. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 2, 2017
Also, just as a sanity check, (once you've verified changing the vm kernel), try changing sys-net and sys-firewall to debian-8 if you can. The other report said they were running Fedora 24 too and so I'm wondering if the distro behind the vm makes a difference, just to rule it out.
rtiangha
commented
Jun 2, 2017
•
|
Also, just as a sanity check, (once you've verified changing the vm kernel), try changing sys-net and sys-firewall to debian-8 if you can. The other report said they were running Fedora 24 too and so I'm wondering if the distro behind the vm makes a difference, just to rule it out. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adrelanos
Jun 2, 2017
Member
This upgrade broke it.
kernel x86_64 1000:4.9.29-17.pvops.qubes qubes-dom0-cached 40 M
kernel-qubes-vm x86_64 1000:4.9.29-17.pvops.qubes qubes-dom0-cached 45 M
Upgrading:
qubes-core-dom0-linux x86_64 3.2.14-1.fc23 qubes-dom0-cached 130 k
qubes-core-dom0-linux-kernel-install x86_64 3.2.14-1.fc23 qubes-dom0-cached 8.1 k
qubes-db x86_64 3.2.4-1.fc23 qubes-dom0-cached 23 k
qubes-db-dom0 x86_64 3.2.4-1.fc23 qubes-dom0-cached 6.5 k
qubes-db-libs x86_64 3.2.4-1.fc23 qubes-dom0-cached 17 k
qubes-gpg-split-dom0 x86_64 2.0.25-1.fc23 qubes-dom0-cached 15 k
qubes-input-proxy x86_64 1.0.9-1.fc23 qubes-dom0-cached 24 k
qubes-manager x86_64 3.2.12-1.fc23 qubes-dom0-cached 1.8 M
|
This upgrade broke it.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
MarioGeckler
commented
Jun 2, 2017
|
I have debian-8 in my VMs and have the errors. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
commented
Jun 2, 2017
|
Thanks. That helps. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ij1
Jun 2, 2017
@rtiangha I'll continue discussions here. I'm not fully in current-testing, only selectively.
4.4.67-12 in all VMs works for me, 4.9.29-17 does not.
I've only the kernel and qubes-manager from the adrelanos' list (but I believe those are unrelated anyway). And I've pretty much ruled out the effect of dom0 kernel version.
According to my testing, only VM kernel versions matter (like marmarek also noted). I've tested earlier (before even reporting this initially) with various combos of old/new VM kernel versions and all of the combos seemed to fail with 4.9 present in any VM. I don't know why sys-net in 4.9 + others in 4.4 failed, perhaps that implies that the main problem might not be in the tx path itself but in interVM pkt transfer that gets blocked(?) for some reason.
ij1
commented
Jun 2, 2017
|
@rtiangha I'll continue discussions here. I'm not fully in current-testing, only selectively. 4.4.67-12 in all VMs works for me, 4.9.29-17 does not. I've only the kernel and qubes-manager from the adrelanos' list (but I believe those are unrelated anyway). And I've pretty much ruled out the effect of dom0 kernel version. According to my testing, only VM kernel versions matter (like marmarek also noted). I've tested earlier (before even reporting this initially) with various combos of old/new VM kernel versions and all of the combos seemed to fail with 4.9 present in any VM. I don't know why sys-net in 4.9 + others in 4.4 failed, perhaps that implies that the main problem might not be in the tx path itself but in interVM pkt transfer that gets blocked(?) for some reason. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 2, 2017
Thanks. I'll start digging into the kernel config then to see what might be causing this. Are you comfortable with compiling Qubes kernels? If so, I'll make a public testing branch for this for anyone willing to help test things; might be faster than asking @marmarek to continue uploading a bunch of different kernels to current-testing.
rtiangha
commented
Jun 2, 2017
|
Thanks. I'll start digging into the kernel config then to see what might be causing this. Are you comfortable with compiling Qubes kernels? If so, I'll make a public testing branch for this for anyone willing to help test things; might be faster than asking @marmarek to continue uploading a bunch of different kernels to current-testing. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jun 2, 2017
Member
I can confirm this issue after changing just final AppVM kernel to 4.9.29-17. Other VMs on the network path are on 4.4.67-12.
|
I can confirm this issue after changing just final AppVM kernel to 4.9.29-17. Other VMs on the network path are on 4.4.67-12. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
And, it works fine on 4.9.28-16. So, much less changes to validate :) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 2, 2017
Well, there are only really two major changes between those two (outside of additional driver support for hardware drivers that weren't present in 4.4). The increase in bits in regards to ASLR (taken from the Hardened Kernel Community Project's work with 4.11):
-CONFIG_ARCH_MMAP_RND_BITS=28
+CONFIG_ARCH_MMAP_RND_BITS=32
CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
-CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
+CONFIG_ARCH_MMAP_RND_COMPAT_BITS=16
and enabling slub debugging by default (an attempt to try and access some of the security features included in that functionality without having to force a user to modify their kernel command line options for every VM):
-# CONFIG_SLUB_DEBUG_ON is not set
+CONFIG_SLUB_DEBUG_ON=y
I can't look at this until late tonight when I get home to my dev machine, but off the top of my head, I'm leaning towards CONFIG_SLUB_DEBUG_ON since it might be too aggressive in the things that it's doing (it does the full suite of slub debug stuff) and it might explain why the issue manifests intermittently (really, all we want is the free poisoning feature which one can toggle by adding slub_debug=P on the kernel command line, and I think that alone works fine). So reverting that might help and would be the first thing I would try. Otherwise, it'd be the ASLR stuff, which would be unfortunate (and weird but it is an advanced setting and so maybe it might not have gotten as much testing upstream as it might have if it were a normal option to toggle).
rtiangha
commented
Jun 2, 2017
•
|
Well, there are only really two major changes between those two (outside of additional driver support for hardware drivers that weren't present in 4.4). The increase in bits in regards to ASLR (taken from the Hardened Kernel Community Project's work with 4.11): -CONFIG_ARCH_MMAP_RND_BITS=28 and enabling slub debugging by default (an attempt to try and access some of the security features included in that functionality without having to force a user to modify their kernel command line options for every VM): -# CONFIG_SLUB_DEBUG_ON is not set I can't look at this until late tonight when I get home to my dev machine, but off the top of my head, I'm leaning towards CONFIG_SLUB_DEBUG_ON since it might be too aggressive in the things that it's doing (it does the full suite of slub debug stuff) and it might explain why the issue manifests intermittently (really, all we want is the free poisoning feature which one can toggle by adding slub_debug=P on the kernel command line, and I think that alone works fine). So reverting that might help and would be the first thing I would try. Otherwise, it'd be the ASLR stuff, which would be unfortunate (and weird but it is an advanced setting and so maybe it might not have gotten as much testing upstream as it might have if it were a normal option to toggle). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 2, 2017
UNLESS this is only happening on Intel wifi cards, in which case, it could also be the enablement of broadcast filtering (to try and guard against broadcast floods):
-# CONFIG_IWLWIFI_BCAST_FILTERING is not set
+CONFIG_IWLWIFI_BCAST_FILTERING=y
So for those who are experiencing this issue, please chime in with your NICs and what chipset they're using. However, I think someone said they were using a Qualcomm based card as well as a USB NIC and I'm not sure if Intel makes USB wifi devices, so it's probably not this option. But more data points are always useful.
rtiangha
commented
Jun 2, 2017
•
|
UNLESS this is only happening on Intel wifi cards, in which case, it could also be the enablement of broadcast filtering (to try and guard against broadcast floods): -# CONFIG_IWLWIFI_BCAST_FILTERING is not set So for those who are experiencing this issue, please chime in with your NICs and what chipset they're using. However, I think someone said they were using a Qualcomm based card as well as a USB NIC and I'm not sure if Intel makes USB wifi devices, so it's probably not this option. But more data points are always useful. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Intel, but wired one (e1000e). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Also, I haven't changed kernel in netvm - that is still running 4.4.67. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 2, 2017
That's good to know.
Well, I don't have access to Qubes on the machine I'm on right now, but I do have access to a linux vm and git.
So for anyone who has the capability to compile and test a kernel, I've uploaded a new config that disables only CONFIG_SLUB_DEFAULT_ON in order to narrow this down a bit:
https://github.com/rtiangha/qubes-linux-kernel/tree/stable-4.9
If anyone does try this, please report back the results. Otherwise, I'll test in depth when I get home later tonight.
rtiangha
commented
Jun 2, 2017
•
|
That's good to know. Well, I don't have access to Qubes on the machine I'm on right now, but I do have access to a linux vm and git. So for anyone who has the capability to compile and test a kernel, I've uploaded a new config that disables only CONFIG_SLUB_DEFAULT_ON in order to narrow this down a bit: https://github.com/rtiangha/qubes-linux-kernel/tree/stable-4.9 If anyone does try this, please report back the results. Otherwise, I'll test in depth when I get home later tonight. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ij1
commented
Jun 2, 2017
•
|
My build for a kernel without SLUB_DEBUG_ON nor kASLR changes is about to complete. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 2, 2017
Cool; looking forward to the results @ij1. Just curious: Did you disable the CONFIG_SLUB_DEBUG option entirely, or just the CONFIG_SLUB_DEBUG_ON option? I believe CONFIG_SLUB_DEBUG has always been enabled ever since the early days, so leaving that on should be fine. It's just enabling the whole suite of slub debug options by default that could be causing the issues here.
rtiangha
commented
Jun 2, 2017
•
|
Cool; looking forward to the results @ij1. Just curious: Did you disable the CONFIG_SLUB_DEBUG option entirely, or just the CONFIG_SLUB_DEBUG_ON option? I believe CONFIG_SLUB_DEBUG has always been enabled ever since the early days, so leaving that on should be fine. It's just enabling the whole suite of slub debug options by default that could be causing the issues here. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ij1
Jun 2, 2017
So far everything seems good but I'll report back if the issue still reappears.
...Hmm, it might be worth testing also with slub_debug=- added to the VM kernels cmdline for 4.9.29-17 to see if that's enough to get rid of the issue.
ij1
commented
Jun 2, 2017
|
So far everything seems good but I'll report back if the issue still reappears. ...Hmm, it might be worth testing also with slub_debug=- added to the VM kernels cmdline for 4.9.29-17 to see if that's enough to get rid of the issue. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 2, 2017
That'd definitely be easier than compiling another kernel to see if if it's the kASLR stuff or the SLUB DEBUG stuff that's causing it, and then we'd know for sure which of the two it is. Plus, it's something that an affected user can do right now (assuming it does override the compiled-in option). Hopefully, it's not the combination of the two..
rtiangha
commented
Jun 2, 2017
•
|
That'd definitely be easier than compiling another kernel to see if if it's the kASLR stuff or the SLUB DEBUG stuff that's causing it, and then we'd know for sure which of the two it is. Plus, it's something that an affected user can do right now (assuming it does override the compiled-in option). Hopefully, it's not the combination of the two.. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ij1
Jun 2, 2017
I was thinking if there could be something lurking in the Qubes related patches that gets uncovered by the SLUB debug. It would explain why upstream kernel does not seem have any reports about a similar issues (at least I didn't find anything in a quick search or maybe almost nobody just runs with slub debug active). There are at least some netfront/back related changes that might explain why it's the network that fails.
ij1
commented
Jun 2, 2017
|
I was thinking if there could be something lurking in the Qubes related patches that gets uncovered by the SLUB debug. It would explain why upstream kernel does not seem have any reports about a similar issues (at least I didn't find anything in a quick search or maybe almost nobody just runs with slub debug active). There are at least some netfront/back related changes that might explain why it's the network that fails. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 2, 2017
That's an interesting thought. It's before my time so I don't know what the origins are or if the Qubes project modified or only uses a subset of the XSA 155 security patches that were released back in the day. It could also be that there are updated version of those patches that were released by Oracle that are available that might address these SSL issues (checking the Xen mail list, the latest version of the XSA 155 patches is 6 (the only difference between it and 5 which was the first public release is an updated CREDITS section) but I don't know what version is included the Qubes repository, for example, is it a version prior to 5 that was released internally or to a select group of developers before public disclosure?). Personally, I've always been surprised that they still patch in and were never merged upstream (XSA 157 was). I've never touched them nor looked into it much though; not my area of expertise to make those kinds of calls, especially if there are reasons for any differences.
rtiangha
commented
Jun 2, 2017
•
|
That's an interesting thought. It's before my time so I don't know what the origins are or if the Qubes project modified or only uses a subset of the XSA 155 security patches that were released back in the day. It could also be that there are updated version of those patches that were released by Oracle that are available that might address these SSL issues (checking the Xen mail list, the latest version of the XSA 155 patches is 6 (the only difference between it and 5 which was the first public release is an updated CREDITS section) but I don't know what version is included the Qubes repository, for example, is it a version prior to 5 that was released internally or to a select group of developers before public disclosure?). Personally, I've always been surprised that they still patch in and were never merged upstream (XSA 157 was). I've never touched them nor looked into it much though; not my area of expertise to make those kinds of calls, especially if there are reasons for any differences. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 2, 2017
OK, so I got back and rather than compiling test kernels right away (which I will still do as a sanity check), I booted up two VMs, one with the current kernel and one with slub_debug=- as part of its kernel options, and I can already see an immediate difference.
If other people could corroborate the results ( qvm-prefs -s VM kernelopts "slub_debug=- (plus any other kernel options your VM may already have)" ), then I think the culprit might have been found.
As for slub debugging exposing some issues with Qubes' Xen patches, I can't find equivalent XSA 155 patches in the Version 6 Xen patch set for the stuff that's currently in the Qubes repository (ex. the numbers and names don't match). @marmarek, are those from an earlier patch set, or did Qubes cherry pick from one of those files? Could @ij1 be onto something where slub debugging may have inadvertently exposed a flaw in some of Qubes' netfront/back patches?
https://xenbits.xen.org/xsa/advisory-155.html
https://www.kernel.org/doc/Documentation/vm/slub.txt
rtiangha
commented
Jun 2, 2017
•
|
OK, so I got back and rather than compiling test kernels right away (which I will still do as a sanity check), I booted up two VMs, one with the current kernel and one with slub_debug=- as part of its kernel options, and I can already see an immediate difference. If other people could corroborate the results ( qvm-prefs -s VM kernelopts "slub_debug=- (plus any other kernel options your VM may already have)" ), then I think the culprit might have been found. As for slub debugging exposing some issues with Qubes' Xen patches, I can't find equivalent XSA 155 patches in the Version 6 Xen patch set for the stuff that's currently in the Qubes repository (ex. the numbers and names don't match). @marmarek, are those from an earlier patch set, or did Qubes cherry pick from one of those files? Could @ij1 be onto something where slub debugging may have inadvertently exposed a flaw in some of Qubes' netfront/back patches? https://xenbits.xen.org/xsa/advisory-155.html |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jun 2, 2017
Member
Those patches didn't make it into final advisory, because Xen Security Team does not consider frontends to be security critical code ("PV frontend patches will be developed and released (publicly) after the embargo date."). No one have picked up those patches when I've send them to security@xen during embargo and re-sending them publicly is on my very long todo-list...
Anyway I'm just checking that idea by compiling the kernel without those patches but SLUB debugging enabled.
|
Those patches didn't make it into final advisory, because Xen Security Team does not consider frontends to be security critical code ("PV frontend patches will be developed and released (publicly) after the embargo date."). No one have picked up those patches when I've send them to security@xen during embargo and re-sending them publicly is on my very long todo-list... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jun 3, 2017
Member
Ok, looks like the patches are not the problem. Without them but with SLUB debugging enabled the problem is still there.
|
Ok, looks like the patches are not the problem. Without them but with SLUB debugging enabled the problem is still there. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 3, 2017
OK, I'm fairly confident it's the CONFIG_SLUB_DEFAULT_ON option then, although I'm still compiling on this end. But if other people encountering the issue can work around it by enabling slub_debug=- as a kernel option with 4.9.29, then that's the issue.
Not sure if you want to push out a 4.9.30 now with slub debugging off or wait a bit for more feedback since 4.9.31 is around the corner and will probably be out in the next couple of days anyways. But there seems to be a userspace workaround that could be applied in the meantime for those who are really affected so there may be no rush.
FYI, it's also happening in 4.11; I just never noticed and chalked it up to internet issues since both my machines had the same behavior; it wasn't until this thread that I realized they were both Qubes machines. So I apologize for not catching this myself. Usually I run new kernel options on my machines for a few weeks to test them before pushing them out; this one, I didn't do as much of.
rtiangha
commented
Jun 3, 2017
•
|
OK, I'm fairly confident it's the CONFIG_SLUB_DEFAULT_ON option then, although I'm still compiling on this end. But if other people encountering the issue can work around it by enabling slub_debug=- as a kernel option with 4.9.29, then that's the issue. Not sure if you want to push out a 4.9.30 now with slub debugging off or wait a bit for more feedback since 4.9.31 is around the corner and will probably be out in the next couple of days anyways. But there seems to be a userspace workaround that could be applied in the meantime for those who are really affected so there may be no rush. FYI, it's also happening in 4.11; I just never noticed and chalked it up to internet issues since both my machines had the same behavior; it wasn't until this thread that I realized they were both Qubes machines. So I apologize for not catching this myself. Usually I run new kernel options on my machines for a few weeks to test them before pushing them out; this one, I didn't do as much of. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 3, 2017
So I'm 100% certain this is caused by that one kernel option now. Thanks for pointing it out.
You can still enable the page poisoning kernel security protections if you want by adding slub_debug=P and page_poison=1 to your VM and dom0 kernel options; those options I used for about month with no ill effect, and I'm using them again now. They say there's a bit of a performance hit with that feature (which is always the trade off), but I don't think it's noticeable, or at least, it isn't noticeable for me.
rtiangha
commented
Jun 3, 2017
|
So I'm 100% certain this is caused by that one kernel option now. Thanks for pointing it out. You can still enable the page poisoning kernel security protections if you want by adding slub_debug=P and page_poison=1 to your VM and dom0 kernel options; those options I used for about month with no ill effect, and I'm using them again now. They say there's a bit of a performance hit with that feature (which is always the trade off), but I don't think it's noticeable, or at least, it isn't noticeable for me. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ij1
Jun 3, 2017
I've also confirmed that 4.9.29-17 with slub_debug=- really works.
@marmarek thanks for testing. I guess it must be an upstream bug then.
I intend to keep trying to hunt down the actual upstream bug myself but in the meantime, the workaround to disable slub debug seems good enough.
@rtiangha I was also first very unsure whether it was just due to my ISP dropping some packets or a genuine bug.
ij1
commented
Jun 3, 2017
|
I've also confirmed that 4.9.29-17 with slub_debug=- really works. @marmarek thanks for testing. I guess it must be an upstream bug then. I intend to keep trying to hunt down the actual upstream bug myself but in the meantime, the workaround to disable slub debug seems good enough. @rtiangha I was also first very unsure whether it was just due to my ISP dropping some packets or a genuine bug. |
andrewdavidwong
added
bug
C: kernel
P: critical
labels
Jun 4, 2017
andrewdavidwong
added this to the Release 3.2 updates milestone
Jun 4, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ij1
Jun 4, 2017
I've now found the offending line and pretty much the cause too:
In xennet_start_xmit(...):
(slots > 1 && !xennet_can_sg(dev)), where slots == 2 because SKB's linear part has high offset on the page (for me, at least 3800, 3128, 2760 start offsets occur). I guess it's because of SLUB debug adding some extra stuff below (and above) the allocations. The linear part then spills over to the next page if the size of the packet is large enough.
An obvious workaround would be to enable SG for the device. Is there some downside with that or why it's not enabled?
The other option would be to ensure that slots == 1 but I don't know if there's a way to ensure with SLUB debug on that the packet data would not cross page boundaries so it seems quite fragile.
ij1
commented
Jun 4, 2017
|
I've now found the offending line and pretty much the cause too: In An obvious workaround would be to enable SG for the device. Is there some downside with that or why it's not enabled? The other option would be to ensure that slots == 1 but I don't know if there's a way to ensure with SLUB debug on that the packet data would not cross page boundaries so it seems quite fragile. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rtiangha
Jun 4, 2017
Is that like CONFIG_SG_SPLIT? If so, it isn't set (no particular reason, I think; or rather, I don't recall ever touching it), however it seems to be hidden in make menuconfig. It's supposed to appear under Library Routines but it's not there and I can't figure out how to make it appear. It still appears in the config file, though:
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
# CONFIG_SG_SPLIT is not set
CONFIG_SG_POOL=y
CONFIG_ARCH_HAS_SG_CHAIN=y
CONFIG_ARCH_HAS_PMEM_API=y
rtiangha
commented
Jun 4, 2017
|
Is that like CONFIG_SG_SPLIT? If so, it isn't set (no particular reason, I think; or rather, I don't recall ever touching it), however it seems to be hidden in make menuconfig. It's supposed to appear under Library Routines but it's not there and I can't figure out how to make it appear. It still appears in the config file, though:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
Jun 4, 2017
Member
SG is specifically disabled here. I don't remember details, but it caused other problems.
|
SG is specifically disabled here. I don't remember details, but it caused other problems. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
This is not happening anymore. Anything left to do here? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
andrewdavidwong
Oct 19, 2017
Member
Closing this as "resolved." If you believe the issue is not yet resolved, if anyone is still affected by this issue, or if it needs to be reopened for merge/builder tracking, please leave a comment, and we'll be happy to reopen this. Thank you.
|
Closing this as "resolved." If you believe the issue is not yet resolved, if anyone is still affected by this issue, or if it needs to be reopened for merge/builder tracking, please leave a comment, and we'll be happy to reopen this. Thank you. |
adrelanos commentedJun 2, 2017
Qubes OS version (e.g.,
R3.2):R3.2
Affected TemplateVMs (e.g.,
fedora-23, if applicable):all
Expected behavior:
https and ssh working normally
Actual behavior:
https mostly and ssh fully broken
Steps to reproduce the behavior:
go to https://www.youtube.com
click on some link on youtube
click on some other link on youtube
General notes:
If I got to forums.whonix.org it loads. But when I click on "latest" it never finishes loading.
(Just an example web site. Happens everywhere on https.)
It's happening since the latest dom0 upgrades.
From a non-Qubes system, using the same network, everything works normal. My internet in http speed test is still very fast. So I guess my network connection is fine.
Sometimes (after browser or VM restart) it's possible to view a https page one time. But when clicking on some link there, it stops working. Http only websites still work fine.
This is happening in both Firefox and Chromium, in both debian-8, debian-stretch and Fedora.
Firefox doesn't show any useful error message.
When that happens, in chromium most of the time get the following.
Running chromium with
chromium --use-spdy=offhelps a bit. A view more page views are possible until the same error message appears.I did run
sudo debsums -sin my Debian TemplateVM to make sure no files (of the openssl package or so) are broken. Another Qubes R3.2 installation (very similar, for testing purposes) on my external USB hard drive has the very same symptoms, so I think we can exclude file corruption as well.This bug is currently killing all my productivity.
Booting with an older dom0 kernel didn't help. Booting with an older VM kernel setting in QVMM did not help either.
Happens both on Wi-Fi and on wired internet connection.
Compared with non-Qubes Debian on bare metal (different computer), I could not reproduce this issue. (I hope I can prevent booting by Qubes-only notebook with Debian.)
While this issue is happening, I don't see any new messages in syslog or kern.log.
Qubes R3.2 fully updated on another computer does not show this issue either.
All very strange. Do we have a hardware specific kernel bug here that leads to preventing SSL connections? I may be able to remove the disk from the affected computer and (USB) boot it in the non-affected one to get higher certainty if that would be useful.
possibly related:
#2778