Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upqubes-iptables race condition in fedora-28 #3939
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
fosslinux
May 30, 2018
Does the problem and/or the race condition occur in appvms based on other templates (eg debian)? Also is the race condition in qubes-iptables, Fedora's setup or iptables?
fosslinux
commented
May 30, 2018
|
Does the problem and/or the race condition occur in appvms based on other templates (eg debian)? Also is the race condition in qubes-iptables, Fedora's setup or iptables? |
andrewdavidwong
added
bug
C: core
labels
May 30, 2018
andrewdavidwong
added this to the Release 4.0 updates milestone
May 30, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Aekez
May 30, 2018
I can confirm this issue happens on my system too. It is 100% reproduce-able in sys-net / sys-firewall from a couple of hours trying to replace different fedora-27/28 templates, settings, restarting, looking for errors, etc.
Testing background
I have not tried individual AppVM's on F27 / F28, however due to the upcoming EOL for F26, I upgraded sys-vm's today, and due to habit I restart and test internet access before replacing individual regular AppVM's, and the lack of internet access from sys-net and sys-firewall caught my attention before I started replacing F26 AppVM's. Since the failed sys-vm's caught my attention early, I did not test regular AppVM's with F27/F28 very much, but it may very well be affected too since I did not test it yet. After these failed attempts, I went looking here. Sure enough, this issue seems to be the exact issue I'm having too.
Regarding the random behavior
I did have succesful run with F26 in sys-net and sys-firewall, with an AppVM running F27, but I only tried this layout once. Given tasket's 50-70% reported up above, it may very well be that I was just lucky on first try and may not work all the time.
Tested Templates
- Fedora-27 (Full download from repository): No or limited internet access.
- Surprisingly dom0 and templates can update, however AppVM's (like firefox) has no internet access. This also seems to match tasket's description.
- Going back to fedora-26 for sys-vm's immediately fixes the missing internet access.
- F26 not updated to recent updates in current-testing due to worry of loosing all internet access. Dom0 and all other templates are fully updated to latest current-testing.
- Fedora-28 (Full download from repository): No or limited internet access.
- Surprisingly dom0 and templates can update, however AppVM's (like firefox) has no internet access. This also seems to match tasket's description.
- Going back to fedora-26 for sys-vm's immediately fixes the missing internet access.
- F26 not updated to recent updates in current-testing due to worry of loosing all internet access. Dom0 and all other templates are fully updated to latest current-testing.
- Fedora-27 (Upgraded from Fedora-26): No internet at all.
- F26 to F27 Qubes doc instructions successfully followed to exact details.
- Fedora upgrade itself went error-free.
- qubes-core-agent-networking showed failed error towards the very end of the upgrade.
- qubes-core-agent-qrexec showed failed error towards the very end of the upgrade.
- Since dom0 is in current-testing, and because these two core-agents failed the upgrade, I immediately followed up with a current-testing update in the fedora-27 template, which then correctly installed higher versions than the failed upgrade versions. The failed upgrade issue "seemed" to have been avoided, but as it turns out it probably wasn't.
- Everything "seemed" normal now, I shutdown everything, replaced sys-vm's with fedora-27, re-started system fully, and tested the internet access.
- Unlike the full download from repository, this left Qubes with no internet access at all.
- Going back to fedora-26 for sys-vm's immediately fixes the missing internet access.
- F26 not updated to recent updates in current-testing due to worry of loosing all internet access. Dom0 and all other templates are fully updated to latest current-testing.
Three big critical concerns
- This leaves users stuck on fedora-26 despite it being EOL in less than 2 days from now.
- The unlucky Qubes user who happen to delete their old fedora-26 (or run updates if previous F27/F28 worked) before testing if fedora-27/28 works, might not be able to, or not easily be able to return to fedora-26 (or previous update state) to regain internet access, to fix this issue.
- A big question would be, how many are affected by this? Few? Many? Everyone? At least on one Qubes system, I can 100% reproduce this issue of no or limited internet access from the sys-vm's.
Clarifications
- No AppVM's had internet once using F27 or F28 for sys-net or sys-firewall, or both at the same time.
- This includes no internet access to Debian and Whonix.
- The odd part is that dom0 and templates had internet access most of the time. The only time I did not have internet access to updates was when I used the upgraded F26-F27 version. Perhaps I was just lucky, given the random nature mentioned by tasket.
Aekez
commented
May 30, 2018
•
|
I can confirm this issue happens on my system too. It is 100% reproduce-able in sys-net / sys-firewall from a couple of hours trying to replace different fedora-27/28 templates, settings, restarting, looking for errors, etc. Testing background Regarding the random behavior Tested Templates
Three big critical concerns
Clarifications
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
marmarek
May 30, 2018
Member
Does it happen also with qubes-core-agent 4.0.29?? There is related fix for that.
|
Does it happen also with qubes-core-agent 4.0.29?? There is related fix for that. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tasket
May 30, 2018
@marmarek Yes it still happens, but log is a bit different:
May 30 11:52:58 sys-vpn2 systemd[1]: Starting Qubes base firewall settings...
May 30 11:52:58 sys-vpn2 qubes-iptables[357]: iptables: Applying firewall rules: iptables-restore v1.6.2: wait seconds not numeric
May 30 11:52:58 sys-vpn2 qubes-iptables[357]: Error occurred at line: 0
May 30 11:52:58 sys-vpn2 qubes-iptables[357]: Try `iptables-restore -h' or 'iptables-restore --help' for more information.
May 30 11:52:58 sys-vpn2 qubes-iptables[357]: FAIL
May 30 11:52:58 sys-vpn2 systemd[1]: qubes-iptables.service: Main process exited, code=exited, status=1/FAILURE
May 30 11:52:58 sys-vpn2 systemd[1]: qubes-iptables.service: Failed with result 'exit-code'.
May 30 11:52:58 sys-vpn2 systemd[1]: Failed to start Qubes base firewall settings.
tasket
commented
May 30, 2018
|
@marmarek Yes it still happens, but log is a bit different:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
fosslinux
commented
May 30, 2018
|
@tasket Just clarifying, is this log in the appVM or the netVM? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
fosslinux
May 30, 2018
@tasket @Aekez I'm finding this very unusual. I cloned my fedora template, upgraded straight from 26 to 28, changed my sys-net and sys-firewall netVMs to fedora-28. Everything works perfectly.
I think the defining difference between our two setups is the templates used for the appVMs. I primarily use the archlinux template for 90% of my appVMs and the renaming 10% (eg 1) is Debian (only because the package isn't available on Arch). I only use Fedora for my netVMs - I generally don't use Fedora (don't like RPM's).
I am unable to reproduce this problem.
fosslinux
commented
May 30, 2018
•
|
@tasket @Aekez I'm finding this very unusual. I cloned my fedora template, upgraded straight from 26 to 28, changed my sys-net and sys-firewall netVMs to fedora-28. Everything works perfectly. I think the defining difference between our two setups is the templates used for the appVMs. I primarily use the archlinux template for 90% of my appVMs and the renaming 10% (eg 1) is Debian (only because the package isn't available on Arch). I only use Fedora for my netVMs - I generally don't use Fedora (don't like RPM's). I am unable to reproduce this problem. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tasket
May 30, 2018
@sstt011 This happens in proxyVMs (provides network, no NIC hardware) based on an installed fedora-28 template. Debian-9 doesn't suffer from the problem.
tasket
commented
May 30, 2018
|
@sstt011 This happens in proxyVMs (provides network, no NIC hardware) based on an installed fedora-28 template. Debian-9 doesn't suffer from the problem. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
fosslinux
May 31, 2018
@tasket Can you connect to the local Qubes network, your local LAN or the internet in appVMs?
fosslinux
commented
May 31, 2018
|
@tasket Can you connect to the local Qubes network, your local LAN or the internet in appVMs? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Aekez
May 31, 2018
@sstt011 I only upgraded from F26 to F27, perhaps I should give the F26 to F28 approach a try later when I can find time to do it (unfortunately I'm very restricted in any allocation of my own time today).
If it works for you, then it seems that either there is a difference in our templates, or something is amiss in dom0 on the systems where this issue happens? Are you on the latest current-testing in the template/dom0 too? when did you upgrade the template? within the last few days or further back? Maybe there is some differences in the templates, time of upgrade could make a difference too (i.e. maybe something amiss and has changed in the fedora or maybe the qubes update repository between when you did it and when we did it)? If not any difference in templates, then this must be in dom0?
I do not (yet at least) have the skills to quickly manage the iptables to check the network from an AppVM's perspective, so hopefully tasket can provide that insight. However, I can go to two other 2-3 other Qubes systems within my reach and test the same template on them (with qvm-backup) and see if sys-net / sys-firewall works on F27 there or not.
I'll give a shout when I can find time to control on my own to test this, apologies that I can't try this right away.
Aekez
commented
May 31, 2018
|
@sstt011 I only upgraded from F26 to F27, perhaps I should give the F26 to F28 approach a try later when I can find time to do it (unfortunately I'm very restricted in any allocation of my own time today). If it works for you, then it seems that either there is a difference in our templates, or something is amiss in dom0 on the systems where this issue happens? Are you on the latest current-testing in the template/dom0 too? when did you upgrade the template? within the last few days or further back? Maybe there is some differences in the templates, time of upgrade could make a difference too (i.e. maybe something amiss and has changed in the fedora or maybe the qubes update repository between when you did it and when we did it)? If not any difference in templates, then this must be in dom0? I do not (yet at least) have the skills to quickly manage the iptables to check the network from an AppVM's perspective, so hopefully tasket can provide that insight. However, I can go to two other 2-3 other Qubes systems within my reach and test the same template on them (with qvm-backup) and see if sys-net / sys-firewall works on F27 there or not. I'll give a shout when I can find time to control on my own to test this, apologies that I can't try this right away. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tasket
May 31, 2018
@sstt011 No, appVMs cannot access the net through these fedora 28 proxyVMs. The iptables rules needed to forward traffic are not added.
I have a small patch, moving $CMD_ARGS to end of command line in qubes-iptables 4.0.29, that should resolve the issue. This is needed because iptables-restore has a crummy command syntax:
QubesOS/qubes-core-agent-linux#120
tasket
commented
May 31, 2018
|
@sstt011 No, appVMs cannot access the net through these fedora 28 proxyVMs. The iptables rules needed to forward traffic are not added. I have a small patch, moving |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
fosslinux
commented
May 31, 2018
|
@Aekez Does QubesOS/qubes-core-agent-linux#120 fix it for you? |
tasket
referenced this issue
Jun 1, 2018
Open
Lost internet on all VMs after updating from testing repos #3949
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Aekez
Jun 1, 2018
@sstt011 To be determined, I'm currently unaware how to use the component build (I normally build the full Qubes build, not the components). But I gather that I probably need to apply the Build Configuration to the travis-build script, with the Build Configuration from QubesOS/qubes-core-agent-linux#120. It seems like I'm close enough to get it working, but it's taking a lot of time to figure out how to apply this modification, so a quick pointer on the location to apply the Config Configuration so I can run the script, so that I can test the modified qubes-core-agent-* in the fedora-27 template, would be greatly appreciated.
Meanwhile, given the issue appears to have been found in tasket's fix this probably is no surprise, but I managed to reproduce this issue on my second Qubes system/hardware. Am I right to assume the fix has been found, but the cause of the issue has been found as well? or is the cause/trigger one and the same? As to the former scenario, it seems it's triggered by something in Qubes current-testing updates. I found that internet works from the sys-vm's when installing and using fresh fedora-27/28 template from Qubes repositories without any updates applied (on both qubes systems/hardware). The moment updates are applied to the fedora-27/28 template responsible for sys-net / sys-firewall, all bets are off and there is no longer any internet from sys-net / sys-firewall after the sys-vm's VM's were restarted.
At this point it is probably redundant to try test the same templates on different systems/hardware since the fail can be predicted on both systems when updates are applied (but works before updates are applied). I believe normal current updates still worked, and it was when the current-testing updates was applied when it failed, however this part I need to try again to confirm as I did not take proper notes on this particular difference (testing whether internet worked between current and current-testing updates. The issue is triggered in one of the repositories, but it's probably by something current-testing though).
I'm a little puzzled why not everyone on current-testing are experiencing this issue, or maybe there aren't that many running current-testing.
Please disregard spending value-able time on my input if it doesn't provide you with any helpful information. For my personal needs I'll wait for the official fix, however if I can help with further testing I'll gladly do so with the time I can find.
Aekez
commented
Jun 1, 2018
|
@sstt011 To be determined, I'm currently unaware how to use the component build (I normally build the full Qubes build, not the components). But I gather that I probably need to apply the Build Configuration to the travis-build script, with the Build Configuration from QubesOS/qubes-core-agent-linux#120. It seems like I'm close enough to get it working, but it's taking a lot of time to figure out how to apply this modification, so a quick pointer on the location to apply the Config Configuration so I can run the script, so that I can test the modified qubes-core-agent-* in the fedora-27 template, would be greatly appreciated. Meanwhile, given the issue appears to have been found in tasket's fix this probably is no surprise, but I managed to reproduce this issue on my second Qubes system/hardware. Am I right to assume the fix has been found, but the cause of the issue has been found as well? or is the cause/trigger one and the same? As to the former scenario, it seems it's triggered by something in Qubes current-testing updates. I found that internet works from the sys-vm's when installing and using fresh fedora-27/28 template from Qubes repositories without any updates applied (on both qubes systems/hardware). The moment updates are applied to the fedora-27/28 template responsible for sys-net / sys-firewall, all bets are off and there is no longer any internet from sys-net / sys-firewall after the sys-vm's VM's were restarted. I'm a little puzzled why not everyone on current-testing are experiencing this issue, or maybe there aren't that many running current-testing. Please disregard spending value-able time on my input if it doesn't provide you with any helpful information. For my personal needs I'll wait for the official fix, however if I can help with further testing I'll gladly do so with the time I can find. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tasket
Jun 1, 2018
The template initially has no --wait parameter, but the update adds the --wait parameter to the wrong place in the iptables-restore command. This causes qubes-iptables.service to fail and firewall rules for basic traffic flow are not enacted. My fix moves --wait parameter so the service doesn't fail.
tasket
commented
Jun 1, 2018
•
|
The template initially has no --wait parameter, but the update adds the --wait parameter to the wrong place in the |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
donob4n
Jun 2, 2018
I also fixed it also adding a '0' after the --wait.
CMD_ARGS=
if "$CMD-restore" --help 2>&1 | grep -q wait=; then
CMD_ARGS=--wait 0
fi
I don't know what is the desired time for the wait.
donob4n
commented
Jun 2, 2018
|
I also fixed it also adding a '0' after the --wait.
I don't know what is the desired time for the wait. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tasket
Jun 3, 2018
@donob4n I think the intended wait time is 'indefinite', and the manpage for iptables-restore does not indicate that '0' has special behavior here, so I would opt for moving --wait to the end of the line instead of using '0'.
tasket
commented
Jun 3, 2018
|
@donob4n I think the intended wait time is 'indefinite', and the manpage for iptables-restore does not indicate that '0' has special behavior here, so I would opt for moving --wait to the end of the line instead of using '0'. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
donob4n
Jun 3, 2018
Probably yours is a better solution, I added a 0 because the error complained about a missing numeric value.
donob4n
commented
Jun 3, 2018
|
Probably yours is a better solution, I added a 0 because the error complained about a missing numeric value. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Aekez
Jun 4, 2018
Would it be better to take down the current-testing effected package(s) updates until the fix has replaced the bad packages though? I know it's current-testing and at own risk using it and everything, but it seems (?) many people use current-testing for whatever reason even if they don't plan to report issues. Of course many can be subjective different here, but at least quite a few people seem to post in qubes-users mailing list who probably never should have run current-testing but did so to get faster updates. Updates can be addictive maybe? Anyway, it might affect more people than just the testers if they run current-testing.
@tasket I see, thanks for explanation, the cause effect makes much better sense now in addition to issue effect. Knowing what's going on when issues happen is always appreciated.
Aekez
commented
Jun 4, 2018
•
|
Would it be better to take down the current-testing effected package(s) updates until the fix has replaced the bad packages though? I know it's current-testing and at own risk using it and everything, but it seems (?) many people use current-testing for whatever reason even if they don't plan to report issues. Of course many can be subjective different here, but at least quite a few people seem to post in qubes-users mailing list who probably never should have run current-testing but did so to get faster updates. Updates can be addictive maybe? Anyway, it might affect more people than just the testers if they run current-testing. @tasket I see, thanks for explanation, the cause effect makes much better sense now in addition to issue effect. Knowing what's going on when issues happen is always appreciated. |
marmarek
closed this
in
marmarek/qubes-core-agent-linux@8b1cb80
Jun 4, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesos-bot
Jun 4, 2018
Automated announcement from builder-github
The package core-agent-linux has been pushed to the r4.0 testing repository for the CentOS centos7 template.
To test this update, please install it with the following command:
sudo yum update --enablerepo=qubes-vm-r4.0-current-testing
qubesos-bot
commented
Jun 4, 2018
|
Automated announcement from builder-github The package
|
qubesos-bot
added
the
r4.0-centos7-cur-test
label
Jun 4, 2018
qubesos-bot
referenced this issue
in QubesOS/updates-status
Jun 4, 2018
Closed
core-agent-linux v4.0.30 (r4.0) #551
qubesos-bot
added
r4.0-buster-cur-test
r4.0-jessie-cur-test
labels
Jun 4, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesos-bot
Jun 4, 2018
Automated announcement from builder-github
The package qubes-core-agent_4.0.30-1+deb9u1 has been pushed to the r4.0 testing repository for the Debian template.
To test this update, first enable the testing repository in /etc/apt/sources.list.d/qubes-*.list by uncommenting the line containing stretch-testing (or appropriate equivalent for your template version), then use the standard update command:
sudo apt-get update && sudo apt-get dist-upgrade
qubesos-bot
commented
Jun 4, 2018
|
Automated announcement from builder-github The package
|
qubesos-bot
added
the
r4.0-stretch-cur-test
label
Jun 4, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesos-bot
Jun 4, 2018
Automated announcement from builder-github
The component core-agent-linux (including package python2-dnf-plugins-qubes-hooks-4.0.30-1.fc26) has been pushed to the r4.0 testing repository for the Fedora template.
To test this update, please install it with the following command:
sudo yum update --enablerepo=qubes-vm-r4.0-current-testing
qubesos-bot
commented
Jun 4, 2018
|
Automated announcement from builder-github The component
|
qubesos-bot
added
r4.0-buster-stable
r4.0-jessie-stable
and removed
r4.0-buster-cur-test
r4.0-jessie-cur-test
labels
Jun 12, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesos-bot
Jun 12, 2018
Automated announcement from builder-github
The component core-agent-linux (including package python2-dnf-plugins-qubes-hooks-4.0.30-1.fc26) has been pushed to the r4.0 stable repository for the Fedora template.
To install this update, please use the standard update command:
sudo yum update
qubesos-bot
commented
Jun 12, 2018
|
Automated announcement from builder-github The component
|
qubesos-bot
removed
the
r4.0-fc26-cur-test
label
Jun 12, 2018
qubesos-bot
added
r4.0-fc26-stable
r4.0-fc27-stable
r4.0-fc28-stable
r4.0-jessie-stable
and removed
r4.0-fc27-cur-test
r4.0-fc28-cur-test
labels
Jun 12, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesos-bot
Jun 13, 2018
Automated announcement from builder-github
The package qubes-core-agent_4.0.30-1+deb9u1 has been pushed to the r4.0 stable repository for the Debian template.
To install this update, please use the standard update command:
sudo apt-get update && sudo apt-get dist-upgrade
qubesos-bot
commented
Jun 13, 2018
|
Automated announcement from builder-github The package
|
qubesos-bot
added
r4.0-stretch-stable
and removed
r4.0-stretch-cur-test
labels
Jun 13, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
qubesos-bot
Jun 29, 2018
Automated announcement from builder-github
The package core-agent-linux has been pushed to the r4.0 stable repository for the Fedora centos7 template.
To install this update, please use the standard update command:
sudo yum update
qubesos-bot
commented
Jun 29, 2018
|
Automated announcement from builder-github The package
|
tasket commentedMay 30, 2018
Qubes OS version:
R4.0
Affected component(s):
qubes-iptables
Steps to reproduce the behavior:
Install options that alter the f28 startup sequence, such as qubes-tunnel service, and create a proxyVM.
Specific conditions seem to matter very little. I have reproduced the problem with the qubes service enabled, deactivated (but listed), removed from services tab with qubes-firewall.d script still enabled, with script removed but service enabled, with meminfowriter enabled... and other combinations of the above.
This also affects proxyVMs like sys-firewall which have no service or firewall options activated, and may affect f28 proxyVMs where the template is unmodified - see this qubes-users discussion.
Expected behavior:
The Qubes proxyVM default iptables rules should be active after VM startup.
Actual behavior:
On about 50-70% of VM starts, the default iptables rules are totally absent.
Failure log:
General notes:
The log and the erratic behavior suggests there is a compromise (for speed?) in f28's specific systemd matrix that has resulted in a race condition.
Related issues: