New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMs with Whonix-Tor networking timeout and do not recover #1779

Closed
bnvk opened this Issue Feb 25, 2016 · 30 comments

Comments

Projects
None yet
5 participants
@bnvk

bnvk commented Feb 25, 2016

For the last 3 - 4 days, randomly throughout the day sometimes when switching WiFi networks and sometimes while using a good steady connection, VMs with Tor internet connections start timing out and failing while other VMs stay connected! Non-Tor VMs don't seem affected, and the only Tor VM that does not time out is my Jabber VM.

Usually, the only thing I can do to fix it is close all my VMs and restart them! This has happened multiple times each day :/

@adrelanos was a recent update in packages something that could be causing this?

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Feb 25, 2016

Member
Member

adrelanos commented Feb 25, 2016

@mfc

This comment has been minimized.

Show comment
Hide comment
@mfc

mfc Feb 27, 2016

Member

this is happening to me as well. I'm on 3.1rc3 with updated templates.

fyi an easier workaround is to create a second sys-whonix (say sys-tor) and switch the networking for running appvms to that proxyvm, then shutdown and restart sys-whonix.

Member

mfc commented Feb 27, 2016

this is happening to me as well. I'm on 3.1rc3 with updated templates.

fyi an easier workaround is to create a second sys-whonix (say sys-tor) and switch the networking for running appvms to that proxyvm, then shutdown and restart sys-whonix.

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Feb 28, 2016

Member

@marmarek have you been able to reproduce this?

I am unable to reproduce this. Can change to my router's WiFi, to my mobile WiFi hotspot back and forth without permanent connection issues.

I am working on instructions on how to debug such issues so these boil down comparable Qubes-only. In essence firewall unloading, perhaps adding Qubes specific firewall rules only, restoring /etc/resolv.conf, disabling uwt wrappers, and then use ifconfig, ping, curl etc. to figure out what the issue is.

It's more likely, that some Qubes or Qubes-Whonix specific issue is going on here. It's good to have Non-Qubes-Whonix as a point of reference - it does not happen there.

To simplify the Qubes-Whonix implementation [and to making it more similar to Non-Qubes-Whonix] and thereby make issues such as this less likely, I would need feature optional static IP addresses (#1477).

Member

adrelanos commented Feb 28, 2016

@marmarek have you been able to reproduce this?

I am unable to reproduce this. Can change to my router's WiFi, to my mobile WiFi hotspot back and forth without permanent connection issues.

I am working on instructions on how to debug such issues so these boil down comparable Qubes-only. In essence firewall unloading, perhaps adding Qubes specific firewall rules only, restoring /etc/resolv.conf, disabling uwt wrappers, and then use ifconfig, ping, curl etc. to figure out what the issue is.

It's more likely, that some Qubes or Qubes-Whonix specific issue is going on here. It's good to have Non-Qubes-Whonix as a point of reference - it does not happen there.

To simplify the Qubes-Whonix implementation [and to making it more similar to Non-Qubes-Whonix] and thereby make issues such as this less likely, I would need feature optional static IP addresses (#1477).

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Feb 28, 2016

Member

@marmarek have you been able to reproduce this?

No, only suspend-related problem (#1764)

Member

marmarek commented Feb 28, 2016

@marmarek have you been able to reproduce this?

No, only suspend-related problem (#1764)

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Feb 28, 2016

Member

There was some general "ProxyVM sometimes not reliable" issue where a VPN ProxyVM would fail connection independently of wifi hopping and suspend here on Qubes github issues. I commented on it. Can't find it at the moment. Do you remember which one I mean?

Member

adrelanos commented Feb 28, 2016

There was some general "ProxyVM sometimes not reliable" issue where a VPN ProxyVM would fail connection independently of wifi hopping and suspend here on Qubes github issues. I commented on it. Can't find it at the moment. Do you remember which one I mean?

@marmarek

This comment has been minimized.

Show comment
Hide comment
Member

marmarek commented Feb 28, 2016

@bnvk

This comment has been minimized.

Show comment
Hide comment
@bnvk

bnvk Feb 28, 2016

Hrm. What commands can I type in VMs and (which ones) should I do to offer help debugging non-suspend connection dropping help?

bnvk commented Feb 28, 2016

Hrm. What commands can I type in VMs and (which ones) should I do to offer help debugging non-suspend connection dropping help?

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Feb 28, 2016

Member

Marek Marczykowski-Górecki:

#1067?

No.

Member

adrelanos commented Feb 28, 2016

Marek Marczykowski-Górecki:

#1067?

No.

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Mar 10, 2016

Member

Instructions on how to remove Whonix Tor default networking for Whonix-Gateway. After applying these instructions, Whonix-Gateway will connect to clearnet.
"How to UnWhonix":
https://www.whonix.org/wiki/Dev/Test

Member

adrelanos commented Mar 10, 2016

Instructions on how to remove Whonix Tor default networking for Whonix-Gateway. After applying these instructions, Whonix-Gateway will connect to clearnet.
"How to UnWhonix":
https://www.whonix.org/wiki/Dev/Test

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Mar 10, 2016

Member

Please apply the above instructions. @bnvk
See if you are able to connect then. If not, it gets a lot closer to a Qubes-only, not-caused-by-Whonix network related issue.

Your time in Whonix VMs is reasonable correct? (set to UTC, yes, but correct as per "real" UTC plus/minus a minute or so)

Member

adrelanos commented Mar 10, 2016

Please apply the above instructions. @bnvk
See if you are able to connect then. If not, it gets a lot closer to a Qubes-only, not-caused-by-Whonix network related issue.

Your time in Whonix VMs is reasonable correct? (set to UTC, yes, but correct as per "real" UTC plus/minus a minute or so)

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Mar 10, 2016

Member

From there lots of test commands should work as if it was not Whonix. Such as...

  • ping 8.8.8.8
  • nslookup google.com
  • etc.

Also useful to restart Tor.

sudo service tor restart

While checking Tor's log.

tail -f /var/run/tor/log
Member

adrelanos commented Mar 10, 2016

From there lots of test commands should work as if it was not Whonix. Such as...

  • ping 8.8.8.8
  • nslookup google.com
  • etc.

Also useful to restart Tor.

sudo service tor restart

While checking Tor's log.

tail -f /var/run/tor/log
@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Mar 14, 2016

Member

It was caused by a clock jump which should be unrelated to WiFi network switching and unrelated to suspend/resume. To manually fix the clock:

https://www.whonix.org/wiki/Troubleshooting#Clock_Fix


The crude simple fix would be
"disable sdwdate systemd service"

TemplateVM:

sudo systemctl mask sdwdate
Member

adrelanos commented Mar 14, 2016

It was caused by a clock jump which should be unrelated to WiFi network switching and unrelated to suspend/resume. To manually fix the clock:

https://www.whonix.org/wiki/Troubleshooting#Clock_Fix


The crude simple fix would be
"disable sdwdate systemd service"

TemplateVM:

sudo systemctl mask sdwdate
@bnvk

This comment has been minimized.

Show comment
Hide comment
@bnvk

bnvk Mar 14, 2016

@adrelanos thanks! I will try the resetting the clock manually for now and then disable if still presents issues!

bnvk commented Mar 14, 2016

@adrelanos thanks! I will try the resetting the clock manually for now and then disable if still presents issues!

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Mar 15, 2016

Member

In Whonix 13, the rewrite of sdwdate will be deployed. We'll also have reworked all time sources. When Whonix 13 testers-only version is out, please see if this is still happening for you.

Member

adrelanos commented Mar 15, 2016

In Whonix 13, the rewrite of sdwdate will be deployed. We'll also have reworked all time sources. When Whonix 13 testers-only version is out, please see if this is still happening for you.

@mfc

This comment has been minimized.

Show comment
Hide comment
@mfc

mfc Mar 16, 2016

Member

yes I can confirm it is caused by clock jump.

Member

mfc commented Mar 16, 2016

yes I can confirm it is caused by clock jump.

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Mar 18, 2016

Member

Command to check if sdwdate really stops doing anything.

In sys-whonix.

sudo service sdwdate status

Should reveal, that it's not loaded then.

If the clock is still jumping, that could be that something is causing clock jumps independently of sdwdate.

Also from the log that @bnvk has sent me I cannot make head or tail. It's unlikely, that sdwdate would result in a clock jump that big. Right after boot, when sdwdate will use date it could theoretically happen - if the media diff is so big - which is unlikely since we obtain the time from 3 servers. During run - without suspend / resume - big clock jumps are even less likely. Because then sdwdate would use sclockadj. Also since sclockadj was introduced, there were to my knowledge never any reports wrt clock jumps that break connectivity unrelated to suspend / resume.

IIUC @bnvk is running with sdwdate disabled now. I speculate, there could still be big clock jumps independently from sdwdate. I've found some reports of clock jumps related to plain Xen on search engines.

To figure that out, please run "sometimes".

qvm-run --pass-io sys-whonix "date -u" ; date -u

And in comparison.

qvm-run --pass-io sys-firewall "date -u" ; date -u
qvm-run --pass-io random-app-vms "date -u" ; date -u

Let's see if we catch something else.

Member

adrelanos commented Mar 18, 2016

Command to check if sdwdate really stops doing anything.

In sys-whonix.

sudo service sdwdate status

Should reveal, that it's not loaded then.

If the clock is still jumping, that could be that something is causing clock jumps independently of sdwdate.

Also from the log that @bnvk has sent me I cannot make head or tail. It's unlikely, that sdwdate would result in a clock jump that big. Right after boot, when sdwdate will use date it could theoretically happen - if the media diff is so big - which is unlikely since we obtain the time from 3 servers. During run - without suspend / resume - big clock jumps are even less likely. Because then sdwdate would use sclockadj. Also since sclockadj was introduced, there were to my knowledge never any reports wrt clock jumps that break connectivity unrelated to suspend / resume.

IIUC @bnvk is running with sdwdate disabled now. I speculate, there could still be big clock jumps independently from sdwdate. I've found some reports of clock jumps related to plain Xen on search engines.

To figure that out, please run "sometimes".

qvm-run --pass-io sys-whonix "date -u" ; date -u

And in comparison.

qvm-run --pass-io sys-firewall "date -u" ; date -u
qvm-run --pass-io random-app-vms "date -u" ; date -u

Let's see if we catch something else.

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Mar 18, 2016

Member

More ways to verify, that sdwdate does not run.

ls /var/log/sdwdate.log

Won't be existing.

ps aux | grep sdwdate
ps aux | grep sclockadj

No process found.


The only other mechanism of Whonix other than sdwdate that interferes with the clock but only at boot is bootclockrandomization. Unlikely to be the cause because it only moves the clock +/- 180 seconds. Anyhow. To exclude it, it can be disabled.

TemplateVM:

sudo systemctl mask bootclockrandomization
Member

adrelanos commented Mar 18, 2016

More ways to verify, that sdwdate does not run.

ls /var/log/sdwdate.log

Won't be existing.

ps aux | grep sdwdate
ps aux | grep sclockadj

No process found.


The only other mechanism of Whonix other than sdwdate that interferes with the clock but only at boot is bootclockrandomization. Unlikely to be the cause because it only moves the clock +/- 180 seconds. Anyhow. To exclude it, it can be disabled.

TemplateVM:

sudo systemctl mask bootclockrandomization
@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Mar 18, 2016

Member

Check.

dmesg

It might reveal any kernel messages wrt the clock.


Please post the output of the following two commands.

cat /sys/devices/system/clocksource/clocksource0/available_clocksource
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
Member

adrelanos commented Mar 18, 2016

Check.

dmesg

It might reveal any kernel messages wrt the clock.


Please post the output of the following two commands.

cat /sys/devices/system/clocksource/clocksource0/available_clocksource
cat /sys/devices/system/clocksource/clocksource0/current_clocksource

adrelanos added a commit to Whonix/sdwdate that referenced this issue Mar 18, 2016

adrelanos added a commit to Whonix/whonixcheck that referenced this issue Mar 21, 2016

look out for timekeeping watchdog kernel messages
Kernel messages wrt timekeeping can indicate issues which generate lots of
confusion, inconvenience and even time related fingerprinting issues. See
QubesOS/qubes-issues#1779 (new information to be
added there soon). Which took a while to debug.

https://phabricator.whonix.org/T480
@bnvk

This comment has been minimized.

Show comment
Hide comment
@bnvk

bnvk Mar 22, 2016

After disabling sdwdate and still experiencing issues (albeit, less frequently), running date -u in dom0, NetVMs and all my AppVMs, the time reported was correct. Only in the sys-whonix VMs had the time slipped by almost 4 hrs, which of course made my Tor connection fail.

The output of dmesg in sys-whonix produced a bunch of stuff, but @adrelanos seems to think the last three lines starting with timekeeping watchdog: are Xen issue

[ 2.556649] nf_conntrack version 0.5.0 (1804 buckets, 7216 max)
[ 5.040819] Adding 1048572k swap on /dev/xvdc1. Priority:-1 extents:1 across:1048572k SSFS
[ 5.173840] EXT4-fs (xvdb): mounted filesystem with ordered data mode. Opts: discard
[ 6.220787] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 61.173500] vif vif-221-0 vif221.0: Guest Rx ready
[ 323.323338] vif vif-222-0 vif222.0: Guest Rx ready
[ 1583.454513] vif vif-225-0 vif225.0: Guest Rx ready
[ 1751.947171] vif vif-226-0 vif226.0: Guest Rx ready
[ 2743.843690] timekeeping watchdog: Marking clocksource 'tsc' as unstable, because the skew is too large:
[ 2743.843711] 'xen' wd_now: eac101ac3978 wd_last: eac0273e8104 mask: ffffffffffffffff
[ 2743.843716] 'tsc' cs_now: 14d48b5e5 cs_last: 1ff05ffd07de mask: ffffffffffffffff

@marmarek what do you think about this?

bnvk commented Mar 22, 2016

After disabling sdwdate and still experiencing issues (albeit, less frequently), running date -u in dom0, NetVMs and all my AppVMs, the time reported was correct. Only in the sys-whonix VMs had the time slipped by almost 4 hrs, which of course made my Tor connection fail.

The output of dmesg in sys-whonix produced a bunch of stuff, but @adrelanos seems to think the last three lines starting with timekeeping watchdog: are Xen issue

[ 2.556649] nf_conntrack version 0.5.0 (1804 buckets, 7216 max)
[ 5.040819] Adding 1048572k swap on /dev/xvdc1. Priority:-1 extents:1 across:1048572k SSFS
[ 5.173840] EXT4-fs (xvdb): mounted filesystem with ordered data mode. Opts: discard
[ 6.220787] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 61.173500] vif vif-221-0 vif221.0: Guest Rx ready
[ 323.323338] vif vif-222-0 vif222.0: Guest Rx ready
[ 1583.454513] vif vif-225-0 vif225.0: Guest Rx ready
[ 1751.947171] vif vif-226-0 vif226.0: Guest Rx ready
[ 2743.843690] timekeeping watchdog: Marking clocksource 'tsc' as unstable, because the skew is too large:
[ 2743.843711] 'xen' wd_now: eac101ac3978 wd_last: eac0273e8104 mask: ffffffffffffffff
[ 2743.843716] 'tsc' cs_now: 14d48b5e5 cs_last: 1ff05ffd07de mask: ffffffffffffffff

@marmarek what do you think about this?

@marmarek marmarek added the duplicate label Mar 23, 2016

@marmarek

This comment has been minimized.

Show comment
Hide comment
@marmarek

marmarek Mar 23, 2016

Member

Generally yes, it is somehow Xen issue. Basically VM can have multiple clocksources - currently there are two:

  • xen - absolute time provided by Xen
  • tsc - time counted by the VM itself

When you pause the VM (which also covers host suspend case), tsc clocksource is frozen for that time. And after resume it still have the old (from before suspend) value. But xen clocksource points the right time - this is why tsc is marked unstable.
In Qubes we have a mechanism qvm-sync-clock, which is called just after resume to fix the problem. But Whonix have it disabled to have time slightly different than dom0 (to prevent time fingerprinting issues). The proper fix is implemented as part of #1764, which is going to be a part of Whonix 13 (BTW @adrelanos, do you consider it as an update to Whonix 12? Or Whonix 13 is soon enough?). In the meantime you can use instruction here: https://www.whonix.org/wiki/Troubleshooting#Clock_Fix

Anyway, this ticket looks to be a duplicate of #1764

Member

marmarek commented Mar 23, 2016

Generally yes, it is somehow Xen issue. Basically VM can have multiple clocksources - currently there are two:

  • xen - absolute time provided by Xen
  • tsc - time counted by the VM itself

When you pause the VM (which also covers host suspend case), tsc clocksource is frozen for that time. And after resume it still have the old (from before suspend) value. But xen clocksource points the right time - this is why tsc is marked unstable.
In Qubes we have a mechanism qvm-sync-clock, which is called just after resume to fix the problem. But Whonix have it disabled to have time slightly different than dom0 (to prevent time fingerprinting issues). The proper fix is implemented as part of #1764, which is going to be a part of Whonix 13 (BTW @adrelanos, do you consider it as an update to Whonix 12? Or Whonix 13 is soon enough?). In the meantime you can use instruction here: https://www.whonix.org/wiki/Troubleshooting#Clock_Fix

Anyway, this ticket looks to be a duplicate of #1764

@marmarek marmarek closed this Mar 23, 2016

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Mar 24, 2016

Member

Marek Marczykowski-Górecki:

(BTW @adrelanos, do you consider it as an update to
Whonix 12? Or Whonix 13 is soon enough?).

Will answer in #1764.

Anyway, this ticket looks to be a duplicate of #1764

I don't think this is a duplicate.

@bnvk may correct me, but AFAIK, this bug is happening while...

  • He disabled sdwdate.
  • He disabled bootclock randomization.
  • Did not use the VM pause feature.
  • Did not use notebook suspend feature.
  • Still seeing seemingly random clock jumps.

Right, @bnvk?

Member

adrelanos commented Mar 24, 2016

Marek Marczykowski-Górecki:

(BTW @adrelanos, do you consider it as an update to
Whonix 12? Or Whonix 13 is soon enough?).

Will answer in #1764.

Anyway, this ticket looks to be a duplicate of #1764

I don't think this is a duplicate.

@bnvk may correct me, but AFAIK, this bug is happening while...

  • He disabled sdwdate.
  • He disabled bootclock randomization.
  • Did not use the VM pause feature.
  • Did not use notebook suspend feature.
  • Still seeing seemingly random clock jumps.

Right, @bnvk?

@mfc

This comment has been minimized.

Show comment
Hide comment
@mfc

mfc Mar 24, 2016

Member

I get it from notebook suspend, Marek's description fits my issue at least.

Member

mfc commented Mar 24, 2016

I get it from notebook suspend, Marek's description fits my issue at least.

@bnvk

This comment has been minimized.

Show comment
Hide comment
@bnvk

bnvk Mar 28, 2016

  • Did not use the VM pause feature.
  • Did not use notebook suspend feature.

Wrong. I am seeing this behavior even when I have not paused or put my laptop to sleep. The time clock slips considerably in my sys-whonix while just using Qubes

bnvk commented Mar 28, 2016

  • Did not use the VM pause feature.
  • Did not use notebook suspend feature.

Wrong. I am seeing this behavior even when I have not paused or put my laptop to sleep. The time clock slips considerably in my sys-whonix while just using Qubes

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Mar 28, 2016

Member

Brennan Novak:

  • Did not use the VM pause feature.
  • Did not use notebook suspend feature.

Wrong. I am seeing this behavior even when I have not paused or put my laptop to sleep. The time clock slips considerably in my sys-whonix while just using Qubes

That's what I said. (Overread the word "not"?=

Member

adrelanos commented Mar 28, 2016

Brennan Novak:

  • Did not use the VM pause feature.
  • Did not use notebook suspend feature.

Wrong. I am seeing this behavior even when I have not paused or put my laptop to sleep. The time clock slips considerably in my sys-whonix while just using Qubes

That's what I said. (Overread the word "not"?=

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Apr 9, 2016

Member

This is not a duplicate. But I am inclined to make it one by creating a fresh ticket with a clear report. Before I do that, I need a small confirmation. After having (hopefully) cleared about the confusion about the word "not"...

@bnvk Can you confirm, that my summary from #1779 (comment) is correct?

Member

adrelanos commented Apr 9, 2016

This is not a duplicate. But I am inclined to make it one by creating a fresh ticket with a clear report. Before I do that, I need a small confirmation. After having (hopefully) cleared about the confusion about the word "not"...

@bnvk Can you confirm, that my summary from #1779 (comment) is correct?

@bnvk

This comment has been minimized.

Show comment
Hide comment
@bnvk

bnvk Apr 9, 2016

@bnvk Can you confirm, that my summary from #1779 (comment) is correct?

Correct.

bnvk commented Apr 9, 2016

@bnvk Can you confirm, that my summary from #1779 (comment) is correct?

Correct.

@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos May 18, 2016

Member

@bnvk can you please try Whonix 13 testers release?

https://www.whonix.org/blog/qubes-whonix-13-0-0-0-7-testers-wanted

I have a small hope that this fixes your issue.

Otherwise on clock jumps you can then also try a new utility:

sudo /usr/lib/sdwdate/clock-fix
Member

adrelanos commented May 18, 2016

@bnvk can you please try Whonix 13 testers release?

https://www.whonix.org/blog/qubes-whonix-13-0-0-0-7-testers-wanted

I have a small hope that this fixes your issue.

Otherwise on clock jumps you can then also try a new utility:

sudo /usr/lib/sdwdate/clock-fix
@adrelanos

This comment has been minimized.

Show comment
Hide comment
@adrelanos

adrelanos Jun 3, 2016

Member

Whonix stable is out btw. ( https://www.whonix.org/blog/whonix-13-released )

Related, and possible the cause for this issue:
#2044

Member

adrelanos commented Jun 3, 2016

Whonix stable is out btw. ( https://www.whonix.org/blog/whonix-13-released )

Related, and possible the cause for this issue:
#2044

@bnvk

This comment has been minimized.

Show comment
Hide comment
@bnvk

bnvk Jun 11, 2016

@adrelanos just installed will test and check over the next week and will close if the problems seems to be fixed!

bnvk commented Jun 11, 2016

@adrelanos just installed will test and check over the next week and will close if the problems seems to be fixed!

@bnvk

This comment has been minimized.

Show comment
Hide comment
@bnvk

bnvk Jun 18, 2016

After running the new templates for a week I have never experienced this. Thanks @adrelanos 👍 😄

bnvk commented Jun 18, 2016

After running the new templates for a week I have never experienced this. Thanks @adrelanos 👍 😄

@bnvk bnvk closed this Jun 18, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment