New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sys-whonix doesn't connect to Tor after system suspend #1764

Closed
marmarek opened this Issue Feb 19, 2016 · 23 comments

Comments

Projects
None yet
4 participants
@marmarek
Copy link
Member

marmarek commented Feb 19, 2016

After system suspend, sys-whonix cannot connect to Tor, when trying to reach any site, it logs:

Feb 16 10:37:02 host Tor[3507]: Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
(...)
Feb 16 10:41:29 host Tor[3507]: Tried for 120 seconds to get a connection to [scrubbed]:0. Giving up. (waiting for circuit)

I guess it's because of desynchronized clock - now is Feb 19 16:36:42.
When I set date manually (date -s 'Feb 19 16:30:00') to some approximately current value, it works again:

Feb 19 16:30:04 host Tor[3507]: Tor has successfully opened a circuit. Looks like client functionality is working.

First of all, currently sys-whonix have no idea when system was suspended. To have any solution for this problem, probably it should change. Is it possible to somehow force reconnection to Tor, even with such large clock difference? IIUC it is required for sdwdate to sync the time.
Another solution would be to properly suspend the VM. This mean the kernel would sync the time after resume (based on clocksource xen? not sure, but it works for sys-net).

/cc @adrelanos

@marmarek

This comment has been minimized.

Copy link
Member

marmarek commented Feb 19, 2016

Related: after setting time manually, something (sdwdate?) set time constantly:

Feb 19 16:36:52 host systemd[1390]: Time has been changed
Feb 19 16:36:53 host systemd[1]: Time has been changed
Feb 19 16:36:53 host systemd[1390]: Time has been changed
Feb 19 16:36:54 host systemd[1]: Time has been changed
Feb 19 16:36:54 host systemd[1390]: Time has been changed
Feb 19 16:36:55 host systemd[1]: Time has been changed
Feb 19 16:36:55 host systemd[1390]: Time has been changed
Feb 19 16:36:56 host systemd[1]: Time has been changed
Feb 19 16:36:56 host systemd[1390]: Time has been changed
Feb 19 16:36:57 host systemd[1]: Time has been changed
Feb 19 16:36:57 host systemd[1390]: Time has been changed
Feb 19 16:36:58 host systemd[1]: Time has been changed
Feb 19 16:36:58 host systemd[1390]: Time has been changed
Feb 19 16:36:59 host systemd[1]: Time has been changed
Feb 19 16:36:59 host systemd[1390]: Time has been changed
Feb 19 16:37:00 host systemd[1390]: Time has been changed
Feb 19 16:37:00 host systemd[1]: Time has been changed
Feb 19 16:37:01 host systemd[1390]: Time has been changed

Process list:

user@host:~$ ps aux|grep sdwdate
sdwdate    814  0.0  0.6  14372  4220 ?        Ss   Feb18   0:34 /bin/bash /usr/bin/sdwdate
sdwdate  15460  0.0  0.4  14372  3184 ?        S    16:30   0:00 /bin/bash /usr/bin/sdwdate
sdwdate  15461  0.0  0.3  14372  2232 ?        S    16:30   0:00 /bin/bash /usr/bin/sdwdate
root     15470  0.0  0.5  51072  3660 ?        S    16:30   0:00 sudo INLINEDIR=/var/cache/sdwdate/sclockadj /usr/lib/sdwdate/sclockadj --no-verbose --no-debug --no-first-wait --move-min 500000 --move-max 500000 --wait-min 1000000000 --wait-max 1000000000 --subtract 2992752664174
sdwdate  15472  0.0  0.0   5800   660 ?        S    16:30   0:00 sleep 9120
root     15473 29.7 15.5 175864 108168 ?       Sl   16:30   2:13 ruby /usr/lib/sdwdate/sclockadj --no-verbose --no-debug --no-first-wait --move-min 500000 --move-max 500000 --wait-min 1000000000 --wait-max 1000000000 --subtract 2992752664174
@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented Feb 19, 2016

The time changed messages are generated by sclockadj. Unrelated.
Requires $someone to finish / create sclockadj2. More info:

- Whonix/sdwdate#4

https://www.whonix.org/wiki/Dev/TimeSync#Adjusting_time_slowly_using_adjtimex.2Fntp_adjtime


It's a known issue.
https://www.whonix.org/wiki/Known_Issues#Suspend_.2F_Hibernate_Issues

Fun to fix.

How to fix... On resume... Inside Whonix.... To do....

  1. sudo service sdwdate stop
  2. have dome0 telling Qubes-Whonix a slightly randomized time [1] and
    set it using date
  3. sudo service sdwdate start
  4. might have to restart Tor at this point depending on if they already
    fixed the bugs [that I reported already] requiring this

Minor: check if sdwdate is even installed beforehand.

Do we have a ticket for Qubes to implement dispatching a hook on resume?
Once that is done, the above is simple.

[1] code similar to this:
https://github.com/Whonix/bootclockrandomization/blob/master/usr/share/bootclockrandomization/start

@marmarek

This comment has been minimized.

Copy link
Member

marmarek commented Feb 19, 2016

On Fri, Feb 19, 2016 at 08:29:18AM -0800, Patrick Schleizer wrote:

  1. sudo service sdwdate stop

I guess, should be start here :)

Do we have a ticket for Qubes to implement dispatching a hook on resume?
Once that is done, the above is simple.

Yes, #1663.

Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

@rootkovska

This comment has been minimized.

Copy link
Member

rootkovska commented Feb 22, 2016

BTW, I also have started observing problems with time syncing in my sys-whonix gw VM. While Tor connects fine, the sdwdate processes starts consuming 100% cpu (luckily one core only). The only solution is to stop the service. The date is out of sync and shows much into the past (e.g. the previous day). Tested with all the latest patches.

@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented Feb 22, 2016

For reference, there is more information on sclockadj here:

https://groups.google.com/d/msg/qubes-users/QO4He5mZDzc/68iyt4-5BgAJ

@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented Mar 11, 2016

I might likely be able to come up with a workaround. Using clock-jump-detector.

Pros:

  • If the clock is less than 1 hour in the past or less than 3 hours in the future
    • no 100% use of cpu issue
    • restored Tor connectivity
  • If the clock is more than 1 hour in the past or more than 3 hours in the future
    • no 100% use of cpu issue
    • (((no restored Tor connectivity)))

Technical details:

A clock-jump-detector could be invented. A script that runs a loop, that stores unixtime in variable A, waits, stores another unixtime in variable B, then compares those.

When sdwdate sets the time using date (rather than sclockadj) [2], sdwdate would need to to:

  1. stop the clock-jump-detector systemd service.
  2. set the time using date.
  3. restart the clock-jump-detector systemd service.

Caveats:

  • during 1) to 3) there is room for a race condition [4], probably happening very seldom
  • when users change the time manually, it will trigger clock-jump-detector
  • As a general Tor (non-Whonix!) issue, if the clock is 1 hour in the past or more than 3 hours in the future, Tor can't connect. So there is no sane way (speak: only a fingerprintable) way to recover Tor automatically in such situations. [1]
    • This goes for any Tor-inside-a-VM or Tor-on-the-host project.
    • To recover from such situations for a Tor-inside-a-VM (speak: Whonix) project, cooperation of the host (speak: dom0) is required. (That would be qubes.GetRandomizedTime [3]. [Or at least a qubes.GetTime dom0 qrexec service.])
  • It would be a hack. #1663 and qubes.GetRandomizedTime [3] would be a cleaner solution providing better usability [5].

[1] https://www.whonix.org/wiki/Dev/TimeSync#Tor_Consensus_Method
[2] after boot or when manually instructed so
[3] https://groups.google.com/d/msg/qubes-devel/aN3IOv6JmKw/_XOwbV-EAgAJ
[4] meaning, that the clock-jump-detector mechanism would not work then
[5] recover connectivity no matter how long the computer was being suspended

@marmarek

This comment has been minimized.

Copy link
Member

marmarek commented Mar 11, 2016

I think we can add qubes.GetRandomizedTime, based on bootclockrandomization to R3.1 and R3.0. Adding new service is safe in terms of regressions. What is not so safe, is changing qubes.SuspendPre logic (to be called in all the VMs). Needs to be done very carefully.

@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented Mar 11, 2016

On a second thought, unfortunately qubes.GetRandomizedTime is probably of lower priority for a long time. This is because Xen cannot provide us with a fully independent VM clock anyhow. (details 1 / details 2) So the clock correlation attack that it is supposed to defeat cannot be implemented anyhow. (Sure, it would be a nice-to-have qubes.GetRandomizedTime.)

Would having a simpler qubes.GetTime service make sense in meanwhile? For non-Whonix VMs? Or should it be avoided/skipped and right qubes.GetRandomizedTime be implemented to ease the next iteration far in the future? [avoid far future restricting which types of VMs do not get access to qubes.GetTime]

adrelanos added a commit to adrelanos/qubes-core-admin-linux that referenced this issue Mar 11, 2016

implemented dom0 qubes.GetTime
Required for fixing 'sys-whonix doesn't connect to Tor after system suspend'.

QubesOS/qubes-issues#1764

adrelanos added a commit to adrelanos/qubes-core-admin that referenced this issue Mar 11, 2016

implemented dom0 qubes.GetTime
Required for fixing 'sys-whonix doesn't connect to Tor after system suspend'.

QubesOS/qubes-issues#1764
@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented Mar 11, 2016

Can you please check if QubesOS/qubes-core-admin#22 would make sense?

@marmarek

This comment has been minimized.

Copy link
Member

marmarek commented Mar 11, 2016

If going non-randomized time option, IMHO simply qubes.SetDateTime can be used - providing appropriate Whonix-specific handler. This handler should:

  1. Check if provided time is off more than 180s - if so, probably suspend happened. Or maybe check for +-1h, because that is the range really hurting tor, right?
  2. Depending on above check - either ignore the value, or randomize it slightly and set (taking care to not conflict with sdwdate)

This would be even better than qubes.GetTime, because it would work without any dom0 modification (so just whonix template will be enough, no need to handle multiple cases).

That said, I have nothing against qubes.GetRandomizedTime. The fact that we have to deal with this very ticket proves that it isn't exactly trivial to get host time, even with clocksource=xen.

@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented Mar 11, 2016

Marek Marczykowski-Górecki:

If going non-randomized time option, IMHO simply qubes.SetDateTime can be used - providing appropriate Whonix-specific handler.

qubes.SetDateTime is called too often. By qvm-sync-clock. Also at
times where we rather avoid it. qubes.GetTime could be used upon real
resume only.

Alternatively I was wondering if the [new] hook notifying the VM of
resume should also notify the VM of the current unixtime as an extra
parameter. (#1663) That would be fine until we get a fully independent
VM clock in far future.

  1. Check if provided time is off more than 180s - if so, probably suspend happened. Or maybe check for +-1h, because that is the range really hurting tor, right?
  2. Depending on above check - either ignore the value, or randomize it slightly and set (taking care to not conflict with sdwdate)

More certainty than probably suspend happened would be good. Because
then we could just act as on boot. short term plan: use sdwdate, set
time using date. long term plan: block networking until sdwdate is
done, set time using sdwdate and date. With sdwdate-alike security and
accuracy. Better fingerprinting defense than "boot" (resume) clock
randomization after suspend alone.

The fact that we have to deal with this very ticket proves that it isn't exactly trivial to get host time, even with clocksource=xen.

Not trivial, but also not super hard. Just would require quite some time
for research and development.

  • requires a kernel module that is not exactly simple to set up on Qubes
    generally (following pvgrub instructions)
  • writing a custom kernel module accessing clocksource xen
  • [or some other C / kernel trickery I am not aware off to access
    clocksource xen]
  • dealing with the (un)realiability of clocksource xen. It may work okay
    as long as suspend/resume is not involved (for clock correlation
    attacks) but not after actual suspend/resume.
@marmarek

This comment has been minimized.

Copy link
Member

marmarek commented Mar 12, 2016

  • writing a custom kernel module accessing clocksource xen

Ok, you're right. It looks like a module is required. Really simple one:

#include <linux/module.h>
#include <linux/timekeeping.h>

int gettime(void) {
    struct timespec ts;
    x86_platform.get_wallclock(&ts);
    printk(KERN_INFO "persistent_clock: %ld.%ld\n", ts.tv_sec, ts.tv_nsec);
    return -1;
}

MODULE_LICENSE("GPL");
module_init(gettime);

output:

[986617.296157] persistent_clock: 1457743780.493455504
[986629.957794] persistent_clock: 1457743793.155087310
[986630.458305] persistent_clock: 1457743793.655793983
[986630.859566] persistent_clock: 1457743794.56864426
[986631.218631] persistent_clock: 1457743794.415929105

Which is host (or Xen?) time, even after setting VM time to something totally different (with date -s).

@marmarek

This comment has been minimized.

Copy link
Member

marmarek commented Mar 12, 2016

And this time is properly loaded back as system time, when the VM is properly suspended before suspending the host. Which is the case for NetVM (or more precisely: VM with some PCI device).
This is something we consider to change in Qubes 4.0 - properly suspend all the VMs before suspending the host.

Anyway, I think it would be better to do qubes.GetRandomizedTime just now, to not rollback/limit qubes.GetTime in the future. Probably a simple copy&paste from bootclockrandomization, so not really much more work.

adrelanos added a commit to adrelanos/bootclockrandomization that referenced this issue Mar 13, 2016

adrelanos added a commit to adrelanos/qubes-core-admin that referenced this issue Mar 13, 2016

implemented dom0 qubes.GetRandomizedTime
Required for fixing 'sys-whonix doesn't connect to Tor after system suspend'.

QubesOS/qubes-issues#1764
@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented Mar 13, 2016

Okay, agreed.

Please check if QubesOS/qubes-core-admin#23 makes sense.

adrelanos added a commit to adrelanos/sdwdate that referenced this issue Mar 15, 2016

@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented Mar 15, 2016

We now have:

Next thing I'll be working on is suspend / resume scripts (#1663). After that this ticket is trivial to solve. (Just use #1663 to call the sdwdate suspend handler scripts.)

@marmarek

This comment has been minimized.

Copy link
Member

marmarek commented Mar 15, 2016

This is also in current-testing for R3.1.

@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented Mar 18, 2016

Next thing I'll be working on is suspend / resume scripts (#1663).

Was done by Marek.

After that this ticket is trivial to solve. (Just use #1663 to call the sdwdate suspend handler scripts.)

Done:
Whonix/sdwdate@5c1aea7

adrelanos added a commit to Whonix/sdwdate that referenced this issue Mar 18, 2016

@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented Mar 19, 2016

Done. Will be released with Whonix 13.


For testing purposes. Inside Whonix.

sudo sh -x /etc/qubes-rpc/qubes.SuspendPreAll
sudo sh -x /etc/qubes-rpc/qubes.SuspendPostAll

Might require real host system suspend / resume. Not working yet with suspend / resume VM in QVMM yet. Asked about that: #1663 (comment)

@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented Mar 24, 2016

Marek Marczykowski-Górecki:

(BTW @adrelanos, do you consider it as an update to Whonix 12? Or Whonix 13 is soon enough?).

No upgrade to Whonix 12. I release Whonix stable upgrades with the smallest possible diff and as seldom as possible to avoid a disaster. [defined as: broken package manager, breaking connectivity and ability to upgrade for all users at once] There are too many combinations of versions. [Whonix 12 vs 13; Qubes 3.0 vs 3.1; Whonix stable vs testers ; Qubes stable vs testing] Very time consuming to manually test. And we don't have automated Q/A, CI builds, tests, release manager etc. in place. And this is a big change. [sdwdate was rewritten in meanwhile and dependencies changed]

@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented May 17, 2016

What's next in this ticket? It is fixed in Whonix 13 which is due to be released soon-ish. I just now verified that again.

Do we want to keep tickets open until an upgrade has been released to stable that fixes it?

Or do we want to close tickets as soon as the code to implement them is done?

@andrewdavidwong

This comment has been minimized.

Copy link
Member

andrewdavidwong commented May 17, 2016

Do we want to keep tickets open until an upgrade has been released to stable that fixes it?

Or do we want to close tickets as soon as the code to implement them is done?

It's Marek's call, of course, but from what I've observed, I think the current practice is to close them once the code is done. Then, we have qubes-builder-github for notifications once packages containing fixes/features are available in repos.

@marmarek

This comment has been minimized.

Copy link
Member

marmarek commented May 18, 2016

👍

@adrelanos

This comment has been minimized.

Copy link
Member

adrelanos commented Jun 2, 2016

Please close.

@marmarek marmarek closed this Jun 2, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment