New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save bandwidth by moderating onion pinging #542

Merged
merged 1 commit into from Aug 5, 2017

Conversation

Projects
9 participants
@zugz

zugz commented May 1, 2017

Currently, a large majority of the traffic in the tox network is generated by
pinging onion nodes with Onion Request packets, which a node does to keep
itself announced and to check for friends coming online. This PR significantly
reduces the rate at which such requests are sent.

In part this is by fixing some bugs; probably most significantly, previously
the nodes for friends were always considered timed out when we came to ping
them, since the timeout for announce nodes was used for the check rather than
that for friend nodes, leading to some extra traffic.

Then it implements two schemes for reducing unnecessary pinging. The first is
to start to trust slightly announce nodes and paths which have been alive for a
while, and not ping them so aggressively - while reverting to the old behaviour
at the first sign of failure. Based on my testing, this reducing announce
traffic by around a factor of 4, from around 2 packets per second to around
~0.5, with no discernable problems.

The second is to gradually reduce the rate of pinging we do to check on offline
friends when they've been offline for some time. Since some users typically
have many offline friends, this should result in a large reduction in traffic.

This last should be the only change which could cause any problems for users.
The principal effect is on friend requests: if we make a friend request
to an offline node, and it remains offline for some time while we stay online,
then when it comes online it may take longer for it to receive the friend
request. With the constants as I've set them, the extreme case is that they
stay offline for at least 1600 minutes, and then the request will take
something like 5 minutes on average to get through. The same delay applies to
established friends who come online after being offline for a while and who
somehow fail to find our announcement - that shouldn't happen, but possibly it
sometimes does (and an attacker can cause it by poisoning our announcement
neighbourhood).

Please note when testing this PR that you shouldn't expect to see a huge
bandwidth reduction - what you should see if you packet-sniff is a reduction
(after your node has been up for a while) in Onion Request packets (first byte
0x80), but it needs a large proportion of the network to adopt it before we'll
start to see actual significant bandwidth reduction.


This change is Reviewable

@zugz

This comment has been minimized.

Show comment
Hide comment
@zugz

zugz May 1, 2017

The other potential problem I meant to mention: this PR reduces the
sensitivity to being disconnected from the network, since this is detected by
seeing if enough time passes without us receiving a response to an onion
request, and now we send requests less regularly it can happen that we don't
receive any responses for a while. I've increased the timeout from 19s to 75s,
which is enough to make false positives very unlikely. If this is a problem,
it wouldn't be too hard to add a mechanism to deliberately test the network if
we receive nothing for say 30s.

zugz commented May 1, 2017

The other potential problem I meant to mention: this PR reduces the
sensitivity to being disconnected from the network, since this is detected by
seeing if enough time passes without us receiving a response to an onion
request, and now we send requests less regularly it can happen that we don't
receive any responses for a while. I've increased the timeout from 19s to 75s,
which is enough to make false positives very unlikely. If this is a problem,
it wouldn't be too hard to add a mechanism to deliberately test the network if
we receive nothing for say 30s.

@SkyzohKey

This comment has been minimized.

Show comment
Hide comment
@SkyzohKey

SkyzohKey May 1, 2017

Does this allows ToxCore to perform better on mobile devices. How I understand that PR it would fix bandwidth issues on mobile, right ?

SkyzohKey commented May 1, 2017

Does this allows ToxCore to perform better on mobile devices. How I understand that PR it would fix bandwidth issues on mobile, right ?

@zugz

This comment has been minimized.

Show comment
Hide comment
@zugz

zugz May 1, 2017

zugz commented May 1, 2017

@mannol

This comment has been minimized.

Show comment
Hide comment
@mannol

mannol May 4, 2017

The problem with mobile is the 'per second' notation. It's not(that much) about the amount of traffic you send but rather how frequent you send that traffic. Mobile devices shouldn't be a part of the DHT at all.

mannol commented May 4, 2017

The problem with mobile is the 'per second' notation. It's not(that much) about the amount of traffic you send but rather how frequent you send that traffic. Mobile devices shouldn't be a part of the DHT at all.

@zoff99

This comment has been minimized.

Show comment
Hide comment
@zoff99

zoff99 May 4, 2017

@mannol actually on mobile the "per second" does not matter when it comes to battery usage.
the things is, you can buy more bandwidth but you can't increase the battery size. if it's empty it's empty. game over.

on mobile (in my opinion) only timeouts really matter. so that the CPU can go into deepsleep and save battery.

zoff99 commented May 4, 2017

@mannol actually on mobile the "per second" does not matter when it comes to battery usage.
the things is, you can buy more bandwidth but you can't increase the battery size. if it's empty it's empty. game over.

on mobile (in my opinion) only timeouts really matter. so that the CPU can go into deepsleep and save battery.

@mannol

This comment has been minimized.

Show comment
Hide comment
@mannol

mannol May 4, 2017

@zoff99 the CPU is woken/prevented from sleeping every time the packet is received on the radio.

mannol commented May 4, 2017

@zoff99 the CPU is woken/prevented from sleeping every time the packet is received on the radio.

@zoff99

This comment has been minimized.

Show comment
Hide comment
@zoff99

zoff99 May 4, 2017

@mannol yeah, that's why we need at least 2 minutes total silence on Android to get some deep sleep.
so if there are no active calls and no active filetransfers, and no recent chat messages outgoing or incoming, mobile-c-toxcore should sleep for 2 (or a bit more) minutes without doing anything on the network. (and then continure WITHOUT bootstrapping again)

tests have shown it does wonders to your battery life

zoff99 commented May 4, 2017

@mannol yeah, that's why we need at least 2 minutes total silence on Android to get some deep sleep.
so if there are no active calls and no active filetransfers, and no recent chat messages outgoing or incoming, mobile-c-toxcore should sleep for 2 (or a bit more) minutes without doing anything on the network. (and then continure WITHOUT bootstrapping again)

tests have shown it does wonders to your battery life

@mannol

This comment has been minimized.

Show comment
Hide comment
@mannol

mannol May 4, 2017

The problem is: when in a DHT, the device has to communicate with the DHT by responding on DHT requests. That's why mobile devices shouldn't be a part of the DHT at all.

mannol commented May 4, 2017

The problem is: when in a DHT, the device has to communicate with the DHT by responding on DHT requests. That's why mobile devices shouldn't be a part of the DHT at all.

@zoff99

This comment has been minimized.

Show comment
Hide comment
@zoff99

zoff99 May 4, 2017

if in the near future 80% of all tox devices will be mobile devices (without beeing part of the DHT), will DHT still work as expected?

zoff99 commented May 4, 2017

if in the near future 80% of all tox devices will be mobile devices (without beeing part of the DHT), will DHT still work as expected?

@mannol

This comment has been minimized.

Show comment
Hide comment
@mannol

mannol May 4, 2017

Probably.

mannol commented May 4, 2017

Probably.

@GrayHatter

This comment has been minimized.

Show comment
Hide comment
@GrayHatter

GrayHatter May 11, 2017

Currently, a large majority of the traffic in the tox network is generated by
pinging onion nodes with Onion Request packets,

I can haz dis data tooo?

In part this is by fixing some bugs; probably most significantly, previously
the nodes for friends were always considered timed out when we came to ping
them, since the timeout for announce nodes was used for the check rather than
that for friend nodes, leading to some extra traffic.

Which commit[s] fixes this?

Then it implements two schemes for reducing unnecessary pinging.

reading through the code, what would you think about externalizing this logic a bit. E.g., can we make the decision in one function, and do the work in another. Instead of the current decide+act in the same functions.

The first is to start to trust slightly announce nodes and paths which have been alive for a
while, and not ping them so aggressively [...] with no discernable problems.

Onion paths, so if one of the nodes goes down, they whole route is stale? And what time frame did you do you testing.

The second is to gradually reduce the rate of pinging we do to check on offline
friends when they've been offline for some time. [...] With the constants as I've set them, the extreme case is that they stay offline for at least 1600 minutes, and then the request will take
something like 5 minutes on average to get through.

Is that friend requests, or friend connections.

The same delay applies to established friends who come online after being offline for a while and who somehow fail to find our announcement - that shouldn't happen, but possibly it sometimes does (and an attacker can cause it by poisoning our announcement neighbourhood).

Tox isn't very adaptive to poor network conditions, if I'm on a bad connection, would this diff change it from hard to make a connection, to impossible?

The other potential problem I meant to mention: this PR reduces the
sensitivity to being disconnected from the network, since this is detected by
seeing if enough time passes without us receiving a response to an onion
request, and now we send requests less regularly it can happen that we don't
receive any responses for a while. I've increased the timeout from 19s to 75s,
which is enough to make false positives very unlikely. If this is a problem,
it wouldn't be too hard to add a mechanism to deliberately test the network if
we receive nothing for say 30s.

Which network? The onion network, or the DHT network?

It should help a lot, but tox will always use a fair amount of bandwidth. Making up numbers which might be way off, I guess this should get it down to around 2-3KBps in each direction when idling,

What data are you using to get this estimate?

and with more work (optimising DHT traffic, and making sure we only send good nodes in response to onion requests) I have the impresion that as low as 1 could be feasible. I don't know if that's good enough for mobile users.

Yeah, as @mannol said. Delay in between is what helps mobile users the most. The goal that will best help mobile users will be LONG delays between packets, not just fewer packets. That said, this will be make it much easier to use on metered connections.

The DHT backend is what makes tox a good protocol. The problem is not that tox uses the DHT on mobile, the problem is the DHT is dumb. I smart DHT protocol should be able to bootstrap, and find and connect to ALL online friends in under 5s. Creating and validating onion routes is one of the things that makes this hard to do. So any optimizations of onion routing is a good place to start. (A part from dropping the onion for announce/friend finding)

GrayHatter commented May 11, 2017

Currently, a large majority of the traffic in the tox network is generated by
pinging onion nodes with Onion Request packets,

I can haz dis data tooo?

In part this is by fixing some bugs; probably most significantly, previously
the nodes for friends were always considered timed out when we came to ping
them, since the timeout for announce nodes was used for the check rather than
that for friend nodes, leading to some extra traffic.

Which commit[s] fixes this?

Then it implements two schemes for reducing unnecessary pinging.

reading through the code, what would you think about externalizing this logic a bit. E.g., can we make the decision in one function, and do the work in another. Instead of the current decide+act in the same functions.

The first is to start to trust slightly announce nodes and paths which have been alive for a
while, and not ping them so aggressively [...] with no discernable problems.

Onion paths, so if one of the nodes goes down, they whole route is stale? And what time frame did you do you testing.

The second is to gradually reduce the rate of pinging we do to check on offline
friends when they've been offline for some time. [...] With the constants as I've set them, the extreme case is that they stay offline for at least 1600 minutes, and then the request will take
something like 5 minutes on average to get through.

Is that friend requests, or friend connections.

The same delay applies to established friends who come online after being offline for a while and who somehow fail to find our announcement - that shouldn't happen, but possibly it sometimes does (and an attacker can cause it by poisoning our announcement neighbourhood).

Tox isn't very adaptive to poor network conditions, if I'm on a bad connection, would this diff change it from hard to make a connection, to impossible?

The other potential problem I meant to mention: this PR reduces the
sensitivity to being disconnected from the network, since this is detected by
seeing if enough time passes without us receiving a response to an onion
request, and now we send requests less regularly it can happen that we don't
receive any responses for a while. I've increased the timeout from 19s to 75s,
which is enough to make false positives very unlikely. If this is a problem,
it wouldn't be too hard to add a mechanism to deliberately test the network if
we receive nothing for say 30s.

Which network? The onion network, or the DHT network?

It should help a lot, but tox will always use a fair amount of bandwidth. Making up numbers which might be way off, I guess this should get it down to around 2-3KBps in each direction when idling,

What data are you using to get this estimate?

and with more work (optimising DHT traffic, and making sure we only send good nodes in response to onion requests) I have the impresion that as low as 1 could be feasible. I don't know if that's good enough for mobile users.

Yeah, as @mannol said. Delay in between is what helps mobile users the most. The goal that will best help mobile users will be LONG delays between packets, not just fewer packets. That said, this will be make it much easier to use on metered connections.

The DHT backend is what makes tox a good protocol. The problem is not that tox uses the DHT on mobile, the problem is the DHT is dumb. I smart DHT protocol should be able to bootstrap, and find and connect to ALL online friends in under 5s. Creating and validating onion routes is one of the things that makes this hard to do. So any optimizations of onion routing is a good place to start. (A part from dropping the onion for announce/friend finding)

@iphydf iphydf added this to the v0.1.9 milestone May 13, 2017

@zugz

This comment has been minimized.

Show comment
Hide comment
@zugz

zugz May 14, 2017

zugz commented May 14, 2017

@iphydf iphydf modified the milestones: v0.1.9, v0.1.10 Jun 5, 2017

@CLAassistant

This comment has been minimized.

Show comment
Hide comment
@CLAassistant

CLAassistant Jul 1, 2017

CLA assistant check
All committers have signed the CLA.

CLAassistant commented Jul 1, 2017

CLA assistant check
All committers have signed the CLA.

@robinlinden

This comment has been minimized.

Show comment
Hide comment
@robinlinden

robinlinden Jul 2, 2017

Member

:lgtm_strong:


Reviewed 1 of 2 files at r1, 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


toxcore/onion_client.h, line 82 at r3 (raw file):

    /* number of times pinged without success. */
    unsigned int last_pinged_times;

uint8_t and maybe unsuccessful_pings? Would mean the comment isn't needed.


Comments from Reviewable

Member

robinlinden commented Jul 2, 2017

:lgtm_strong:


Reviewed 1 of 2 files at r1, 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


toxcore/onion_client.h, line 82 at r3 (raw file):

    /* number of times pinged without success. */
    unsigned int last_pinged_times;

uint8_t and maybe unsuccessful_pings? Would mean the comment isn't needed.


Comments from Reviewable

@robinlinden

This comment has been minimized.

Show comment
Hide comment
@robinlinden

robinlinden Jul 21, 2017

Member

Reviewed 2 of 2 files at r6.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

Member

robinlinden commented Jul 21, 2017

Reviewed 2 of 2 files at r6.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@robinlinden robinlinden merged commit 5dd2557 into TokTok:master Aug 5, 2017

4 of 6 checks passed

ci/circleci CircleCI is running your tests
Details
license/cla Contributor License Agreement is not signed yet.
Details
code-review/reviewable 1/1 LGTMs
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.05%) to 73.955%
Details

@SkyzohKey SkyzohKey added this to Triage in Group Chats via automation Nov 19, 2017

@SkyzohKey SkyzohKey added the PR:ready label Nov 19, 2017

@SkyzohKey SkyzohKey added this to Triage in Mobile optimizations via automation Nov 19, 2017

@SkyzohKey SkyzohKey removed this from Triage in Group Chats Nov 19, 2017

@SkyzohKey SkyzohKey moved this from Triage to Done in Mobile optimizations Nov 19, 2017

@SkyzohKey SkyzohKey added PR:shipped and removed PR:ready labels Nov 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment