Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gossip far less in steady state #3064

Merged
merged 16 commits into from Sep 20, 2019

Conversation

@rustyrussell
Copy link
Contributor

commented Sep 16, 2019

Get our own gossip generation rate down:

  1. Never generate two gossip msgs for same channel/node less than 5 minutes apart.
  2. More generally suppress identical gossip messages we generate.
  3. Refresh channels every 13, not 7 days.

Filter others' gossip more aggressively:

  1. Ignore "nothing changed" channel_update and node_announcement messages (allow refresh 1 per 7 days)
  2. Generally ratelimit to 1 update per day, with a burst of up to 4.
  3. Ignore messages > 1 day in future, or 2 weeks in past.

And numerous support cleanups along the way.

sig is only non-const so we can override if NULL, but talz helps
us here.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Write helpers to split it into non-timestamp, non-signature parts,
and simply compare those.  We extract a helper to do channel_update, too.

This is more generic than our previous approach, and simpler.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Default is 5 x gossip interval == 5 minutes.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Never make them less than gossip_min_interval apart.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell added the gossip label Sep 16, 2019
@rustyrussell rustyrussell added this to the 0.7.3 milestone Sep 16, 2019
@rustyrussell rustyrussell requested a review from cdecker as a code owner Sep 16, 2019
@rustyrussell rustyrussell added this to To Do in Gossip: sip, don't gulp. via automation Sep 16, 2019
Copy link
Member

left a comment

Just some clarifications for my own benefit, and some minor points about having sticky open channels because we over-aggressively discard disabling updates.

@@ -196,6 +196,8 @@ void gossip_init(struct lightningd *ld, int connectd_fd)
get_offered_globalfeatures(tmpctx),
ld->rgb,
ld->alias, ld->config.channel_update_interval,
/* gossip_min_interval: 5x the broadcast interval */
ld->config.broadcast_interval_msec / 200,

This comment has been minimized.

Copy link
@cdecker

cdecker Sep 16, 2019

Member

This constant needs a bit more explanation.

This comment has been minimized.

Copy link
@rustyrussell

rustyrussell Sep 16, 2019

Author Contributor

msec->sec, but it's replaced in a future patch anyway?

* - MUST set `timestamp` to be greater than that of any
* previous `node_announcement` it has previously created.
*/
/* We do better: never send them within more than 5 minutes. */

This comment has been minimized.

Copy link
@cdecker

cdecker Sep 16, 2019

Member

As a side note: if we try to send a node announcement more frequently than 5 minutes we will actually never send anything except maybe the first one: tal_free on line 576 will cancel any queued announcement, replacing it with a new one, which then gets cancelled again in the next iteration. Should probably warn against this somewhere :-)

This comment has been minimized.

Copy link
@rustyrussell

rustyrussell Sep 16, 2019

Author Contributor

? We'll re-run this after 5 minutes, generate a new node_announcement, and now the timestamp will be ok?

else if (node_id_eq(&chan->nodes[1]->id, &rstate->local_id))
direction = 1;
else
return NULL;

This comment has been minimized.

Copy link
@cdecker

cdecker Sep 16, 2019

Member

Shouldn't this be fatal()?

This comment has been minimized.

Copy link
@rustyrussell

rustyrussell Sep 16, 2019

Author Contributor

No, it should be called maybe_add_local_chan() though....

}

/*~ This is when channeld asks us for a `channel_update` for a local channel.
/*~ This is when channeld asks us for a channel_update for a local channel.
* It does that to fill in the error field when lightningd fails an HTLC and
* sets the UPDATE bit in the error type. lightningd is too important to
* fetch this itself, so channeld does it (channeld has to talk to us for
* other things anyway, so why not?). */
static bool handle_get_update(struct peer *peer, const u8 *msg)

This comment has been minimized.

Copy link
@cdecker

cdecker Sep 16, 2019

Member

I was confused for a while, maybe we should rename this function to be handle_get_local_channel_update just to make it clear to lazy people like me, who don't read the docstring, what this handles 😉

/* Create an unsigned channel_update: we backdate enables, so
* we can always send a disable in an emergency. */
if (!lc->disable)
timestamp -= daemon->gossip_min_interval;

This comment has been minimized.

Copy link
@cdecker

cdecker Sep 16, 2019

Member

Are we sure this won't land below the previous gossip message's timestamp, which would make this outdated, and therefore would be discarded? This can happen if at time 5m we queue an update, and then at time 9m we queue an emergency update (disable). The backdated timestamp for the emergency is 4m which is older than the prior regular update making it obsolete.

I don't think we can set a correct timestamp without any awareness of the previous one was, which is what we did before. Or am I missing something?

This comment has been minimized.

Copy link
@rustyrussell

rustyrussell Sep 16, 2019

Author Contributor

This will hit the "back off and retry" logic:

		/* Is it too soon to send another update? */
		next = hc->bcast.timestamp
			+ GOSSIP_MIN_INTERVAL(daemon->rstate->dev_fast_gossip);
@@ -0,0 +1,456 @@
/* Routines to make our own gossip messages */

This comment has been minimized.

Copy link
@cdecker

cdecker Sep 16, 2019

Member

The filename kind of implies that it is a utility that compiles to a binary. Wouldn't gossip_msgs.c or gossip_wire.c be better?

This comment has been minimized.

Copy link
@rustyrussell

rustyrussell Sep 16, 2019

Author Contributor

No, these are the routines which generate the gossip for our node and channels. But the name "generate" is used otherwise in the source... agree it could be better.

@@ -147,6 +147,12 @@ static void update_own_node_announcement(struct daemon *daemon)
/* Discard existing timer. */
daemon->node_announce_timer = tal_free(daemon->node_announce_timer);

/* If we ever use set-based propagation, ensuring the toggle the lower
* bit in consecutive timestamps makes it more robust. */

This comment has been minimized.

Copy link
@cdecker

cdecker Sep 16, 2019

Member

A link to the rationale would be great here 😉

This comment has been minimized.

Copy link
@rustyrussell

rustyrussell Sep 16, 2019

Author Contributor

Future patch :)

if (daemon->rstate->prune_timeout < 24*3600)
highwater = now - daemon->rstate->prune_timeout / 2;
else
highwater = now - (daemon->rstate->prune_timeout - 24*3600);

This comment has been minimized.

Copy link
@cdecker

cdecker Sep 16, 2019

Member

This feels rather dangerous: if we are online for 13 days since the last update, and then go offline exactly for those last 24 hours, then our channel gets pruned, despite us being up >90% of time. With better set-reconciliation, I agree we could then get this stuff back, but for now sending after half the prune interval makes sure that we are offline for 7+ days before we get pruned.

Notice that the channel would have been soft-disabled anyway since our peers will have sent out the disabling update, so the node can't really be reached anyway, hence the long timeout.

This comment has been minimized.

Copy link
@rustyrussell

rustyrussell Sep 16, 2019

Author Contributor

Does it matter? There's no danger in being pruned, as we'll come back and get unpruned? And it would take both sides to be offline to be pruned like this.

}

/* Make sure it's not spamming us. */
if (!ratelimit(&hc->tokens, hc->bcast.timestamp, timestamp)) {

This comment has been minimized.

Copy link
@cdecker

cdecker Sep 16, 2019

Member

If we exhaust our rate for channel_updates and then a disable comes in, we will be stuck with an open channel that should have been soft-disabled?

This comment has been minimized.

Copy link
@cdecker

cdecker Sep 16, 2019

Member

Or could this be sidestepped, by adding a timestamp that is 6 hours after the previous one? In this case this needs documentation in the spec.

This comment has been minimized.

Copy link
@rustyrussell

rustyrussell Sep 16, 2019

Author Contributor

I thought about adding an extra once-off burst allowance for disable. But really, nodes wouldn't use it, since they're misbehaving at this point anyway. Wiser would be to track channels which are saturated and mark them disabled? I left a bit for that :)


/* First, top up tokens, avoiding overflow. */
num_tokens = *tokens + ((new_timestamp - prev_timestamp)
/ GOSSIP_TOKEN_TIME(rstate->dev_fast_gossip));

This comment has been minimized.

Copy link
@cdecker

cdecker Sep 16, 2019

Member

This may lead us to effectively allow fewer than 1 update per day, since the division will round down.

This comment has been minimized.

Copy link
@cdecker

cdecker Sep 16, 2019

Member

Not a big problem in ratelimiting, but it may have some strange effects if other implementations expect to be able to send one a day, we may end up just getting one every two days.

This comment has been minimized.

Copy link
@rustyrussell

rustyrussell Sep 16, 2019

Author Contributor

In practice, 1 per day is walking the edge already I think.

Normally we'd put a pointer into struct half_chan for local
information, but it would be NULL on 99.99% of nodes.  Instead, keep a
separate hash table.

This immediately subsumes the previous "map of local-disabled
channels", and will be enhanced further.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Make update_local_channel use a timer if it's too soon to make another
update.

1. Implement cupdate_different() which compares two updates.
2. make update_local_channel() take a single arg for timer usage.
3. Set timestamp of non-disable update back 5 minutes, so we can
   always generate a disable update if we need to.
4. Make update_local_channel() itself do the "unchanged update" suppression.
   gossipd: clean up local channel updates.
5. Keep pointer to the current timer so we override any old updates with
   a new one, to avoid a race.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
gossipd.c is doing too many things: this is a start.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
One day is plenty of time to propagate the update.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We can look up local channels directly now, which offers simplifcations.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is useful for various "partial timestamp" forms of propagation
in future, esp. minisketch.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If you send a message which simply changes timestamp and signature, we
drop it.  You shouldn't be doing that, and the door to ignoring them
was opened by by option_gossip_query_ex, which would allow clients to
ignore updates with the same checksum.

This is more aggressive at reducing spam messages, but we allow refreshes
(to be conservative, we allow them even when 1/2 of the way through the
refresh period).

I dropped the now-unnecessary sleep from test_gossip_pruning, too.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
And similar for node_announcement.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell force-pushed the rustyrussell:gossip-far-less branch from ac02ac4 to 442414a Sep 16, 2019
@rustyrussell

This comment has been minimized.

Copy link
Contributor Author

commented Sep 17, 2019

Weird crash under travis, trying to diag.... Fixed real bug! Nice...

@rustyrussell rustyrussell force-pushed the rustyrussell:gossip-far-less branch 2 times, most recently from 50ccdd1 to a2f5c78 Sep 17, 2019
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
…sip.

It's generally clearer to have simple hardcoded numbers with an
#if DEVELOPER around it, than apparent variables which aren't, really.

Interestingly, our pruning test was always kinda broken: we have to pass
two cycles, since l2 will refresh the channel once to avoid pruning.

Do the more obvious thing, and cut the network in half and check that
l1 and l3 time out.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Suggested-by: @cdecker.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell force-pushed the rustyrussell:gossip-far-less branch from a2f5c78 to 2bd929e Sep 18, 2019
@rustyrussell rustyrussell requested a review from cdecker Sep 19, 2019
@rustyrussell rustyrussell merged commit 6a8d18c into ElementsProject:master Sep 20, 2019
2 of 3 checks passed
2 of 3 checks passed
ackbot Need at least 1 ACKs
bitcoin-bot/fixups PR does not contain unsquashed fixups
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Gossip: sip, don't gulp. automation moved this from To Do to Done Sep 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
2 participants
You can’t perform that action at this time.