Gossip rework part 3: gossipd now uses gossmap #6941

rustyrussell · 2023-12-14T06:57:58Z

~Based on #6904 ~ Merged into master!

Lots of moving code out of gossipd/routing.c, followed by getting rid of spam and zombie code.

Then we switch everything across to our new "gossmap_manage" API, and clean up.

cdecker · 2024-01-31T15:45:40Z

Rebased on top of master, reviewing now.

rustyrussell · 2024-02-01T04:06:13Z

Tweak to overstrict test change in final commit.

Fuzzer is complaining about leaks, but it seems to be an OpenSSL thing. This series doesn't contain any fuzz changes, either...

cdecker · 2024-02-01T16:51:08Z

Yeah, I noticed the fuzzer complaining in some other PRs too. We should likely split out the fuzzer into a cron-triggered CI run instead, since at best it can catch newly introduced bugs (but randomized), and at worst it delays a completely unrelated PR just because it tried some other path through the program and hit a latent one.

cdecker · 2024-02-01T16:52:57Z

Rebased on top of master.

ACK 23cc4f3

morehouse · 2024-02-01T23:45:46Z

@cdecker

Yeah, I noticed the fuzzer complaining in some other PRs too. We should likely split out the fuzzer into a cron-triggered CI run instead, since at best it can catch newly introduced bugs (but randomized), and at worst it delays a completely unrelated PR just because it tried some other path through the program and hit a latent one.

This is not how the fuzz regression tests work in CI. No new fuzzing is done, the fuzz target is only run on the existing inputs in the corpus. The regression tests really do belong in CI, so they can actually detect regressions.

I haven't looked in detail, but one particular fuzz test seems to be failing due to a new bug in OpenSSL -- perhaps the CI container recently upgraded OpenSSL? We should really figure out what's going on and fix that, or if necessary disable the particular test that is failing. Getting rid of all fuzz regression testing in CI (#7029) is not the right solution IMO.

rustyrussell · 2024-02-02T00:14:35Z

@cdecker

Yeah, I noticed the fuzzer complaining in some other PRs too. We should likely split out the fuzzer into a cron-triggered CI run instead, since at best it can catch newly introduced bugs (but randomized), and at worst it delays a completely unrelated PR just because it tried some other path through the program and hit a latent one.

This is not how the fuzz regression tests work in CI. No new fuzzing is done, the fuzz target is only run on the existing inputs in the corpus. The regression tests really do belong in CI, so they can actually detect regressions.

I haven't looked in detail, but one particular fuzz test seems to be failing due to a new bug in OpenSSL -- perhaps the CI container recently upgraded OpenSSL? We should really figure out what's going on and fix that, or if necessary disable the particular test that is failing. Getting rid of all fuzz regression testing in CI (#7029) is not the right solution IMO.

I agree, and will create a patch to re-eneable all but this one until it's fixed...

cdecker · 2024-02-02T08:23:01Z

Good point, thanks @morehouse for the additional context, I did not know it was only regression testing, though I could have immagined as much from the Workflow name :-)

The CI and its brittleness have long been a major issue for CLN maintainers, and this is a good example: the first assumption is always that it must be a flake, not that there is something really wrong. Something something signal vs noise :-)

morehouse · 2024-02-02T17:18:22Z

Here's a PR to revert #7029 and avoid the LSan reports: #7032

The only way you'll see private channel_updates is if you put them there yourself with localmods. I also renamed the confusing gossmap_chan_capacity to gossmap_chan_has_capacity. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossip_store_del - takes a gossmap-style offset-of-msg not offset-of-hdr. gossip_store_flag: set an arbitrary flag on a gossip_store hdr. gossip_store_get_timestamp/gossip_store_set_timestamp: access gossip_store hdr. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

And helpers to tell if a node_announcement exists, and get a full channel_update. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

In particular, allow callers to see unknown records we ignore (and let them fail as a result), and get called if we can't pack a channel_update into our internal format. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

It was an obscure dev command, as it never worked reliably. It would be much easier to re-implement once this is done. This turned out to reveal a tiny leak on tests/test_gossip.py::test_gossip_store_load_amount_truncated where we didn't immedately free chan_ann if it was dangling. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Makes it easier to wean off routing.c. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This is easier for future callers, which don't have a convenient peer structure: in particular, asynchronous processing of gossip for peers. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Again, we don't necessarily have a peer pointer. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Most nodes don't really care about exact timestamps on gossip filters, so just keep a flag on whether we have anything in the gossip_store, and use that to determine whether we ask peers for everything. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We never enabled it, because we seemed to be eliminating valid channels. We discard zombie-marked records on loading. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We weakened this progressively over time, and gossip v1.5 makes spam impossible by protocol, so we can wait until then. Removing this code simplifies things a great deal! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Removed: Protocol: we no longer ratelimit gossip messages by channel, making our code far simpler.

…ntries. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This is a fair amount of code, but much is taken from the old routing.c, with the difference that this uses common/gossmap instead of our own structures. The interfaces are fairly clear: 1. gossmap_manage_new - allocator 2. gossmap_manage_channel_announcement - handle new channel announcement msg - if too early, keeps it in early map - queues it, asks lightingd about UTXO. 3. gossmap_manage_handle_get_txout_reply - handle response from lightningd for above. 4. gossmap_manage_channel_update - handle channel_update message - may have to wait on pending channel_announcement 5. gossmap_manage_node_announcement - handle node_announcement msg - may have to wait on pending channel_announcement 6. gossmap_manage_new_block - see if early announces can now be processed. 7. gossmap_manage_channel_spent - lightningd tells us UTXO is spent - may prepare channel for closing in 12 blocks. 8. gossmap_manage_channel_dying - gossip_store load tells us channel was spent earlier. - like gossmap_manage_channel_spent, but maybe < 12. 9. gossmap_manage_get_gossmap - gossmap accessor: seeker and queries will need this. 10. gossmap_manage_new_peer - a new peer has connected, give them all our gossip. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

No external interfaces, we start the timer on allocation. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

At initialization, gossipd is supposed to send all the local channel_updates and any node_announcement it knows, so lightningd doesn't generate fresh ones unnecessarily. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

It's an unnecessary round-trip, and can cause us to complain in CI, in the case where the channel has been closed by the time we ask. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

If we've got more than 10000 pending channel_announcements, complain and stop processing any more. If this becomes a problem, we can limit individual peers. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We add a temporary stub gossmap_manage constructor, which simply opens the gossmap and doesn't do anything else. Then seeker uses this, rather than routing.c, to probe. We optimize our "get random node announcements" a bit by traversing a random set of nodes directly, and seeing if we have no node_announcement, then querying their first channel. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This is a bit less optimal than before, where we had an ordered map of channels and could easily serve "channels between scids 800000x and 900000x". We now iterate all of them. The rest is fairly mechanical. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

The gossip_store_load is now basically a noop, since gossmap does that. gossipd removes a pile of routines dealing with messages, in favor of just handing them to gossmap_manage. The stub gossmap_manage constructor is removed entirely. We simplified behaviour around channel_announcements with no channel update: we now add them to the store, and go back to fix the timestamp later. This changes a test, which explicitly tests for the old behaviour. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We don't use the dying flag, and we can manually append the addendum rather than having gossip_store_add present a bizarre interface. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossmap offsets are to the beginning of the message, whereas the gossip_store uses the header offset. Convert the internals of gossip_store to use gossmap-style uniformly, even where it's a little less convenient. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Instead of "new" and "load", we don't really need to "load" anything, so do everything in gossip_store_new. Have it do the compaction/rewrite, and collect the dying records

We can get bad gossip if a node processes a gossip message after we've closed: ``` _________________________________________ ERROR at teardown of test_closing_specified_destination _________________________________________ ... > raise ValueError(str(errors)) E ValueError: E Node errors: E - lightningd-1: had warning messages E - lightningd-4: had bad gossip messages E Global errors: ... lightningd-1 2024-02-03T00:29:02.299Z INFO 0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199-connectd: Received WIRE_WARNING: WARNING: channel_announcement: no unspent txout 105x1x0 lightningd-1 2024-02-03T00:29:02.300Z DEBUG 0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199-connectd: peer_in WIRE_WARNING lightningd-1 2024-02-03T00:29:02.300Z INFO 0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199-connectd: Received WIRE_WARNING: WARNING: channel_announcement: no unspent txout 103x1x0 lightningd-1 2024-02-03T00:29:02.339Z DEBUG 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: peer_in WIRE_WARNING lightningd-1 2024-02-03T00:29:02.339Z INFO 035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: Received WIRE_WARNING: WARNING: channel_announcement: no unspent txout 103x1x0 ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

This time in renepay tests. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell added this to the v24.02 milestone Dec 14, 2023

rustyrussell requested a review from endothermicdev December 14, 2023 06:57

rustyrussell requested a review from cdecker as a code owner December 14, 2023 06:57

rustyrussell force-pushed the guilt/gossipd-cleanup branch from 38017c9 to 2cc44b0 Compare December 14, 2023 09:15

cdecker added the needs-rebase label Jan 30, 2024

rustyrussell force-pushed the guilt/gossipd-cleanup branch 3 times, most recently from a30d27b to c9462e9 Compare January 31, 2024 06:08

cdecker force-pushed the guilt/gossipd-cleanup branch from c9462e9 to 7159088 Compare January 31, 2024 15:45

rustyrussell force-pushed the guilt/gossipd-cleanup branch from 7159088 to 01308d9 Compare February 1, 2024 04:05

cdecker force-pushed the guilt/gossipd-cleanup branch from 01308d9 to 23cc4f3 Compare February 1, 2024 16:52

rustyrussell removed the needs-rebase label Feb 1, 2024

rustyrussell added 10 commits February 3, 2024 10:56

common/gossmap: remove now-unused private flag.

9c16014

The only way you'll see private channel_updates is if you put them there yourself with localmods. I also renamed the confusing gossmap_chan_capacity to gossmap_chan_has_capacity. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

common: gossmap method to load fd directly, not filename.

61a26f6

And helpers to tell if a node_announcement exists, and get a full channel_update. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

common: optional gossmap callbacks for better failure handling.

441c7cf

In particular, allow callers to see unknown records we ignore (and let them fail as a result), and get called if we can't pack a channel_update into our internal format. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: take txout failure cache out of routing.c

1f10dbf

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: take signature checks out of routing.c

a75537e

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: move tell_lightningd_peer_update from routing.c into gossipd.c

d309d60

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: make gossip_store hold daemon ptr, not rstate.

64cf875

Makes it easier to wean off routing.c. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: pass node_id to queue_peer_msg, not peer.

b49ddfb

This is easier for future callers, which don't have a convenient peer structure: in particular, asynchronous processing of gossip for peers. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell added 21 commits February 3, 2024 10:56

gossipd: have seeker quert interfaces take an id, not a struct peer.

04373d9

Again, we don't necessarily have a peer pointer. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: move dev flags from routing struct to daemon struct.

793ccb1

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: move timestamp_reasonable out of routing.c

19149e1

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: move gossip_store pointer from struct routing_state to daemon.

31b89c0

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: remove zombie handling.

a7e41bc

We never enabled it, because we seemed to be eliminating valid channels. We discard zombie-marked records on loading. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossip_store: remove infratructure and bits for marking ratelimited e…

30214de

…ntries. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: gossmap_manage helper to get a node's address, if any.

7bd42c6

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: implement pruning timer inside gossmap_manage.

62ead87

No external interfaces, we start the timer on allocation. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: interface to have gossmap_manage send updates on init.

8bca9f9

At initialization, gossipd is supposed to send all the local channel_updates and any node_announcement it knows, so lightningd doesn't generate fresh ones unnecessarily. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: don't validate UTXOs on our own channels.

b26e945

It's an unnecessary round-trip, and can cause us to complain in CI, in the case where the channel has been closed by the time we ask. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: re-implement flood protection.

8f93570

If we've got more than 10000 pending channel_announcements, complain and stop processing any more. If this becomes a problem, we can limit individual peers. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: remove routing.c and other unused functions.

c8cf272

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: clean up gossip_store routines.

e3c1187

We don't use the dying flag, and we can manually append the addendum rather than having gossip_store_add present a bizarre interface. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

gossipd: simplify gossip store API.

207656f

Instead of "new" and "load", we don't really need to "load" anything, so do everything in gossip_store_new. Have it do the compaction/rewrite, and collect the dying records

rustyrussell force-pushed the guilt/gossipd-cleanup branch from 23cc4f3 to ad58539 Compare February 3, 2024 01:08

rustyrussell force-pushed the guilt/gossipd-cleanup branch from ad58539 to 9cc946a Compare February 3, 2024 02:34

pytest: another bad gossip flake

891c0e9

This time in renepay tests. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell merged commit 28c4a52 into ElementsProject:master Feb 3, 2024
35 checks passed

cdecker mentioned this pull request Feb 6, 2024

bad gossip after merging the gossip rework #7043

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gossip rework part 3: gossipd now uses gossmap #6941

Gossip rework part 3: gossipd now uses gossmap #6941

rustyrussell commented Dec 14, 2023 •

edited

cdecker commented Jan 31, 2024

rustyrussell commented Feb 1, 2024

cdecker commented Feb 1, 2024

cdecker commented Feb 1, 2024

morehouse commented Feb 1, 2024

rustyrussell commented Feb 2, 2024

cdecker commented Feb 2, 2024

morehouse commented Feb 2, 2024

Gossip rework part 3: gossipd now uses gossmap #6941

Gossip rework part 3: gossipd now uses gossmap #6941

Conversation

rustyrussell commented Dec 14, 2023 • edited

cdecker commented Jan 31, 2024

rustyrussell commented Feb 1, 2024

cdecker commented Feb 1, 2024

cdecker commented Feb 1, 2024

morehouse commented Feb 1, 2024

rustyrussell commented Feb 2, 2024

cdecker commented Feb 2, 2024

morehouse commented Feb 2, 2024

rustyrussell commented Dec 14, 2023 •

edited