Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gossip rework part 3: gossipd now uses gossmap #6941

Merged

Conversation

rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented Dec 14, 2023

~Based on #6904 ~ Merged into master!

Lots of moving code out of gossipd/routing.c, followed by getting rid of spam and zombie code.

Then we switch everything across to our new "gossmap_manage" API, and clean up.

@cdecker
Copy link
Member

cdecker commented Jan 31, 2024

Rebased on top of master, reviewing now.

@rustyrussell
Copy link
Contributor Author

Tweak to overstrict test change in final commit.

Fuzzer is complaining about leaks, but it seems to be an OpenSSL thing. This series doesn't contain any fuzz changes, either...

@cdecker
Copy link
Member

cdecker commented Feb 1, 2024

Yeah, I noticed the fuzzer complaining in some other PRs too. We should likely split out the fuzzer into a cron-triggered CI run instead, since at best it can catch newly introduced bugs (but randomized), and at worst it delays a completely unrelated PR just because it tried some other path through the program and hit a latent one.

@cdecker
Copy link
Member

cdecker commented Feb 1, 2024

Rebased on top of master.

ACK 23cc4f3

@morehouse
Copy link
Contributor

@cdecker

Yeah, I noticed the fuzzer complaining in some other PRs too. We should likely split out the fuzzer into a cron-triggered CI run instead, since at best it can catch newly introduced bugs (but randomized), and at worst it delays a completely unrelated PR just because it tried some other path through the program and hit a latent one.

This is not how the fuzz regression tests work in CI. No new fuzzing is done, the fuzz target is only run on the existing inputs in the corpus. The regression tests really do belong in CI, so they can actually detect regressions.

I haven't looked in detail, but one particular fuzz test seems to be failing due to a new bug in OpenSSL -- perhaps the CI container recently upgraded OpenSSL? We should really figure out what's going on and fix that, or if necessary disable the particular test that is failing. Getting rid of all fuzz regression testing in CI (#7029) is not the right solution IMO.

@rustyrussell
Copy link
Contributor Author

@cdecker

Yeah, I noticed the fuzzer complaining in some other PRs too. We should likely split out the fuzzer into a cron-triggered CI run instead, since at best it can catch newly introduced bugs (but randomized), and at worst it delays a completely unrelated PR just because it tried some other path through the program and hit a latent one.

This is not how the fuzz regression tests work in CI. No new fuzzing is done, the fuzz target is only run on the existing inputs in the corpus. The regression tests really do belong in CI, so they can actually detect regressions.

I haven't looked in detail, but one particular fuzz test seems to be failing due to a new bug in OpenSSL -- perhaps the CI container recently upgraded OpenSSL? We should really figure out what's going on and fix that, or if necessary disable the particular test that is failing. Getting rid of all fuzz regression testing in CI (#7029) is not the right solution IMO.

I agree, and will create a patch to re-eneable all but this one until it's fixed...

@cdecker
Copy link
Member

cdecker commented Feb 2, 2024

Good point, thanks @morehouse for the additional context, I did not know it was only regression testing, though I could have immagined as much from the Workflow name :-)

The CI and its brittleness have long been a major issue for CLN maintainers, and this is a good example: the first assumption is always that it must be a flake, not that there is something really wrong. Something something signal vs noise :-)

@morehouse
Copy link
Contributor

Here's a PR to revert #7029 and avoid the LSan reports: #7032

The only way you'll see private channel_updates is if you put them
there yourself with localmods.

I also renamed the confusing gossmap_chan_capacity to gossmap_chan_has_capacity.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
gossip_store_del - takes a gossmap-style offset-of-msg not offset-of-hdr.
gossip_store_flag: set an arbitrary flag on a gossip_store hdr.
gossip_store_get_timestamp/gossip_store_set_timestamp: access gossip_store hdr.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
And helpers to tell if a node_announcement exists, and get a
full channel_update.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In particular, allow callers to see unknown records we ignore (and let
them fail as a result), and get called if we can't pack a
channel_update into our internal format.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It was an obscure dev command, as it never worked reliably.
It would be much easier to re-implement once this is done.

This turned out to reveal a tiny leak on
tests/test_gossip.py::test_gossip_store_load_amount_truncated where we
didn't immedately free chan_ann if it was dangling.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Makes it easier to wean off routing.c.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is easier for future callers, which don't have a convenient peer
structure: in particular, asynchronous processing of gossip for peers.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Again, we don't necessarily have a peer pointer.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Most nodes don't really care about exact timestamps on gossip filters,
so just keep a flag on whether we have anything in the gossip_store,
and use that to determine whether we ask peers for everything.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We never enabled it, because we seemed to be eliminating valid
channels.  We discard zombie-marked records on loading.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We weakened this progressively over time, and gossip v1.5 makes spam
impossible by protocol, so we can wait until then.

Removing this code simplifies things a great deal!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Removed: Protocol: we no longer ratelimit gossip messages by channel, making our code far simpler.
…ntries.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is a fair amount of code, but much is taken from
the old routing.c, with the difference that this uses
common/gossmap instead of our own structures.

The interfaces are fairly clear:

1. gossmap_manage_new - allocator
2. gossmap_manage_channel_announcement
   - handle new channel announcement msg
   - if too early, keeps it in early map
   - queues it, asks lightingd about UTXO.
3. gossmap_manage_handle_get_txout_reply
   - handle response from lightningd for above.
4. gossmap_manage_channel_update
   - handle channel_update message
   - may have to wait on pending channel_announcement
5. gossmap_manage_node_announcement
   - handle node_announcement msg
   - may have to wait on pending channel_announcement
6. gossmap_manage_new_block
   - see if early announces can now be processed.
7. gossmap_manage_channel_spent
   - lightningd tells us UTXO is spent
   - may prepare channel for closing in 12 blocks.
8. gossmap_manage_channel_dying
   - gossip_store load tells us channel was spent earlier.
   - like gossmap_manage_channel_spent, but maybe < 12.
9. gossmap_manage_get_gossmap
   - gossmap accessor: seeker and queries will need this.
10. gossmap_manage_new_peer
   - a new peer has connected, give them all our gossip.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
No external interfaces, we start the timer on allocation.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
At initialization, gossipd is supposed to send all the local
channel_updates and any node_announcement it knows, so lightningd
doesn't generate fresh ones unnecessarily.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's an unnecessary round-trip, and can cause us to complain in CI, in
the case where the channel has been closed by the time we ask.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we've got more than 10000 pending channel_announcements,
complain and stop processing any more.

If this becomes a problem, we can limit individual peers.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We add a temporary stub gossmap_manage constructor, which simply opens
the gossmap and doesn't do anything else.

Then seeker uses this, rather than routing.c, to probe.

We optimize our "get random node announcements" a bit by traversing a
random set of nodes directly, and seeing if we have no
node_announcement, then querying their first channel.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is a bit less optimal than before, where we had an ordered map of
channels and could easily serve "channels between scids 800000x and
900000x".  We now iterate all of them.

The rest is fairly mechanical.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The gossip_store_load is now basically a noop, since gossmap
does that.

gossipd removes a pile of routines dealing with messages,
in favor of just handing them to gossmap_manage.

The stub gossmap_manage constructor is removed entirely.

We simplified behaviour around channel_announcements with
no channel update: we now add them to the store, and go
back to fix the timestamp later.  This changes a test,
which explicitly tests for the old behaviour.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We don't use the dying flag, and we can manually append the
addendum rather than having gossip_store_add present a
bizarre interface.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
gossmap offsets are to the beginning of the message, whereas
the gossip_store uses the header offset.  Convert the internals
of gossip_store to use gossmap-style uniformly, even where it's
a little less convenient.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Instead of "new" and "load", we don't really need to "load" anything,
so do everything in gossip_store_new.

Have it do the compaction/rewrite, and collect the dying records
We can get bad gossip if a node processes a gossip message after we've closed:

```
_________________________________________ ERROR at teardown of test_closing_specified_destination _________________________________________
...
>           raise ValueError(str(errors))
E           ValueError:
E           Node errors:
E            - lightningd-1: had warning messages
E            - lightningd-4: had bad gossip messages
E           Global errors:
...
lightningd-1 2024-02-03T00:29:02.299Z INFO    0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199-connectd: Received WIRE_WARNING: WARNING: channel_announcement: no unspent txout 105x1x0
lightningd-1 2024-02-03T00:29:02.300Z DEBUG   0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199-connectd: peer_in WIRE_WARNING
lightningd-1 2024-02-03T00:29:02.300Z INFO    0382ce59ebf18be7d84677c2e35f23294b9992ceca95491fcf8a56c6cb2d9de199-connectd: Received WIRE_WARNING: WARNING: channel_announcement: no unspent txout 103x1x0
lightningd-1 2024-02-03T00:29:02.339Z DEBUG   035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: peer_in WIRE_WARNING
lightningd-1 2024-02-03T00:29:02.339Z INFO    035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: Received WIRE_WARNING: WARNING: channel_announcement: no unspent txout 103x1x0
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This time in renepay tests.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell merged commit 28c4a52 into ElementsProject:master Feb 3, 2024
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants