Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix routing infinite loop, bad scoring #7127

Merged

Conversation

rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented Mar 5, 2024

Three small fixes: the scoring ones may have a large effect in real routing scenarios!

Fixes: #7092

Copy link
Collaborator

@vincenzopalazzo vincenzopalazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 4b06cde

plugins/libplugin-pay.c Outdated Show resolved Hide resolved
@rustyrussell
Copy link
Contributor Author

The final commit is wrong: we do, indeed, sum the inputs to that function. However, we shouldn't. Will fix separately (now I'm writing tests!)

@rustyrussell rustyrussell marked this pull request as draft March 6, 2024 00:34
Wrote a test program which passed num_channel_updates_rejected as NULL
(which we don't usually do), and valgrind complained:

```
==1048302== Conditional jump or move depends on uninitialised value(s)
==1048302==    at 0x118B90: update_channel (gossmap.c:550)
==1048302==    by 0x119EEE: map_catchup (gossmap.c:663)
==1048302==    by 0x11A299: load_gossip_store (gossmap.c:726)
==1048302==    by 0x11A352: gossmap_load (gossmap.c:1052)
==1048302==    by 0x125362: main (run-route-infloop.c:90)
```

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell marked this pull request as ready for review March 6, 2024 04:22
@rustyrussell
Copy link
Contributor Author

Even after fixing the capacity bias, we had a similar problem with our risk values (being 0). In fact, the scoring callback was counterintuitive, so I've fixed that entirely.

The main contribution was to spend a day writing a proper unit test of the routing code, which uncovered this confusion and now gives me much more optimism that this is correct.

Please re-review: the first two commits are the same, though.

@cdecker cdecker self-assigned this Mar 6, 2024
The amount is set not to crash by default, but run
"common/test/run-route-infloop 8388607" and you'll see a crash.

Sorry about the 7MB blob, but this testing was quite revealing and
I consider it worth adding.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We attempted to introduce a "capacity bias" so we would penalize
channels where we were using a large amount of their total capacity.

Unfortunately, this was both naive, and applied wrongly:

	-log((capmsat + 1 - amtmsat) / (capmsat + 1));

As an integer gives 0 up to about 65% capacity.  We then applied this
by multiplying:

	(cmsat * rmsat * bias) / (cmsat + rmsat + bias + 1);

Giving a total score of 0 in these cases!  If we're using the
arithmetic mean we really need to add 1 to bias.  We might as well
use a double the whole way through, for a slightly more fine-grained
approach.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It gave 0.  A lot.

Firstly, rmsat was often very small, because delays are often small.   Much
smaller than the actual fee.

We really just want to offset the bias and multiply it.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The "path_score" callback was supposed to evaluate the *entire path*,
but that was counter-intuitive and opened the door to a cost function
bug which caused this path cost to be less than the closer path.

In particular, the capacity bias code didn't understand this at all.

1. Rename the function to `channel_score` and remove the "distance"
   parameter (always "1" since you're supposed to be evaluating a
   single hop).
2. Rename "cost" to the more specific "fee": "score" is our
   actual cost function result (we avoid the word "cost" as it
   may get confused with satoshi amounts).
3. For capacity biassing, we do want to know the amount, but
   explicitly hand that as a separate parameter "total".
4. Fix a minor bug where total handed to scoring function previously
   included channel fee (this is wrong: fee is paid before sending into
   channel).
5. Remove the now-unused total_delay member from the dijkstra
   struct.

Here are the results of our test now (routing 4194303 msat, which
didn't crash the old code, so we could compare).  In both cases
we could find routes to 615 nodes:

Linear success probability (when found): min-max(mean +/- stddev)
	Before: 0.484764-0.999750(0.9781+/-0.049)
	After:  0.487040-0.999543(0.952548+/-0.075)

Hops:
	Before: 1-5(2.13821+/-0.66)
	After:  1-5(2.98374+/-0.77)

Fees:
	Before: 0-50041(2173.75+/-5.3e+03)
	After:  0-50848(922.457+/-2.7e+03)

Delay (blocks):
	Before: 0-294(83.1642+/-68)
	After:  0-196(65.8081+/-60)

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: ElementsProject#7092
Changelog-Fixed: Plugins: `pay` would occasionally crash on routing.
Changelog-Fixed: Plugins: `pay` route algorithm fixed and refined to balance fees and capacity far better.
I noticed that run-route-infloop chose some worse-looking paths after
routing was fixed, eg the second node:

Before:
	Destination node, success, probability, hops, fees, cltv, scid...
	02b3aa1e4ed31be83cca4bd367b2c01e39502cb25e282a9b4520ad376a1ba0a01a,1,0.991856,2,1004,40,2572260x39x0/1,2131897x45x0/0

After:
	Destination node, success, probability, hops, fees, cltv, scid...
	02b3aa1e4ed31be83cca4bd367b2c01e39502cb25e282a9b4520ad376a1ba0a01a,1,0.954540,3,1046,46,2570715x21x0/1,2346882x26x14/1,2131897x45x0/0

This is because although the final costs don't reflect it, routing was taking
into account local channels, and 2572260x39x0/1 has a base fee of 2970.

There's an easy fix: when we the pay plugin creates localmods for our
gossip graph, add all local channels with delay and fees equal to 0.
We do the same thing in our unit test.  This improves things across
the board:

Linear success probability (when found): min-max(mean +/- stddev)
	Before: 0.487040-0.999543(0.952548+/-0.075)
	After:  0.486985-0.999750(0.975978+/-0.053)

Hops:
	Before: 1-5(2.98374+/-0.77)
	After:  1-5(2.09593+/-0.63)

Fees:
	Before: 0-50848(922.457+/-2.7e+03)
	After:  0-50041(861.621+/-2.7e+03)

Delay (blocks):
	Before: 0-196(65.8081+/-60)
	After:  0-190(60.3285+/-60)

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Plugins: `pay` route algorithm doesn't bias against our own "expensive" channels any more.
Set up a simple line of channel pairs, where one should be preferred
over the other for our various reasons.  Make sure this works, both
using a low-level call to Dijkstra and at a higher level as the pay
plugin does.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@cdecker cdecker added this to the v24.02.1 milestone Mar 7, 2024
@cdecker cdecker merged commit 65efa2a into ElementsProject:master Mar 7, 2024
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

plugin-pay: crashed during normal operation
3 participants