Fix routing infinite loop, bad scoring #7127

rustyrussell · 2024-03-05T04:10:34Z

Three small fixes: the scoring ones may have a large effect in real routing scenarios!

vincenzopalazzo

ACK 4b06cde

plugins/libplugin-pay.c

rustyrussell · 2024-03-06T00:34:41Z

The final commit is wrong: we do, indeed, sum the inputs to that function. However, we shouldn't. Will fix separately (now I'm writing tests!)

Wrote a test program which passed num_channel_updates_rejected as NULL (which we don't usually do), and valgrind complained: ``` ==1048302== Conditional jump or move depends on uninitialised value(s) ==1048302== at 0x118B90: update_channel (gossmap.c:550) ==1048302== by 0x119EEE: map_catchup (gossmap.c:663) ==1048302== by 0x11A299: load_gossip_store (gossmap.c:726) ==1048302== by 0x11A352: gossmap_load (gossmap.c:1052) ==1048302== by 0x125362: main (run-route-infloop.c:90) ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell · 2024-03-06T04:24:47Z

Even after fixing the capacity bias, we had a similar problem with our risk values (being 0). In fact, the scoring callback was counterintuitive, so I've fixed that entirely.

The main contribution was to spend a day writing a proper unit test of the routing code, which uncovered this confusion and now gives me much more optimism that this is correct.

Please re-review: the first two commits are the same, though.

The amount is set not to crash by default, but run "common/test/run-route-infloop 8388607" and you'll see a crash. Sorry about the 7MB blob, but this testing was quite revealing and I consider it worth adding. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We attempted to introduce a "capacity bias" so we would penalize channels where we were using a large amount of their total capacity. Unfortunately, this was both naive, and applied wrongly: -log((capmsat + 1 - amtmsat) / (capmsat + 1)); As an integer gives 0 up to about 65% capacity. We then applied this by multiplying: (cmsat * rmsat * bias) / (cmsat + rmsat + bias + 1); Giving a total score of 0 in these cases! If we're using the arithmetic mean we really need to add 1 to bias. We might as well use a double the whole way through, for a slightly more fine-grained approach. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

It gave 0. A lot. Firstly, rmsat was often very small, because delays are often small. Much smaller than the actual fee. We really just want to offset the bias and multiply it. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

The "path_score" callback was supposed to evaluate the *entire path*, but that was counter-intuitive and opened the door to a cost function bug which caused this path cost to be less than the closer path. In particular, the capacity bias code didn't understand this at all. 1. Rename the function to `channel_score` and remove the "distance" parameter (always "1" since you're supposed to be evaluating a single hop). 2. Rename "cost" to the more specific "fee": "score" is our actual cost function result (we avoid the word "cost" as it may get confused with satoshi amounts). 3. For capacity biassing, we do want to know the amount, but explicitly hand that as a separate parameter "total". 4. Fix a minor bug where total handed to scoring function previously included channel fee (this is wrong: fee is paid before sending into channel). 5. Remove the now-unused total_delay member from the dijkstra struct. Here are the results of our test now (routing 4194303 msat, which didn't crash the old code, so we could compare). In both cases we could find routes to 615 nodes: Linear success probability (when found): min-max(mean +/- stddev) Before: 0.484764-0.999750(0.9781+/-0.049) After: 0.487040-0.999543(0.952548+/-0.075) Hops: Before: 1-5(2.13821+/-0.66) After: 1-5(2.98374+/-0.77) Fees: Before: 0-50041(2173.75+/-5.3e+03) After: 0-50848(922.457+/-2.7e+03) Delay (blocks): Before: 0-294(83.1642+/-68) After: 0-196(65.8081+/-60) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Fixes: ElementsProject#7092 Changelog-Fixed: Plugins: `pay` would occasionally crash on routing. Changelog-Fixed: Plugins: `pay` route algorithm fixed and refined to balance fees and capacity far better.

I noticed that run-route-infloop chose some worse-looking paths after routing was fixed, eg the second node: Before: Destination node, success, probability, hops, fees, cltv, scid... 02b3aa1e4ed31be83cca4bd367b2c01e39502cb25e282a9b4520ad376a1ba0a01a,1,0.991856,2,1004,40,2572260x39x0/1,2131897x45x0/0 After: Destination node, success, probability, hops, fees, cltv, scid... 02b3aa1e4ed31be83cca4bd367b2c01e39502cb25e282a9b4520ad376a1ba0a01a,1,0.954540,3,1046,46,2570715x21x0/1,2346882x26x14/1,2131897x45x0/0 This is because although the final costs don't reflect it, routing was taking into account local channels, and 2572260x39x0/1 has a base fee of 2970. There's an easy fix: when we the pay plugin creates localmods for our gossip graph, add all local channels with delay and fees equal to 0. We do the same thing in our unit test. This improves things across the board: Linear success probability (when found): min-max(mean +/- stddev) Before: 0.487040-0.999543(0.952548+/-0.075) After: 0.486985-0.999750(0.975978+/-0.053) Hops: Before: 1-5(2.98374+/-0.77) After: 1-5(2.09593+/-0.63) Fees: Before: 0-50848(922.457+/-2.7e+03) After: 0-50041(861.621+/-2.7e+03) Delay (blocks): Before: 0-196(65.8081+/-60) After: 0-190(60.3285+/-60) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Changed: Plugins: `pay` route algorithm doesn't bias against our own "expensive" channels any more.

Set up a simple line of channel pairs, where one should be preferred over the other for our various reasons. Make sure this works, both using a low-level call to Dijkstra and at a higher level as the pay plugin does. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell added crash pay-plugin labels Mar 5, 2024

rustyrussell requested a review from renepickhardt March 5, 2024 04:10

vincenzopalazzo approved these changes Mar 5, 2024

View reviewed changes

cdecker reviewed Mar 5, 2024

View reviewed changes

plugins/libplugin-pay.c Outdated Show resolved Hide resolved

rustyrussell marked this pull request as draft March 6, 2024 00:34

rustyrussell force-pushed the guilt/route-infinite-loop branch from 4b06cde to 3ab7d6c Compare March 6, 2024 04:22

rustyrussell marked this pull request as ready for review March 6, 2024 04:22

rustyrussell requested review from cdecker and vincenzopalazzo March 6, 2024 04:22

cdecker self-assigned this Mar 6, 2024

rustyrussell force-pushed the guilt/route-infinite-loop branch from 3ab7d6c to 0fc538b Compare March 7, 2024 00:54

rustyrussell added 6 commits March 7, 2024 13:26

plugins/pay: fix route scoring calculation.

9ecb26c

It gave 0. A lot. Firstly, rmsat was often very small, because delays are often small. Much smaller than the actual fee. We really just want to offset the bias and multiply it. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the guilt/route-infinite-loop branch from 0fc538b to 3d457de Compare March 7, 2024 02:57

cdecker added this to the v24.02.1 milestone Mar 7, 2024

cdecker merged commit 65efa2a into ElementsProject:master Mar 7, 2024
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix routing infinite loop, bad scoring #7127

Fix routing infinite loop, bad scoring #7127

rustyrussell commented Mar 5, 2024 •

edited

Loading

vincenzopalazzo left a comment

rustyrussell commented Mar 6, 2024

rustyrussell commented Mar 6, 2024

Fix routing infinite loop, bad scoring #7127

Fix routing infinite loop, bad scoring #7127

Conversation

rustyrussell commented Mar 5, 2024 • edited Loading

vincenzopalazzo left a comment

Choose a reason for hiding this comment

rustyrussell commented Mar 6, 2024

rustyrussell commented Mar 6, 2024

rustyrussell commented Mar 5, 2024 •

edited

Loading