-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix routing infinite loop, bad scoring #7127
Fix routing infinite loop, bad scoring #7127
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK 4b06cde
The final commit is wrong: we do, indeed, sum the inputs to that function. However, we shouldn't. Will fix separately (now I'm writing tests!) |
Wrote a test program which passed num_channel_updates_rejected as NULL (which we don't usually do), and valgrind complained: ``` ==1048302== Conditional jump or move depends on uninitialised value(s) ==1048302== at 0x118B90: update_channel (gossmap.c:550) ==1048302== by 0x119EEE: map_catchup (gossmap.c:663) ==1048302== by 0x11A299: load_gossip_store (gossmap.c:726) ==1048302== by 0x11A352: gossmap_load (gossmap.c:1052) ==1048302== by 0x125362: main (run-route-infloop.c:90) ``` Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
4b06cde
to
3ab7d6c
Compare
Even after fixing the capacity bias, we had a similar problem with our risk values (being 0). In fact, the scoring callback was counterintuitive, so I've fixed that entirely. The main contribution was to spend a day writing a proper unit test of the routing code, which uncovered this confusion and now gives me much more optimism that this is correct. Please re-review: the first two commits are the same, though. |
3ab7d6c
to
0fc538b
Compare
The amount is set not to crash by default, but run "common/test/run-route-infloop 8388607" and you'll see a crash. Sorry about the 7MB blob, but this testing was quite revealing and I consider it worth adding. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We attempted to introduce a "capacity bias" so we would penalize channels where we were using a large amount of their total capacity. Unfortunately, this was both naive, and applied wrongly: -log((capmsat + 1 - amtmsat) / (capmsat + 1)); As an integer gives 0 up to about 65% capacity. We then applied this by multiplying: (cmsat * rmsat * bias) / (cmsat + rmsat + bias + 1); Giving a total score of 0 in these cases! If we're using the arithmetic mean we really need to add 1 to bias. We might as well use a double the whole way through, for a slightly more fine-grained approach. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It gave 0. A lot. Firstly, rmsat was often very small, because delays are often small. Much smaller than the actual fee. We really just want to offset the bias and multiply it. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The "path_score" callback was supposed to evaluate the *entire path*, but that was counter-intuitive and opened the door to a cost function bug which caused this path cost to be less than the closer path. In particular, the capacity bias code didn't understand this at all. 1. Rename the function to `channel_score` and remove the "distance" parameter (always "1" since you're supposed to be evaluating a single hop). 2. Rename "cost" to the more specific "fee": "score" is our actual cost function result (we avoid the word "cost" as it may get confused with satoshi amounts). 3. For capacity biassing, we do want to know the amount, but explicitly hand that as a separate parameter "total". 4. Fix a minor bug where total handed to scoring function previously included channel fee (this is wrong: fee is paid before sending into channel). 5. Remove the now-unused total_delay member from the dijkstra struct. Here are the results of our test now (routing 4194303 msat, which didn't crash the old code, so we could compare). In both cases we could find routes to 615 nodes: Linear success probability (when found): min-max(mean +/- stddev) Before: 0.484764-0.999750(0.9781+/-0.049) After: 0.487040-0.999543(0.952548+/-0.075) Hops: Before: 1-5(2.13821+/-0.66) After: 1-5(2.98374+/-0.77) Fees: Before: 0-50041(2173.75+/-5.3e+03) After: 0-50848(922.457+/-2.7e+03) Delay (blocks): Before: 0-294(83.1642+/-68) After: 0-196(65.8081+/-60) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Fixes: ElementsProject#7092 Changelog-Fixed: Plugins: `pay` would occasionally crash on routing. Changelog-Fixed: Plugins: `pay` route algorithm fixed and refined to balance fees and capacity far better.
I noticed that run-route-infloop chose some worse-looking paths after routing was fixed, eg the second node: Before: Destination node, success, probability, hops, fees, cltv, scid... 02b3aa1e4ed31be83cca4bd367b2c01e39502cb25e282a9b4520ad376a1ba0a01a,1,0.991856,2,1004,40,2572260x39x0/1,2131897x45x0/0 After: Destination node, success, probability, hops, fees, cltv, scid... 02b3aa1e4ed31be83cca4bd367b2c01e39502cb25e282a9b4520ad376a1ba0a01a,1,0.954540,3,1046,46,2570715x21x0/1,2346882x26x14/1,2131897x45x0/0 This is because although the final costs don't reflect it, routing was taking into account local channels, and 2572260x39x0/1 has a base fee of 2970. There's an easy fix: when we the pay plugin creates localmods for our gossip graph, add all local channels with delay and fees equal to 0. We do the same thing in our unit test. This improves things across the board: Linear success probability (when found): min-max(mean +/- stddev) Before: 0.487040-0.999543(0.952548+/-0.075) After: 0.486985-0.999750(0.975978+/-0.053) Hops: Before: 1-5(2.98374+/-0.77) After: 1-5(2.09593+/-0.63) Fees: Before: 0-50848(922.457+/-2.7e+03) After: 0-50041(861.621+/-2.7e+03) Delay (blocks): Before: 0-196(65.8081+/-60) After: 0-190(60.3285+/-60) Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Changed: Plugins: `pay` route algorithm doesn't bias against our own "expensive" channels any more.
Set up a simple line of channel pairs, where one should be preferred over the other for our various reasons. Make sure this works, both using a low-level call to Dijkstra and at a higher level as the pay plugin does. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
0fc538b
to
3d457de
Compare
Three small fixes: the scoring ones may have a large effect in real routing scenarios!
Fixes: #7092