Askrene more tweaks #7734

rustyrussell · 2024-10-10T07:08:24Z

OK, lots of real-life testing for askrene, resulting in a number of bug fixes, including a nasty cycle fix from @Lagrang3 Looking forward to @Lagrang3 review!

(I need to benchmark again, too!)
Changelog-None

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

doc/schemas/lightning-getroutes.json

plugins/askrene/askrene.c

plugins/askrene/explain_failure.c

rustyrussell · 2024-10-11T10:03:15Z

I wonder about using "double" for network costs instead of s64? I had an issue with giant k, causing an overflow in linearize_channel:

It doesn't seem to be an issue with fixed k, though.

		cost[i] = params->cost_fraction[i]
		          *params->amount.millisatoshis /* Raw: linearize_channel */
		          *params->k_factor/(b-a);

rustyrussell · 2024-10-11T10:32:41Z

Rebase with minor changes (and whitespace fix which was stopping CI).

Lagrang3 · 2024-10-11T10:39:42Z

I wonder about using "double" for network costs instead of s64? I had an issue with giant k, causing an overflow in linearize_channel:

It doesn't seem to be an issue with fixed k, though.
		cost[i] = params->cost_fraction[i]
		          *params->amount.millisatoshis /* Raw: linearize_channel */
		          *params->k_factor/(b-a);

The theory bounds the complexity of the MCF algorithms based on the assumption that costs, capacities and demand/supply are all integers.

rustyrussell · 2024-10-11T10:59:15Z

I wonder about using "double" for network costs instead of s64? I had an issue with giant k, causing an overflow in linearize_channel:
It doesn't seem to be an issue with fixed k, though.
		cost[i] = params->cost_fraction[i]
		          *params->amount.millisatoshis /* Raw: linearize_channel */
		          *params->k_factor/(b-a);
The theory bounds the complexity of the MCF algorithms based on the assumption that costs, capacities and demand/supply are all integers.

Hmm, true. I need to think harder about the constraints here if we want to vary k again, though. It's multiplied by the product of two amounts, though one is scaled down to sats: that makes me a little nervous!

And unify logging for better debugging. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-EXPERIMENTAL: `askrene` now has better logging, gives notifications of progress.

Rather than adding to the gossmap modifications directly, populate the layer and have the normal layer application logic do it. This is consistent when we query layers in the next patch. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-EXPERIMENTAL: `getroutes` now applies `auto.sourcefree` layer in the order specified, so doesn't alter channels changed in later layers.

… for completeness. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…annels. In particular, this lets you find the exact htlc_maximum_msat/htlc_minimum_msat values. This means we actually create real channel_updates for local mods, which requires a second "local" scratch region. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

The fp16_t values are approximations (overestimate for htlc_max, underestimate for htlc_min), so in the refinement step we should use the exact values. This also fixes a logic bug: flow_remaining_capacity returned the total capacity, not the additional capacity! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-EXPERIMENTAL: `askrene` now honors exact htlc_maximum_msat limits.

This happens in the coming "real network" test! We add fees and hit htlc_max, but don't have another flow to add to. Rather than MCF again, we split the flow into two. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Flow cycles can occur if we have arc zero arc costs. The previous path construction from the flow in the network assumed the absence of such cycles and would enter an infinite loop if it hit one. With his patch wee add cycle detection and removal during the path construction phase. Reported-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Lagrang3 <lagrang3@protonmail.com> Changelog-EXPERIMENTAL: `askrene` infinite loop fixed

I tested with a really large gossmap (hacked to be 4GB), and when we keep retrying to minimize cost (calling minflow 11 times), and we don't free tmpctx. Due to an issue with how gossmap estimates the index sizes, we ended up running out of memory. This fixes it. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

I checked the failures, they seem real. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We ask it again, but reduce fees by 1msat from the previous answer. This is really nasty, as it frequently exercises the case where we only go over fee when we do the refinement step. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

The current prob_cost_factor setting does not seem to make mu very effective, in fact, it gives strange results: plugin-cln-askrene: notify msg unusual: The flows had a fee of 151950msat, greater than max of 53697msat, retrying with mu of 10%... plugin-cln-askrene: notify msg unusual: The flows had a fee of 220126msat, greater than max of 53697msat, retrying with mu of 20%... We would expect increasing mu to *reduce* the fee! As a first step, simplify (it can't be infinite, and the -1 are weird). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

…annel. During "test_real_data", then only successes with reduced fees were 92 on "mu=10", and only 1 on "mu=30": the rest went to mu=100 and failed. I tried numerous approaches, and in the end, opted for the simplest: The typical range of probability costs looks likes: min = 0, max = 924196240, mean = 10509.4, stddev = 1.9e+06 The typical range of linear fee costs looks like: min = 0, max = 101000000, mean = 81894.6, stddev = 2.6e+06 This implies a k factor of 8 makes the two comparable. This makes the two numbers comparable, and thus makes "mu" much more effective. Here are the number of different mu values we succeeded at: 87 mu=0 90 mu=10 42 mu=20 24 mu=30 17 mu=40 19 mu=50 19 mu=60 11 mu=70 95 mu=80 19 mu=90 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

I noticed this in the logs: plugin-cln-askrene: notify msg unusual: The flows had a fee of 151950msat, greater than max of 53697msat, retrying with mu of 10%... plugin-cln-askrene: notify msg unusual: The flows had a fee of 220126msat, greater than max of 53697msat, retrying with mu of 20%... We would expect increasing mu to *reduce* the fee! Turns out that our linear fee is a bad terrible approximation, because I was using base_fee_penalty of 10.0. | | / __ <- real fee, with base: fee = base + propfee * amount. | / __/ | _// | __/ | __/_/ |/ _/ | _/ <- linearized fee: fee = linear * amount |/ +----------------------------------- These cross over where linear = propfee + base / amount. Assume we split the payment into 10 parts, this implies that the base_fee_penalty should be 10 / amount (this gives a slight penalty to the normal case, but that's ok). This gives better results, too: we get down to 650099 sats in fees, vs 801613 before. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Even after the previous fix, we still occasionally increase fees when my increases. This is due to the difference between MCF's linear fees, and actual fees, and is unavoidable, but add a check if it somehow happens. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Lagrang3 · 2024-10-11T12:35:40Z

No worries. I am working on a new MCF solver. We will have more control on the base unit.

plugins/askrene/askrene.c

While the `k=8` value worked for the current main network tests with the amounts in those tests, it wasn't robust across a wider range of values (as demonstrated when other test changes broke tests!). Time to do this properly: calculate the ratio at the time we combine them, using median values. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

I noticed that increasing mu a little bit sometimes made a big difference, because by completely ignoring fees we were choosing the worst of two channels in some cases. Start at 1% fees; this saves a lot on initial fees in this test! Here's the new stats on mu levels: 96 mu=1 90 mu=10 41 mu=20 30 mu=30 24 mu=40 19 mu=50 22 mu=60 8 mu=70 95 mu=80 19 mu=90 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-EXPERIMENTAL: `askrene` is now better at finding low-fee paths.

We were trying to get the max capacity of a flow to see if we could add some more sats, and hit an assertion: tests/test_askrene.py:707: ``` DEBUG plugin-cln-askrene: notify msg info: Flow reduced to deliver 88070161msat not 90008000msat, because 107x1x0/1 has remaining capacity 88071042msat DEBUG plugin-cln-askrene: notify msg info: Flow reduced to deliver 284138158msat not 284787000msat, because 108x1x0/1 has remaining capacity 284141000msat **BROKEN** plugin-cln-askrene: Flow delivers 129565000msat but max only 56506138msat INFO plugin-cln-askrene: Killing plugin: exited during normal operation ``` We need to *unreserve* our flow before asking for max capacity. We were also missing a few less important cases where we altered flows without altering the reservation, so fix those too. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell · 2024-10-12T07:07:01Z

Added a new method for deriving k which should be more stable (and indeed, other changes kept breaking tests which didn't use the real data, as the k value of 8 was too different). Also fixed another unrelated bug in reservations.

Since we know the total reservations on each hop, we can more easily determine probabilities than using flowset_probability() which has to replicate this collision detection. We leave both in place for now, to check. The results are not identical, due to slightly different calculation methods. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Now we've checked it gives the same answers, we can remove a lot of work in flow.c. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

Lagrang3 doesn't like the logging in here at all, but he suggested we at least be consistent! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pyln-testing: understand gossip_store_file arg in get_nodes.

e5a57d4

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell added this to the v24.11 milestone Oct 10, 2024

rustyrussell requested a review from Lagrang3 October 10, 2024 07:08

rustyrussell requested a review from cdecker as a code owner October 10, 2024 07:08

rustyrussell force-pushed the askrene-more-tweaks branch 4 times, most recently from def417c to e234c03 Compare October 11, 2024 07:02

Lagrang3 reviewed Oct 11, 2024

View reviewed changes

doc/schemas/lightning-getroutes.json Outdated Show resolved Hide resolved

Lagrang3 reviewed Oct 11, 2024

View reviewed changes

plugins/askrene/askrene.c Outdated Show resolved Hide resolved

Lagrang3 reviewed Oct 11, 2024

View reviewed changes

plugins/askrene/explain_failure.c Show resolved Hide resolved

rustyrussell force-pushed the askrene-more-tweaks branch from e234c03 to de3419b Compare October 11, 2024 10:32

rustyrussell and others added 14 commits October 11, 2024 21:30

askrene: give notifications back to caller as we go.

23fc792

And unify logging for better debugging. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-EXPERIMENTAL: `askrene` now has better logging, gives notifications of progress.

askrene: make auto.sourcefree a real layer, too.

ea074bf

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-EXPERIMENTAL: `getroutes` now applies `auto.sourcefree` layer in the order specified, so doesn't alter channels changed in later layers.

gossmap: include cltv_expiry_delta in gossmap_chan_get_update_details…

e7ac294

… for completeness. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pytest: test askrene on real network data.

3b019e2

I checked the failures, they seem real. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

pytest: test askrene with worse maxfee argument.

302fde5

We ask it again, but reduce fees by 1msat from the previous answer. This is really nasty, as it frequently exercises the case where we only go over fee when we do the refinement step. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell force-pushed the askrene-more-tweaks branch from de3419b to 87c4da3 Compare October 11, 2024 11:00

Lagrang3 reviewed Oct 11, 2024

View reviewed changes

plugins/askrene/askrene.c Show resolved Hide resolved

rustyrussell added 3 commits October 12, 2024 12:50

rustyrussell force-pushed the askrene-more-tweaks branch from 87c4da3 to 8b83d12 Compare October 12, 2024 07:06

rustyrussell added 3 commits October 13, 2024 09:47

askrene: remove flowset_probability() now refine step calculates it.

2eb37d7

Now we've checked it gives the same answers, we can remove a lot of work in flow.c. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

askrene: more logging in explain_failure.

2d3c004

Lagrang3 doesn't like the logging in here at all, but he suggested we at least be consistent! Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell merged commit 318e49e into ElementsProject:master Oct 14, 2024
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Askrene more tweaks #7734

Askrene more tweaks #7734

Uh oh!

rustyrussell commented Oct 10, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rustyrussell commented Oct 11, 2024

Uh oh!

rustyrussell commented Oct 11, 2024

Uh oh!

Lagrang3 commented Oct 11, 2024

Uh oh!

rustyrussell commented Oct 11, 2024

Uh oh!

Lagrang3 commented Oct 11, 2024

Uh oh!

Uh oh!

rustyrussell commented Oct 12, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Askrene more tweaks #7734

Askrene more tweaks #7734

Uh oh!

Conversation

rustyrussell commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rustyrussell commented Oct 11, 2024

Uh oh!

rustyrussell commented Oct 11, 2024

Uh oh!

Lagrang3 commented Oct 11, 2024

Uh oh!

rustyrussell commented Oct 11, 2024

Uh oh!

Lagrang3 commented Oct 11, 2024

Uh oh!

Uh oh!

rustyrussell commented Oct 12, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rustyrussell commented Oct 10, 2024 •

edited

Loading