PostRestartHtlcCleaner handle channel closing #1338

t-bast · 2020-03-02T14:14:51Z

There are a few edges cases where we currently don't reconcile correctly after a restart (which may lead to channels being closed, which is inconvenient). It's recommended to review commit by commit.

There are dangling elements in the map that can be ignored: HTLCs that still appear downstream in a closing channel, but were correctly resolved upstream (adde74c).

A more complex scenario to handle is when the downstream channel is closing (because our peer didn't send a revocation in time, an HTLC timed out or some other failure) and we missed the notification from the channel because of a reboot. In some cases the channel will not re-emit an event, so we need to look at the channel state to correctly fail upstream (d90500a).

There was also some clean-up that could be done on the scripts helpers and htlc-timeout txs post-MPP (2245197 and 4aee086).

codecov-io · 2020-03-02T14:34:25Z

Codecov Report

Merging #1338 into master will decrease coverage by 0.03%.
The diff coverage is 88.09%.

@@            Coverage Diff             @@
##           master    #1338      +/-   ##
==========================================
- Coverage   86.42%   86.39%   -0.04%     
==========================================
  Files         119      119              
  Lines        9261     9306      +45     
  Branches      390      387       -3     
==========================================
+ Hits         8004     8040      +36     
- Misses       1257     1266       +9

Impacted Files	Coverage Δ
...c/main/scala/fr/acinq/eclair/payment/Auditor.scala	`93.47% <50.00%> (ø)`
...c/main/scala/fr/acinq/eclair/channel/Channel.scala	`85.71% <73.68%> (+0.05%)`	⬆️
.../eclair/payment/relay/PostRestartHtlcCleaner.scala	`85.43% <81.08%> (-2.57%)`	⬇️
...c/main/scala/fr/acinq/eclair/channel/Helpers.scala	`96.30% <98.07%> (-0.19%)`	⬇️
...n/scala/fr/acinq/eclair/transactions/Scripts.scala	`90.47% <100.00%> (+2.24%)`	⬆️
...clair/blockchain/electrum/ElectrumClientPool.scala	`78.49% <0.00%> (-4.31%)`	⬇️
...nq/eclair/blockchain/electrum/ElectrumWallet.scala	`81.00% <0.00%> (+0.25%)`	⬆️
...q/eclair/blockchain/electrum/ElectrumWatcher.scala	`55.20% <0.00%> (+1.60%)`	⬆️

pm47

I fail to see how 8815b91 is related to dust HTLCs. Overridden/Timed out yes, but dust?

BTW I think dust HTLCs should be failed quickly (as soon as the commitment tx confirms), but I don't think we do it?

eclair-core/src/main/scala/fr/acinq/eclair/payment/relay/PostRestartHtlcCleaner.scala

eclair-core/src/main/scala/fr/acinq/eclair/channel/ChannelExceptions.scala

eclair-core/src/main/scala/fr/acinq/eclair/payment/relay/PostRestartHtlcCleaner.scala

pm47 · 2020-03-16T17:56:06Z

eclair-core/src/main/scala/fr/acinq/eclair/payment/relay/Relayer.scala

@@ -149,10 +149,10 @@ class Relayer(nodeParams: NodeParams, router: ActorRef, register: ActorRef, comm

    case Status.Failure(addFailed: AddHtlcFailed) =>
      addFailed.origin match {
-        case Origin.Local(id, None) => log.error(s"received unexpected add failed with no sender (paymentId=$id)")
+        case Origin.Local(id, None) => postRestartCleaner forward addFailed


beautiful, makes so much sense to have all those cases properly handled!

Suggested change

case Origin.Local(id, None) => postRestartCleaner forward addFailed

case Origin.Local(_, None) => postRestartCleaner forward addFailed

Exactly, I've always felt uneasy with those unhandled cases, I knew I was missing something but I didn't know what...Now I know, and I'll sleep better xD

They're back and still not handled now, but this is because they really should never appear now that AddHtlcFailed is used properly...still feels dangerous though, I don't know how I could improve that.

eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala

eclair-core/src/main/scala/fr/acinq/eclair/transactions/Scripts.scala

eclair-core/src/main/scala/fr/acinq/eclair/payment/relay/PostRestartHtlcCleaner.scala

eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala

When a channel is closed we want to remove its HTLCs from our list of pending broken HTLCs (they are being resolved on-chain). We should also ignore outgoing HTLCs that have already been settled upstream (which can happen when downstream is closing).

When a downstream channel is closing, we can safely fail upstream the HTLCs that were either timed out on-chain or not included in the broadcast commit transaction. Channels will not always raise events about those after a reboot, so we need to inspect the channel state and detect such HTLCs.

To extract the payment_hash or preimage from an HTLC script seen on-chain.

With MPP, it's possible that a channel contains multiple HTLCs for the same payment hash, and potentially even for the same expiry and amount. We add more fine-grained handling of HTLC timeouts that share the same payment hash. This allows a cleaner handling after a restart, and makes sure we correctly detect failure that should be propagated upstream. Otherwise we wouldn't be losing any money, but some channels may be closed that we can avoid.

eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala

eclair-core/src/main/scala/fr/acinq/eclair/payment/relay/PostRestartHtlcCleaner.scala

eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala

eclair-core/src/main/scala/fr/acinq/eclair/payment/relay/PostRestartHtlcCleaner.scala

A couple refactorings to avoid duplication and some clean-up.

It may happen that a commit tx and some htlc-timeout txs end up in the same block. In that case, there is no guarantee on the order we'll receive the confirmation events. If any tx in a local/remoteCommitPublished is confirmed, that implicitly means that the commit tx is confirmed (because it spends from it). So we can consider the closing type known and forward the failure upstream.

t-bast · 2020-04-01T14:32:21Z

79f8d37 fixes the transient integration tests failure. These failures happened because we may receive BITCOIN_TX_CONFIRMED for an htlc-timeout tx before we receive BITCOIN_TX_CONFIRMED for the commit-tx. Because of the changes introduced in 0791853, the closing_type would be None and we would not fail the HTLC upstream.

There are many ways we can fix that:

Update isClosingTypeAlreadyKnown to handle txs out-of-order: if any tx is confirmed, we know the closing_type because this tx spends from the commit-tx (or is the commit-tx)
When we receive BITCOIN_TX_CONFIRMED but the parent of that tx isn't confirmed, stash that message for later (delay its processing)
When we receive BITCOIN_TX_CONFIRMED for the (local/remote) commit-tx, check if we already have other irrevocablySpent txs that we didn't fully process and process them

I chose solution 1 as I think it's the one that makes more sense. Maybe we should rename the isConfirmed field that I introducted or move it somewhere else, let me know.

eclair-core/src/main/scala/fr/acinq/eclair/channel/ChannelTypes.scala

pm47

30e4f97 is much better.

What do you think of the tests failing for 79f8d37? I don't remember seeing that kind of failure before.

t-bast · 2020-04-01T15:37:54Z

What do you think of the tests failing for 79f8d37? I don't remember seeing that kind of failure before.

It was because doing what I did for revoked created a regression on the .get that's done below...
So I finally kept revoked the way it was before, especially since it doesn't impact htlc-timeout txs.

t-bast requested a review from pm47 March 2, 2020 14:14

t-bast force-pushed the post-restart-improvements branch from 1520ca3 to 5936b1e Compare March 5, 2020 12:45

pm47 reviewed Mar 16, 2020

View reviewed changes

t-bast force-pushed the post-restart-improvements branch from 8815b91 to 813ad9c Compare March 20, 2020 17:55

t-bast changed the title ~~PostRestartHtlcCleaner handle channel close~~ PostRestartHtlcCleaner handle channel closing Mar 20, 2020

t-bast force-pushed the post-restart-improvements branch from 813ad9c to 4aee086 Compare March 30, 2020 11:53

pm47 reviewed Mar 30, 2020

View reviewed changes

t-bast added 6 commits March 31, 2020 18:57

Add missing cases to PostRestart

0a9d2bb

When a channel is closed we want to remove its HTLCs from our list of pending broken HTLCs (they are being resolved on-chain). We should also ignore outgoing HTLCs that have already been settled upstream (which can happen when downstream is closing).

Add helper function to HTLC scripts

6d8d2be

To extract the payment_hash or preimage from an HTLC script seen on-chain.

Fix script function signatures (for real)

6267a8d

Refactor isClosingAlreadyKnown

0791853

t-bast force-pushed the post-restart-improvements branch from 6576390 to 0791853 Compare March 31, 2020 17:02

t-bast requested a review from pm47 March 31, 2020 17:03

pm47 reviewed Apr 1, 2020

View reviewed changes

eclair-core/src/main/scala/fr/acinq/eclair/channel/Helpers.scala Outdated Show resolved Hide resolved

pm47 reviewed Apr 1, 2020

View reviewed changes

Address PR comments

b9590b1

A couple refactorings to avoid duplication and some clean-up.

pm47 previously approved these changes Apr 1, 2020

View reviewed changes

t-bast dismissed pm47’s stale review via 79f8d37 April 1, 2020 14:25

pm47 reviewed Apr 1, 2020

View reviewed changes

eclair-core/src/main/scala/fr/acinq/eclair/channel/ChannelTypes.scala Outdated Show resolved Hide resolved

fixup! Handle out-of-order htlc-timeout txs

30e4f97

pm47 reviewed Apr 1, 2020

View reviewed changes

pm47 approved these changes Apr 1, 2020

View reviewed changes

t-bast merged commit f9789b7 into master Apr 1, 2020

t-bast deleted the post-restart-improvements branch April 1, 2020 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PostRestartHtlcCleaner handle channel closing #1338

PostRestartHtlcCleaner handle channel closing #1338

t-bast commented Mar 2, 2020 •

edited

codecov-io commented Mar 2, 2020 •

edited

pm47 left a comment

pm47 Mar 16, 2020

t-bast Mar 16, 2020

t-bast Mar 20, 2020

t-bast commented Apr 1, 2020

pm47 left a comment

t-bast commented Apr 1, 2020

	case Origin.Local(id, None) => postRestartCleaner forward addFailed
	case Origin.Local(_, None) => postRestartCleaner forward addFailed

PostRestartHtlcCleaner handle channel closing #1338

PostRestartHtlcCleaner handle channel closing #1338

Conversation

t-bast commented Mar 2, 2020 • edited

codecov-io commented Mar 2, 2020 • edited

Codecov Report

pm47 left a comment

Choose a reason for hiding this comment

pm47 Mar 16, 2020

Choose a reason for hiding this comment

t-bast Mar 16, 2020

Choose a reason for hiding this comment

t-bast Mar 20, 2020

Choose a reason for hiding this comment

t-bast commented Apr 1, 2020

pm47 left a comment

Choose a reason for hiding this comment

t-bast commented Apr 1, 2020

t-bast commented Mar 2, 2020 •

edited

codecov-io commented Mar 2, 2020 •

edited