Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PostRestartHtlcCleaner handle channel closing #1338

Merged
merged 9 commits into from Apr 1, 2020

Conversation

t-bast
Copy link
Member

@t-bast t-bast commented Mar 2, 2020

There are a few edges cases where we currently don't reconcile correctly after a restart (which may lead to channels being closed, which is inconvenient). It's recommended to review commit by commit.

There are dangling elements in the map that can be ignored: HTLCs that still appear downstream in a closing channel, but were correctly resolved upstream (adde74c).

A more complex scenario to handle is when the downstream channel is closing (because our peer didn't send a revocation in time, an HTLC timed out or some other failure) and we missed the notification from the channel because of a reboot. In some cases the channel will not re-emit an event, so we need to look at the channel state to correctly fail upstream (d90500a).

There was also some clean-up that could be done on the scripts helpers and htlc-timeout txs post-MPP (2245197 and 4aee086).

@t-bast t-bast requested a review from pm47 March 2, 2020 14:14
@codecov-io
Copy link

codecov-io commented Mar 2, 2020

Codecov Report

Merging #1338 into master will decrease coverage by 0.03%.
The diff coverage is 88.09%.

@@            Coverage Diff             @@
##           master    #1338      +/-   ##
==========================================
- Coverage   86.42%   86.39%   -0.04%     
==========================================
  Files         119      119              
  Lines        9261     9306      +45     
  Branches      390      387       -3     
==========================================
+ Hits         8004     8040      +36     
- Misses       1257     1266       +9     
Impacted Files Coverage Δ
...c/main/scala/fr/acinq/eclair/payment/Auditor.scala 93.47% <50.00%> (ø)
...c/main/scala/fr/acinq/eclair/channel/Channel.scala 85.71% <73.68%> (+0.05%) ⬆️
.../eclair/payment/relay/PostRestartHtlcCleaner.scala 85.43% <81.08%> (-2.57%) ⬇️
...c/main/scala/fr/acinq/eclair/channel/Helpers.scala 96.30% <98.07%> (-0.19%) ⬇️
...n/scala/fr/acinq/eclair/transactions/Scripts.scala 90.47% <100.00%> (+2.24%) ⬆️
...clair/blockchain/electrum/ElectrumClientPool.scala 78.49% <0.00%> (-4.31%) ⬇️
...nq/eclair/blockchain/electrum/ElectrumWallet.scala 81.00% <0.00%> (+0.25%) ⬆️
...q/eclair/blockchain/electrum/ElectrumWatcher.scala 55.20% <0.00%> (+1.60%) ⬆️

@t-bast t-bast force-pushed the post-restart-improvements branch from 1520ca3 to 5936b1e Compare March 5, 2020 12:45
Copy link
Member

@pm47 pm47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fail to see how 8815b91 is related to dust HTLCs. Overridden/Timed out yes, but dust?

BTW I think dust HTLCs should be failed quickly (as soon as the commitment tx confirms), but I don't think we do it?

@@ -149,10 +149,10 @@ class Relayer(nodeParams: NodeParams, router: ActorRef, register: ActorRef, comm

case Status.Failure(addFailed: AddHtlcFailed) =>
addFailed.origin match {
case Origin.Local(id, None) => log.error(s"received unexpected add failed with no sender (paymentId=$id)")
case Origin.Local(id, None) => postRestartCleaner forward addFailed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful, makes so much sense to have all those cases properly handled!

Suggested change
case Origin.Local(id, None) => postRestartCleaner forward addFailed
case Origin.Local(_, None) => postRestartCleaner forward addFailed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, I've always felt uneasy with those unhandled cases, I knew I was missing something but I didn't know what...Now I know, and I'll sleep better xD

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're back and still not handled now, but this is because they really should never appear now that AddHtlcFailed is used properly...still feels dangerous though, I don't know how I could improve that.

@t-bast t-bast changed the title PostRestartHtlcCleaner handle channel close PostRestartHtlcCleaner handle channel closing Mar 20, 2020
When a channel is closed we want to remove its HTLCs from our
list of pending broken HTLCs (they are being resolved on-chain).

We should also ignore outgoing HTLCs that have already been
settled upstream (which can happen when downstream is closing).
When a downstream channel is closing, we can safely fail upstream the
HTLCs that were either timed out on-chain or not included in the
broadcast commit transaction.

Channels will not always raise events about those after a reboot, so we
need to inspect the channel state and detect such HTLCs.
To extract the payment_hash or preimage from an HTLC script seen on-chain.
With MPP, it's possible that a channel contains multiple HTLCs for the
same payment hash, and potentially even for the same expiry and amount.

We add more fine-grained handling of HTLC timeouts that share the same
payment hash. This allows a cleaner handling after a restart, and makes
sure we correctly detect failure that should be propagated upstream.
Otherwise we wouldn't be losing any money, but some channels may be closed
that we can avoid.
@t-bast t-bast requested a review from pm47 March 31, 2020 17:03
A couple refactorings to avoid duplication and some clean-up.
pm47
pm47 previously approved these changes Apr 1, 2020
It may happen that a commit tx and some htlc-timeout txs end up in the
same block. In that case, there is no guarantee on the order we'll receive
the confirmation events.

If any tx in a local/remoteCommitPublished is confirmed, that implicitly
means that the commit tx is confirmed (because it spends from it).
So we can consider the closing type known and forward the failure upstream.
@t-bast
Copy link
Member Author

t-bast commented Apr 1, 2020

79f8d37 fixes the transient integration tests failure. These failures happened because we may receive BITCOIN_TX_CONFIRMED for an htlc-timeout tx before we receive BITCOIN_TX_CONFIRMED for the commit-tx. Because of the changes introduced in 0791853, the closing_type would be None and we would not fail the HTLC upstream.

There are many ways we can fix that:

  1. Update isClosingTypeAlreadyKnown to handle txs out-of-order: if any tx is confirmed, we know the closing_type because this tx spends from the commit-tx (or is the commit-tx)
  2. When we receive BITCOIN_TX_CONFIRMED but the parent of that tx isn't confirmed, stash that message for later (delay its processing)
  3. When we receive BITCOIN_TX_CONFIRMED for the (local/remote) commit-tx, check if we already have other irrevocablySpent txs that we didn't fully process and process them

I chose solution 1 as I think it's the one that makes more sense. Maybe we should rename the isConfirmed field that I introducted or move it somewhere else, let me know.

Copy link
Member

@pm47 pm47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30e4f97 is much better.

What do you think of the tests failing for 79f8d37? I don't remember seeing that kind of failure before.

@t-bast
Copy link
Member Author

t-bast commented Apr 1, 2020

What do you think of the tests failing for 79f8d37? I don't remember seeing that kind of failure before.

It was because doing what I did for revoked created a regression on the .get that's done below...
So I finally kept revoked the way it was before, especially since it doesn't impact htlc-timeout txs.

@t-bast t-bast merged commit f9789b7 into master Apr 1, 2020
@t-bast t-bast deleted the post-restart-improvements branch April 1, 2020 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants