Channels shutting down with "bad reestablish revocation_number" #5240

laanwj · 2022-05-05T09:57:59Z

Issue and Steps to Reproduce

In the last day, on a fairly large node, three channels have shut down on me with these messages:

                     "message": "channeld: sent ERROR bad reestablish revocation_number: 7457 vs 7462"
                     "message": "channeld: sent ERROR bad reestablish revocation_number: 271 vs 513"
                     "message": "channeld: sent ERROR bad reestablish revocation_number: 529 vs 543"

I am sure two of them are with c-lightning nodes, probably all three. Two of them are fairly recent channels (opened last week).

I have never restored from a backup, nor shut down c-lightning in an unclean way in the last month at least. Also, maybe important: the closes did not happen upon my daemon starting, but while it was running and the channels were active for at least a few hours.

Using c-lightning master as of 43ace03.

`getinfo` output

{
   "num_peers": 65,
   "num_pending_channels": 0,
   "num_active_channels": 63,
   "num_inactive_channels": 2,
   "version": "v0.11.0.1-6-g43ace03",
   "blockheight": 734998,
   "network": "bitcoin",
   "our_features": {
      "init": "80282269a2",
      "node": "800080282269a2",
      "channel": "",
      "invoice": "02000020024100"
   }
}

Plugins and configuration:

    --plugin=$HOME/src/plugins/lnurlp/lnurlp.py \
    --plugin=$HOME/src/plugins/rebalance/rebalance.py \
    --plugin=$HOME/src/plugins/summary/summary.py \
    --plugin=$HOME/src/plugins/prometheus/prometheus.py \
    --experimental-dual-fund \
    --experimental-onion-messages \
    --experimental-offers \

Logs before close:

2022-05-04T10:47:34.373Z INFO    plugin-bcli: bitcoin-cli initialized and connected to bitcoind. (c-lightning started)
⋮ 
2022-05-04T13:07:13.965Z INFO    [X]-channeld-chan#59650: Peer connection lost
2022-05-04T13:07:13.965Z INFO    [X]-chan#59650: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-04T13:07:14.102Z UNUSUAL [X]-chan#59650: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish revocation_number: 529 vs 543
2022-05-04T13:07:14.123Z INFO    [X]-chan#59650: State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL
⋮ 
2022-05-05T01:18:43.090Z INFO    [Y]-channeld-chan#59624: Peer connection lost
2022-05-05T01:18:43.093Z INFO    [Y]-chan#59624: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (9)
2022-05-05T01:18:43.198Z UNUSUAL [Y]-chan#59624: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish revocation_number: 7457 vs 7462
2022-05-05T01:18:43.223Z INFO    [Y]-chan#59624: State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL
⋮ 
2022-05-05T07:48:29.676Z INFO    [Z]-channeld-chan#59646: Peer connection lost
2022-05-05T07:48:29.677Z INFO    [Z]-chan#59646: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (9)
2022-05-05T07:48:29.856Z UNUSUAL [Z]-chan#59646: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish revocation_number: 271 vs 513
2022-05-05T07:48:29.860Z INFO    [Z]-chan#59646: State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL

Context information that may be useful:

Two of the three failing channels were intentionally dual-funded. One I opened to the other node using a lease, one was opened to me after i set funderupdate fixed 100 false. Opening logs:

2022-03-31T08:10:45.242Z INFO    [Y]-chan#59624: State changed from CHANNELD_AWAITING_LOCKIN to CHANNELD_NORMAL
2022-05-02T09:42:22.796Z INFO    [Z]-chan#59646: State changed from DUALOPEND_OPEN_INIT to DUALOPEND_AWAITING_LOCKIN
2022-05-02T18:45:45.343Z INFO    [X]-chan#59650: State changed from DUALOPEND_OPEN_INIT to DUALOPEND_AWAITING_LOCKIN

Operating system is FreeBSD 13.0-RELEASE-p2, c-lightning is compiled with clang version 11.0.1, default sqlite database backend

The text was updated successfully, but these errors were encountered:

laanwj · 2022-05-05T11:54:03Z

I've had contact with the admin of one of the other peers. Their node (running c-lightning 6ced555) wasn't restarting either at the time of the disconnect, and doesn't have any other problems. I struggle to understand how this sudden state loss(?) could have came to be. I do think all three channels were fairly active, so it could be some kind of race condition.

Messages in their log around the event:

2022-05-05T07:48:32.962Z INFO    [M]-chan#2: Peer transient failure in CHANNELD_NORMAL: Reconnected
2022-05-05T07:48:33.466Z INFO    [M]-channeld-chan#2: Peer connection lost
2022-05-05T07:48:33.467Z INFO    [M]-chan#2: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-05T07:48:34.895Z UNUSUAL [M]-chan#2: Peer permanent failure in CHANNELD_NORMAL: channeld: received ERROR error channel [chan]: bad reestablish revocation_number: 271 vs 513
2022-05-05T07:48:34.896Z INFO    [M]-chan#2: State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL
2022-05-05T07:49:28.783Z UNUSUAL [M]-chan#2: Peer permanent failure in AWAITING_UNILATERAL: Funding transaction spent
2022-05-05T07:49:28.784Z INFO    [M]-chan#2: State changed from AWAITING_UNILATERAL to AWAITING_UNILATERAL
2022-05-05T07:49:28.784Z INFO    [M]-chan#2: State changed from AWAITING_UNILATERAL to FUNDING_SPEND_SEEN
2022-05-05T07:49:28.831Z INFO    [M]-chan#2: State changed from FUNDING_SPEND_SEEN to ONCHAIN

I've started my node again (had shut it down in a kind of panic 😄) and am seeing no further channel closes.

laanwj · 2022-05-05T16:09:34Z

Looks like #5235 describes a similar issue. I did not make any change to the CLTV setting, though.

niftynei · 2022-05-11T19:24:45Z

First, it's worth pointing out that post channel open the code path is the same. In other words, after a channel state has moved from DUALOPEND_ to CHANNELD_NORMAL, it's using the same code as a channel opened using the v1 code path, all in channeld.

*The exception here is for 'leased' channels, which include an extra update message for the CSV lock that meters the lease timeout; but given that you've experienced the same issue for a non-leased channel, that's a good indication that that's likely not the problem here.

Second, just as a point of interest, in the ERROR bad reestablish revocation_number: 271 vs 513
The first number 271 is what your peer supplied as in its next_revocation_number in its channel_reestablish message; your node's next_index number is the second (513). Your node is failing the channel b/c you believe them to be behind.

Thanks for the detailed report.

laanwj · 2022-05-13T08:09:52Z

First, it's worth pointing out that post channel open the code path is the same. In other words, after a channel state has moved from DUALOPEND_ to CHANNELD_NORMAL, it's using the same code as a channel opened using the v1 code path, all in channeld.
… that's a good indication that that's likely not the problem here.

Thanks for the explanation. I agree. It was pretty far-fetched for it to be about the kind of opening but couldn't think of anything else.

Second, just as a point of interest, in the ERROR bad reestablish revocation_number: 271 vs 513
The first number 271 is what your peer supplied as in its next_revocation_number in its channel_reestablish message; your node's next_index number is the second (513). Your node is failing the channel b/c you believe them to be behind.

I wondered about who was blaming whom !

We had multiple reports of channels being unilaterally closed because it seemed like the peer was sending old revocation numbers. Turns out, it was actually old reestablish messages! When we have a reconnection, we would put the new connection aside, and tell lightningd to close the current connection: when it did, we would restart processing of the initial reconnection. However, we could end up with *multiple* "reconnecting" connections, while waiting for an existing connection to close. Though the connections were long gone, there could still be messages queued (particularly the channel_reestablish message, which comes early on). Eventually, a normal reconnection would cause us to process one of these reconnecting connections, and channeld would see the (perhaps very old!) messages, and get confused. (I have a test which triggers this, but it also hangs the connect command, due to other issues we will fix in the next release...) Fixes: ElementsProject#5240 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We had multiple reports of channels being unilaterally closed because it seemed like the peer was sending old revocation numbers. Turns out, it was actually old reestablish messages! When we have a reconnection, we would put the new connection aside, and tell lightningd to close the current connection: when it did, we would restart processing of the initial reconnection. However, we could end up with *multiple* "reconnecting" connections, while waiting for an existing connection to close. Though the connections were long gone, there could still be messages queued (particularly the channel_reestablish message, which comes early on). Eventually, a normal reconnection would cause us to process one of these reconnecting connections, and channeld would see the (perhaps very old!) messages, and get confused. (I have a test which triggers this, but it also hangs the connect command, due to other issues we will fix in the next release...) Fixes: #5240 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We don't have to put aside a peer which is reconnecting and wait for lightningd to remove the old peer, we can now simply free the old and add the new. Fixes: ElementsProject#5240 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We had multiple reports of channels being unilaterally closed because it seemed like the peer was sending old revocation numbers. Turns out, it was actually old reestablish messages! When we have a reconnection, we would put the new connection aside, and tell lightningd to close the current connection: when it did, we would restart processing of the initial reconnection. However, we could end up with *multiple* "reconnecting" connections, while waiting for an existing connection to close. Though the connections were long gone, there could still be messages queued (particularly the channel_reestablish message, which comes early on). Eventually, a normal reconnection would cause us to process one of these reconnecting connections, and channeld would see the (perhaps very old!) messages, and get confused. (I have a test which triggers this, but it also hangs the connect command, due to other issues we will fix in the next release...) Fixes: ElementsProject#5240 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We don't have to put aside a peer which is reconnecting and wait for lightningd to remove the old peer, we can now simply free the old and add the new. Fixes: ElementsProject#5240 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We don't have to put aside a peer which is reconnecting and wait for lightningd to remove the old peer, we can now simply free the old and add the new. Fixes: #5240 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

We don't have to put aside a peer which is reconnecting and wait for lightningd to remove the old peer, we can now simply free the old and add the new. Fixes: ElementsProject#5240 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell mentioned this issue May 13, 2022

connectd: fix accidental handling of old reconnections. #5256

Merged

ZmnSCPxj mentioned this issue May 13, 2022

clboss nodes force close on peers for unknown reasons ZmnSCPxj/clboss#121

Closed

rustyrussell closed this as completed in #5256 May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Channels shutting down with "bad reestablish revocation_number" #5240

Channels shutting down with "bad reestablish revocation_number" #5240

laanwj commented May 5, 2022 •

edited

laanwj commented May 5, 2022 •

edited

laanwj commented May 5, 2022 •

edited

niftynei commented May 11, 2022 •

edited

laanwj commented May 13, 2022

Channels shutting down with "bad reestablish revocation_number" #5240

Channels shutting down with "bad reestablish revocation_number" #5240

Comments

laanwj commented May 5, 2022 • edited

Issue and Steps to Reproduce

getinfo output

laanwj commented May 5, 2022 • edited

laanwj commented May 5, 2022 • edited

niftynei commented May 11, 2022 • edited

laanwj commented May 13, 2022

laanwj commented May 5, 2022 •

edited

`getinfo` output

laanwj commented May 5, 2022 •

edited

laanwj commented May 5, 2022 •

edited

niftynei commented May 11, 2022 •

edited