Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channels shutting down with "bad reestablish revocation_number" #5240

Closed
laanwj opened this issue May 5, 2022 · 4 comments · Fixed by #5256
Closed

Channels shutting down with "bad reestablish revocation_number" #5240

laanwj opened this issue May 5, 2022 · 4 comments · Fixed by #5256

Comments

@laanwj
Copy link
Contributor

laanwj commented May 5, 2022

Issue and Steps to Reproduce

In the last day, on a fairly large node, three channels have shut down on me with these messages:

                     "message": "channeld: sent ERROR bad reestablish revocation_number: 7457 vs 7462"
                     "message": "channeld: sent ERROR bad reestablish revocation_number: 271 vs 513"
                     "message": "channeld: sent ERROR bad reestablish revocation_number: 529 vs 543"

I am sure two of them are with c-lightning nodes, probably all three. Two of them are fairly recent channels (opened last week).

I have never restored from a backup, nor shut down c-lightning in an unclean way in the last month at least. Also, maybe important: the closes did not happen upon my daemon starting, but while it was running and the channels were active for at least a few hours.

Using c-lightning master as of 43ace03.

getinfo output

{
   "num_peers": 65,
   "num_pending_channels": 0,
   "num_active_channels": 63,
   "num_inactive_channels": 2,
   "version": "v0.11.0.1-6-g43ace03",
   "blockheight": 734998,
   "network": "bitcoin",
   "our_features": {
      "init": "80282269a2",
      "node": "800080282269a2",
      "channel": "",
      "invoice": "02000020024100"
   }
}

Plugins and configuration:

    --plugin=$HOME/src/plugins/lnurlp/lnurlp.py \
    --plugin=$HOME/src/plugins/rebalance/rebalance.py \
    --plugin=$HOME/src/plugins/summary/summary.py \
    --plugin=$HOME/src/plugins/prometheus/prometheus.py \
    --experimental-dual-fund \
    --experimental-onion-messages \
    --experimental-offers \

Logs before close:

2022-05-04T10:47:34.373Z INFO    plugin-bcli: bitcoin-cli initialized and connected to bitcoind. (c-lightning started)
⋮ 
2022-05-04T13:07:13.965Z INFO    [X]-channeld-chan#59650: Peer connection lost
2022-05-04T13:07:13.965Z INFO    [X]-chan#59650: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-04T13:07:14.102Z UNUSUAL [X]-chan#59650: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish revocation_number: 529 vs 543
2022-05-04T13:07:14.123Z INFO    [X]-chan#59650: State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL
⋮ 
2022-05-05T01:18:43.090Z INFO    [Y]-channeld-chan#59624: Peer connection lost
2022-05-05T01:18:43.093Z INFO    [Y]-chan#59624: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (9)
2022-05-05T01:18:43.198Z UNUSUAL [Y]-chan#59624: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish revocation_number: 7457 vs 7462
2022-05-05T01:18:43.223Z INFO    [Y]-chan#59624: State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL
⋮ 
2022-05-05T07:48:29.676Z INFO    [Z]-channeld-chan#59646: Peer connection lost
2022-05-05T07:48:29.677Z INFO    [Z]-chan#59646: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (9)
2022-05-05T07:48:29.856Z UNUSUAL [Z]-chan#59646: Peer permanent failure in CHANNELD_NORMAL: channeld: sent ERROR bad reestablish revocation_number: 271 vs 513
2022-05-05T07:48:29.860Z INFO    [Z]-chan#59646: State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL

Context information that may be useful:

  • Two of the three failing channels were intentionally dual-funded. One I opened to the other node using a lease, one was opened to me after i set funderupdate fixed 100 false. Opening logs:
2022-03-31T08:10:45.242Z INFO    [Y]-chan#59624: State changed from CHANNELD_AWAITING_LOCKIN to CHANNELD_NORMAL
2022-05-02T09:42:22.796Z INFO    [Z]-chan#59646: State changed from DUALOPEND_OPEN_INIT to DUALOPEND_AWAITING_LOCKIN
2022-05-02T18:45:45.343Z INFO    [X]-chan#59650: State changed from DUALOPEND_OPEN_INIT to DUALOPEND_AWAITING_LOCKIN
  • Operating system is FreeBSD 13.0-RELEASE-p2, c-lightning is compiled with clang version 11.0.1, default sqlite database backend
@laanwj
Copy link
Contributor Author

laanwj commented May 5, 2022

I've had contact with the admin of one of the other peers. Their node (running c-lightning 6ced555) wasn't restarting either at the time of the disconnect, and doesn't have any other problems. I struggle to understand how this sudden state loss(?) could have came to be. I do think all three channels were fairly active, so it could be some kind of race condition.

Messages in their log around the event:

2022-05-05T07:48:32.962Z INFO    [M]-chan#2: Peer transient failure in CHANNELD_NORMAL: Reconnected
2022-05-05T07:48:33.466Z INFO    [M]-channeld-chan#2: Peer connection lost
2022-05-05T07:48:33.467Z INFO    [M]-chan#2: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)
2022-05-05T07:48:34.895Z UNUSUAL [M]-chan#2: Peer permanent failure in CHANNELD_NORMAL: channeld: received ERROR error channel [chan]: bad reestablish revocation_number: 271 vs 513
2022-05-05T07:48:34.896Z INFO    [M]-chan#2: State changed from CHANNELD_NORMAL to AWAITING_UNILATERAL
2022-05-05T07:49:28.783Z UNUSUAL [M]-chan#2: Peer permanent failure in AWAITING_UNILATERAL: Funding transaction spent
2022-05-05T07:49:28.784Z INFO    [M]-chan#2: State changed from AWAITING_UNILATERAL to AWAITING_UNILATERAL
2022-05-05T07:49:28.784Z INFO    [M]-chan#2: State changed from AWAITING_UNILATERAL to FUNDING_SPEND_SEEN
2022-05-05T07:49:28.831Z INFO    [M]-chan#2: State changed from FUNDING_SPEND_SEEN to ONCHAIN

I've started my node again (had shut it down in a kind of panic 😄) and am seeing no further channel closes.

@laanwj
Copy link
Contributor Author

laanwj commented May 5, 2022

Looks like #5235 describes a similar issue. I did not make any change to the CLTV setting, though.

@niftynei
Copy link
Collaborator

niftynei commented May 11, 2022

First, it's worth pointing out that post channel open the code path is the same. In other words, after a channel state has moved from DUALOPEND_ to CHANNELD_NORMAL, it's using the same code as a channel opened using the v1 code path, all in channeld.

*The exception here is for 'leased' channels, which include an extra update message for the CSV lock that meters the lease timeout; but given that you've experienced the same issue for a non-leased channel, that's a good indication that that's likely not the problem here.

Second, just as a point of interest, in the ERROR bad reestablish revocation_number: 271 vs 513
The first number 271 is what your peer supplied as in its next_revocation_number in its channel_reestablish message; your node's next_index number is the second (513). Your node is failing the channel b/c you believe them to be behind.

Thanks for the detailed report.

@laanwj
Copy link
Contributor Author

laanwj commented May 13, 2022

First, it's worth pointing out that post channel open the code path is the same. In other words, after a channel state has moved from DUALOPEND_ to CHANNELD_NORMAL, it's using the same code as a channel opened using the v1 code path, all in channeld.
… that's a good indication that that's likely not the problem here.

Thanks for the explanation. I agree. It was pretty far-fetched for it to be about the kind of opening but couldn't think of anything else.

Second, just as a point of interest, in the ERROR bad reestablish revocation_number: 271 vs 513
The first number 271 is what your peer supplied as in its next_revocation_number in its channel_reestablish message; your node's next_index number is the second (513). Your node is failing the channel b/c you believe them to be behind.

I wondered about who was blaming whom !

rustyrussell added a commit to rustyrussell/lightning that referenced this issue May 13, 2022
We had multiple reports of channels being unilaterally closed because
it seemed like the peer was sending old revocation numbers.

Turns out, it was actually old reestablish messages!  When we have a
reconnection, we would put the new connection aside, and tell lightningd
to close the current connection: when it did, we would restart
processing of the initial reconnection.

However, we could end up with *multiple* "reconnecting" connections,
while waiting for an existing connection to close.  Though the
connections were long gone, there could still be messages queued
(particularly the channel_reestablish message, which comes early on).

Eventually, a normal reconnection would cause us to process one of
these reconnecting connections, and channeld would see the (perhaps
very old!) messages, and get confused.

(I have a test which triggers this, but it also hangs the connect
 command, due to other issues we will fix in the next release...)

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue May 14, 2022
We had multiple reports of channels being unilaterally closed because
it seemed like the peer was sending old revocation numbers.

Turns out, it was actually old reestablish messages!  When we have a
reconnection, we would put the new connection aside, and tell lightningd
to close the current connection: when it did, we would restart
processing of the initial reconnection.

However, we could end up with *multiple* "reconnecting" connections,
while waiting for an existing connection to close.  Though the
connections were long gone, there could still be messages queued
(particularly the channel_reestablish message, which comes early on).

Eventually, a normal reconnection would cause us to process one of
these reconnecting connections, and channeld would see the (perhaps
very old!) messages, and get confused.

(I have a test which triggers this, but it also hangs the connect
 command, due to other issues we will fix in the next release...)

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit that referenced this issue May 16, 2022
We had multiple reports of channels being unilaterally closed because
it seemed like the peer was sending old revocation numbers.

Turns out, it was actually old reestablish messages!  When we have a
reconnection, we would put the new connection aside, and tell lightningd
to close the current connection: when it did, we would restart
processing of the initial reconnection.

However, we could end up with *multiple* "reconnecting" connections,
while waiting for an existing connection to close.  Though the
connections were long gone, there could still be messages queued
(particularly the channel_reestablish message, which comes early on).

Eventually, a normal reconnection would cause us to process one of
these reconnecting connections, and channeld would see the (perhaps
very old!) messages, and get confused.

(I have a test which triggers this, but it also hangs the connect
 command, due to other issues we will fix in the next release...)

Fixes: #5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue May 16, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue May 16, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue May 18, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue May 19, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Jun 7, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
adi2011 pushed a commit to adi2011/lightning that referenced this issue Jun 29, 2022
We had multiple reports of channels being unilaterally closed because
it seemed like the peer was sending old revocation numbers.

Turns out, it was actually old reestablish messages!  When we have a
reconnection, we would put the new connection aside, and tell lightningd
to close the current connection: when it did, we would restart
processing of the initial reconnection.

However, we could end up with *multiple* "reconnecting" connections,
while waiting for an existing connection to close.  Though the
connections were long gone, there could still be messages queued
(particularly the channel_reestablish message, which comes early on).

Eventually, a normal reconnection would cause us to process one of
these reconnecting connections, and channeld would see the (perhaps
very old!) messages, and get confused.

(I have a test which triggers this, but it also hangs the connect
 command, due to other issues we will fix in the next release...)

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Jul 8, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Jul 9, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Jul 12, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Jul 12, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Jul 14, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Jul 14, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Jul 15, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell added a commit to rustyrussell/lightning that referenced this issue Jul 16, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
niftynei pushed a commit that referenced this issue Jul 19, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: #5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
whitslack pushed a commit to whitslack/lightning that referenced this issue Aug 30, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
whitslack pushed a commit to whitslack/lightning that referenced this issue Aug 30, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
whitslack pushed a commit to whitslack/lightning that referenced this issue Oct 8, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
whitslack pushed a commit to whitslack/lightning that referenced this issue Oct 28, 2022
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.

Fixes: ElementsProject#5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
@laanwj @niftynei and others