Silent failure to broadcast anchor tx, then crash #67

gwillen · 2016-11-04T07:35:58Z

This time I tried 'connect'ing the cat picture server using an input tx that only had one confirm. I don't know whether that was the cause, but I ended up with a connection but no channel, and lightningd never said anything about broadcasting an anchor tx.

(First thing-that-seems-like-a-bug: If lightningd is not opening a channel because the tx I supplied isn't deep enough yet, it would be helpful if it would tell me -- this protocol is very fiddly so good error reporting is key. There wasn't anything in the log, either.)

At this point 'getpeers' reported:

{ "peers" : 
	[ 
		{ "name" : "02915506c736ffec49ad58fc021779600dcd2b7a52ac97690571aea5b4d9be2706:", "state" : "STATE_OPEN_WAIT_FOR_OPEN_WITHANCHOR", "peerid" : "02915506c736ffec49ad58fc021779600dcd2b7a52ac97690571aea5b4d9be2706", "connected" : true } ] }

After a couple more confirms on the input tx, and nothing further from lightningd, I killed and restarted it. Still no anchor tx, but now 'getpeers' yielded:

{ "peers" : 
	[ 
		{ "name" : "02915506c736ffec49ad58fc021779600dcd2b7a52ac97690571aea5b4d9be2706:", "state" : "STATE_INIT", "peerid" : "02915506c736ffec49ad58fc021779600dcd2b7a52ac97690571aea5b4d9be2706", "connected" : true } ] }

So I decided to run 'connect' again just to see what I would get, yielding finally:

lightningd(4399): Connected json input
02915506c736ffec49ad58fc021779600dcd2b7a52ac97690571aea5b4d9be2706: Disconnected
lightningd: daemon/db.c:1686: db_forget_peer: Assertion `peer->state == STATE_CLOSED' failed.
lightningd(4399): FATAL SIGNAL 6 RECEIVED
Fatal signal 6. Log dumped in crash.log

So I guess I ended up in an inconsistent state and then went bang. That part definitely seems like a bug!

The text was updated successfully, but these errors were encountered:

gwillen · 2016-11-04T17:10:27Z

After a night's reflection, I notice something -- I feel like I have several times seen issues that feel like "lightningd is confused about keeping a TCP connection in sync with a lightning channel over that connection." The user interface treats the two together -- 'connect' accomplishes both -- but it feels like, especially in error cases or across daemon restarts, the daemon has trouble keeping all the state machines synced up with each other.

cdecker · 2016-11-05T14:14:11Z

There is no minimum required depth for the funds used to create a channel, so even 0-conf should work. As a matter of fact that's how I open most of my channels. Not broadcasting is strange, do you have any indication where it fails in the logs? Maybe we can add a JSON-RPC to rebroadcast or extract funding transactions, so we can manually trigger/release them.

cdecker · 2016-11-05T16:25:20Z

Just had the same crash after attempting to disconnect a node that was unreachable, and in state STATE_INIT. https://gist.github.com/anonymous/251ce28774d9f0c094a5dbe06388b6bb

cdecker · 2016-11-05T16:31:22Z

Turns out it was a bit different. calling close on a channel that is neither in STATE_NORMAL nor STATE_NORMAL_COMMITTING will trigger an assertion and fault our client. But it seems that we are conflating connection state and channel state in a few places.

rustyrussell · 2016-11-05T21:03:18Z

OK, I have a patch which annotates the state information so getpeers will give you some more idea of what's happening. But:

STATE_OPEN_WAIT_FOR_OPEN_WITHANCHOR : We're actually waiting for them to send the OPEN packet! You shouldn't see that state for more than 1 RTT. That's why we reverted to STATE_INIT on restart; we didn't receive anything from the peer. Though clearly we didn't do anything useful there either...

The reconnect assert is another bug. I'll look at that too! Thanks!

gwillen · 2016-11-06T19:01:30Z

Interesting. @rustyrussell does this mean that the reason I never broadcast an anchor is that I was waiting for the remote host to send the OPEN packet?

It's interesting that this state can persist -- does this imply that we got a TCP connection, sent a packet through it, and then ... silence? It seems like this ought to resolve quickly one way or another.

cdecker · 2016-11-06T19:04:37Z

I don't think we have regular pings, so it could be a TCP connection getting stuck forever.

rustyrussell · 2016-11-10T00:37:46Z

@cdecker I'll open another bug report for yours, one sec.

db_forget_peer() was harmless, but we haven't been entered into the database yet anyway, and it asserted that we should have been STATE_CLOSED. Closes: #67 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

rustyrussell mentioned this issue Nov 10, 2016

close on non-NORMAL channel causes assert #82

Closed

rustyrussell mentioned this issue Nov 10, 2016

peer_disconnect: simply free if in STATE_INIT. #83

Merged

cdecker closed this as completed in #83 Nov 10, 2016

Ulmo mentioned this issue Mar 3, 2018

Three crashes every time it sync'd with blockchain. #1157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silent failure to broadcast anchor tx, then crash #67

Silent failure to broadcast anchor tx, then crash #67

gwillen commented Nov 4, 2016 •

edited by cdecker

Loading

gwillen commented Nov 4, 2016

cdecker commented Nov 5, 2016

cdecker commented Nov 5, 2016

cdecker commented Nov 5, 2016

rustyrussell commented Nov 5, 2016

gwillen commented Nov 6, 2016

cdecker commented Nov 6, 2016

rustyrussell commented Nov 10, 2016

Silent failure to broadcast anchor tx, then crash #67

Silent failure to broadcast anchor tx, then crash #67

Comments

gwillen commented Nov 4, 2016 • edited by cdecker Loading

gwillen commented Nov 4, 2016

cdecker commented Nov 5, 2016

cdecker commented Nov 5, 2016

cdecker commented Nov 5, 2016

rustyrussell commented Nov 5, 2016

gwillen commented Nov 6, 2016

cdecker commented Nov 6, 2016

rustyrussell commented Nov 10, 2016

gwillen commented Nov 4, 2016 •

edited by cdecker

Loading