Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silent failure to broadcast anchor tx, then crash #67

Closed
gwillen opened this issue Nov 4, 2016 · 8 comments
Closed

Silent failure to broadcast anchor tx, then crash #67

gwillen opened this issue Nov 4, 2016 · 8 comments

Comments

@gwillen
Copy link
Contributor

gwillen commented Nov 4, 2016

This time I tried 'connect'ing the cat picture server using an input tx that only had one confirm. I don't know whether that was the cause, but I ended up with a connection but no channel, and lightningd never said anything about broadcasting an anchor tx.

(First thing-that-seems-like-a-bug: If lightningd is not opening a channel because the tx I supplied isn't deep enough yet, it would be helpful if it would tell me -- this protocol is very fiddly so good error reporting is key. There wasn't anything in the log, either.)

At this point 'getpeers' reported:

{ "peers" : 
	[ 
		{ "name" : "02915506c736ffec49ad58fc021779600dcd2b7a52ac97690571aea5b4d9be2706:", "state" : "STATE_OPEN_WAIT_FOR_OPEN_WITHANCHOR", "peerid" : "02915506c736ffec49ad58fc021779600dcd2b7a52ac97690571aea5b4d9be2706", "connected" : true } ] }

After a couple more confirms on the input tx, and nothing further from lightningd, I killed and restarted it. Still no anchor tx, but now 'getpeers' yielded:

{ "peers" : 
	[ 
		{ "name" : "02915506c736ffec49ad58fc021779600dcd2b7a52ac97690571aea5b4d9be2706:", "state" : "STATE_INIT", "peerid" : "02915506c736ffec49ad58fc021779600dcd2b7a52ac97690571aea5b4d9be2706", "connected" : true } ] }

So I decided to run 'connect' again just to see what I would get, yielding finally:

lightningd(4399): Connected json input
02915506c736ffec49ad58fc021779600dcd2b7a52ac97690571aea5b4d9be2706: Disconnected
lightningd: daemon/db.c:1686: db_forget_peer: Assertion `peer->state == STATE_CLOSED' failed.
lightningd(4399): FATAL SIGNAL 6 RECEIVED
Fatal signal 6. Log dumped in crash.log

So I guess I ended up in an inconsistent state and then went bang. That part definitely seems like a bug!

@gwillen
Copy link
Contributor Author

gwillen commented Nov 4, 2016

After a night's reflection, I notice something -- I feel like I have several times seen issues that feel like "lightningd is confused about keeping a TCP connection in sync with a lightning channel over that connection." The user interface treats the two together -- 'connect' accomplishes both -- but it feels like, especially in error cases or across daemon restarts, the daemon has trouble keeping all the state machines synced up with each other.

@cdecker
Copy link
Member

cdecker commented Nov 5, 2016

There is no minimum required depth for the funds used to create a channel, so even 0-conf should work. As a matter of fact that's how I open most of my channels. Not broadcasting is strange, do you have any indication where it fails in the logs? Maybe we can add a JSON-RPC to rebroadcast or extract funding transactions, so we can manually trigger/release them.

@cdecker
Copy link
Member

cdecker commented Nov 5, 2016

Just had the same crash after attempting to disconnect a node that was unreachable, and in state STATE_INIT. https://gist.github.com/anonymous/251ce28774d9f0c094a5dbe06388b6bb

@cdecker
Copy link
Member

cdecker commented Nov 5, 2016

Turns out it was a bit different. calling close on a channel that is neither in STATE_NORMAL nor STATE_NORMAL_COMMITTING will trigger an assertion and fault our client. But it seems that we are conflating connection state and channel state in a few places.

@rustyrussell
Copy link
Contributor

OK, I have a patch which annotates the state information so getpeers will give you some more idea of what's happening. But:

STATE_OPEN_WAIT_FOR_OPEN_WITHANCHOR : We're actually waiting for them to send the OPEN packet! You shouldn't see that state for more than 1 RTT. That's why we reverted to STATE_INIT on restart; we didn't receive anything from the peer. Though clearly we didn't do anything useful there either...

The reconnect assert is another bug. I'll look at that too! Thanks!

@gwillen
Copy link
Contributor Author

gwillen commented Nov 6, 2016

Interesting. @rustyrussell does this mean that the reason I never broadcast an anchor is that I was waiting for the remote host to send the OPEN packet?

It's interesting that this state can persist -- does this imply that we got a TCP connection, sent a packet through it, and then ... silence? It seems like this ought to resolve quickly one way or another.

@cdecker
Copy link
Member

cdecker commented Nov 6, 2016

I don't think we have regular pings, so it could be a TCP connection getting stuck forever.

@rustyrussell
Copy link
Contributor

@cdecker I'll open another bug report for yours, one sec.

rustyrussell added a commit that referenced this issue Nov 10, 2016
db_forget_peer() was harmless, but we haven't been entered into the
database yet anyway, and it asserted that we should have been STATE_CLOSED.

Closes: #67
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants