Skip to content

Conversation

@niftynei
Copy link
Collaborator

@niftynei niftynei commented Nov 11, 2020

If we're already attempting to connect to a peer, we would ignore
new connection requests. This is problematic if your node has bad
connection details for the node -- you can't update it while inflight.

This patch appends new connection suggestions to the list of connections
to try.

Fixes #4154

Changelog-None

@niftynei niftynei requested a review from cdecker as a code owner November 11, 2020 00:26
Copy link
Contributor

@rustyrussell rustyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs some pycheck fixes, and Travis is grumpy...

@plugin.async_method('waitconn')
def wait_connection(request, plugin):
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
addr = ('localhost', 6669)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use port number 0, which will assign an unused ephemeral port? Then use getsockname and print out what port you got.

This avoids conflicts on port 6669.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, updated.

Copy link
Member

@cdecker cdecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Travis failure might just be a timeout that is a bit optimistic in the .result(1) calls below. I'm wondering though why the decoy port is managed in a plugin at all, since the creation of the listening socket and acceptance of the connection could just as well be done in the test itself.

@niftynei
Copy link
Collaborator Author

niftynei commented Nov 13, 2020

I'm wondering though why the decoy port is managed in a plugin at all, since the creation of the listening socket and acceptance of the connection could just as well be done in the test itself.

Needs signaling between the listening socket and the main test thread. By using a plugin we can take advantage of wait_for_log, which gives it a nice symmetry with our other tests.

Open to different ideas for how to do inter-thread signaling on python. /me eyes Golang channels.

@niftynei niftynei force-pushed the nifty/fix-4154 branch 2 times, most recently from 13792a4 to a8b1a38 Compare November 13, 2020 00:18
@cdecker
Copy link
Member

cdecker commented Nov 13, 2020

Open to different ideas for how to do inter-thread signaling on python. /me eyes Golang channels.

Ah yes, inter-thread / inter-process communication :-)
Since you mention Golang channels: the queue.Queue class is a multi-threading enabled primitive allowing mpmc semantics to transfer pieces of data between threads. Alternatively if you just need one-time signals (wait for startup, etc) threading.Lock might be an alternative.

But no need to amend, I was just curious as to why the big hammer was used in this case :-)

@niftynei
Copy link
Collaborator Author

I was just curious as to why the big hammer was used in this case :-)

I can't take credit for the idea, @rustyrussell suggested it first lol.

@cdecker
Copy link
Member

cdecker commented Nov 16, 2020

Seems Travis doesn't like the change just yet. Failing tests are:

  • test_connect_stresstest
  • test_plugin_command (unrelated flaky test)

The stresstest one seems to be a crash here:

lightningd-3: 2020-11-13T01:24:08.887Z **BROKEN** connectd: FATAL SIGNAL 6 (version v0.9.2rc1-3-g4bd7b63)
lightningd-3: 2020-11-13T01:24:08.887Z **BROKEN** connectd: backtrace: common/daemon.c:43 (send_backtrace) 0x5643377b5753
lightningd-3: 2020-11-13T01:24:08.887Z **BROKEN** connectd: backtrace: common/daemon.c:51 (crashdump) 0x5643377b57a3
lightningd-3: 2020-11-13T01:24:08.887Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0x7f08e7e9e03f
lightningd-3: 2020-11-13T01:24:08.888Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0x7f08e7e9dfb7
lightningd-3: 2020-11-13T01:24:08.888Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0x7f08e7e9f920
lightningd-3: 2020-11-13T01:24:08.888Z **BROKEN** connectd: backtrace: common/wireaddr.c:106 (towire_wireaddr_internal) 0x5643377c17c8
lightningd-3: 2020-11-13T01:24:08.888Z **BROKEN** connectd: backtrace: connectd/connectd_wiregen.c:287 (towire_connectd_connect_failed) 0x5643377a7fc3
lightningd-3: 2020-11-13T01:24:08.888Z **BROKEN** connectd: backtrace: connectd/connectd.c:658 (connect_failed) 0x5643377aa616
lightningd-3: 2020-11-13T01:24:08.889Z **BROKEN** connectd: backtrace: connectd/connectd.c:787 (try_connect_one_addr) 0x5643377aaaf6
lightningd-3: 2020-11-13T01:24:08.892Z **BROKEN** connectd: backtrace: connectd/connectd.c:696 (destroy_io_conn) 0x5643377aa7b4
lightningd-3: 2020-11-13T01:24:08.892Z **BROKEN** connectd: backtrace: ccan/ccan/io/poll.c:244 (destroy_conn) 0x5643377f3883
lightningd-3: 2020-11-13T01:24:08.893Z **BROKEN** connectd: backtrace: ccan/ccan/io/poll.c:250 (destroy_conn_close_fd) 0x5643377f38a3
lightningd-3: 2020-11-13T01:24:08.893Z **BROKEN** connectd: backtrace: ccan/ccan/tal/tal.c:240 (notify) 0x5643378015f6
lightningd-3: 2020-11-13T01:24:08.893Z **BROKEN** connectd: backtrace: ccan/ccan/tal/tal.c:402 (del_tree) 0x564337801ae5
lightningd-3: 2020-11-13T01:24:08.893Z **BROKEN** connectd: backtrace: ccan/ccan/tal/tal.c:412 (del_tree) 0x564337801b37
lightningd-3: 2020-11-13T01:24:08.893Z **BROKEN** connectd: backtrace: ccan/ccan/tal/tal.c:486 (tal_free) 0x564337801e71
lightningd-3: 2020-11-13T01:24:08.893Z **BROKEN** connectd: backtrace: connectd/connectd.c:288 (connected_to_peer) 0x5643377a967b
lightningd-3: 2020-11-13T01:24:08.894Z **BROKEN** connectd: backtrace: connectd/connectd.c:469 (peer_connected) 0x5643377a9d4b
lightningd-3: 2020-11-13T01:24:08.894Z **BROKEN** connectd: backtrace: connectd/peer_exchange_initmsg.c:94 (peer_init_received) 0x5643377ada20
lightningd-3: 2020-11-13T01:24:08.894Z **BROKEN** connectd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x5643377f1216
lightningd-3: 2020-11-13T01:24:08.894Z **BROKEN** connectd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x5643377f1d93
lightningd-3: 2020-11-13T01:24:08.894Z **BROKEN** connectd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x5643377f1dd1
lightningd-3: 2020-11-13T01:24:08.894Z **BROKEN** connectd: backtrace: ccan/ccan/io/poll.c:445 (io_loop) 0x5643377f3f97
lightningd-3: 2020-11-13T01:24:08.894Z **BROKEN** connectd: backtrace: connectd/connectd.c:1690 (main) 0x5643377ad6e3
lightningd-3: 2020-11-13T01:24:08.894Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0x7f08e7e80bf6
lightningd-3: 2020-11-13T01:24:08.895Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0x5643377a6da9
lightningd-3: 2020-11-13T01:24:08.895Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0xffffffffffffffff
lightningd-3: 2020-11-13T01:24:08.895Z **BROKEN** connectd: FATAL SIGNAL (version v0.9.2rc1-3-g4bd7b63)
lightningd-3: 2020-11-13T01:24:08.895Z **BROKEN** connectd: backtrace: common/daemon.c:43 (send_backtrace) 0x5643377b5753
lightningd-3: 2020-11-13T01:24:08.895Z **BROKEN** connectd: backtrace: common/status.c:206 (status_failed) 0x5643377bfc43
lightningd-3: 2020-11-13T01:24:08.895Z **BROKEN** connectd: backtrace: common/subdaemon.c:25 (status_backtrace_exit) 0x5643377bfe93
lightningd-3: 2020-11-13T01:24:08.895Z **BROKEN** connectd: backtrace: common/daemon.c:54 (crashdump) 0x5643377b57ac
lightningd-3: 2020-11-13T01:24:08.896Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0x7f08e7e9e03f
lightningd-3: 2020-11-13T01:24:08.896Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0x7f08e7e9dfb7
lightningd-3: 2020-11-13T01:24:08.896Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0x7f08e7e9f920
lightningd-3: 2020-11-13T01:24:08.896Z **BROKEN** connectd: backtrace: common/wireaddr.c:106 (towire_wireaddr_internal) 0x5643377c17c8
lightningd-3: 2020-11-13T01:24:08.896Z **BROKEN** connectd: backtrace: connectd/connectd_wiregen.c:287 (towire_connectd_connect_failed) 0x5643377a7fc3
lightningd-3: 2020-11-13T01:24:08.896Z **BROKEN** connectd: backtrace: connectd/connectd.c:658 (connect_failed) 0x5643377aa616
lightningd-3: 2020-11-13T01:24:08.896Z **BROKEN** connectd: backtrace: connectd/connectd.c:787 (try_connect_one_addr) 0x5643377aaaf6
lightningd-3: 2020-11-13T01:24:08.897Z **BROKEN** connectd: backtrace: connectd/connectd.c:696 (destroy_io_conn) 0x5643377aa7b4
lightningd-3: 2020-11-13T01:24:08.897Z **BROKEN** connectd: backtrace: ccan/ccan/io/poll.c:244 (destroy_conn) 0x5643377f3883
lightningd-3: 2020-11-13T01:24:08.897Z **BROKEN** connectd: backtrace: ccan/ccan/io/poll.c:250 (destroy_conn_close_fd) 0x5643377f38a3
lightningd-3: 2020-11-13T01:24:08.898Z **BROKEN** connectd: backtrace: ccan/ccan/tal/tal.c:240 (notify) 0x5643378015f6
lightningd-3: 2020-11-13T01:24:08.898Z **BROKEN** connectd: backtrace: ccan/ccan/tal/tal.c:402 (del_tree) 0x564337801ae5
lightningd-3: 2020-11-13T01:24:08.898Z **BROKEN** connectd: backtrace: ccan/ccan/tal/tal.c:412 (del_tree) 0x564337801b37
lightningd-3: 2020-11-13T01:24:08.899Z **BROKEN** connectd: backtrace: ccan/ccan/tal/tal.c:486 (tal_free) 0x564337801e71
lightningd-3: 2020-11-13T01:24:08.900Z **BROKEN** connectd: backtrace: connectd/connectd.c:288 (connected_to_peer) 0x5643377a967b
lightningd-3: 2020-11-13T01:24:08.900Z **BROKEN** connectd: backtrace: connectd/connectd.c:469 (peer_connected) 0x5643377a9d4b
lightningd-3: 2020-11-13T01:24:08.901Z **BROKEN** connectd: backtrace: connectd/peer_exchange_initmsg.c:94 (peer_init_received) 0x5643377ada20
lightningd-3: 2020-11-13T01:24:08.902Z **BROKEN** connectd: backtrace: ccan/ccan/io/io.c:59 (next_plan) 0x5643377f1216
lightningd-3: 2020-11-13T01:24:08.902Z **BROKEN** connectd: backtrace: ccan/ccan/io/io.c:407 (do_plan) 0x5643377f1d93
lightningd-3: 2020-11-13T01:24:08.902Z **BROKEN** connectd: backtrace: ccan/ccan/io/io.c:417 (io_ready) 0x5643377f1dd1
lightningd-3: 2020-11-13T01:24:08.905Z **BROKEN** connectd: backtrace: ccan/ccan/io/poll.c:445 (io_loop) 0x5643377f3f97
lightningd-3: 2020-11-13T01:24:08.905Z **BROKEN** connectd: backtrace: connectd/connectd.c:1690 (main) 0x5643377ad6e3
lightningd-3: 2020-11-13T01:24:08.906Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0x7f08e7e80bf6
lightningd-3: 2020-11-13T01:24:08.906Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0x5643377a6da9
lightningd-3: 2020-11-13T01:24:08.906Z **BROKEN** connectd: backtrace: (null):0 ((null)) 0xffffffffffffffff
lightningd-3: 2020-11-13T01:24:08.906Z **BROKEN** connectd: STATUS_FAIL_INTERNAL_ERROR: FATAL SIGNAL

So likely just a null-pointer somewhere in towire_connectd_connect_failed

If we're already attempting to connect to a peer, we would ignore
new connection requests. This is problematic if your node has bad
connection details for the node -- you can't update it while inflight.

This patch appends new connection suggestions to the list of connections
to try.

Fixes ElementsProject#4154
@niftynei
Copy link
Collaborator Author

niftynei commented Nov 16, 2020

So I'm a bit confused by this crash. Line 288 of connected_to_peer is freeing connect, which is notifying the callback destroy_io_conn.

lightningd-3: 2020-11-13T01:24:08.892Z **BROKEN** connectd: backtrace: connectd/connectd.c:696 (destroy_io_conn) 0x5643377aa7b4
lightningd-3: 2020-11-13T01:24:08.892Z **BROKEN** connectd: backtrace: ccan/ccan/io/poll.c:244 (destroy_conn) 0x5643377f3883
lightningd-3: 2020-11-13T01:24:08.893Z **BROKEN** connectd: backtrace: ccan/ccan/io/poll.c:250 (destroy_conn_close_fd) 0x5643377f38a3
lightningd-3: 2020-11-13T01:24:08.893Z **BROKEN** connectd: backtrace: ccan/ccan/tal/tal.c:240 (notify) 0x5643378015f6
lightningd-3: 2020-11-13T01:24:08.893Z **BROKEN** connectd: backtrace: ccan/ccan/tal/tal.c:402 (del_tree) 0x564337801ae5
lightningd-3: 2020-11-13T01:24:08.893Z **BROKEN** connectd: backtrace: ccan/ccan/tal/tal.c:412 (del_tree) 0x564337801b37
lightningd-3: 2020-11-13T01:24:08.893Z **BROKEN** connectd: backtrace: ccan/ccan/tal/tal.c:486 (tal_free) 0x564337801e71
lightningd-3: 2020-11-13T01:24:08.893Z **BROKEN** connectd: backtrace: connectd/connectd.c:288 (connected_to_peer) 0x5643377a967b

Except that, we shouldn't be calling destroy_io_conn at all, because we ostensibly set the conn's finish to NULL just before freeing the connection.

static void connected_to_peer(struct daemon *daemon,
			      struct io_conn *conn,
			      const struct node_id *id)
{
	/* Don't call destroy_io_conn */
	io_set_finish(conn, NULL, NULL);

	/* We allocate 'conn' as a child of 'connect': we don't want to free
	 * it just yet though.  tal_steal() it onto the permanent 'daemon'
	 * struct. */
	tal_steal(daemon, conn);

	/* Now free the 'connecting' struct. */
	tal_free(find_connecting(daemon, id));
}

Update: the connection associated with the connecting struct returned by find_connecting is not the same as the conn passed in via connected_to. Calling io_set_finish on the wrong conn doesn't clear destroy_io_conn as intended.

Copy link
Contributor

@rustyrussell rustyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack 6491961

Nice catch of long-outstanding race between incoming and outgoing conns!

@niftynei niftynei merged commit abad494 into ElementsProject:master Nov 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Problem connecting after a new IP address

3 participants