Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xud will not reconnect after connection is lost #1699

Closed
ghost opened this issue Jun 29, 2020 · 9 comments
Closed

xud will not reconnect after connection is lost #1699

ghost opened this issue Jun 29, 2020 · 9 comments
Assignees
Labels
bug Something isn't working p2p Peer to peer networking P2 mid priority
Milestone

Comments

@ghost
Copy link

ghost commented Jun 29, 2020

Mainnet xud-docker environment did not reconnect to existing peers in 6 hours when host machine lost connectivity and came back online 30 minutes later.

I'm gathering logs and findings from future investigation into this issue.

@ghost ghost added the bug Something isn't working label Jun 29, 2020
@ghost ghost self-assigned this Jun 29, 2020
@raladev
Copy link
Contributor

raladev commented Jul 2, 2020

I think i faced this issue once on testnet env but without 30 min offline and when i tried to connect manually i got connection refused err. Restart of env fixed problem for me and for @erkarl.

@raladev raladev self-assigned this Jul 2, 2020
@kilrau kilrau added the P2 mid priority label Jul 6, 2020
@raladev
Copy link
Contributor

raladev commented Jul 9, 2020

it seems i can easily reproduce it by switching on/off my VPN

@kilrau
Copy link
Contributor

kilrau commented Jul 14, 2020

I noticed that I didn't have any peers on testnet today. Restarting xud brought back my peers. Analyzing the logs I can't get a clear picture of what is happening, just a constant iteration of ResponseStalling, closing connection, reconnecting:

14/07/2020 11:36:06.580 [P2P] error: Peer timed out waiting for response to packet 2cfaeca0-c5c6-11ea-a0fe-97d81e57d941
14/07/2020 11:36:06.581 [P2P] debug: Peer 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2 (ArmyTreat): closing socket. reason: ResponseStalling
14/07/2020 11:36:06.591 [ORDERBOOK] debug: removed all orders for peer 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2 (ArmyTreat)
14/07/2020 11:36:06.592 [P2P] debug: attempting to reconnect to a disconnected peer 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2 (ArmyTreat)
14/07/2020 11:36:06.594 [P2P] info: Peer 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2 (ArmyTreat) socket closed
14/07/2020 11:36:07.345 [RPC] debug: received call /xudrpc.Xud/ListPeers
14/07/2020 11:36:07.754 [P2P] debug: Connected pre-handshake to 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2@wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885
14/07/2020 11:36:08.502 [P2P] debug: Peer 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2@wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885 session in-encryption enabled
14/07/2020 11:36:08.650 [RPC] debug: received call /xudrpc.Xud/ListOrders
14/07/2020 11:36:08.891 [P2P] debug: Peer 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2 (ArmyTreat) session out-encryption enabled
14/07/2020 11:36:08.895 [P2P] verbose: opened connection to 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2 (ArmyTreat) at wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885
14/07/2020 11:36:09.907 [P2P] verbose: received 0 orders from 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2 (ArmyTreat)
14/07/2020 11:36:10.162 [P2P] verbose: received 0 orders from 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2 (ArmyTreat)
14/07/2020 11:36:10.165 [P2P] verbose: received 0 nodes (0 new) from 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2 (ArmyTreat)

And at times failing to reconnect multiple times in a row because peer says we are already connected:

13/07/2020 12:03:16.857 [P2P] debug: Connected pre-handshake to 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2@wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885
13/07/2020 12:03:17.410 [P2P] debug: received disconnecting packet from 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2@wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885 due to AlreadyConnected
13/07/2020 12:03:17.755 [P2P] info: Socket closed prior to handshake with 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2@wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885
13/07/2020 12:03:17.756 [P2P] warn: could not open connection to outbound peer (02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2@wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885): Peer 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2@wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885 disconnected from us due to AlreadyConnected
13/07/2020 12:03:18.288 [P2P] debug: Connected pre-handshake to 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2@wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885
13/07/2020 12:03:18.784 [P2P] debug: received disconnecting packet from 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2@wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885 due to AlreadyConnected
13/07/2020 12:03:19.035 [P2P] info: Socket closed prior to handshake with 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2@wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885
13/07/2020 12:03:19.035 [P2P] warn: could not open connection to outbound peer (02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2@wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885): Peer 02e33d476a3475c7355ea04cb5e61e9577ea142ccb4a0aff974a2bebac2d07b6a2@wr5xrgmpftaog3fsoat6ntlr2ylpw2avcj73dm5slnuo62esohbdg6ad.onion:18885 disconnected from us due to AlreadyConnected

@kilrau
Copy link
Contributor

kilrau commented Jul 14, 2020

At least the latter should be fixable by checking connected peers (which can be inbound) before firing a reconnect. Looks like we don't do that and continue trying to reconnect even though the peer already connected to us. What is your take? @LePremierHomme

@kilrau
Copy link
Contributor

kilrau commented Jul 24, 2020

@michael1011 observed a similar scenario on the boltz xud testnet node:

Jul 23 16:18:49 testnet xud[51256]: 23/07/2020 16:18:49.215 [P2P] info: Peer 029f39cce2646d305fae75eb578929c9f84fbfdffd9739db016a528760e4aa392c (RiceNeck) socket closed
Jul 23 16:19:28 testnet xud[51256]: 23/07/2020 16:19:28.093 [P2P] info: Peer 029f39cce2646d305fae75eb578929c9f84fbfdffd9739db016a528760e4aa392c (RiceNeck) socket closed
Jul 23 16:30:42 testnet xud[51256]: 23/07/2020 16:30:42.351 [P2P] info: Peer 029f39cce2646d305fae75eb578929c9f84fbfdffd9739db016a528760e4aa392c (RiceNeck) socket closed
Jul 24 11:51:56 testnet xud[51256]: 24/07/2020 11:51:56.586 [P2P] error: Peer timed out waiting for response to packet 0b242430-cda4-11ea-9b59-abb4b5ffae6e
Jul 24 11:51:56 testnet xud[51256]: 24/07/2020 11:51:56.593 [P2P] info: Peer 03dc5dbe77106209287a404d2bd5648a8c8a4ffaa8f86d2cfa5c22e662e65e8809 (CertainRubber) socket closed
Jul 24 13:23:35 testnet xud[51256]: 24/07/2020 13:23:35.576 [P2P] info: Peer 02799dff47d38da549c9f9e3dcd94388e3a9fe05b4ea50ea0636562b4ba2b4ca4a (AttitudeErase) socket closed
Jul 24 13:23:35 testnet xud[51256]: 24/07/2020 13:23:35.808 [P2P] warn: could not open connection to outbound peer (02799dff47d38da549c9f9e3dcd94388e3a9fe05b4ea50ea0636562b4ba2b4ca4a@xud.kilrau.com:18885): could not connect to peer at xud.kilrau.com:18885: connect ECONNREFUSED 188.127.180.82:18885

@kilrau kilrau added the p2p Peer to peer networking label Aug 13, 2020
@LePremierHomme
Copy link
Contributor

LePremierHomme commented Aug 14, 2020

@kilrau from your first log paragraph I don’t see that the peer is rejecting the reconnection (perhaps because the log is partial), and in the second one I don’t see what precedes the rejection.

We always check for connected peers before trying to open a new outbound connection. This makes me think that maybe the peer view is that we were still connected, which i’m not sure is possible (because we don’t disconnect upon receiving Disconnecting packet, but rather wait for the actual socket closure by the peer, so he should be updated), or maybe we have some issue with how we deal the race condition of both sides trying to connect at the same time (while network is unreliable).

@kilrau @raladev Can you reproduce this and get the full logs (trace level) of both nodes?

@kilrau
Copy link
Contributor

kilrau commented Aug 14, 2020

@michael1011 can you still digg up these logs? #1699 (comment)

@raladev
Copy link
Contributor

raladev commented Aug 14, 2020

Description:
I used 2 xud simnet setups on different machines to test this. One of them (good_xud) was online all time during the testing, another one was actively used for testing (bad_xud).

Steps:

  1. wifi disabling + for 5 seconds for bad machine - > lost peers for disabled, but connection was restored with both peers after ~10-20 sec after the enabling (Time - 15:55)
  2. wifi disabling + for ~2 minutes for bad machine -> Same result (Time - 15:56)
  3. VPN enabling (Germany, Munich) -> Same result (Time - 15:58)
  4. VPN disabling -> Lost peers, re-connection attepmts with Timeout results (Time 16:00)
  5. call Connect manually from both nodes -> SwapTimeOut error (Time 16:04:20)
  6. xud restart (bad machine) -> no gracefull shutdown, so i used docker rm -f simnet_xud_1 (AND LOST ALL LOGS of bad machine!!)
  7. started new container -> connection is fine
  8. VPN enabling (Paris) -> lost peers for connection reestablishment, successful reconnect after ~15 sec
  9. VPN disabling -> Lost peers, re-connection attepmts with Timeout results
  10. tried manual connect after 5 min -> SwapTimeout from both sides

Logs:
log_of_good_xud.log
log_of_bad_xud.log

Note:
About reject connection error - I think that swap timeout will dissapear after some time and i will be able to get reject error instead of timeout error from good machine.

UPDATE after 1hour:
Got all peers back, so my case is different and we cant reliably reproduce the bug, will try 30 min wifi disabling.

Update after 45 min of offline
Got peers back after ~1-2 min

@kilrau
Copy link
Contributor

kilrau commented Nov 25, 2020

Just ran a test a machine running testnet and mainnet arby, unplugging the network cable at 9:44, re-plugging at 10:25:

testnet_xud_1:

25/11/2020 09:39:31.406 [P2P] debug: Connection attempt #8 to 036501a51f2279b2176e2fc69a2f6cad8a5b74783195ed90f278fc5d6787498773@w6qxmyl6chlohza77t4ai3kbmgw4h2wjtraed6gtoi7qrfmolkknttad.onion:18885 failed: Socks5 proxy rejected connection - HostUnreachable. retrying in 640 sec...
Nov 25 09:39:32.000 [notice] Closed 1 streams for service [scrubbed].onion for reason resolve failed. Fetch status: No more HSDir available to query.
25/11/2020 09:39:32.627 [P2P] debug: Connection attempt #8 to 02a9d1b8e95aeb8f8bba899565ae0defc5ff20f7813258e51dc7a36810ba201e51@qv6nfch6mssif7xnzp7objdcf23cxwh75tsv4astibhup4ztyexx2dad.onion:18885 failed: Socks5 proxy rejected connection - HostUnreachable. retrying in 640 sec...
Nov 25 09:39:36.000 [notice] Closed 1 streams for service [scrubbed].onion for reason resolve failed. Fetch status: No more HSDir available to query.
25/11/2020 09:39:36.707 [P2P] debug: Connection attempt #8 to 0221c2efea43118d59ef8dd9c800501febde8dd4ba1f25456b9dbd2e38f3573ce3@kizwyl76djv4x3ajdwquzlr5qwybvk53l2b4q75gwugq5gz5gl3i27qd.onion:18885 failed: Socks5 proxy rejected connection - HostUnreachable. retrying in 640 sec...
Nov 25 09:39:46.000 [notice] Closed 1 streams for service [scrubbed].onion for reason resolve failed. Fetch status: No more HSDir available to query.
25/11/2020 09:39:46.410 [P2P] debug: Connection attempt #8 to 03fe8eed12784d2bc532a6943bd0fb5c86c55b82802650a6f6240df22617689ab1@vcxtxlytr3gi3zpvg355ta2s7cjlnfcrbkqqzfkdjaprgcknrod7zkad.onion:18885 failed: Socks5 proxy rejected connection - HostUnreachable. retrying in 640 sec...

*UNPLUG
Logs til re-plug: https://paste.ubuntu.com/p/XCmk5Fkt6v/

mainnet_xud_1:

25/11/2020 09:30:34.845 [P2P] debug: received disconnecting packet from 02f5ca87acf65b56c490726a9e71e3418f5a96b6e29a28e013d927d89337de031b (MansionCube) due to Shutdown
25/11/2020 09:30:34.848 [P2P] info: Peer 02f5ca87acf65b56c490726a9e71e3418f5a96b6e29a28e013d927d89337de031b (MansionCube) socket closed
25/11/2020 09:30:34.850 [ORDERBOOK] debug: removed all orders for peer 02f5ca87acf65b56c490726a9e71e3418f5a96b6e29a28e013d927d89337de031b (MansionCube)
25/11/2020 09:30:42.155 [RPC] error: call /xudrpc.Xud/PlaceOrderSync errored with code 5: order with id arby-BTC/USDT-sell-order could not be found
25/11/2020 09:30:42.158 [RPC] error: call /xudrpc.Xud/PlaceOrderSync errored with code 5: order with id arby-BTC/USDT-buy-order could not be found

*UNPLUG
Logs til re-plug: https://paste.ubuntu.com/p/cwb5c4787g/

Observations: xud reconnected to all peers just fine and arby started issuing orders again. Since we don't know how exactly to reproduce this until someone (cc @raladev ) found a way to reliably reproduce this.

@kilrau kilrau closed this as completed Nov 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p2p Peer to peer networking P2 mid priority
Projects
None yet
Development

No branches or pull requests

4 participants