Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NonP2P relays fail to keep an established connection to a P2P node #4465

Closed
coot opened this issue Mar 21, 2023 · 2 comments · Fixed by #4467
Closed

NonP2P relays fail to keep an established connection to a P2P node #4465

coot opened this issue Mar 21, 2023 · 2 comments · Fixed by #4467
Assignees
Labels
bug Something isn't working

Comments

@coot
Copy link
Contributor

coot commented Mar 21, 2023

After 1.35.6 release of cardano-node, we received some reports where BPs weren't able to keep an established connection to a P2P node.

@karknu was able to reproduce the issue and capture a pcap trace, this was interpreted to create the following diagram (where BP is assumed to be a non P2P node):

sequenceDiagram
    BP->>P2P: connect (producer) - A
    P2P->>BP: connect (local root peer) - B
    note right of P2P: P2P is using a fixed PORT
    BP->>P2P: (timeout - tear down by BP - A) [BP no outgoing diffussion]
    note over BP,P2P: BP belives existing connection - no BP outbound connection attempts
    P2P->>BP: after 79 sec: P2P tears down due to non-respose of keep alive protocol[observed]<br>Suggests BP response to timer teardown "captured B" first connection in some way
    note over BP,P2P: time reference T
    loop until BP 'bad peer' timer expires - trace retries at T+10 and T+18
      P2P--xBP: attempts new connection - TCP connections but<br>BP resets before process handshake (assumed still in 'sin bin')
    end
    note over BP,P2P: some seconds later (21s after T in trace)
    BP->>P2P:connect (producer) - A [success]
Loading

We believe this is a bug in the legacy non-p2p code, which is triggered by P2P nodes which bind to their advertised addressed for outbound connections. The non-p2p component which tracks the state of inbound & outbound connections can be confused by this and depending on which nodes connects first (e.g. P2P -> NonP2P or NonP2P -> P2P) this can either result in minutes long or hour long (or more) problems for a non-p2p node to establish a connection to a p2p node.

@coot coot added bug Something isn't working networking labels Mar 21, 2023
@coot coot assigned coot and dcoutts Mar 21, 2023
@Straightpool
Copy link

Straightpool commented Mar 21, 2023

I have had an issue which did not recover within 21seconds, the BP lost the outgoing connection to the P2P Relay for many hours until I restarted the BP. All nodes 1.35.6, BP in non-P2P mode, 2 relays one in P2P, one in non-P2P mode.

image
The bright green line is the block producer, time is CET.

Here the log entries of the block producer
Connection lost:

{"app":[],"at":"2023-03-17T01:24:56.52Z","data":{"domain":"\"relay1.str8pool.com\"","event":"Application Exception: <P2P-IP:P2P-Port> ExceededTimeLimit (ChainSync (Header (HardForkBlock (': * ByronBlock (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyEra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyMAEra 'Allegra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyMAEra 'Mary StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (AlonzoEra StandardCrypto)) (': * (ShelleyBlock (Praos StandardCrypto) (BabbageEra StandardCrypto)) ('[] *))))))))) (Tip HardForkBlock (': * ByronBlock (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyEra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyMAEra 'Allegra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyMAEra 'Mary StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (AlonzoEra StandardCrypto)) (': * (ShelleyBlock (Praos StandardCrypto) (BabbageEra StandardCrypto)) ('[] *))))))))) (ServerAgency TokNext TokMustReply)","kind":"SubscriptionTrace"},"env":"1.35.6:07b0c","host":"ub22cn","loc":null,"msg":"","ns":["cardano.node.DnsSubscription"],"pid":"1391","sev":"Error","thread":"2470"} {"app":[],"at":"2023-03-17T01:24:56.53Z","data":{"address":"<P2P-IP:P2P-Port>","event":"ErrorPolicySuspendConsumer (Just (ApplicationExceptionTrace ExceededTimeLimit (ChainSync (Header (HardForkBlock (': * ByronBlock (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyEra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyMAEra 'Allegra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyMAEra 'Mary StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (AlonzoEra StandardCrypto)) (': * (ShelleyBlock (Praos StandardCrypto) (BabbageEra StandardCrypto)) ('[] *))))))))) (Tip HardForkBlock (': * ByronBlock (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyEra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyMAEra 'Allegra StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (ShelleyMAEra 'Mary StandardCrypto)) (': * (ShelleyBlock (TPraos StandardCrypto) (AlonzoEra StandardCrypto)) (': * (ShelleyBlock (Praos StandardCrypto) (BabbageEra StandardCrypto)) ('[] *))))))))) (ServerAgency TokNext TokMustReply))) 20s","kind":"ErrorPolicyTrace"},"env":"1.35.6:07b0c","host":"ub22cn","loc":null,"msg":"","ns":["cardano.node.ErrorPolicy"],"pid":"1391","sev":"Notice","thread":"339"}

After this the IP of the P2P-relay never appears in the BP log again until I restart the node. Interestingly, peer fetch entries henceforth on the BP only listed the non-P2P relay, not the P2P-relay ever again, so the BP never attempts to reconnect:

{"app":[],"at":"2023-03-17T01:25:25.06Z","data":{"kind":"PeersFetch","peers":[{"kind":"FetchDecision results","length":"1","peer":{"local":{"addr":"<BP-IP>","port":"<local BP Port>"},"remote":{"addr":"<non-P2P Relay>","port":"<non P2P-Relay Port>"}}}]},"env":"1.35.6:07b0c","host":"ub22cn","loc":null,"msg":"","ns":["cardano.node.BlockFetchDecision"],"pid":"1391","sev":"Info","thread":"328"}

On the P2P Relay around the same time:

{"app":[],"at":"2023-03-17T01:24:56.63Z","data":{"kind":"ConnectionManagerCounters","state":{"duplex":12,"fullDuplex":0,"inbound":43,"outbound":49,"unidirectional":80}},"env":"1.35.6:07b0c","host":"ub22r1","loc":null,"msg":"","ns":["cardano.node.ConnectionManager"],"pid":"44439","sev":"Info","thread":"12657"} {"app":[],"at":"2023-03-17T01:24:56.67Z","data":{"connectionHandler":{"command":"ShutdownPeer","context":"InboundError","kind":"Error","reason":"MuxError (MuxIOException Network.Socket.recvBuf: resource vanished (Connection reset by peer)) \"(recv errored)\""},"connectionId":{"localAddress":{"address":"<P2P-Relay IP>","port":"<P2P Relay Port>"},"remoteAddress":{"address":"<BP-IP>","port":"<BP-IP Port>"}},"kind":"ConnectionHandler"},"env":"1.35.6:07b0c","host":"ub22r1","loc":null,"msg":"","ns":["cardano.node.ConnectionManager"],"pid":"44439","sev":"Info","thread":"13283"} {"app":[],"at":"2023-03-17T01:24:56.67Z","data":{"kind":"ConnectionManagerCounters","state":{"duplex":12,"fullDuplex":0,"inbound":42,"outbound":49,"unidirectional":79}},"env":"1.35.6:07b0c","host":"ub22r1","loc":null,"msg":"","ns":["cardano.node.ConnectionManager"],"pid":"44439","sev":"Info","thread":"13283"}

After this the BP IP is listed on the P2P Relay log only in such contexts:

{"app":[],"at":"2023-03-17T06:30:08.12Z","data":{"kind":"PeersFetch","peers":[{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"3001"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"6001"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"3001"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"3001"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"6000"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"6000"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"3001"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"6791"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"4830"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"5001"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"19001"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"3001"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"3001"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"3001"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"3001"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"9010"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"6000"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<Some Relay>","port":"3001"}}},{"declined":"FetchDeclineChainNotPlausible","kind":"FetchDecision declined","peer":{"local":{"addr":"<P2P-Relay IP>","port":"<P2P-Relay Port>"},"remote":{"addr":"<BP IP>","port":"<BP Port>"}}},{"kind":"FetchDecision results","length":"1","peer":{"local":{"addr":"<P2P Relay IP>","port":"<P2P Relay Port>"},"remote":{"addr":"<Some Relay?","port":"3001"}}}]},"env":"1.35.6:07b0c","host":"ub22r1","loc":null,"msg":"","ns":["cardano.node.BlockFetchDecision"],"pid":"44439","sev":"Info","thread":"324"

The P2P-Relay still held the connection to the BP at that time, uni-directionally to fetch new blocks, so no need to add the connection again from this end.

Hope the above is helpful. Definitely matching to the title of this issue "NonP2P relays fail to keep an established connection to a P2P node", in this case a NonP2P-Block producer fails to keep an established connection to a P2P node and even fails to re-establish the connection until restart.

@coot
Copy link
Contributor Author

coot commented Mar 22, 2023

Thanks @Straightpool, this fits quite well with IOG & CF investigation. We're colaborating on a fix.

coot added a commit to IntersectMBO/cardano-node that referenced this issue Mar 23, 2023
@iohk-bors iohk-bors bot closed this as completed in 807ba41 Mar 23, 2023
coot added a commit to IntersectMBO/cardano-node that referenced this issue Mar 24, 2023
coot added a commit to IntersectMBO/cardano-node that referenced this issue Mar 24, 2023
coot added a commit to IntersectMBO/cardano-node that referenced this issue Mar 24, 2023
coot added a commit to IntersectMBO/cardano-node that referenced this issue Mar 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants