Fix hole punching to work in both directions #605

tegefaulkes · 2023-10-25T01:00:57Z

Specification

While testing the testnet and hole punching. We found that while a connection could be punched between the two test machines. The connection could be established with hole punching from NodeA -> NodeB just fine. But starting the connection in reverse with hole punching NodeB -> NodeA could not be done.

With this we can conclude that the hole punching does work, we can see it working with the forward case. But there is some flaw with the hole punching procedure that prevents it working with certain kinds of NATs.

Before we can move forward with testing this, other things need to be cleaned up.

We need more informative logs when t comes to the hole punching procedure.
We need to reduce the amount of noise in the logging output. As it stands there is too much logging about connections. We need a discussion about the amount of logging that is appropriate and what kind of logging isn't needed at the INFO level.

Then we can move on to Identifying why connections can only be punched one way. I first guess is that we are not providing the correct information when signalling. Right now the signalling protocol provides the src host:port to the target, but I don't think it's providing the target information back to the src node. That is one thing to look into.

Additional context

Tasks

Identify why a forward connection can be punched, but the reverse can't be punched.
Apply a fix

The text was updated successfully, but these errors were encountered:

CMCDragonkai · 2023-10-26T00:55:19Z

The local node graph is out of date. The signalling node has the up to date information.

The local node's node graph isn't GCing the old node record. And we are not getting information.

The existing records could be wrong.

CMCDragonkai · 2023-10-26T00:59:29Z

We are going to be moving to a node only connecting to one of the seed node. This is necessary for scalability, we can't have a node connecting to every node.

We will need to find the common seed node to do this, if source node is connected to seed node A, and target node is connected to seed node B. Then right now source node has to connect to seed node B to do the hole punching.

Afterwards during a hole punching request to the seed node on the response you could get new information about the target node, which may be more up to date than your current information.

If it does update the current node record, then the current hole punching request has to be cancelled, and restarted with the new node record.

In general we need a better policy for getting up to date node records.

CMCDragonkai · 2023-10-26T01:00:30Z

Up to date node records:

Effect on Ongoing Operations: If the failed node was part of an ongoing lookup operation, the local node will continue the lookup by querying other closest nodes from its routing table. The failed node is skipped, and its failure impacts only the efficiency but not the correctness of the lookup operation.

Retry Policy: Some implementations incorporate a retry mechanism before marking a node as stale or removing it. This accounts for temporary network issues or delays that might have caused the failed connection attempt.

Marking as Stale or Immediate Removal: When a node fails to respond, it is either marked as stale or immediately removed from the bucket in the routing table. The exact behavior can vary depending on the implementation.

We want to connect to Node B, NodeB's record is wrong. That means the connection fails. There's basically a timeout reached.
At this point, we need a lookup operation. It should only be done afterwards. Where we ask other nodes, close to it, what is the node's record. - you assume you don't have the record anymore
Reconciliation - need to only take in records with a latest up to date information
Retry - retry with new information
At the same time, we should have some sort of TTL applied here for prioritised refreshing - instead of immediate removal, marking as "stale" - means there's a delayed GC
Instead of an active GC, where we drop records, we could just simply mark a record as "stale" it just means we have failed to get a connection to this record the last time we tried - in fact this could be timestamp - NULL means not stale, and set with a timestamp it failed at that time (start time)

tegefaulkes · 2023-11-13T06:13:37Z

I have a better understanding of what's happening now. During testing between the office and a VM running on my home NAS I can see the same behaviour.

I've observed the following.

Both sides can initiate the hole punch connection and succeed. I tested this by starting each node with fresh and changing the order. The last node started will initiate the connection during network discovery.
The node who first initiated the connection has the correct IPv4 address. The node receiving the connection has the IPv6 mapped IPv4 address in it's node graph.
The node with the IPv6 mapped address could not start the connection.

So I think it's a combination of two problems

the problem we solve by feat: connection with ICE now gets target port from signalling node #624. We should try connecting on the host and port provided by the signalling node since that's the this moment correct information.
The issue with NodesListConnections handler is randomly mapping IP addresses as IPv6 mapped IPv4 #614 is rearing it's head here. Since the node that has the IPv6 address is the one that fails to initiate.

Moving forward I can merge #624 and fix up #614. This should address the current problems I'm seeing.

CMCDragonkai · 2023-11-13T18:47:56Z

So they are connected together!

tegefaulkes · 2023-11-14T05:54:02Z

Ok, so after merging #624 and running some testing. I can now start the connection from either side without issue. I can see that the problem with #614 is resolved now.

Moving forward we should do some more extensive testing between different kinds of nat. It would be useful to create a diagnostic command to easily profile the kind of nat we're dealing with and how it behaves. This can just be a CLI command that reports some useful information about a connection. EG, local host and port, remote host and port, and the same information from the peer's side. We can look into methods of identifying the kind of NAT or if the node is publicly accessible.

tegefaulkes · 2023-11-14T06:00:32Z

Two main things were done to address this issue.

Signalling was updated to provide the address the signaller sees to the target AND the source node. This ensures no chance of out of date information.
Fixed a problem where when using a dual stack socket, the returned hosts from NodeConnection were in the mapped IPv6 format. This was fixed to use a canonical form of IPv4 when the address was IPv4 or the mapped format, and IPv6 for IPv6.

I'm considering this done now unless any other hole punching issues come up. But they will be a new issue.

tegefaulkes added the development Standard development label Oct 25, 2023

tegefaulkes self-assigned this Oct 25, 2023

This was referenced Oct 25, 2023

6th Testnet Deployment #551

Closed

CLI Beta Launch MatrixAI/Polykey-CLI#40

Closed

This was referenced Oct 27, 2023

MDNS integration and NodeGraph structure expansion #537

Closed

Decentralised NAT Signalling #365

Closed

tegefaulkes mentioned this issue Nov 9, 2023

feat: connection with ICE now gets target port from signalling node #624

Merged

8 tasks

tegefaulkes closed this as completed Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hole punching to work in both directions #605

Fix hole punching to work in both directions #605

tegefaulkes commented Oct 25, 2023

CMCDragonkai commented Oct 26, 2023

CMCDragonkai commented Oct 26, 2023 •

edited

Loading

CMCDragonkai commented Oct 26, 2023 •

edited

Loading

tegefaulkes commented Nov 13, 2023

CMCDragonkai commented Nov 13, 2023

tegefaulkes commented Nov 14, 2023

tegefaulkes commented Nov 14, 2023

Fix hole punching to work in both directions #605

Fix hole punching to work in both directions #605

Comments

tegefaulkes commented Oct 25, 2023

Specification

Additional context

Tasks

CMCDragonkai commented Oct 26, 2023

CMCDragonkai commented Oct 26, 2023 • edited Loading

CMCDragonkai commented Oct 26, 2023 • edited Loading

tegefaulkes commented Nov 13, 2023

CMCDragonkai commented Nov 13, 2023

tegefaulkes commented Nov 14, 2023

tegefaulkes commented Nov 14, 2023

CMCDragonkai commented Oct 26, 2023 •

edited

Loading

CMCDragonkai commented Oct 26, 2023 •

edited

Loading