Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hole punching to work in both directions #605

Closed
tegefaulkes opened this issue Oct 25, 2023 · 7 comments
Closed

Fix hole punching to work in both directions #605

tegefaulkes opened this issue Oct 25, 2023 · 7 comments
Assignees
Labels
development Standard development

Comments

@tegefaulkes
Copy link
Contributor

Specification

While testing the testnet and hole punching. We found that while a connection could be punched between the two test machines. The connection could be established with hole punching from NodeA -> NodeB just fine. But starting the connection in reverse with hole punching NodeB -> NodeA could not be done.

With this we can conclude that the hole punching does work, we can see it working with the forward case. But there is some flaw with the hole punching procedure that prevents it working with certain kinds of NATs.

Before we can move forward with testing this, other things need to be cleaned up.

  1. We need more informative logs when t comes to the hole punching procedure.
  2. We need to reduce the amount of noise in the logging output. As it stands there is too much logging about connections. We need a discussion about the amount of logging that is appropriate and what kind of logging isn't needed at the INFO level.

Then we can move on to Identifying why connections can only be punched one way. I first guess is that we are not providing the correct information when signalling. Right now the signalling protocol provides the src host:port to the target, but I don't think it's providing the target information back to the src node. That is one thing to look into.

Additional context

Tasks

  1. Identify why a forward connection can be punched, but the reverse can't be punched.
  2. Apply a fix
@tegefaulkes tegefaulkes added the development Standard development label Oct 25, 2023
@tegefaulkes tegefaulkes self-assigned this Oct 25, 2023
@CMCDragonkai
Copy link
Member

The local node graph is out of date. The signalling node has the up to date information.

The local node's node graph isn't GCing the old node record. And we are not getting information.

The existing records could be wrong.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 26, 2023

We are going to be moving to a node only connecting to one of the seed node. This is necessary for scalability, we can't have a node connecting to every node.

We will need to find the common seed node to do this, if source node is connected to seed node A, and target node is connected to seed node B. Then right now source node has to connect to seed node B to do the hole punching.

Afterwards during a hole punching request to the seed node on the response you could get new information about the target node, which may be more up to date than your current information.

If it does update the current node record, then the current hole punching request has to be cancelled, and restarted with the new node record.

In general we need a better policy for getting up to date node records.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 26, 2023

Up to date node records:

Effect on Ongoing Operations: If the failed node was part of an ongoing lookup operation, the local node will continue the lookup by querying other closest nodes from its routing table. The failed node is skipped, and its failure impacts only the efficiency but not the correctness of the lookup operation.

Retry Policy: Some implementations incorporate a retry mechanism before marking a node as stale or removing it. This accounts for temporary network issues or delays that might have caused the failed connection attempt.

Marking as Stale or Immediate Removal: When a node fails to respond, it is either marked as stale or immediately removed from the bucket in the routing table. The exact behavior can vary depending on the implementation.

  1. We want to connect to Node B, NodeB's record is wrong. That means the connection fails. There's basically a timeout reached.
  2. At this point, we need a lookup operation. It should only be done afterwards. Where we ask other nodes, close to it, what is the node's record. - you assume you don't have the record anymore
  3. Reconciliation - need to only take in records with a latest up to date information
  4. Retry - retry with new information
  5. At the same time, we should have some sort of TTL applied here for prioritised refreshing - instead of immediate removal, marking as "stale" - means there's a delayed GC
  6. Instead of an active GC, where we drop records, we could just simply mark a record as "stale" it just means we have failed to get a connection to this record the last time we tried - in fact this could be timestamp - NULL means not stale, and set with a timestamp it failed at that time (start time)

@tegefaulkes
Copy link
Contributor Author

I have a better understanding of what's happening now. During testing between the office and a VM running on my home NAS I can see the same behaviour.

I've observed the following.

  1. Both sides can initiate the hole punch connection and succeed. I tested this by starting each node with fresh and changing the order. The last node started will initiate the connection during network discovery.
  2. The node who first initiated the connection has the correct IPv4 address. The node receiving the connection has the IPv6 mapped IPv4 address in it's node graph.
  3. The node with the IPv6 mapped address could not start the connection.

So I think it's a combination of two problems

  1. the problem we solve by feat: connection with ICE now gets target port from signalling node #624. We should try connecting on the host and port provided by the signalling node since that's the this moment correct information.
  2. The issue with NodesListConnections handler is randomly mapping IP addresses as IPv6 mapped IPv4 #614 is rearing it's head here. Since the node that has the IPv6 address is the one that fails to initiate.

Moving forward I can merge #624 and fix up #614. This should address the current problems I'm seeing.

@CMCDragonkai
Copy link
Member

So they are connected together!

@tegefaulkes
Copy link
Contributor Author

Ok, so after merging #624 and running some testing. I can now start the connection from either side without issue. I can see that the problem with #614 is resolved now.

Moving forward we should do some more extensive testing between different kinds of nat. It would be useful to create a diagnostic command to easily profile the kind of nat we're dealing with and how it behaves. This can just be a CLI command that reports some useful information about a connection. EG, local host and port, remote host and port, and the same information from the peer's side. We can look into methods of identifying the kind of NAT or if the node is publicly accessible.

@tegefaulkes
Copy link
Contributor Author

Two main things were done to address this issue.

  1. Signalling was updated to provide the address the signaller sees to the target AND the source node. This ensures no chance of out of date information.
  2. Fixed a problem where when using a dual stack socket, the returned hosts from NodeConnection were in the mapped IPv6 format. This was fixed to use a canonical form of IPv4 when the address was IPv4 or the mapped format, and IPv6 for IPv6.

I'm considering this done now unless any other hole punching issues come up. But they will be a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development
Development

No branches or pull requests

2 participants