Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the timeout for establishing a Node Connection within the Discovery domain (by adding timer override to NodeConnectionManager) #353

Closed
emmacasolin opened this issue Feb 27, 2022 · 8 comments
Assignees
Labels
development Standard development r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy

Comments

@emmacasolin
Copy link
Contributor

Specification

The default timeout for node connections is currently set inside the NodeConnectionManager for 20s. This can only be set when creating a new Polykey agent by setting the nodeConnectionManagerConfig when calling PolykeyAgent.createPolykeyAgent():

// From src/config.defaults
nodeConnectionManagerConfig: {
  connConnectTime: 20000, // this is the relevant setting
  connTimeoutTime: 60000,
  initialClosestNodes: 3,
},

This value is subsequently used for all node connections, regardless of what the node connection is used for or where it originates from. This is a problem for the Discovery domain, where a default timeout of 20s is too long. The Discovery domain calls NodeManager.requestChainData(), which calls NodeConnectionManager.withConnF(), which subsequently creates the node connection that we use. We need to be able to change the timeout for creating this node connection from inside the Discovery domain to something smaller, for example 1-2s. This does not need to be a modifiable option that is available to the user, since it should not affect the behaviour of the Discovery domain, so the timeout can just be a constant.

In order for the Discovery domain to be able to set its own timeout for Node Connections, we need to propagate the timeout through a few different places in the code. Specifically:

  1. The chosen timeout must be set as a constant property of the Discovery class
  2. NodeManager.requestChainData() needs to have an extra parameter for this timeout (alternatively, since this requestChainData() method is not used anywhere else outside of Discovery we could move it into the Discovery class and methods called by it could just directly use the Discovery's timeout property)
  3. NodeConnectionManager.withConnF() (called by requestChainData()) needs an additional timeout parameter
  4. NodeConnectionManager.acquireConnection() (called by withConnF()) needs an additional timeout parameter
  5. NodeConnectionManager.createConnection() (called by acquireConnection()) needs an additional timeout parameter
  6. NodeConnectionManager.establishNodeConnection() (called by createConnection()) needs an additional timeout parameter which defaults to NodeConnectionManager.connConnectTime if not set explicitly - all instances of using NodeConnectionManager.connConnectTime inside this method need to switch over to using this parameter instead

Additional context

Tasks

  1. Create a timeout property for the Discovery class (around 1000-2000ms)
  2. Propagate this timeout through the relevant methods to the establishment of a node connection
@emmacasolin emmacasolin added the development Standard development label Feb 27, 2022
@CMCDragonkai CMCDragonkai mentioned this issue Mar 10, 2022
29 tasks
@CMCDragonkai CMCDragonkai changed the title Reduce the timeout for establishing a Node Connection within the Discovery domain Reduce the timeout for establishing a Node Connection within the Discovery domain (by adding timer override to NodeConnectionManager) Mar 23, 2022
@CMCDragonkai
Copy link
Member

Was this done? The NodeConnectionManager should have a timeout override, but is the discovery making use of it?

@emmacasolin
Copy link
Contributor Author

The discovery domain doesn't seem to have been touched in #378 so I don't think so

@teebirdy teebirdy added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Jul 24, 2022
@CMCDragonkai CMCDragonkai added r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy and removed r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices labels Jul 24, 2022
@tegefaulkes
Copy link
Contributor

This is a more general problem since the discovery system doesn't set timeouts for connection to nodes and identity providers. Identity providers uses cross-fetch that doesn't have a programmed timeout but uses an AbortSignal. Which we are not utilising at the moment.

To solve this in a general sense we need to make use of the timedCancellable decorator for all of the discovery tasks. We need to incorporate a task deadline no matter weather it's connecting to a node or an identity provider. The deadline doesn't just matter for connecting to a node but also the RPC calls. This can be implemented as a GRPC deadline.

Places that require integration with Timer, timedCancellable decorator and HOF:

  • GitHubProvider
  • NodeConnectionManager,
  • NodeConnection,
  • Proxy
  • Discovery
  • GRPCClientAgent

In the GRPC client stubs this is an example of the call signature. Take notes of the options: Partial<grpc.CallOptions>, this is where you pass the deadline property. This is defined as a type Deadline = Date | number as a Date in the future or a Unix time in milliseconds. If the server doesn't answer before the deadline then we get an exception.

    public nodesClosestLocalNodesGet(request: polykey_v1_nodes_nodes_pb.Node, metadata: grpc.Metadata, options: Partial<grpc.CallOptions>, callback: (error: grpc.ServiceError | null, response: polykey_v1_nodes_nodes_pb.NodeTable) => void): grpc.ClientUnaryCall;

The server should also have a default deadline set for if the client doesn't pass a deadline to the server. This is to prevent a DDOS attack or the case of slow clients holding up the server. According to the GRPC docs the client can cancel at any time. but the server can just close the stream or end with a response. This will likely be a server deadline exceeded exception that the server throws. The longest deadline here will be the server's deadline. If the client deadline is reached then the client throws it's deadline error and cancels the connection. It is possible that both sides throws the exception when the deadline is exceeded.

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Sep 15, 2022

1-2 seconds is too strict, especially for contacting identities. It's also not about what the deadline is for establishing connection but deadline for each individual call as well. It's not that the discovery is serialised, We need to make this concurrent so we can have each timeout be long running but do multiple at a time.

So Discovery needs to make use of the TaskManager to do concurrent tasks to do discovery. There should be at least 2 task handlers.

  1. visitNode
  2. visitIdentity

With reference to #328 the decision for which nodes and identities to visit should be proximity to our own gestalt, priority and trust. Where trust is the ACL property.

@CMCDragonkai
Copy link
Member

Let's make #445 stage the changes of integration task manager into discovery.

But to actually to solve this issue, we will need to incorporate deadlines to the RPC communication, and also timers and timedCancellable to all the methods in Discovery, Nodes... etc.

@CMCDragonkai
Copy link
Member

The main thing to understand here is to realise that Discovery is doing something similar to NodeGraph, only that the GestaltGraph is the database.

@tegefaulkes
Copy link
Contributor

This issue is a pretty old one and the way we solve this has changed by the usage of timedCancellable. This issue is pretty specific to using shorter timeouts than the NodeConnectionManager defaults when making connections to other nodes during discovery.

The core changes for this is handled by #477. but that is a generalised update to the discovery domain. This will be complete once #477 is complete and we set the shorter timeouts for connections during discovery.

@tegefaulkes
Copy link
Contributor

Discovery.processVertex has been updated to pass the ctx and connectionTimeout paramets to all calls within it that establish a connection, Theses are nodeManager.requestChainData and discovery.getIdentityInfo. These have their individual timeouts applied but can still be cancelled via processvertx's cancellation or timeout. I've set connectionTimeout to 2000ms for the task handler. This may need to be adjusted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy
Development

No branches or pull requests

4 participants