Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting Node Connection Management out of NodeManager to NodeConnectionManager #310

Merged
merged 1 commit into from
Feb 14, 2022

Conversation

joshuakarp
Copy link
Contributor

@joshuakarp joshuakarp commented Dec 22, 2021

Description

The refactoring effort to make our nodes domain more robust.

                              ┌──────────┐
                              │ Sigchain │
                              └────▲─────┘
                                   │
                                   │
                                   │
                            ┌──────┴──────┐
                            │ NodeManager ├──────────────────┐
                            └──────┬──────┘                  │
                                   │                         │
                                   │                         │
                                   │                         │
                       ┌───────────▼───────────┐             │
┌──────────────┐       │                       │             │
│ ForwardProxy ◄───────┤ NodeConnectionManager │             │
└──────────────┘       │                       │       ┌─────▼─────┐
                       │   ┌──────────────┐    ├───────► NodeGraph │
┌──────────────┐       │   │NodeConnection│    │       └───────────┘
│ ReverseProxy ◄───────┤   └──────────────┘    │
└──────────────┘       │                       │
                       └─────▲─────▲────▲──────┘
                             │     │    │
                             │     │    │
                  ┌──────────┘     │    └─────────────┐
                  │                │                  │
             ┌────┴────┐    ┌──────┴─────┐  ┌─────────┴──────────┐
             │Discovery│    │VaultManager│  │NotificationsManager│
             └─────────┘    └────────────┘  └────────────────────┘

Issues Fixed

Tasks

  1. NodeConnectionManager:

    • draft API plan:
    class NodeConnectionManager {
       // creates/retrieves a connection, and uses to perform passed f
       public withConnection<T>(targetNodeId, f: (conn: NodeConnection => Promise<T>)): Promise<T>;
       // simply creates a connection, but doesn't use it for anything yet
       public createConnection(targetNodeId): Promise<void>;
       public destroyConnection(targetNodeId): Promise<void>;
       // Migrated from NodeManager:
       // helper function to create and set a NodeConnection (called twice in createConnection)
       protected establishNodeConnection(targetNodeId, lock): Promise<void>;
       public findNode(targetNodeId): Promise<NodeAddress>;
       public openConnection(egressHost, egressPort, timer): Promise<void>;
       // Migrated from NodeGraph:
       public getClosestLocalNodes(targetNodeId): Promise<Array<NodeData>>;
       public getClosestGlobalNodes(targetNodeId): Promise<NodeAddress | undefined>;
    }
  2. NodeConnection:

    • draft API plan:
    class NodeConnection {
      // publicly exposed client, such that NodeConnectionManager.withConnection can do calls
      // such as conn.client.doSomething()
      public client: GRPCClientAgent;
      // does everything that NodeConnection.start currently does
      public create/start();
      // removes itself from the reference it has to NodeConnectionManager's connection map
      public destroy()/stop();
      // Retained from current NodeConnection:
      public getRootCertChain();
      public getExpectedPublicKey();
    }
  3. NodeGraph:

    • draft API plan:
    class NodeGraph {
      // no changes to current API, except the function removals as listed below
    }
  4. NodeManager:

    • draft API plan:
    class NodeManager {
      // Retaining the following existing functions:
      public getNodeId();
      public knowsNode(targetNodeId);
      public pingNode(targetNodeId);
      // or could even move this to KeyManager now - gets the target node's public key
      public getPublicKey(targetNodeId); 
      // queries the encapsulated sigchain
      public getChainData();
      // queries a target node's sigchain
      public requestChainData(targetNodeId);
      public claimNode(targetNodeId);
    }
  5. confirm issues have been addressed

  6. things to double check.

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

Once this is merged, it's time to do Testnet Deployment! #194

@joshuakarp joshuakarp self-assigned this Dec 22, 2021
@joshuakarp
Copy link
Contributor Author

From yesterday's discussions, the plan for the NodeConnectionManager locking is as follows:

class NodeConnectionManager {
  public withConnection() {
    // if conn exists, lock = its read lock
    // otherwise, lock = a new write lock
    createConnection(lock)
    // release lock
  }

  public createConnection(lock?) {
    if (lock == null) newLock = new write lock
    // do connection creation
    if (lock == null) release newLock
  }
}

As a result of this, the expectation is that we'd only really pass in our own lock when you know what you're doing, and that you've already acquired the lock that you're passing in. That seems like somewhat ambiguous behaviour to me. I feel like all of this locking behaviour shouldn't be publicly exposed at all. I know that it's an optional parameter, but we don't want other domains to need to think about this lock parameter at all.

Alternatively, we could do the following:

class NodeConnectionManager {
  public withConnection() {
    createConnection()
    // now we know a connection must exist
    // so simply get the connAndLock, acquire its read lock and do the operation
  }

  public createConnection() {
    // this should basically be identical behaviour to what getConnectionToNode currently is
    // however, it shouldn't return the NodeConnection: 
    get connAndLock
    if connAndLock != null
      if conn != null: return
      acquire write lock
      if conn != null after acquiring: return
      otherwise create connection
      release write lock
    if connAndLock == null:
      create lock and acquire write
      set lock in connections map
      create conn
      set conn in connections map
      release write lock
  }
}

@joshuakarp
Copy link
Contributor Author

joshuakarp commented Dec 22, 2021

With the current committed usage of RWLock, I've been able to verify concurrent connection behaviour:

  public async withConnection<T>(
    targetNodeId,
    f: (conn: NodeConnection) => Promise<T>,
  ): Promise<T> {
    // Ensure a connection exists
    await this.createConnection(targetNodeId);
    const connAndLock = this.connections.get(targetNodeId);
    // Unexpected behvaiour: between call to create connection and retrieval, 
    // the conn or conn entry was deleted
    // if (connAndLock == null || connAndLock.connection == null) {
    //   throw new Error();
    // }
    // The above is commented out such that we don't need to create a NodeConnection - just testing the locking mechanisms
    // Acquire the read lock (allowing concurrent connection usage)
    return await connAndLock!.lock.read<T>(async () => {
      return await f(connAndLock!.connection!);
    });
  };

  // these are used by `withConnection`
  // reusing terms from network domain:
  // connConnectTime - timeout for opening the connection
  // connTimeoutTime  - TTL for the connection itself
  public async createConnection(targetNodeId: NodeId): Promise<void> {
    let connection: NodeConnection | undefined;
    let lock: utils.RWLock;
    let connAndLock = this.connections.get(targetNodeId);
    if (connAndLock != null) {
      ({ connection, lock } = connAndLock);
      // Connection already exists, so return
      if (connection != null) return;
      // Acquire the write (creation) lock
      await lock.write(async () => {
        // Once lock is released, check again if the conn now exists
        connAndLock = this.connections.get(targetNodeId);
        if (connAndLock != null && connAndLock.connection != null) return;
        // TODO: create the connection and set in map
        console.log('existing lock: created connection');
      });
    } else {
      lock = new utils.RWLock();
      connAndLock = { lock };
      this.connections.set(targetNodeId, connAndLock);
      await lock.write(async () => {
        // TODO: create the connection and set in map
        console.log('no existing entry: created connection');
      });
    }
  }

With the following test.ts:

import { NodeConnectionManager } from './src/nodes';
import type { NodeId } from './src/nodes/types';
import { sleep } from './src/utils';

async function main() {
  const connManager = new NodeConnectionManager();
  await connManager.createConnection('a' as NodeId);
  await Promise.race([
    connManager.withConnection('a' as NodeId, async () => {
      console.log('1.0');
      await sleep(1000);
      console.log('1.1');
      await sleep(1000);
      console.log('1.2');
      await sleep(1000);
      console.log('1.3');
      await sleep(1000);
      console.log('1.4');
    }),
    connManager.withConnection('a' as NodeId, async () => {
      console.log('2.0');
      await sleep(1000);
      console.log('2.1');
      await sleep(1000);
      console.log('2.2');
      await sleep(1000);
      console.log('2.3');
      await sleep(1000);
      console.log('2.4');
    }),
  ]);
}

main();

We get:

[nix-shell:~/Documents/MatrixAI/js-polykey]$ npm run ts-node -- test-node.ts 

> @matrixai/polykey@1.0.0 ts-node /home/josh/Documents/MatrixAI/js-polykey
> ts-node --require tsconfig-paths/register "test-node.ts"

no existing entry: created connection
existing lock: created connection
existing lock: created connection
1.0
2.0
1.1
2.1
1.2
2.2
1.3
2.3
1.4
2.4

Changing the return await connAndLock!.lock.read<T> to a write lock, return await connAndLock!.lock.write<T> in withConnection, we get:

[nix-shell:~/Documents/MatrixAI/js-polykey]$ npm run ts-node -- test-node.ts 

> @matrixai/polykey@1.0.0 ts-node /home/josh/Documents/MatrixAI/js-polykey
> ts-node --require tsconfig-paths/register "test-node.ts"

no existing entry: created connection
existing lock: created connection
existing lock: created connection
1.0
1.1
1.2
1.3
1.4
2.0
2.1
2.2
2.3
2.4

This is expected, as the write lock should block concurrent usage.

@joshuakarp
Copy link
Contributor Author

There needs to be some further discussion concerning the async-init patterns we use for the nodes domain.

Original comment from @CMCDragonkai about this here #225 (comment):

  • NodeGraph - CDSS - because this is now shared between NodeConnectionManager and NodeManager, this has to be created outside of NodeManager as a singleton and shared - created first
  • NodeConnectionManager - SS or CDSS or CD - created second
  • NodeManager - SS or CDSS or CD - created third

NodeGraph (currently CDSS):

  • should be CDSS, as it has persistent data storage that should be initialised on create/start, and completely wiped on destroy
  • ultimately just a data structure now, akin to ACL/GestaltGraph (which also use CDSS)

NodeConnection (currently CDSS):

  • potentially this only needs to be a CD pattern - this would also follow the usage of the CD pattern with GRPCClientAgent
  • I don't foresee any problems with migrating the current start into createNodeConnection instead. start currently:
    • creates the GRPCClientAgent
    • sends hole punching message to the target node (using the seed connections)
  • this could also relate to the choices made in Integrate async-init pattern to network/Connection.ts #293

NodeConnectionManager:

  • most likely dependent on what NodeConnection is - if it remains as CDSS, then this will also likely be CDSS (stops the NodeConnections on stop, and destroys on destroy), otherwise this could also be CD

NodeManager (currently CDSS):

  • given that NodeGraph will no longer be an encapsulated dependency on NodeManager, there's also some flexibility with what this will need to be. Previously, we had to propagate the NodeGraph.destroy into the NodeManager.destroy, but this will no longer be the case

Also worthwhile to recall this comment from me regarding a problem with ordering on PolykeyAgent startup #289 (comment) and the discussion of using the start-stop pattern as a way to mitigate this #225 (comment).

@joshuakarp
Copy link
Contributor Author

We'll also need to slightly rework the usage of seed node connections.

At the moment, we have a function NodeManager.getConnectionsToSeedNodes that establishes/retrieves the connections to each of our seed nodes. This is used in a few places:

  • PolykeyAgent.start: for initially establishing the connections. This is currently separated from the regular NodeManager.start as a result of some ordering issues discussed here Supporting trusted seed nodes (nodes domain refactoring, static sources for seed nodes) #289 (comment). Ultimately, this problem should be aimed to be resolved in this nodes refactoring effort.
  • NodeConnection.start: the seed node connections are passed in every createNodeConnection call. This is currently done to allow hole punching messages to be sent to the seed nodes, and then relayed to the target node to allow the connection to be established. Calls NodeConnection.sendHolePunchMessage.
  • NodeGraph.syncNodeGraph: for initially populating a new NodeGraph with some initial nodes closest to it. We ask the seed nodes for the closest nodes to our newly created node. Calls NodeConnection.getClosestNodes.

This shouldn't require too much of a rework. Initially thinking about this, we can instead statically store the seed node IDs in NodeConnectionManager, and just call withConnection on each of these IDs with the appropriate functionality we require.

@joshuakarp
Copy link
Contributor Author

There's also the problem of code repetition when using withConnection. For example, we currently have a getClosestNodes function in NodeConnection. This is called in 2 places:

  1. NodeGraph.syncNodeGraph
  2. NodeGraph.getClosestGlobalNodes

Both places connect to a remote node and retrieve a set of closest nodes (i.e. the same functionality).

We'll no longer be wrapping this common functionality within a NodeConnection function. Instead, we could wrap it as a common function in some other class like so:

class NodeConnectionManager {
  public requestClosestNodes(targetNodeId) {
    return this.withConnection(targetNodeId, async () => {
      // functionality to get the closest nodes
    });
  }
  
  public syncNodeGraph() {
    // ...
    this.requestClosestNodes(this.keyManager.getNodeId());
  }

  public getClosestGlobalNodes(targetNodeId) {
    // ...
    this.requestClosestNodes(targetNodeId);
  }
}

Although I am wary that this starts to lean towards the issues that we're currently trying to solve (where NodeConnection has multiple functions that simply construct GRPC messages and relay calls to the GRPC client). Here, we seem to just be pulling this problem out of NodeConnection and into some other domain.

@joshuakarp
Copy link
Contributor Author

Given that NodeGraph will no longer be treated as an encapsulated dependency of NodeManager, we also have the opportunity to remove the forwarding calls in NodeManager that simply query the NodeGraph. For example:

  public async getNode(targetNodeId: NodeId): Promise<NodeAddress | undefined> {
    return await this.nodeGraph.getNode(targetNodeId);
  }

We have similar functions for setting a node, getting all the node buckets, getting the closest global nodes, and refreshing the buckets.

This would mean that we would instead inject NodeGraph into the domains that it would need to be used.

So 2 alternatives here:

  1. We retain the forwarding functions in NodeManager, and keep it as is where we inject NodeManager everywhere it's necessary to query/add to the NodeGraph
  2. We remove the forwarding functions, and instead inject NodeGraph everywhere we need it. This greatly simplifies the API of NodeManager, but means we'd need to expose NodeGraph in lots of different places

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 28, 2021

Regarding with contexts and general resource management patterns.

In OOP we have RAII, and this is being done through the js-async-init. But it's fairly a manual process with createX, destroy, start, stop being used. In our overall app architecture this is put together as matroska dolls and nesting the create, start, stop, destroy.

In imperative programming, we have the try finally pattern that interacts with RAII. Basically create the resource initially, then in finally destroy the resource. This works well with RAII. And it's also really flexible, since it can be used pretty much anywhere.

In functional programming we have additional pattern: with contexts or known as the bracketing pattern. This is a higher order function that takes a callback and injects it the resource, and when the callback is done, it will release/destroy the resource. We have used this in a number of places as it's a nicer interface for interacting with resources and it automates the destruction as it scopes the resource lifecycle to the function callstack/scope. We have used this in a number of places and intend to use it for things like withConn, withVault, withLock... etc.

One issue with using brackets, is that JS has 2 kinds of async functions (assume that such resource management always takes place asynchronously), ones that return a promise, and ones that return an async generator. And this would require 2 different variants of the bracket function like withConnF and withConnG. Additionally nesting brakets/with contexts can lead to a callback mountain which isn't as nice, in which case a generalised with combinator or promise-based chaining system should be used.

For example a general with combinator would look like this:

// similar to python's with Context() as a, Context() as b:
withF([resource, resource], (a, b) => {

});
// need a async generator variant
withG([resource, resource], async function* () {

});

To do the above, the resources must all have a similar expected interface. Perhaps something like:

// the T may be optional, in that it may not actually be available
// locking resource doesn't give any lock resource to the callback
type Resource<T> = () => Promise<[T, () => Promise<void>]>

This would allow the with combinator to always work any resource such as those exposed via imperative APIs.

let count = 0;
withF([
  async () => {
    ++count;
    return [count, async () => { --count; }]
  }
], (c) => {
  console.log(c);
})

The with combinator signature might be complicated, since you have to accumulate the types of resources that will be acquired, and expect a function to be passed in that takes those types. This would be similar to the signature for Promise.all since it also has to take an array of arbitrary types and then spread it into the then callback.

A promise chaining system would be similar to how promises can chain stuff:

resourceA().with(resourceB).with(resourceC).then((a, b, c) => {
});

But I can't really see this being much better than what is discussed above.

@tegefaulkes @joshuakarp relevant to our discussions about the right API for nodes and vaults.

@CMCDragonkai
Copy link
Member

If withF and withG are generalised utility functions. Then we wouldn't expect each class/domain to expose it. Instead each domain would need to expose a "resource API" as per the example given:

type Resource<T> = () => Promise<[T, () => Promise<void>]>

So imagine for ACL, instead of transaction, you expose the ability to lock the ACL.

class ACL {
  public function acquireLock() {
    await this.lock.acquire();
    return [undefined, () => {
      release();
    }];
  }
}

Then external users could then use withF([acl.acquireLock()], ...);.

@tegefaulkes
Copy link
Contributor

Right now NodeConnection requires a Map<NodeId, NodeConnection>. this doesn't work with the withConnection()method of using the connections. I could usewithConnection()inside of theNodeConnectionbut having theNodeConnectionkeep a reference back to theNodeConnectionManager` doesn't feel quite right.

@CMCDragonkai
Copy link
Member

It's totally fine for node connection to keep a reference back to the manager. We do this already in the networking proxies.

@CMCDragonkai
Copy link
Member

We would normally want the collection to be weakly referenced. But it's not possible in JS even with weak map. So you'll have to use Map with a reverse reference for garbage collecting it's reference.

@CMCDragonkai
Copy link
Member

The RWLock should be changed over to write-preferring to prevent write-starvation. Example of this is here: https://gist.github.com/CMCDragonkai/4de5c1526fc58dac259e321db8cf5331#file-rwlock-write-ts

I've already used this in js-async-init to great effect. The current RWLock in js-polykey is read-preferring is not the best way to do it.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jan 6, 2022

Priorities:

  1. NodeConnectionManager - this is the centerpiece
  2. NodeManager - moving functionality out of this into NodeConnectionManager and also using NodeConnectionManager when interacting with nodes
  3. NodeGraph - changing this to being a data structure and move out some functionality into NodeConnectionManager
  4. Then address the error handling, first in GRPCClient and watching the connectivity state, then in NodeConnection, look into ready method wrapping as well when other users end up calling on a "broken"/"dead" node connection
  5. Finally look at all the other domains that need to call other keynode's agent service like VaultManager.. etc and how they should be changed accordingly to use withConn or our generic with combinator

@tegefaulkes
Copy link
Contributor

I've created an interface that exposes only the needed functionality from NodeConnectionManager for NodeConnection. This includes a personalised callback to handle destroying and removing itself from the connection map.

@CMCDragonkai
Copy link
Member

Rebase please!

@CMCDragonkai
Copy link
Member

@tegefaulkes can you make sure this PR removes this:

[nix-shell:~/Projects/js-polykey/tests]$ ag 'failedConnectionTimeout'
global.d.ts
11:declare var failedConnectionTimeout: number;

bin/nodes/ping.test.ts
114:    global.failedConnectionTimeout * 2,
136:    global.failedConnectionTimeout * 2,

bin/nodes/find.test.ts
195:    global.failedConnectionTimeout * 2,

bin/nodes/add.test.ts
98:    global.failedConnectionTimeout,
117:    global.failedConnectionTimeout,

client/rpcNodes.test.ts
230:    global.failedConnectionTimeout * 2,
275:    global.failedConnectionTimeout * 2,
301:    global.failedConnectionTimeout * 2,

nodes/NodeManager.test.ts
265:      global.failedConnectionTimeout * 2,
315:    global.failedConnectionTimeout * 2,
391:    global.failedConnectionTimeout * 2,

This should no longer be necessary once you have timeouts dependency injected, or timer mocking. Once done, remove the variable from jest.config.js.

@CMCDragonkai
Copy link
Member

I've added #274, #216, #254 to this PR @tegefaulkes since these are all related together, it's best to refactor these together.

@tegefaulkes
Copy link
Contributor

Re-based on master.

@CMCDragonkai
Copy link
Member

There's some confusion between all these methods that we should make clearer.

  • withConnection - with wrapper, should have promise and async generator variant
  • createConnection - public methods to create a connection but doesn't return the connection
  • destroyConnection - public method to destroy a connection
  • establishNodeConnection - wrapped function that actually creates the connection
  • openConnection - hole punch for reverse proxy
  • fwdOpenConnection - hole punch for forward proxy

These are all quite confusing, we should be renaming these accordingly.

  1. Investigate whether createConnection should really be public or not, and this has a relationship with the seed nodes synchronisation, it may not be necessary anymore
  2. Maybe createConnection should actually be getConnection, in this case the connection object is actually returned. But this becomes a manual usage.
  3. The establishNodeConnection which is a protected function should be renamed so that it's clearer that it is a wrapper worker function. This is based on the object map pattern which was originally called createObject, and if 2 can be addressed, then establishNodeConnection can be createConnection
  4. The hole punching functions shouldn't be called openConnection... etc, they can be called holePunchForward or holePunchReverse

@tegefaulkes I think it's a good idea for you to review in detail the design of 2 things:

  1. The kademlia node graph so you have an idea how the contact database is being synchronised
  2. How the hole punching ICE logic is supposed to work

@tegefaulkes
Copy link
Contributor

When moving some functionality about I keep getting to the question "Do I put it in the NodeManager or the NodeConnectionManager?". It seems to me that the NodeConnectionManager should solely handle the NodeConnections and their life cycle. So maybe all functionality such as getClosestLocalNodes, sendHolePunchMessage or findNode should be in the NodeManager?

Also to note, there are two functions for getClosestLocalNodes. One that calls the GRPC and one that is called by the service. Like above I think they should both be in the NodeManager. However I need a good way to tell them apart by name.

@CMCDragonkai
Copy link
Member

Connection-related work is indeed in the NodeConnectionManager. So if there are functions relating to managing connections, doing hole punching.

Note in the diagram #225 (comment) NodeManager requires NodeConnectionManager but not the other way around. So this has some implications for how sending notifications needs to occur. It's possible that NodeConnectionManager needs NotificationsManager to do it. But there's a recursive dependency here too.

This mutually recursive dependency has to be resolved by further abstraction, or some sort of normalisation. Prototype how it can be done with sending notifications.

For getClosestLocalNodes... you have the sender and receiver. One is to ask another node what their closest local nodes are, and another is to give back what the closest local nodes are. You could centralise it by giving it a parameter, where if the nodeId is undefined, then it assumes it means itself. Or separate the 2 functions entirely by different names.

@tegefaulkes
Copy link
Contributor

NodeManager has a getNodeId() method that just wraps the KeyManager.getNodeId() maybe we should get this directly from the KeyManager?

@tegefaulkes
Copy link
Contributor

I've done almost everything now. only things left are

  • refactoring the destroyedCallback stuff is mostly complete, just have to fix a small problem with it. I've made a comment about it. but it's getting too late to finish now.
  • the NCM.getInterface() method can be removed and replaced with casting to the interface. But we will have to extract the destroyCallback from it and create that separately.

@CMCDragonkai
Copy link
Member

Regarding the NodeConnectionManagerInterface, once you have removed the destroyCallback from the interface, you can do:

  1. Remove the getInterface method entirely. You can just use structural type to satisfy the interface.
  2. Move the NodeConnectionManagerInterface into NodeConnection.ts, it is the receiver that cares about this type, not the sender in this case.
  3. The NodeConnectionManagerInterface can just be private type to NodeConnection.ts, no need to export it. It's equivalent to an embedded type in the function signature to NodeConnection.createNodeConnection.

This means the destroyCallback is passed separately.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Feb 14, 2022

Some more issues I'm working on:

  1. The placement of the watchConnectivityState should be in the GRPCClient.constructor because it needs to have access the instance to be able to call this.destroy. - DONE
  2. Whether to do wrapping of destroyCallback in NC or not. I figured that no, because each class can handle their own responsibility here.
  3. How should we be checking whether to call the destroyCallback or not, either by checking the status 'destroying' and also when it is already destroyed. - Now also checking && !client[asyncInit.destroyed].
  4. The errors that can come into checkState. According to grpc-js, the only possible errors are: new Error('Deadline passed without connectivity state change') and possibly an exception when the client is stopped. See grpc-node/packages/grpc-js/src/channel.ts. What is the exception when it is stopped? - Many exception checks were removed, because no exceptions could be thrown there. Some are undefined behaviour because the error should never be thrown.
  5. Should be able to remove the usage of setImmediate from watchConnectivityState it should be able to register the handler synchronously. - Now we just call checkState immediately, but we may require a eslint comment to allow it to pass the linting.
  6. Changed to use connectivityState enum, rather than the numbers.

@CMCDragonkai
Copy link
Member

Still need to add a test that tests the READY to IDLE transition.

@joshuakarp
Copy link
Contributor Author

joshuakarp commented Feb 14, 2022

Still need to add a test that tests the READY to IDLE transition.

See this comment #224 (comment)

Any transition from READY to IDLE is going to kill the connection with the current implementation. At this point, I believe there's only 2 cases where this transition is observable:

  1. when the server closes the connection (already have a test for this)
  2. when the underlying GRPC TTL times out (no test for this yet)

At the moment, the GRPC TTL is internally something like 6 minutes according to @tegefaulkes. I can look into how to override this, but note that with the current implementation, it's just going to close the connection.

If we want to support the transition cycle between IDLE and READY (when the GRPC TTL expires), then some extra logic is necessary (I could chat to @tegefaulkes about this).

@CMCDragonkai
Copy link
Member

Pushed up some changes to the NodeConnection hierarchy. @tegefaulkes please adjust the changes in the NodeConnectionManager and update test for it as well. There was a lot of un-used parameters for NodeConnection.

@joshuakarp
Copy link
Contributor Author

There's a GRPC_ARG_CLIENT_IDLE_TIMEOUT_MS https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga51ab062269cd81298f5adb6fd9a45e99

Timeout after the last RPC finishes on the client channel at which the channel goes back into IDLE state.

Int valued, milliseconds. INT_MAX means unlimited. The default value is 30 minutes and the min value is 1 second.

But there doesn't seem to be a field for this in ChannelOptions.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Feb 14, 2022

Still need to add a test that tests the READY to IDLE transition.

See this comment #224 (comment)

Any transition from READY to IDLE is going to kill the connection with the current implementation. At this point, I believe there's only 2 cases where this transition is observable:

  1. when the server closes the connection (already have a test for this)
  2. when the underlying GRPC TTL times out (no test for this yet)

At the moment, the GRPC TTL is internally something like 6 minutes according to @tegefaulkes. I can look into how to override this, but note that with the current implementation, it's just going to close the connection.

If we want to support the transition cycle between IDLE and READY (when the GRPC TTL expires), then some extra logic is necessary (I could chat to @tegefaulkes about this).

Yes we want a test for 2., see this: https://grpc.github.io/grpc/core/group__grpc__arg__keys.html. According to the docs (https://github.com/grpc/grpc/blob/master/doc/connectivity-semantics-and-api.md) a IDLE_TIMEOUT should exist. But I cannot find this parameter in the GRPC js codebase.

Maybe @tegefaulkes just run a test and leave it running for 10-15 minutes... etc, and see?

@CMCDragonkai
Copy link
Member

Once we find out how to trigger it, we can always use timer mocking to make it to grpc that the time has passed alot!

@CMCDragonkai
Copy link
Member

@tegefaulkes at any case, if we cannot find a quick fix to that, we can add that as a later issue. And just fix up the final code based on my changes, and then we can lint/squash merge.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Feb 14, 2022

Will also create a new issue for resolving the cycle in NodeConnection when it has to create the node connection instance before it creates the client instance. Obviously we would want to create the client instance first. But creating the client instance requires the destroyCallback, and the destroy callback requires a reference to the node instance.

A similar situation occurs in GRPCClient, which was resolved by calling this.watchConnectivityState inside the constructor, so it has access to the instance as well.

But in this case, the async creation of the derived GRPCClient classes requires the destroyCallback, because they are constructing the instance.

It seems like the proper solution is to change to StartStop, or CreateDestroyStartStop for these connection classes and make it similar to network domain's ConnectionForward and ConnectionReverse. The CreateDestroy pattern here is just a bit inflexible.

New issue started here #333.

@joshuakarp
Copy link
Contributor Author

There's a GRPC_ARG_CLIENT_IDLE_TIMEOUT_MS https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga51ab062269cd81298f5adb6fd9a45e99

Timeout after the last RPC finishes on the client channel at which the channel goes back into IDLE state.
Int valued, milliseconds. INT_MAX means unlimited. The default value is 30 minutes and the min value is 1 second.

But there doesn't seem to be a field for this in ChannelOptions.

There's a few mentions of this in the C-based GRPC: https://github.com/grpc/grpc/search?q=GRPC_ARG_CLIENT_IDLE_TIMEOUT_MS&type=code

@CMCDragonkai
Copy link
Member

There's a GRPC_ARG_CLIENT_IDLE_TIMEOUT_MS https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga51ab062269cd81298f5adb6fd9a45e99

Timeout after the last RPC finishes on the client channel at which the channel goes back into IDLE state.
Int valued, milliseconds. INT_MAX means unlimited. The default value is 30 minutes and the min value is 1 second.

But there doesn't seem to be a field for this in ChannelOptions.

There's a few mentions of this in the C-based GRPC: https://github.com/grpc/grpc/search?q=GRPC_ARG_CLIENT_IDLE_TIMEOUT_MS&type=code

Can you make a question on grpc-js about this parameter.

@joshuakarp
Copy link
Contributor Author

I've posted a question on their discussion channel https://groups.google.com/g/grpc-io/c/yq4pmBGaXOQ

@tegefaulkes
Copy link
Contributor

Everything is fixed now.
I've run all the tests and only 3 tests files are failing.

I'm going to start the squash and merge process now.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Feb 14, 2022

Ok I'll add create a new issue regarding the idle timeout. #332.

…ctionManager:

- fixes and updating inline documentation
- Removing connection establishment before adding node to NodeGraph
- Made `createConnection` private
- Removed transaction and locking from `NodeManager`
- removed `clearDB` from `NodeGraph`
- Added `nodeConnectionManagerConfig` to `PolykeyAgent`
- added `withConnF`, `withConnG` and `acquireConnection` to `NodeConnectionManager`
- moved some functionality from `NodeManager` and `NodeGraph` to `NodeConnectionManager`
- Implementing `NodeConnectionManager.ts`
- Extracted `NodeGraph` from `nodeManager`
- Updated `nodes/errors.ts` exit codes
- Updated `@matrixai/id` to `^3.0.0`
- Fixed random failures in NodeGraph.test.ts
- Expanded tests for `NodeConnectionManager` and `NodeConnection` to check connection failure conditions cleanup triggered in response
- NodeConnection connection failure handling
- Updated asyn-init pattern for `NodeGraph`, `NodeManager`, `NodeConnection` and `NodeConnectionManager`
- Shuffled methods to appropriate classes
- Switched to selective imports to remove cycles
- Fixed GRPC logging
- Refactored `NodeGraph` distance calulation and fixed tests
- NodeConnection only attemps hole punching for non seed nodes
- Removed `getNodeId` wrappers from nodes classes
- Added curly rule to eslint to prevent braceless for single statement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment