Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous connection attempts between a deployed seed node and local agent #415

Closed
4 tasks done
emmacasolin opened this issue Jul 13, 2022 · 10 comments · Fixed by #445
Closed
4 tasks done

Continuous connection attempts between a deployed seed node and local agent #415

emmacasolin opened this issue Jul 13, 2022 · 10 comments · Fixed by #445
Assignees
Labels
bug Something isn't working r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@emmacasolin
Copy link
Contributor

emmacasolin commented Jul 13, 2022

Describe the bug

After initially connecting to a deployed seed node, the seed node will also attempt to open a forward connection back to the local agent that initialised the original connection. The two nodes will then continue to connect to each other (how long this will continue for is yet to be determined, but it is at least several minutes).

To Reproduce

  1. Connect to a deployed agent as a seed node using npm run polykey -- agent start --seednodes="[deployedAgentNodeId]@[deployedAgentHost]:[deployedAgentPort] --verbose --format json

Expected behavior

Connections should not be re-attempted continuously, regardless of whether a successful connection was established in the beginning. Once a connection is established there should be no more logs bout the state of the connection, unless it changes. Some of the connections may also be to other nodes as part of syncing the node graph in the background, however, if each of these connections is logged it might start to get confusing, so maybe we should rethink this.

Additional context

Tasks

  • 1. Remove logging messages for getting existing connections, Should only be logging life-cycle of connections. trivial
  • 2. Add 'jitter' to the refreshBucket delays to avoid large batches happening at once. trivial
  • 3. Add a 'heuristic' for refreshBucket to avoid asking the same node over and over again. 1 hour
    • if a refreshBucket operation returns less than 20 nodes then that's the WHOLE network it can see. reset the delays for all other buckets.
    • If a node has been queried recently then skip asking it again. might not need this one. This will be addressed along side Starting Connection Forward infinite loop #413 if at all.
@emmacasolin
Copy link
Contributor Author

After following the lifecycle of a local node starting up, syncing its node graph with a deployed agent seed node, running for a few minutes, then being killed, I believe this issue is likely closely related (if not the same issue) as #413. The process of syncing the node graph in the beginning looks normal (on both sides), but once this is done we get a continuous stream of Node Connection Manager logs of connecting back to the local agent. Since we don't see any actual logs of the connection being made (on either side) I believe the deployed agent isn't actually making multiple connections - it just attempts to, realises there's an existing one, and drops the connection. But these logs can be very confusing so maybe we should rethink whether they're needed?

Once the local agent is killed though we see the Starting Connection Forward logs from #413, which makes me think these logs are coming from the same underlying process, but just look different depending on whether the local agent is alive or not. Either way, this seemingly infinite logging seems unnecessary.

Logs on the deployed agent:

// First connection (syncing node graph) - local logs look normal so not included
INFO:Proxy:Handling connection from 202.92.108.228:57021
INFO:ConnectionReverse 202.92.108.228:57021:Starting Connection Reverse
INFO:ConnectionReverse 202.92.108.228:57021:Started Connection Reverse
INFO:ConnectionReverse 202.92.108.228:57021:Composing Connection Reverse
INFO:ConnectionReverse 202.92.108.228:57021:Composed Connection Reverse
INFO:PolykeyAgent:Reverse connection adding vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg:202.92.108.228:57021 to NodeGraph
INFO:Proxy:Handled connection from 202.92.108.228:57021
INFO:ConnectionForward 202.92.108.228:57021:Starting Connection Forward
INFO:ConnectionForward 202.92.108.228:57021:Started Connection Forward
INFO:NodeConnectionManager:Getting connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:no existing entry, creating connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnection 202.92.108.228:57021:Creating NodeConnection
INFO:clientFactory:Creating GRPCClientAgent connecting to 202.92.108.228:57021
INFO:Proxy:Handling CONNECT to 202.92.108.228:57021
INFO:ConnectionForward 202.92.108.228:57021:Composing Connection Forward
INFO:ConnectionForward 202.92.108.228:57021:Composed Connection Forward
INFO:Proxy:Handled CONNECT to 202.92.108.228:57021
INFO:clientFactory:Created GRPCClientAgent connecting to 202.92.108.228:57021
INFO:NodeConnection 202.92.108.228:57021:Created NodeConnection
INFO:NodeConnectionManager:withConnF calling function with connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:Getting connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:existing entry found for vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:withConnF calling function with connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:ConnectionReverse 202.92.108.228:57021:Stopping Connection Reverse
INFO:ConnectionReverse 202.92.108.228:57021:Stopped Connection Reverse

// Finished syncing node graph - from this point there are no more local logs
INFO:NodeConnectionManager:Getting connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:existing entry found for vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:withConnF calling function with connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:Getting connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:existing entry found for vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:withConnF calling function with connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:Getting connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:existing entry found for vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
INFO:NodeConnectionManager:withConnF calling function with connection to vjrd8n87opao3ctcnirccdo9e6crpq5r92m17umnv9ku00od9vdpg
... 
// This just keeps going until I kill the local agent

// This at the point at which I kill the local agent
INFO:ConnectionForward 202.92.108.228:57021:Stopping Connection Forward
INFO:clientFactory:Destroying GRPCClientAgent connected to 202.92.108.228:57021
INFO:NodeConnection 202.92.108.228:57021:Destroying NodeConnection
INFO:NodeConnection 202.92.108.228:57021:Destroyed NodeConnection
INFO:clientFactory:Destroyed GRPCClientAgent connected to 202.92.108.228:57021
INFO:ConnectionForward 202.92.108.228:57021:Stopped Connection Forward
INFO:ConnectionForward 202.92.108.228:57021:Starting Connection Forward
INFO:ConnectionForward 202.92.108.228:57021:Starting Connection Forward
INFO:ConnectionForward 202.92.108.228:57021:Starting Connection Forward
...
// These `Starting Connection Forward` logs just keep going and don't stop

@emmacasolin
Copy link
Contributor Author

emmacasolin commented Jul 14, 2022

I actually can't replicate these logs when just simulating the setup using local agents - the continuous logs (both the NodeConnectionManager and ConnectionForward ones) only seem to appear on a deployed agent.

To clarify, my local setup just uses 127.0.0.1 for all agents, however, maybe it would be worth testing with namespaces?

@emmacasolin
Copy link
Contributor Author

I think this issue is actually not about the number or frequency of connection attempts, because these are easily configurable/adjustable. I think what this is really about is our log messages not being very clear. It doesn't matter what the purpose of a node connection is, the log for an attempted node connection always looks the same, so it can be very hard to 1) debug what part of the code is actually running at any given moment and 2) know what tasks are occurring in the background.

I think a couple of things need to happen here:

  1. If a connection is not the result of something that was directly specified by the user (e.g. tasks coming from an asynchronous queue) these should not be logged by default. While these logs are very useful when debugging, for a naive user, or even someone familiar with the code, these logs can be incredibly confusing if they're coming from something you didn't directly ask to happen.
  2. The withConnF calling function with connection to... log message should be made a little more descriptive if possible - it would be good to include the function being called, although it's often just an arrow function.

Possibly unrelated but I've also seen this error a couple of times - maybe too many connections at once?

(node:843042) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 message listeners added to [UTP]. Use emitter.setMaxListeners() to increase limit

@tegefaulkes
Copy link
Contributor

For 1. I think it's still useful to know when connections are happening, background or not. But maybe we can reduce the noise a fair bit. Just log out that we're starting a node connection, proxy connection and connection failures. Any extra information could fall under a debug level?

For 2. it would be useful to know what's getting called but I don't know what useful information to log out. Keep in mind that the withConnF usage looks like this usually.

nodeConnectionManager.withConnF(
    targetNodeId,
    async connection => {
        const client = connection.getClient();
        await client.doThing(Message);
    },
);

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jul 15, 2022

Regarding the warning. It seems the default number of listeners is 10: https://nodejs.org/api/events.html#eventsdefaultmaxlisteners.

So if you add more than 10, it starts warning about that. (Does it prevent the additional listener, is it soft limit or a hard limit?)

Which event emitter is this? Is this the eventbus? We should be setting a default number of maximum listeners... but also if we are going to allow it to be unbounded, then we must also ensure control that boundary so that way we don't end up with a memory leak.

@CMCDragonkai
Copy link
Member

These log messages should be removed. This is because the whole idea we are memoising the node connections. If the connection already exists, we re-use it. If it doesn't we start it up. Log messages indicating "async lifecycles" are useful and that's why we log them out. So it makes sense to log when we are creating a new node connection. It doesn't make sense to log out if we are just checking if a node connection already exists and just reusing it.

In the future, tracing system can be introduced to both handle async object lifecycles AND async function scope lifecycle. But for now these logs should just be removed.

At the same time @tegefaulkes indicates that there are alot of connection requests being made to the same seed node over and over, due to the refresh buckets.

We can add some heuristics to "coalesce" the result set of the refresh buckets operation. Basically if we have asked the seed node for some nodes using findNode, and the number of nodes is less than 20. And we have called it within the last 30 minutes/60 minutes, then it's likely the result set is going to be same. In such a case we can just do nothing in our refresh bucket operations.

Furthermore, we should probably add some random jitter to the refresh bucket TTL. Right now there's a thundering herd occurring 1 hr after the first sync node graph. Where there will be 256 (255?) refresh bucket tasks doing a findNode operation. A random jitter can help stagger this out a bit.

Not sure about the event emitter problem. This may be less of a problem if we change to using event target in #444.

@CMCDragonkai
Copy link
Member

There are only 2 uses of EventEmitter atm:

»» ~/Projects/Polykey/src
 ♖ ag 'EventEmitter'                                                                                        (feature-crypto) pts/2 14:04:05
events/EventBus.ts
1:import { EventEmitter } from 'events';
3:class EventBus extends EventEmitter {
6:  constructor(...args: ConstructorParameters<typeof EventEmitter>) {
8:    // EventEmitter's captureRejections option is only accessible through a private symbol
37:      // This matches the behaviour of EventEmitter which uses `process.nextTick`
39:        if (typeof this[EventEmitter.captureRejectionSymbol] === 'function') {
40:          this[EventEmitter.captureRejectionSymbol](e, event, ...args);

grpc/utils/FlowCountInterceptor.ts
6:import EventEmitter from 'events';
18:class FlowCountInterceptor extends EventEmitter {

It's not likely that the EventBus is the one resulting in the warning.

But the flow count interceptor maybe. This was intended to allow us to have asynchronous interceptors... This may be an issue. We may wish to change this to EventTarget.

We should also verify that there's no memory leak here...

@CMCDragonkai
Copy link
Member

@tegefaulkes if you can figure out a nice heuristic to limit the refreshBucket chatter for a small network, can you post it here and try it out.

@tegefaulkes
Copy link
Contributor

There are only 2 uses of EventEmitter atm:

»» ~/Projects/Polykey/src
 ♖ ag 'EventEmitter'                                                                                        (feature-crypto) pts/2 14:04:05
events/EventBus.ts
1:import { EventEmitter } from 'events';
3:class EventBus extends EventEmitter {
6:  constructor(...args: ConstructorParameters<typeof EventEmitter>) {
8:    // EventEmitter's captureRejections option is only accessible through a private symbol
37:      // This matches the behaviour of EventEmitter which uses `process.nextTick`
39:        if (typeof this[EventEmitter.captureRejectionSymbol] === 'function') {
40:          this[EventEmitter.captureRejectionSymbol](e, event, ...args);

grpc/utils/FlowCountInterceptor.ts
6:import EventEmitter from 'events';
18:class FlowCountInterceptor extends EventEmitter {

It's not likely that the EventBus is the one resulting in the warning.

But the flow count interceptor maybe. This was intended to allow us to have asynchronous interceptors... This may be an issue. We may wish to change this to EventTarget.

We should also verify that there's no memory leak here...

I think this is reasonably separate to this issue. A new issue should be created for it.

@tegefaulkes
Copy link
Contributor

I've added tasks to the issue description. overall this should take 1-2 hours to complete.

tegefaulkes added a commit that referenced this issue Sep 15, 2022
Removing excessive logging for using connections. We don't need a 3 log messages for each time we use an existing connection.

Adding 'jitter' or spacing to the `refreshBuckets` delays so that they don't run all at once. This is implemented with a `refreshBucketDelaySpread` paramater that specifies the multiple of the delay to spread across. defaults to 0.5 for 50%

Adding a 'heuristic' to `refreshBucket` to prevent it from contacting the same nodes repeatably. Currently this is just a check in `getClosestGlobalNodes` where if we find less than `nodeBucketLimit` nodes we just reset the timer on all `refreshBucket` tasks.

Adding tests for checking the spread of `refreshBucket` delays. Another test for resetting the timer on `refreshBucket` tasks if a `findNode` finds less than 20 nodes.

#415
tegefaulkes added a commit that referenced this issue Sep 21, 2022
Removing excessive logging for using connections. We don't need a 3 log messages for each time we use an existing connection.

Adding 'jitter' or spacing to the `refreshBuckets` delays so that they don't run all at once. This is implemented with a `refreshBucketDelaySpread` paramater that specifies the multiple of the delay to spread across. defaults to 0.5 for 50%

Adding a 'heuristic' to `refreshBucket` to prevent it from contacting the same nodes repeatably. Currently this is just a check in `getClosestGlobalNodes` where if we find less than `nodeBucketLimit` nodes we just reset the timer on all `refreshBucket` tasks.

Adding tests for checking the spread of `refreshBucket` delays. Another test for resetting the timer on `refreshBucket` tasks if a `findNode` finds less than 20 nodes.

#415
tegefaulkes added a commit that referenced this issue Sep 21, 2022
Removing excessive logging for using connections. We don't need a 3 log messages for each time we use an existing connection.

Adding 'jitter' or spacing to the `refreshBuckets` delays so that they don't run all at once. This is implemented with a `refreshBucketDelaySpread` paramater that specifies the multiple of the delay to spread across. defaults to 0.5 for 50%

Adding a 'heuristic' to `refreshBucket` to prevent it from contacting the same nodes repeatably. Currently this is just a check in `getClosestGlobalNodes` where if we find less than `nodeBucketLimit` nodes we just reset the timer on all `refreshBucket` tasks.

Adding tests for checking the spread of `refreshBucket` delays. Another test for resetting the timer on `refreshBucket` tasks if a `findNode` finds less than 20 nodes.

#415
@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

Successfully merging a pull request may close this issue.

3 participants