Prevent `PolykeyAgent` crashes from connection failures #592

tegefaulkes · 2023-10-19T02:24:38Z

Specification

reffer to this comment #551 (comment)

In the service of making the PolykeyAgent more robust, we need to prevent connection failures from causing the PolykeyAgent to crash. This means that any connection or stream done in the background needs to be handled if it fails. Any errors need to be caught handled and possibly logged depending on the level of severity. There are some places where this needs to be handled.

The NodeConnectionManager needs to handle any connection failures gracefully. Any connection failures that happen internally should not bubble up to the top.
Stream handling needs to handle any errors. I think the RPC should be gracefully handing stream failures so errors shouldn't come from the streams themselves.
Any connection related methods in the NodeConnectionManager such as getClosestNodes and findNodes should not throw anything by design. These need to be checked.
Anything that uses nodeConnectionManager.withConnF needs to deal with connection failures at any time. Remember a connection can fail at any time for any reason. Just having access to a connection doesn't mean it's always safe to use.

On reflection on the problem. The stream factory is throwing and we need to handle that condition. That could be the long and short of it.

We need to add more testing for race conditions. Where normal operations are done concurrently with the connection ending. This should be a NodeConnection and NodeConnectionManager level test.

In two places we are using a hard coded magic number as an error code for closing a QUICConnection. We need to make a nodes domain enum for QUICConnection error codes.

On reflection there is only 1 place where a QUICConnection is force destroyed directly and that's the NodeConnection.destroy method. Two new enums need to be created in place of the hard coded reason and code used here.

enum ConnectionErrorCode {
  ForceClose = 1,
}

enum ConnectionErrorReason {
  ForceClose = 'NodeConnection is forcing destruction',
}

No new errors are made for this, We don't throw any error locally in this case and quic has it's own errors for connections that ended with an error.

Additional context

Related 6th Testnet Deployment #551
Related 6th Testnet Deployment #551 (comment)
Fixed by Addressing crashes from failing background connections #609

Tasks

1. Prevent any connection failures in the background from bubbling up to the top of the program. The NCM should never throw when a connection fails.
2. expand tests to include connection failures and concurrent connection failures.
3. Remove magic number error codes from the nodes domain and use an enum to get the code and reason. all forced connection stops should use this.

The text was updated successfully, but these errors were encountered:

CMCDragonkai · 2023-10-23T04:17:11Z

There are 2 magic numbers being used inside NodeConnectionManager, these 2 numbers (42, 1) is being used as a connection error code. Unlike streams, we don't immediately map all stream codes to exceptions and back. Connection error codes are raw codes and they need to be replaced with some sort of enum.

Once we receive a particular code for a connection, we should be turning that into an exception.

The reason we don't also have reasonToCode and codeToReason for connections is because connections do not have a natural way of throwing errors, whereas streams based on webstreams do.

So the conversion to exceptions needs to be done on the outside. So enums is not enough, in fact it should be done with regular exceptions. We need exceptions for each of relevant codes, as well, as an exception for unknown codes.

CMCDragonkai · 2023-10-23T04:18:48Z

We not observed a case where a connection failure is resulting a crash. However the connection failure handling is messy and needs refactoring.

tegefaulkes · 2023-10-25T04:15:33Z

I'm Looking into this now.

tegefaulkes · 2023-10-25T07:37:55Z

There is definitely a problem with concurrently establishing a connection. While I added locking and logic to handle a connection being concurrently created. The solution is not up to the task.

In essence we have a connection and a shadow connection doing the same thing and punching each other in the face. Both sides cancel the other and we end up with a connection in the connection map that has already failed. So it seems clean up in this case is failing as well but that's a side issue.

We need a way for both sides to determine which connection to keep, without them coordinating. A decision can be made deterministically on both sides by comparing the connection Id where the lower ID wins.

tegefaulkes · 2023-10-26T04:22:12Z

From the forward connection perspective, both sides establish a successful connection and add it to the connection map before handling any reverse connection.

So the decision which connection we keep needs to be made when handling the reverse connection on both sides. Without the nodes communicating they need to select the same connection to keep. To do this I was going to compare the ConnectionIds of the connections and select one of the deterministically. To do this, bot nodes need the same connection Ids for each connection. The problem is, when logging out the connection ids I get 4 different IDs...

nodeA:handling reverse connection
nodeA:existing connection G0EpF7c_vZI01NZ9LpMwOuHTfuk
nodeA:new connection IUlCGdmBNY54StkXA31KrfmuSBE
nodeB:handling reverse connection
nodeB:existing connection du9EryP--4GNdCVL4EIPPSFELj4
nodeB:new connection NTIjRdkq5j3MX1mrAElcjvBHuvo

Understandable since the QUICConnection uses the scid for the connection id. But there are two, the scid and dcid. These are reversed on the complementary connection. I need access to both the scid and dcid of the QUICConnction to compare them properly.

tegefaulkes added the development Standard development label Oct 19, 2023

tegefaulkes self-assigned this Oct 19, 2023

tegefaulkes mentioned this issue Oct 19, 2023

6th Testnet Deployment #551

Closed

This was referenced Oct 25, 2023

Addressing crashes from failing background connections #609

Merged

CLI Beta Launch MatrixAI/Polykey-CLI#40

Closed

tegefaulkes mentioned this issue Oct 27, 2023

feat: expose more connectionId information MatrixAI/js-quic#73

Merged

8 tasks

tegefaulkes closed this as completed in #609 Nov 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent `PolykeyAgent` crashes from connection failures #592

Prevent `PolykeyAgent` crashes from connection failures #592

tegefaulkes commented Oct 19, 2023 •

edited

CMCDragonkai commented Oct 23, 2023

CMCDragonkai commented Oct 23, 2023

tegefaulkes commented Oct 25, 2023

tegefaulkes commented Oct 25, 2023

tegefaulkes commented Oct 26, 2023

Prevent PolykeyAgent crashes from connection failures #592

Prevent PolykeyAgent crashes from connection failures #592

Comments

tegefaulkes commented Oct 19, 2023 • edited

Specification

Additional context

Tasks

CMCDragonkai commented Oct 23, 2023

CMCDragonkai commented Oct 23, 2023

tegefaulkes commented Oct 25, 2023

tegefaulkes commented Oct 25, 2023

tegefaulkes commented Oct 26, 2023

Prevent `PolykeyAgent` crashes from connection failures #592

Prevent `PolykeyAgent` crashes from connection failures #592

tegefaulkes commented Oct 19, 2023 •

edited