Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent PolykeyAgent crashes from connection failures #592

Closed
3 tasks done
tegefaulkes opened this issue Oct 19, 2023 · 5 comments · Fixed by #609
Closed
3 tasks done

Prevent PolykeyAgent crashes from connection failures #592

tegefaulkes opened this issue Oct 19, 2023 · 5 comments · Fixed by #609
Assignees
Labels
development Standard development

Comments

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Oct 19, 2023

Specification

reffer to this comment #551 (comment)

In the service of making the PolykeyAgent more robust, we need to prevent connection failures from causing the PolykeyAgent to crash. This means that any connection or stream done in the background needs to be handled if it fails. Any errors need to be caught handled and possibly logged depending on the level of severity. There are some places where this needs to be handled.

  1. The NodeConnectionManager needs to handle any connection failures gracefully. Any connection failures that happen internally should not bubble up to the top.
  2. Stream handling needs to handle any errors. I think the RPC should be gracefully handing stream failures so errors shouldn't come from the streams themselves.
  3. Any connection related methods in the NodeConnectionManager such as getClosestNodes and findNodes should not throw anything by design. These need to be checked.
  4. Anything that uses nodeConnectionManager.withConnF needs to deal with connection failures at any time. Remember a connection can fail at any time for any reason. Just having access to a connection doesn't mean it's always safe to use.

On reflection on the problem. The stream factory is throwing and we need to handle that condition. That could be the long and short of it.

We need to add more testing for race conditions. Where normal operations are done concurrently with the connection ending. This should be a NodeConnection and NodeConnectionManager level test.

In two places we are using a hard coded magic number as an error code for closing a QUICConnection. We need to make a nodes domain enum for QUICConnection error codes.

On reflection there is only 1 place where a QUICConnection is force destroyed directly and that's the NodeConnection.destroy method. Two new enums need to be created in place of the hard coded reason and code used here.

enum ConnectionErrorCode {
  ForceClose = 1,
}

enum ConnectionErrorReason {
  ForceClose = 'NodeConnection is forcing destruction',
}

No new errors are made for this, We don't throw any error locally in this case and quic has it's own errors for connections that ended with an error.

Additional context

Tasks

  • 1. Prevent any connection failures in the background from bubbling up to the top of the program. The NCM should never throw when a connection fails.
  • 2. expand tests to include connection failures and concurrent connection failures.
  • 3. Remove magic number error codes from the nodes domain and use an enum to get the code and reason. all forced connection stops should use this.
@tegefaulkes tegefaulkes added the development Standard development label Oct 19, 2023
@tegefaulkes tegefaulkes self-assigned this Oct 19, 2023
@CMCDragonkai
Copy link
Member

There are 2 magic numbers being used inside NodeConnectionManager, these 2 numbers (42, 1) is being used as a connection error code. Unlike streams, we don't immediately map all stream codes to exceptions and back. Connection error codes are raw codes and they need to be replaced with some sort of enum.

Once we receive a particular code for a connection, we should be turning that into an exception.

The reason we don't also have reasonToCode and codeToReason for connections is because connections do not have a natural way of throwing errors, whereas streams based on webstreams do.

So the conversion to exceptions needs to be done on the outside. So enums is not enough, in fact it should be done with regular exceptions. We need exceptions for each of relevant codes, as well, as an exception for unknown codes.

@CMCDragonkai
Copy link
Member

We not observed a case where a connection failure is resulting a crash. However the connection failure handling is messy and needs refactoring.

@tegefaulkes
Copy link
Contributor Author

I'm Looking into this now.

@tegefaulkes
Copy link
Contributor Author

There is definitely a problem with concurrently establishing a connection. While I added locking and logic to handle a connection being concurrently created. The solution is not up to the task.

In essence we have a connection and a shadow connection doing the same thing and punching each other in the face. Both sides cancel the other and we end up with a connection in the connection map that has already failed. So it seems clean up in this case is failing as well but that's a side issue.

We need a way for both sides to determine which connection to keep, without them coordinating. A decision can be made deterministically on both sides by comparing the connection Id where the lower ID wins.

@tegefaulkes
Copy link
Contributor Author

From the forward connection perspective, both sides establish a successful connection and add it to the connection map before handling any reverse connection.

So the decision which connection we keep needs to be made when handling the reverse connection on both sides. Without the nodes communicating they need to select the same connection to keep. To do this I was going to compare the ConnectionIds of the connections and select one of the deterministically. To do this, bot nodes need the same connection Ids for each connection. The problem is, when logging out the connection ids I get 4 different IDs...

nodeA:handling reverse connection
nodeA:existing connection G0EpF7c_vZI01NZ9LpMwOuHTfuk
nodeA:new connection IUlCGdmBNY54StkXA31KrfmuSBE
nodeB:handling reverse connection
nodeB:existing connection du9EryP--4GNdCVL4EIPPSFELj4
nodeB:new connection NTIjRdkq5j3MX1mrAElcjvBHuvo

Understandable since the QUICConnection uses the scid for the connection id. But there are two, the scid and dcid. These are reversed on the complementary connection. I need access to both the scid and dcid of the QUICConnction to compare them properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Standard development
Development

Successfully merging a pull request may close this issue.

2 participants