-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent PolykeyAgent
crashes from connection failures
#592
Comments
There are 2 magic numbers being used inside Once we receive a particular code for a connection, we should be turning that into an exception. The reason we don't also have So the conversion to exceptions needs to be done on the outside. So enums is not enough, in fact it should be done with regular exceptions. We need exceptions for each of relevant codes, as well, as an exception for unknown codes. |
We not observed a case where a connection failure is resulting a crash. However the connection failure handling is messy and needs refactoring. |
I'm Looking into this now. |
There is definitely a problem with concurrently establishing a connection. While I added locking and logic to handle a connection being concurrently created. The solution is not up to the task. In essence we have a connection and a shadow connection doing the same thing and punching each other in the face. Both sides cancel the other and we end up with a connection in the connection map that has already failed. So it seems clean up in this case is failing as well but that's a side issue. We need a way for both sides to determine which connection to keep, without them coordinating. A decision can be made deterministically on both sides by comparing the connection Id where the lower ID wins. |
From the forward connection perspective, both sides establish a successful connection and add it to the connection map before handling any reverse connection. So the decision which connection we keep needs to be made when handling the reverse connection on both sides. Without the nodes communicating they need to select the same connection to keep. To do this I was going to compare the
Understandable since the |
Specification
reffer to this comment #551 (comment)
In the service of making the
PolykeyAgent
more robust, we need to prevent connection failures from causing thePolykeyAgent
to crash. This means that any connection or stream done in the background needs to be handled if it fails. Any errors need to be caught handled and possibly logged depending on the level of severity. There are some places where this needs to be handled.NodeConnectionManager
needs to handle any connection failures gracefully. Any connection failures that happen internally should not bubble up to the top.RPC
should be gracefully handing stream failures so errors shouldn't come from the streams themselves.NodeConnectionManager
such asgetClosestNodes
andfindNodes
should not throw anything by design. These need to be checked.nodeConnectionManager.withConnF
needs to deal with connection failures at any time. Remember a connection can fail at any time for any reason. Just having access to a connection doesn't mean it's always safe to use.On reflection on the problem. The stream factory is throwing and we need to handle that condition. That could be the long and short of it.
We need to add more testing for race conditions. Where normal operations are done concurrently with the connection ending. This should be a
NodeConnection
andNodeConnectionManager
level test.In two places we are using a hard coded magic number as an error code for closing a
QUICConnection
. We need to make anodes
domainenum
forQUICConnection
error codes.On reflection there is only 1 place where a QUICConnection is force destroyed directly and that's the
NodeConnection.destroy
method. Two newenum
s need to be created in place of the hard coded reason and code used here.No new errors are made for this, We don't throw any error locally in this case and quic has it's own errors for connections that ended with an error.
Additional context
Tasks
NCM
should never throw when a connection fails.enum
to get the code and reason. all forced connection stops should use this.The text was updated successfully, but these errors were encountered: