-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up StreamObserver
on failed openStream
attempt to solve No Handlers Found exception
#250
Clean up StreamObserver
on failed openStream
attempt to solve No Handlers Found exception
#250
Conversation
Whenever the CommandChannelImpl and QueryChannelImpl try to open a stream with the CommandProviderOutbound and QueryProviderOutbound respectively, we should catch exceptions. If we do not, the channel implementations get stuck in a faulty state, where they created their StreamObservers, but weren't able to connect them. This situation can occur, if an AxonServerConnector-Java user dispatches a command/query while all connections are down. By catching the exception, and clearing the StreamObservers, we resolve this issue. #bug/stream-cleanup-on-opening-failure
…-on-opening-failure
@smcvb I am testing this locally with the setup that failed before. I'm afraid the problem hasn't been resolved. I do see the additional log line:
It keeps looping over this error on every connect attempt. The root cause is a scheduler that is shut down:
We will have to look into that root cause to fix the re-registering option |
Whether exceptions are thrown or reported through a callback, they should result in the same handling logic.
@abuijze and I had a look at the changes made, diving a bit deeper in it all. @MORLACK, if you could give this new format another try, that would be very helpful! |
I found a discrepancy between grpc's I am working on a fix to improve the internal connection tracking. |
While isShutdown() returns true immediate upon a shutdown() call, getState() doesn't return SHUTDOWN immediately. The latter state is modified on a scheduled task. Additional calls to isShutdown() were added to prevent starting calls on a channel that has been shut down already.
Each attempt to retrieve a channel would schedule an attempt to connect if it wasn't in the ready state yet. However, when disconnected, this means that each attempt to perform a call will result in a scheduled task to verify connections. This commit changes that to only connect a channel when it is created for the first time. If it fails to connect, the channel will itself schedule a task to verify the connection status in a certain timeframe.
Kudos, SonarCloud Quality Gate passed! |
@abuijze I just tested this locally in my hotel demo version that broke it. It fully recovers now! |
Whenever the
CommandChannelImpl
andQueryChannelImpl
try to open a stream with theirCommandProviderOutbound
andQueryProviderOutbound
respectively we should catch exceptions.Exceptions may occur when, for example, the connection to the Axon Server instance is down or faulty at that moment in time.
If we do not catch these exceptions, the channel implementations get stuck in a faulty state.
In this state, they've already created and their
StreamObserver
instance.As the
StreamObserver
instance is used to deduce whether the connection is live, a following reconnect will be ignored.A consequence of this, is that the command and query handlers never get registered on a reconnect, resulting in the
NoHandlerForCommandException
andNoHandlerForQueryException
A user of the axonserver-connector-java module may reach this problematic state when all Axon Server connections are down while it dispatches a command or query.
This is particularly easy to replicate when using Axon Server Standard edition, as only a single connection is present.
To resolve the above, this pull request catches the exceptions on the
openStream
invocations towards gRPC.If an exception is caught, the constructed
StreamObserver
is removed, thus resolving the faulty state.Additionally to this, several debug and trace statements are introduced to ease future debugging.
Lastly, adding a test case to the Integration Test suite was difficult, as ToxiProxy only breaks the connection for a while instead of a harsh connection breakdown.
Due to this, it's not trivial to reach the above described state, and hence, no test cases were introduced.