Clean up `StreamObserver` on failed `openStream` attempt to solve No Handlers Found exception #250

smcvb · 2022-10-28T12:11:37Z

Whenever the CommandChannelImpl and QueryChannelImpl try to open a stream with their CommandProviderOutbound and QueryProviderOutbound respectively we should catch exceptions.
Exceptions may occur when, for example, the connection to the Axon Server instance is down or faulty at that moment in time.

If we do not catch these exceptions, the channel implementations get stuck in a faulty state.
In this state, they've already created and their StreamObserver instance.
As the StreamObserver instance is used to deduce whether the connection is live, a following reconnect will be ignored.
A consequence of this, is that the command and query handlers never get registered on a reconnect, resulting in the NoHandlerForCommandException and NoHandlerForQueryException

A user of the axonserver-connector-java module may reach this problematic state when all Axon Server connections are down while it dispatches a command or query.
This is particularly easy to replicate when using Axon Server Standard edition, as only a single connection is present.

To resolve the above, this pull request catches the exceptions on the openStream invocations towards gRPC.
If an exception is caught, the constructed StreamObserver is removed, thus resolving the faulty state.
Additionally to this, several debug and trace statements are introduced to ease future debugging.

Lastly, adding a test case to the Integration Test suite was difficult, as ToxiProxy only breaks the connection for a while instead of a harsh connection breakdown.
Due to this, it's not trivial to reach the above described state, and hence, no test cases were introduced.

Whenever the CommandChannelImpl and QueryChannelImpl try to open a stream with the CommandProviderOutbound and QueryProviderOutbound respectively, we should catch exceptions. If we do not, the channel implementations get stuck in a faulty state, where they created their StreamObservers, but weren't able to connect them. This situation can occur, if an AxonServerConnector-Java user dispatches a command/query while all connections are down. By catching the exception, and clearing the StreamObservers, we resolve this issue. #bug/stream-cleanup-on-opening-failure

…-on-opening-failure

CodeDrivenMitch · 2022-10-31T08:21:24Z

@smcvb I am testing this locally with the setup that failed before. I'm afraid the problem hasn't been resolved.

I do see the additional log line:

2022-10-31 09:17:52.958  WARN 45477 --- [   scheduling-1] i.a.a.c.command.impl.CommandChannelImpl  : Failed trying to open stream for CommandChannel in context [default]. Cleaning up for following attempt...

io.grpc.StatusRuntimeException: UNKNOWN: Uncaught exception in the SynchronizationContext. Re-thrown.

It keeps looping over this error on every connect attempt. The root cause is a scheduler that is shut down:

org.axonframework.axonserver.connector.command.AxonServerCommandDispatchException: Exception while dispatching a command to AxonServer
	at org.axonframework.axonserver.connector.command.AxonServerCommandBus.doDispatch(AxonServerCommandBus.java:197) ~[axon-server-connector-4.6.1.jar:4.6.1]
	at org.axonframework.axonserver.connector.command.AxonServerCommandBus.dispatch(AxonServerCommandBus.java:161) ~[axon-server-connector-4.6.1.jar:4.6.1]
	at org.axonframework.commandhandling.gateway.AbstractCommandGateway.send(AbstractCommandGateway.java:76) ~[axon-messaging-4.6.1.jar:4.6.1]
	at org.axonframework.commandhandling.gateway.DefaultCommandGateway.send(DefaultCommandGateway.java:83) ~[axon-messaging-4.6.1.jar:4.6.1]
	at org.axonframework.extensions.tracing.TracingCommandGateway.lambda$doSendAndExtract$5(TracingCommandGateway.java:152) ~[axon-tracing-4.6.1.jar:4.6.1]
	at org.axonframework.extensions.tracing.TracingCommandGateway.sendWithSpan(TracingCommandGateway.java:172) ~[axon-tracing-4.6.1.jar:4.6.1]
	at org.axonframework.extensions.tracing.TracingCommandGateway.doSendAndExtract(TracingCommandGateway.java:151) ~[axon-tracing-4.6.1.jar:4.6.1]
	at org.axonframework.extensions.tracing.TracingCommandGateway.sendAndWait(TracingCommandGateway.java:118) ~[axon-tracing-4.6.1.jar:4.6.1]
	at io.axoniq.demo.hotel.booking.command.demo.AutomaticAccountCommandDispatcher.dispatch(AutomaticAccountCommandDispatcher.kt:18) ~[classes/:na]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:na]
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:na]
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:na]
	at java.base/java.lang.reflect.Method.invoke(Method.java:566) ~[na:na]
	at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84) ~[spring-context-5.3.22.jar:5.3.22]
	at io.opentelemetry.javaagent.instrumentation.spring.scheduling.SpringSchedulingRunnableWrapper.run(SpringSchedulingRunnableWrapper.java:35) ~[na:na]
	at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) ~[spring-context-5.3.22.jar:5.3.22]
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[na:na]
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) ~[na:na]
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:829) ~[na:na]
Caused by: io.grpc.StatusRuntimeException: UNKNOWN: Uncaught exception in the SynchronizationContext. Re-thrown.
	at io.grpc.Status.asRuntimeException(Status.java:526) ~[grpc-api-1.49.0.jar:1.49.0]
	at io.grpc.internal.RetriableStream$1.uncaughtException(RetriableStream.java:75) ~[grpc-core-1.49.0.jar:1.49.0]
	at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:97) ~[grpc-api-1.49.0.jar:1.49.0]
	at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127) ~[grpc-api-1.49.0.jar:1.49.0]
	at io.grpc.internal.RetriableStream.cancel(RetriableStream.java:493) ~[grpc-core-1.49.0.jar:1.49.0]
	at io.grpc.internal.RetriableStream.start(RetriableStream.java:361) ~[grpc-core-1.49.0.jar:1.49.0]
	at io.grpc.internal.ClientCallImpl.startInternal(ClientCallImpl.java:289) ~[grpc-core-1.49.0.jar:1.49.0]
	at io.grpc.internal.ClientCallImpl.start(ClientCallImpl.java:191) ~[grpc-core-1.49.0.jar:1.49.0]
	at io.grpc.ForwardingClientCall.start(ForwardingClientCall.java:32) ~[grpc-api-1.49.0.jar:1.49.0]
	at io.opentelemetry.javaagent.shaded.instrumentation.grpc.v1_6.TracingClientInterceptor$TracingClientCall.start(TracingClientInterceptor.java:97) ~[na:na]
	at io.grpc.ForwardingClientCall.start(ForwardingClientCall.java:32) ~[grpc-api-1.49.0.jar:1.49.0]
	at io.axoniq.axonserver.connector.impl.GrpcBufferingInterceptor$AdditionalMessageRequestingCall.start(GrpcBufferingInterceptor.java:69) ~[axonserver-connector-java-4.6.3-SNAPSHOT.jar:4.6.3-SNAPSHOT]
	at io.grpc.ForwardingClientCall.start(ForwardingClientCall.java:32) ~[grpc-api-1.49.0.jar:1.49.0]
	at io.axoniq.axonserver.connector.impl.HeaderAttachingInterceptor$HeaderAttachedCall.start(HeaderAttachingInterceptor.java:68) ~[axonserver-connector-java-4.6.3-SNAPSHOT.jar:4.6.3-SNAPSHOT]
	at io.grpc.stub.ClientCalls.startCall(ClientCalls.java:341) ~[grpc-stub-1.49.0.jar:1.49.0]
	at io.grpc.stub.ClientCalls.asyncStreamingRequestCall(ClientCalls.java:332) ~[grpc-stub-1.49.0.jar:1.49.0]
	at io.grpc.stub.ClientCalls.asyncBidiStreamingCall(ClientCalls.java:119) ~[grpc-stub-1.49.0.jar:1.49.0]
	at io.axoniq.axonserver.grpc.command.CommandServiceGrpc$CommandServiceStub.openStream(CommandServiceGrpc.java:195) ~[axonserver-connector-java-4.6.3-SNAPSHOT.jar:4.6.3-SNAPSHOT]
	at io.axoniq.axonserver.connector.command.impl.CommandChannelImpl.doCreateCommandStream(CommandChannelImpl.java:172) ~[axonserver-connector-java-4.6.3-SNAPSHOT.jar:4.6.3-SNAPSHOT]
	at io.axoniq.axonserver.connector.command.impl.CommandChannelImpl.connect(CommandChannelImpl.java:152) ~[axonserver-connector-java-4.6.3-SNAPSHOT.jar:4.6.3-SNAPSHOT]
	at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.notifyWhenStateChanged(AxonServerManagedChannel.java:260) ~[axonserver-connector-java-4.6.3-SNAPSHOT.jar:4.6.3-SNAPSHOT]
	at io.axoniq.axonserver.connector.impl.ContextConnection.ensureConnected(ContextConnection.java:218) ~[axonserver-connector-java-4.6.3-SNAPSHOT.jar:4.6.3-SNAPSHOT]
	at io.axoniq.axonserver.connector.impl.ContextConnection.commandChannel(ContextConnection.java:171) ~[axonserver-connector-java-4.6.3-SNAPSHOT.jar:4.6.3-SNAPSHOT]
	at org.axonframework.axonserver.connector.command.AxonServerCommandBus.doDispatch(AxonServerCommandBus.java:178) ~[axon-server-connector-4.6.1.jar:4.6.1]
	... 21 common frames omitted
Caused by: java.util.concurrent.RejectedExecutionException: Task io.grpc.internal.SerializingExecutor@6c452cfd rejected from java.util.concurrent.ThreadPoolExecutor@58cdeb8e[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 217]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) ~[na:na]
	at io.grpc.internal.SerializingExecutor.schedule(SerializingExecutor.java:102) ~[grpc-core-1.49.0.jar:1.49.0]
	at io.grpc.internal.SerializingExecutor.execute(SerializingExecutor.java:95) ~[grpc-core-1.49.0.jar:1.49.0]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.closedInternal(ClientCallImpl.java:752) ~[grpc-core-1.49.0.jar:1.49.0]
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.closed(ClientCallImpl.java:688) ~[grpc-core-1.49.0.jar:1.49.0]
	at io.grpc.internal.RetriableStream$4.run(RetriableStream.java:498) ~[grpc-core-1.49.0.jar:1.49.0]
	at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95) ~[grpc-api-1.49.0.jar:1.49.0]
	... 42 common frames omitted

We will have to look into that root cause to fix the re-registering option

Whether exceptions are thrown or reported through a callback, they should result in the same handling logic.

smcvb · 2022-11-02T16:30:19Z

@abuijze and I had a look at the changes made, diving a bit deeper in it all.
We're confident the predicament occurs because the gRPC does not propagate the exception on the response channel at all, thus not invoking onError.
As such, we've adjusted the changes I made earlier by invoking onError ourselves.
Doing so, we ensure the clean-up occurs, as well as a disconnect.

@MORLACK, if you could give this new format another try, that would be very helpful!

abuijze · 2022-11-03T12:18:10Z

I found a discrepancy between grpc's channel.getState(false) == SHUTDOWN and channel.isShutdown(). Whereas the latter checks a flag that is immediately set on calling channel.shutdown(), the former checks on state that is part of a scheduled activity on a channel. Since we check using getState(), it is possible that the reconnection process temporarily keeps a reference to a channel in shutdown state, causing the stack trace @MORLACK is seeing.

I am working on a fix to improve the internal connection tracking.

While isShutdown() returns true immediate upon a shutdown() call, getState() doesn't return SHUTDOWN immediately. The latter state is modified on a scheduled task. Additional calls to isShutdown() were added to prevent starting calls on a channel that has been shut down already.

Each attempt to retrieve a channel would schedule an attempt to connect if it wasn't in the ready state yet. However, when disconnected, this means that each attempt to perform a call will result in a scheduled task to verify connections. This commit changes that to only connect a channel when it is created for the first time. If it fails to connect, the channel will itself schedule a task to verify the connection status in a certain timeframe.

sonarcloud · 2022-11-03T13:11:39Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

72.6% Coverage
0.0% Duplication

CodeDrivenMitch · 2022-11-03T14:10:55Z

@abuijze I just tested this locally in my hotel demo version that broke it. It fully recovers now!

smcvb added Type: Bug Status: In Progress Priority 1: Must labels Oct 28, 2022

smcvb added this to the Release 4.6.3 milestone Oct 28, 2022

smcvb requested review from abuijze, CodeDrivenMitch and saratry October 28, 2022 12:11

smcvb self-assigned this Oct 28, 2022

smcvb mentioned this pull request Oct 28, 2022

Query handler resubscription bug AxonFramework/AxonFramework#1953

Closed

Merge branch 'bug/junit5-and-testng-together' into bug/stream-cleanup…

6d9372e

…-on-opening-failure

Base automatically changed from bug/junit5-and-testng-together to connector-4.6.x October 31, 2022 09:16

Changed exception handling on openstream calls

15677af

Whether exceptions are thrown or reported through a callback, they should result in the same handling logic.

abuijze added 2 commits November 3, 2022 14:03

CodeDrivenMitch approved these changes Nov 3, 2022

View reviewed changes

CodeDrivenMitch merged commit 25c857e into connector-4.6.x Nov 4, 2022

smcvb added Status: Resolved and removed Status: In Progress labels Nov 4, 2022

smcvb deleted the bug/stream-cleanup-on-opening-failure branch November 4, 2022 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up `StreamObserver` on failed `openStream` attempt to solve No Handlers Found exception #250

Clean up `StreamObserver` on failed `openStream` attempt to solve No Handlers Found exception #250

smcvb commented Oct 28, 2022

CodeDrivenMitch commented Oct 31, 2022

smcvb commented Nov 2, 2022

abuijze commented Nov 3, 2022

sonarcloud bot commented Nov 3, 2022

CodeDrivenMitch commented Nov 3, 2022

Clean up StreamObserver on failed openStream attempt to solve No Handlers Found exception #250

Clean up StreamObserver on failed openStream attempt to solve No Handlers Found exception #250

Conversation

smcvb commented Oct 28, 2022

CodeDrivenMitch commented Oct 31, 2022

smcvb commented Nov 2, 2022

abuijze commented Nov 3, 2022

sonarcloud bot commented Nov 3, 2022

CodeDrivenMitch commented Nov 3, 2022

Clean up `StreamObserver` on failed `openStream` attempt to solve No Handlers Found exception #250

Clean up `StreamObserver` on failed `openStream` attempt to solve No Handlers Found exception #250