Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] UndeliverableException when creating resource group and network security group in heavy load #33056

Open
3 tasks
wangwenbj opened this issue Jan 18, 2023 · 54 comments
Assignees
Labels
ARM customer-reported Issues that are reported by GitHub users external to the Azure organization. Mgmt This issue is related to a management-plane library. needs-team-attention This issue needs attention from Azure service team or SDK team pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention This issue is responsible by Azure service team.

Comments

@wangwenbj
Copy link

Describe the bug
We encountered the following errors on heavy load when creating resource group and network security group using Azure Java SDK new version, The Webclient is OkHttpClient. This issue is not happending in the old rxjava version though

Exception or Stack Trace

Exception in thread "RxCachedThreadScheduler-141" io.reactivex.rxjava3.exceptions.UndeliverableException: The exception could not be delivered to the consumer because it has already canceled/disposed the flow or the exception has nowhere to go to begin with. Further reading: https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0#error-handling | reactor.core.Exceptions$ReactiveException: java.lang.InterruptedException
at io.reactivex.rxjava3.plugins.RxJavaPlugins.onError(RxJavaPlugins.java:372)
at io.reactivex.rxjava3.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:49)
at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
at io.reactivex.rxjava3.internal.operators.single.SingleResumeNext.subscribeActual(SingleResumeNext.java:39)
at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
at io.reactivex.rxjava3.internal.operators.single.SingleSubscribeOn$SubscribeOnObserver.run(SingleSubscribeOn.java:89)
at io.reactivex.rxjava3.core.Scheduler$DisposeTask.run(Scheduler.java:644)
at io.reactivex.rxjava3.internal.schedulers.ScheduledRunnable.run(ScheduledRunnable.java:65)
at io.reactivex.rxjava3.internal.schedulers.ScheduledRunnable.call(ScheduledRunnable.java:56)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

To Reproduce
This issue cannot be reproduced eaisly. It happens every now and then in our production env and we have nowhere to catch and handle this issue.

In large scale of resoruce group creation we encounter this issue some times. I have reproduce this only once locally using 100 resource groups provision in parallel.

Code Snippet
ResourceGroup.DefinitionStages.WithCreate creator = this.azureResoureManager.resourceGroups().define(resourceGroupName)
.withRegion(region);
return ReactorToRxV3Interop.monoToSingle(creator.createAsync());

Expected behavior
No exception happend or if exception happened we could have a way to catch it inside the reactor chain.

Screenshots
API error. No screen shots

Additional context
This part of log is what we catch in our customized okhttp interceptor, however, after the exception is thrown, the upper chain lost track of this exception. Which caused the chain to never stop.

2023-01-11T17:05:44.011Z [trace_id=9492315ecd8cdf9e9db291d40c42e57b] [transaction_id=1e99ae844e81ce79] ERROR [gement.azure.com/...] .i.i.AzureResilienceInterceptorImpl.logRetryInfoForError:506 - Exception: java.io.IOException: Canceled at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.interceptor.AzureResilienceInterceptorImpl.intercept(AzureResilienceInterceptorImpl.java:117) at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.DynamicThrottleInterceptor.intercept(DynamicThrottleInterceptor.java:80) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.kt:221) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201) at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)

  | stream | stdout

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added
@ghost ghost added needs-triage This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Jan 18, 2023
@joshfree joshfree added ARM Mgmt This issue is related to a management-plane library. pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) labels Jan 19, 2023
@ghost ghost removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Jan 19, 2023
@joshfree
Copy link
Member

Thank you for reaching out to us via this github issue, @wangwenbj. @weidongxu-microsoft will be able to help route your issue further. Please note that if this problem requires immediate attention, please refer to Azure support plan details here: https://github.com/Azure/azure-sdk-for-java/blob/main/SUPPORT.md#support

@weidongxu-microsoft
Copy link
Member

weidongxu-microsoft commented Jan 20, 2023

@wangwenbj

What is the version of the SDK?
What is the version of azure-core-http-okhttp?

Also, may I ask why choose OkHttpClient over NettyClient?

@wangwenbj
Copy link
Author

wangwenbj commented Jan 20, 2023 via email

@wangwenbj
Copy link
Author

wangwenbj commented Jan 30, 2023 via email

@XiaofeiCao
Copy link
Contributor

Hi @wangwenbj ,
I've tried creating 100 resource groups multiple times but not able to reproduce the issue...

You can refer to this doc for throttling control.

P.S. You don't have to write your own ReactorToRxV3Interop. There's official support for converting Mono to Rxjava3 Single.

@wangwenbj
Copy link
Author

wangwenbj commented Jan 30, 2023 via email

@XiaofeiCao
Copy link
Contributor

XiaofeiCao commented Jan 30, 2023

OK, got it.

Any think you could think of that caused this issue?

I'm not sure. From the log I can't tell the root cause of the exception. And for your description:

after the exception is thrown, the upper chain lost track of this exception. Which caused the chain to never stop.

I don't quite understand, can you elaborate on this? What do you mean by never stop?

@wangwenbj
Copy link
Author

wangwenbj commented Jan 30, 2023 via email

@XiaofeiCao
Copy link
Contributor

Thanks @wangwenbj

I saw a blocking get operation got canceled in DynamicThrottleInterceptor(Exception: java.io.IOException: Canceled), and InterruptedException is thrown. You may want some special error-handling here, described by Rxjava3 error-handling:

In addition, some 3rd party libraries/code throw when they get interrupted by a cancel/dispose call which leads to an undeliverable exception most of the time. Internal changes in 2.0.6 now consistently cancel or dispose a Subscription/Disposable before cancelling/disposing a task or worker (which causes the interrupt on the target thread).

// in some library
try {
   doSomethingBlockingly()
} catch (InterruptedException ex) {
   // check if the interrupt is due to cancellation
   // if so, no need to signal the InterruptedException
   if (!disposable.isDisposed()) {
      observer.onError(ex);
   }
}

If the library/code already did this, the undeliverable InterruptedExceptions should stop now. If this pattern was not employed before, we encourage updating the code/library in question.

By the way, could you show me the codesnippet of DynamicThrottleInterceptor please?

@wangwenbj
Copy link
Author

wangwenbj commented Jan 30, 2023 via email

@XiaofeiCao
Copy link
Contributor

OK, our track1 lib uses Rxjava and your code uses Rxjava3. There is a difference in error handling since Rxjava2, especially those undeliverable:

One important design requirement for 2.x is that no Throwable errors should be swallowed. This means errors that can't be emitted because the downstream's lifecycle already reached its terminal state or the downstream cancelled a sequence which was about to emit an error.

My best guess is that this error actually happens in old rxjava but got swallowed. You can try adding a global error handler to handle this specific exception based on whether they represent a likely bug or an ignorable application/network state in Rxjava3 described in error-handling:

RxJavaPlugins.setErrorHandler(e -> {
    if (e instanceof UndeliverableException) {
        e = e.getCause();
    }
    if ((e instanceof IOException) || (e instanceof SocketException)) {
        // fine, irrelevant network problem or API that throws on cancellation
        return;
    }
    if (e instanceof InterruptedException) {
        // fine, some blocking code was interrupted by a dispose call
        return;
    }
    if ((e instanceof NullPointerException) || (e instanceof IllegalArgumentException)) {
        // that's likely a bug in the application
        Thread.currentThread().getUncaughtExceptionHandler()
            .handleException(Thread.currentThread(), e);
        return;
    }
    if (e instanceof IllegalStateException) {
        // that's a bug in RxJava or in a custom operator
        Thread.currentThread().getUncaughtExceptionHandler()
            .handleException(Thread.currentThread(), e);
        return;
    }
    Log.warning("Undeliverable exception received, not sure what to do", e);
});

@wangwenbj
Copy link
Author

wangwenbj commented Feb 1, 2023 via email

@XiaofeiCao
Copy link
Contributor

XiaofeiCao commented Feb 1, 2023

Sure. Would you help me confirm line 80 code content of DynamicThrottleInterceptor? I assume the exception initiated here?

at com.vmware.horizon.sg.clouddriver.impl.azure.internal.DynamicThrottleInterceptor.intercept(DynamicThrottleInterceptor.java:80)

@wangwenbj
Copy link
Author

wangwenbj commented Feb 1, 2023 via email

@XiaofeiCao
Copy link
Contributor

Hi @wangwenbj , I saw a very similar situation where the chain got stalled when throwing an non-IOException in interceptor:
square/retrofit#3453

I wonder if this is the case? What did you do with the exception after you logged it in your custom interceptor(AzureResilienceInterceptorImpl.logRetryInfoForError)? Did you wrapped it into some other non-IOException?

@wangwenbj
Copy link
Author

wangwenbj commented Feb 8, 2023 via email

@XiaofeiCao
Copy link
Contributor

XiaofeiCao commented Feb 8, 2023

Hi @wangwenbj

why this issue is not happening in the old rxjava version SDK?

I'm not sure. Are you using the same version of Okhttp3 as before?

My other speculation is that the Rxjava->Rxjava3 adaptor that you used before behaves differently than the Reactor->Rxjava3 adaptor you are using now. This is pure speculation...

General good practice(from their official doc) is that you don't throw your own exceptions in Interceptors, IOExceptions or not.
Instead, if you want to signal a failure, use synthetic http responses:

 @Throws(IOException::class)
 override fun intercept(chain: Interceptor.Chain): Response {
   if (myConfig.isInvalid()) {
     return Response.Builder()
         .request(chain.request())
         .protocol(Protocol.HTTP_1_1)
         .code(400)
         .message("client config invalid")
         .body("client config invalid".toResponseBody(null))
         .build()
   }

   return chain.proceed(chain.request())
 }

@wangwenbj
Copy link
Author

wangwenbj commented Feb 9, 2023 via email

@wangwenbj
Copy link
Author

wangwenbj commented Feb 15, 2023 via email

@XiaofeiCao
Copy link
Contributor

Thanks @wangwenbj , and the UndeliverableException still persists?

@wangwenbj
Copy link
Author

wangwenbj commented Mar 1, 2023 via email

@XiaofeiCao
Copy link
Contributor

XiaofeiCao commented Mar 2, 2023

Hi wen,

Thanks for the clarification. Rxjava translators should be fine.

For 1, I don't think so since our track1 SDK is officially deprecated since March 2022.
For 4, do you mean

chain.blockingGet()

or

chain.map(v -> 
    {
        anotherChain.blockingGet();
        return v;
    }

?
The latter is not correct since one shouldn't do sync blocks inside a chain. Some codesnippet would be helpful for us to better understand your situation.

Another thing is, have you set any callTimeouts to OkHttpClient(or OkHttpAsyncHttpClient)? We can't control the timeout exception since it's directly from OkHttp itself. You could set the calltimeout to a higher value if this is the case.

@XiaofeiCao
Copy link
Contributor

XiaofeiCao commented Mar 2, 2023

Also, I saw from the stacktrace that there's a blockingGet in AzureResilienceInterceptorImpl:

com.vmware.horizon.sg.clouddriver.impl.azure.internal.interceptor.AzureResilienceInterceptorImpl.intercept(AzureResilienceInterceptorImpl.java:117) at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87)

Usually it should be fine to do blocking calls in Okhttp interceptors. However, if you could share what you do with the blockingGet, it would help us better understand the situation. You could do that in my personal repo: https://github.com/XiaofeiCao/ioexception_repro, or email me if that's possible.

Further question, does this line always appear in the exception's stacktrace?
Or does the exception happen somewhere else too? If so, could you share the stacktrace?

@wangwenbj
Copy link
Author

wangwenbj commented Mar 2, 2023 via email

@XiaofeiCao
Copy link
Contributor

@wangwenbj Could you show me how you set up your OkHttpClient please? Or do you leave it as default?

@wangwenbj
Copy link
Author

wangwenbj commented Mar 8, 2023 via email

@XiaofeiCao
Copy link
Contributor

XiaofeiCao commented Mar 9, 2023

Thanks @wangwenbj for your code snippet!

I was able to reproduce your situation in my demo repo test.

Exception in thread "Thread-11" reactor.core.Exceptions$ReactiveException: java.lang.InterruptedException
    at reactor.core.Exceptions.propagate(Exceptions.java:396)
    at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:91)
    at reactor.core.publisher.Mono.block(Mono.java:1742)
    at com.azure.resourcemanager.resources.implementation.DeploymentsClientImpl.checkExistence(DeploymentsClientImpl.java:7569)
    at com.azure.resourcemanager.resources.implementation.DeploymentsImpl.checkExistence(DeploymentsImpl.java:102)
    at com.azure.resourcemanager.repro.ioexception.test.undeliverable.CallTimeoutMockTests$1.run(CallTimeoutMockTests.java:129)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.InterruptedException
    at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1048)
    at java.base/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)
    at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87)
    ... 5 more
java.io.IOException: Canceled
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at com.azure.resourcemanager.repro.ioexception.test.undeliverable.CallTimeoutMockTests.lambda$buildHttpClient$1(CallTimeoutMockTests.java:177)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
    at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)

lt's very similar to this issue, in which the calling thread got interrupted.

The IOException: Canceled is logged in OkHttpClient interceptor, caused by the thread interruption. Now it's all about finding where did this interruption occur.

Does this error always occur on this line?

com.azure.resourcemanager.resources.implementation.DeploymentsClientImpl.checkExistence(DeploymentsClientImpl.java:7569)

@wangwenbj
Copy link
Author

wangwenbj commented Mar 9, 2023 via email

@XiaofeiCao
Copy link
Contributor

Hi @wangwenbj ,

Thanks @wangwenbj for the information.

Sorry for not making my point clear. The above demo only simulated the error log. It's a guess of what actually happened.

I'm still trying to reproduce in normal situations.

@XiaofeiCao
Copy link
Contributor

XiaofeiCao commented Mar 10, 2023

I've updated my real-time test with 100 concurrent resource group creation and deletion.

I'll leave it running till the bug is reproduced.

Meanwhile, may I know what you did after you throw an Exception in DynamicThrottleInterceptor when the quota delay calculated is positive?

long delay = getQuotaDelay(requestMethod, requestUrl, clientId);

    if (delay > 0) {

        throw new Exception();

    }

@weidongxu-microsoft
Copy link
Member

weidongxu-microsoft commented Mar 10, 2023

Let's make it simpler.

@XiaofeiCao , you already have the test running. Configure it as best as author's (same OkHttpClient config, same Interceptor configure, same scale, same AKS instance configure if need to be), run it till we see the same problem.

If we reproduce it, diagnose and fix it. If we don't see it, while it does not prove there is no bug in SDK, at least it means the bug is unlikely.

The reason is that apparently we cannot have code from author's stress test, and even if we had it, it may contain too many code that not belong to SDK and could be a cause in itself. We'd like to limit Xiaofei's reproduction on a relatively simple scenario that having minimal non-SDK code, so that it focus on reproducing SDK bug.

@wangwenbj , if you think Xiaofei's test fail to reproduce the problem, please let him know what you'd like him to change.
Both Xiaofei and me has email in profile, and you can email us for anything you think might help to diagnose the problem.

@wangwenbj
Copy link
Author

wangwenbj commented Mar 13, 2023 via email

@ghost ghost added the needs-team-attention This issue needs attention from Azure service team or SDK team label Mar 13, 2023
@ghost
Copy link

ghost commented Mar 13, 2023

Thank you for your feedback. This has been routed to the support team for assistance.

@wangwenbj
Copy link
Author

@XiaofeiCao
According to Azure network support team, this issue seems to happen in the following sequence:

  1. Submit a request. e.g. create a resource group
  2. This request succeeded in secondes
  3. Using the new Azure SDK, we did not see any response in 20 minutes and finally timeout from client side
  4. Azure service got a client failure after 20 minutes and then refused this request.

Screenshot 2023-03-15 at 13 14 58

Screenshot 2023-03-15 at 13 15 04

@navba-MSFT navba-MSFT added Service Attention This issue is responsible by Azure service team. and removed CXP Attention labels Mar 17, 2023
@ghost
Copy link

ghost commented Mar 17, 2023

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @armleads-azure.

Issue Details

Describe the bug
We encountered the following errors on heavy load when creating resource group and network security group using Azure Java SDK new version, The Webclient is OkHttpClient. This issue is not happending in the old rxjava version though

Exception or Stack Trace

Exception in thread "RxCachedThreadScheduler-141" io.reactivex.rxjava3.exceptions.UndeliverableException: The exception could not be delivered to the consumer because it has already canceled/disposed the flow or the exception has nowhere to go to begin with. Further reading: https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0#error-handling | reactor.core.Exceptions$ReactiveException: java.lang.InterruptedException
at io.reactivex.rxjava3.plugins.RxJavaPlugins.onError(RxJavaPlugins.java:372)
at io.reactivex.rxjava3.internal.operators.single.SingleFromCallable.subscribeActual(SingleFromCallable.java:49)
at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
at io.reactivex.rxjava3.internal.operators.single.SingleResumeNext.subscribeActual(SingleResumeNext.java:39)
at io.reactivex.rxjava3.core.Single.subscribe(Single.java:4855)
at io.reactivex.rxjava3.internal.operators.single.SingleSubscribeOn$SubscribeOnObserver.run(SingleSubscribeOn.java:89)
at io.reactivex.rxjava3.core.Scheduler$DisposeTask.run(Scheduler.java:644)
at io.reactivex.rxjava3.internal.schedulers.ScheduledRunnable.run(ScheduledRunnable.java:65)
at io.reactivex.rxjava3.internal.schedulers.ScheduledRunnable.call(ScheduledRunnable.java:56)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

To Reproduce
This issue cannot be reproduced eaisly. It happens every now and then in our production env and we have nowhere to catch and handle this issue.

In large scale of resoruce group creation we encounter this issue some times. I have reproduce this only once locally using 100 resource groups provision in parallel.

Code Snippet
ResourceGroup.DefinitionStages.WithCreate creator = this.azureResoureManager.resourceGroups().define(resourceGroupName)
.withRegion(region);
return ReactorToRxV3Interop.monoToSingle(creator.createAsync());

Expected behavior
No exception happend or if exception happened we could have a way to catch it inside the reactor chain.

Screenshots
API error. No screen shots

Additional context
This part of log is what we catch in our customized okhttp interceptor, however, after the exception is thrown, the upper chain lost track of this exception. Which caused the chain to never stop.

2023-01-11T17:05:44.011Z [trace_id=9492315ecd8cdf9e9db291d40c42e57b] [transaction_id=1e99ae844e81ce79] ERROR [gement.azure.com/...] .i.i.AzureResilienceInterceptorImpl.logRetryInfoForError:506 - Exception: java.io.IOException: Canceled at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.interceptor.AzureResilienceInterceptorImpl.intercept(AzureResilienceInterceptorImpl.java:117) at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:87) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at com.vmware.horizon.sg.clouddriver.impl.azure.internal.DynamicThrottleInterceptor.intercept(DynamicThrottleInterceptor.java:80) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.kt:221) at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201) at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)

  | stream | stdout

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added
Author: wangwenbj
Assignees: XiaofeiCao
Labels:

question, ARM, Service Attention, Mgmt, customer-reported, pillar-reliability, needs-team-attention

Milestone: -

@navba-MSFT
Copy link

@armleads-azure Could you please look into this ? Thanks in advance.

CC @jennyhunter-msft @josephkwchan

@XiaofeiCao
Copy link
Contributor

Thanks @wangwenbj , I was able to get the request log from your second screenshot. I believe it's a NetworkSecurityGroup query?

Strangely the httpStatusCode is 404, which means the nsg is not deployed(or less likely, the client sends the wrong URL)...
image

I don't know where to locate the log from your first screenshot. Are they targeting the same networkSecurityGroup?

@XiaofeiCao
Copy link
Contributor

Hi @wangwenbj , would you try replacing sync call

Single.fromCallable(() -> azureResourceManager.deployments().checkExistence(resourceGroupName, nsgName))

with below async one, and see if the Exception throws again?

azureResourceManager.deployments().manager().serviceClient().getDeployments().checkExistenceAsync(resourceGroupName, nsgName)

And avoid any sync http calls in reactor/rxjava chain, like the first code snippet(checkExistence's implementation is checkExistenceAsync.block()). I tried it in my repo and it got stuck:
https://github.com/XiaofeiCao/ioexception_repro/blob/5db3fbcb4c6b03196d0b56f8555c9fa7849210b7/src/test/java/com/azure/resourcemanager/repro/ioexception/test/undeliverable/BatchCreateResourceGroupTests.java#L107

@wangwenbj
Copy link
Author

wangwenbj commented Mar 28, 2023 via email

@XiaofeiCao
Copy link
Contributor

XiaofeiCao commented Mar 28, 2023

I see.

sync -> async is tricky in this case. If you are doing simple sync call without IO operations involved, e.g. getting a model's innerModel's properties, you can safely do that.

But if IO operations are involved, I think you should always avoid it. Like in this case, checkDeploymentExists() is achieved by checkDeploymentExistsAsync().block(), which involves http invocation. You should always resort to an async variant if applicable.

Unfortunately in this case, we didn't provide an async variant in convenience layer. Though you could use serviceClient level code instead, which is

azureResourceManager.deployments().manager().serviceClient().getDeployments().checkExistenceAsync(resourceGroupName, nsgName)

Then wrap it using Single.fromPublisher.

@XiaofeiCao
Copy link
Contributor

Hi, does the issue still persists?

You could also try

Single.fromCallable(() ->
        azureResourceManager
                .deployments()
                .checkExistence(resourceGroupName, nsgName))
        .subscribeOn(Schedulers.io()) 

@wangwenbj
Copy link
Author

Hi Xiaofei,

So what you mean is we could avoid this from happening if we did not throw IOException? It's probably not easy to do so. Let me specify the usage

  1. We Added an exception that extends IOException in the OkHttpClient interceptor level to avoid Azure quota limit (This is not the root cause of this issue, but IOException). We use a Rxjava wrapper to retry against the previous exception. This is why we need async and sync transformation. Do you have any suggestions for this impl since once this exception is thrown, IOException is not avoidable
  2. We are trying to modify the usages of what is identified. Need some more time on testing.

The issue identified here originally could caused by requests for creating, getting, and maybe updating. Before we rolled back the SDK to the old one, it happened in many places. So my guess is it could be a framework-level issue or in some common area. With the same implementation using the old Azure SDK, no similar issue happened ever since. Hope this could help with identifying the real root cause.

Also, I will keep trying and keep you updated on any progress. Thanks!

@waynewang1989
Copy link

@XiaofeiCao We just encountered another issue which might related to this one.
We are trying to re-enable the new Azure SDK in our production and when we try to call this API:

azure.resourceGroups().getByName(name);

We have several threads hang due to this API call. Please take sometime check. The SDK version is <com.azure.resourcemanager.version>2.26.0</com.azure.resourcemanager.version>
We are opening SR ticket to the Azure support team as well in the meantime. Please let us know if there's anything you need from the backend service.
Thanks!

@XiaofeiCao
Copy link
Contributor

Thanks @waynewang1989 for reporting!

To clarify, will the thread hang forever, or terminate after sometime with failure (which is the similar 20 minutes behavior as before)?

If latter, is the error stack trace also similar as before?

@waynewang1989
Copy link

@XiaofeiCao,
It seems to hang forever. We have an upper-layer timeout reported. From the logs, No REST call is happening to Azure via the configured OkHttpClient ( as client of HttpClient).
And this is a sync call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARM customer-reported Issues that are reported by GitHub users external to the Azure organization. Mgmt This issue is related to a management-plane library. needs-team-attention This issue needs attention from Azure service team or SDK team pillar-reliability The issue is related to reliability, one of our core engineering pillars. (includes stress testing) question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

7 participants