Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io.netty.util.IllegalReferenceCountException after upgrading azure-cosmos from v4.53.1 to v4.56.0 [BUG] #39252

Open
3 tasks done
varenyavv opened this issue Mar 15, 2024 · 4 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. Cosmos cosmos-java-ecosystem-dt-planning Considered for Dilithium semester planning customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention This issue is responsible by Azure service team.

Comments

@varenyavv
Copy link

varenyavv commented Mar 15, 2024

Describe the bug
After upgrading azure-cosmos library's version from v4.53.1 to v4.56.0, our application started encountering io.netty.util.IllegalReferenceCountException intermittently.

Exception or Stack Trace

io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1
	at io.netty.util.internal.ReferenceCountUpdater.toLiveRealRefCnt(ReferenceCountUpdater.java:83)
	at io.netty.util.internal.ReferenceCountUpdater.release(ReferenceCountUpdater.java:148)
	at io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:101)
	at io.netty.buffer.ByteBufInputStream.close(ByteBufInputStream.java:145)
	at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._closeInput(UTF8StreamJsonParser.java:295)
	at com.fasterxml.jackson.core.base.ParserBase.close(ParserBase.java:393)
	at com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4950)
	at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:3258)
	at com.azure.cosmos.implementation.directconnectivity.JsonNodeStorePayload.fromJson(JsonNodeStorePayload.java:28)
	at com.azure.cosmos.implementation.directconnectivity.JsonNodeStorePayload.<init>(JsonNodeStorePayload.java:19)
	at com.azure.cosmos.implementation.directconnectivity.StoreResponse.<init>(StoreResponse.java:69)
	at com.azure.cosmos.implementation.RxGatewayStoreModel.lambda$toDocumentServiceResponse$1(RxGatewayStoreModel.java:372)
	at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:106)
	at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:74)
	at reactor.core.publisher.FluxPeek$PeekSubscriber.onNext(FluxPeek.java:200)
	at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122)
	at reactor.core.publisher.FluxDoFinally$DoFinallySubscriber.onNext(FluxDoFinally.java:113)
	at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onNext(FluxHandleFuseable.java:194)
	at reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onNext(FluxContextWrite.java:107)
	at reactor.core.publisher.Operators$BaseFluxToMonoOperator.completePossiblyEmpty(Operators.java:2097)
	at reactor.core.publisher.MonoCollectList$MonoCollectListSubscriber.onComplete(MonoCollectList.java:118)
	at reactor.core.publisher.FluxPeek$PeekSubscriber.onComplete(FluxPeek.java:260)
	at reactor.core.publisher.FluxMap$MapSubscriber.onComplete(FluxMap.java:144)
	at reactor.netty.channel.FluxReceive.onInboundComplete(FluxReceive.java:415)
	at reactor.netty.channel.ChannelOperations.onInboundComplete(ChannelOperations.java:446)
	at reactor.netty.channel.ChannelOperations.terminate(ChannelOperations.java:500)
	at reactor.netty.http.client.HttpClientOperations.onInboundNext(HttpClientOperations.java:782)
	at reactor.netty.channel.ChannelOperationsHandler.channelRead(ChannelOperationsHandler.java:114)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:289)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
	at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475)
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338)
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387)
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:509)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)

To Reproduce
Steps to reproduce the behavior:
Use com.azure:azure-cosmos:4.56.0
Create Cosmos client using gateway mode.
Save Json document using the code snippet provided below at throughput shared in the screenshot.
Our Json document size is not more than 4KB.
Sample Document

{
    "id": "GPS321348639",
    "referenceId": "321348639",
    "fulfillmentTypeName": "16",
    "sourceSystem": "GPS",
    "memberLookupId": "543311380245",
    "submitDate": "2021-09-30 13:01:51",
    "requestStatus": "SHIPPED",
    "address": {
        "mailToName": {
            "first": "DUMMY",
            "middle": "J",
            "last": "DUMMY"
        },
        "lineOne": "PO BOX XXX",
        "town": "DUMMY",
        "stateProvinceCode": "XX",
        "postalCode": "6546",
        "countryCode": "UNITED STATES"
    },
    "insuredPlanId": "123456",
    "applicationId": "7890156",
    "asOfDate": "2021-09-30",
    "processDate": "2021-09-30 22:44:04",
    "printVendorHistory": [
        {
            "carrier": "XYZZ",
            "shippingMethod": "FIRST_CLASS",
            "status": "SHIPPED",
            "date": "2021-10-02 12:00:00",
            "batchId": 0
        }
    ],
    "mailByDate": "2021-09-30",
    "recipientType": "MEMBER",
    "sourceSystemData": {
        "nameValuePairs": {}
    },
    "statusChangeHistory": [
        {
            "status": "PROCESSED",
            "timestamp": "2021-09-30 22:44:04",
            "source": "Generator Microservice"
        },
        {
            "status": "SHIPPED",
            "timestamp": "2021-10-05 04:13:11",
            "source": "Vendor Microservice"
        }
    ],
    "sentToGeneratorCount": 1,
    "audit": {
        "createdBy": "gp-fulfillment-request",
        "createdOn": "2021-09-30 22:43:59",
        "modifiedBy": "gp-fulfillment-request",
        "modifiedOn": "2021-10-05 04:13:11"
    },
    "_etag": "\"f900f0ed-0000-0300-0000-646e3ac10000\"",
    "_rid": "uC5NAJLopl4IAAAAAAAAAA==",
    "_self": "dbs/uC5NAA==/colls/uC5NAJLopl4=/docs/uC5NAJLopl4IAAAAAAAAAA==/",
    "_attachments": "attachments/",
    "_ts": 1684945601
}

Code Snippet

com.azure.cosmos.CosmosContainer cosmosContainer = this.getContainer(); //our internal method to retrieve CosmosContainer object using Gateway mode.
CosmosItemResponse<T> response = cosmosContainer.createItem(item, new com.azure.cosmos.models.PartitionKey(partitionKey), new com.azure.cosmos.models.CosmosItemRequestOptions());

Expected behavior
Library should be able to save the document without the reported error.

Screenshots
Correlation between error encountered by the application and request throughput at container
error-trend
ThroughputTrend

Setup (please complete the following information):

  • OS: Ubuntu 22.04.4 LTS
  • IDE: Intellij
  • Library/Libraries: com.azure:azure-cosmos:4.56.0
  • Java version: Openjdk version 17.0.10
  • App Server/Environment: Tomcat embedded in Springboot
  • Frameworks: Springboot v3.2.3

Additional context
The change related to PR #38072 must be causing this issue. I also found a couple of year old issue #9802 which talks about the same error.
cc: @FabianMeiswinkel @kushagraThapar

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • Bug Description Added
  • Repro Steps Added
  • Setup information Added
@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. Cosmos customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention This issue is responsible by Azure service team. labels Mar 15, 2024
Copy link

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @kushagraThapar @pjohari-ms @TheovanKraay.

@kushagraThapar
Copy link
Member

@varenyavv I investigated this issue and couldn't reproduce it with Netty's leak detection code enabled. I investigated on the LATEST version of azure-cosmos. I have the code here - https://github.com/kushagraThapar/cosmos-java-sdk-testing/blob/main/src/main/java/com/example/common/NettyMemoryIssue.java

The only difference is, I am not using the payload you mentioned. Do you think the issue is strictly related to the payload type and size?

@kushagraThapar kushagraThapar added the cosmos-java-ecosystem-dt-planning Considered for Dilithium semester planning label Mar 27, 2024
@varenyavv
Copy link
Author

Apologies for the delayed response. I attempted to replicate the issue locally but was unsuccessful. The error occurred specifically in our production environment, prompting us to revert to version 4.53.1, which has been stable since then. We'll proceed to test the latest version in our staging environment to determine if the issue persists. By the way, could this happen due to networking issues?

@kushagraThapar
Copy link
Member

Thanks @varenyavv for trying it out. Let us know how it goes in staging environment, I will keep this issue open.
NOTE: It can happen because of networking issue because that could leave some of the buffers hanging, but having a repro would be the first step to look into it deeper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. Cosmos cosmos-java-ecosystem-dt-planning Considered for Dilithium semester planning customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

2 participants