Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf-improvement avoiding extra-buffer copy for query and point operations #38072

Merged

Conversation

FabianMeiswinkel
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel commented Dec 15, 2023

Description

This PR is refactoring the transport to allow avoiding one buffer copy. Before this PR the flow of a request in the direct/rntbd hot path a bit simplified was:
Netty ByteBuf -> Copy into byte[] in RntbdResponse.toStoreResponse -> ObjectMapper deserialization off of the byte[]

The goal of this PR is to avoid the additional buffer copy -> allow directly deserializing from the netty buteBuf.

The perf results below show a good improvement (5-10 percent) - but to be clear only when the GC pressure is high - to simulate this in this benchmark I have set -Xmx to 2GB (max. heap restricted to 2 GB) and used a point-read benchmark reading docs >= 1.5 MB of size with concurrency of 35.

Kusto query against https://fabianm-perf-results.westeurope.kusto.windows.net/fabianm-java-perf-results

Results
| where TIMESTAMP > datetime(2023-12-19 18:30:00.0000000)
| where TestVariationName == "ReadLarge_NoMonitoring_NoJavaAgent"
| where BranchName startswith "origin"
| summarize avg(SuccessRate) by bin(TIMESTAMP, 5m), strcat(BranchName, "_", CommitId)
| render timechart

image

High-level design
Cosmos DB NoSql always requires documents to be stored as Json. Using this to allow parsing the Netty ByteBuf into a JsonNode directly - and normalizing on it. There are a few code paths that need special-casing (Queries extract docs form a json array, Stored procedures have special response payload etc.) - but on the hot path for point operations this always avoids one buffer copy. And this refactoring will also help with Binary implementation because we have a single place now where we need to support parsing Binary into a Json Node.

All SDK Contribution checklist:

  • The pull request does not introduce [breaking changes]
  • CHANGELOG is updated for new features, bug fixes or other significant changes.
  • I have read the contribution guidelines.

General Guidelines and Best Practices

  • Title of the pull request is clear and informative.
  • There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, see this page.

Testing Guidelines

  • Pull request includes test coverage for the included changes.

@azure-sdk
Copy link
Collaborator

azure-sdk commented Dec 15, 2023

API change check

APIView has identified API level changes in this PR and created following API reviews.

com.azure:azure-cosmos
com.azure:azure-cosmos-encryption

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

Copy link
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to run few tests with netty byte buff leak detection enabled to make sure there are no obvious memory leaks.

Great work @FabianMeiswinkel, thank you!

@FabianMeiswinkel
Copy link
Member Author

It would be good to run few tests with netty byte buff leak detection enabled to make sure there are no obvious memory leaks.

Great work @FabianMeiswinkel, thank you!

I have used that to fix some of the initial issues I had. The memory leak detection errors showed up in local tests under debugger - and I have validated that they were gone with query and Cosmos Item tests.

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@xinlian12 xinlian12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@FabianMeiswinkel
Copy link
Member Author

/azp run java - cosmos - spark

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @FabianMeiswinkel

@FabianMeiswinkel FabianMeiswinkel merged commit ad74eee into Azure:main Dec 28, 2023
70 checks passed
lmolkova pushed a commit to lmolkova/azure-sdk-for-java that referenced this pull request Jan 4, 2024
…tions (Azure#38072)

* Perf-improvement avoiding extra-buffer copy for query and point operations

* Fixing JDK8 build errors

* Update StoreResponseBuilder.java

* Fixing test issues

* Update BatchResponseParser.java

* Update CosmosItemResponse.java

* Update ResponseUtils.java

* Fixing test issues

* Test changes

* Fixing test failures

* Fixing encryption test failures

* Update UtilsTest.java

* Update ClientTelemetryTest.java

* Fixing some encryption relates test failures

* Update CosmosItemTest.java

* Update CosmosItemResponse.java

* Update CosmosItemResponse.java

* Avoided double-deserialization in Gateway mode

* Removing unnecessary imports

* Update RxGatewayStoreModel.java

* Addressing code review feedback
@varenyavv
Copy link

@FabianMeiswinkel @kushagraThapar As we upgraded from azure-cosmos:4.53.1 to azure-cosmos:4.56.0 we started encountering the below exception and quite possibly it is related to this change:

exception: io.netty.util.IllegalReferenceCountException: refCnt: 0, decrement: 1
  at io.netty.util.internal.ReferenceCountUpdater.toLiveRealRefCnt(ReferenceCountUpdater.java:83)
  at io.netty.util.internal.ReferenceCountUpdater.release(ReferenceCountUpdater.java:148)
  at io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:101)
  at io.netty.buffer.ByteBufInputStream.close(ByteBufInputStream.java:145)
  at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._closeInput(UTF8StreamJsonParser.java:295)
  at com.fasterxml.jackson.core.base.ParserBase.close(ParserBase.java:393)
  at com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4950)
  at com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:3258)
  at com.azure.cosmos.implementation.directconnectivity.JsonNodeStorePayload.fromJson(JsonNodeStorePayload.java:28)
  at com.azure.cosmos.implementation.directconnectivity.JsonNodeStorePayload.(JsonNodeStorePayload.java:19)
  at com.azure.cosmos.implementation.directconnectivity.StoreResponse.(StoreResponse.java:69)
  at com.azure.cosmos.implementation.RxGatewayStoreModel.lambda$toDocumentServiceResponse$1(RxGatewayStoreModel.java:372)

I found a couple of year old #9802 which talks about the same error. Can you please look into it once?

@kushagraThapar
Copy link
Member

@varenyavv thanks for providing the exception details, can you also please provide the operation type you noticed this error on? I would like to know your workload pattern and throughput. Are you using direct or gateway mode? To help us investigate this issue, can you please create a separate github issue and provide all the necessary details?

You can create the github issue on this repo and tag us.

@varenyavv
Copy link

@kushagraThapar Thank you for your reply! We are using gateway mode. I created #39252 with all the necessary details.

@kushagraThapar
Copy link
Member

Sounds good, lets investigate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants