Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUERY] Can excessive memory usage by reactor.util.concurrent.SpscArrayQueue while consuming from an EventHub be controlled? Or is this a bug? #39386

Closed
ChrisCollinsIBM opened this issue Mar 25, 2024 · 4 comments
Assignees
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@ChrisCollinsIBM
Copy link

ChrisCollinsIBM commented Mar 25, 2024

Query/Question
During consumption from an EventHub, we're seeing an exceptionally large amount of memory in-use in the java heap.

In the example below there are PartitionEvent objects containing EventData and then AmqpMessageBody bodies of roughly 148,000 bytes (which itself is multiple messages which is why the message is so large).

This ArrayList contains about 500 of these (but could contain 549 based on the instantiated size), and the SpscArrayQueue could contain up to 512 of these ArrayList objects.

Class Name                                                                                      | Shallow Heap | Retained Heap | Percentage
--------------------------------------------------------------------------------------------------------------------------------------------
reactor.util.concurrent.SpscArrayQueue @ 0x603b21700                                            |          400 | 8,433,524,592 |     21.14%
'- array java.lang.Object[512] @ 0x6052d6dd0                                                    |        2,064 | 8,433,524,192 |     21.14%
   '- java.util.ArrayList @ 0x6e97f36e0                                                         |           32 |    50,389,456 |      0.13%
      '- elementData java.lang.Object[549] @ 0x947f6cdb0                                        |        2,208 |    50,389,424 |      0.13%
         '- com.azure.messaging.eventhubs.models.PartitionEvent @ 0x6ef385a80                   |           32 |       148,496 |      0.00%
            '- eventData com.azure.messaging.eventhubs.EventData @ 0x6ef386b80                  |           32 |       148,432 |      0.00%
               '- annotatedMessage com.azure.core.amqp.models.AmqpAnnotatedMessage @ 0x6ef388d30|           48 |       148,224 |      0.00%
                  |- amqpMessageBody com.azure.core.amqp.models.AmqpMessageBody @ 0x6ef38ca50   |           32 |       147,696 |      0.00%
                  |- messageAnnotations java.util.HashMap @ 0x6ef38caa0                         |           48 |           288 |      0.00%
                  |- properties com.azure.core.amqp.models.AmqpMessageProperties @ 0x6ef38cb20  |           64 |            64 |      0.00%
                  |- deliveryAnnotations java.util.HashMap @ 0x6ef38ca70                        |           48 |            48 |      0.00%
                  |- footer java.util.HashMap @ 0x6ef38cad0                                     |           48 |            48 |      0.00%
                  '- header com.azure.core.amqp.models.AmqpMessageHeader @ 0x6ef38cb00          |           32 |            32 |      0.00%
--------------------------------------------------------------------------------------------------------------------------------------------

So we're trying to understand why there are over 56,000 messages in memory (8,433,524,592 total heap / 148,496 sample message size) when default cache (100) and prefetch (300) values are being used with a batch size of 999.

There seems to be one of these caches for each partition-pump-x-x thread as well which scales the effect of this memory usage up massively when collecting from multiple partitions or EventHubs.

Why is this not a Bug or a feature Request?
Before making this a bug, we want to ensure there aren't some settings that can control this that we can make use of.

Setup (please complete the following information if applicable):

  • OS: RHEL7 / Java 8.0.7.20
  • Library/Libraries:
    com.azure:azure-core:1.28.0
    com.azure:azure-core-amqp:2.0.5.1
    com.azure:azure-messaging-eventhubs:5.11.2
    com.azure:azure-messaging-eventhubs-checkpointstore-blob:1.12.2
    com.azure:azure-storage-blob:12.21.1
    com.azure:azure-storage-common:12.16.1

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • [ X ] Query Added
  • [ X ] Setup information Added
@github-actions github-actions bot added customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-triage This is a new issue that needs to be triaged to the appropriate team. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Mar 25, 2024
@anuchandy
Copy link
Member

Hi @ChrisCollinsIBM, we optimized the memory allocation of Processor in version 5.18.0 azure-sdk-for-java/sdk/eventhubs/azure-messaging-eventhubs/CHANGELOG.md at main · Azure/azure-sdk-for-java (github.com), this should lower the memory usage that you are seeing.

Each partition has a dedicated connection (link) to the service, the in-memory queues exist for each partition-receive. Each partition-receive is managed by an instance of partition-pump-x-x, so as you observed, allocation adds up based on the number of partitions-receive hosted in one machine.

@ChrisCollinsIBM
Copy link
Author

Thanks for the prompt response @anuchandy, I did find #38572 when I went digging into the EventHub Messaging history so I presume that's the fix you're referring to.

Since we're using the prefetch defaults (300) what would you suggest for a reasonable batch size? I see 100 tossed around in many discussions using a 3:1 prefetch:batch ratio but maybe that was cache related. But then in some other places I see 10 as a batch. Some guidance on this would be great, thanks!

@github-actions github-actions bot removed the needs-triage This is a new issue that needs to be triaged to the appropriate team. label Mar 26, 2024
@anuchandy
Copy link
Member

Hello @ChrisCollinsIBM, sorry for the late response. @conniey and I discussed this. We don’t have a one-size-fit-all recommendation for tuning prefetch and batch size for optimal memory, there is also a third variable of expected event(s) size. Our suggestion is to run the application (with actual event processing logic) and tune these values to achieve expected throughput. While doing this exercise, identify an appropriate value to set for max heap size (-Xmx). The idea is, once the application run reaches a steady state with expected throughput, force a full GC using tools such as JConsole, check how much memory is occupied after the full GC. You want to size the heap such that only ~30% is occupied after full GC; use this value to set the max heap size (-Xmx). Size the host (e.g., container) memory to have an "additional ~1 GB" of memory for the "non-heap" need for the JVM instance.

@anuchandy anuchandy self-assigned this Apr 29, 2024
@anuchandy
Copy link
Member

Closing this, refer previous comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

4 participants