-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[fix] Fix memory leak in TopicName cache and reduce heap memory usage through deduplication #24457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a memory leak in the Pulsar Proxy and client by reverting the previous caching strategy and introducing a new NameCache implementation that uses soft references and string interning to reduce heap memory usage.
- Introduces NameCache, TopicNameCache, and NamespaceNameCache classes to provide efficient, self-clearing caches.
- Updates TopicName and NamespaceName to use interning when parsing topic names.
- Removes obsolete configuration properties and scheduled tasks related to cache clearing.
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
pulsar-common/src/main/java/org/apache/pulsar/common/naming/TopicNameCache.java | Added a new cache for TopicName using the NameCache abstraction. |
pulsar-common/src/main/java/org/apache/pulsar/common/naming/TopicName.java | Updated to use TopicNameCache and string interning. |
pulsar-common/src/main/java/org/apache/pulsar/common/naming/NamespaceNameCache.java | Added a new cache for NamespaceName using the NameCache abstraction. |
pulsar-common/src/main/java/org/apache/pulsar/common/naming/NamespaceName.java | Updated to use NamespaceNameCache and string interning. |
pulsar-common/src/main/java/org/apache/pulsar/common/naming/NameCache.java | Introduced the new NameCache base class with soft references and periodic maintenance. |
pulsar-broker/src/test/resources/configurations/pulsar_broker_test_standalone.conf | Removed obsolete cache configuration properties. |
pulsar-broker/src/test/resources/configurations/pulsar_broker_test.conf | Removed obsolete cache configuration properties. |
pulsar-broker/src/test/java/org/apache/pulsar/common/naming/ServiceConfigurationTest.java | Removed tests for now-obsolete cache configuration. |
pulsar-broker/src/test/java/org/apache/pulsar/broker/service/StandaloneTest.java | Removed assertions related to the obsolete cache configuration. |
pulsar-broker/src/test/java/org/apache/pulsar/broker/PulsarServiceTest.java | Removed cache configuration tests and related lookup tests. |
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java | Removed the scheduled cache clearance task. |
pulsar-broker-common/src/main/java/org/apache/pulsar/broker/ServiceConfiguration.java | Removed obsolete configuration fields concerning the cache. |
microbench/src/main/java/org/apache/pulsar/common/naming/package-info.java | New file addition for package-level documentation. |
microbench/src/main/java/org/apache/pulsar/common/naming/TopicNameBenchmark.java | Added a benchmark to validate the performance of TopicName lookup. |
conf/broker.conf | Removed obsolete cache configuration properties. |
pulsar-common/src/main/java/org/apache/pulsar/common/naming/NameCache.java
Outdated
Show resolved
Hide resolved
I ran a test to compare the All keys have been inserted into the cache. The new implementation's performance gap seems acceptable (only 1/2 of the current implementation)
But when I increased the loop count from 10000 to 200000, the gap became large.
loop count: 1000000
|
I changed @Warmup(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Benchmark
public void testNoCache() {
for (int i = 0; i < 1000000; i++) {
new TopicName("test-" + i);
}
} loop count 1000000:
loop count 200000:
loop count 50000:
|
This has an explanation. The soft cache will push out entries that aren't used (no strong reference in other locations). |
@BewareMyPower please check the JMH flame graph and repro instructions in #24457 (comment) . The flamegraph also shows that it delegates directly to ConcurrenHashMap. That's why it's performance is similar. One missing detail in the test setup is about setting heap parameters for the JMH forked JVM since the results can vary without that due to the soft reference usage. I also noticed that in the test, it would be needed to keep strong refs in another data structure to simulate instances which are in use (like in a real broker or pulsar client) since otherwise soft refs could be cleared unless initial heap size is reasonable. I believe you came across this with larger amounts of entries. JMH produces often skewed results on Mac OS. Another reason is the lack of using JMH "black hole" (or returning a computed value). What's your test setup like? |
You can check benchmark.diff.txt for how I set up the tests. BTW, @codelipenghui mentioned "over engineering" here. However, IMO, the cache itself is "over engineering" to me. I checked many places that Both the existing solution and this solution have issues when a lot of different topic names are used. The existing solution relies on manual call of Actually, I think we should remove the cache, which brings many issues that should never happen. From my perspective, introducing the cache is just to get around the poor implementation of |
@BewareMyPower There was no review and there's rationale that this PR makes sense and improves Pulsar without causing performance regressions. I agree that in this case, it might take longer to reach consensus. |
@BewareMyPower How did you setup JMH to run your benchmarks that you shared? JMH results could be skewed for multiple reasons in this case as explained in my previous comment, #24457 (comment) |
@BewareMyPower This is an invalid JMH test case. In JMH, you need to use the "black hole" to consume computed values. @Warmup(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Benchmark
public void testNoCache(Blackhole blackhole) {
for (int i = 0; i < 1000000; i++) {
blackhole.consume(new TopicName("test-" + i));
}
} |
Another detail is the short benchmark duration. A better choice could be to run a single longer iteration:
|
…rences being collected
I improved the JMH benchmark and fixed some flaws in it. MacOS results (will soon run on Linux):
|
Linux results
|
Well, based on my improvement in #24463, I migrated the cache implementation of this PR. The benchmark ran on a ubuntu 22.04 runner in GitHub Actions, here is the result from See https://github.com/BewareMyPower/pulsar/actions/runs/15872123416/job/44750991782?pr=45:
The result is similar to my benchmark on macOS:
Both shows getting the topic name from cache is not significantly better than my latest optimization. Please leave comments directly in BewareMyPower#45 if you think there is something wrong with my benchmark code since you said
|
Great work @BewareMyPower . The main aspect why I wouldn't remove the cache is due to deduplication functionality of the cache. If TopicName instances are strongly referenced, the more there are, the worse the problem is.
The benchmark code looks fine now. On Mac OS, the results just could be different than on Linux x86_64 in many cases. Especially if there's any code that accesses |
…rentTimeMillis instead of nanoTime - postpone shrinking the cache until the maintenance task runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@poorbarcode we can proceed with your PR and come back to the usefulness of a soft reference cache and instance deduplication in optimizing memory usage of topic pattern listing / watching use cases. |
A few observations if helpful,
|
Fixes #24445
Motivation
There is currently a memory leak in Pulsar Proxy and Pulsar client caused by the TopicName cache. The TopicName cache grows without bounds and eventually consumes all available heap memory when there is high cardinality of topic names resolved during the runtime of a single JVM.
Before PR #23052, Guava's LoadingCache was used to limit the cache size to 100,000 entries. That was replaced with ConcurrentHashMap due to CPU overhead concerns with Guava compared to plain ConcurrentHashMap.
However, #23052 only addressed cache expiration for the broker component. Since the TopicName cache is included in the Pulsar client, it should be self-contained without requiring additional maintenance tasks to keep the cache size bounded.
Modifications
NameCache
class that uses ConcurrentHashMap with SoftReference values to allow memory pressure-based garbage collection of unreferenced cache entriesKey Benefits
Performance Characteristics
Documentation
doc
doc-required
doc-not-needed
doc-complete