Skip to content

[fix] Fix memory leak in TopicName cache and reduce heap memory usage through deduplication #24457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 41 commits into
base: master
Choose a base branch
from

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Jun 24, 2025

Fixes #24445

Motivation

There is currently a memory leak in Pulsar Proxy and Pulsar client caused by the TopicName cache. The TopicName cache grows without bounds and eventually consumes all available heap memory when there is high cardinality of topic names resolved during the runtime of a single JVM.

Before PR #23052, Guava's LoadingCache was used to limit the cache size to 100,000 entries. That was replaced with ConcurrentHashMap due to CPU overhead concerns with Guava compared to plain ConcurrentHashMap.

However, #23052 only addressed cache expiration for the broker component. Since the TopicName cache is included in the Pulsar client, it should be self-contained without requiring additional maintenance tasks to keep the cache size bounded.

Modifications

  • Revert [improve] [broker] Improve CPU resources usege of TopicName Cache #23052 changes: Remove the broker-specific cache management that was added in the original PR
  • Add generic NameCache implementation: Create an abstract NameCache class that uses ConcurrentHashMap with SoftReference values to allow memory pressure-based garbage collection of unreferenced cache entries
  • Implement memory-efficient caching: Use soft references to ensure cache entries are automatically cleared when the JVM runs low on memory
  • Add string deduplication: Use Guava's Interner class to deduplicate String instances of TopicName and NamespaceName components (tenant, namespace, localName, etc.) to reduce heap memory usage
  • Performance optimization: Implement custom cache that is ~3x more performant than Caffeine cache with similar functionality
  • Add benchmarking: Include JMH benchmark to validate performance (achieves ~28M ops/second on Mac M3 Max)
  • Unified implementation: Both TopicNameCache and NamespaceNameCache extend the same generic NameCache implementation

Key Benefits

  1. Memory leak prevention: Cache automatically shrinks under memory pressure through soft references
  2. Heap memory reduction: String interning significantly reduces memory footprint for applications with many topic names
  3. Self-contained solution: No external maintenance tasks required - the cache manages itself
  4. Performance maintained: Despite added safety features, performance remains excellent
  5. Consistent behavior: Same caching strategy applied to both TopicName and NamespaceName

Performance Characteristics

  • End-to-end performance: ~28M TopicName.get operations per second (Mac M3 Max)
  • Custom implementation is ~3x faster than equivalent Caffeine cache
  • Memory usage significantly reduced through string deduplication
  • Automatic cache shrinking when memory pressure is detected

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Jun 24, 2025
@lhotari lhotari requested a review from codelipenghui June 24, 2025 16:00
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a memory leak in the Pulsar Proxy and client by reverting the previous caching strategy and introducing a new NameCache implementation that uses soft references and string interning to reduce heap memory usage.

  • Introduces NameCache, TopicNameCache, and NamespaceNameCache classes to provide efficient, self-clearing caches.
  • Updates TopicName and NamespaceName to use interning when parsing topic names.
  • Removes obsolete configuration properties and scheduled tasks related to cache clearing.

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pulsar-common/src/main/java/org/apache/pulsar/common/naming/TopicNameCache.java Added a new cache for TopicName using the NameCache abstraction.
pulsar-common/src/main/java/org/apache/pulsar/common/naming/TopicName.java Updated to use TopicNameCache and string interning.
pulsar-common/src/main/java/org/apache/pulsar/common/naming/NamespaceNameCache.java Added a new cache for NamespaceName using the NameCache abstraction.
pulsar-common/src/main/java/org/apache/pulsar/common/naming/NamespaceName.java Updated to use NamespaceNameCache and string interning.
pulsar-common/src/main/java/org/apache/pulsar/common/naming/NameCache.java Introduced the new NameCache base class with soft references and periodic maintenance.
pulsar-broker/src/test/resources/configurations/pulsar_broker_test_standalone.conf Removed obsolete cache configuration properties.
pulsar-broker/src/test/resources/configurations/pulsar_broker_test.conf Removed obsolete cache configuration properties.
pulsar-broker/src/test/java/org/apache/pulsar/common/naming/ServiceConfigurationTest.java Removed tests for now-obsolete cache configuration.
pulsar-broker/src/test/java/org/apache/pulsar/broker/service/StandaloneTest.java Removed assertions related to the obsolete cache configuration.
pulsar-broker/src/test/java/org/apache/pulsar/broker/PulsarServiceTest.java Removed cache configuration tests and related lookup tests.
pulsar-broker/src/main/java/org/apache/pulsar/broker/service/BrokerService.java Removed the scheduled cache clearance task.
pulsar-broker-common/src/main/java/org/apache/pulsar/broker/ServiceConfiguration.java Removed obsolete configuration fields concerning the cache.
microbench/src/main/java/org/apache/pulsar/common/naming/package-info.java New file addition for package-level documentation.
microbench/src/main/java/org/apache/pulsar/common/naming/TopicNameBenchmark.java Added a benchmark to validate the performance of TopicName lookup.
conf/broker.conf Removed obsolete cache configuration properties.

@lhotari lhotari changed the title [fix] Fix memory leak in Pulsar Proxy and Pulsar client caused by TopicName cache and reduce heap memory usage [fix] Fix memory leak in TopicName cache and reduce heap memory usage through deduplication Jun 24, 2025
@BewareMyPower
Copy link
Contributor

I ran a test to compare the get performance with the old (existing) implementation. The LegacyTopicName is migrated from current TopicName in master.

benchmark.diff.txt

All keys have been inserted into the cache. The new implementation's performance gap seems acceptable (only 1/2 of the current implementation)

Benchmark      Mode  Cnt     Score     Error  Units
Main.testNew  thrpt    5  2013.904 ± 214.360  ops/s
Main.testOld  thrpt    5  4278.375 ±  82.866  ops/s

But when I increased the loop count from 10000 to 200000, the gap became large.

Benchmark      Mode  Cnt   Score   Error  Units
Main.testNew  thrpt    5   4.659 ± 0.565  ops/s
Main.testOld  thrpt    5  63.299 ± 3.645  ops/s

loop count: 1000000

Benchmark      Mode  Cnt   Score   Error  Units
Main.testNew  thrpt    5   0.506 ± 0.200  ops/s
Main.testOld  thrpt    5  11.127 ± 1.189  ops/s

@BewareMyPower
Copy link
Contributor

BewareMyPower commented Jun 25, 2025

I changed TopicName's constructor to public and add a separated test:

    @Warmup(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
    @Measurement(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
    @Benchmark
    public void testNoCache() {
        for (int i = 0; i < 1000000; i++) {
            new TopicName("test-" + i);
        }
    }

loop count 1000000:

Main.testNew      thrpt    5   0.513 ± 0.204  ops/s
Main.testNoCache  thrpt    5   0.713 ± 0.361  ops/s
Main.testOld      thrpt    5  11.027 ± 0.648  ops/s

loop count 200000:

Main.testNew      thrpt    5   4.922 ± 0.609  ops/s
Main.testNoCache  thrpt    5   4.649 ± 0.188  ops/s
Main.testOld      thrpt    5  65.304 ± 1.194  ops/s

loop count 50000:

Benchmark          Mode  Cnt    Score     Error  Units
Main.testNew      thrpt    5  278.312 ± 150.342  ops/s
Main.testNoCache  thrpt    5   21.216 ±   1.151  ops/s
Main.testOld      thrpt    5  463.201 ±  17.506  ops/s

@lhotari
Copy link
Member Author

lhotari commented Jun 25, 2025

I ran a test to compare the get performance with the old (existing) implementation. The LegacyTopicName is migrated from current TopicName in master.

benchmark.diff.txt

All keys have been inserted into the cache. The new implementation's performance gap seems acceptable (only 1/2 of the current implementation)

Benchmark      Mode  Cnt     Score     Error  Units
Main.testNew  thrpt    5  2013.904 ± 214.360  ops/s
Main.testOld  thrpt    5  4278.375 ±  82.866  ops/s

But when I increased the loop count from 10000 to 200000, the gap became large.

Benchmark      Mode  Cnt   Score   Error  Units
Main.testNew  thrpt    5   4.659 ± 0.565  ops/s
Main.testOld  thrpt    5  63.299 ± 3.645  ops/s

loop count: 1000000

Benchmark      Mode  Cnt   Score   Error  Units
Main.testNew  thrpt    5   0.506 ± 0.200  ops/s
Main.testOld  thrpt    5  11.127 ± 1.189  ops/s

This has an explanation. The soft cache will push out entries that aren't used (no strong reference in other locations).

@lhotari
Copy link
Member Author

lhotari commented Jun 25, 2025

@BewareMyPower please check the JMH flame graph and repro instructions in #24457 (comment) .

The flamegraph also shows that it delegates directly to ConcurrenHashMap. That's why it's performance is similar.

One missing detail in the test setup is about setting heap parameters for the JMH forked JVM since the results can vary without that due to the soft reference usage. I also noticed that in the test, it would be needed to keep strong refs in another data structure to simulate instances which are in use (like in a real broker or pulsar client) since otherwise soft refs could be cleared unless initial heap size is reasonable. I believe you came across this with larger amounts of entries.

JMH produces often skewed results on Mac OS. Another reason is the lack of using JMH "black hole" (or returning a computed value). What's your test setup like?

@BewareMyPower
Copy link
Contributor

You can check benchmark.diff.txt for how I set up the tests.

BTW, @codelipenghui mentioned "over engineering" here. However, IMO, the cache itself is "over engineering" to me. I checked many places that TopicName is used. Mostly it's just used as a temporary variable to do some conversion.

Both the existing solution and this solution have issues when a lot of different topic names are used. The existing solution relies on manual call of clearIfReachedMaxCapacity. Introducing a scheduled task also brings unnecessary complexity. The solution in this PR relies on GC to clean up entries from cache. If any issue happens, it could be harder to diagnose.

Actually, I think we should remove the cache, which brings many issues that should never happen. From my perspective, introducing the cache is just to get around the poor implementation of TopicName's constructor and it's a "GC PTSD" coding style of Pulsar.

@lhotari
Copy link
Member Author

lhotari commented Jun 25, 2025

We should not dismiss the requested change so quickly before reaching the consensus. Such change that affects a widely used common class needs a more careful review.

@BewareMyPower There was no review and there's rationale that this PR makes sense and improves Pulsar without causing performance regressions. I agree that in this case, it might take longer to reach consensus.

@lhotari
Copy link
Member Author

lhotari commented Jun 25, 2025

You can check benchmark.diff.txt for how I set up the tests.

@BewareMyPower How did you setup JMH to run your benchmarks that you shared? JMH results could be skewed for multiple reasons in this case as explained in my previous comment, #24457 (comment)

@lhotari
Copy link
Member Author

lhotari commented Jun 25, 2025

I changed TopicName's constructor to public and add a separated test:

    @Warmup(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
    @Measurement(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
    @Benchmark
    public void testNoCache() {
        for (int i = 0; i < 1000000; i++) {
            new TopicName("test-" + i);
        }
    }

@BewareMyPower This is an invalid JMH test case. In JMH, you need to use the "black hole" to consume computed values.
For example:

    @Warmup(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
    @Measurement(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
    @Benchmark
    public void testNoCache(Blackhole blackhole) {
        for (int i = 0; i < 1000000; i++) {
            blackhole.consume(new TopicName("test-" + i));
        }
    }

@lhotari
Copy link
Member Author

lhotari commented Jun 25, 2025

    @Warmup(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
    @Measurement(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)

Another detail is the short benchmark duration. A better choice could be to run a single longer iteration:

    @Warmup(iterations = 1, time = 10, timeUnit = TimeUnit.SECONDS)
    @Measurement(iterations = 1, time = 10, timeUnit = TimeUnit.SECONDS)

BewareMyPower added a commit to BewareMyPower/pulsar that referenced this pull request Jun 25, 2025
@lhotari
Copy link
Member Author

lhotari commented Jun 25, 2025

I improved the JMH benchmark and fixed some flaws in it.

MacOS results (will soon run on Linux):

Benchmark                                (testMode)   Mode  Cnt         Score          Error  Units
TopicNameBenchmark.topicLookup001             basic  thrpt    3  26378883.001 ±  3852072.154  ops/s
TopicNameBenchmark.topicLookup001   invalidateCache  thrpt    3  24498489.317 ±  6733474.934  ops/s
TopicNameBenchmark.topicLookup001  strongReferences  thrpt    3  27146611.212 ± 20090217.310  ops/s
TopicNameBenchmark.topicLookup010             basic  thrpt    3  22199659.357 ±  5861674.879  ops/s
TopicNameBenchmark.topicLookup010   invalidateCache  thrpt    3  21913068.107 ±  3784974.042  ops/s
TopicNameBenchmark.topicLookup010  strongReferences  thrpt    3  22537707.826 ±  4154289.677  ops/s
TopicNameBenchmark.topicLookup100             basic  thrpt    3  16356441.900 ± 12729573.003  ops/s
TopicNameBenchmark.topicLookup100   invalidateCache  thrpt    3  15799393.963 ±  5250569.159  ops/s
TopicNameBenchmark.topicLookup100  strongReferences  thrpt    3  16439089.083 ± 11551885.443  ops/s

@lhotari
Copy link
Member Author

lhotari commented Jun 25, 2025

Linux results

Benchmark                                      (testMode)   Mode  Cnt         Score           Error  Units
TopicNameBenchmark.topicLookup001                   basic  thrpt    3   8296768.710 ±   3847917.674  ops/s
TopicNameBenchmark.topicLookup001:async             basic  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup001         invalidateCache  thrpt    3   7719755.753 ±   1550321.165  ops/s
TopicNameBenchmark.topicLookup001:async   invalidateCache  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup001        strongReferences  thrpt    3   8340214.037 ±   1532080.072  ops/s
TopicNameBenchmark.topicLookup001:async  strongReferences  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup010                   basic  thrpt    3  33723499.610 ±  77685435.770  ops/s
TopicNameBenchmark.topicLookup010:async             basic  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup010         invalidateCache  thrpt    3  29339606.271 ±  43286582.744  ops/s
TopicNameBenchmark.topicLookup010:async   invalidateCache  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup010        strongReferences  thrpt    3  37706919.154 ±  33147758.020  ops/s
TopicNameBenchmark.topicLookup010:async  strongReferences  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup100                   basic  thrpt    3  37475011.818 ± 159568954.829  ops/s
TopicNameBenchmark.topicLookup100:async             basic  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup100         invalidateCache  thrpt    3  33964067.415 ±  28649039.174  ops/s
TopicNameBenchmark.topicLookup100:async   invalidateCache  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup100        strongReferences  thrpt    3  41307757.659 ±  48036888.867  ops/s
TopicNameBenchmark.topicLookup100:async  strongReferences  thrpt                NaN                    ---

flamegraph

@BewareMyPower
Copy link
Contributor

Well, based on my improvement in #24463, I migrated the cache implementation of this PR. The benchmark ran on a ubuntu 22.04 runner in GitHub Actions, here is the result from See https://github.com/BewareMyPower/pulsar/actions/runs/15872123416/job/44750991782?pr=45:

TopicNameBenchmark.testConstruct      thrpt       254.510          ops/s
TopicNameBenchmark.testReadFromCache  thrpt       294.822          ops/s

The result is similar to my benchmark on macOS:

TopicNameBenchmark.testConstruct      thrpt       345.498          ops/s
TopicNameBenchmark.testReadFromCache  thrpt       375.209          ops/s

Both shows getting the topic name from cache is not significantly better than my latest optimization.

Please leave comments directly in BewareMyPower#45 if you think there is something wrong with my benchmark code since you said

I wouldn't trust JMH results from benchmarking on Mac OS.

@lhotari
Copy link
Member Author

lhotari commented Jun 25, 2025

Both shows getting the topic name from cache is not significantly better than my latest optimization.

Great work @BewareMyPower . The main aspect why I wouldn't remove the cache is due to deduplication functionality of the cache. If TopicName instances are strongly referenced, the more there are, the worse the problem is.

Please leave comments directly in BewareMyPower#45 if you think there is something wrong with my benchmark code since you said

I wouldn't trust JMH results from benchmarking on Mac OS.

The benchmark code looks fine now. On Mac OS, the results just could be different than on Linux x86_64 in many cases. Especially if there's any code that accesses System.nanoTime(). It's many times slower on Mac OS than on Linux.

lhotari added 2 commits June 25, 2025 12:52
…rentTimeMillis instead of nanoTime

- postpone shrinking the cache until the maintenance task runs
@lhotari
Copy link
Member Author

lhotari commented Jun 25, 2025

Latest benchmark results on Linux after removing System.nanoTime and making cache maintenance handled during the single threaded maintenance task:

Benchmark                                      (testMode)   Mode  Cnt         Score           Error  Units
TopicNameBenchmark.topicLookup001                   basic  thrpt    3   8772255.043 ±   2202646.758  ops/s
TopicNameBenchmark.topicLookup001:async             basic  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup001         invalidateCache  thrpt    3   7787641.717 ±   4826933.972  ops/s
TopicNameBenchmark.topicLookup001:async   invalidateCache  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup001        strongReferences  thrpt    3   8697105.056 ±   2378331.526  ops/s
TopicNameBenchmark.topicLookup001:async  strongReferences  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup010                   basic  thrpt    3  59980650.077 ±  18606653.977  ops/s
TopicNameBenchmark.topicLookup010:async             basic  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup010         invalidateCache  thrpt    3  54319848.644 ± 104656951.523  ops/s
TopicNameBenchmark.topicLookup010:async   invalidateCache  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup010        strongReferences  thrpt    3  61755152.726 ±  28077697.228  ops/s
TopicNameBenchmark.topicLookup010:async  strongReferences  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup100                   basic  thrpt    3  93982448.999 ±  36720417.649  ops/s
TopicNameBenchmark.topicLookup100:async             basic  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup100         invalidateCache  thrpt    3  81061925.069 ±  55132917.420  ops/s
TopicNameBenchmark.topicLookup100:async   invalidateCache  thrpt                NaN                    ---
TopicNameBenchmark.topicLookup100        strongReferences  thrpt    3  73160471.462 ±  47659030.231  ops/s
TopicNameBenchmark.topicLookup100:async  strongReferences  thrpt                NaN                    ---

flamegraph

recent detail:
image
It would require a longer run to get more accurate results, but it looks like the performance is now very close to ConcurrentHashMap.get. Obviously running the benchmark with the current master branch would be a way to get an accurate comparison. I'll handle that later.

Copy link
Contributor

@poorbarcode poorbarcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhotari
Copy link
Member Author

lhotari commented Jun 25, 2025

#24457 (comment)
#24457 (comment)
#24457 (comment)

@poorbarcode we can proceed with your PR and come back to the usefulness of a soft reference cache and instance deduplication in optimizing memory usage of topic pattern listing / watching use cases.

@ben-manes
Copy link

ben-manes commented Jun 26, 2025

A few observations if helpful,

  • It looks like Pulsar is using an old version of Caffeine (2.9.1; May 2021) in order to run on Java 8. Is that still necessary or can you use the current version (3.2.1; June 2025)?
  • You were using expireAfterAccess which incurs an expensive system time call on every read, while in the replacement it has no time threshold and an optional space limit. Can you use a size bounding for a fairer comparison?
  • The maintenance is O(n) on the caller thread, which is likely fine but is a caller-facing penalty. The scan is of the underlying array, which at the previous setting of 100K entries would be up to 175-250K slots to traverse. That can have surprisingly bad performance when not handled properly. e.g. the clustering effect breaks ehcache's sampled lru) assumptions.
    • Caffeine uses strictly amortized O(1) algorithms that are inexpensive. However by default it will defer to ForkJoinPool.commonPool() since we don't know the usage costs of foreign code (e.g. CHM blocks on a hashbin collision so a long compute could stall eviction). You can use executor(Runnable::run) in most cases.
  • If soft references are over used then they have traditionally been costly by increasing GC pauses for a slower system, making it look good in benchmarks and very problematic in real workloads. They can cause old gc thrashing, so newer GCs will often evict aggressively so there is less content captured on a real system. The eviction will likely discard useful content due to the global LRU where a polluter of inexpensive data forces the eviction of valuable expensive content. This spooky action at a distance due to the eviction policy not being localized can be harder to debug and causes overall slower system performance as projects evolve and the tradeoffs are forgotten.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Memory leak in Pulsar Proxy with TopicName
8 participants