Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groundwork to demonstrate FlyWeight transient memory consumption #245

Merged
merged 2 commits into from
Apr 18, 2018

Conversation

blacelle
Copy link
Member

This follows the work done in #243 and the discussion in #244.

This first commit introduces a test class (TestIteratorMemory) based on IteratorsBenchmark32 but to provide some transient memory allocation measures. It typically gives:

Standard Iteration allocated: 765872
FlyWeight Iteration allocated: 904
Boxed Iteration allocated: 2364816

It shall provide figures only under Oracle JVM (as it relies on com.sun.management.ThreadMXBean), but the tests should remain green/not-throwing in other JVMs.

@coveralls
Copy link

coveralls commented Apr 15, 2018

Coverage Status

Coverage decreased (-0.008%) to 91.313% when pulling b125873 on blacelle:MeasureIteratorTransientHeap into 5d61acf on RoaringBitmap:master.

@richardstartin
Copy link
Member

This measures allocation rates in interpreted code, right? What if C2 escape analysis decides to scalarise half those allocations once the code gets hot?

I would prefer to use -prof gc with JMH, and refer to the normalised allocation rate, GC count and GC time values produced thereby. Since flyweights trade collections for stores, and next generation garbage collectors are making collections cheaper at the expense of stores , we need to be able to control the GC algorithm (as JMH @Fork allows perfectly).

@richardstartin
Copy link
Member

richardstartin commented Apr 15, 2018

I added these annotations to IteratorsBenchmark32, choosing G1 as the collector.

@Threads(8) // provoke any possible contention on auxiliary data structures
@Measurement(time = 30) // this should be long enough for accumulated garbage to make a difference
@Fork(value = 1, jvmArgsPrepend = {"-XX:-TieredCompilation", "-XX:+UseG1GC"}) // skip C1, use G1

The object graph in the benchmark isn't complex enough to make garbage collection difficult, and you observe no benefit from flyweights in this benchmark. Here are the garbage collection statistics (only measured with hot compiled code) showing that there is a lot of garbage generated by the standard iterator, and virtually none by the flyweight.

Benchmark                                                             Mode  Cnt       Score         Error   Units
IteratorsBenchmark32.testFlyweight_a                                  avgt    5       2.067 ±       0.357   ms/op
IteratorsBenchmark32.testFlyweight_a:·gc.alloc.rate                   avgt    5      ≈ 10⁻⁴                MB/sec
IteratorsBenchmark32.testFlyweight_a:·gc.alloc.rate.norm              avgt    5       0.068 ±       0.012    B/op
IteratorsBenchmark32.testFlyweight_a:·gc.count                        avgt    5         ≈ 0                counts
IteratorsBenchmark32.testFlyweight_b                                  avgt    5     257.347 ±      37.047   ms/op
IteratorsBenchmark32.testFlyweight_b:·gc.alloc.rate                   avgt    5      ≈ 10⁻⁴                MB/sec
IteratorsBenchmark32.testFlyweight_b:·gc.alloc.rate.norm              avgt    5       8.296 ±       1.116    B/op
IteratorsBenchmark32.testFlyweight_b:·gc.count                        avgt    5         ≈ 0                counts
IteratorsBenchmark32.testFlyweight_c                                  avgt    5    2883.477 ±      25.115   ms/op
IteratorsBenchmark32.testFlyweight_c:·gc.alloc.rate                   avgt    5      ≈ 10⁻⁴                MB/sec
IteratorsBenchmark32.testFlyweight_c:·gc.alloc.rate.norm              avgt    5      83.974 ±       2.417    B/op
IteratorsBenchmark32.testFlyweight_c:·gc.count                        avgt    5         ≈ 0                counts
IteratorsBenchmark32.testStandard_a                                   avgt    5       2.036 ±       0.310   ms/op
IteratorsBenchmark32.testStandard_a:·gc.alloc.rate                    avgt    5    2809.553 ±     419.834  MB/sec
IteratorsBenchmark32.testStandard_a:·gc.alloc.rate.norm               avgt    5  748688.068 ±       0.010    B/op
IteratorsBenchmark32.testStandard_a:·gc.churn.G1_Eden_Space           avgt    5    2815.227 ±     421.814  MB/sec
IteratorsBenchmark32.testStandard_a:·gc.churn.G1_Eden_Space.norm      avgt    5  750205.142 ±   18894.732    B/op
IteratorsBenchmark32.testStandard_a:·gc.churn.G1_Old_Gen              avgt    5       0.001 ±       0.003  MB/sec
IteratorsBenchmark32.testStandard_a:·gc.churn.G1_Old_Gen.norm         avgt    5       0.400 ±       0.826    B/op
IteratorsBenchmark32.testStandard_a:·gc.count                         avgt    5     245.000                counts
IteratorsBenchmark32.testStandard_a:·gc.time                          avgt    5     723.000                    ms
IteratorsBenchmark32.testStandard_b                                   avgt    5     242.516 ±       9.946   ms/op
IteratorsBenchmark32.testStandard_b:·gc.alloc.rate                    avgt    5      12.357 ±       0.499  MB/sec
IteratorsBenchmark32.testStandard_b:·gc.alloc.rate.norm               avgt    5  393255.831 ±       0.300    B/op
IteratorsBenchmark32.testStandard_b:·gc.churn.G1_Eden_Space           avgt    5      18.732 ±      98.787  MB/sec
IteratorsBenchmark32.testStandard_b:·gc.churn.G1_Eden_Space.norm      avgt    5  602894.992 ± 3179726.854    B/op
IteratorsBenchmark32.testStandard_b:·gc.churn.G1_Survivor_Space       avgt    5       1.225 ±      10.549  MB/sec
IteratorsBenchmark32.testStandard_b:·gc.churn.G1_Survivor_Space.norm  avgt    5   39282.001 ±  338229.877    B/op
IteratorsBenchmark32.testStandard_b:·gc.count                         avgt    5       2.000                counts
IteratorsBenchmark32.testStandard_b:·gc.time                          avgt    5     114.000                    ms
IteratorsBenchmark32.testStandard_c                                   avgt    5    2878.964 ±     208.415   ms/op
IteratorsBenchmark32.testStandard_c:·gc.alloc.rate                    avgt    5       1.385 ±       0.100  MB/sec
IteratorsBenchmark32.testStandard_c:·gc.alloc.rate.norm               avgt    5  524403.065 ±       4.899    B/op
IteratorsBenchmark32.testStandard_c:·gc.count                         avgt    5         ≈ 0                counts

Conclusively, it can be seen that allocation rate is higher with standard iterators! But this should be expected and the question is whether this is actually harmful.

What would be interesting would be to generate a difficult object graph to collect, run some background work to mutate it during the benchmark, and then run the benchmark itself. For the record, I don't have anything against flyweights.

@lemire
Copy link
Member

lemire commented Apr 18, 2018

Ok guys... do we merge this?

@richardstartin
Copy link
Member

@lemire @blacelle I guess my point was that we already have the info available, and I produced the data above with openjdk, with confidence that it doesn't relate to interpreted code. I also ran the same test with an early access openjdk Shenandoah build, which was only possible because I was using JMH.

@lemire
Copy link
Member

lemire commented Apr 18, 2018

@richardstartin I am trying to understand where this is going.

My view is that @blacelle is establishing that flyweight iterations generate fewer allocations. That might be "self-evident", but we know not to take anything for granted.

Then Richard points out that, in JITed code, the result could be different. Ok. That seems like an excellent question.

Benoit, what is your take on this? Assuming that you agree that you are not measuring the result of JITed code, is there any easy fix for this?

Another point that Richard raises is whether the extra allocation is harmful. I think we agree that it is a different question, but maybe we want to go one step at a time.

I think that Richard's implicit point is that it is a myth that GC pressure is a performance bottleneck. The problem is that it is hard to prove a negative. That is, how can you prove that flyweight iterators never help?

The counterpart is more interesting. Can we come up with an example where flyweight iterators help!

Ok... but we have to start somewhere and this PR is a start.

@richardstartin
Copy link
Member

@lemire on the contrary: the JMH profiling demonstrates clearly that flyweights allocate a lot less memory. I’m pointing out we can already use JMH -prof GC to check that.

@lemire
Copy link
Member

lemire commented Apr 18, 2018

@richardstartin

Yes, I understand that we have established that flyweight iterators produce far fewer allocations even with JITed code. But that's not where I am at.

I am facing a screen. There is a button "squash and merge". I am trying to decide whether to push it. That is all.

Let me simplify the question. We have a choice.

  1. We can drop this PR.
  2. We can merge this PR as-is.
  3. We can ask for changes.

Given this choice, what are you saying?

@richardstartin
Copy link
Member

Better information is already available via JMH.

@blacelle
Copy link
Member Author

Leaving away the question about how com.sun.management.ThreadMXBean computes memory allocation, especially escape analysis, I feel this PR is still useful to write unit-tests. JMH excels in producing benchmarks. @richardstartin provided valuable insights regarding GC behaviors with small JMH tweaks: this is excellent.

Still, this PR can be pushed further to ensure FlyWeight iterators do not allocates more memory in a future commit. The output from proposed JUnit tests are very simple to check/read; simpler in my POV than analyzing a JMH output. They also run much more quickly.

In short, this PR enables anyway automatic checks related to memory allocation, and not very precise figure regarding benchmarks (at least less rich than those from JMH).

@lemire
Copy link
Member

lemire commented Apr 18, 2018

Merging. We can always remove it later.

@lemire lemire merged commit 02ea66a into RoaringBitmap:master Apr 18, 2018
@richardstartin
Copy link
Member

Automated checks relating to allocation in interpreted code, which are essentially meaningless. The test produces a number, but the wrong number.

@ppiotrow
Copy link
Member

Few weeks ago I played with google allocation instrumenter. As this is ready code I allow myself to open PR to at least compare results.

@ppiotrow ppiotrow mentioned this pull request Apr 19, 2018
Smallhi pushed a commit to Smallhi/RoaringBitmap that referenced this pull request Jun 14, 2021
…ringBitmap#245)

* Groundwork to demonstrate FlyWeight transient memory consumption

* Fix code style
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants