Groundwork to demonstrate FlyWeight transient memory consumption #245

blacelle · 2018-04-15T17:19:52Z

This follows the work done in #243 and the discussion in #244.

This first commit introduces a test class (TestIteratorMemory) based on IteratorsBenchmark32 but to provide some transient memory allocation measures. It typically gives:

Standard Iteration allocated: 765872
FlyWeight Iteration allocated: 904
Boxed Iteration allocated: 2364816

It shall provide figures only under Oracle JVM (as it relies on com.sun.management.ThreadMXBean), but the tests should remain green/not-throwing in other JVMs.

coveralls · 2018-04-15T17:36:05Z

Coverage decreased (-0.008%) to 91.313% when pulling b125873 on blacelle:MeasureIteratorTransientHeap into 5d61acf on RoaringBitmap:master.

richardstartin · 2018-04-15T17:44:37Z

This measures allocation rates in interpreted code, right? What if C2 escape analysis decides to scalarise half those allocations once the code gets hot?

I would prefer to use -prof gc with JMH, and refer to the normalised allocation rate, GC count and GC time values produced thereby. Since flyweights trade collections for stores, and next generation garbage collectors are making collections cheaper at the expense of stores , we need to be able to control the GC algorithm (as JMH @Fork allows perfectly).

richardstartin · 2018-04-15T19:40:44Z

I added these annotations to IteratorsBenchmark32, choosing G1 as the collector.

@Threads(8) // provoke any possible contention on auxiliary data structures
@Measurement(time = 30) // this should be long enough for accumulated garbage to make a difference
@Fork(value = 1, jvmArgsPrepend = {"-XX:-TieredCompilation", "-XX:+UseG1GC"}) // skip C1, use G1

The object graph in the benchmark isn't complex enough to make garbage collection difficult, and you observe no benefit from flyweights in this benchmark. Here are the garbage collection statistics (only measured with hot compiled code) showing that there is a lot of garbage generated by the standard iterator, and virtually none by the flyweight.

Benchmark                                                             Mode  Cnt       Score         Error   Units
IteratorsBenchmark32.testFlyweight_a                                  avgt    5       2.067 ±       0.357   ms/op
IteratorsBenchmark32.testFlyweight_a:·gc.alloc.rate                   avgt    5      ≈ 10⁻⁴                MB/sec
IteratorsBenchmark32.testFlyweight_a:·gc.alloc.rate.norm              avgt    5       0.068 ±       0.012    B/op
IteratorsBenchmark32.testFlyweight_a:·gc.count                        avgt    5         ≈ 0                counts
IteratorsBenchmark32.testFlyweight_b                                  avgt    5     257.347 ±      37.047   ms/op
IteratorsBenchmark32.testFlyweight_b:·gc.alloc.rate                   avgt    5      ≈ 10⁻⁴                MB/sec
IteratorsBenchmark32.testFlyweight_b:·gc.alloc.rate.norm              avgt    5       8.296 ±       1.116    B/op
IteratorsBenchmark32.testFlyweight_b:·gc.count                        avgt    5         ≈ 0                counts
IteratorsBenchmark32.testFlyweight_c                                  avgt    5    2883.477 ±      25.115   ms/op
IteratorsBenchmark32.testFlyweight_c:·gc.alloc.rate                   avgt    5      ≈ 10⁻⁴                MB/sec
IteratorsBenchmark32.testFlyweight_c:·gc.alloc.rate.norm              avgt    5      83.974 ±       2.417    B/op
IteratorsBenchmark32.testFlyweight_c:·gc.count                        avgt    5         ≈ 0                counts
IteratorsBenchmark32.testStandard_a                                   avgt    5       2.036 ±       0.310   ms/op
IteratorsBenchmark32.testStandard_a:·gc.alloc.rate                    avgt    5    2809.553 ±     419.834  MB/sec
IteratorsBenchmark32.testStandard_a:·gc.alloc.rate.norm               avgt    5  748688.068 ±       0.010    B/op
IteratorsBenchmark32.testStandard_a:·gc.churn.G1_Eden_Space           avgt    5    2815.227 ±     421.814  MB/sec
IteratorsBenchmark32.testStandard_a:·gc.churn.G1_Eden_Space.norm      avgt    5  750205.142 ±   18894.732    B/op
IteratorsBenchmark32.testStandard_a:·gc.churn.G1_Old_Gen              avgt    5       0.001 ±       0.003  MB/sec
IteratorsBenchmark32.testStandard_a:·gc.churn.G1_Old_Gen.norm         avgt    5       0.400 ±       0.826    B/op
IteratorsBenchmark32.testStandard_a:·gc.count                         avgt    5     245.000                counts
IteratorsBenchmark32.testStandard_a:·gc.time                          avgt    5     723.000                    ms
IteratorsBenchmark32.testStandard_b                                   avgt    5     242.516 ±       9.946   ms/op
IteratorsBenchmark32.testStandard_b:·gc.alloc.rate                    avgt    5      12.357 ±       0.499  MB/sec
IteratorsBenchmark32.testStandard_b:·gc.alloc.rate.norm               avgt    5  393255.831 ±       0.300    B/op
IteratorsBenchmark32.testStandard_b:·gc.churn.G1_Eden_Space           avgt    5      18.732 ±      98.787  MB/sec
IteratorsBenchmark32.testStandard_b:·gc.churn.G1_Eden_Space.norm      avgt    5  602894.992 ± 3179726.854    B/op
IteratorsBenchmark32.testStandard_b:·gc.churn.G1_Survivor_Space       avgt    5       1.225 ±      10.549  MB/sec
IteratorsBenchmark32.testStandard_b:·gc.churn.G1_Survivor_Space.norm  avgt    5   39282.001 ±  338229.877    B/op
IteratorsBenchmark32.testStandard_b:·gc.count                         avgt    5       2.000                counts
IteratorsBenchmark32.testStandard_b:·gc.time                          avgt    5     114.000                    ms
IteratorsBenchmark32.testStandard_c                                   avgt    5    2878.964 ±     208.415   ms/op
IteratorsBenchmark32.testStandard_c:·gc.alloc.rate                    avgt    5       1.385 ±       0.100  MB/sec
IteratorsBenchmark32.testStandard_c:·gc.alloc.rate.norm               avgt    5  524403.065 ±       4.899    B/op
IteratorsBenchmark32.testStandard_c:·gc.count                         avgt    5         ≈ 0                counts

Conclusively, it can be seen that allocation rate is higher with standard iterators! But this should be expected and the question is whether this is actually harmful.

What would be interesting would be to generate a difficult object graph to collect, run some background work to mutate it during the benchmark, and then run the benchmark itself. For the record, I don't have anything against flyweights.

lemire · 2018-04-18T04:30:03Z

Ok guys... do we merge this?

richardstartin · 2018-04-18T12:10:50Z

@lemire @blacelle I guess my point was that we already have the info available, and I produced the data above with openjdk, with confidence that it doesn't relate to interpreted code. I also ran the same test with an early access openjdk Shenandoah build, which was only possible because I was using JMH.

lemire · 2018-04-18T14:43:47Z

@richardstartin I am trying to understand where this is going.

My view is that @blacelle is establishing that flyweight iterations generate fewer allocations. That might be "self-evident", but we know not to take anything for granted.

Then Richard points out that, in JITed code, the result could be different. Ok. That seems like an excellent question.

Benoit, what is your take on this? Assuming that you agree that you are not measuring the result of JITed code, is there any easy fix for this?

Another point that Richard raises is whether the extra allocation is harmful. I think we agree that it is a different question, but maybe we want to go one step at a time.

I think that Richard's implicit point is that it is a myth that GC pressure is a performance bottleneck. The problem is that it is hard to prove a negative. That is, how can you prove that flyweight iterators never help?

The counterpart is more interesting. Can we come up with an example where flyweight iterators help!

Ok... but we have to start somewhere and this PR is a start.

richardstartin · 2018-04-18T14:48:12Z

@lemire on the contrary: the JMH profiling demonstrates clearly that flyweights allocate a lot less memory. I’m pointing out we can already use JMH -prof GC to check that.

lemire · 2018-04-18T15:07:42Z

@richardstartin

Yes, I understand that we have established that flyweight iterators produce far fewer allocations even with JITed code. But that's not where I am at.

I am facing a screen. There is a button "squash and merge". I am trying to decide whether to push it. That is all.

Let me simplify the question. We have a choice.

We can drop this PR.
We can merge this PR as-is.
We can ask for changes.

Given this choice, what are you saying?

richardstartin · 2018-04-18T15:26:22Z

Better information is already available via JMH.

blacelle · 2018-04-18T19:16:43Z

Leaving away the question about how com.sun.management.ThreadMXBean computes memory allocation, especially escape analysis, I feel this PR is still useful to write unit-tests. JMH excels in producing benchmarks. @richardstartin provided valuable insights regarding GC behaviors with small JMH tweaks: this is excellent.

Still, this PR can be pushed further to ensure FlyWeight iterators do not allocates more memory in a future commit. The output from proposed JUnit tests are very simple to check/read; simpler in my POV than analyzing a JMH output. They also run much more quickly.

In short, this PR enables anyway automatic checks related to memory allocation, and not very precise figure regarding benchmarks (at least less rich than those from JMH).

lemire · 2018-04-18T19:20:00Z

Merging. We can always remove it later.

richardstartin · 2018-04-19T06:24:09Z

Automated checks relating to allocation in interpreted code, which are essentially meaningless. The test produces a number, but the wrong number.

ppiotrow · 2018-04-19T06:47:23Z

Few weeks ago I played with google allocation instrumenter. As this is ready code I allow myself to open PR to at least compare results.

…ringBitmap#245) * Groundwork to demonstrate FlyWeight transient memory consumption * Fix code style

blacelle added 2 commits April 15, 2018 19:15

Groundwork to demonstrate FlyWeight transient memory consumption

3512080

Fix code style

b125873

lemire merged commit 02ea66a into RoaringBitmap:master Apr 18, 2018

ppiotrow mentioned this pull request Apr 19, 2018

Allocation testing code #247

Merged

Smallhi pushed a commit to Smallhi/RoaringBitmap that referenced this pull request Jun 14, 2021

Groundwork to demonstrate FlyWeight transient memory consumption (Roa…

cecb3a9

…ringBitmap#245) * Groundwork to demonstrate FlyWeight transient memory consumption * Fix code style

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groundwork to demonstrate FlyWeight transient memory consumption #245

Groundwork to demonstrate FlyWeight transient memory consumption #245

blacelle commented Apr 15, 2018

coveralls commented Apr 15, 2018 •

edited

Loading

richardstartin commented Apr 15, 2018

richardstartin commented Apr 15, 2018 •

edited

Loading

lemire commented Apr 18, 2018

richardstartin commented Apr 18, 2018

lemire commented Apr 18, 2018

richardstartin commented Apr 18, 2018

lemire commented Apr 18, 2018

richardstartin commented Apr 18, 2018

blacelle commented Apr 18, 2018

lemire commented Apr 18, 2018

richardstartin commented Apr 19, 2018

ppiotrow commented Apr 19, 2018

Groundwork to demonstrate FlyWeight transient memory consumption #245

Groundwork to demonstrate FlyWeight transient memory consumption #245

Conversation

blacelle commented Apr 15, 2018

coveralls commented Apr 15, 2018 • edited Loading

richardstartin commented Apr 15, 2018

richardstartin commented Apr 15, 2018 • edited Loading

lemire commented Apr 18, 2018

richardstartin commented Apr 18, 2018

lemire commented Apr 18, 2018

richardstartin commented Apr 18, 2018

lemire commented Apr 18, 2018

richardstartin commented Apr 18, 2018

blacelle commented Apr 18, 2018

lemire commented Apr 18, 2018

richardstartin commented Apr 19, 2018

ppiotrow commented Apr 19, 2018

coveralls commented Apr 15, 2018 •

edited

Loading

richardstartin commented Apr 15, 2018 •

edited

Loading