Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling Memory Usage and Object Creation #1204

Closed
benjchristensen opened this issue May 16, 2014 · 54 comments
Closed

Profiling Memory Usage and Object Creation #1204

benjchristensen opened this issue May 16, 2014 · 54 comments
Milestone

Comments

@benjchristensen
Copy link
Member

We need to spend time profiling memory and object allocation and finding places where we can improve.

I would really appreciate help diving into this and finding problem areas. Even if you don't fix them, but just identity use cases, operators, etc that would be very valuable.

This is partly a result of the fact that in Netflix production we have seen an increase in YoungGen GCs since 0.17.x.

The areas to start should probably be:

  • Observable.create
  • Observable.lift
  • Subscriber
  • CompositeSubscription
  • map
  • flatMap

If you can or want to get involved in this please comment here so we all can collaborate together.

@akarnokd
Copy link
Member

The most likely source of garbage is the CompositeSubscription: whenever a task is scheduled or a new merge-source appears, the state transition creates a new State object and copies the subscription array. PR #1145 reduces the amount of garbage by switching to HashSet on a larger composite. Since flatMap->mergeMap->merge needs to track the active inner subscriptions, it uses a composite as well and would gain the same benefits.

@benjchristensen
Copy link
Member Author

The HashSet change shouldn't kick in though because these aren't generally large lists (handful at most, not hundreds). I expect it's just far too many of them being created.

@daschl
Copy link
Contributor

daschl commented May 18, 2014

I'll also like to help out here, since its crucial for our uses as well.

@benjchristensen
Copy link
Member Author

Thanks @daschl I'd appreciate your help. Profiling and identifying hot spots is what we need most right now.

@daschl
Copy link
Contributor

daschl commented May 20, 2014

I did some GC profiling of my test workloads and I'd also like to nominate:

  • BlockingObservable (it seems that especially in the .single() case its not optimized)
  • Observable.subscribe (for both Observer and Subscriber) - in my GC logs they take up 40% of the overall GC pressure.

The bad news is that i had to fall back out of Observables to plain execution on the hot code path (aside from the overall wrapping observable), because also using it in the path produces way too much garbage (moving away from Rx in the hot code path got my throughput from 20% to 80% according to the GC logs) and it correlates with my findings since I could not sustain constant IO throughput because of full GCs happening way too frequently.

@benjchristensen
Copy link
Member Author

Observable.subscribe (for both Observer and Subscriber)

Not surprised on this. Were you able to identify what the garbage is?

@benjchristensen benjchristensen added this to the 0.19 milestone May 20, 2014
@daschl
Copy link
Contributor

daschl commented May 20, 2014

I did some changes in my code and will re-profile and post the results here so we get better measurements.

Here is a slightly older profile run which might give you a hint or two.
screen shot 2014-05-20 at 10 13 10
screen shot 2014-05-20 at 10 13 00

@benjchristensen
Copy link
Member Author

We can definitely improve on the AtomicReference by using AtomicFieldUpdater. The others will require more effort to analyze along with sample code to see what's triggering them.

@akarnokd
Copy link
Member

I think that many AtomicReference come from mostly the CompositeSubscription instances.

@benjchristensen
Copy link
Member Author

I went back in history to 0.16.1 to compare performance of the basic map/flatMap behavior and found that current master is faster. This fits what I had expected of the work of the past couple months, which means it's not an obvious glaring issue (unless my test is completely flawed).

Here is the code for the test:

Results

0.16

Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  2421210.583    68845.966    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5     1017.787       73.597    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  2398541.067    90703.197    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5      990.623      100.735    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  4020548.060   262841.500    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    16205.747      352.618    ops/s

Master

Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  3184873.133   172320.420    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5     9079.937      343.905    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  3411785.677    73767.161    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5    10860.963      294.309    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  7208334.997   703327.745    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    18720.797      278.529    ops/s

GC

On the master branch test I'm seeing GC results like this:

Iteration   5: 3189218.350 ops/s
          GC | wall time = 5.001 secs,  GC time = 0.047 secs, GC% = 0.94%, GC count = +98
Iteration   5: 9198.700 ops/s
          GC | wall time = 5.002 secs,  GC time = 0.048 secs, GC% = 0.96%, GC count = +98

versus 0.16

Iteration   5: 2420099.017 ops/s
          GC | wall time = 5.000 secs,  GC time = 0.046 secs, GC% = 0.92%, GC count = +96
Iteration   5: 993.867 ops/s
          GC | wall time = 5.001 secs,  GC time = 0.100 secs, GC% = 2.00%, GC count = +212

Summary

Unless I'm mistaken, current code is better:

  • similar GC behavior on Observable with 1 item, better GC behavior with 1024 items
  • higher ops/second on all of the map/flatMap tests

I'll start profiling this and improve ... but this does not reveal the source of the problems seen. Possibly it's related to schedulers, or it's a specific operator. I exercised map, flatMap (and thus merge), Observable and Subscribe here to get the most fundamental ones.

@benjchristensen
Copy link
Member Author

The observeOn test shows the cost of scheduling ... particularly Observables of a single item.

    @GenerateMicroBenchmark
    public void observeOn(UseCaseInput input) throws InterruptedException {
        input.observable.observeOn(Schedulers.computation()).subscribe(input.observer);
        input.awaitCompletion();
    }
r.u.PerfObserveOn.observeOn        1  thrpt         5   457830.180     9943.221    ops/s
r.u.PerfObserveOn.observeOn     1024  thrpt         5    14496.430      518.463    ops/s

Thus, with an Observable of 1 item we can do 457k onNext/second, with an Observable of 1024 items we can do 14.8million onNext/second (14.5k * 1024).

@benjchristensen
Copy link
Member Author

By the way, all testing is just being done on my Mac laptop ... so these numbers are all relative and not representative of proper server hardware.

@benjchristensen
Copy link
Member Author

Converting from AtomicReference to AtomicReferenceFieldUpdater improved performance from this:

Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  3184873.133   172320.420    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5     9079.937      343.905    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  3411785.677    73767.161    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5    10860.963      294.309    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  7208334.997   703327.745    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    18720.797      278.529    ops/s

to this:

Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  3459205.110   124790.906    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5     9225.037      604.720    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  3639603.060   225599.038    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5    11135.613      337.022    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  7264202.633   214787.109    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    18795.790      713.668    ops/s

@daschl
Copy link
Contributor

daschl commented May 21, 2014

@benjchristensen I suppose the AtomicReferenceFieldUpdater mostly comes from less GC pressure, is that why the raw throughput increases in your tests? How did the GC wall time change?

@daschl
Copy link
Contributor

daschl commented May 21, 2014

If you want me to run a specific workload/type of test let me know so we can compare results.

@akarnokd
Copy link
Member

I've been experimenting with FieldUpdaters and Unsafe for the SerializedObserver. It seems that by using Unsafe directly and thus avoiding security checks and an indirection, I can get about 8-10% more throughput. The downsides are that it is scheduled to be removed/standardized in future Java versions and platform dependence in terms of availability. Should I pursue the Unsafe path?

@daschl
Copy link
Contributor

daschl commented May 21, 2014

@akarnokd since RxJava also runs on android, I'm not sure how good/standard the support is there.

I know that the netty folks are having the same issues and they are wrapping those Unsafe calles in a PlatformDependent util class.

@benjchristensen
Copy link
Member Author

wrapping those Unsafe calles in a PlatformDependent util class.

This sounds like a valid approach for us. As we mature Rx we'll want to squeeze as much performance out of it as we can, while still remaining portable.

@benjchristensen
Copy link
Member Author

Testing with:

../gradlew benchmarks '-Pjmh=-f 1 -tu s -bm thrpt -wi 5 -i 5 -r 5 -prof GC .*PerfTransforms.*'
May 21st 0efda07
Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  4597237.797   227193.650    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5    12334.190      214.479    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  4522036.147   216489.787    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5    12293.223      321.573    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  9133316.230   303274.438    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    19698.323      150.313    ops/s
May 26th a34cba2
Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  4367166.623   145506.857    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5    11382.233      425.976    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  4468497.400    83649.115    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5    11374.780      780.039    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  8851147.610   303583.393    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    19649.227     1134.279    ops/s

According to these results we got slower (though within mean error it appears, so if not slower, then no better).

@akarnokd
Copy link
Member

Which Java version is this? Java 6 intrinsics isn't as good as the newer versions. Maybe the wip field in the merge operator causes more false sharing and needs padding.

@benjchristensen
Copy link
Member Author

/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk

@benjchristensen
Copy link
Member Author

Master branch with /Library/Java/JavaVirtualMachines/jdk1.8.0_05.jdk

Benchmark                                       (size)   Mode   Samples         Mean   Mean error    Units
r.u.PerfTransforms.flatMapTransformsUsingFrom        1  thrpt         5  4378589.533   109056.155    ops/s
r.u.PerfTransforms.flatMapTransformsUsingFrom     1024  thrpt         5    10702.953      447.216    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust        1  thrpt         5  4341206.933   184228.619    ops/s
r.u.PerfTransforms.flatMapTransformsUsingJust     1024  thrpt         5    10961.550      545.574    ops/s
r.u.PerfTransforms.mapTransformation                 1  thrpt         5  8996983.320   226242.030    ops/s
r.u.PerfTransforms.mapTransformation              1024  thrpt         5    19423.813      779.759    ops/s

@benjchristensen
Copy link
Member Author

Here is a simple test without JMH (but using same coding pattern) that shows significant increases in throughput from 0.16 -> 0.17 -> 0.18 -> current master branch for this code:

    public void mapTransformation(UseCaseInput input) throws InterruptedException {
        input.observable.map(i -> {
            return String.valueOf(i);
        }).map(i -> {
            return Integer.parseInt(i);
        }).subscribe(input.observer);
        input.awaitCompletion();
    }
master
Run: 10 - 10,333,567 ops/sec 
Run: 11 - 10,235,100 ops/sec 
Run: 12 - 10,269,259 ops/sec 
Run: 13 - 10,170,560 ops/sec 
Run: 14 - 10,072,319 ops/sec 
Version 0.18.3
Run: 10 - 7,973,782 ops/sec 
Run: 11 - 8,695,425 ops/sec 
Run: 12 - 8,283,768 ops/sec 
Run: 13 - 8,283,562 ops/sec 
Run: 14 - 8,270,888 ops/sec 
Version 0.17.6 (using OnSubscribe)
Run: 10 - 7,812,927 ops/sec 
Run: 11 - 7,632,713 ops/sec 
Run: 12 - 7,488,673 ops/sec 
Run: 13 - 7,975,944 ops/sec 
Run: 14 - 7,882,146 ops/sec 
Version 0.17.6 (using OnSubscribeFunc)
Run: 10 - 7,049,700 ops/sec 
Run: 11 - 7,175,042 ops/sec 
Run: 12 - 7,173,240 ops/sec 
Run: 13 - 7,069,685 ops/sec 
Run: 14 - 7,248,320 ops/sec 
Version 0.16.1
Run: 10 - 4,765,899 ops/sec 
Run: 11 - 4,792,623 ops/sec 
Run: 12 - 4,709,982 ops/sec 
Run: 13 - 4,761,451 ops/sec 
Run: 14 - 4,769,627 ops/sec 

@daschl
Copy link
Contributor

daschl commented May 28, 2014

Very good progress! I'll get back to profiling from master next week.

@akarnokd
Copy link
Member

I run some benchmarks with AtomicInteger and volatile int inside merge and it appears that JMH benchmark gives 10% more throughput with AtomicInteger.

@benjchristensen
Copy link
Member Author

Well that's odd and doesn't help much when two different ways of measuring are giving contradictory results :-(

@benjchristensen
Copy link
Member Author

@benjchristensen
Copy link
Member Author

replaystate-arraylist-object

@benjchristensen
Copy link
Member Author

hystrix rx-with-0 18 2

@benjchristensen
Copy link
Member Author

Those were from 0.18.2 ... now with Master, plus a modified CompositeSubscription we get this:

hystrix rx-with-master new-compositesubscription

@benjchristensen
Copy link
Member Author

Here is evidence for need of work on ReplaySubject:

replaysubject-1

replaysubject-2

replaysubject-3

@benjchristensen
Copy link
Member Author

0 18 2-map observeon

master-map observeon

@benjchristensen
Copy link
Member Author

The Subscription[] issue is definitely being caused by the use of a Scheduler. Here is some evidence:

subscription-array-when-using-scheduler

Then the master branch with CompositeSubscription change:

screen shot 2014-05-29 at 10 58 09 am

The issue is definitely the CompositeSubscription as the following shows where it is master as it currently stands without my changes to CompositeSubscription:

master-without-changing-compositesubscription

@mattrjacobs
Copy link
Contributor

I've added some logging to our production instances and discovered that the cause of large Subscription arrays we see in practice is caused by a prefetching operation which generates many (> 500) HystrixCommands, calls observe on them, puts them in an ArrayList, and finally calls Observable.merge on this list to have a single place to listen for overall completion.

This seems like a valid case to support and any work which improves performance for large Subscription arrays would be a meaningful improvement.

benjchristensen added a commit to benjchristensen/RxJava that referenced this issue May 29, 2014
- significant reduction in object allocations
- details on research available at ReactiveX#1204
@benjchristensen
Copy link
Member Author

I have submitted a pull request for this: #1281

We are testing the code in our environment shortly.

benjchristensen added a commit to benjchristensen/RxJava that referenced this issue May 29, 2014
- significant reduction in object allocations
- details on research available at ReactiveX#1204
@benjchristensen
Copy link
Member Author

For anyone wanting to dig into this, Java Flight Recorder has been very helpful, and far better than the other profiling tools I've used and tried for this.

benjchristensen added a commit to benjchristensen/RxJava that referenced this issue May 29, 2014
- significant reduction in object allocations
- details on research available at ReactiveX#1204
benjchristensen added a commit to benjchristensen/RxJava that referenced this issue May 29, 2014
- significant reduction in object allocations
- details on research available at ReactiveX#1204
benjchristensen added a commit to benjchristensen/RxJava that referenced this issue May 29, 2014
- significant reduction in object allocations
- details on research available at ReactiveX#1204
@benjchristensen
Copy link
Member Author

Backporting to 0.18.x in #1283

@chrisjenx
Copy link

Superb work guys, I've held off on 18.x on Android as I noticed an increase in GC, really glad you guys take this seriously!

@benjchristensen
Copy link
Member Author

Thanks @chrisjenx ... it looks like the most glaring issues are resolved, and low hanging fruit taken care of. There are a few other things for us to improve on, but I think we'll release 0.19 early next week. I would appreciate your feedback on whether you see an improvement.

I have also opened #1299 to document our attempts on blocking vs non-blocking implementations and to seek input from anyone who can provide better solutions.

@benjchristensen
Copy link
Member Author

@akarnokd Is there anything else that stands out to you that we should fix before closing this issue? I'll continue doing some profiling but it seems the obvious ones are done.

We'll continue working on performance going forward and those can have their own issues and pull requests, so if nothing else obvious stands out let's close this issue and not leave it open-ended.

@akarnokd
Copy link
Member

The history List in ReplaySubject; since ArrayList uses more memory then actual items, it might be worth compacting it on a terminal state (one time, but might be costly and run out of memory). Alternatively, it could use fixed increment expansion strategy. Third option is to have a cache() overload that passes in a capacity hint to reduce reallocation and wasted space.

@benjchristensen
Copy link
Member Author

I think that the object allocation penalty of resizing after a terminal event would be worse.

A cache() overload that takes a capacity hint may be valuable, particularly in the single item case where it could just a single volatile ref instead of an array.

@benjchristensen
Copy link
Member Author

I have opened a new issue for the cache() overload: #1303

@benjchristensen
Copy link
Member Author

I'm closing this issue out as I believe we have handled the most glaring problems and don't want this to be a never-ending issue. We will of course continue seeking to improve performance, but let's use individual issues for each improvement/problem we find.

Thanks everyone for your involvement on this one as it was rather significant and important.

@Xorlev and @daschl I would appreciate feedback once you've had a chance to try the changes in the master branch (or the portion that was backported to 0.18.4) to know if you see the improvements or still have issues. @Xorlev In particular I'd like to know if the issue you had was only the GC pressure, or if you still see signs of a memory leak (which I have not seen yet).

@Xorlev
Copy link

Xorlev commented Jun 3, 2014

@benjchristensen Hystrix 1.3.16 w/ RxJava 0.18.4 has been in prod for about a day now, I'm happy to report a decrease in garbage (and CPU usage in general). I believe the pressure & the suboptimal subscription removal was causing the leak-like behavior. @mattrjacobs's use case matches a few of our own (fan out commands, wait on all), which is likely the source of the large numbers of subscriptions.

I'll keep an eye out for any similar issues that might crop up.

Thanks a lot for all the help and dedication to improving RxJava.

@benjchristensen
Copy link
Member Author

Excellent. Thank you @Xorlev for the confirmation. I'll release Hystrix 1.3.17 in a few days hopefully with RxJava 0.19 as a dependency, and at least one performance optimization I found I can do in Hystrix directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants