-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock-free, MPSC-queue based, fast-path serializing Observer. #1235
Lock-free, MPSC-queue based, fast-path serializing Observer. #1235
Conversation
RxJava-pull-requests #1136 SUCCESS |
One additional node. It is possible the JVM will eliminate the |
* if the {@link Observer} is null. | ||
*/ | ||
@SuppressWarnings("unchecked") | ||
public boolean accept2(Observer<? super T> o, Object n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the return value just be added to the accept
method without a performance hit instead of having the awkwardly named, almost exactly the same accept2
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most likely yes.
Your research on this is awesome. I look forward to playing with this queue implementation and seeing it's performance compared to alternatives. However, this slows down the normal non-contended use cases: With MPSC
Master branch
Test using:
I don't think
This applies in this case because we are NOT exchanging events between threads, we are actively trying to not do that. A high-performance queue (particularly of the lock-free kind) will make great sense however in places such as
I suggest we get this queue and counter into |
Here is my branch where I merged this PR and moved the code into |
By the way, I'll spend more time testing the contended cases. Don't close this PR, I'm not done reviewing it all. |
The JVM can remove a synchronized block if it detects a single-threaded use; I can't compete with that by using atomics. This is a tradeoff scenario: get very fast synchronous behavior or get double throughput in contended case. If we value the fast synchronous behavior more, then there is no need to go atomic just make sure we don't use wait/notify. |
I know that, we determined that in #1190. If we're going to change the decision on what tradeoff to make then we need to go back and revisit why we made the decision we did. The expected common case is when there is not contention because of how I think we need a broader set of use cases to determine which direction the tradeoff should be made. |
This was automatically closed as I merged the queue implementation. We still need to finish discussing the SerializedObserver part. |
I have my doubts now. Some simple benchmarks show improvements, other jmh benchmarks don't. Without industrial use cases, I think synchronized constructs are enough because if either low contention or single thread use. In the latter, JVM optimizations will always win. For example, I tried double checked locking with Composite and barely got 10Mops/s. The current version gives 1.3Gops/s if run within a 100k loop long enough. |
I think if in doubt, let's just not outsmart the JVM :) |
Here is an adhoc conversation on Twitter with some experts in this: https://twitter.com/benjchristensen/status/472362740510494721 I'd like to improve our perf testing and have cases for uncontended, normal occasional contention and highly contended and test it in a few places. I'll do the upgrade of JMH today so we have the latest code. Once I have those in place I'll put it out there for review and guidance. |
For what it is worth, I find the statemachine approach leads to hard to read code, so if there is no obvious win, I'd go for simple. |
I've rewritten the
SerializedObserver
to improve performance. It is now lock-free, uses a multiple-producer-single-consumer queue based on Netty's implementation and employs a fast-path logic for single-threaded write-through.Benchmarked by measuring how fast 500k integers can get through it, if running 1-8 producers at the same time. For a single threaded case, master gives about 18 MOps/s, this implementation gives ~36 MOps/s (would be ~16 MOps/s on the slow path). For producers > 2, master gives ~5.5 MOps/s and this gives ~11.5 MOps/s. For 2 producers, aka 1 producer - 1 consumer, master gives ~4.5 MOps and this gives ~8.5 MOps/s.
The two new class,
PaddedAtomicInteger
andMpscPaddedQueue
will come in handy with other lock-free structures such as Schedulers, etc. We may consider adding back therx.util
or some other sub-package to store these helper classes: they don't need to be part of the public API but can be leftpublic
to enable cross-package access internally.Things I learned during the implementation:
sun.misc.Unsafe
can add 8-10% more throughput. To avoid platform issues, I stayed with the FieldUpdaters.getAndIncrement
anddecrementAndGet
are intrinsified in Java 8 and are compiled to a single x86 instruction, which generally outperforms any CAS loop. Same is true for thegetAndSet
.tail
in theMpscPaddedQueue
again helps separate producers trashing on the tail and a consumer reading the head. Without it, the throughput would decrease by about ~1.5 MOps/smerge
and co which serialize multiple sources. Note however, that if the source speed isn't that high as in the benchmark, this implementation still provide less latency than the alternatives because the fast-path would be most likely open if the source emission is interleaved.