Migrate from SynchronizedObserver to SerializedObserver #962

benjchristensen · 2014-03-13T18:01:05Z

This pull request replaces use of SynchronizedObserver with SerializedObserver.

Why?

1) Deadlocks

Holding a lock while emitting notifications (onNext, onCompleted, onError) allows for deadlocks if the event results in a cycle back to the parent.

While testing RxJava 0.17.0 in Netflix production we ran into one of these. The vulnerability has existed all along but we finally hit it.

This issue has also been reported before such as: http://twistedoakstudios.com/blog/Post8424_deadlocks-in-practice-dont-hold-locks-while-notifying

2) Blocking Threads

The use of synchronized can block threads. If it's used in areas such as modifying a data structure this can be okay. When emitting a notification however it is a problem as the result of an onNext can take a non-deterministically long time to complete. This means any other thread trying to emit will be blocked.

If the source threads are event loops (such as Vert.x or Netty) this will block the event loops.

For example, if two network calls off two Netty event loops are being merged (such as via flatMap) and one of them does further slow processing that causes onNext to be slow, it will block the other onNext which blocks the event loop and prevents any further IO on that thread. This is a significant problem for system scale and breaks the promise of Rx being a non-blocking, reactive library.

Solution

The synchronize, SynchronizedObserver and SynchronizedSubscriber operator and classes have been deprecated. They are replaced by serialize, SerializedObserver and SerializedSubscriber.

The SerializedObserver still ensures only a single thread can emit onNext, onCompleted, or onError at a time but does not hold a lock while doing so. Instead of blocking threads it will accept the incoming events into a buffer. Thus, it becomes an asynchronous operator.

The merge operator (which impacts flatMap) now uses SerializedObserver, along with any other place in RxJava that needed synchronization.

Implementation

3 implementatations were written and tested:

Performance testing revealed:

    /**
     * 1 streams emitting in a tight loop. Testing for single-threaded overhead.
     * 
     * -> blocking synchronization (SynchronizedObserver)
     * 
     * Run: 10 - 58,186,310 ops/sec
     * Run: 11 - 60,592,037 ops/sec
     * Run: 12 - 58,099,263 ops/sec
     * Run: 13 - 59,034,765 ops/sec
     * Run: 14 - 58,231,548 ops/sec
     * 
     * -> state machine technique (SerializedObserverViaStateMachine)
     * 
     * Run: 10 - 34,668,810 ops/sec
     * Run: 11 - 32,874,312 ops/sec
     * Run: 12 - 33,389,339 ops/sec
     * Run: 13 - 35,269,946 ops/sec
     * Run: 14 - 34,165,013 ops/sec
     * 
     * -> using queue and counter technique (SerializedObserverViaQueueAndCounter)
     * 
     * Run: 10 - 19,548,387 ops/sec
     * Run: 11 - 19,471,069 ops/sec
     * Run: 12 - 19,480,112 ops/sec
     * Run: 13 - 18,720,550 ops/sec
     * Run: 14 - 19,070,383 ops/sec
     * 
     * -> using queue and lock technique (SerializedObserverViaQueueAndLock)
     * 
     * Run: 10 - 51,295,152 ops/sec
     * Run: 11 - 50,317,937 ops/sec
     * Run: 12 - 51,126,331 ops/sec
     * Run: 13 - 52,418,291 ops/sec
     * Run: 14 - 51,694,710 ops/sec
     */

    /**
     * 2 streams emitting in tight loops so very high contention.
     * 
     * -> blocking synchronization (SynchronizedObserver)
     * 
     * Run: 10 - 8,361,252 ops/sec
     * Run: 11 - 7,184,728 ops/sec
     * Run: 12 - 8,249,685 ops/sec
     * Run: 13 - 6,831,595 ops/sec
     * Run: 14 - 8,003,358 ops/sec
     * 
     * (faster because it allows each thread to be "single threaded" while blocking the other)
     * 
     * -> state machine technique (SerializedObserverViaStateMachine)
     * 
     * Run: 10 - 4,060,062 ops/sec
     * Run: 11 - 3,561,131 ops/sec
     * Run: 12 - 3,721,387 ops/sec
     * Run: 13 - 3,693,909 ops/sec
     * Run: 14 - 3,516,324 ops/sec
     * 
     * -> using queue and counter technique (SerializedObserverViaQueueAndCounter)
     * 
     * Run: 10 - 4,300,229 ops/sec
     * Run: 11 - 4,395,995 ops/sec
     * Run: 12 - 4,551,550 ops/sec
     * Run: 13 - 4,443,235 ops/sec
     * Run: 14 - 4,158,475 ops/sec
     * 
     * -> using queue and lock technique (SerializedObserverViaQueueAndLock)
     * 
     * Run: 10 - 6,369,781 ops/sec
     * Run: 11 - 6,933,872 ops/sec
     * Run: 12 - 5,652,535 ops/sec
     * Run: 13 - 5,503,716 ops/sec
     * Run: 14 - 6,219,264 ops/sec
     */

    /**
     * 2 streams emitting once a millisecond. Slow emission so little to no contention.
     * 
     * -> blocking synchronization (SynchronizedObserver)
     * 
     * Run: 10 - 1,996 ops/sec
     * Run: 11 - 1,996 ops/sec
     * Run: 12 - 1,995 ops/sec
     * Run: 13 - 1,997 ops/sec
     * Run: 14 - 1,996 ops/sec
     * 
     * -> state machine technique (SerializedObserverViaStateMachine)
     * 
     * Run: 10 - 1,996 ops/sec
     * Run: 11 - 1,996 ops/sec
     * Run: 12 - 1,996 ops/sec
     * Run: 13 - 1,996 ops/sec
     * Run: 14 - 1,996 ops/sec
     * 
     * -> using queue and counter technique (SerializedObserverViaQueueAndCounter)
     * 
     * Run: 10 - 1,996 ops/sec
     * Run: 11 - 1,996 ops/sec
     * Run: 12 - 1,996 ops/sec
     * Run: 13 - 1,996 ops/sec
     * Run: 14 - 1,995 ops/sec
     * 
     * -> using queue and lock technique (SerializedObserverViaQueueAndLock)
     * 
     * Run: 10 - 1,996 ops/sec
     * Run: 11 - 1,996 ops/sec
     * Run: 12 - 1,997 ops/sec
     * Run: 13 - 1,996 ops/sec
     * Run: 14 - 1,995 ops/sec
     */

The state machine solution was tested in production but caused performance problems, most likely due to the immense object allocation it needs to do.

The elegant "queue and counter" solution does not perform well enough in the non-contended case.

The "queue and lock" model performs well in the non-contended case and under contention, despite not being a very elegant solution and requiring the use of mutex locks for the state changes (but it does not hold the locks during notification).

Considerations

This does allow unbounded buffer growth, the same as observeOn and zip instead of blocking the producer threads.

Conclusion

The implementation in this pull request can and likely will be improved over time. The other implementations are purposefully being shown to allow others to provide further insight on how to do this better.

This change is important to ensure RxJava is non-blocking and our canary testing of this change in the Netflix production environment suggests this change is both performant and functional.

Using the "queue and lock" implementation which won our performance and production testing.

- migrate all usage to Serialized instead of Synchronized - remove implementations of SerializedObserver that lost the competition (performance and testing in production)

cloudbees-pull-request-builder · 2014-03-13T18:11:56Z

RxJava-pull-requests #902 SUCCESS
This pull request looks good

Migrate from SynchronizedObserver to SerializedObserver

abersnaze · 2014-03-13T20:28:51Z

rxjava-core/src/main/java/rx/observers/SerializedObserver.java

+        if (canEmit) {
+            // we won the right to emit
+            try {
+                drainQueue(list);


should onError be draining the queue before sending the error?

Yes we can, but if we're winning the right to emit immediately as it's doing here, it's highly unlikely there is anything in the queue.

Sorry, by "yes we can" I meant we can clear it ... which means skip draining it in the onError case.

akarnokd · 2014-03-13T23:17:53Z

How about the enqueue/drain approach with synchronized counter increment and queue replacement as drain similar to your code? (No ConcurrentLinkedQueue.)

benjchristensen · 2014-03-14T00:02:11Z

Possibly, but it exposes a different problem: starvation. One thread could end up constantly looping and draining the queue and never get a chance to return and emit data that it needs to emit.

I considered setting a max number of loops, but when it finishes those loops then the other issue would still occur, possibly leaving items in the queue waiting for some period of time until another event.

The queue/drain model works far better when there is a single thread dedicated to draining like in observeOn. Stealing threads like this I'm not sure how to avoid having one of the two problems, and I'm debating which one is worse (obviously I've decided starvation is worse for now, and termination events always end up causing the queue to be drained).

headinthebox · 2014-03-14T00:30:04Z

But streams are not guaranteed to terminate. They can be silent forever like Observable.never

benjchristensen · 2014-03-14T00:48:09Z

They can be silent forever like Observable.never

In which case it would never emit anything and never queue anything.

But streams are not guaranteed to terminate.

Of course, in which case it's the next onNext that would drain the queue.

The most concerning type of use case in the current implementation is one where 2 events are emitted, one is delivered, the other is queued, then a long time passes before any further events occur. In this use case, the second onNext is just sitting in a queue waiting to be delivered.

The only model I can think of that solves this requires injecting extra concurrency - using a Scheduler (defaulting to the computation one) to drain the queue, but optimizing for using the providing thread in the non-contended case, only using the Scheduler to drain the queue when there is contention. I'm not a fan of requiring that though for merge since that means flatMap could result in additional concurrency - but perhaps it's an okay tradeoff to sometimes end up delivering the events via a different Scheduler (think "event loop") to ensure these two bad cases can't happen. It would be a significant departure to start allowing additional scheduling and concurrency on merge though.

abersnaze · 2014-03-14T06:58:46Z

Could a fast producer see that it's starving another thread and proactively
steal the canEmit to take over draining for a bit.

On Thu, Mar 13, 2014 at 5:48 PM, Ben Christensen
notifications@github.comwrote:

They can be silent forever like Observable.never

In which case it would never emit anything and never queue anything.

But streams are not guaranteed to terminate.

Of course, in which case it's the next onNext that would drain the queue.

The most concerning type of use case in the current implementation is one
where 2 events are emitted, one is delivered, the other is queued, then a
long time passes before any further events occur. In this use case, the
second onNext is just sitting in a queue waiting to be delivered.

The only model I can think of that solves this requires injecting extra
concurrency - using a Scheduler (defaulting to the computation one) to
drain the queue, but optimizing for using the providing thread in the
non-contended case, only using the Scheduler to drain the queue when
there is contention. I'm not a fan of requiring that though for mergesince that means
flatMap could result in additional concurrency - but perhaps it's an okay
tradeoff to sometimes end up delivering the events via a different
Scheduler (think "event loop") to ensure these two bad cases can't
happen. It would be a significant departure to start allowing additional
scheduling and concurrency on merge though.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/962#issuecomment-37604736
.

benjchristensen · 2014-03-14T15:52:20Z

Could a fast producer see that it's starving another thread and proactively
steal the canEmit to take over draining for a bit.

How would you do that, by keeping track of the number of onNext received by each Thread ID?

And how would it steal the canEmit without blocking?

benjchristensen · 2014-03-14T15:54:21Z

I don't know if this issue will remain in a few weeks if we end up implementing the continuation back-pressure solution, as the inner queue would be bounded to a relatively small size (128? 512? 1024?) and the origin threads would end up parking themselves if they fill their buffers.

This would naturally cause a form of rescheduling and no single thread would starve out another as all would be required to limit the amount they emit.

Strilanc · 2014-03-15T21:50:56Z

Glad to see people taking notice of the deadlock issue.

Here are some tweaks that might help the queue-and-counter case:

Optimistically start operations by CompareSwapping 1 for 0 into the counter. If it succeeds, do your work without queueing it (then try to drain the queue and decrement out of that responsibility as usual). This saves two atomic queue operations when things don't overlap, but costs a CAS when they do.
Wrap the Observable instead of the Observer. This reduces the number of atomic operations when there are multiple subscriptions, but makes each serialized operation take longer (so will cause more overlap).
Since draining the queue can potentially live-lock a thread, keep a non-synchronized count of how much work you have drained. When it exceeds 10 or 100 or whatever, spin off a new thread pool task to do the work so the caller gets control back. (Handing off to another producer is hard)
Improve the concurrent queue, using the knowledge that our case is many-producer-single-consumer. For example, a long time ago I tried an idea where producers did a single atomic Exchange, the consumer had no atomic operations, but there were issues with when queued operations became visible and that problem complicated draining and ultimately made it slower.

benjchristensen · 2014-03-15T22:28:41Z

Hi @Strilanc, thanks for getting involved

queue-and-counter case

Are you able to get it to perform as well as the queue-and-lock implementation?

Wrap the Observable instead of the Observer.

What do you mean by this? Here is the use case that needs to be solved: https://github.com/Netflix/RxJava/blob/master/rxjava-core/src/main/java/rx/operators/OperatorMerge.java#L39

Since draining the queue can potentially live-lock a thread

We have chosen the other tradeoff of possibly delayed delivery right now so this doesn't happen. We discuss this tradeoff in the comments above. Your proposed solution is one of the possibilities discussed but involves adding additional concurrency which has traditionally been unwanted (for cases like merge/flatMap). I think this issue goes away if we end up implementation the back pressure design we're considering that would eliminate the unbounded buffers and thus naturally eliminate the live-lock/thread-starvation issue. (Example origin supporting continuation and parking itself: https://github.com/benjchristensen/RxBackpressure/blob/master/rx-bp-prototype/src/main/java/rx/operators/OnSubscribeFromIterable.java#L69)

Improve the concurrent queue

Definitely. As part of the continuation implementation for back pressure I expect us to end up using optimized ring buffers since we will have fixed size buffers. We may be able to use a multi-producer-single-consumer ring buffer in some places, but in this case where we're stealing producer threads it's not a traditional producer/consumer use case so likely will still need multi-producer-multi-consumer.

Strilanc · 2014-03-15T22:44:46Z

Hey @benjchristensen ,

Are you able to get it to perform as well as the queue-and-lock implementation?

I used the optimistic-CAS trick as part of fixing a performance problem in an Objective-C app that used ReactiveCocoa. However, that was essentially a single threaded case and I can't test on anything even approaching the scale NetFlix can.

What do you mean by this? Here is the use case that needs to be solved: https://github.com/Netflix/RxJava/blob/master/rxjava-core/src/main/java/rx/operators/OperatorMerge.java#L39

You're right that it doesn't apply to that case. I was thinking of subscribing to a BehaviorSubject, where it would be wasteful to do the serialization once per observer instead of once inside the subject.

We have chosen the other tradeoff of possibly delayed delivery right now so this doesn't happen. We discuss this tradeoff in the comments above. [...]

I think those comments are all about passing the draining to another producer. The suggestion I'm giving is to transfer the work to the thread pool. This avoids issues where the queue stops until another producer comes along, and the corresponding tricky synchronization involved in that.

but in this case where we're stealing producer threads it's not a traditional producer/consumer use case so likely will still need multi-producer-multi-consumer.

Yes, it is a bit of a unique case. The consumer is only ever in one place, but that place keeps teleporting around.

benjchristensen · 2014-03-15T23:01:30Z

I can't test on anything even approaching the scale NetFlix can.

I don't think the Netflix scale impacts this very much, as many application use cases regardless of horizontal or vertical scale apply. There are 3 primary use cases that occur when serializing:

No contention

This needs to be optimized for as it's very common and the JVM does a great job with locks of recognizing there is no contention and performing well.

Highly Contended

This happens when tight for/while loops are emitting and being merged. It's somewhat of a bad use case, but it happens. Locks perform rather well per-thread since they block the other threads.

Intermittent contention

This case is easy with locks, queues or CAS and all perform about the same.

I was thinking of subscribing to a BehaviorSubject, where it would be wasteful to do the serialization once per observer instead of once inside the subject.

That makes sense.

The suggestion I'm giving is to transfer the work to the thread pool. This avoids issues where the queue stops until another producer comes along, and the corresponding tricky synchronization involved in that.

Agreed. It will be far easier in this case, and much more similar to how the queue-drain model works in observeOn. We've been trying to eliminate adding concurrency to serialize though, as the last thing anyone expects flatMap/merge to do is inject a new thread to consume the work.

In Rx it's fairly normal for the producer to move across threads (as long as the emissions are serialized) since merging of streams happens all the time or other things like onErrorResumeNext and timeout. Moving consumption to a new thread though only happens when explicitly asked for via observeOn.

That said, if the back-pressure work does not solve the trade-offs (thread-starvation or notification delay) we will likely end up pursuing the fallback solution of moving work to a Scheduler when the live-lock counter threshold is hit.

benjchristensen · 2014-03-15T23:34:14Z

Optimistically start operations by CompareSwapping 1 for 0 into the counter. If it succeeds, do your work without queueing it

I had done this optimization on the state machine one but hadn't on the queue and counter implementation as I'm not quite sure the race condition is safe. It seems it should be good as it drains AFTER emitting, since this only occurs when count is 0, but I get multiple unit test failures when I attempt this optimization. I haven't spent the time to figure out why. Code updated to not queue on non-contended case here: https://github.com/benjchristensen/RxJava/blob/serialize-implementations/rxjava-core/src/main/java/rx/observers/SerializedObserverViaQueueAndCounter.java#L50

With this optimization here are the before and after on performance:

     * -> using queue and counter technique (SerializedObserverViaQueueAndCounter)
     * 
     * Run: 10 - 19,548,387 ops/sec
     * Run: 11 - 19,471,069 ops/sec
     * Run: 12 - 19,480,112 ops/sec
     * Run: 13 - 18,720,550 ops/sec
     * Run: 14 - 19,070,383 ops/sec
     * 
     * ... after optimizations
     * 
     * Run: 10 - 40,202,262 ops/sec
     * Run: 11 - 40,628,288 ops/sec
     * Run: 12 - 41,878,527 ops/sec
     * Run: 13 - 43,400,405 ops/sec
     * Run: 14 - 40,002,117 ops/sec

However, the queue-and-lock implementation is still faster:

     * Run: 10 - 51,295,152 ops/sec
     * Run: 11 - 50,317,937 ops/sec
     * Run: 12 - 51,126,331 ops/sec
     * Run: 13 - 52,418,291 ops/sec
     * Run: 14 - 51,694,710 ops/sec

... and this optimization makes the contended cases slower:

     * Run: 10 - 4,300,229 ops/sec
     * Run: 11 - 4,395,995 ops/sec
     * Run: 12 - 4,551,550 ops/sec
     * Run: 13 - 4,443,235 ops/sec
     * Run: 14 - 4,158,475 ops/sec
     * 
     * ... after "optimizations"
     * 
     * Run: 10 - 2,857,008 ops/sec
     * Run: 11 - 3,093,778 ops/sec
     * Run: 12 - 4,009,758 ops/sec
     * Run: 13 - 3,094,435 ops/sec
     * Run: 14 - 3,166,119 ops/sec

Even if the performance changes made it better than queue-and-lock I am not certain this is thread-safe and there are 6 unit tests across the codebase now failing, whereas queue-and-lock still performs better and all tests pass.

I will explore this more as we finish experimenting with back pressure as that will define how merge and the primary use case for serialization/synchronization can be implemented.

benjchristensen · 2014-03-18T16:11:13Z

Thanks for continuing on this. I'm not surprised we can't beat the compiler :-) I'm curious why the CAS operations aren't similarly optimized.

This exploration isn't done yet, and I'll come back to it in coming weeks. We don't necessarily need to be faster than locks to end up using CAS, there are other benefits to CAS over locks.

davidmoten · 2014-03-19T22:55:01Z

I'm seeing a nasty bug related to this I believe where the use of merge + SerializedSubscriber has the result of onCompleted being called before onNext has finished. I thought I'd mention it now while I dig around trying to get a unit test in case it rings bells for those of you that a familiar with the changed code. My use case worked as expected in 0.17.0 then not in 0.17.1 at all. Unfortunately the code is too complex to present here without adding stupid amounts of noise so I will try to distill a unit test.

Strilanc · 2014-03-20T00:22:42Z

@benjchristensen @davidmoten I think the problem has to do with the queue not being drained properly. There's a copy-away-to-avoid-locking thing happening, but drainQueue only drains the copy. It doesn't go back and check if there's more to drain, which can leave the queue with items hanging.

Actually, I'm kind of surprised any concurrent completions are working. If terminated is set while the queue is being concurrently drained, I don't think anything forwards the completion ever!

There's also a race on emitting being cleared vs another producer taking up draining.

I think your particular bug is because of this:

    if (canEmit) {
        // we won the right to emit
        try {
            drainQueue(list);
            actual.onNext(t);

There's a race with an onComplete firing before that call to drainQueue runs, then onNext ruins everything. ... Except that list != queue, so I dunno...

Still, the hand-off process between producers is wrong in the current queue+lock implementation. It's rotten enough to consider starting from scratch to avoid contamination from the same subtle mistakes. There definitely need to be tests that detect these.

davidmoten · 2014-03-20T00:27:05Z

Thanks for that, I'll stop chasing a unit test. I'll leave it to the rxjava
experts to run with.

On 20 March 2014 11:22, Craig Gidney notifications@github.com wrote:

I think the problem has to do with the queue not being drained properly.
There's a copy-away-to-avoid-locking thing happening, but drainQueue only
drains the copy. It doesn't go back and check if there's more to drain,
which can leave the queue with items hanging.

Actually, I'm kind of surprised any concurrent completions are working. If
terminated is set while the queue is being concurrently drained, I don't
think anything forwards the completion ever!

There's also a race on emitting being cleared vs another producer taking
up draining.

Your particular bug is because of this:
if (canEmit) {
    // we won the right to emit
    try {
        drainQueue(list);
        actual.onNext(t);
There's a race with an onComplete firing before that call to drainQueueruns, then
onNext ruins everything.

Basically the entire hand-off process between producers is wrong in the
current queue+lock implementation.

Reply to this email directly or view it on GitHubhttps://github.com//pull/962#issuecomment-38123526
.

benjchristensen · 2014-03-20T16:22:33Z

If terminated is set while the queue is being concurrently drained, I don't think anything forwards the completion ever!

It happens right here: https://github.com/Netflix/RxJava/blob/master/rxjava-core/src/main/java/rx/observers/SerializedObserver.java#L160

benjchristensen · 2014-03-20T16:52:22Z

Can one of you please provide me a test case? The only way I can cause a problem so far is if the source Observable breaks the Rx contract.

Here are examples where threads are racing each other:

    @Test
    public void testConcurrency() {

        Observable<Integer> o = Observable.create(new OnSubscribe<Integer>() {

            @Override
            public void call(final Subscriber<? super Integer> s) {
                Schedulers.newThread().schedule(new Action1<Inner>() {

                    @Override
                    public void call(Inner inner) {
                        for (int i = 0; i < 10000; i++) {
                            s.onNext(1);
                        }
                        s.onCompleted();
                    }

                });
            }
        });

        for (int i = 0; i < 1000; i++) {
            Observable<Integer> merge = Observable.merge(o, o, o);
            TestSubscriber<Integer> ts = new TestSubscriber<Integer>();
            merge.subscribe(ts);

            ts.awaitTerminalEvent();
            assertEquals(1, ts.getOnCompletedEvents().size());
            assertEquals(30000, ts.getOnNextEvents().size());
            List<Integer> onNextEvents = ts.getOnNextEvents();
            System.out.println("onNext: " + onNextEvents.size() + " onCompleted: " + ts.getOnCompletedEvents().size());
        }
    }

Another variant this time with sleeps to cause random concurrent behavior:

    @Test
    public void testConcurrencyWithSleeping() {

        Observable<Integer> o = Observable.create(new OnSubscribe<Integer>() {

            @Override
            public void call(final Subscriber<? super Integer> s) {
                Schedulers.newThread().schedule(new Action1<Inner>() {

                    @Override
                    public void call(Inner inner) {
                        for (int i = 0; i < 100; i++) {
                            s.onNext(1);
                            try {
                                Thread.sleep(1);
                            } catch (InterruptedException e) {
                                e.printStackTrace();
                            }
                        }
                        s.onCompleted();
                    }

                });
            }
        });

        for (int i = 0; i < 100; i++) {
            Observable<Integer> merge = Observable.merge(o, o, o);
            TestSubscriber<Integer> ts = new TestSubscriber<Integer>();
            merge.subscribe(ts);

            ts.awaitTerminalEvent();
            assertEquals(1, ts.getOnCompletedEvents().size());
            assertEquals(300, ts.getOnNextEvents().size());
            List<Integer> onNextEvents = ts.getOnNextEvents();
            System.out.println("onNext: " + onNextEvents.size() + " onCompleted: " + ts.getOnCompletedEvents().size());
        }
    }

Both of those work correctly.

I am however able to get it to behave poorly when the Rx contract is broken:

    @Test
    public void testConcurrencyWithBrokenContract() {

        Observable<Integer> o = Observable.create(new OnSubscribe<Integer>() {

            @Override
            public void call(final Subscriber<? super Integer> s) {
                Schedulers.newThread().schedule(new Action1<Inner>() {

                    @Override
                    public void call(Inner inner) {
                        for (int i = 0; i < 10000; i++) {
                            s.onNext(1);
                        }
                        s.onCompleted();
                        for (int i = 0; i < 100; i++) {
                            s.onNext(1);
                        }
                        s.onCompleted();
                        for (int i = 0; i < 100; i++) {
                            s.onNext(1);
                        }
                    }

                });
            }
        });

        for (int i = 0; i < 1000; i++) {
            Observable<Integer> merge = Observable.merge(o, o, o);
            TestSubscriber<Integer> ts = new TestSubscriber<Integer>();
            merge.subscribe(ts);

            ts.awaitTerminalEvent();
            assertEquals(1, ts.getOnCompletedEvents().size());
            assertEquals(30000, ts.getOnNextEvents().size());
            List<Integer> onNextEvents = ts.getOnNextEvents();
            System.out.println("onNext: " + onNextEvents.size() + " onCompleted: " + ts.getOnCompletedEvents().size());
        }
    }

This emits: java.lang.AssertionError: expected:<30000> but was:<29377>

The fix for this is to not decrement the count if onComplete is received more than once:

             final class InnerObserver extends Subscriber<T> {

+                private boolean innerCompleted = false;
+
                 public InnerObserver() {
                 }

                 @Override
                 public void onCompleted() {
-                    if (runningCount.decrementAndGet() == 0 && completed) {
-                        o.onCompleted();
+                    if (!innerCompleted) {
+                        // we check if already completed otherwise a misbehaving Observable that emits onComplete more than once
+                        // will cause the runningCount to decrement multiple times.
+                        innerCompleted = true;
+                        if (runningCount.decrementAndGet() == 0 && completed) {
+                            o.onCompleted();
+                        }
+                        cleanup();
                     }
-                    cleanup();
                 }

benjchristensen · 2014-03-20T16:55:59Z

Just merged the fix for an Observable that emits multiple onComplete.

davidmoten · 2014-03-21T06:37:53Z

Ben, this serialized subscriber complicates my world for this reason:

I am merging asynchronous observables. Each observable has a resource that must remain open till all onNext calls have finished for that observable. The OnSubscribe for the observable get a fresh resource, calls onNext repeatedly then closes the resource and calls onComplete. Because the SerializedSubscriber has buffered stuff asynchronously I am encountering the situation where the resource has been closed before the onNext have finished and I get errors.

Unless I'm mistaken this is a most undesirable effect that will complicate the hell out of writing Observables.

Strilanc · 2014-03-21T11:34:26Z

@benjchristensen My guess was that if a subject sent a re-entrant onCompleted in its serialized onNext callback, that the onComplete would not forward. Something like:

Subject<Object> s = new Subject<>();
Observable<Object> r = s.Serialized();
bool[] a = new bool[1];
r.onCompleted(() -> a[0] = true);
r.onNext((x) -> s.sendCompleted());
s.sendNext(null);
assert(a[0]);

(I apologize if I got some of the method names wrong.)

Strilanc · 2014-03-21T11:37:49Z

@davidmoten It sounds like the problem is that the resource is being closed by the producer, instead of the consumer, when it's the consumer that relies upon it. Would it also break if delay was applied to it?

I don't know the conventions of RxJava well enough to suggest what should be done. My guess would be that the consumer should be responsible for disposing the subscription, even if it completes, and so you'd cleanup-on-dispose rather than cleanup-on-complete-sent.

davidmoten · 2014-03-21T12:06:18Z

@Strilanc good description. not sure about delay, I'll look it up.

Cleanup on completion I suspect is achieved with the Observable.using() method so I'll try that but cleanup on dispose is also a possibility. I wrote a quick test and unsubscribe is called even after onCompleted is called so that could work. I'm a bit concerned in the synchronous case that if I use unsubscribe to close resources then the subsequent observable (say in a flatMap) might commence and be blocked waiting for a resource from a singleton resource pool.

Strilanc · 2014-03-21T12:11:06Z

@davidmoten I don't know the specifics of the situation so this might be a ridiculous suggestion in context, but what about releasing producing rights after on-complete is sent but only release the actual resource based on a subscribed reference count?

So, consumer1 subscribes and the resource is allocated. The notifications get sent, and the resource is now no longer "busy" but does still have to exist. Consumer2 subscribes, and does not wait because the resource is not "busy" (despite already being allocated). Consumer2 unsubscribes, and the resource goes back to "allocated but not busy". Consumer1 unsubscribes and the resource gets released.

benjchristensen · 2014-03-21T14:17:16Z

Each observable has a resource that must remain open till all onNext calls have finished for that observable.

@davidmoten Once an Observable has emitted all of its onNext it has no further control over the events. Anything downstream can be async (such as zip and observeOn) and onCompleted and cleanup will have been performed by the source Observable while the events are still being processed downstream. That is the nature of async push.

My guess was that if a subject sent a re-entrant onCompleted in its serialized onNext callback, that the onComplete would not forward.

@Strilanc The Rx contract is clear that once an onCompleted or onError is received no further events are to be sent. Those are terminal events. Thus your example breaks the contract by sending onComplete to Observable "s" and then expecting to send more events to it.

My guess would be that the consumer should be responsible for disposing the subscription, even if it completes, and so you'd cleanup-on-dispose rather than cleanup-on-complete-sent.

The unsubscribe always gets called at the end (after onComplete or onError) but events can not rely upon the source Observable not having been cleaned up because any number of compositions can cause the source Observable to be long gone by the time the events are processed. Some examples include: zip, observeOn, cache, replay, delay, window, buffer, and now merge.

The unsubscribe event also can occur well before the final Subscriber receives the events. For example, the merge operator eagerly invokes unsubscribe as soon as onComplete is received otherwise memory leaks can occur on long-lived streaming use cases since the final Subscriber may never actually complete.

From what I can tell the difference in behavior you're seeing is coincidental. Before the synchronized solution would allow a broken contract to flow through, such as: onNext, onCompleted, onNext. This would flow through until some operator or the final SafeSubscriber filtered out the last onNext that broke the contract. The new serialized solution is maintaining the contract and filtering out everything after a terminal event (according to the Rx contract). It doesn't need to be done on serialized but is an optimization to allow it to finish draining the queue. I could go either way on this one ... but no operator or Observable should depend on events propagating after onCompleted, onError or unsubscribe.

@davidmoten What is the use case you have where emitted events depend on their original Observable not having been cleaned up? That is not going to be reliable when it gets composed.

davidmoten · 2014-03-21T23:59:30Z

@benjchristensen https://github.com/benjchristensen thanks for your
detailed response. I've refactored my observable to internalize usage of
the resource to the Observable and it now only emits items that do not
depend on the resource. I did this on my first iteration of the code a
month ago but found nothing was pushing me to do this in particular
(clearly hadn't used zip, cache etc in my use cases and was deliberately
avoiding observeOn).

The use case was resource = jdbc connection and item = ResultSet. The first
action from the observable was always a map to somethiing not dependent on
the resultset/connection which was frequently flatmapped but
SerializedSubscriber threw a spanner in the works and quite rightly so I
realize. I've internalized that map now so that a ResultSet is not emitted
and all my unit tests pass with 0.17.1. At least now I'll be able to throw
async operators at it with impunity now. FYI the project is at
https://github.com/davidmoten/rxjava-jdbc.

Thanks again for your time and to @Strilanc, twas all me and was easily
fixed.
Dave

On 22 March 2014 01:17, Ben Christensen notifications@github.com wrote:

Each observable has a resource that must remain open till all onNext calls
have finished for that observable.

@davidmoten https://github.com/davidmoten Once an Observable has
emitted all of its onNext it has no further control over the events.
Anything downstream can be async (such as zip and observeOn) and
onCompleted and cleanup will have been performed by the source Observablewhile the events are still being processed downstream. That is the nature
of async push.

My guess was that if a subject sent a re-entrant onCompleted in its
serialized onNext callback, that the onComplete would not forward.

@Strilanc https://github.com/Strilanc The Rx contract is clear that
once an onCompleted or onError is received no further events are to be
sent. Those are terminal events. Thus your example breaks the contract by
sending onComplete to Observable "s" and then expecting to send more
events to it.

My guess would be that the consumer should be responsible for disposing
the subscription, even if it completes, and so you'd cleanup-on-dispose
rather than cleanup-on-complete-sent.

The unsubscribe always gets called at the end (after onComplete or onError)
but events can not rely upon the source Observable not having been
cleaned up because any number of compositions can cause the source
Observable to be long gone by the time the events are processed. Some
examples include: zip, observeOn, cache, replay, delay, window, buffer,
and now merge.

The unsubscribe event also can occur well before the final Subscriberreceives the events. For example, the
merge operator eagerly invokes unsubscribe as soon as onComplete is
received otherwise memory leaks can occur on long-lived streaming use cases
since the final Subscriber may never actually complete.

From what I can tell the difference in behavior you're seeing is
coincidental. Before the synchronized solution would allow a broken
contract to flow through, such as: onNext, onCompleted, onNext. This
would flow through until some operator or the final SafeSubscriberfiltered out the last
onNext that broke the contract. The new serialized solution is
maintaining the contract and filtering out everything after a terminal
event (according to the Rx contract). It doesn't need to be done on
serialized but is an optimization to allow it to finish draining the
queue. I could go either way on this one ... but no operator or Observable
should depend on events propagating after onCompleted, onError or
unsubscribe.

@davidmoten https://github.com/davidmoten What is the use case you have
where emitted events depend on their original Observable not having been
cleaned up? That is not going to be reliable when it gets composed.

Reply to this email directly or view it on GitHubhttps://github.com//pull/962#issuecomment-38278838
.

headinthebox · 2014-03-23T21:22:28Z

Always keep a copy of http://go.microsoft.com/fwlink/?LinkID=205219 next to your keyboard (which reminds me, we should do a refresh of this, and for every language).

benjchristensen · 2014-03-24T17:53:09Z

@davidmoten Glad it was easy to resolve.

as per discussion at ReactiveX#962 (comment)

davidmoten · 2014-03-28T05:40:24Z

@benjchristensen With reference to this problem:

The most concerning type of use case in the current implementation is one where 2 events are emitted, one is delivered, the other is queued, then a long time passes before any further events occur. In this use case, the second onNext is just sitting in a queue waiting to be delivered.

I'm encountering this problem now using merge. My use case is described here . My assumption at this point is that the problem is here to stay and I'm looking for the least objectionable way of handling it. Do we need to be able to parameterize the use of a SerializedSubscriber so that we can avoid the behaviour if we choose to? In SerializedSubscriber once winning the right to emit we currently drain the queue before the onNext is called. Is there any scope for draining the queue some limited/configurable number of times AFTER the onNext is called as well? I certainly see problems with this as well because the last drain of the queue still offers the chance of delaying an incoming onNext. For certain use cases however this could reduce the likelihood of the delay happening. Perhaps some probabilistic diminishment is the best we can hope for.

Strilanc · 2014-03-28T11:04:43Z

@davidmoten If an event sits in the queue until the next event, instead of eventually being forwarded even if another event doesn't come along, it's a bug.

Are you actually experiencing this bug?

davidmoten · 2014-03-28T11:42:38Z

@Strilanc Yeah I'm experiencing it as described here. My first para above quotes Ben on this one and is known issue. I realize on reviewing the conversation above that my rambly ideas seem to have been talked about already. I thought I'd push it along a little seeing as I'm bumping into the issue now (though not without a somewhat ugly workaround). It strikes me as a serious side-effect of work that is trying to ameliorate deadlocks and blocking.

benjchristensen · 2014-03-28T17:07:36Z

I have created #998 for us to continue this discussion and determine how to move forward.

benjchristensen added 12 commits March 13, 2014 09:41

OperationSynchronize -> OperatorSynchronize

9a0f54f

OperatorSerialize

4427d03

not quite functional ... and slow

8b3862e

Functional now ... it seems.

cc3c654

Migrate Merge from Synchronize to Serialize

423b470

Experimenting with different implementations and performance

34a2561

Clarify names during testing

2136f8f

Unit tests for each

7c16c51

Unlock in finally block

b609607

queue and lock implementation

c103796

SerializedObserver and SerializedSubscriber

76bbefc

Using the "queue and lock" implementation which won our performance and production testing.

Synchronize -> Serialize

6926fa6

- migrate all usage to Serialized instead of Synchronized - remove implementations of SerializedObserver that lost the competition (performance and testing in production)

benjchristensen added a commit that referenced this pull request Mar 13, 2014

Merge pull request #962 from benjchristensen/serialize-synchronize

4834d85

Migrate from SynchronizedObserver to SerializedObserver

benjchristensen merged commit 4834d85 into ReactiveX:master Mar 13, 2014

benjchristensen deleted the serialize-synchronize branch March 13, 2014 18:34

abersnaze reviewed Mar 13, 2014
View reviewed changes

benjchristensen mentioned this pull request Mar 20, 2014

Merge - Handle Bad Observables #973

Merged

benjchristensen added a commit to benjchristensen/RxJava that referenced this pull request Mar 26, 2014

Fix queue-and-counter race condition

3134fab

as per discussion at ReactiveX#962 (comment)

benjchristensen mentioned this pull request Mar 28, 2014

Serialization Behavior #998

Closed

benjchristensen mentioned this pull request Mar 28, 2014

New Implementation of SerializedObserver #999

Merged

davidmoten mentioned this pull request Apr 4, 2014

upgrade to rxjava 0.17.4 to avoid SerializedObserver delays from earlier releases davidmoten/rxjava-jdbc#1

Closed

kastiglione mentioned this pull request May 8, 2014

[RFC] Make subscribers more powerful, remove RACSubject and -subscribeNext:error:completed: ReactiveCocoa/ReactiveCocoa#1308

Closed

This was referenced May 22, 2014

Lock-free, MPSC-queue based, fast-path serializing Observer. #1235

Merged

Is a Subscriber required to be thread-safe? reactive-streams/reactive-streams-jvm#53

Closed

benjchristensen mentioned this pull request May 31, 2014

Explore Blocking vs Non-Blocking Solutions #1299

Closed

mattpodwysocki mentioned this pull request Oct 13, 2014

BehaviorSubject not eventually consistent? dotnet/reactive#38

Closed

Migrate from SynchronizedObserver to SerializedObserver #962

Migrate from SynchronizedObserver to SerializedObserver #962

Conversation

benjchristensen commented Mar 13, 2014

Why?

1) Deadlocks

2) Blocking Threads

Solution

Implementation

Considerations

Conclusion

cloudbees-pull-request-builder commented Mar 13, 2014

abersnaze Mar 13, 2014

Choose a reason for hiding this comment

benjchristensen Mar 13, 2014

Choose a reason for hiding this comment

benjchristensen Mar 13, 2014

Choose a reason for hiding this comment

akarnokd commented Mar 13, 2014

benjchristensen commented Mar 14, 2014

headinthebox commented Mar 14, 2014

benjchristensen commented Mar 14, 2014

abersnaze commented Mar 14, 2014

benjchristensen commented Mar 14, 2014

benjchristensen commented Mar 14, 2014

Strilanc commented Mar 15, 2014

benjchristensen commented Mar 15, 2014

Strilanc commented Mar 15, 2014

benjchristensen commented Mar 15, 2014

benjchristensen commented Mar 15, 2014

benjchristensen commented Mar 18, 2014

davidmoten commented Mar 19, 2014

Strilanc commented Mar 20, 2014

davidmoten commented Mar 20, 2014

benjchristensen commented Mar 20, 2014

benjchristensen commented Mar 20, 2014

benjchristensen commented Mar 20, 2014

davidmoten commented Mar 21, 2014

Strilanc commented Mar 21, 2014

Strilanc commented Mar 21, 2014

davidmoten commented Mar 21, 2014

Strilanc commented Mar 21, 2014

benjchristensen commented Mar 21, 2014

davidmoten commented Mar 21, 2014

headinthebox commented Mar 23, 2014

benjchristensen commented Mar 24, 2014

davidmoten commented Mar 28, 2014

Strilanc commented Mar 28, 2014

davidmoten commented Mar 28, 2014

benjchristensen commented Mar 28, 2014