Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XADD Chunked Array Queue #230

Merged
merged 1 commit into from Apr 6, 2019

Conversation

Projects
None yet
5 participants
@franz1981
Copy link
Contributor

franz1981 commented Jan 18, 2019

It is a chunked array q designed to scale with the number of producers.

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch from 333bfc5 to 27a0229 Jan 18, 2019

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jan 18, 2019

@nitsanw It is a just a lil' more then a POC but I will continue to improve it if the logic seems reasonable and sound to you, so please take a look to the logic 👍
I have yet to fix:

  1. tests
  2. implements missing methods
  3. use standard naming
  4. use Unsafe where necessary
  5. padding to protect from false sharing
  6. "maybe" Implement AtomicChunk::next using a JUMP + AtomicChunk element

The first benchmarks shows a better scaling then the other multi-producer queues and the code/logic/layout is not optimized at all 👍

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch from 27a0229 to c158651 Jan 18, 2019

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jan 18, 2019

Pull Request Test Coverage Report for Build 511

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 47.61%

Totals Coverage Status
Change from base Build 507: 0.0%
Covered Lines: 1673
Relevant Lines: 3514

💛 - Coveralls
@eolivelli
Copy link

eolivelli left a comment

Awesome.
Just a newbie question, why is there no new test case ?

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jan 18, 2019

@eolivelli
I will probably reuse some of the already existing tests, but as I have written above is just a PR to get a review on the q logic; I will squash a more refined version If the logic will be validated 👍

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch 4 times, most recently from 59519c0 to 714a8fc Jan 18, 2019

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jan 20, 2019

I'm working on an idea to recycle the consumed buffers and make it GC-free in the same way MpscChunkedArrayQueue is 👍

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch 5 times, most recently from 41dfa8e to a27ecf0 Jan 20, 2019

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jan 20, 2019

@eolivelli @nitsanw I've played a lil' more with this on the weekend, making possible to recycle chunks, adding tests and paddings + refine the implementation in order to use Unsafe.
The performance results are interesting: with 2 producers I'm getting 20 ops/us vs 13 ops/usec of MpscUnboundedArrayQueue + no garbage produced.
With 7 producers I'm getting 40 ops/us vs 8 ops/usec of MpscUnboundedArrayQueue + near no garbage produced (depends on long stall of any producer thread).

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch 2 times, most recently from 1ee2c31 to c72f06b Jan 20, 2019

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jan 20, 2019

I think that both poll/peek and fill could be improved a lot; single threaded fill benchs are not good as I was expecting!

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch from c72f06b to 6c4aa8f Jan 20, 2019

consumerBuffer = next;
this.consumerBuffer = consumerBuffer;
}
E e = consumerBuffer.lvElement(consumerOffset);

This comment has been minimized.

Copy link
@nitsanw

nitsanw Jan 21, 2019

Contributor

The following code can be re-arranged as elsewhere for single consumer code

This comment has been minimized.

Copy link
@franz1981

franz1981 Jan 21, 2019

Author Contributor

Do you mean by grouping everything under a single method or just by grouping the "slow" path?

        assert !firstElementOfNewChunk;
        if (lvProducerIndex() == consumerIndex)
        {
            return null;
        }
        e = spinForElement(consumerBuffer, consumerOffset);
        consumerBuffer.spElement(consumerOffset, null);
        soConsumerIndex(consumerIndex + 1);
        return e;
@nitsanw

This comment has been minimized.

Copy link
Contributor

nitsanw commented Jan 21, 2019

Great stuff! This makes for an interesting tradeoff, though the progress is still blocked if the new buffer allocator is interrupted (so this is still blocking producers on new buffer). The progress guarantees are the same overall, so this is not more "progressive" IMO. You can make it more progressive if the first to discover a missing chunk allocates it, rather than the first in chunk, but this is not simple...

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch from 6c4aa8f to 931fcc8 Jan 21, 2019

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jan 21, 2019

though the progress is still blocked if the new buffer allocator is interrupted (so this is still blocking producers on new buffer).

Eheh, smart observation: that's why I've added the recycling of the buffer; to speedup the process to publish a new chunk to be used by the other producers awaiting it

The progress guarantees are the same overall, so this is not more "progressive" IMO

You're right, but I wasn't able to find a better name for this :P
MostlyProgressive seems even worst eheh

You can make it more progressive if the first to discover a missing chunk allocates it, rather than the first in chunk, but this is not simple...

Nope, but it is an interesting idea! Not sure I will be able to get it into this PR

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch from 931fcc8 to 3722d6b Jan 21, 2019

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch from 3722d6b to 4ca24d8 Jan 21, 2019

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jan 21, 2019

@nitsanw I've addressed the most of the changes I've understood: let me know if it seems better and...probably will need some help to improve fill/drain bud 👍

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch 5 times, most recently from 2564497 to abec981 Jan 21, 2019

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jan 21, 2019

@nitsanw I've added some significant changes to the producerBuffer publication/initialization logic due to the complication of recycling instances, but I think to have addressed any weird corner cases due to long stalls of producer threads. Please let me know if the comments are clear enough: I wasn't expecting recycling logic to affect the offer logic that much!

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch 2 times, most recently from e29c6e1 to f4b7931 Jan 22, 2019

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jan 24, 2019

I'm just too curious to see how it performs on a machine with many cores: @nitsanw there is any chance to test this one on such hw?

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Jan 27, 2019

@nitsanw I've addressed in a separate commit the challenge:

though the progress is still blocked if the new buffer allocator is interrupted (so this is still blocking producers on new buffer). The progress guarantees are the same overall, so this is not more "progressive" IMO. You can make it more progressive if the first to discover a missing chunk allocates it, rather than the first in chunk, but this is not simple...

Now it should be more progressive: any offer that found any missing chunk (including previous ones) will try to fill the gap, but given that producer buffer publication (partial) ordering should be preserved I've used a CAS on a separate producerChunkIndex to maintain ordering of publish.
Maybe there is a better way to make it totally progressive, but I will probably need a lot more time to find it :)

From perf pov I can't see any evident benefit in this new solution, but maybe I'm not benchmarking it properly or the implementation is just not enough "progressive".

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch 3 times, most recently from fbfafb0 to c6a4da9 Jan 27, 2019

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Feb 15, 2019

@eolivelli @felixbarny did you have chances to try it or check the logic guys?

@felixbarny

This comment has been minimized.

Copy link

felixbarny commented Feb 15, 2019

Sorry, but I didn't try it out yet. For my use case, it's important that the size of the queue is limited. This one seems to be unlimited? Ideally, I'd also need a XADD-based MPMC queue.

<offtopic>
A quick outline of my use case. I'm writing an APM agent. It needs to collect traces from multiple application threads which are transferred to a reporter thread. After reporting, the objects are reset and put in an object pool for reuse. When a request is recorded, the agent takes a recycled object from the pool in the context of an application thread. I'm currently using a MPSC disruptor to transfer the objects from the application threads to the reporter thread. For the object pool, I'm using a MPMCArrayQueue. This design is quite simple but getting objects from the pool and putting them into the disruptor is contended. I have some ideas to reduce the contention with ThreadLocals but that would be even more off-topic here. A XADD-based MPMC queue would allow the architecture to remain simple but still not be prone to contention.
</offtopic>

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Feb 15, 2019

I have already implemented a bounded mpsc xadd queue but is prone to be blocked forever in very special cases ie is in the experimental branch of JCTools for that reason and is inspired by the Aeron log buffer algorithm.
I can't ATM see a chance to preserve a queue semantic while using XADD on producer side: without being too negative I think this is difficult to achieve with a bounded q.
That's why I have built this unbounded q.

This q could be "easily" turn into mpmc (but without using XADD on consumer side obviously) and I could provide a weakOffer method that could just weakly back-pressure a producer by looking the current distance of consumers, reducing the chance to allocate a new chunks.
I'm not sure it is the right tool TBH while using a mpmc bounded queue with batch drain/fill should work a lot better.

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Mar 3, 2019

@kay I have always referenced Nitsan for the review of this PR, but please, feel free to validate the code as well eh 👍

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch from c6a4da9 to 125917c Mar 11, 2019

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Mar 11, 2019

@nitsanw I've pushed a squashed commit addressing a null check, but the logic should be the same.
The first commit, that wasn't performing cooperative/progressive allocation of missing chunks isn't needed anymore IMHO, but I'm lefting it here just while awaiting the code review: I will squash the commits into one right after that 👍

@franz1981 franz1981 force-pushed the franz1981:xadd_q branch from 125917c to b0febce Mar 18, 2019

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Mar 18, 2019

I've decided to squash the commits into one, given that the "more progressive" version is preferreable to the original one,@nitsanw let me know if I'm missing other parts to clean-up or any improvement you see on the algorithm/implementation.
I see that the freeBuffer could be replaced by a SpscArrayQueue configurable at startup and many volatile loads could be replaced, but my only concerns perf-wise are the batch operations that doesn't seem to me effective as I was expecting.

@franz1981

This comment has been minimized.

Copy link
Contributor Author

franz1981 commented Apr 1, 2019

@nitsanw I've tried a scaling benchmark with increasing numbers of producers, but I'm not getting what I was expecting:
image
According to http://www.perfdynamics.com/Manifesto/USLscalability.html is quite clear that there are some other costs in place that are reducing the scalability of this q.
The first part of the shape of the chart is perfectly matching the classic USL (for some coherency and contention factors to be estimated):
image
But what seems the most intriguing part to me is that I've found an evident pattern for the tail of the chart
and it seems somehow related to Gene Amdahl himself:
image
After several hours I can show you my bet:
image
Can you see it? The nose of Amdahl has a perfect match with my results: and is totally reproducible!!!

It is clear that the Universal Scalability Law is clearly an extension of Amdahl Law, but Amdhal himself is a personified extension of it as well; and he seems to go beyond that too (!!!).
I'm sure that Little too could be hidden somewhere in my bechmark results...
maybe next 1st April I'll find out :)

@nitsanw nitsanw merged commit 9f745a0 into JCTools:master Apr 6, 2019

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage remained the same at 47.61%
Details
@nitsanw

This comment has been minimized.

Copy link
Contributor

nitsanw commented Apr 6, 2019

I'll review further before next release, but generally it looks good :-)
Thanks for your efforts and patience

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.