Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XADD Chunked Array Queue #230

Merged
merged 1 commit into from
Apr 6, 2019
Merged

XADD Chunked Array Queue #230

merged 1 commit into from
Apr 6, 2019

Conversation

franz1981
Copy link
Collaborator

@franz1981 franz1981 commented Jan 18, 2019

It is a chunked array q designed to scale with the number of producers.

@franz1981
Copy link
Collaborator Author

@nitsanw It is a just a lil' more then a POC but I will continue to improve it if the logic seems reasonable and sound to you, so please take a look to the logic 👍
I have yet to fix:

  1. tests
  2. implements missing methods
  3. use standard naming
  4. use Unsafe where necessary
  5. padding to protect from false sharing
  6. "maybe" Implement AtomicChunk::next using a JUMP + AtomicChunk element

The first benchmarks shows a better scaling then the other multi-producer queues and the code/logic/layout is not optimized at all 👍

@coveralls
Copy link

coveralls commented Jan 18, 2019

Pull Request Test Coverage Report for Build 511

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 47.61%

Totals Coverage Status
Change from base Build 507: 0.0%
Covered Lines: 1673
Relevant Lines: 3514

💛 - Coveralls

Copy link

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome.
Just a newbie question, why is there no new test case ?

@franz1981
Copy link
Collaborator Author

franz1981 commented Jan 18, 2019

@eolivelli
I will probably reuse some of the already existing tests, but as I have written above is just a PR to get a review on the q logic; I will squash a more refined version If the logic will be validated 👍

@franz1981 franz1981 force-pushed the xadd_q branch 4 times, most recently from 59519c0 to 714a8fc Compare January 19, 2019 15:47
@franz1981
Copy link
Collaborator Author

I'm working on an idea to recycle the consumed buffers and make it GC-free in the same way MpscChunkedArrayQueue is 👍

@franz1981 franz1981 force-pushed the xadd_q branch 5 times, most recently from 41dfa8e to a27ecf0 Compare January 20, 2019 14:55
@franz1981
Copy link
Collaborator Author

franz1981 commented Jan 20, 2019

@eolivelli @nitsanw I've played a lil' more with this on the weekend, making possible to recycle chunks, adding tests and paddings + refine the implementation in order to use Unsafe.
The performance results are interesting: with 2 producers I'm getting 20 ops/us vs 13 ops/usec of MpscUnboundedArrayQueue + no garbage produced.
With 7 producers I'm getting 40 ops/us vs 8 ops/usec of MpscUnboundedArrayQueue + near no garbage produced (depends on long stall of any producer thread).

@franz1981
Copy link
Collaborator Author

I think that both poll/peek and fill could be improved a lot; single threaded fill benchs are not good as I was expecting!

consumerBuffer = next;
this.consumerBuffer = consumerBuffer;
}
E e = consumerBuffer.lvElement(consumerOffset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following code can be re-arranged as elsewhere for single consumer code

Copy link
Collaborator Author

@franz1981 franz1981 Jan 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean by grouping everything under a single method or just by grouping the "slow" path?

        assert !firstElementOfNewChunk;
        if (lvProducerIndex() == consumerIndex)
        {
            return null;
        }
        e = spinForElement(consumerBuffer, consumerOffset);
        consumerBuffer.spElement(consumerOffset, null);
        soConsumerIndex(consumerIndex + 1);
        return e;

@nitsanw
Copy link
Contributor

nitsanw commented Jan 21, 2019

Great stuff! This makes for an interesting tradeoff, though the progress is still blocked if the new buffer allocator is interrupted (so this is still blocking producers on new buffer). The progress guarantees are the same overall, so this is not more "progressive" IMO. You can make it more progressive if the first to discover a missing chunk allocates it, rather than the first in chunk, but this is not simple...

@franz1981
Copy link
Collaborator Author

though the progress is still blocked if the new buffer allocator is interrupted (so this is still blocking producers on new buffer).

Eheh, smart observation: that's why I've added the recycling of the buffer; to speedup the process to publish a new chunk to be used by the other producers awaiting it

The progress guarantees are the same overall, so this is not more "progressive" IMO

You're right, but I wasn't able to find a better name for this :P
MostlyProgressive seems even worst eheh

You can make it more progressive if the first to discover a missing chunk allocates it, rather than the first in chunk, but this is not simple...

Nope, but it is an interesting idea! Not sure I will be able to get it into this PR

@franz1981
Copy link
Collaborator Author

@nitsanw I've addressed the most of the changes I've understood: let me know if it seems better and...probably will need some help to improve fill/drain bud 👍

@franz1981 franz1981 force-pushed the xadd_q branch 5 times, most recently from 2564497 to abec981 Compare January 21, 2019 18:47
@franz1981
Copy link
Collaborator Author

franz1981 commented Jan 21, 2019

@nitsanw I've added some significant changes to the producerBuffer publication/initialization logic due to the complication of recycling instances, but I think to have addressed any weird corner cases due to long stalls of producer threads. Please let me know if the comments are clear enough: I wasn't expecting recycling logic to affect the offer logic that much!

@franz1981 franz1981 force-pushed the xadd_q branch 2 times, most recently from e29c6e1 to f4b7931 Compare January 23, 2019 06:31
@franz1981
Copy link
Collaborator Author

I'm just too curious to see how it performs on a machine with many cores: @nitsanw there is any chance to test this one on such hw?

@franz1981
Copy link
Collaborator Author

franz1981 commented Jan 27, 2019

@nitsanw I've addressed in a separate commit the challenge:

though the progress is still blocked if the new buffer allocator is interrupted (so this is still blocking producers on new buffer). The progress guarantees are the same overall, so this is not more "progressive" IMO. You can make it more progressive if the first to discover a missing chunk allocates it, rather than the first in chunk, but this is not simple...

Now it should be more progressive: any offer that found any missing chunk (including previous ones) will try to fill the gap, but given that producer buffer publication (partial) ordering should be preserved I've used a CAS on a separate producerChunkIndex to maintain ordering of publish.
Maybe there is a better way to make it totally progressive, but I will probably need a lot more time to find it :)

From perf pov I can't see any evident benefit in this new solution, but maybe I'm not benchmarking it properly or the implementation is just not enough "progressive".

@franz1981 franz1981 force-pushed the xadd_q branch 3 times, most recently from fbfafb0 to c6a4da9 Compare January 28, 2019 08:59
@franz1981
Copy link
Collaborator Author

@eolivelli @felixbarny did you have chances to try it or check the logic guys?

@felixbarny
Copy link

Sorry, but I didn't try it out yet. For my use case, it's important that the size of the queue is limited. This one seems to be unlimited? Ideally, I'd also need a XADD-based MPMC queue.

<offtopic>
A quick outline of my use case. I'm writing an APM agent. It needs to collect traces from multiple application threads which are transferred to a reporter thread. After reporting, the objects are reset and put in an object pool for reuse. When a request is recorded, the agent takes a recycled object from the pool in the context of an application thread. I'm currently using a MPSC disruptor to transfer the objects from the application threads to the reporter thread. For the object pool, I'm using a MPMCArrayQueue. This design is quite simple but getting objects from the pool and putting them into the disruptor is contended. I have some ideas to reduce the contention with ThreadLocals but that would be even more off-topic here. A XADD-based MPMC queue would allow the architecture to remain simple but still not be prone to contention.
</offtopic>

@franz1981
Copy link
Collaborator Author

franz1981 commented Feb 15, 2019

I have already implemented a bounded mpsc xadd queue but is prone to be blocked forever in very special cases ie is in the experimental branch of JCTools for that reason and is inspired by the Aeron log buffer algorithm.
I can't ATM see a chance to preserve a queue semantic while using XADD on producer side: without being too negative I think this is difficult to achieve with a bounded q.
That's why I have built this unbounded q.

This q could be "easily" turn into mpmc (but without using XADD on consumer side obviously) and I could provide a weakOffer method that could just weakly back-pressure a producer by looking the current distance of consumers, reducing the chance to allocate a new chunks.
I'm not sure it is the right tool TBH while using a mpmc bounded queue with batch drain/fill should work a lot better.

@franz1981
Copy link
Collaborator Author

franz1981 commented Mar 3, 2019

@kay I have always referenced Nitsan for the review of this PR, but please, feel free to validate the code as well eh 👍

@franz1981
Copy link
Collaborator Author

@nitsanw I've pushed a squashed commit addressing a null check, but the logic should be the same.
The first commit, that wasn't performing cooperative/progressive allocation of missing chunks isn't needed anymore IMHO, but I'm lefting it here just while awaiting the code review: I will squash the commits into one right after that 👍

@franz1981
Copy link
Collaborator Author

I've decided to squash the commits into one, given that the "more progressive" version is preferreable to the original one,@nitsanw let me know if I'm missing other parts to clean-up or any improvement you see on the algorithm/implementation.
I see that the freeBuffer could be replaced by a SpscArrayQueue configurable at startup and many volatile loads could be replaced, but my only concerns perf-wise are the batch operations that doesn't seem to me effective as I was expecting.

@franz1981
Copy link
Collaborator Author

franz1981 commented Apr 1, 2019

@nitsanw I've tried a scaling benchmark with increasing numbers of producers, but I'm not getting what I was expecting:
image
According to http://www.perfdynamics.com/Manifesto/USLscalability.html is quite clear that there are some other costs in place that are reducing the scalability of this q.
The first part of the shape of the chart is perfectly matching the classic USL (for some coherency and contention factors to be estimated):
image
But what seems the most intriguing part to me is that I've found an evident pattern for the tail of the chart
and it seems somehow related to Gene Amdahl himself:
image
After several hours I can show you my bet:
image
Can you see it? The nose of Amdahl has a perfect match with my results: and is totally reproducible!!!

It is clear that the Universal Scalability Law is clearly an extension of Amdahl Law, but Amdhal himself is a personified extension of it as well; and he seems to go beyond that too (!!!).
I'm sure that Little too could be hidden somewhere in my bechmark results...
maybe next 1st April I'll find out :)

@nitsanw nitsanw merged commit 9f745a0 into JCTools:master Apr 6, 2019
@nitsanw
Copy link
Contributor

nitsanw commented Apr 6, 2019

I'll review further before next release, but generally it looks good :-)
Thanks for your efforts and patience

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants