Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can someone lend a really multicore machine (16/32/64?) to experiment? #3

Closed
t0yv0 opened this issue Feb 21, 2014 · 4 comments
Closed

Comments

@t0yv0
Copy link
Contributor

t0yv0 commented Feb 21, 2014

That might be interesting to benchmark/tune scaling to many cores.

@polytypic
Copy link
Member

FYI, I just ordered a new machine with a 10-core (20-thread) CPU and will later extend that with a second CPU to be able to test and fix low level scaling issues.

BTW, I believe the main (and almost only) component in Hopac that needs to scale (and is already designed to do that) is the work distributing scheduler. I believe almost everything else should actually be optimized, by default, for low contention scenarios and for low overhead (minimizing memory usage and the overhead of creating and discarding objects like channels and jobs).

This way end-to-end processing performance with massive uses of lightweight threads is maximized (consider the section "A Case Study on Concurrency-Oriented Programming" in chapter 4 of the book "Erlang Programming" by Cesarini and Thompson). The other approach, that is, maximizing throughput of individual components (queues / pipelines) leads to an architecture like LMAX Disruptor, which is a very different way to organize a program - you wouldn't create a system that has millions of ring buffers being dynamically created and destroyed all the time.

It is certainly possible to mix and match, and in the future I plan to implement slightly more scalable channel like abstractions that I call FanIn and FanOut buffers, but to make basic channels, locks, ivars and so on more multicore scalable would, in my estimate, based on understanding of low level hardware mechanisms, actually reduce performance by about a factor of two in the limit (assuming perfect code generation by the .Net runtime) and, at the limit, require a factor of N times more memory (N being the number of cores).

When message passing abstractions are used in a system that uses lightweight threads for end-to-end processing then they are quite naturally mostly used in a low contention manner. An ivar, for example, is typically allocated to get a response from a concurrent server. It is typically only accessed by at most two threads and typically those threads access the ivar only once and most likely not at the exact same time. This is a very low contention scenario. If an individual ivar would be made "multicore scalable" for high-contention scenarios, it would require more than doubling the amount of memory used for an individual ivar and typical usage pattern would require about a double amount of cache line transfers. So, in my opinion, it makes more sense to make ivars scalable in the sense that you can have twice as many with twice the performance in the typical usage pattern.

@t0yv0
Copy link
Contributor Author

t0yv0 commented Apr 9, 2014

Sounds great. Yes, I completely agree with the analysis. Will be nice to have a "large" better-througput channel to throw in in those rare cases when there's lots of contention, but that's just icing on the cake.

@polytypic
Copy link
Member

FYI, just received the new machine this morning. After unwrapping the machine I tried running benchmarks and didn't really see any surprises. What I mean is that the benchmarks that should scale without problems did so. PingPong (with ping-pong pair per core) and ThreadRing (with ring per core) seemed to scale nearly linearly as they should (ignoring memory bandwidth on longer rings), because they don't really have any sequential bottlenecks. The benchmarks that fundamentally cannot scale (e.g. PostMailbox & CounterActor just measure contention and really don't have any parallelism and Chameneos is also contention heavy and simply doesn't have enough parallelism) didn't and actually had somewhat lower performance, which is not surprising. I will be looking into somewhat more scalable locking solutions and see if performance degradation can be reduced (it is impossible to make those benchmarks "scale" unless their semantics are changed significantly). Performance of the PrimeStream benchmark seemed to scale, but probably not linearly. In this benchmark memory bandwidth requirements also increase as the number of filter threads increases, so it is quite possible that it cannot really scale any better without faster memory (it uses n concurrent threads to compute nth prime and those do take memory). On the other hand, the Fibonacci benchmark is now clearly showing the scaling issues I knew were there theoretically (touching shared memory), but had never seen on lower core count machines. Needless to say I will be working on this as soon as I have the time. I would expect minor improvements across all the benchmarks and measurable improvement in the contention benchmarks and asymptotic improvement in the Fibonacci benchmark.

@polytypic
Copy link
Member

Just another FYI, still haven't had time to try the real solutions I have in mind for improving low-level scaling, but I did a very quick experiment with the Fibonacci benchmark commenting out a few lines from the <*> operator implementation that touch shared memory (on lower core machines those improve performance by avoiding some allocations when it appears all cores are already busy) and on the 10-core machine it more than doubled the performance of the fib variations using <*>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants