Can someone lend a really multicore machine (16/32/64?) to experiment? #3

t0yv0 · 2014-02-21T16:38:40Z

That might be interesting to benchmark/tune scaling to many cores.

polytypic · 2014-04-09T02:10:16Z

FYI, I just ordered a new machine with a 10-core (20-thread) CPU and will later extend that with a second CPU to be able to test and fix low level scaling issues.

BTW, I believe the main (and almost only) component in Hopac that needs to scale (and is already designed to do that) is the work distributing scheduler. I believe almost everything else should actually be optimized, by default, for low contention scenarios and for low overhead (minimizing memory usage and the overhead of creating and discarding objects like channels and jobs).

This way end-to-end processing performance with massive uses of lightweight threads is maximized (consider the section "A Case Study on Concurrency-Oriented Programming" in chapter 4 of the book "Erlang Programming" by Cesarini and Thompson). The other approach, that is, maximizing throughput of individual components (queues / pipelines) leads to an architecture like LMAX Disruptor, which is a very different way to organize a program - you wouldn't create a system that has millions of ring buffers being dynamically created and destroyed all the time.

It is certainly possible to mix and match, and in the future I plan to implement slightly more scalable channel like abstractions that I call FanIn and FanOut buffers, but to make basic channels, locks, ivars and so on more multicore scalable would, in my estimate, based on understanding of low level hardware mechanisms, actually reduce performance by about a factor of two in the limit (assuming perfect code generation by the .Net runtime) and, at the limit, require a factor of N times more memory (N being the number of cores).

When message passing abstractions are used in a system that uses lightweight threads for end-to-end processing then they are quite naturally mostly used in a low contention manner. An ivar, for example, is typically allocated to get a response from a concurrent server. It is typically only accessed by at most two threads and typically those threads access the ivar only once and most likely not at the exact same time. This is a very low contention scenario. If an individual ivar would be made "multicore scalable" for high-contention scenarios, it would require more than doubling the amount of memory used for an individual ivar and typical usage pattern would require about a double amount of cache line transfers. So, in my opinion, it makes more sense to make ivars scalable in the sense that you can have twice as many with twice the performance in the typical usage pattern.

t0yv0 · 2014-04-09T02:23:31Z

Sounds great. Yes, I completely agree with the analysis. Will be nice to have a "large" better-througput channel to throw in in those rare cases when there's lots of contention, but that's just icing on the cake.

polytypic · 2014-04-28T10:52:50Z

FYI, just received the new machine this morning. After unwrapping the machine I tried running benchmarks and didn't really see any surprises. What I mean is that the benchmarks that should scale without problems did so. PingPong (with ping-pong pair per core) and ThreadRing (with ring per core) seemed to scale nearly linearly as they should (ignoring memory bandwidth on longer rings), because they don't really have any sequential bottlenecks. The benchmarks that fundamentally cannot scale (e.g. PostMailbox & CounterActor just measure contention and really don't have any parallelism and Chameneos is also contention heavy and simply doesn't have enough parallelism) didn't and actually had somewhat lower performance, which is not surprising. I will be looking into somewhat more scalable locking solutions and see if performance degradation can be reduced (it is impossible to make those benchmarks "scale" unless their semantics are changed significantly). Performance of the PrimeStream benchmark seemed to scale, but probably not linearly. In this benchmark memory bandwidth requirements also increase as the number of filter threads increases, so it is quite possible that it cannot really scale any better without faster memory (it uses n concurrent threads to compute nth prime and those do take memory). On the other hand, the Fibonacci benchmark is now clearly showing the scaling issues I knew were there theoretically (touching shared memory), but had never seen on lower core count machines. Needless to say I will be working on this as soon as I have the time. I would expect minor improvements across all the benchmarks and measurable improvement in the contention benchmarks and asymptotic improvement in the Fibonacci benchmark.

polytypic · 2014-05-29T07:37:15Z

Just another FYI, still haven't had time to try the real solutions I have in mind for improving low-level scaling, but I did a very quick experiment with the Fibonacci benchmark commenting out a few lines from the <*> operator implementation that touch shared memory (on lower core machines those improve performance by avoiding some allocations when it appears all cores are already busy) and on the 10-core machine it more than doubled the performance of the fib variations using <*>.

polytypic closed this as completed Dec 22, 2014

polytypic mentioned this issue Jan 4, 2015

Very bad CounterActor benchmark performance on a manycore machine #38

Closed

slav mentioned this issue Jun 22, 2016

Recommended way to return reply form server #109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can someone lend a really multicore machine (16/32/64?) to experiment? #3

Can someone lend a really multicore machine (16/32/64?) to experiment? #3

t0yv0 commented Feb 21, 2014

polytypic commented Apr 9, 2014

t0yv0 commented Apr 9, 2014

polytypic commented Apr 28, 2014

polytypic commented May 29, 2014

Can someone lend a really multicore machine (16/32/64?) to experiment? #3

Can someone lend a really multicore machine (16/32/64?) to experiment? #3

Comments

t0yv0 commented Feb 21, 2014

polytypic commented Apr 9, 2014

t0yv0 commented Apr 9, 2014

polytypic commented Apr 28, 2014

polytypic commented May 29, 2014