Question about CounterActor benchmark #4

t0yv0 · 2014-02-22T23:18:51Z

I played with CounterActor benchmark a bit, I am getting about 10x better throughput on the one using native lock vs the Hopac lock. I guess this is as expected - Hopac lock advantage is that for long critical sections it does not waste a whole thread, right? There is also perhaps some extra overhead in how parallel-for is done, to account for this.

A bit more surprising is that my first attempt at implementing lock API in F# is performing a bit better (1.5x). A bit strange - perhaps I'm not implementing it correctly, or the benchmark is not representative, or there's some other disadvantage.

If you have a spare moment could you give it a look?

module AltLock =
    open System
    open System.Threading
    open Hopac

    [<AbstractClass>]
    type Section() =
        abstract Do : byref<Core.Worker> -> unit

    [<Sealed>]
    type Section<'T>(f: unit -> 'T, cont: Cont<'T>) =
        inherit Section()

        override self.Do(worker) =
            try
                let res = f ()
                cont.DoCont(&worker, res)
            with e ->
                cont.DoHandle(&worker, e)

    type LockState =
        | Free
        | Locked of list<Section>

    let ( == ) (a: obj) (b: obj) =
        Object.ReferenceEquals(a, b)

    let inline update (state: byref<'T>) (update: 'T -> 'T) =
        let mutable loop = true
        let mutable old = Unchecked.defaultof<_>
        let mutable n = 0
        while loop do
            old <- state
            if old == Interlocked.CompareExchange(&state, update old, old) then
                loop <- false
            else
                if n > 5 then
                    Thread.SpinWait(1 <<< n)
                if n < 23 then
                    n <- n + 1
        old

    [<Sealed>]
    type AltLock internal () =
        let mutable state = Free
        static let locked0 = Locked []

        let enterLock x =
            let st =
                update &state (function
                    | Free -> locked0
                    | Locked xs -> Locked (x :: xs))
            match st with
            | Free -> true
            | _ -> false

        let exitLock (worker: byref<Core.Worker>) =
            let mutable loop = true
            while loop do
                let st =
                    update &state (function
                        | Locked [] -> Free
                        | Locked xs -> locked0
                        | _ -> failwith "Impossible")
                match st with
                | Locked [] -> loop <- false
                | Locked work ->
                    for w in work do
                        w.Do(&worker)
                | _ -> failwith "Impossible"

        member self.Protect<'R>(f: unit -> 'R) =
            {
                new Job<'R>() with
                    member __.DoJob(worker, cont) =
                        let work = Section<_>(f, cont) :> Section
                        if enterLock work then
                            work.Do(&worker)
                            exitLock &worker
            }

    let create () = AltLock()
    let protect (lock: AltLock) f = lock.Protect(f)

I'm still not quite getting the DoJob/Worker/Cont protocol. Is the above basically how you use it? Is there some more stuff one can do with the Worker struct?

About backoff / Thread.SpinWait - I think I got the idea from Aaron Turon's work like Reagents and his thesis - can't confirm any benefit, but does not seem to hurt either. I have a 6-core machine. It might make a difference for scaling to more cores. Not sure what the mechanism is exactly that makes it work, but something bad is going on when many cores are reading the same memory address, so "backing off" to spin-wait reportedly helps.

Is there anywhere else to backoff? Through worker? Like, switch to a different task if contention on current one is too high?

The text was updated successfully, but these errors were encountered:

polytypic · 2014-02-22T23:54:54Z

Before getting your message I actually removed the previous versions of the
CounterActor benchmark and added a new version that is perhaps closer to
original source of inspiration for the benchmark. The new version I added
there is basically an ordinary CML style server loop and is much closer to
what would be a realistic "actor" (except that the server loop is
synchronous and unbuffered).

Regarding the performance of the native vs Hopac versions, your analysis is
on the right track. The main reason the Hopac version is slower is that
the cost of executing monadic job code is rather high.

The particular "CounterActor" benchmark is rather problematic in the sense
that "One actor is no actor" and one should be careful when interpreting
results from such benchmarks. Did you notice the README.md that I added
there (after initial commit)? It is quite easy to write specialized
versions of many of the benchmarks that perform extremely well - at least
when you run those benchmarks with suitable parameters. That is very much
the case with the native version of the CounterActor benchmark. It
actually didn't have a separate thread of any kind for the actor at all.
(This is different from the blog post being referred to.) You could spawn
a thread there and write a loop that waits on a monitor for "messages".
You'd get good performance from that too. But if you'd then try to spawn
large amounts of such actors, well, you just couldn't. You'd need to make
something that, for example, bounces the execution via thread pool user
work items - taking you back to square one. I initially wrote the lock
based tests mostly to test/show that locks can actually perform better than
the ConcurrentQueue+Interlocked combination described in the blog post.

I'll reply to the other points shortly.

polytypic · 2014-02-23T00:24:23Z

Hmm... It would seem that AltLock implementation is not correct. If I'm
reading the code correctly, then, in the Protect method, is the only place
where exitLock is called and that is done after calling work.Do(...) that
internally calls either the success or failure continuation, which means
that exitLock may not be called at the right time or at all, depending on
what happens in the continuation.

On Sun, Feb 23, 2014 at 1:18 AM, Anton Tayanovskyy <notifications@github.com

wrote:

I played with CounterActor benchmark a bit, I am getting about 10x better
throughput on the one using native lock vs the Hopac lock. I guess this
is as expected - Hopac lock advantage is that for long critical sections it
does not waste a whole thread, right? There is perhaps some extra overhead
in how parallel-for is done.

A bit more surprising is that my first attempt at implementing lock API in
F# is performing a bit better (1.5x). A bit strange - perhaps I'm not
implementing it correctly, or the benchmark is not representative, or
there's some other disadvantage.

If you have a spare moment could you give it a look?

module AltLock =
open System
open System.Threading
open Hopac
[<AbstractClass>]
type Section() =
    abstract Do : byref<Core.Worker> -> unit

[<Sealed>]
type Section<'T>(f: unit -> 'T, cont: Cont<'T>) =
    inherit Section()

    override self.Do(worker) =
        try
            let res = f ()
            cont.DoCont(&worker, res)
        with e ->
            cont.DoHandle(&worker, e)

type LockState =
    | Free
    | Locked of list<Section>

let ( == ) (a: obj) (b: obj) =
    Object.ReferenceEquals(a, b)

let inline update (state: byref<'T>) (update: 'T -> 'T) =
    let mutable loop = true
    let mutable old = Unchecked.defaultof<_>
    let mutable n = 0
    while loop do
        old <- state
        if old == Interlocked.CompareExchange(&state, update old, old) then
            loop <- false
        else
            if n > 5 then
                Thread.SpinWait(1 <<< n)
            if n < 23 then
                n <- n + 1
    old

[<Sealed>]
type AltLock internal () =
    let mutable state = Free
    static let locked0 = Locked []

    let enterLock x =
        let st =
            update &state (function
                | Free -> locked0
                | Locked xs -> Locked (x :: xs))
        match st with
        | Free -> true
        | _ -> false

    let exitLock (worker: byref<Core.Worker>) =
        let mutable loop = true
        while loop do
            let st =
                update &state (function
                    | Locked [] -> Free
                    | Locked xs -> locked0
                    | _ -> failwith "Impossible")
            match st with
            | Locked [] -> loop <- false
            | Locked work ->
                for w in work do
                    w.Do(&worker)
            | _ -> failwith "Impossible"

    member self.Protect<'R>(f: unit -> 'R) =
        {
            new Job<'R>() with
                member __.DoJob(worker, cont) =
                    let work = Section<_>(f, cont) :> Section
                    if enterLock work then
                        work.Do(&worker)
                        exitLock &worker
        }

let create () = AltLock()
let protect (lock: AltLock) f = lock.Protect(f)
I'm still not quite getting the DoJob/Worker/Cont protocol. Is the above
basically how you use it? Is there some more stuff one can do with the
Worker struct?

About backoff / Thread.SpinWait - I think I got the idea from Aaron
Turon's work like Reagents http://www.mpi-sws.org/%7Eturon/reagents.pdfand his thesis - can't confirm any benefit, but does not seem to hurt
either. I have a 6-core machine. It might make a difference for scaling to
more cores. Not sure what the mechanism is exactly that makes it work, but
something bad is going on when many cores are reading the same memory
address, so "backing off" to spin-wait reportedly helps.

Is there anywhere else to backoff? Through worker? Like, switch to a
different task if contention on current one is too high?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4
.

t0yv0 · 2014-02-23T00:56:00Z

Yes I think my mistake is that I don't understand how to handle Cont properly - once fixed the performance will likely degrade and all will be well :)

I got N Section objects encapsulating critical sections (unit -> 'T) and I need to run them sequentially, then signal back to the framework to schedule execution of the continuations. I thought cont.DoCont / cont.DoHandle do this but I guess not? These execute cont directly on the given thread?

Must be some other API in there to val schedule : 'T * Cont<'T> -> unit. Will read the C# code again! Thanks.

polytypic · 2014-02-23T01:00:34Z

BTW, for understanding spinlocks, I would recommend reading chapter 7 of
"The Art of Multiprocessor Programming". What is most relevant with
respect to spinlock performance on shared memory multiprocessors is the so
called cache coherency protocol of the system. Reading from a "shared"
cache line is fine and does not hurt performance as such. However, once
the cache line is written to, to free the spinlock, then all processors
spinning on reads of that cache line need to refetch the cache line and
then one of those processors gets lucky and gets to write to the cache line
to acquire the spinlock and then all the other processors try to do the
same and need to refetch the cache line. That is what happens with a naive
TTAS lock. Backoff helps a bit with that by basically reducing the amount
of refetches and failed interlocked writes performed by processors that
don't get lucky. Backoff still isn't quite sufficient to make a spinlock
scalable (scalable = O(n) cache line transfers for n acquire/release
operations).

I'm not at all opposed to tweaking the spinlock implementation used in
Hopac, but it needs to be done with proper performance testing. One thing
that should be kept in mind is that spinlocks in Hopac are almost always
taken for extremely short periods. So, an elaborate backoff or scaling
mechanism becomes visible overhead very quickly. Like I said, I did
experiment with many spinlock implementations, including TAS, TTAS (current
choice for most cases), Backoff (but there are many ways to code the
backoff), TicketLocks, MCS and CLH. CLH locks, for example, seemed to have
very low runtime overhead. Unfortunately, CLH locks require the allocation
of an extra object and that is quite significant overhead as pretty much
everything requires a lock. TicketLocks also have very low overhead and
offer similar advantages over TTAS as Backoff, but neither approach makes
the locks really scalable.

Actually, probably the most viable alternative for the current TTAS
approach is to just use the .Net built-in Monitor mechanism. It is
built-in, so it requires very little extra memory and one can hope that it
is implemented efficiently or improved in later versions of the runtime.

On Sun, Feb 23, 2014 at 2:23 AM, Vesa Karvonen vesa.a.j.k@gmail.com wrote:

Hmm... It would seem that AltLock implementation is not correct. If I'm
reading the code correctly, then, in the Protect method, is the only place
where exitLock is called and that is done after calling work.Do(...) that
internally calls either the success or failure continuation, which means
that exitLock may not be called at the right time or at all, depending on
what happens in the continuation.

On Sun, Feb 23, 2014 at 1:18 AM, Anton Tayanovskyy <
notifications@github.com> wrote:
I played with CounterActor benchmark a bit, I am getting about 10x better
throughput on the one using native lock vs the Hopac lock. I guess this
is as expected - Hopac lock advantage is that for long critical sections it
does not waste a whole thread, right? There is perhaps some extra overhead
in how parallel-for is done.

A bit more surprising is that my first attempt at implementing lock API
in F# is performing a bit better (1.5x). A bit strange - perhaps I'm not
implementing it correctly, or the benchmark is not representative, or
there's some other disadvantage.

If you have a spare moment could you give it a look?

module AltLock =
open System
open System.Threading
open Hopac
[<AbstractClass>]
type Section() =
    abstract Do : byref<Core.Worker> -> unit

[<Sealed>]
type Section<'T>(f: unit -> 'T, cont: Cont<'T>) =
    inherit Section()

    override self.Do(worker) =
        try
            let res = f ()
            cont.DoCont(&worker, res)
        with e ->
            cont.DoHandle(&worker, e)

type LockState =
    | Free
    | Locked of list<Section>

let ( == ) (a: obj) (b: obj) =
    Object.ReferenceEquals(a, b)

let inline update (state: byref<'T>) (update: 'T -> 'T) =
    let mutable loop = true
    let mutable old = Unchecked.defaultof<_>
    let mutable n = 0
    while loop do
        old <- state
        if old == Interlocked.CompareExchange(&state, update old, old) then
            loop <- false
        else
            if n > 5 then
                Thread.SpinWait(1 <<< n)
            if n < 23 then
                n <- n + 1
    old

[<Sealed>]
type AltLock internal () =
    let mutable state = Free
    static let locked0 = Locked []

    let enterLock x =
        let st =
            update &state (function
                | Free -> locked0
                | Locked xs -> Locked (x :: xs))
        match st with
        | Free -> true
        | _ -> false

    let exitLock (worker: byref<Core.Worker>) =
        let mutable loop = true
        while loop do
            let st =
                update &state (function
                    | Locked [] -> Free
                    | Locked xs -> locked0
                    | _ -> failwith "Impossible")
            match st with
            | Locked [] -> loop <- false
            | Locked work ->
                for w in work do
                    w.Do(&worker)
            | _ -> failwith "Impossible"

    member self.Protect<'R>(f: unit -> 'R) =
        {
            new Job<'R>() with
                member __.DoJob(worker, cont) =
                    let work = Section<_>(f, cont) :> Section
                    if enterLock work then
                        work.Do(&worker)
                        exitLock &worker
        }

let create () = AltLock()
let protect (lock: AltLock) f = lock.Protect(f)
I'm still not quite getting the DoJob/Worker/Cont protocol. Is the above
basically how you use it? Is there some more stuff one can do with the
Worker struct?

About backoff / Thread.SpinWait - I think I got the idea from Aaron
Turon's work like Reagents http://www.mpi-sws.org/%7Eturon/reagents.pdfand his thesis - can't confirm any benefit, but does not seem to hurt
either. I have a 6-core machine. It might make a difference for scaling to
more cores. Not sure what the mechanism is exactly that makes it work, but
something bad is going on when many cores are reading the same memory
address, so "backing off" to spin-wait reportedly helps.

Is there anywhere else to backoff? Through worker? Like, switch to a
different task if contention on current one is too high?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4
.

polytypic · 2014-02-23T01:08:52Z

I think I'll write a document with some notes on the internal
implementation details and tradeoffs and add it to the project. Will take
some time.

On Sun, Feb 23, 2014 at 2:56 AM, Anton Tayanovskyy <notifications@github.com

wrote:

Yes I think my mistake is that I don't understand how to handle Cont
properly - once fixed the performance will likely degrade and all will be
well :)

I got N Section objects encapsulating critical sections (unit -> 'T) and
I need to run them sequentially, then signal back to the framework to
schedule execution of the continuations. I thought cont.DoCont /
cont.DoHandle do this but I guess not? These execute cont directly on the
given thread?

Must be some other API in there to val schedule : 'T * Cont<'T> -> unit.
Will read the C# code again! Thanks.
[<Sealed>]
type Section<'T>(f: unit -> 'T, cont: Cont<'T>) =
    inherit Section()

    override self.Do(worker) =
        try
            let res = f ()
            cont.DoCont(&worker, res)
        with e ->
            cont.DoHandle(&worker, e)
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-35820380
.

polytypic · 2014-02-23T02:01:52Z

https://github.com/VesaKarvonen/Hopac/blob/master/Docs/Internals.md

Will write more shortly.

On Sun, Feb 23, 2014 at 2:56 AM, Anton Tayanovskyy <notifications@github.com

wrote:

Yes I think my mistake is that I don't understand how to handle Cont
properly - once fixed the performance will likely degrade and all will be
well :)

I got N Section objects encapsulating critical sections (unit -> 'T) and
I need to run them sequentially, then signal back to the framework to
schedule execution of the continuations. I thought cont.DoCont /
cont.DoHandle do this but I guess not? These execute cont directly on the
given thread?

Must be some other API in there to val schedule : 'T * Cont<'T> -> unit.
Will read the C# code again! Thanks.
[<Sealed>]
type Section<'T>(f: unit -> 'T, cont: Cont<'T>) =
    inherit Section()

    override self.Do(worker) =
        try
            let res = f ()
            cont.DoCont(&worker, res)
        with e ->
            cont.DoHandle(&worker, e)
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-35820380
.

t0yv0 · 2014-02-23T20:04:04Z

Thanks, this helps, the Internals.md is awesome! Thanks for the pointers too. I am just trying to play with things and build some understanding, obviously not qualified to suggest improvements at this point :)

So it looks like the C# code basically inlining everything, from locking to data structures like Work queues, and minimizing the number of objects by pulling cramming things together to minimize memory/GC use.. Ok, this makes sense - not surprising it's hard to read then.

I just thought this is like an advertisement for MLton. I never used it much, but the presentations claim that it gives free or almost free abstractions - something Hopac has to eliminate in C#/F# here for performance.

Aha, so you are using TTAS "inlined" .. And that is generally better default than .NET lock/monitor? Very curious. I guess because of lower resource use and especially for not-contested case. Ok.

t0yv0 · 2014-02-23T20:12:10Z

So I attempted to fix my F#. Now I only run the continuation when the lock is Free; and I schedule waiter continuations via Scheduler.Push instead of running them in the lock.

Does this look correct now? It seems ok to me, but the performance results on the old CounterActor benchmark are baffling. This is now 5-6x faster than original Hopac Lock.

namespace Hopac

module AltLock =
    open System
    open System.Threading
    open Hopac
    open Hopac.Core

    [<AbstractClass>]
    type Section() =
        abstract Run : unit -> unit

    [<Sealed>]
    type Section<'T>(f: unit -> 'T, cont: Cont<'T>) =
        inherit Section()

        override self.Run() =
            try
                let r = f ()
                Scheduler.Push(JobWork(Job.result r, cont))
            with e ->
                Scheduler.Push(FailWork(e, cont))

    type private LockState =
        | Free
        | Locked of list<Section>

    let private ( == ) (a: obj) (b: obj) =
        Object.ReferenceEquals(a, b)

    let inline private update (state: byref<'T>) (update: 'T -> 'T) =
        let mutable loop = true
        let mutable old = Unchecked.defaultof<_>
        while loop do
            old <- state
            if old == Interlocked.CompareExchange(&state, update old, old) then
                loop <- false
        old

    [<Sealed>]
    type AltLock internal () =

        [<VolatileField>]
        let mutable state = Free

        // Returns true if the lock is acquired;
        // otherwise returns false and puts the critical section corresponding
        // to `f` and `cont` into the wait list.
        let enterLock f cont =
            let st =
                update &state (function
                    | Free -> Locked []
                    | Locked xs ->
                        let work = Section<_>(f, cont) :> Section
                        Locked (work :: xs))
            match st with
            | Free -> true
            | _ -> false

        // Releases the lock, but also runs critical sections of all
        // waiters and schedules their continuations to execute.
        let exitLock () =
            let mutable loop = true
            while loop do
                let st =
                    update &state (function
                        | Locked [] -> Free
                        | Locked xs -> Locked []
                        | _ -> failwith "Impossible")
                match st with
                | Locked [] -> loop <- false
                | Locked works ->
                    for w in works do
                        w.Run()
                | _ -> failwith "Impossible"

        member internal self.Protect<'R>(f: unit -> 'R) =
            {
                new Job<'R>() with
                    member __.DoJob(worker, cont) =
                        if enterLock f cont then
                            let mutable result = Unchecked.defaultof<_>
                            let mutable error = Unchecked.defaultof<_>
                            let mutable hasError = false
                            try
                                result <- f ()
                            with e ->
                                hasError <- true
                                error <- e
                            exitLock ()
                            if hasError then
                                cont.DoHandle(&worker, error)
                            else
                                cont.DoCont(&worker, result)
            }

    let create () = AltLock()
    let protect (lock: AltLock) f = lock.Protect(f)

polytypic · 2014-02-23T21:05:28Z

Interesting. Have you tried the AltLock implementation with the Chameneos
benchmark? Try replacing the lock used in the HopacLock module in that
benchmark program.

Assuming the alternative lock is correct, then whether you push into the
global Scheduler stack or into a local Worker stack might make the
performance difference you see in that particular benchmark. The way you
run the critical sections of waiters might also make a difference.

The current design of the "scheduler", which compromises both the worker
threads and the scheduler, biases things in favor of running things locally
in a single worker. This is based on a number of assumptions: that there
is plenty of work to run (other workers already busy) and that it is best
not to touch the global (shared) stack (contention) and that if there isn't
plenty of work to run, then it might be best not to wake up further worker
threads. If you look at the Worker.Push method, you'll see logic there
that checks whether the worker first seems to have plenty of work and if so
then checks whether some other worker might be empty and only then shares
work via the global scheduler stack.

There is an opportunity for optimization on the line

            Scheduler.Push(JobWork(Job.result r, cont))

Here you could just use the Value field of the cont:

cont.Value <- r
Scheduler.Push cont

Eliminates two allocations.

On Sun, Feb 23, 2014 at 10:12 PM, Anton Tayanovskyy <
notifications@github.com> wrote:

So I attempted to fix my F#. Now I only run the continuation when the lock
is Free; and I schedule waiter continuations via Scheduler.Push instead
of running them in the lock.

Does this look correct now? It seems ok to me, but the performance results
on the old CounterActor benchmark are baffling. This is now 5-6x faster
than original Hopac Lock.

namespace Hopac

module AltLock =
open System
open System.Threading
open Hopac

open Hopac.Core

[<AbstractClass>]
type Section() =
    abstract Run : unit -> unit


[<Sealed>]
type Section<'T>(f: unit -> 'T, cont: Cont<'T>) =
    inherit Section()


    override self.Run() =
        try
            let r = f ()
            Scheduler.Push(JobWork(Job.result r, cont))
        with e ->
            Scheduler.Push(FailWork(e, cont))

type private LockState =

    | Free
    | Locked of list<Section>


let private ( == ) (a: obj) (b: obj) =
    Object.ReferenceEquals(a, b)

let inline private update (state: byref<'T>) (update: 'T -> 'T) =


    let mutable loop = true
    let mutable old = Unchecked.defaultof<_>

    while loop do
        old <- state
        if old == Interlocked.CompareExchange(&state, update old, old) then
            loop <- false

    old

[<Sealed>]
type AltLock internal () =


    [<VolatileField>]

    let mutable state = Free


    // Returns true if the lock is acquired;
    // otherwise returns false and puts the critical section corresponding
    // to `f` and `cont` into the wait list.
    let enterLock f cont =

        let st =
            update &state (function

                | Free -> Locked []
                | Locked xs ->

                    let work = Section<_>(f, cont) :> Section


                    Locked (work :: xs))

        match st with
        | Free -> true
        | _ -> false


    // Releases the lock, but also runs critical sections of all
    // waiters and schedules their continuations to execute.
    let exitLock () =

        let mutable loop = true
        while loop do
            let st =
                update &state (function
                    | Locked [] -> Free

                    | Locked xs -> Locked []

                    | _ -> failwith "Impossible")
            match st with
            | Locked [] -> loop <- false

            | Locked works ->
                for w in works do
                    w.Run()
            | _ -> failwith "Impossible"

    member internal self.Protect<'R>(f: unit -> 'R) =


        {
            new Job<'R>() with
                member __.DoJob(worker, cont) =

                    if enterLock f cont then
                        let mutable result = Unchecked.defaultof<_>
                        let mutable error = Unchecked.defaultof<_>
                        let mutable hasError = false
                        try
                            result <- f ()
                        with e ->
                            hasError <- true
                            error <- e
                        exitLock ()
                        if hasError then
                            cont.DoHandle(&worker, error)
                        else
                            cont.DoCont(&worker, result)

        }

let create () = AltLock()
let protect (lock: AltLock) f = lock.Protect(f)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-35841794
.

polytypic · 2014-02-23T21:30:31Z

Heh... Improvement suggestions are more than welcome. :)

BTW, just popped into my mind regarding AltLock, that the order in which
things are run often has big effects on CPU caches and such and the way you
clear the queued waiters might make a big difference in terms of those
cache effects. Note that the lock primitive is different from the message
passing primitives precisely due to the critical section code (no such code
with the message passing primitives). Also note that the current Hopac
lock allows both Jobs (that might even block while holding the lock) and
functions to be used for the critical section. And the way run the
critical sections would not work with blocking jobs. So the AltLock is
definitely a lower overhead primitive. So, it might make sense to provide
both kinds of locks (locks that allow jobs to enter the critical section
and locks that only allow non-blocking function as the critical section).

On MLton... Yeah... It is something that I think is not widely appreciated
and I find it quite sad. I've actually spent quite some time tweaking the
Hopac internals. Last week I did a quick experiment with MLton to
implement a simplified version of the Job monad and converted that
simplified version to similar F# code (using inline functions and no
abstraction boundaries like in Hopac and a few other performance tweaks).
The result was that the MLton compiled code ran about 8x faster, being able
to do the equivalent of processing 50 million messages per second (single
threaded). That different comes from MLton's ability to eliminate pretty
much all the non-essential allocations (the allocation of objects for the
monadic operations) and then inline and further optimize the code. Perhaps
time would be better spend by working on a multithreaded runtime for
MLton... But, of course, I still have to work to get paid. :)

On Sun, Feb 23, 2014 at 10:04 PM, Anton Tayanovskyy <
notifications@github.com> wrote:

Thanks, this helps, the Internals.md is awesome! Thanks for the pointers
too. I am just trying to play with things and build some understanding,
obviously not qualified to suggest improvements at this point :)

So it looks like the C# code basically inlining everything, from locking
to data structures like Work queues, and minimizing the number of objects
by pulling cramming things together to minimize memory/GC use.. Ok, this
makes sense - not surprising it's hard to read then.

I just thought this is like an advertisement for MLton. I never used it
much, but the presentations claim that it gives free or almost free
abstractions - something Hopac has to eliminate in C#/F# here for
performance.

Aha, so you are using TTAS "inlined" .. And that is generally better
default than .NET lock/monitor? Very curious. I guess because of lower
resource use and especially for not-contested case. Ok.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-35841506
.

t0yv0 · 2014-02-23T22:49:11Z

Darn it, I totally missed that, indeed!

val inline duringJob: Lock -> Job<'a> -> Job<'a>

Yes this would be harder.

Long live MLTon!

I was a bit unsure about pushing work into the same worker. With AltLock, when multiple light-weight threads come in at the same time they end jumping to one thread, which executes the critical sections. So I thought it's important to spawn the continuations back to have N threads out? I thought otherwise the lock might artificially reduce parallelism. But I have not looked at the scheduler much yet - if there is work stealing in the scheduler this is not to be worried about.

Sounds like Worker.Push would be nice. Will try!

polytypic · 2014-02-23T23:32:07Z

In the current implementation, the Worker.Push method pushes the given work
item to the local stack of the worker. But, assuming there were other work
items in the local stack, it checks if some other worker is out of work
items. In that case it tries to push the older work items to the global
scheduler stack and signal the suspended worker.

t0yv0 · 2014-02-23T23:41:49Z

Replaced this:

        override self.Run() =
            try
                let r = f ()
                Scheduler.Push(JobWork(Job.result r, cont))
            with e ->
                Scheduler.Push(FailWork(e, cont))

with

        override self.Run(worker) =
            try
                cont.Value <- f ()
                Worker.Push(&worker, cont)
            with e ->
                Scheduler.Push(FailWork(e, cont))

Gets weird. Obtained - no progress, all cores spinning, on "warmup n=10000". Deadlock on TTAS spin-locks? No it looks like Worker.Push gets executed repeatedly. Hm..

Bug in my lock also possible. Let me think.

polytypic · 2014-02-24T00:02:29Z

I would assume a change elsewhere (some changes needed to pass the worker?)
is actually the cause of the problem. I would analyze the enterLock and
exitLock functions carefully.

On Mon, Feb 24, 2014 at 1:41 AM, Anton Tayanovskyy <notifications@github.com

wrote:

Replaced this:
    override self.Run() =
        try
            let r = f ()
            Scheduler.Push(JobWork(Job.result r, cont))
        with e ->
            Scheduler.Push(FailWork(e, cont))
with
    override self.Run(worker) =
        try
            cont.Value <- f ()
            Worker.Push(&worker, cont)
        with e ->
            Scheduler.Push(FailWork(e, cont))
Gets weird. Obtained - no progress, all cores spinning, on "warmup
n=10000". Deadlock on TTAS spin-locks?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-35848302
.

polytypic · 2014-02-24T00:10:32Z

Oh, one thing. You are now messing with the very internals of Hopac so
there are some ugly bits there that you need to know about. The Next field
of the Cont must be nullified before calling Worker.Push. The Push method
assumes that the Next field is null. Ugly, I know. :)

On Mon, Feb 24, 2014 at 2:01 AM, Vesa Karvonen vesa.a.j.k@gmail.com wrote:

I would assume a change elsewhere (some changes needed to pass the
worker?) is actually the cause of the problem. I would analyze the
enterLock and exitLock functions carefully.

On Mon, Feb 24, 2014 at 1:41 AM, Anton Tayanovskyy <
notifications@github.com> wrote:
Replaced this:
    override self.Run() =
        try
            let r = f ()
            Scheduler.Push(JobWork(Job.result r, cont))
        with e ->
            Scheduler.Push(FailWork(e, cont))
with
    override self.Run(worker) =
        try
            cont.Value <- f ()
            Worker.Push(&worker, cont)
        with e ->
            Scheduler.Push(FailWork(e, cont))
Gets weird. Obtained - no progress, all cores spinning, on "warmup
n=10000". Deadlock on TTAS spin-locks?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-35848302
.

t0yv0 · 2014-02-24T00:30:40Z

Thanks for breaking my mental infinite loop. That's it I think. Should maybe have some kind of asserts throughout the code to allow compiling in Debug mode with assert checking?

polytypic · 2014-03-14T17:29:37Z

Did you make further progress with the lock?

Just to clear a possible confusion, Hopac doesn't use the Hopac.Lock internally. It is an abstraction provided for clients of the library. Sometimes it is more natural to implement an algorithm in terms of locks than higher-level message passing primitives. You could compare it to something like a Monitor or Mutex in .Net / Windows - a type of synchronization object that potentially blocks the user job. Internally Hopac uses lower level mechanisms.

t0yv0 · 2014-03-14T19:37:10Z

Hey, I'm planning to get back to this. A lot going on, like Russia invaded my home country.. Yes, it's clear that we're talking about user-level locks here. --A

polytypic closed this as completed Dec 22, 2014

polytypic mentioned this issue Jun 5, 2016

Rust's Rayon crate: impressive parallel performance #102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about CounterActor benchmark #4

Question about CounterActor benchmark #4

t0yv0 commented Feb 22, 2014

polytypic commented Feb 22, 2014

polytypic commented Feb 23, 2014

t0yv0 commented Feb 23, 2014

polytypic commented Feb 23, 2014

polytypic commented Feb 23, 2014

polytypic commented Feb 23, 2014

t0yv0 commented Feb 23, 2014

t0yv0 commented Feb 23, 2014

polytypic commented Feb 23, 2014

polytypic commented Feb 23, 2014

t0yv0 commented Feb 23, 2014

polytypic commented Feb 23, 2014

t0yv0 commented Feb 23, 2014

polytypic commented Feb 24, 2014

polytypic commented Feb 24, 2014

t0yv0 commented Feb 24, 2014

polytypic commented Mar 14, 2014

t0yv0 commented Mar 14, 2014

Question about CounterActor benchmark #4

Question about CounterActor benchmark #4

Comments

t0yv0 commented Feb 22, 2014

polytypic commented Feb 22, 2014

polytypic commented Feb 23, 2014

t0yv0 commented Feb 23, 2014

polytypic commented Feb 23, 2014

polytypic commented Feb 23, 2014

polytypic commented Feb 23, 2014

t0yv0 commented Feb 23, 2014

t0yv0 commented Feb 23, 2014

polytypic commented Feb 23, 2014

polytypic commented Feb 23, 2014

t0yv0 commented Feb 23, 2014

polytypic commented Feb 23, 2014

t0yv0 commented Feb 23, 2014

polytypic commented Feb 24, 2014

polytypic commented Feb 24, 2014

t0yv0 commented Feb 24, 2014

polytypic commented Mar 14, 2014

t0yv0 commented Mar 14, 2014