Skip to content

Conversation

MartinNowak
Copy link
Member

  • uses double-checked locking (the correct one)
  • all instances shared the same global mutex
  • minimal overhead and simple API

@MartinNowak
Copy link
Member Author

Herb Sutter recently mentioned that call_once is implemented using something else than DCL.
I couldn't find it anywhere though.

@MartinNowak
Copy link
Member Author

all instances shared the same global mutex

There is a potential for dead-locking here when one thread waits in callOnce for the result of another thread that also invokes callOnce. It's probably not a practical issue and I'd rather save all the mutex allocations.

* Executes the given $(D_PARAM callable) exactly once, even when invoked by
* multiple threads simultaneously. The implementation guarantees that all
* calling threads block until $(D_PARAM callable) returns and that any
* side-effects of $(D_PARAM callable) is globally visible afterwards.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: is -> are

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@MartinNowak MartinNowak force-pushed the master branch 2 times, most recently from 61d7958 to 53412fc Compare November 10, 2014 15:26
@dnadlinger
Copy link
Contributor

Did you measure whether the variant using atomics or the TLS one are actually better for typical applications (typical of course being hard to define)? I'd actually expect the latter to be faster in cases where there are a lot of concurrent accesses.

Furthermore, I don't think the dead-lock issue should be brushed aside so lightly. Concurrency primitives that come with hidden traps are rarely a good thing.

@MartinNowak
Copy link
Member Author

Did you measure whether the variant using atomics or the TLS one are actually better for typical applications (typical of course being hard to define)?
I'd actually expect the latter to be faster in cases where there are a lot of concurrent accesses.

No, there is going to be only a single write (no contention), acquire/release is good enough (and cheap or even free) and TLS access is slower (emulated TLS).
For dmd there is a little extra overhead, because core.atomic doesn't use intrinsics.

@MartinNowak
Copy link
Member Author

Furthermore, I don't think the dead-lock issue should be brushed aside so lightly. Concurrency primitives that come with hidden traps are rarely a good thing.

I added a note in the docs. But I could still only come up with very contrived examples where this would be a problem.

@mihails-strasuns
Copy link

Usual ping to @complexmath

@dnadlinger
Copy link
Contributor

Depending on the platform/TLS model TLS accesses might be nearly free too, but I guess you are right – in general, they aren't, and at least on x86 the read should be free (unless you get extremely unlucky with what else is on your cache line, of course, but that's beside the point here).

@MartinNowak MartinNowak force-pushed the master branch 2 times, most recently from 996d8bc to 2e80999 Compare November 12, 2014 00:05
@MartinNowak
Copy link
Member Author

Are we still waiting for something?

@MartinNowak MartinNowak force-pushed the master branch 2 times, most recently from 07c94d0 to 2e80999 Compare November 14, 2014 00:28
@dnadlinger
Copy link
Contributor

I'd like one or two more opinions before adding a new API artifact, just as a matter of principle.

{
import core.atomic;

static shared bool flag;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe silly question, but should not be used __gshared here instead of shared?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in atomic operations, therefor requires shared. And shared static variables are global, not thread-local, just like __gshared.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MartinNowak thanks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think general rule of a thumb is to always try going for shared instead of __gshared unless there are some transitivity issues that prevent it (like with Mutex)

@quickfur
Copy link
Member

I'd like to help review this, but my experience with concurrent programming is rather limited, so I don't know if I can give a meaningful review. But at least on the surface, the code looks OK to me.

@dnadlinger
Copy link
Contributor

Hm, one additional, non-obvious pitfall is that multiple callOnce invocations with, say, the same function declaration as a parameter instead of a lambda will also only invoke the target once. I'm starting to think that we should simply use something like std::once_flag in C++.

This way, the deadlock issue would by default drastically be reduced in scope to areas that are already coupled in source code, as we'd just put the Mutex in that object. If the few extra bytes would be critical for a given application, the user could always make the decision to re-use it on their own.

You could also make specifying the once_flag analog optional, defaulting to a separate one per invocation.

@DmitryOlshansky
Copy link
Member

I'm starting to think that we should simply use something like std::once_flag in C++.

Agreed.

* $(D_PARAM callable) should not wait for another thread that might
* also invoke callOnce or a dead-lock can happen.
*/
void callOnce(alias callable)()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have a template constraint making sure that callable is callable?

@IgorStepanov
Copy link
Contributor

@MartinNowak this implenentation looks correct, but I think that global mutex with static this (which may be instantiated more then one time) is stodgy. Why don't you use bool flag instead of the second mutex?

void callOnce(alias callable)()
{
    import core.atomic;

    static shared bool flag;
    static shared bool flag2;
    if (!atomicLoad!(MemoryOrder.acq)(flag))
    {
        if (cas(&flag2, false, true))
        {
            if (!atomicLoad!(MemoryOrder.acq)(flag))
            {
                callable();
                atomicStore!(MemoryOrder.rel)(flag, true);
                atomicStore!(MemoryOrder.rel)(flag2, false);
            }
            else
            {
                 while(atomicLoad!(MemoryOrder.acq)(flag2))
                     Thread.yeild();
            } 
        }
    }
}

@MartinNowak
Copy link
Member Author

Hm, one additional, non-obvious pitfall is that multiple callOnce invocations with, say, the same function declaration as a parameter instead of a lambda will also only.

Yes, and that's perfectly fine. Because it means that multiple invocation to say a library initialization will only happens once, it's a sane default. You can easily workaround it by wrapping your function call in a lambda.

I'm starting to think that we should simply use something like std::once_flag in C++.

No, I thought about this quite a while and I don't think it's worth the trouble.
Not only is the API much more complex (and requires way more documentation) the dead-lock problem is even bigger with a separate once_flag. It's a single flag that you can associate with different functions which allows for hard to debug lock ordering problems.
My conclusion was to go with a simple API, and add things like separate flags when there is an actual need.

This way, the deadlock issue would by default drastically be reduced in scope to areas that are already coupled in source code, as we'd just put the Mutex in that object. If the few extra bytes would be critical for a given application, the user could always make the decision to re-use it on their own.

A once_flag can only be used once, hence the name.

@dnadlinger
Copy link
Contributor

Hm, one additional, non-obvious pitfall is that multiple callOnce invocations with, say, the same function declaration as a parameter instead of a lambda will also only.

Yes, and that's perfectly fine. Because it means that multiple invocation to say a library initialization will only happens once, it's a sane default. You can easily workaround it by wrapping you function call in a lambda.

It's also perfectly unexpected if you think of it as "do this function call once". Your "easy" workaround will appear completely mysterious to many users, because it relies on the fact that two semantically identical lambdas are still distinct symbols – something that, if I remember correctly, by the way isn't even documented right now.

Note that you can write your library initialization very naturally in the other model too, and this time without relying on non-obvious details:

void apiClient1() { ensureInitialized(); … } 
void apiClient2() { ensureInitialized(); … }
void ensureInitialized() { callOnce({ auto h = dlopen(…); … }); }

I'm starting to think that we should simply use something like std::once_flag in C++.

No, I thought about this quite a while and I don't think it's worth the trouble.
Not only is the API much more complex (and requires way more documentation)

"much more complex"? It adds a single type. I don't think it would necessarily entail more documentation (and certainly not "way"), if you, from a user's point of view, properly documented the details of your proposal.

the dead-lock problem is even bigger with a separate once_flag.

The problem that exists in your implementation certainly isn't; it is at least as big as with a separate once_flag, and in my opinion much worse (as you can't even see the sharing in the code). There might be a different, but let's keep the discussion to the point.

It's a single flag that you can associate with different functions which allows for hard to debug lock ordering problems.

That is indeed an issue to consider.

My conclusion was to go with a simple API, and add things like separate flags if there is an actual need.

I don't think your API is conceptually more simple than having a separate flag. The latter can easily be explained as the thread-global equivalent of static bool flag; if (!flag) { flag = true; … }.

This way, the deadlock issue would by default drastically be reduced in scope to areas that are already coupled in source code, as we'd just put the Mutex in that object. If the few extra bytes would be critical for a given application, the user could always make the decision to re-use it on their own.

A once_flag can only be used once, hence the name.

Err, yes, of course.

@MartinNowak
Copy link
Member Author

To cite http://en.cppreference.com/w/cpp/thread/call_once

Exactly one execution of exactly one of the functions (passed as f to the invocations in the group) is performed. It is undefined which function will be selected for execution. The selected function runs in the same thread as the call_once invocation it was passed to.

The problem that exists in your implementation certainly isn't; it is at least as big as with a separate once_flag, and in my opinion much worse (as you can't even see the sharing in the code).

It's mentioned in the documentation, and because it's a recursive mutex you'd need to interact with another thread that already invokes callOnce. I don't agree that this is a very practical problem, but we could add an overload to callOnce that takes a specific mutex argument to resolve it.

Note that you can write your library initialization very naturally in the other model too, and this time without relying on non-obvious details:

Sure it's cleaner if the library provided a thread-safe initialization function, but if it doesn't you're stuck. And if you have multiple libraries in your app that use the same library you'll end up initializing it twice.

I don't think your API is conceptually more simple than having a separate flag.

It saves the not so trivial boilerplate of declaring and initializing a separate flag.

class Foo
{
    private static shared OnceFlag fooFlag;
    shared static this()
    {
        fooFlag = OnceFlag();
    }

    Foo instance()
    {
        callOnce!({ _instance = new Foo(); })(fooFlag);
        return _instance;
    }
}

It's also perfectly unexpected if you think of it as "do this function call once".

I added an example for this, still missing a valid use-case where you actually want this behavior.

@MartinNowak
Copy link
Member Author

The problem that exists in your implementation certainly isn't; it is at least as big as with a separate once_flag

It's simpler because there is no lock order issue. The following code will always dead-lock, but again I can only come up with very contrived examples.

callOnce!({
    auto tid = spawn({
        callOnce!({writeln("foobar");})();
        ownerTid.send(true);
    });
    receiveOnly!bool();
});

@IgorStepanov
Copy link
Contributor

The following code will always dead-lock...

Does it happens because you use one global mutex for all callOnce instance? Why don't use separate mutexes/flags for each callOnce instance?

@MartinNowak
Copy link
Member Author

Does it happens because you use one global mutex for all callOnce instance?

It would be much harder to dead-lock with a mutex per instance.

Why don't use separate mutexes/flags for each callOnce instance?

Because it's a waste of resources, the mutex would at maximum be used once.
And I still don't think that it's a practical issue, but I'll add an overload that takes a mutex to resolve the discussion.

@JakobOvrum
Copy link
Contributor

Will do it just fine in thread-local case.

Whether to use lazy or eager initialization (and whether to use compile-time or runtime initialization in the eager case) is orthogonal from whether it's TLS or global. TLS objects benefit from lazy initialization in exactly the same ways:

  • The object may not always be used, so the cost of initialization is only paid when necessary, or start-up time is reduced by delaying initialization to time-of-use.
  • Initialization may depend on runtime arguments only available at the place of use.
  • Runtime initialization is desired but a static constructor is out of the question, such as in circular dependency scenarios.

Sure, but it's also easy.

It's cluttered imperative boilerplate:

static bool isInitialized = false;
static T theSingleton;
if(!isInitialized)
{
    theSingleton = ...;
    isInitialized = true;
}
// use `theSingleton`

vs declarative:

static T theSingleton;
initOnce!theSingleton(...);
// use `theSingleton`

Of course the global singleton is much more involved and error-prone than the TLS one, but initOnce helps in the TLS case too.

edit:

To be clear, I'm not against putting it in std.concurrency, I think that's perfectly reasonable.

@DmitryOlshansky
Copy link
Member

@JakobOvrum The point is that

static T theSingleton;
initOnce!theSingleton(...);
// use `theSingleton`

Most likely has to be:

@property auto theSingleton(){
static Singleton s;
initOnce!(s)(...);
return s;
}

So for singletons that's the same accessor boilerplate anyway and trading one for another really is not a deal breaker.

@DmitryOlshansky
Copy link
Member

Only in C++, D doesn't have static variable initializers, they have to be computable at compile time.

Right sorry for confusion.

@JakobOvrum
Copy link
Contributor

So for singletons that's the same accessor boilerplate anyway and trading one for another really is not a deal breaker.

Firstly, it's clearly not the same amount of boilerplate. But more importantly, lazy initialization is not just for the singleton pattern.

@MartinNowak
Copy link
Member Author

Firstly, it's clearly not the same amount of boilerplate. But more importantly, lazy initialization is not just for the singleton pattern.

You correctly identified the usability problem. Now it happens quite often to me that I accidentally declare a thread-local variable instead of a global one. That would result in quite some debugging and confusion if initOnce could be used with TLS variables, e.g. to stick with my use case, calling the thread-unsafe curl_global_init multiple times possibly in multiple threads.
So the usability problem isn't really resolved, therefor we should disallow TLS variables altogether.

@MartinNowak
Copy link
Member Author

Firstly, it's clearly not the same amount of boilerplate.

But almost and we have the policy to not add trivial stuff to phobos.

static Instance var;
if (!var) var = new Instance();

@JakobOvrum
Copy link
Contributor

You correctly identified the usability problem. Now it happens quite often to me that I accidentally declare a thread-local variable instead of a global one. That would result in quite some debugging and confusion if initOnce could be used with TLS variables, e.g. to stick with my use case, calling the thread-unsafe curl_global_init multiple times possibly in multiple threads.
So the usability problem isn't really resolved, therefor we should disallow TLS variables altogether.

I think this is a good point for callOnce, but I'm not so sure about initOnce. The D type system prevents you from initializing a non-shared TLS variable with a shared object, so this will already fail to compile.

But almost and we have the policy to not add trivial stuff to phobos.

Most lazily initialized variables will be of reference type, so I guess that's a good point.

@MartinNowak
Copy link
Member Author

Most lazily initialized variables will be of reference type, so I guess that's a good point.

Updated and all things addressed, at least I hope so ;).

@MartinNowak
Copy link
Member Author

Thanks @JakobOvrum, this really came out much better than the initial callOnce.
The only I don't like that much, the documentation for the mutex overload is equally prominent.

@MartinNowak
Copy link
Member Author

The only I don't like that much, the documentation for the mutex overload is equally prominent.

Partly fixed now, I attached each unittest example to it's own overload, so the one taking a mutex is separated a little better.
https://dlang.dawg.eu/phobos-prerelease/std_concurrency.html#.initOnce
Ready to merge.

auto ref initOnce(alias var)(lazy typeof(var) init, Mutex mutex)
{
// check that var is global, can't take address of a TLS variable
static assert(is(typeof({__gshared p = &var;})), "var must be 'static shared' or '__gshared'.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are TLS variables disabled when the type system prevents you from doing it wrong anyway?

@MartinNowak
Copy link
Member Author

I think this is a good point for callOnce, but I'm not so sure about initOnce.

Why are you starting that discussion again? This is a primitive for race-free lazy initialization. There are no races when assigning TLS variables.

The D type system prevents you from initializing a non-shared TLS variable with a shared object, so this will already fail to compile.

Most of the time it's not even possible to use shared, prominent example being Mutex.
Not sure why you're trying so hard to squeeze lazy TLS initialization into a concurrency primitive.
But this example here is a real usability problem.

static Mutext mutex; // should be __gshared
void foo()
{
    synchronized (initOnce!mutex(new Mutex)) {
        // Oh, not mutual exclusive at all. Who's going to debug this?
    }
}

@MartinNowak
Copy link
Member Author

Anything left to do?

@MartinNowak
Copy link
Member Author

ping

* simultaneously calling initOnce with the same $(D_PARAM var) argument block
* until $(D_PARAM var) is fully initialized. All side-effects of $(D_PARAM init)
* are globally visible afterwards.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Params: block

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I seriously write var - the variable to initialize and init - the initializer even though the first sentence perfectly describes what this function does? This feels a lot like auto sum = a + b; // add a and b comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The main description can be more of an overview, and be more of an explanation of what the point of this function is. For example:

http://www.cplusplus.com/reference/cmath/sin/

That's normal and expected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I still disagree.
We don't improve our documentation by adding noise.
What's that supposed to tell me?
http://www.cplusplus.com/reference/cmath/pow/#parameters


/**
* Initializes $(D_PARAM var) with the lazy $(D_PARAM init) value in a
* thread-safe manner.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a blank line to make the first part a summary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- uses double-checked locking (the correct one)

- minimal overhead and simple API

- default mutex is shared among all instances
@MartinNowak
Copy link
Member Author

Anything left to do? This has been on hold for yet another week and almost 2 month in total.

@MartinNowak
Copy link
Member Author

Ping again

@DmitryOlshansky
Copy link
Member

LGTM

@MartinNowak
Copy link
Member Author

Someone merge this please?

@MartinNowak
Copy link
Member Author

Auto-merge toggled on

MartinNowak added a commit that referenced this pull request Jan 20, 2015
@MartinNowak MartinNowak merged commit dcee2cd into dlang:master Jan 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.