-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data race in vectorstats - std::lgamma is not thread safe on Linux #2857
Comments
I found an implementation of the natural logarithm of the gamma function in the Ruby language repository. After subsituting |
Yikes. Okay. I was not aware of the The transformation from t-statistic / F-statistic to Z-statistic is computationally expensive. So I use pre-generation of a lookup table, on which linear interpolation is then performed. In the What I suspect it doesn't like is the fact that I'm trying to avoid mutexing every single time I read from one of those tables.
So is the problem that the release of the mutex lock is insufficient for the newly generated data to become thread-synchronized? There's probably some other more appropriate design pattern. Been too long since I played with threading primitives. But since these transformations are done for every element (eg. voxel, fixel) for every permutation, I would like for it to be possible to read from shared data without having to thread lock every single time. This is still only for the fixed homoscedastic case. If It's also currently possible to disable use of the lookup tables by commenting this line. If |
Yes, effectively we need to provide a global mutex (or the equivalent logic) that protects calls to This simple piece of code compiled with TSAN reports a race condition:
So I think we need to find an alternative to I wrote a small microbenchmark here to get a vague idea on how that implementation in the Ruby repo might perform and it appears reasonably fast compared to |
The code around the t->Z and F->Z conversions is complex precisely because of numerical accuracy problems. Indeed I added a unit test for the My concern is that addressing having two threads generating tables at the same time may not fully resolve: in the use case you pasted, there should only need to be one table generated, yet a problem is still reported. So presumably two threads are generating the same table at the same time despite the current checks. Perhaps we should avoid
Your benchmark runs the log-abs-gamma against log-gamma only, across positive values only. It's been too long since I did this work to recall whether compatibility with negative values is required here; but at least the former will have a code branch that the latter doesn't, which could bias the benchmark. RE code structure: I'm way out of practise here. I want threads to be able to generate these as needed, but for each unique table to be generated only once, by whichever thread needs it first, and for other threads to then synchrnoise with the shared data and be able to make use of them. And only one thread can be generating a table at any given time (unlike intent of current code, which tries to allow different threads to generate different tables at the same time). Importantly, data reads need to be cheap and non-locking wherever possible. So I think I need something like:
? |
|
If we were to grab Eigen code in |
I misread the definition of To evaluate precision, I wrote a very simple (and naive) program to see how the value of the alternative implementation differs from
In my testing, this was disabled. Furthermore, I cannot reproduce the race condition even with the simple two-threaded program I posted above on MacOS, but I can on Linux.
That's good to hear! If Eigen has an independent implementation then using it would also be my preferred solution. I just hope that internally it doesn't call the OS' implementation of |
In the way that There is some concern that the source of the code (website; repository) may be unaware of the distinction, and be using |
Ok so testing with Should we just stick with the Eigen implementation and effectively keep using |
Bigger question is, in the case of having |
This seems to be the case (e.g. see here). |
OK so there's not likely to be any clean solution that isn't without downsides here.
|
Notably the Egein code seems to be using |
I think I've been tying myself in knots again. It's not about whether the logarithm of the gamma function is negative, but whether the gamma function itself is negative. Here the inputs (DoF / rank) are exclusively positive, in which case the gamma function is always positive. So in our use case what's written to the sign bit is not only never read from, but should also be So I think the solution here is actually to tell thread sanitiser to ignore this race condition. It's inconsequential, and we don't want to incur mutex locking on Mac. |
Also the regularised incomplete beta function is defined only for |
@daljit46 I might need your help with What I think I want to do is:
Previously in |
My opinion is that we should try to avoid this "solution". The problem is that in C++, a data race is undefined behaviour ans there are virtually zero guarantees that there will be no fatal consequences. Thus a C++ program is allowed to crash or worse even perform a nonsensical transformation (or even set fire to your device) without contradicting the ISO C++ specification. |
Additionally, I see that
|
OK, I've had a quick look into this, and came across much the same posts as you already have. The main thing I was looking into was whether a data race that involves writes only, with no intended or actual reading of the racing variable, was likely to be a problem. As far as I can tell, all the examples of undefined behaviour I've seen involve getting the wrong value of the variable - which may indeed have nasty sides effects - but only if the code actually ever needs to read the variable. If all the code does is write to the variable, it really doesn't matter how it reorders operations - as least I can't see how any of the examples I've come across would apply. There is however one interesting example that involves re-ordering operations and speculative early loading - and that seems to apply to precisely the use case that @Lestropie was talking about in this previous post, with this pseudo-code:
This looks similar to the example in section 2.1 Double checks for lazy initialization of Hans-J. Boehm's How to miscompile programs with “benign” data races paper. Maybe that's all we need to fix, assuming the generation of the table is then guaranteed to be single-threaded...? |
The lookup table generation may be a more difficult fix than just the use of The theory there is that:
Therefore there may be different t->Z / F->Z transforms required. What the current code is attempting to achieve is: for any given rank / degrees of freedom, the lookup table should be generated just once, and then made available for all threads to read from. The current design is however seemingly Bad. Options:
|
Thread sanitizer has reported another data race in our codebase. Here is the output:
The issue seems to be triggered in the function
std::lgamma
incore/math/betainc.cpp:65
. It's not fully clear to me whether this is a problem with our code or not ( addressing #2795 would make the code much more readable and easier to debug). However, it seems likely that this is an issue in the function itself as apparentlystd::lgamma
is not guaranteed to be thread safe (e.g. see here).From cppreference:
It's worth noting that I can only reproduce this on Linux, while on MacOS there is no race condition. This helps corrobrate the idea that issue lies in the speficic implementation of
lgamma
on Linux.The text was updated successfully, but these errors were encountered: