-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
atomic_ref when external underlying type size is not natural atomic size (not power of two) #130
Comments
We generally would expect if possible to go through lock free implementations - though we are doing experiments right now to see whether lock tables in some cases can actually be faster than CAS loops. That said, CUDA 10.2 for example does implement lock free atomics for types smaller than 32 bits via CAS loops on 32 bit types see the |
Thanks for replying quickly. Installed CUDA 10.2. I also see that it is supposed to work for sizes 1 and 2: My question is rather about this:
I also don't see a point to do CAS loop for |
On NVIDIA GPUs there are no 16 bit atomic ops so they have to implement them via something else. Those _Interlock things are Windows only too. I am not in general sure which hardware platforms can do 16bit atomic operations natively. And yes they don't have an implementation of But Note: the standard does not mandate this. So its a quality of implementation thing whether they do it or not. We will actually provide in a couple months an implementation back ported to at least C++14 covering more or less all platforms. We do plan to implement smaller atomics appropriately. |
I'm delighted to see! I'm proposing a PR to MSVC STL that patches Currently, it is lock-free only on 1,2,4,8 sizes. But it could be lock free for size like 3 or 7. I'm wondering if I should support it, and, if so, what exactly I should do. Some things that I suspect:
|
actually I was just thinking that there might be other problems with doing the CAS on a larger thing trick. For example what if I actually just allocated 2 bytes. Sure underneath it will likely have done a larger allocation, but technically it would constitute a memory access vioalation to say cast a short pointer to an int pointer so you can do a 4 byte cas. |
This access violation can be avoided depending on boundary. I know page size, and will fall back to lock-based. |
With short and int it is completely avoided by reqired_alignment == sizeof(short), that's why odd sizes are complex special case. I thought it is complex on one hand, and not really very useful on he other |
I'm now also thinking than an idea to have Overalignment would imply oversize, so may do stores as conventional stores, no CAS loop. Sure it will partly defeat the purpose of having an array of such types. But aren't atomics about performance anyway? Oversize penalty is likely to be less than CAS penalty. |
If you do that you could use |
Hm. On Windows 32-bit x86, default alignment for So, I though It could be solved without requiring alignment, but then |
Yeah we would most likely require alignment of 8 byte on our side of sizeof is 8 byte. But that still allows you to do a simple array, you just need to use a proper allocator which does alignment. for a 3 byte type that is not possible, even if the pointer is 4 byte aligned, ptr[1] will not be, unless the struct itself has the attribute alignas 4 byte (whatever that was called). |
So And This means that CAS trick will work only sometimes. When a value location crosses certain boundaries, the implementation has to fall back to locks (not only to avoid spanning cache lines, but also to avoid access violations!). Since efficiency of CAS is questioned already:
and runtime branching whether to CAS or not would reduce its efficiency even more, probably always locking for EDIT: Actually access violations can be excluded by accessing at proper aligmnent. So only crossing cache line boundary would be a problem. |
But there are bigger problems with using a 4-byte CAS on a 3-byte struct, even if the type is manually aligned to 4-byte boundaries (or constrained with an |
CAS can assume some value on padding bits, and have the same value on exchange and comparand. On failure it will know the values, and again have it the same and comparand. I think there's not problem to implement 3-byte atomic with 4-byte
So my questions are:
|
I think that it may be possible to implement lock-free
atomic_ref<T>
for non natural atomic size.That is non power of two size.
Such
atomic_ref<T>
could access aligned memory as wider type. Most of the time it would have to fallback to compare exchange. (Still load may work as normal load).Sure this has to deal with locally solving strict aliasing issue, it will also depend on platform-specific guarantee, that is is possible at all, but I think on x86-64 it is possible.
I want to confirm if possible whether such lock-free
atomic_ref<T>
is expected from an implementation, or an implementation should fall back to lock-based for such sizes.(Cross posting from here https://stackoverflow.com/q/62004443/2945027)
The text was updated successfully, but these errors were encountered: