Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

atomic_ref when external underlying type size is not natural atomic size (not power of two) #130

Closed
AlexGuteniev opened this issue May 26, 2020 · 14 comments

Comments

@AlexGuteniev
Copy link

I think that it may be possible to implement lock-free atomic_ref<T> for non natural atomic size.
That is non power of two size.

Such atomic_ref<T> could access aligned memory as wider type. Most of the time it would have to fallback to compare exchange. (Still load may work as normal load).

Sure this has to deal with locally solving strict aliasing issue, it will also depend on platform-specific guarantee, that is is possible at all, but I think on x86-64 it is possible.

I want to confirm if possible whether such lock-free atomic_ref<T> is expected from an implementation, or an implementation should fall back to lock-based for such sizes.

(Cross posting from here https://stackoverflow.com/q/62004443/2945027)

@crtrott
Copy link
Collaborator

crtrott commented May 26, 2020

We generally would expect if possible to go through lock free implementations - though we are doing experiments right now to see whether lock tables in some cases can actually be faster than CAS loops. That said, CUDA 10.2 for example does implement lock free atomics for types smaller than 32 bits via CAS loops on 32 bit types see the include/cuda/std/detail/__atomic_derived file right at the top.

@AlexGuteniev
Copy link
Author

Thanks for replying quickly.

Installed CUDA 10.2.
I see the implementation of CAS. No implementation of atomic_ref though.

I also see that it is supposed to work for sizes 1 and 2: enable_if<sizeof(_Type) <= 2, int>

My question is rather about this:

struct S { char a, b, c; };
static_assert(sizeof(S) == 3);
struct alignas(std::atomic_ref<S>::required_alignment) D
{
  S s[1000];
} d;
std::atomic_ref<S> r(d.s[0]);

I also don't see a point to do CAS loop for sizeof(_Type) == 2.
Think it is possible just to have required_aligmnent as 2, and use smaller size operand intrinsics, such as _InterlockedExchange16 or _InterlockedExchangeAdd16.

@crtrott
Copy link
Collaborator

crtrott commented May 26, 2020

On NVIDIA GPUs there are no 16 bit atomic ops so they have to implement them via something else. Those _Interlock things are Windows only too. I am not in general sure which hardware platforms can do 16bit atomic operations natively.

And yes they don't have an implementation of atomic_ref. I just used that as an example of a vendor provided implementation of atomics which do CAS loop based, lock free operations for things smaller than 32 bit. I.e. I would expect in a future version of CUDA to have atomic_ref sit on top of those calls for example.

But Note: the standard does not mandate this. So its a quality of implementation thing whether they do it or not. We will actually provide in a couple months an implementation back ported to at least C++14 covering more or less all platforms. We do plan to implement smaller atomics appropriately.

@AlexGuteniev
Copy link
Author

We do plan to implement smaller atomics appropriately.

I'm delighted to see!

I'm proposing a PR to MSVC STL that patches std::atomic machinery to implement std::atomic_ref using existing code ( microsoft/STL#843 ).

Currently, it is lock-free only on 1,2,4,8 sizes. But it could be lock free for size like 3 or 7. I'm wondering if I should support it, and, if so, what exactly I should do.

Some things that I suspect:

  • Making required_aligment more than sizeof is not an option, as it changes size. So I can't easily over-align 3 bytes to make 4-bytes structure
  • As a result of having small alignment, such atomic_ref<3 bytes> may cross cahce line boundary. When crossing cache line boundary, atomic_ref<3 bytes> could be better if lock-based, so atomic_ref<3 bytes> may be not is_always_lock_free

@crtrott
Copy link
Collaborator

crtrott commented May 26, 2020

actually I was just thinking that there might be other problems with doing the CAS on a larger thing trick. For example what if I actually just allocated 2 bytes. Sure underneath it will likely have done a larger allocation, but technically it would constitute a memory access vioalation to say cast a short pointer to an int pointer so you can do a 4 byte cas.

@AlexGuteniev
Copy link
Author

This access violation can be avoided depending on boundary. I know page size, and will fall back to lock-based.

@AlexGuteniev
Copy link
Author

With short and int it is completely avoided by reqired_alignment == sizeof(short), that's why odd sizes are complex special case. I thought it is complex on one hand, and not really very useful on he other

@AlexGuteniev
Copy link
Author

AlexGuteniev commented May 27, 2020

I'm now also thinking than an idea to have reqiured_alignment==4 for 3 bytes struct on Windows, or even for 2 byte int on CUDA kernel isn't necessarily bad.

Overalignment would imply oversize, so may do stores as conventional stores, no CAS loop. Sure it will partly defeat the purpose of having an array of such types. But aren't atomics about performance anyway? Oversize penalty is likely to be less than CAS penalty.

@crtrott
Copy link
Collaborator

crtrott commented May 27, 2020

If you do that you could use atomic<T> instead of atomic_ref<T> in my mind. The main motivation for us was that we need to do atomics on things stored simply via a T* - i.e. arrays. What one could do is change the answer of is_always_lock_free based on alignof instead of sizeof.

@AlexGuteniev
Copy link
Author

AlexGuteniev commented May 27, 2020

Hm. On Windows 32-bit x86, default alignment for int64_t is 4 bytes, not 8.

So, int64_t a[1000] allocated on stack may be aligned on 4, if no alignas(8) specified.

I though required_alignment is there to make this user's problem.

It could be solved without requiring alignment, but then load and store has to be implemented as fetch_add(0) and exchange correspondingly, since plain load and stores will tear on boundaries. And since in 32-bit mode there's no 64-bit fetch_add and exchange, ultimatrly it is a fallback to CAS for load, CAS loop for store.

@crtrott
Copy link
Collaborator

crtrott commented May 27, 2020

Yeah we would most likely require alignment of 8 byte on our side of sizeof is 8 byte. But that still allows you to do a simple array, you just need to use a proper allocator which does alignment. for a 3 byte type that is not possible, even if the pointer is 4 byte aligned, ptr[1] will not be, unless the struct itself has the attribute alignas 4 byte (whatever that was called).

@AlexGuteniev
Copy link
Author

AlexGuteniev commented Jun 2, 2020

So required_alignment solves the case of int64_t on 32-bit x86.

And required_alignment cannot solve the case of 3-byte struct, since it cannot change type of T* array. required_alignment has to be 1 for 3-byte struct.

This means that CAS trick will work only sometimes. When a value location crosses certain boundaries, the implementation has to fall back to locks (not only to avoid spanning cache lines, but also to avoid access violations!). Since efficiency of CAS is questioned already:

we are doing experiments right now to see whether lock tables in some cases can actually be faster than CAS loops

and runtime branching whether to CAS or not would reduce its efficiency even more, probably always locking for atomic_ref<3 bytes struct> is the right thing to do.

EDIT: Actually access violations can be excluded by accessing at proper aligmnent. So only crossing cache line boundary would be a problem.

@dhollman
Copy link
Collaborator

dhollman commented Jun 2, 2020

And required_alignment cannot solve the case of 3-byte struct, since it cannot change type of T* array. required_alignment has to be 1 for 3-byte struct.

required_alignment can be anything the implementation wants it to be. A low quality-of-implementation (QOI) implementation might set required_alignment to std::numeric_limits<size_t>::max() for every template parameter, and that implementation would be technically conforming (though I expect their users would not let them get away with that).

But there are bigger problems with using a 4-byte CAS on a 3-byte struct, even if the type is manually aligned to 4-byte boundaries (or constrained with an alignas clause). Most implementations do not make any guarantees about the values of padding bits, and so unless the platform supports an atomic masked-CAS, you have no guarantees that what you're doing is meaningful.

@AlexGuteniev
Copy link
Author

CAS can assume some value on padding bits, and have the same value on exchange and comparand. On failure it will know the values, and again have it the same and comparand.

I think there's not problem to implement 3-byte atomic with 4-byte required_alignment using usual 4-byte CAS or LL/SC. Similar was mentioned in this thread, though it is about supporting 1 and 2 bytes atomic on a platform that only has 4-bytes atomics #130 (comment)

That said, CUDA 10.2 for example does implement lock free atomics for types smaller than 32 bits via CAS loops on 32 bit types see the include/cuda/std/detail/__atomic_derived file right at the top.


So my questions are:

  • What a high quality implementation supposed to do on a platform that has 1,2,4 and 8 bytes atomic for atomic_ref<3 bytes> ?
  • If having this case handled as always lock-based is not highest quality of implementation, is it too low to have it as a part of real standard library ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants