fix Issue 12891: add atomicFetchAdd to core.atomic #1208
Conversation
Yes, there should also be a |
@@ -190,6 +254,14 @@ else version( AsmX86_32 ) | |||
// | |||
// += -= *= /= %= ^^= &= | |||
// |= ^= <<= >>= >>>= ~= | |||
static if( op == "+=" && __traits(isIntegral, T) && is(T==V1) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try pragma(msg, T, " == ", V1, " = ", is(T == V1), " (op: ", op, ")");
just on the function entry below body
.
The is(T == V1)
constraint will miss half of the cases where this optimization is applicable, e.g. atomicOp!"+="(u8, 1)
.
Note that you can also do this, which is defined in terms of int
promoted add with subsequent coercion to ubyte
.
ubyte u8;
int val = 1;
u8 += val;
I think we can implement this like the following.
size_t tmp = mod; // convert all operands to size_t
asm
{
mov RAX, tmp;
mov RDX, val;
}
static if (T.sizeof == 1) asm { lock; xadd[RDX], AL; }
else static if (T.sizeof == 2) asm { lock; xadd[RDX], AX; }
else static if (T.sizeof == 4) asm { lock; xadd[RDX], EAX; }
else static if (T.sizeof == 8) asm { lock; xadd[RDX], RAX; }
asm { mov tmp, RAX; }
return cast(T)(tmp + mod);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this code do:
asm { mov tmp, RAX; }
return cast(T)(tmp + mod);
I also didn't understand why this:
return oval + mod;
Why is any addition going on here (and why added with mod.. isn't xadd doing this job)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, the pragma wrote this in the console for the unittest:
ulong == ulong = true (op: +=)
ulong == ulong = true (op: -=)
ulong == ulong = true (op: +=)
ulong == ulong = true (op: -=)
ulong == ulong = true (op: +=)
ulong == ulong = true (op: -=)
ubyte == int = false (op: +=)
ushort == int = false (op: +=)
uint == int = false (op: +=)
ulong == int = false (op: +=)
byte == int = false (op: +=)
short == int = false (op: +=)
int == int = true (op: +=)
long == int = false (op: +=)
ubyte == int = false (op: -=)
ushort == int = false (op: -=)
uint == int = false (op: -=)
ulong == int = false (op: -=)
byte == int = false (op: -=)
short == int = false (op: -=)
int == int = true (op: -=)
long == int = false (op: -=)
ulong == int = false (op: +=)
ulong == int = false (op: -=)
ulong == int = false (op: +=)
ulong == int = false (op: -=)
uint == int = false (op: +=)
ulong == int = false (op: +=)
uint == int = false (op: +=)
uint == int = false (op: +=)
uint == int = false (op: +=)
uint == int = false (op: +=)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is any addition going on here (and why added with mod.. isn't xadd doing this job)?
XADD returns you the value before the addition, so you have to add it again to return the total result.
… to size_t for xadd op. Debug pragmas in for testing
Agreed. I've made atomicFetchAdd private. I think separating this logic into another function (at least internally) aids code readability though. My latest patch has most (if not all) of your feedback implemented. |
@MartinNowak Btw, please read my replies on your previous code comments. (GitHub hides them once a new commit is made) |
Well it turned out to be a bit more involved, but thanks for driving this. |
@@ -190,6 +222,14 @@ else version( AsmX86_32 ) | |||
// | |||
// += -= *= /= %= ^^= &= | |||
// |= ^= <<= >>= >>>= ~= | |||
static if( op == "+=" && __traits(isIntegral, T) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you need a && T.sizeof <= 4
for the 32 bit case.
I couldn't find any good info online on how xaddl and xadd differs. Is there a reason to use one over the other? Btw, I'm thinking that for +=1 and -=1 atomicOp calls (using a static if check) will instead use atomic INC asm calls. This way, you transparently are getting the best performance possible. In atomicFetchAdd, I'm thinking something like this: |
I'm still confused.. why do we need to "add it again"? Doesn't this mean that this atomic action is then taking two addition ops to complete.. and hence be slow? (asm noob speaking) |
No need to,
GCC has 2 builtins, atomic_add_fetch and atomic_fetch_add. The former one returns the value after adding the op, the latter returns the value before adding the op, this is also the behavior of
Nope, the second add simply adds 2 registers, this is literally free (your CPU can perform 3 independent adds per cycle). Doing |
assert( atomicValueIsProperlyAligned!(size_t)( cast(size_t) &val ) ); | ||
else | ||
assert( atomicValueIsProperlyAligned!(T)( cast(size_t) &val ) ); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this in contract here, it is only needed for 8-byte CAS on x86-32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed contract
Motivating benchmark. import core.atomic, std.algorithm, std.datetime, std.stdio;
void main()
{
auto dur = Duration.max;
foreach (i; 0 .. 20)
{
auto sw = StopWatch(AutoStart.yes);
shared uint rc;
foreach (_; 0 .. 1_000_000)
{
atomicOp!"+="(rc, 1);
atomicOp!"-="(rc, 1);
}
dur = min(dur, cast(Duration)sw.peek);
}
writeln(dur);
} Running this with your pull gives me |
…cOp to perform the addition/substraction op
My powers of deduction are failing me. Why doesn't += return the result of an AddFetch (without the second addition)...as the += expression means "return me the value of myself after the modifier".
Done in last commit! Also added atomicFetchSub. |
Auto-merge toggled on |
fix Issue 12891: add atomicFetchAdd to core.atomic
Thanks. There exists no Intel instruction that atomically adds a number to a variable and returns the result. |
Place opening curly bracket on its own line and remove empty line of spaces.
@jadbox, please follow D stadard library code style and squash such small fixing commit lists to not pollute the log. |
@denis-sh How should I do that exactly? Just perform a rebase squash and push? |
@jadbox, yes. You need to prepare your local branch (e.g. rebase squash) and if its ready update remote one (push with force). github allows non fast-forward updates for pull requests. |
@denis-sh It looks like it shouldn't pollute master as it was simply merged in (single commit point). How would compressing the history of the branch that was merge commit'ed over be cleaner? |
@jadbox, "compressed" branch view compound from only merge commits isn't enough as merged branches often contain few logically distinct changes divided into commits so when one looks through the log he inspects all merged branch commits and such sequences of non-sense fixups waste vertical space and doesn't look good. I'm not even telling about github's commits page. |
Sure, the commits don't convey any information, but there is no real need to worry about such small stuff. |
If you want to do a little more on this @jadbox, you might consider supporting pointer arithmetic as well, i.e. |
[Whitespace] Style fixup for pull #1208.
Rats, this is missing documentation. |
My mistake, those functions are simply used internally by atomicOp. |
This is the placeholder pull request for the work on improving performance around fetchAdd atomic operations. This has a new implementation for += and -= atomicOp and provides a new method atomicFetchAdd. I'd debate if we require atomicFetchSub for this feature, but I'm open to thoughts.
Ideally, I need help for someone to benchmark this PR to see if it does indeed improve performance on x86 and x86_x64.
https://issues.dlang.org/show_bug.cgi?id=12891