Skip to content
This repository has been archived by the owner on Oct 12, 2022. It is now read-only.

fix Issue 12891: add atomicFetchAdd to core.atomic #1208

Merged
merged 11 commits into from Apr 14, 2015

Conversation

jadbox
Copy link
Contributor

@jadbox jadbox commented Apr 7, 2015

This is the placeholder pull request for the work on improving performance around fetchAdd atomic operations. This has a new implementation for += and -= atomicOp and provides a new method atomicFetchAdd. I'd debate if we require atomicFetchSub for this feature, but I'm open to thoughts.

Ideally, I need help for someone to benchmark this PR to see if it does indeed improve performance on x86 and x86_x64.

https://issues.dlang.org/show_bug.cgi?id=12891

@MartinNowak
Copy link
Member

Yes, there should also be a fetchSub because it maps to a separate intrinsic.
https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html#index-g_t_005f_005fatomic_005fsub_005ffetch-3536
It might be a better overall design to not introduce new function, but to only change the internal implementation of atomicOp.

@@ -190,6 +254,14 @@ else version( AsmX86_32 )
//
// += -= *= /= %= ^^= &=
// |= ^= <<= >>= >>>= ~=
static if( op == "+=" && __traits(isIntegral, T) && is(T==V1) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try pragma(msg, T, " == ", V1, " = ", is(T == V1), " (op: ", op, ")"); just on the function entry below body.

The is(T == V1) constraint will miss half of the cases where this optimization is applicable, e.g. atomicOp!"+="(u8, 1).

Note that you can also do this, which is defined in terms of int promoted add with subsequent coercion to ubyte.

    ubyte u8;
    int val = 1;
    u8 += val;

I think we can implement this like the following.

size_t tmp = mod; // convert all operands to size_t

asm
{
    mov RAX, tmp;
    mov RDX, val;
}

static if (T.sizeof == 1) asm { lock; xadd[RDX], AL; }
else static if (T.sizeof == 2) asm { lock; xadd[RDX], AX; }
else static if (T.sizeof == 4) asm { lock; xadd[RDX], EAX; }
else static if (T.sizeof == 8) asm { lock; xadd[RDX], RAX; }

asm { mov tmp, RAX; }

return cast(T)(tmp + mod);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this code do:
asm { mov tmp, RAX; }
return cast(T)(tmp + mod);

I also didn't understand why this:
return oval + mod;

Why is any addition going on here (and why added with mod.. isn't xadd doing this job)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, the pragma wrote this in the console for the unittest:

ulong == ulong = true (op: +=)
ulong == ulong = true (op: -=)
ulong == ulong = true (op: +=)
ulong == ulong = true (op: -=)
ulong == ulong = true (op: +=)
ulong == ulong = true (op: -=)
ubyte == int = false (op: +=)
ushort == int = false (op: +=)
uint == int = false (op: +=)
ulong == int = false (op: +=)
byte == int = false (op: +=)
short == int = false (op: +=)
int == int = true (op: +=)
long == int = false (op: +=)
ubyte == int = false (op: -=)
ushort == int = false (op: -=)
uint == int = false (op: -=)
ulong == int = false (op: -=)
byte == int = false (op: -=)
short == int = false (op: -=)
int == int = true (op: -=)
long == int = false (op: -=)
ulong == int = false (op: +=)
ulong == int = false (op: -=)
ulong == int = false (op: +=)
ulong == int = false (op: -=)
uint == int = false (op: +=)
ulong == int = false (op: +=)
uint == int = false (op: +=)
uint == int = false (op: +=)
uint == int = false (op: +=)
uint == int = false (op: +=)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is any addition going on here (and why added with mod.. isn't xadd doing this job)?

XADD returns you the value before the addition, so you have to add it again to return the total result.

… to size_t for xadd op. Debug pragmas in for testing
@jadbox
Copy link
Contributor Author

jadbox commented Apr 8, 2015

It might be a better overall design to not introduce new function, but to only change the internal implementation of atomicOp.

Agreed. I've made atomicFetchAdd private. I think separating this logic into another function (at least internally) aids code readability though.

My latest patch has most (if not all) of your feedback implemented.

@jadbox
Copy link
Contributor Author

jadbox commented Apr 8, 2015

@MartinNowak Btw, please read my replies on your previous code comments. (GitHub hides them once a new commit is made)

@MartinNowak
Copy link
Member

Well it turned out to be a bit more involved, but thanks for driving this.

@@ -190,6 +222,14 @@ else version( AsmX86_32 )
//
// += -= *= /= %= ^^= &=
// |= ^= <<= >>= >>>= ~=
static if( op == "+=" && __traits(isIntegral, T) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you need a && T.sizeof <= 4 for the 32 bit case.

@jadbox
Copy link
Contributor Author

jadbox commented Apr 9, 2015

I couldn't find any good info online on how xaddl and xadd differs. Is there a reason to use one over the other?

Btw, I'm thinking that for +=1 and -=1 atomicOp calls (using a static if check) will instead use atomic INC asm calls. This way, you transparently are getting the best performance possible.

In atomicFetchAdd, I'm thinking something like this:
static if(mod==1) {
asm pure nothrow @nogc
{
mov RDX, val;
lock; inc[RDX];
}
return ????
else { ...

@jadbox
Copy link
Contributor Author

jadbox commented Apr 9, 2015

XADD returns you the value before the addition, so you have to add it again to return the total result.

I'm still confused.. why do we need to "add it again"? Doesn't this mean that this atomic action is then taking two addition ops to complete.. and hence be slow? (asm noob speaking)

@MartinNowak
Copy link
Member

Btw, I'm thinking that for +=1 and -=1 atomicOp calls (using a static if check) will instead use atomic INC asm calls. This way, you transparently are getting the best performance possible.

No need to, inc won't be faster than xadd, and you'd need to make the 1 a template parameter to check this.

I'm still confused.. why do we need to "add it again"?

GCC has 2 builtins, atomic_add_fetch and atomic_fetch_add. The former one returns the value after adding the op, the latter returns the value before adding the op, this is also the behavior of XADD, so implementing += requires us to add the op again.

oesn't this mean that this atomic action is then taking two addition ops to complete.. and hence be slow? (asm noob speaking)

Nope, the second add simply adds 2 registers, this is literally free (your CPU can perform 3 independent adds per cycle). Doing LOCK XADD on a memory location synchronizes the memory bus and costs 20-50 cycles latency.

assert( atomicValueIsProperlyAligned!(size_t)( cast(size_t) &val ) );
else
assert( atomicValueIsProperlyAligned!(T)( cast(size_t) &val ) );
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this in contract here, it is only needed for 8-byte CAS on x86-32.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed contract

@MartinNowak
Copy link
Member

Motivating benchmark.

import core.atomic, std.algorithm, std.datetime, std.stdio;

void main()
{
    auto dur = Duration.max;
    foreach (i; 0 .. 20)
    {
        auto sw = StopWatch(AutoStart.yes);
        shared uint rc;
        foreach (_; 0 .. 1_000_000)
        {
            atomicOp!"+="(rc, 1);
            atomicOp!"-="(rc, 1);
        }
        dur = min(dur, cast(Duration)sw.peek);
    }
    writeln(dur);
}

Running this with your pull gives me 18 ms instead of 26 ms, the old code was 44% slower/new code is 30% faster.

@jadbox
Copy link
Contributor Author

jadbox commented Apr 12, 2015

implementing += requires us to add the op again.

My powers of deduction are failing me. Why doesn't += return the result of an AddFetch (without the second addition)...as the += expression means "return me the value of myself after the modifier".

Even better, we keep this as atomicFetchAdd, and do the post adding in atomicOp

Done in last commit! Also added atomicFetchSub.

@MartinNowak
Copy link
Member

Auto-merge toggled on

MartinNowak added a commit that referenced this pull request Apr 14, 2015
fix Issue 12891: add atomicFetchAdd to core.atomic
@MartinNowak MartinNowak merged commit c5f1019 into dlang:master Apr 14, 2015
@MartinNowak
Copy link
Member

Thanks.

There exists no Intel instruction that atomically adds a number to a variable and returns the result.
You only have xadd which adds a number and returns the value of the variable before the adding.
So in order to get the result after the add, as needed by +=, we have to add the number to the result of atomicFetchAdd.
Hope that answers your question.

denis-sh added a commit to denis-sh/druntime that referenced this pull request Apr 15, 2015
Place opening curly bracket on its own line and remove empty line of spaces.
@denis-sh
Copy link
Contributor

@jadbox, please follow D stadard library code style and squash such small fixing commit lists to not pollute the log.

@jadbox
Copy link
Contributor Author

jadbox commented Apr 15, 2015

@denis-sh How should I do that exactly? Just perform a rebase squash and push?

@denis-sh
Copy link
Contributor

@jadbox, yes. You need to prepare your local branch (e.g. rebase squash) and if its ready update remote one (push with force). github allows non fast-forward updates for pull requests.

@jadbox
Copy link
Contributor Author

jadbox commented Apr 15, 2015

@denis-sh It looks like it shouldn't pollute master as it was simply merged in (single commit point). How would compressing the history of the branch that was merge commit'ed over be cleaner?

@denis-sh
Copy link
Contributor

@jadbox, "compressed" branch view compound from only merge commits isn't enough as merged branches often contain few logically distinct changes divided into commits so when one looks through the log he inspects all merged branch commits and such sequences of non-sense fixups waste vertical space and doesn't look good. I'm not even telling about github's commits page.

@MartinNowak
Copy link
Member

Sure, the commits don't convey any information, but there is no real need to worry about such small stuff.

@MartinNowak
Copy link
Member

If you want to do a little more on this @jadbox, you might consider supporting pointer arithmetic as well, i.e. shared int* ptr; atomicOp!"+="(ptr, 2).

MartinNowak added a commit that referenced this pull request Apr 20, 2015
@MartinNowak
Copy link
Member

Rats, this is missing documentation.

@MartinNowak
Copy link
Member

My mistake, those functions are simply used internally by atomicOp.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants