upgraded int array ops #471

John-Colvin · 2013-04-04T16:58:55Z

Things in this pull, all relating to arrayint.d:

ported everything to x64
fixed/reimplemented all the multiplication asm, now fully functioning.
upgraded unittests to check correct signed behaviour.
upgraded unittests to get 100% coverage where appropriate, excepting the little helper function redirects.

On my machine this provides dramatic performance improvements for 64 bit int array ops, although I haven't done a detailed profile as yet.

The reimplemented multiplication asm now requires sse4.1 (checked at runtime, with a fallback). This shouldn't be a problem on intel as it was introduced with penryn (2007/2008). Slightly older Amd processors will lose out however as sse4.1 was only introduced by them in late 2011 with the Bulldozer architecture.

Further direction if this is accepted:

take a similar approach for other types.
make use of more recent tech such as avx.
add some amd specific code.

further down the line:

ARM?

John-Colvin · 2013-04-04T16:59:53Z

Just noticed a bug for uint > uint.max/2

Will fix in the next few days.

[not valid, see below]

WalterBright · 2013-04-04T18:00:27Z

Awesome, thanks for doing this.

As for taking advantage of new architectures, and relying on sse4.1, please add a runtime check for what instruction set is present.

John-Colvin · 2013-04-04T18:54:11Z

Particular instruction set extensions (e.g. sse4.1) are checked for at runtime using core.cpuid properties.

FYI: Applicable extensions are enabled separately in the unittests and tested, (limited by the processor on the testing machine of course). Currently there is no overlap of feature sets that aren't always present together (e.g. if sse4.1 is present then so is sse2) so this is sufficient to get 100% test coverage without problems, for now at least.

John-Colvin · 2013-04-04T19:05:26Z

I'm a little uncomfortable with all the functions returning int[], I presume the result is just implicitly casted to uint[] as required?
It feels like a bit of a subversion of the type system to to return large negative ints instead of large positive uints ( > uint.max/2).

John-Colvin · 2013-04-04T22:24:14Z

ok, there is no bug, I'd just got my understanding of the pmulld instruction wrong.

From my perspective this is ready for merging, let me know if there are any problems.

MartinNowak · 2013-04-05T03:14:51Z

src/rt/arrayint.d

@@ -207,6 +223,127 @@ body
            }
        }
    }
+    version (D_InlineAsm_X86_64)
+    {
+        if (sse2 && a.length >= 8)


X86-64 implies sse2 support, so there is no need for a runtime test nor for the mmx and scalar fallbacks.

Those aren't just fallbacks for lack of sse2 support, they also provide support for smaller arrays.

In x64 I will remove the runtime check for sse2 and re-implement the 4 <= a.length < 8 situation with a single register sse2 routine (no loop), same as already done for the multiplication code.

MartinNowak · 2013-04-05T03:32:23Z

Thanks for your effort. I wished we already got rid of all those array* modules but I didn't got to finish this yet, https://gist.github.com/MartinNowak/1235470

AFAIK x64 loop alignment should be 16-byte.
As said above SSE2 is available on all x64 CPUs.
Have you benchmarked using 8 registers per loop?

MartinNowak · 2013-04-05T13:16:48Z

src/rt/arrayint.d

+                mov bptr, RAX;
+            }
+        }
+        else if (a.length >= 2) //should probably be left to the compiler?


Yeah please just drop that part and leave it to the while loop at the end.

John-Colvin · 2013-04-05T13:27:00Z

"I wished we already got rid of all those array* modules but I didn't got to finish this yet, https://gist.github.com/dawgfoto/1235470."
That looks very interesting. Are you planning to restart work on that any time soon?

"AFAIK x64 loop alignment should be 16-byte."
Yes, sorry my mistake.

"Have you benchmarked using 8 registers per loop?"
Not yet, I benchmarked 4 registers and couldn't see any significant speedup (< 1%, probably not valid). Do you have any advice for how to do these sort of micro benchmarks, on my machine the results are very inconsistent, presumably due to some interference from other processes.

MartinNowak · 2013-04-05T13:34:06Z

Have you benchmarked using 8 registers per loop?

When I wrote the arrayOps the following pattern turned out to be a good performance compromise for short and long arrays.
The important part is that there are two loops, one that loads and stores 4 registers per iteration and another one that only uses a single register.
That way you can still use XMM operations for less than 16 ints.
I also wonder if it would make more sense to use SIMD intrinsics for this and leave loop unrolling to the compiler.

size_t arrayOpImpl(alias unit, string op, uint alignMask, T)
(size_t len, T* a, T* b, T* c) if (unit.mangleof != nativeUnit!().mangleof)
in
{
    assert(!(alignMask & 0b100) || !(cast(size_t)a & (unit.regsz - 1)));
    assert(!(alignMask & 0b010) || !(cast(size_t)b & (unit.regsz - 1)));
    assert(!(alignMask & 0b001) || !(cast(size_t)c & (unit.regsz - 1)));
}
body
{
    enum mova = (alignMask & 0b100) ? unit.inst!(T).amov : unit.inst!(T).umov;
    enum movb = (alignMask & 0b010) ? unit.inst!(T).amov : unit.inst!(T).umov;
    enum movc = (alignMask & 0b001) ? unit.inst!(T).amov : unit.inst!(T).umov;
    enum opinst = unit.inst!(T).optab[opStringToEnum(op)];
    alias unit.MMRegs MM;
    enum regsz = toString(unit.regsz);

    auto n = len;

    mixin(`
        asm {
            mov `~GP.DI~`, a; // destination
            mov `~GP.SI~`, b; // 1st oprnd source
            mov `~GP.DX~`, c; // 2nd oprnd source
            mov `~GP.CX~`, n; // loop counter

            align `~toString(2 * size_t.sizeof)~`;
        LloopA:
            cmp `~GP.CX~`, `~toString(4 * unit.regsz / T.sizeof)~`;
            jb LloopB;
            `~movb~` `~MM._0~`, [`~GP.SI~` + 0*`~regsz~`];
            `~movb~` `~MM._1~`, [`~GP.SI~` + 1*`~regsz~`];
            `~movb~` `~MM._2~`, [`~GP.SI~` + 2*`~regsz~`];
            `~movb~` `~MM._3~`, [`~GP.SI~` + 3*`~regsz~`];
            add `~GP.SI~`, 4*`~regsz~`;

            `~movc~` `~MM._4~`, [`~GP.DX~` + 0*`~regsz~`];
            `~movc~` `~MM._5~`, [`~GP.DX~` + 1*`~regsz~`];
            `~movc~` `~MM._6~`, [`~GP.DX~` + 2*`~regsz~`];
            `~movc~` `~MM._7~`, [`~GP.DX~` + 3*`~regsz~`];
            add `~GP.DX~`, 4*`~regsz~`;

            `~opinst~` `~MM._0~`, `~MM._4~`;
            `~opinst~` `~MM._1~`, `~MM._5~`;
            `~opinst~` `~MM._2~`, `~MM._6~`;
            `~opinst~` `~MM._3~`, `~MM._7~`;

            `~mova~`[`~GP.DI~` + 0*`~regsz~`], `~MM._0~`;
            `~mova~`[`~GP.DI~` + 1*`~regsz~`], `~MM._1~`;
            `~mova~`[`~GP.DI~` + 2*`~regsz~`], `~MM._2~`;
            `~mova~`[`~GP.DI~` + 3*`~regsz~`], `~MM._3~`;
            add `~GP.DI~`, 4*`~regsz~`;

            sub `~GP.CX~`, `~toString(4 * unit.regsz / T.sizeof)~`;
            jmp LloopA;

            align `~toString(2 * size_t.sizeof)~`;
        LloopB:
            cmp `~GP.CX~`, `~toString(unit.regsz / T.sizeof)~`;
            jb Ldone;

            `~movb~` `~MM._0~`, [`~GP.SI~`];
            add `~GP.SI~`, `~regsz~`;

            `~movc~` `~MM._4~`, [`~GP.DX~`];
            add `~GP.DX~`, `~regsz~`;

            `~opinst~` `~MM._0~`, `~MM._4~`;

            `~mova~`[`~GP.DI~`], `~MM._0~`;
            add `~GP.DI~`, `~regsz~`;

            sub `~GP.CX~`, `~toString(unit.regsz / T.sizeof)~`;
            jmp LloopB;

        Ldone:
            `~unit.loopCleanup~`
            mov n, `~GP.CX~`;
        }
    `);

    return len - n;
}

John-Colvin · 2013-04-08T16:05:16Z

I'm working on a version of Martin's code that will hopefully generate the correct code replacing all the array* modules.

The disadvantage of anything based on the current design is that expressions like this one:

a[] *= b[] + c[];

will still completely miss all these hand-optimised loops.

Any suggestions for how we might deal with this? It seems a bit rubbish to present these fast array ops and then give up if someone writes more than 1 op in an expression.

John-Colvin · 2013-04-08T20:04:46Z

The more I look in to this, the more I feel like it doesn't belong here at all. By implementing these functions so far from the backend we don't have any access to even basic optimisations like expression simplification. Even a[] = b[] - b[] doesn't get spotted. Getting proper behaviour out of multiple op expressions would require implementing what is effectively a small, very specialised backend inside druntime, which just seems wrong.

On the other side of the coin, I'm getting much better performance (in places) with my hand written loops than gdc or ldc manage with their auto-vectorizing, so leaving things to currently existing backends also seems a bad choice.

Neither way seems good :(

MartinNowak · 2013-04-08T20:10:39Z

I think we should drop the asm code because using vector types generates much better code and is much easier to write.
Until now I never used the simd support but I just wrote an almost complete implementation of all rt.array*.d modules in 100 LOC.
We should seriously consider to let the compiler instantiate a template and get rid of the fat compiler<=>runtime interface.

https://gist.github.com/MartinNowak/5338859

ibuclaw · 2013-04-08T20:21:40Z

On the other side of the coin, I'm getting much better performance (in places) with my hand written loops than gdc or ldc manage with their auto-vectorizing, so leaving things to currently existing backends also seems a bad choice.

Actually (have tested this before) the code generated from the compiler is so suboptimal that auto-vectorization is not even possible. Infact, it's about 3x slower than the generic array op implementations in druntime... And that's bad given that you can't inline the druntime library calls.

John-Colvin · 2013-04-08T20:30:32Z

Actually (have tested this before) the code generated from the compiler is so suboptimal that auto-vectorization is not even possible. Infact, it's about 3x slower than the generic array op implementations in druntime... And that's bad given that you can't inline the druntime library calls.

Gdc does manage auto-vectorisation on array ops. Or am I missing your point?

John-Colvin · 2013-04-08T21:54:12Z

I think we should drop the asm code because using vector types generates much better code and is much easier to write.
Until now I never used the simd support but I just wrote an almost complete implementation of all rt.array*.d modules in 100 LOC.
We should seriously consider to let the compiler instantiate a template and get rid of the fat compiler<=>runtime interface.

https://gist.github.com/MartinNowak/5338859

That is very neat, I agree.
However, the problem of dealing with arbitrary array expressions still remains, does it not?

MartinNowak · 2013-04-08T23:17:16Z

The disadvantage of anything based on the current design is that expressions like this one:

a[] *= b[] + c[];
will still completely miss all these hand-optimised loops.

Actually arbitrary complex array expressions are supported.
The problem is that we can't compile everything into druntime and
that it becomes hard to maintain.

void foo()
{
    float[] a, b, c, d, e;
    a[] += b[] + 2 * c[] / (d[] * e[]);
}

This for example would call _arraySliceSliceExpMulSliceSliceMulDivAddSliceAddass_f.
You can compile with -v or mark foo as nothrow to see the name.
You'll have to add it to the compiler table to make use of it.

What we really need and I think the prototype proves this is a proper generic interface.
Also the simd version is much less compile work than the asm mixin string solution, so it becomes feasible to always instantiate a template instead of using a precompiled version. This would allow -march tuning and inlining.

Not sure how the interface should look like but something along this line could work.

void arrayOp(T, string[N] ops, size_t N, Args...)(T[] result, in ref Args args) if (Args.length == N + 1);

result[] = args[0] op[0] args[1] op[1] ... args[n];

John-Colvin · 2013-04-09T00:07:56Z

I'm familiar with the function names the compiler creates, what I'm struggling to imagine is a good way of implementing the actual templates in druntime concisely for the general case. How to tell the difference between b / ( c + a) and (b/c) +a without having to code in understanding of rpn or similar.

WalterBright · 2013-04-09T03:27:33Z

Are you familiar with "expression templates" in C++? The same technique works in D.

MartinNowak · 2013-04-09T05:38:59Z

How to tell the difference between b / ( c + a) and (b/c) +a without having to code in understanding of rpn or similar.

Yeah precedence is an issue. RPN sounds interesting but required heavy template hacking.

Are you familiar with "expression templates" in C++?

Are you suggesting this as compiler => runtime interface?

A pragmatic approach is to let the compiler traverse it's expression tree and output a format string which contains the braced expression, e.g. "(%0s * (%1s - %2s)) / %4s". Then this template argument can be used to generate useable mixin strings.
Intermediate results and optimization would be handled by the compiler again.

One limitation though is that mul/div for integers requires explicit simd calls.

John-Colvin · 2013-04-09T12:42:33Z

github should really have some sort of quoting/threading/reply system.

Are you familiar with "expression templates" in C++? The same technique works in D.

I'm familiar with the concept, but never implemented them myself. It does seem like the most practical approach.

One limitation though is that mul/div for integers requires explicit simd calls.

They're also not all easy to perform:

Multiplication for 16 and 32 bit integers is easy. 64 bit requires some trickery, but not a great deal (3 multiplies, 2 additions and some shifting/shuffling)

simd integer division doesn't exist in any straightforward way as far as I can tell. icc has closed source intrinsics for it, but I have no idea how they work. Gcc doesn't manage to produce anything vectorised for division.

MartinNowak · 2013-04-10T20:53:45Z

How to tell the difference between b / ( c + a) and (b/c) +a without having to code in understanding of rpn or similar.
Yeah precedence is an issue. RPN sounds interesting but required heavy template hacking.

Not at all that difficult, it's also quite simple to implement on the compiler side.
https://gist.github.com/MartinNowak/5358292

simd integer division doesn't exist in any straightforward way as far as I can tell.

Not enough silicon to waste on. Let's treat it as an afterthought.

WalterBright · 2013-08-10T22:52:07Z

While we wait for a better design strategy, I see no reason not to merge this.

upgraded int array ops

John-Colvin · 2013-08-10T23:09:35Z

I never actually got around to doing the last minute tweaks I should have done. There are still unnecessary runtime sse2 checks and rather pointless hand written asm scalar loops in there. Nothing is broken but I should be able to patch it up properly soon.

John-Colvin added 2 commits April 4, 2013 17:07

upgraded int array ops

273222a

more fixes and formatting etc.

664d23f

consistent use of size_t

e8c1cc5

improve unittests

63c1e1e

MartinNowak reviewed Apr 5, 2013
View reviewed changes

WalterBright added a commit that referenced this pull request Aug 10, 2013

Merge pull request #471 from John-Colvin/64_arr_ops

9b0b6e3

upgraded int array ops

WalterBright merged commit 9b0b6e3 into dlang:master Aug 10, 2013

MartinNowak mentioned this pull request Nov 15, 2013

Add more utils #661

Merged

MartinNowak mentioned this pull request Jun 6, 2014

Adding 64-bit sse and sse2 code for array float operations. Fix issue #12870 #829

Merged

MartinNowak mentioned this pull request Sep 30, 2014

fix Issue 13559 - missing 64-bit version of array short operations #977

Merged

Uh oh!

upgraded int array ops #471

upgraded int array ops #471

Uh oh!

Conversation

John-Colvin commented Apr 4, 2013

Uh oh!

John-Colvin commented Apr 4, 2013

Uh oh!

WalterBright commented Apr 4, 2013

Uh oh!

John-Colvin commented Apr 4, 2013

Uh oh!

John-Colvin commented Apr 4, 2013

Uh oh!

John-Colvin commented Apr 4, 2013

Uh oh!

MartinNowak Apr 5, 2013

Choose a reason for hiding this comment

Uh oh!

John-Colvin Apr 5, 2013

Choose a reason for hiding this comment

Uh oh!

MartinNowak commented Apr 5, 2013

Uh oh!

MartinNowak Apr 5, 2013

Choose a reason for hiding this comment

Uh oh!

John-Colvin Apr 5, 2013

Choose a reason for hiding this comment

Uh oh!

John-Colvin commented Apr 5, 2013

Uh oh!

MartinNowak commented Apr 5, 2013

Uh oh!

John-Colvin commented Apr 8, 2013

Uh oh!

John-Colvin commented Apr 8, 2013

Uh oh!

MartinNowak commented Apr 8, 2013

Uh oh!

ibuclaw commented Apr 8, 2013

Uh oh!

John-Colvin commented Apr 8, 2013

Uh oh!

John-Colvin commented Apr 8, 2013

Uh oh!

MartinNowak commented Apr 8, 2013

Uh oh!

John-Colvin commented Apr 9, 2013

Uh oh!

WalterBright commented Apr 9, 2013

Uh oh!

MartinNowak commented Apr 9, 2013

Uh oh!

John-Colvin commented Apr 9, 2013

Uh oh!

MartinNowak commented Apr 10, 2013

Uh oh!

WalterBright commented Aug 10, 2013

Uh oh!

John-Colvin commented Aug 10, 2013

Uh oh!

Uh oh!