Skip to content
This repository was archived by the owner on Oct 12, 2022. It is now read-only.

upgraded int array ops #471

Merged
merged 4 commits into from
Aug 10, 2013
Merged

upgraded int array ops #471

merged 4 commits into from
Aug 10, 2013

Conversation

John-Colvin
Copy link
Contributor

Things in this pull, all relating to arrayint.d:

  • ported everything to x64
  • fixed/reimplemented all the multiplication asm, now fully functioning.
  • upgraded unittests to check correct signed behaviour.
  • upgraded unittests to get 100% coverage where appropriate, excepting the little helper function redirects.

On my machine this provides dramatic performance improvements for 64 bit int array ops, although I haven't done a detailed profile as yet.

The reimplemented multiplication asm now requires sse4.1 (checked at runtime, with a fallback). This shouldn't be a problem on intel as it was introduced with penryn (2007/2008). Slightly older Amd processors will lose out however as sse4.1 was only introduced by them in late 2011 with the Bulldozer architecture.

Further direction if this is accepted:

  • take a similar approach for other types.
  • make use of more recent tech such as avx.
  • add some amd specific code.

further down the line:

  • ARM?

@John-Colvin
Copy link
Contributor Author

Just noticed a bug for uint > uint.max/2

Will fix in the next few days.

[not valid, see below]

@WalterBright
Copy link
Member

Awesome, thanks for doing this.

As for taking advantage of new architectures, and relying on sse4.1, please add a runtime check for what instruction set is present.

@John-Colvin
Copy link
Contributor Author

Particular instruction set extensions (e.g. sse4.1) are checked for at runtime using core.cpuid properties.

FYI: Applicable extensions are enabled separately in the unittests and tested, (limited by the processor on the testing machine of course). Currently there is no overlap of feature sets that aren't always present together (e.g. if sse4.1 is present then so is sse2) so this is sufficient to get 100% test coverage without problems, for now at least.

@John-Colvin
Copy link
Contributor Author

I'm a little uncomfortable with all the functions returning int[], I presume the result is just implicitly casted to uint[] as required?
It feels like a bit of a subversion of the type system to to return large negative ints instead of large positive uints ( > uint.max/2).

@John-Colvin
Copy link
Contributor Author

ok, there is no bug, I'd just got my understanding of the pmulld instruction wrong.

From my perspective this is ready for merging, let me know if there are any problems.

@@ -207,6 +223,127 @@ body
}
}
}
version (D_InlineAsm_X86_64)
{
if (sse2 && a.length >= 8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X86-64 implies sse2 support, so there is no need for a runtime test nor for the mmx and scalar fallbacks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those aren't just fallbacks for lack of sse2 support, they also provide support for smaller arrays.

In x64 I will remove the runtime check for sse2 and re-implement the 4 <= a.length < 8 situation with a single register sse2 routine (no loop), same as already done for the multiplication code.

@MartinNowak
Copy link
Member

Thanks for your effort. I wished we already got rid of all those array* modules but I didn't got to finish this yet, https://gist.github.com/MartinNowak/1235470

  • AFAIK x64 loop alignment should be 16-byte.
  • As said above SSE2 is available on all x64 CPUs.
  • Have you benchmarked using 8 registers per loop?

mov bptr, RAX;
}
}
else if (a.length >= 2) //should probably be left to the compiler?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah please just drop that part and leave it to the while loop at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

@John-Colvin
Copy link
Contributor Author

"I wished we already got rid of all those array* modules but I didn't got to finish this yet, https://gist.github.com/dawgfoto/1235470."
That looks very interesting. Are you planning to restart work on that any time soon?

"AFAIK x64 loop alignment should be 16-byte."
Yes, sorry my mistake.

"Have you benchmarked using 8 registers per loop?"
Not yet, I benchmarked 4 registers and couldn't see any significant speedup (< 1%, probably not valid). Do you have any advice for how to do these sort of micro benchmarks, on my machine the results are very inconsistent, presumably due to some interference from other processes.

@MartinNowak
Copy link
Member

Have you benchmarked using 8 registers per loop?

When I wrote the arrayOps the following pattern turned out to be a good performance compromise for short and long arrays.
The important part is that there are two loops, one that loads and stores 4 registers per iteration and another one that only uses a single register.
That way you can still use XMM operations for less than 16 ints.
I also wonder if it would make more sense to use SIMD intrinsics for this and leave loop unrolling to the compiler.

size_t arrayOpImpl(alias unit, string op, uint alignMask, T)
(size_t len, T* a, T* b, T* c) if (unit.mangleof != nativeUnit!().mangleof)
in
{
    assert(!(alignMask & 0b100) || !(cast(size_t)a & (unit.regsz - 1)));
    assert(!(alignMask & 0b010) || !(cast(size_t)b & (unit.regsz - 1)));
    assert(!(alignMask & 0b001) || !(cast(size_t)c & (unit.regsz - 1)));
}
body
{
    enum mova = (alignMask & 0b100) ? unit.inst!(T).amov : unit.inst!(T).umov;
    enum movb = (alignMask & 0b010) ? unit.inst!(T).amov : unit.inst!(T).umov;
    enum movc = (alignMask & 0b001) ? unit.inst!(T).amov : unit.inst!(T).umov;
    enum opinst = unit.inst!(T).optab[opStringToEnum(op)];
    alias unit.MMRegs MM;
    enum regsz = toString(unit.regsz);

    auto n = len;

    mixin(`
        asm {
            mov `~GP.DI~`, a; // destination
            mov `~GP.SI~`, b; // 1st oprnd source
            mov `~GP.DX~`, c; // 2nd oprnd source
            mov `~GP.CX~`, n; // loop counter

            align `~toString(2 * size_t.sizeof)~`;
        LloopA:
            cmp `~GP.CX~`, `~toString(4 * unit.regsz / T.sizeof)~`;
            jb LloopB;
            `~movb~` `~MM._0~`, [`~GP.SI~` + 0*`~regsz~`];
            `~movb~` `~MM._1~`, [`~GP.SI~` + 1*`~regsz~`];
            `~movb~` `~MM._2~`, [`~GP.SI~` + 2*`~regsz~`];
            `~movb~` `~MM._3~`, [`~GP.SI~` + 3*`~regsz~`];
            add `~GP.SI~`, 4*`~regsz~`;

            `~movc~` `~MM._4~`, [`~GP.DX~` + 0*`~regsz~`];
            `~movc~` `~MM._5~`, [`~GP.DX~` + 1*`~regsz~`];
            `~movc~` `~MM._6~`, [`~GP.DX~` + 2*`~regsz~`];
            `~movc~` `~MM._7~`, [`~GP.DX~` + 3*`~regsz~`];
            add `~GP.DX~`, 4*`~regsz~`;

            `~opinst~` `~MM._0~`, `~MM._4~`;
            `~opinst~` `~MM._1~`, `~MM._5~`;
            `~opinst~` `~MM._2~`, `~MM._6~`;
            `~opinst~` `~MM._3~`, `~MM._7~`;

            `~mova~`[`~GP.DI~` + 0*`~regsz~`], `~MM._0~`;
            `~mova~`[`~GP.DI~` + 1*`~regsz~`], `~MM._1~`;
            `~mova~`[`~GP.DI~` + 2*`~regsz~`], `~MM._2~`;
            `~mova~`[`~GP.DI~` + 3*`~regsz~`], `~MM._3~`;
            add `~GP.DI~`, 4*`~regsz~`;

            sub `~GP.CX~`, `~toString(4 * unit.regsz / T.sizeof)~`;
            jmp LloopA;

            align `~toString(2 * size_t.sizeof)~`;
        LloopB:
            cmp `~GP.CX~`, `~toString(unit.regsz / T.sizeof)~`;
            jb Ldone;

            `~movb~` `~MM._0~`, [`~GP.SI~`];
            add `~GP.SI~`, `~regsz~`;

            `~movc~` `~MM._4~`, [`~GP.DX~`];
            add `~GP.DX~`, `~regsz~`;

            `~opinst~` `~MM._0~`, `~MM._4~`;

            `~mova~`[`~GP.DI~`], `~MM._0~`;
            add `~GP.DI~`, `~regsz~`;

            sub `~GP.CX~`, `~toString(unit.regsz / T.sizeof)~`;
            jmp LloopB;

        Ldone:
            `~unit.loopCleanup~`
            mov n, `~GP.CX~`;
        }
    `);

    return len - n;
}

@John-Colvin
Copy link
Contributor Author

I'm working on a version of Martin's code that will hopefully generate the correct code replacing all the array* modules.

The disadvantage of anything based on the current design is that expressions like this one:

a[] *= b[] + c[];

will still completely miss all these hand-optimised loops.

Any suggestions for how we might deal with this? It seems a bit rubbish to present these fast array ops and then give up if someone writes more than 1 op in an expression.

@John-Colvin
Copy link
Contributor Author

The more I look in to this, the more I feel like it doesn't belong here at all. By implementing these functions so far from the backend we don't have any access to even basic optimisations like expression simplification. Even a[] = b[] - b[] doesn't get spotted. Getting proper behaviour out of multiple op expressions would require implementing what is effectively a small, very specialised backend inside druntime, which just seems wrong.

On the other side of the coin, I'm getting much better performance (in places) with my hand written loops than gdc or ldc manage with their auto-vectorizing, so leaving things to currently existing backends also seems a bad choice.

Neither way seems good :(

@MartinNowak
Copy link
Member

I think we should drop the asm code because using vector types generates much better code and is much easier to write.
Until now I never used the simd support but I just wrote an almost complete implementation of all rt.array*.d modules in 100 LOC.
We should seriously consider to let the compiler instantiate a template and get rid of the fat compiler<=>runtime interface.

https://gist.github.com/MartinNowak/5338859

@ibuclaw
Copy link
Member

ibuclaw commented Apr 8, 2013

On the other side of the coin, I'm getting much better performance (in places) with my hand written loops than gdc or ldc manage with their auto-vectorizing, so leaving things to currently existing backends also seems a bad choice.

Actually (have tested this before) the code generated from the compiler is so suboptimal that auto-vectorization is not even possible. Infact, it's about 3x slower than the generic array op implementations in druntime... And that's bad given that you can't inline the druntime library calls.

@John-Colvin
Copy link
Contributor Author

Actually (have tested this before) the code generated from the compiler is so suboptimal that auto-vectorization is not even possible. Infact, it's about 3x slower than the generic array op implementations in druntime... And that's bad given that you can't inline the druntime library calls.

Gdc does manage auto-vectorisation on array ops. Or am I missing your point?

@John-Colvin
Copy link
Contributor Author

I think we should drop the asm code because using vector types generates much better code and is much easier to write.
Until now I never used the simd support but I just wrote an almost complete implementation of all rt.array*.d modules in 100 LOC.
We should seriously consider to let the compiler instantiate a template and get rid of the fat compiler<=>runtime interface.

https://gist.github.com/MartinNowak/5338859

That is very neat, I agree.
However, the problem of dealing with arbitrary array expressions still remains, does it not?

@MartinNowak
Copy link
Member

The disadvantage of anything based on the current design is that expressions like this one:

a[] *= b[] + c[];
will still completely miss all these hand-optimised loops.

Actually arbitrary complex array expressions are supported.
The problem is that we can't compile everything into druntime and
that it becomes hard to maintain.

void foo()
{
    float[] a, b, c, d, e;
    a[] += b[] + 2 * c[] / (d[] * e[]);
}

This for example would call _arraySliceSliceExpMulSliceSliceMulDivAddSliceAddass_f.
You can compile with -v or mark foo as nothrow to see the name.
You'll have to add it to the compiler table to make use of it.

What we really need and I think the prototype proves this is a proper generic interface.
Also the simd version is much less compile work than the asm mixin string solution, so it becomes feasible to always instantiate a template instead of using a precompiled version. This would allow -march tuning and inlining.

Not sure how the interface should look like but something along this line could work.

void arrayOp(T, string[N] ops, size_t N, Args...)(T[] result, in ref Args args) if (Args.length == N + 1);

result[] = args[0] op[0] args[1] op[1] ... args[n];

@John-Colvin
Copy link
Contributor Author

I'm familiar with the function names the compiler creates, what I'm struggling to imagine is a good way of implementing the actual templates in druntime concisely for the general case. How to tell the difference between b / ( c + a) and (b/c) +a without having to code in understanding of rpn or similar.

@WalterBright
Copy link
Member

Are you familiar with "expression templates" in C++? The same technique works in D.

@MartinNowak
Copy link
Member

How to tell the difference between b / ( c + a) and (b/c) +a without having to code in understanding of rpn or similar.

Yeah precedence is an issue. RPN sounds interesting but required heavy template hacking.

Are you familiar with "expression templates" in C++?

Are you suggesting this as compiler => runtime interface?

A pragmatic approach is to let the compiler traverse it's expression tree and output a format string which contains the braced expression, e.g. "(%0s * (%1s - %2s)) / %4s". Then this template argument can be used to generate useable mixin strings.
Intermediate results and optimization would be handled by the compiler again.

One limitation though is that mul/div for integers requires explicit simd calls.

@John-Colvin
Copy link
Contributor Author

github should really have some sort of quoting/threading/reply system.

Are you familiar with "expression templates" in C++? The same technique works in D.

I'm familiar with the concept, but never implemented them myself. It does seem like the most practical approach.

One limitation though is that mul/div for integers requires explicit simd calls.

They're also not all easy to perform:

Multiplication for 16 and 32 bit integers is easy. 64 bit requires some trickery, but not a great deal (3 multiplies, 2 additions and some shifting/shuffling)

simd integer division doesn't exist in any straightforward way as far as I can tell. icc has closed source intrinsics for it, but I have no idea how they work. Gcc doesn't manage to produce anything vectorised for division.

@MartinNowak
Copy link
Member

How to tell the difference between b / ( c + a) and (b/c) +a without having to code in understanding of rpn or similar.
Yeah precedence is an issue. RPN sounds interesting but required heavy template hacking.

Not at all that difficult, it's also quite simple to implement on the compiler side.
https://gist.github.com/MartinNowak/5358292

simd integer division doesn't exist in any straightforward way as far as I can tell.

Not enough silicon to waste on. Let's treat it as an afterthought.

@WalterBright
Copy link
Member

While we wait for a better design strategy, I see no reason not to merge this.

WalterBright added a commit that referenced this pull request Aug 10, 2013
@WalterBright WalterBright merged commit 9b0b6e3 into dlang:master Aug 10, 2013
@John-Colvin
Copy link
Contributor Author

I never actually got around to doing the last minute tweaks I should have done. There are still unnecessary runtime sse2 checks and rather pointless hand written asm scalar loops in there. Nothing is broken but I should be able to patch it up properly soon.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants