-
-
Notifications
You must be signed in to change notification settings - Fork 417
Conversation
Just noticed a bug for uint > uint.max/2 Will fix in the next few days. [not valid, see below] |
Awesome, thanks for doing this. As for taking advantage of new architectures, and relying on sse4.1, please add a runtime check for what instruction set is present. |
Particular instruction set extensions (e.g. sse4.1) are checked for at runtime using core.cpuid properties. FYI: Applicable extensions are enabled separately in the unittests and tested, (limited by the processor on the testing machine of course). Currently there is no overlap of feature sets that aren't always present together (e.g. if sse4.1 is present then so is sse2) so this is sufficient to get 100% test coverage without problems, for now at least. |
I'm a little uncomfortable with all the functions returning int[], I presume the result is just implicitly casted to uint[] as required? |
ok, there is no bug, I'd just got my understanding of the pmulld instruction wrong. From my perspective this is ready for merging, let me know if there are any problems. |
@@ -207,6 +223,127 @@ body | |||
} | |||
} | |||
} | |||
version (D_InlineAsm_X86_64) | |||
{ | |||
if (sse2 && a.length >= 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
X86-64 implies sse2 support, so there is no need for a runtime test nor for the mmx and scalar fallbacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those aren't just fallbacks for lack of sse2 support, they also provide support for smaller arrays.
In x64 I will remove the runtime check for sse2 and re-implement the 4 <= a.length < 8 situation with a single register sse2 routine (no loop), same as already done for the multiplication code.
Thanks for your effort. I wished we already got rid of all those array* modules but I didn't got to finish this yet, https://gist.github.com/MartinNowak/1235470
|
mov bptr, RAX; | ||
} | ||
} | ||
else if (a.length >= 2) //should probably be left to the compiler? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah please just drop that part and leave it to the while loop at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
"I wished we already got rid of all those array* modules but I didn't got to finish this yet, https://gist.github.com/dawgfoto/1235470." "AFAIK x64 loop alignment should be 16-byte." "Have you benchmarked using 8 registers per loop?" |
When I wrote the arrayOps the following pattern turned out to be a good performance compromise for short and long arrays. size_t arrayOpImpl(alias unit, string op, uint alignMask, T)
(size_t len, T* a, T* b, T* c) if (unit.mangleof != nativeUnit!().mangleof)
in
{
assert(!(alignMask & 0b100) || !(cast(size_t)a & (unit.regsz - 1)));
assert(!(alignMask & 0b010) || !(cast(size_t)b & (unit.regsz - 1)));
assert(!(alignMask & 0b001) || !(cast(size_t)c & (unit.regsz - 1)));
}
body
{
enum mova = (alignMask & 0b100) ? unit.inst!(T).amov : unit.inst!(T).umov;
enum movb = (alignMask & 0b010) ? unit.inst!(T).amov : unit.inst!(T).umov;
enum movc = (alignMask & 0b001) ? unit.inst!(T).amov : unit.inst!(T).umov;
enum opinst = unit.inst!(T).optab[opStringToEnum(op)];
alias unit.MMRegs MM;
enum regsz = toString(unit.regsz);
auto n = len;
mixin(`
asm {
mov `~GP.DI~`, a; // destination
mov `~GP.SI~`, b; // 1st oprnd source
mov `~GP.DX~`, c; // 2nd oprnd source
mov `~GP.CX~`, n; // loop counter
align `~toString(2 * size_t.sizeof)~`;
LloopA:
cmp `~GP.CX~`, `~toString(4 * unit.regsz / T.sizeof)~`;
jb LloopB;
`~movb~` `~MM._0~`, [`~GP.SI~` + 0*`~regsz~`];
`~movb~` `~MM._1~`, [`~GP.SI~` + 1*`~regsz~`];
`~movb~` `~MM._2~`, [`~GP.SI~` + 2*`~regsz~`];
`~movb~` `~MM._3~`, [`~GP.SI~` + 3*`~regsz~`];
add `~GP.SI~`, 4*`~regsz~`;
`~movc~` `~MM._4~`, [`~GP.DX~` + 0*`~regsz~`];
`~movc~` `~MM._5~`, [`~GP.DX~` + 1*`~regsz~`];
`~movc~` `~MM._6~`, [`~GP.DX~` + 2*`~regsz~`];
`~movc~` `~MM._7~`, [`~GP.DX~` + 3*`~regsz~`];
add `~GP.DX~`, 4*`~regsz~`;
`~opinst~` `~MM._0~`, `~MM._4~`;
`~opinst~` `~MM._1~`, `~MM._5~`;
`~opinst~` `~MM._2~`, `~MM._6~`;
`~opinst~` `~MM._3~`, `~MM._7~`;
`~mova~`[`~GP.DI~` + 0*`~regsz~`], `~MM._0~`;
`~mova~`[`~GP.DI~` + 1*`~regsz~`], `~MM._1~`;
`~mova~`[`~GP.DI~` + 2*`~regsz~`], `~MM._2~`;
`~mova~`[`~GP.DI~` + 3*`~regsz~`], `~MM._3~`;
add `~GP.DI~`, 4*`~regsz~`;
sub `~GP.CX~`, `~toString(4 * unit.regsz / T.sizeof)~`;
jmp LloopA;
align `~toString(2 * size_t.sizeof)~`;
LloopB:
cmp `~GP.CX~`, `~toString(unit.regsz / T.sizeof)~`;
jb Ldone;
`~movb~` `~MM._0~`, [`~GP.SI~`];
add `~GP.SI~`, `~regsz~`;
`~movc~` `~MM._4~`, [`~GP.DX~`];
add `~GP.DX~`, `~regsz~`;
`~opinst~` `~MM._0~`, `~MM._4~`;
`~mova~`[`~GP.DI~`], `~MM._0~`;
add `~GP.DI~`, `~regsz~`;
sub `~GP.CX~`, `~toString(unit.regsz / T.sizeof)~`;
jmp LloopB;
Ldone:
`~unit.loopCleanup~`
mov n, `~GP.CX~`;
}
`);
return len - n;
} |
I'm working on a version of Martin's code that will hopefully generate the correct code replacing all the array* modules. The disadvantage of anything based on the current design is that expressions like this one: a[] *= b[] + c[]; will still completely miss all these hand-optimised loops. Any suggestions for how we might deal with this? It seems a bit rubbish to present these fast array ops and then give up if someone writes more than 1 op in an expression. |
The more I look in to this, the more I feel like it doesn't belong here at all. By implementing these functions so far from the backend we don't have any access to even basic optimisations like expression simplification. Even a[] = b[] - b[] doesn't get spotted. Getting proper behaviour out of multiple op expressions would require implementing what is effectively a small, very specialised backend inside druntime, which just seems wrong. On the other side of the coin, I'm getting much better performance (in places) with my hand written loops than gdc or ldc manage with their auto-vectorizing, so leaving things to currently existing backends also seems a bad choice. Neither way seems good :( |
I think we should drop the asm code because using vector types generates much better code and is much easier to write. |
Actually (have tested this before) the code generated from the compiler is so suboptimal that auto-vectorization is not even possible. Infact, it's about 3x slower than the generic array op implementations in druntime... And that's bad given that you can't inline the druntime library calls. |
Gdc does manage auto-vectorisation on array ops. Or am I missing your point? |
That is very neat, I agree. |
Actually arbitrary complex array expressions are supported. void foo()
{
float[] a, b, c, d, e;
a[] += b[] + 2 * c[] / (d[] * e[]);
} This for example would call What we really need and I think the prototype proves this is a proper generic interface. Not sure how the interface should look like but something along this line could work. void arrayOp(T, string[N] ops, size_t N, Args...)(T[] result, in ref Args args) if (Args.length == N + 1); result[] = args[0] op[0] args[1] op[1] ... args[n]; |
I'm familiar with the function names the compiler creates, what I'm struggling to imagine is a good way of implementing the actual templates in druntime concisely for the general case. How to tell the difference between b / ( c + a) and (b/c) +a without having to code in understanding of rpn or similar. |
Are you familiar with "expression templates" in C++? The same technique works in D. |
Yeah precedence is an issue. RPN sounds interesting but required heavy template hacking.
Are you suggesting this as compiler => runtime interface? A pragmatic approach is to let the compiler traverse it's expression tree and output a format string which contains the braced expression, e.g. "(%0s * (%1s - %2s)) / %4s". Then this template argument can be used to generate useable mixin strings. One limitation though is that mul/div for integers requires explicit simd calls. |
github should really have some sort of quoting/threading/reply system.
I'm familiar with the concept, but never implemented them myself. It does seem like the most practical approach.
They're also not all easy to perform: Multiplication for 16 and 32 bit integers is easy. 64 bit requires some trickery, but not a great deal (3 multiplies, 2 additions and some shifting/shuffling) simd integer division doesn't exist in any straightforward way as far as I can tell. icc has closed source intrinsics for it, but I have no idea how they work. Gcc doesn't manage to produce anything vectorised for division. |
Not at all that difficult, it's also quite simple to implement on the compiler side.
Not enough silicon to waste on. Let's treat it as an afterthought. |
While we wait for a better design strategy, I see no reason not to merge this. |
I never actually got around to doing the last minute tweaks I should have done. There are still unnecessary runtime sse2 checks and rather pointless hand written asm scalar loops in there. Nothing is broken but I should be able to patch it up properly soon. |
Things in this pull, all relating to arrayint.d:
On my machine this provides dramatic performance improvements for 64 bit int array ops, although I haven't done a detailed profile as yet.
The reimplemented multiplication asm now requires sse4.1 (checked at runtime, with a fallback). This shouldn't be a problem on intel as it was introduced with penryn (2007/2008). Slightly older Amd processors will lose out however as sse4.1 was only introduced by them in late 2011 with the Bulldozer architecture.
Further direction if this is accepted:
further down the line: