Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core.simd and gl3n #75

Open
mkalte666 opened this issue Oct 6, 2016 · 14 comments
Open

core.simd and gl3n #75

mkalte666 opened this issue Oct 6, 2016 · 14 comments

Comments

@mkalte666
Copy link

Greetings.
I wonderd if it would be usefull to use https://dlang.org/spec/simd.html in gl3n. This could come in handy if you use gl3n alot in collision detection or something similar.

Would that make sense to do?

@Dav1dde
Copy link
Owner

Dav1dde commented Oct 6, 2016

Yes, that would make a lot of sense, back when I wrote gl3n and simd came up, I was waiting on std.simd, but that never happened ...

@mkalte666
Copy link
Author

I made a fork and will try to see if i can implement that somehow. I wouldn't count on me though, i don't have too much time :/

@mkalte666
Copy link
Author

mkalte666 commented Oct 6, 2016

So I think there is a problem that will start to exsist if simd is used to replace the non-simd math.
Because stuff might become slower

performance, hasSimd = true...
Lots of ops took: 0.10047s

vs

performance, hasSimd = false...
Lots of ops took: 0.0750042s

The measured op was

for (int i = 0; i < 1_000_000; i++) {
    vec4 a = 43223.0;
    vec4 b = 1234.0;
    a+=b;
}

That slowdown becomes even worse if i use automated vectorization
vector[] += r.vector[]
takes 0.15, so 3 times as much time as when using float4 (in that case)

Now i changed the code a bit

vec4 a = 43223.0;
vec4 b = 1234.0;
for (int i = 0; i < 1_000_000; i++) {
       a+=b;
}

That results in the expected (even though tiny) speedup.

performance, hasSimd = true...
Lots of ops took: 0.0097749s

vs

performance, hasSimd = false...
Lots of ops took: 0.0159956s

These differences exsist because the vectors need to be loaded into the simd registers. So operations on the same set of vectors will speed up alot, the general use will slow down alot.
So I think that, to implement this, there would be a need for a different set of functions that utilize this, because otherwise it would just be a slow down.

@Dav1dde
Copy link
Owner

Dav1dde commented Oct 6, 2016

Having a separate struct for Vector/Matrix/Quaternion could make sense, depending on how different is from the code right now, otherwise just a flag passed to the constructor of the structs, making it possible to have both versions.

Code which wants to accept both versions of vectors needs to use foo(T vec) is(some_vector!T), that's why I want Compiletime-Interfaces ...

@mkalte666
Copy link
Author

mkalte666 commented Oct 8, 2016

Having a separate struct for Vector/Matrix/Quaternion could make sense, depending on how different is from the code right now, otherwise just a flag passed to the constructor of the structs, making it possible to have both versions.

I tried to integrate it into the normal vector classes via an template argument and in itself that works fine. However i am not able to notice any speedup at all (tbh i only implemented the basic operations and testet them) - and i suspect the implementation of both the core.simd.Vector types and the __simd magic cause a lot of copying around of data, which is kinda locigal because the instructions run on the xmm/ymm registers.

Vector!(float,4,true) a,b,c;
a = b = c = 1234.23234;
for(long i = 0; i < 1_000_000; i++) {
    a+=b;
    a+=c;
    a+=a;
}

Should, in my understanding, run faster if a,b,c use core.simd.vector!(float[4]) - it however always ran slower than what i expected.

It would be nice to work with data within these registers like you can with the Intel/C++ compiler foo (the _mm_add_ss -like functions that take __m128 and __m256 types). So Id go so far to seperate the normal vector/matrix types from the SIMD acceleration completly. So then you would do something like

vec4 a (123,434,124,123);
vec4 b (434,342,323,434);
simdVec!vec4 areg = a;
simdVec!vec4 breg = b;
for(int i = 0; i < 1_000_000; i++) {
    areg += breg; /// ADDPS
    breg += areg; /// ADDPS
}
float magnitude = areg.magnitude; /// can be done with DOTPS and SQRTPS 
a = areg.toVec();
b = areg.toVec();

The main difference would be that the simd-type (wich idially would equal a media register) allows no direct access to the memory to avoid any copying.

Also im kinda missing the AVX (256bit ymm registers, double[4]-stuff) support in core.simd.__simd. Hmm.

Am I thinking right or am I blubbering complete bullshit? O.o

EDIT: I might have found the reason: this code:

import core.simd;

void doStuff()
{
    float4 x = [1.0,0.4,1234.0,124.0]; 
float4 y = [1.0,0.4,1234.0,124.0]; 
float4 z = [1.0,0.4,1234.0,123.0];
  for(long i = 0; i<1_000_000; i++) {
    x += y;
    x += z;
    z += x;
  }
}

Can be split in two parts. The first one is the assignment:

movaps xmm0,XMMWORD PTR [rip+0x0]        # f <void example.doStuff()+0xf>
movaps XMMWORD PTR [rbp-0x40],xmm0
movaps xmm1,XMMWORD PTR [rip+0x0]        # 1a <void example.doStuff()+0x1a>
movaps XMMWORD PTR [rbp-0x30],xmm1
movaps xmm2,XMMWORD PTR [rip+0x0]        # 25 <void example.doStuff()+0x25>
movaps XMMWORD PTR [rbp-0x20],xmm2

Well ok it also copies the stuff onto the stack? meh. Now the math in the loop:

 movaps xmm3,XMMWORD PTR [rbp-0x30]
 movaps xmm4,XMMWORD PTR [rbp-0x40]
 addps  xmm4,xmm3
 movaps XMMWORD PTR [rbp-0x40],xmm4
 movaps xmm0,XMMWORD PTR [rbp-0x20]
 movaps xmm1,XMMWORD PTR [rbp-0x40]
 addps  xmm1,xmm0
 movaps XMMWORD PTR [rbp-0x40],xmm1
 movaps xmm2,XMMWORD PTR [rbp-0x40]
 movaps xmm3,XMMWORD PTR [rbp-0x20]
 addps  xmm3,xmm2
 movaps XMMWORD PTR [rbp-0x20],xmm3

OUCH! This should simply be

addps xmm0,xmm1
addps xmm0,xmm2
addps xmm2,xmm0

I guess i should report that as compiler bug? https://issues.dlang.org/show_bug.cgi?id=16605

@Dav1dde
Copy link
Owner

Dav1dde commented Oct 8, 2016

Thanks for looking into all of this.

I can't really help you here since my knowledge of SSE/SIMD instructions is very limited, you might want to ask in #D on freenode, there are some very smart people with compiler insight who probably can help you in a timely manner.

@mkalte666
Copy link
Author

No Problem, I enjoy this kind of stuff :)

Im gonna head there, because im still not sure if my knowleadge about the SSE/SIMD is enough to come to the right conclusions. Lets see where this is headed!

@mkalte666
Copy link
Author

It was me who was the fool! "-release" != "-O -release -boundscheck=off"
Now that looks like something!

Running ./gl3nspeed 
Doing tests with SIMD=falseand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.140215s!
Doing tests with SIMD=trueand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.050686s!

@Dav1dde
Copy link
Owner

Dav1dde commented Oct 9, 2016

That's almost 3 times faster!

@mkalte666
Copy link
Author

mkalte666 commented Oct 9, 2016

That's almost 3 times faster!

It gets better!

Enter loop count
10000000
Doing tests with SIMD=falseand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.139876s!
Speed of the magnitude operation on float |vec4|
took: 5.30766s! 
Doing tests with SIMD=trueand LC=10000000
Speed of the += operator on float (vec4+=vec4)
took: 0.0507099s!
Speed of the magnitude operation on float |vec4|
took: 1.02721s! 

Im gonna clean this up a bit and make push to my fork so you can take a look at it - if it fits the guidelines/way you wan't stuff to be done for gl3n.

@mkalte666
Copy link
Author

mkalte666 commented Oct 9, 2016

Here are my changes so far: master...mkalte666:master

I know that this is missing tests etc. These i will write as soon as i can i guess.

The speed test tool i used is https://github.com/mkalte666/gl3nspeed

You have to compile gl3n/gl3nspeed with "DLFAGS="-release -O -boundscheck=ff" dub ". Or tell me how i can get dub to use -O xD

@Dav1dde
Copy link
Owner

Dav1dde commented Oct 9, 2016

Looks good, minor style things but in general I like how it is done!

You gonna look into matrices as well?

@mkalte666
Copy link
Author

mkalte666 commented Oct 9, 2016

Looks good, minor style things but in general I like how it is done!

Thanks, im trying ^^

You gonna look into matrices as well?

If i find the time. Im not sure how well that can be done and what instructions already exsist that could help out. Also I still want look into #68, and i guess that could be combined

Thinking about speed and not about time management on my side this would be a massive improvement however: "4x4 matrix multiplication is 64 multiplications and 48 additions. Using SSE this can be reduced to 16 multiplications and 12 additions (and 16 broadcasts)" http://stackoverflow.com/questions/18499971/efficient-4x4-matrix-multiplication-c-vs-assembly

@Dav1dde Dav1dde mentioned this issue Oct 10, 2016
@mkalte666
Copy link
Author

mkalte666 commented Oct 10, 2016

One thing I wonder is if operations with scalars (vec3*float etc) should be vectorized. While the operation itself would speed up, as long as the numerical value is not const, the resulting code would almost always be slower because the scalar would have to be loaded into a vector beforehand.

The speedy way of doing a (any operation) multiplicaion would be to hold a (const?) vector somewhere and then do the operations. So doing

Vector!(float,4,true) scalar = 4.0;
Vector!(float,4,true) foo = 1234.01234;
Vector!(float,4,true) bar = 13.2434;
foo *= scalar;
bar *= scalar; 
// ..... probably do this many times 

would almost always result in faster code than if one would do

foo *= 4.0;
bar *= 4.0; 

because the operator doesn't know if it operates on a const value or a variable. If there is a way to seperate them (detecting if a value is const), that it could be done though - I don't know how.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants