Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential: Dual Xeon Gold 6154 result #11

Open
mratsim opened this issue Dec 26, 2018 · 3 comments
Open

Exponential: Dual Xeon Gold 6154 result #11

mratsim opened this issue Dec 26, 2018 · 3 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Dec 26, 2018

cc @Laurae2

Warmup: 0.9938 s, result 224 (displayed to avoid compiler optimizing warmup away)

A - tensor shape: [5000000]
Required number of operations:     5.000 millions
Required bytes:                   20.000 MB
Arithmetic intensity:              0.250 FLOP/byte
Theoretical peak single-core:     86.400 GFLOP/s
Theoretical peak multi:          172.800 GFLOP/s
a[0]: -9.999997138977051

Baseline <math.h>
Collected 300 samples in 5.625 seconds
Average time: 17.733 ms
Stddev  time: 0.054 ms
Min     time: 17.698 ms
Max     time: 18.148 ms
Perf:         0.282 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.540005829767324e-05

SSE mathfun
Collected 300 samples in 2.094 seconds
Average time: 6.021 ms
Stddev  time: 0.043 ms
Min     time: 5.976 ms
Max     time: 6.544 ms
Perf:         0.830 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.540006193565205e-05

SSE fast_exp_sse (low order polynomial)
Collected 300 samples in 1.211 seconds
Average time: 3.073 ms
Stddev  time: 0.062 ms
Min     time: 3.019 ms
Max     time: 3.734 ms
Perf:         1.627 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.545032061287202e-05

AVX2 fmath
Collected 300 samples in 1.060 seconds
Average time: 2.558 ms
Stddev  time: 0.067 ms
Min     time: 2.473 ms
Max     time: 3.056 ms
Perf:         1.955 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.540006193565205e-05

AVX2 FMA Minimax
Collected 300 samples in 1.042 seconds
Average time: 2.492 ms
Stddev  time: 0.076 ms
Min     time: 2.383 ms
Max     time: 3.050 ms
Perf:         2.006 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.539992369245738e-05

AVX2 mathfun
Collected 300 samples in 1.307 seconds
Average time: 3.382 ms
Stddev  time: 0.067 ms
Min     time: 3.275 ms
Max     time: 3.906 ms
Perf:         1.478 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.540006193565205e-05

AVX+FMA Schraudolph-approx
Collected 300 samples in 0.933 seconds
Average time: 2.128 ms
Stddev  time: 0.066 ms
Min     time: 2.062 ms
Max     time: 2.607 ms
Perf:         2.350 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.625692963600159e-05

Bench SIMD Math Prims
Collected 300 samples in 4.826 seconds
Average time: 15.079 ms
Stddev  time: 0.189 ms
Min     time: 14.945 ms
Max     time: 15.937 ms
Perf:         0.332 GEXPOP/s

Display output[0] to make sure it's not optimized away
4.539986548479646e-05
@c-blake
Copy link

c-blake commented Dec 30, 2018

That's pretty good for exp. For log it is usually helpful to take advantage of the fact that most CPUs can usually domain/range reduce for you to just one multiple of two. Specifically, any IEEE float (single, double, probably even half-precision) has an exponent field which tells you the logarithm base 2 up to the mantissa field. So, if yo are willing to assume IEEE float representation then you only actually need a good fast approximation of log_2 on [1/2,1) or [1,2). (log base e of course is just re-scaled log_2).

In the past I've done the mantissa part with a mixture of Taylor series for near 1 and Padé approximants for further away. That got down to about 60 to 17 cycles on Intel core2 series from Nehalem to Skylake, respectively, for a 1e-7 relative error single precision log, anyway.

There may be a faster approach than Taylor/Padé, but it seemed very likely the "IEEE exponent field" trick would be applicable to whatever you do if you're willing to assume IEEE. I also perused your references and didn't see it mentioned. They seem more exp-focused, but log is generally more poorly optimized by various stdlibs/vendors in my experience.

(edit: clarified performance mention in time and accuracy)

@mratsim
Copy link
Owner Author

mratsim commented Dec 30, 2018

Log focus is coming, I didn't get to that yet.

@c-blake
Copy link

c-blake commented Dec 30, 2018

Well, when you do get to log, I suggest as a starting point these blog posts from some Ebay guy. {That material didn't exist back when I was figuring this out for myself :-) } :

https://www.ebayinc.com/stories/blogs/tech/fast-approximate-logarithms-part-i-the-basics/

https://www.ebayinc.com/stories/blogs/tech/fast-approximate-logarithms-part-ii-rounding-error/

https://www.ebayinc.com/stories/blogs/tech/fast-approximate-logarithms-part-iii-the-formulas/

He never really mentions Padé approximants, though his rational approximations are similar. He argues for a [3/4, 6/4) interval. (As mentioned, any [t,2t) works - whatever gets you a fast, well approximated "shifted mantissa".) Seems like if you have all this Intel-specific SIMD code, assuming IEEE (even little-endian IEEE) is not much of a stretch.

Particularly with log, there can be unavoidable speed/accuracy trade-offs. So, you may want to have a Nim call that accepts a target precision (and conceivably even returns an error estimate), and then just default it to 23 bits for float32 and 51 bits for float64 and let the call-site use a larger error when faster performance is desired and in-context lower accuracy is acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants