-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Faster minimum/maximum/extrema for Float64/32
#43725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Has this been sufficiently |
|
|
|
It would be cool to get simd for unions with singletons (which should be able to be done pretty well). Doesn't have to be in this PR though. |
|
|
586ec02 to
bf461c3
Compare
base/reduce.jl
Outdated
| v, i = if rest < 8 | ||
| ini, first | ||
| elseif rest < 64 | ||
| @inline simd_kernel(Val(4), ini) | ||
| elseif rest < 128 | ||
| simd_kernel(Val(8), ini) | ||
| else | ||
| simd_kernel(Val(16), ini) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to test all branches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would generally expect this much unrolling to benefit microbenchmarks, but harm application performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TTFP does seems a problem. The first minimum needs 0.3s.
While on master it's about 0.05s...
If we want to fulluse simd, such unroll size seems unavoidable.
(AVX2 should be able to perform nan check on 16 float32 using only 1 vcmpps).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not just TTFP. Large binary size is bad for the instruction cache usage. I imagine it's hard to measure this effect though.
62a1824 to
9a44f7e
Compare
6b5e2ab to
e43c44b
Compare
|
2:extrema:9.6ns minimum:5.0ns
4:extrema:14.3ns minimum:7.8ns
8:extrema:24.2ns minimum:13.2ns
16:extrema:33.9ns minimum:18.6ns
32:extrema:37.4ns minimum:20.6ns
64:extrema:45.3ns minimum:24.1ns
128:extrema:55.3ns minimum:33.0ns
256:extrema:74.9ns minimum:47.2ns
512:extrema:115.2ns minimum:75.6ns
1024:extrema:195.6ns minimum:132.2ns
2048:extrema:358.3ns minimum:247.3ns
4096:extrema:710.5ns minimum:492.3ns |
ac1fc8e to
a7aef19
Compare
a7aef19 to
67edbed
Compare
|
Since #43573 has not been landed, I tried to use |
move `mapreduce_impl` to the end of `reduce.jl` Update reduce.jl
typo fix Update reduce.jl
1. document `_fast` 2. fix the unroll size to 16. 3. If the unsimd region's length > 8. use 9 elements to initialize.
Update reduce.jl
And then we can use for loop and let LLVM do the unroll. TTFP reduce about 1/2.
67edbed to
add7a95
Compare
minimum/maximum/extrema for Float64/32
Seperated from #43604.
This PR aims to make
minimum/maximum/extremafaster by better vectorization.The final goal is to close #31442.
Float Benchmark (
Float32/Float64):This PR
1.7.1
Integer Benchmark (
Int64):This PR
1.7.1
Note: Since #43604 is still undergoning,extremarelated optimization is blocked. I just open it for early review.