Skip to content

Conversation

@N5N3
Copy link
Member

@N5N3 N5N3 commented Jan 9, 2022

Seperated from #43604.
This PR aims to make minimum/maximum/extrema faster by better vectorization.
The final goal is to close #31442.

Float Benchmark (Float32/Float64):

This PR
julia> a = randn(Float64,4096);

julia> for n = 1:12
           print(1<<n,':')
           t1 = @belapsed Base.mapreduce_impl(Base.ExtremaMap(identity),Base._extrema_rf, $a, 1, (1<<$n))
           t2 = @belapsed Base.mapreduce_impl(identity,min, $a, 1, (1<<$n))
           println("extrema:", round(1e10t1)/10, "ns  minimum:", round(1e10t2)/10,"ns")
       end
2:extrema:8.4ns  minimum:7.5ns
4:extrema:11.1ns  minimum:10.2ns
8:extrema:21.0ns  minimum:15.7ns
16:extrema:21.2ns  minimum:13.7ns
32:extrema:27.8ns  minimum:18.2ns
64:extrema:40.4ns  minimum:27.8ns
128:extrema:60.0ns  minimum:44.1ns
256:extrema:106.9ns  minimum:70.0ns
512:extrema:154.2ns  minimum:98.9ns
1024:extrema:248.6ns  minimum:157.5ns
2048:extrema:439.9ns  minimum:274.3ns
4096:extrema:820.7ns  minimum:516.7ns

julia> a = randn(Float32,4096);

julia> for n = 1:12
           print(1<<n,':')
           t1 = @belapsed Base.mapreduce_impl(Base.ExtremaMap(identity),Base._extrema_rf, $a, 1, (1<<$n))
           t2 = @belapsed Base.mapreduce_impl(identity,min, $a, 1, (1<<$n))
           println("extrema:", round(1e10t1)/10, "ns  minimum:", round(1e10t2)/10,"ns")
       end
2:extrema:7.5ns  minimum:7.3ns
4:extrema:9.9ns  minimum:10.0ns
8:extrema:14.8ns  minimum:15.8ns
16:extrema:21.0ns  minimum:16.1ns
32:extrema:27.5ns  minimum:21.5ns
64:extrema:40.3ns  minimum:31.6ns
128:extrema:55.5ns  minimum:49.5ns
256:extrema:74.0ns  minimum:73.0ns
512:extrema:100.4ns  minimum:98.2ns
1024:extrema:155.8ns  minimum:148.7ns
2048:extrema:265.6ns  minimum:250.4ns
4096:extrema:485.2ns  minimum:452.0ns
1.7.1
julia> a = randn(Float64,4096);

julia> for n = 1:12
           print(1<<n,':')
           #t1 = @belapsed Base.mapreduce_impl(Base.ExtremaMap(identity),Base._extrema_rf, $a, 1, (1<<$n))
           t2 = @belapsed Base.mapreduce_impl(identity,min, $a, 1, (1<<$n))
           #println("extrema:", round(1e10t1)/10, "ns  minimum:", round(1e10t2)/10,"ns")
           println("minimum:", round(1e10t2)/10,"ns")
       end
2:minimum:8.6ns
4:minimum:11.3ns
8:minimum:17.0ns
16:minimum:32.0ns
32:minimum:61.9ns
64:minimum:119.0ns
128:minimum:235.6ns
256:minimum:469.9ns
512:minimum:588.3ns
1024:minimum:837.3ns
2048:minimum:1300.0ns
4096:minimum:2233.3ns

julia> a = randn(Float32,4096);

julia> for n = 1:12
           print(1<<n,':')
           #t1 = @belapsed Base.mapreduce_impl(Base.ExtremaMap(identity),Base._extrema_rf, $a, 1, (1<<$n))
           t2 = @belapsed Base.mapreduce_impl(identity,min, $a, 1, (1<<$n))
           #println("extrema:", round(1e10t1)/10, "ns  minimum:", round(1e10t2)/10,"ns")
           println("minimum:", round(1e10t2)/10,"ns")
       end
2:minimum:9.3ns
4:minimum:12.0ns
8:minimum:17.5ns
16:minimum:31.9ns
32:minimum:61.6ns
64:minimum:132.5ns
128:minimum:400.5ns
256:minimum:918.5ns
512:minimum:991.7ns
1024:minimum:1250.0ns
2048:minimum:1700.0ns
4096:minimum:2555.6ns

Integer Benchmark (Int64):

This PR
julia> a = rand(Int64,4096);

julia> for n = 1:12
           print(1<<n,':')
           t1 = @belapsed Base.mapreduce_impl(identity,min, $a, 1, (1<<$n))
           println("minimum:", round(1e10t1)/10,"ns")
       end
2:minimum:4.5ns
4:minimum:5.7ns
8:minimum:6.8ns
16:minimum:9.1ns
32:minimum:13.1ns
64:minimum:15.4ns
128:minimum:21.1ns
256:minimum:33.6ns
512:minimum:53.9ns
1024:minimum:95.0ns
2048:minimum:191.4ns
4096:minimum:390.1ns
1.7.1
julia> a = rand(Int64,4096);

julia> for n = 1:12
           print(1<<n,':')
           t1 = @belapsed Base.mapreduce_impl(identity,min, $a, 1, (1<<$n))
           println("minimum:", round(1e10t1)/10,"ns")
       end
2:minimum:5.8ns
4:minimum:6.3ns
8:minimum:7.5ns
16:minimum:10.6ns
32:minimum:17.2ns
64:minimum:20.3ns
128:minimum:25.7ns
256:minimum:36.0ns
512:minimum:111.7ns
1024:minimum:259.8ns
2048:minimum:543.9ns
4096:minimum:1150.0ns

Note: Since #43604 is still undergoning, extrema related optimization is blocked. I just open it for early review.

@oscardssmith
Copy link
Member

oscardssmith commented Jan 9, 2022

Has this been sufficiently NaN/Inf/missing tested?

@N5N3
Copy link
Member Author

N5N3 commented Jan 9, 2022

NaN, Inf, -0.0,0.0 should have been tested (previously) in #43604.
As for missing, I limited this to concrete IEEEFloat. Thus inputs with missing will use general fallback.
(I think we dont want simd for Union inputs)

@oscardssmith
Copy link
Member

It would be cool to get simd for unions with singletons (which should be able to be done pretty well). Doesn't have to be in this PR though.

@N5N3
Copy link
Member Author

N5N3 commented Jan 9, 2022

sum is also not optimized for Union.
So a better choice for me is to implement a general mapreduce_impl for Union{Missing,T} with a fixed unroll size of 4, which should also accelerate other reductions.

@N5N3 N5N3 force-pushed the simd_reduction_for_min_max branch from 586ec02 to bf461c3 Compare January 9, 2022 16:06
base/reduce.jl Outdated
Comment on lines 1233 to 1378
v, i = if rest < 8
ini, first
elseif rest < 64
@inline simd_kernel(Val(4), ini)
elseif rest < 128
simd_kernel(Val(8), ini)
else
simd_kernel(Val(16), ini)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to test all branches

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would generally expect this much unrolling to benefit microbenchmarks, but harm application performance

Copy link
Member Author

@N5N3 N5N3 Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TTFP does seems a problem. The first minimum needs 0.3s.
While on master it's about 0.05s...
If we want to fulluse simd, such unroll size seems unavoidable.
(AVX2 should be able to perform nan check on 16 float32 using only 1 vcmpps).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just TTFP. Large binary size is bad for the instruction cache usage. I imagine it's hard to measure this effect though.

@N5N3 N5N3 force-pushed the simd_reduction_for_min_max branch 4 times, most recently from 62a1824 to 9a44f7e Compare January 13, 2022 15:36
@N5N3 N5N3 force-pushed the simd_reduction_for_min_max branch 2 times, most recently from 6b5e2ab to e43c44b Compare January 19, 2022 01:32
@N5N3
Copy link
Member Author

N5N3 commented Jan 19, 2022

extrema related optimization has been added, the latest benchmark for Float64:

2:extrema:9.6ns minimum:5.0ns
4:extrema:14.3ns minimum:7.8ns
8:extrema:24.2ns minimum:13.2ns
16:extrema:33.9ns minimum:18.6ns
32:extrema:37.4ns minimum:20.6ns
64:extrema:45.3ns minimum:24.1ns
128:extrema:55.3ns minimum:33.0ns
256:extrema:74.9ns minimum:47.2ns
512:extrema:115.2ns minimum:75.6ns
1024:extrema:195.6ns minimum:132.2ns
2048:extrema:358.3ns minimum:247.3ns
4096:extrema:710.5ns minimum:492.3ns

@N5N3 N5N3 force-pushed the simd_reduction_for_min_max branch from ac1fc8e to a7aef19 Compare January 21, 2022 05:05
@oscardssmith oscardssmith added the performance Must go faster label Jan 24, 2022
@N5N3 N5N3 force-pushed the simd_reduction_for_min_max branch from a7aef19 to 67edbed Compare January 28, 2022 06:38
@N5N3
Copy link
Member Author

N5N3 commented Jan 28, 2022

Since #43573 has not been landed, I tried to use Ref{NTupe{16}} to emulate MVector{16}.
Local bench shows it works well. (TTFP is reduced about a half, and runtime performance is not impacted.)

N5N3 added 7 commits May 10, 2022 16:52
move `mapreduce_impl` to the end of `reduce.jl`

Update reduce.jl
typo fix

Update reduce.jl
1. document `_fast`
2. fix the unroll size to 16.
3. If the unsimd region's length > 8. use 9 elements to initialize.
And then we can use for loop and let LLVM do the unroll.
TTFP reduce about 1/2.
@N5N3 N5N3 force-pushed the simd_reduction_for_min_max branch from 67edbed to add7a95 Compare May 10, 2022 08:52
@N5N3 N5N3 changed the title Faster reduction for min max. Faster minimum/maximum/extrema for Float64/32 May 10, 2022
@N5N3 N5N3 closed this Jul 9, 2022
@N5N3 N5N3 deleted the simd_reduction_for_min_max branch July 9, 2022 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extrema is slower than maximum + minimum

6 participants