-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
faster min/max for IEEEFloats #45532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can you give me a test of this on every pair of |
|
My locate bench shows: (Compared with #41709) |
|
Timing this and #41709 on M1 mac:
Same ratios for Float64 and Float32. |
|
@mcabbott would you mind bench this version on M1? function __min(x, y)
diff = x - y
min = ifelse(signbit(diff), x, y)
anynan = isnan(x)|isnan(y)
ifelse(anynan, diff, min)
end
function __max(x, y)
diff = x - y
max = ifelse(signbit(diff), y, x)
anynan = isnan(x)|isnan(y)
ifelse(anynan, diff, max)
end
function __minmax(x, y)
diff = x - y
sdiff = signbit(diff)
min = ifelse(sdiff, x, y)
max = ifelse(sdiff, y, x)
anynan = isnan(x)|isnan(y)
min = ifelse(anynan, diff, min)
max = ifelse(anynan, diff, max)
min, max
endBTW, does M1 have native LLVM maximum support? I guess that should be the fastest choice. |
|
On M1
|
|
At least |
|
Testing on every function makequiet(x)
# set the quiet bit of a NaN
u = reinterpret(Unsigned, x)
qbit = one(u) << (Base.significand_bits(typeof(x))-1) # quiet bit in NaN
return reinterpret(typeof(x), u | qbit)
end
i16s = typemin(Int16):typemax(Int16)
for ix in i16s, iy in i16s
x = reinterpret(Float16, ix)
y = reinterpret(Float16, iy)
m = min(x, y)
if x < y
m == x || println(x,",",y," -> ",m)
elseif x > y
m == y || println(x,",",y," -> ",m)
elseif x == y
m == x == y && signbit(m) == signbit(x) | signbit(y) || println(x,",",y," -> ",m)
elseif isnan(x) | isnan(y)
makequiet(m) === makequiet(x) || makequiet(m) === makequiet(y) || println(x,",",y," -> ",m," : ",ix,",",iy," -> ",reinterpret(Int16,m))
end
endwhich took 25 minutes on my machine. I did a similar test for My machine lacks native I suppose the faster approach would be to promote to |
|
Non-native Looking at I'm torn. Scalar operation is probably more common in the places where performance is critical, but ultimately the difference won't add up to anything massive. Vectorized operation gives the potential to magnify a small performance difference, but again the I tried to tweak |
|
yeah. The only reason I brought up |
|
Certainly. I was referring to the other line of discussion looking at |
|
Well, I try function __min(x, y)
diff = x - y
min = ifelse(signbit(diff), x, y)
ifelse(isnan(diff), diff, min)
endThe branch is still there but it saves one |
|
Would be nice if these made NNlib avoids |
|
I am aware these algorithms perform very poorly for reductions. I don't have a good explanation as to why except that part of it is that they don't SIMD in reductions. When looking at performance, be warned that I have a different approach (related to the |
That's great! I tried more thorough unroll to make |
This version will yield the incorrect results |
|
LLVM already has a issue to track X86's Therefore, I think it is more important to avoid possible degradation (as reported by @mcabbott). Although I can't reproduce that on my old x86... |
|
Having spent some time thinking about it, I think that the It'd be nice if we can just use @N5N3, you are managing #41709 which accomplishes everything this PR would. Would you like me to close this and you can move forward with that one or should I adopt your versions and continue this PR? |
|
I think keep working on #41709 seems simplier as it covered more cases. (In fact I think it's mergeable) |
|
Lets focus on the other PR then. |
Faster
min,max, andminmaxforIEEEFloat(Float16,Float32,Float64) types.Setup:
Before:
After:
map!indicates the straight-line performance whilebroadcast!allows for SIMD. This PR increases throughput formin/maxby about 35%,minmaxby 25%, and enablesminmaxto SIMD (additional 225% throughput).Note that the
AbstractFloat-specializedminmaxwas deleted and theRealversion redefined to simplyminmax(x,y) = min(x,y),max(x,y). The new version is faster forIEEEFloat(with themin/maxchanges), does not change performance forBitInteger, and does not apply toBigFloat.The
AbstractFloat-specialized versions ofmin/maxare now dead toBasetypes (BigFloathas its own specialization andIEEEFloatis superseded by the new specializations in this PR). I was tempted to remove these methods, but that would be breaking for any user-definedAbstractFloattypes that rely on them. I'm not sure that any do, but I didn't feel the need to break things in this PR.