make "dot" operations (.+ etc) fusing broadcasts #17623
Conversation
This changes semantics and isn't a backport candidate. |
@JeffBezanson, I'm noticing an odd behavior with the compiler in the REPL. Basically, the first fused function I compile is slow, but the second and subsequent ones are fast. In particular, the following x = rand(10^7);
f(x) = x .+ 3.*x.^3 .+ 4.*x.^2
@time f(x);
@time f(x); reporting 40M allocations even on the second run. However, the same function is fast if I compile a different fused function first! x = rand(10^7);
g(x) = x .+ 3.*x.^3 .- 4.*x.^2
@time g(x);
f(x) = x .+ 3.*x.^3 .+ 4.*x.^2
@time f(x);
@time f(x); This reports only 8 allocations for Any idea what could cause this? (I'll try to reproduce it in the master branch, to see if it affects the 0.5 loop fusion, and file a separate issue if that is the case.) |
#17759 ? |
Ah, thanks @tkelman, that seems like the culprit. Using |
As long as I avoid @jrevels, it would be good to get some "vectorized operation" performance benchmarks on @nanosoldier. |
@@ -865,7 +865,7 @@ | |||
(begin | |||
#;(if (and (number? ex) (= ex 0)) | |||
(error "juxtaposition with literal \"0\"")) | |||
`(call * ,ex ,(parse-unary s)))) | |||
`(call .* ,ex ,(parse-unary s)))) |
TotalVerb
Sep 19, 2016
Contributor
I'm not convinced this revised behaviour is a good thing. This could break things like 2x
where x::Diagonal
, which broadcast
will try to promote to Array
.
I'm not convinced this revised behaviour is a good thing. This could break things like 2x
where x::Diagonal
, which broadcast
will try to promote to Array
.
stevengj
Sep 19, 2016
Author
Member
2 .* x
for x::Diagonal
is also broken by this PR; why is breaking 2x
worse?
(Such cases could be fixed by adding specialized broadcast
methods, of course.)
2 .* x
for x::Diagonal
is also broken by this PR; why is breaking 2x
worse?
(Such cases could be fixed by adding specialized broadcast
methods, of course.)
TotalVerb
Sep 19, 2016
Contributor
*
has different semantics from .*
. I think it is a mistake to treat the latter as a superset of the former, as is being done with this implicit multiplication lowering to .*
. Intuitively, 2x
means 2 * x
, not 2 .* x
.
*
has different semantics from .*
. I think it is a mistake to treat the latter as a superset of the former, as is being done with this implicit multiplication lowering to .*
. Intuitively, 2x
means 2 * x
, not 2 .* x
.
stevengj
Sep 20, 2016
Author
Member
It doesn't have different semantics for multiplication by scalars...
It doesn't have different semantics for multiplication by scalars...
TotalVerb
Sep 20, 2016
Contributor
But there is no guarantee that 2 .* x
is the same as 2 * x
for all types.
Also, this looks to be a performance disaster for the common scalar case.
julia> @code_llvm broadcast(*, 2, 2)
define %jl_value_t* @julia_broadcast_65524(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
%3 = alloca %jl_value_t**, align 8
store volatile %jl_value_t** %1, %jl_value_t*** %3, align 8
%4 = add i32 %2, -1
%5 = icmp eq i32 %4, 0
br i1 %5, label %fail, label %pass
fail: ; preds = %top
%6 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
call void @jl_bounds_error_tuple_int(%jl_value_t** %6, i64 0, i64 1)
unreachable
pass: ; preds = %top
%7 = icmp ugt i32 %4, 1
br i1 %7, label %pass.2, label %fail1
fail1: ; preds = %pass
%8 = sext i32 %4 to i64
%9 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
call void @jl_bounds_error_tuple_int(%jl_value_t** %9, i64 %8, i64 2)
unreachable
pass.2: ; preds = %pass
%10 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
%11 = bitcast %jl_value_t** %10 to i64**
%12 = load i64*, i64** %11, align 8
%13 = load i64, i64* %12, align 16
%14 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 2
%15 = bitcast %jl_value_t** %14 to i64**
%16 = load i64*, i64** %15, align 8
%17 = load i64, i64* %16, align 16
%18 = mul i64 %17, %13
%19 = call %jl_value_t* @jl_box_int64(i64 signext %18)
ret %jl_value_t* %19
}
I guess I am not understanding what is gained by this change. Loop fusion can be forced explicitly with .*
anyway. Why should the scalar case be disrupted for the convenience of the vector case?
But there is no guarantee that 2 .* x
is the same as 2 * x
for all types.
Also, this looks to be a performance disaster for the common scalar case.
julia> @code_llvm broadcast(*, 2, 2)
define %jl_value_t* @julia_broadcast_65524(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
%3 = alloca %jl_value_t**, align 8
store volatile %jl_value_t** %1, %jl_value_t*** %3, align 8
%4 = add i32 %2, -1
%5 = icmp eq i32 %4, 0
br i1 %5, label %fail, label %pass
fail: ; preds = %top
%6 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
call void @jl_bounds_error_tuple_int(%jl_value_t** %6, i64 0, i64 1)
unreachable
pass: ; preds = %top
%7 = icmp ugt i32 %4, 1
br i1 %7, label %pass.2, label %fail1
fail1: ; preds = %pass
%8 = sext i32 %4 to i64
%9 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
call void @jl_bounds_error_tuple_int(%jl_value_t** %9, i64 %8, i64 2)
unreachable
pass.2: ; preds = %pass
%10 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
%11 = bitcast %jl_value_t** %10 to i64**
%12 = load i64*, i64** %11, align 8
%13 = load i64, i64* %12, align 16
%14 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 2
%15 = bitcast %jl_value_t** %14 to i64**
%16 = load i64*, i64** %15, align 8
%17 = load i64, i64* %16, align 16
%18 = mul i64 %17, %13
%19 = call %jl_value_t* @jl_box_int64(i64 signext %18)
ret %jl_value_t* %19
}
I guess I am not understanding what is gained by this change. Loop fusion can be forced explicitly with .*
anyway. Why should the scalar case be disrupted for the convenience of the vector case?
stevengj
Sep 20, 2016
Author
Member
Sure, it's not a big deal.
But why is it a performance disaster for the scalar case? Shouldn't it be getting inlined to be equivalent to 2 * 2
?
Sure, it's not a big deal.
But why is it a performance disaster for the scalar case? Shouldn't it be getting inlined to be equivalent to 2 * 2
?
stevengj
Sep 20, 2016
Author
Member
Performance looks good to me:
julia> f(x, y) = broadcast(*, x, y)
f (generic function with 1 method)
julia> @code_llvm f(2,2)
define i64 @julia_f_70407(i64, i64) #0 {
top:
%2 = mul i64 %1, %0
ret i64 %2
}
Performance looks good to me:
julia> f(x, y) = broadcast(*, x, y)
f (generic function with 1 method)
julia> @code_llvm f(2,2)
define i64 @julia_f_70407(i64, i64) #0 {
top:
%2 = mul i64 %1, %0
ret i64 %2
}
stevengj
Sep 20, 2016
Author
Member
Anyway, I'll revert this part of the PR, since it is controversial.
Anyway, I'll revert this part of the PR, since it is controversial.
TotalVerb
Sep 20, 2016
Contributor
I wonder why @code_llvm
on the broadcast
itself is so scary.
I wonder why @code_llvm
on the broadcast
itself is so scary.
yuyichao
Sep 20, 2016
Contributor
jlcall
jlcall
One of the difficulties I'm having with this PR is that it makes it effectively impossible to define specialized methods for For example, in Julia ≤ 0.5 we have specialized methods for The problem is that as soon as you fuse the operation with another dot call, it produces a fused anonymous function and the specialized Do we have to give up on |
See also ongoing discussion in #18590 (comment). |
This has removed a few optimizations for BitArrays. One in particular is the case I wonder if there is a way to catch this case again and get the same performance as |
Yup, the general problem here is Now that boolean operations are fused, however, it's not clear to me how often one does non-fused operations on bitarrays. We used to need it for things like It also seems to me that there is quite a bit of unrolling that could be done to make chunk-by-chunk processing of BitArrays more efficient, e.g. for |
.>=, | ||
.≥, | ||
.\, | ||
.^, | ||
/, | ||
//, | ||
.//, |
JeffBezanson
Dec 22, 2016
Member
Is there a reason this is still here?
Is there a reason this is still here?
tkelman
Dec 22, 2016
Contributor
probably just missed - likewise with .>>
and .<<
below
probably just missed - likewise with .>>
and .<<
below
stevengj
Dec 22, 2016
Author
Member
Yup, just missed them, sorry.
Yup, just missed them, sorry.
This appears to have broken the ability for packages to define and use |
Whoops, that wasn't intended. |
Will have a PR to fix |
This is a
WIPfinished PR making dot operations into "fusing" broadcasts. That is,x .⨳ y
(for any binary operator⨳
) is transformed by the parser into(⨳).(x,y)
, which in turn is fused with other nested "dot" calls into a singlebroadcast
.To do:
x .⨳ y
as a fusing "dot" function call..⨳
method definitions tobroadcast(::typeof(⨳), ...)
definitions. (Currently, these methods are silently ignored.)MethodError: no method matching splice!(::Array{UInt8,1}, ::Array{Int64,1}, ::Array{UInt8,1})
.)[true] .* [true]
gives a stack overflow.Float64 * Array{Float32} = Array{Float32}
etc. (for non-dot ops)broadcast(::typeof(op), ...)
methods as possible