make "dot" operations (.+ etc) fusing broadcasts #17623

stevengj · 2016-07-26T00:36:39Z

This is a ~~WIP~~ finished PR making dot operations into "fusing" broadcasts. That is, x .⨳ y (for any binary operator ⨳) is transformed by the parser into (⨳).(x,y), which in turn is fused with other nested "dot" calls into a single broadcast.

To do:

tkelman · 2016-07-26T16:36:01Z

This changes semantics and isn't a backport candidate.

stevengj · 2016-08-02T20:46:06Z

@JeffBezanson, I'm noticing an odd behavior with the compiler in the REPL. Basically, the first fused function I compile is slow, but the second and subsequent ones are fast.

In particular, the following f(x) = x .+ 3.*x.^3 .+ 4.*x.^2 is extremely slow:

x = rand(10^7);
f(x) = x .+ 3.*x.^3 .+ 4.*x.^2
@time f(x);
@time f(x);

reporting 40M allocations even on the second run. However, the same function is fast if I compile a different fused function first!

x = rand(10^7);
g(x) = x .+ 3.*x.^3 .- 4.*x.^2
@time g(x);
f(x) = x .+ 3.*x.^3 .+ 4.*x.^2
@time f(x);
@time f(x);

This reports only 8 allocations for f(x), and allocates exactly the expected amount of memory for the output array.

Any idea what could cause this? (I'll try to reproduce it in the master branch, to see if it affects the 0.5 loop fusion, and file a separate issue if that is the case.)

tkelman · 2016-08-02T20:50:41Z

#17759 ?

stevengj · 2016-08-02T21:04:46Z

Ah, thanks @tkelman, that seems like the culprit. Using f(x) = x .+ 3.*x.*x.*x .+ 4.*x.*x, i.e. avoiding ^, eliminates the problem for me.

stevengj · 2016-08-02T21:11:31Z

As long as I avoid ^, whose problems seem orthogonal to this PR, the performance is exactly what I was hoping for. For example, y .= x .+ 3.*x.*x.*x .+ 4.*x.*x occurs entirely in-place, with performance identical to writing out the loops manually (with @inbounds annotation).

@jrevels, it would be good to get some "vectorized operation" performance benchmarks on @nanosoldier.

TotalVerb · 2016-09-19T21:26:01Z

src/julia-parser.scm

@@ -865,7 +865,7 @@
           (begin
             #;(if (and (number? ex) (= ex 0))
                 (error "juxtaposition with literal \"0\""))
-             `(call * ,ex ,(parse-unary s))))
+             `(call .* ,ex ,(parse-unary s))))


I'm not convinced this revised behaviour is a good thing. This could break things like 2x where x::Diagonal, which broadcast will try to promote to Array.

2 .* x for x::Diagonal is also broken by this PR; why is breaking 2x worse?

(Such cases could be fixed by adding specialized broadcast methods, of course.)

* has different semantics from .*. I think it is a mistake to treat the latter as a superset of the former, as is being done with this implicit multiplication lowering to .*. Intuitively, 2x means 2 * x, not 2 .* x.

It doesn't have different semantics for multiplication by scalars...

But there is no guarantee that 2 .* x is the same as 2 * x for all types.

Also, this looks to be a performance disaster for the common scalar case.

julia> @code_llvm broadcast(*, 2, 2) define %jl_value_t* @julia_broadcast_65524(%jl_value_t*, %jl_value_t**, i32) #0 { top: %3 = alloca %jl_value_t**, align 8 store volatile %jl_value_t** %1, %jl_value_t*** %3, align 8 %4 = add i32 %2, -1 %5 = icmp eq i32 %4, 0 br i1 %5, label %fail, label %pass fail: ; preds = %top %6 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1 call void @jl_bounds_error_tuple_int(%jl_value_t** %6, i64 0, i64 1) unreachable pass: ; preds = %top %7 = icmp ugt i32 %4, 1 br i1 %7, label %pass.2, label %fail1 fail1: ; preds = %pass %8 = sext i32 %4 to i64 %9 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1 call void @jl_bounds_error_tuple_int(%jl_value_t** %9, i64 %8, i64 2) unreachable pass.2: ; preds = %pass %10 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1 %11 = bitcast %jl_value_t** %10 to i64** %12 = load i64*, i64** %11, align 8 %13 = load i64, i64* %12, align 16 %14 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 2 %15 = bitcast %jl_value_t** %14 to i64** %16 = load i64*, i64** %15, align 8 %17 = load i64, i64* %16, align 16 %18 = mul i64 %17, %13 %19 = call %jl_value_t* @jl_box_int64(i64 signext %18) ret %jl_value_t* %19 }

I guess I am not understanding what is gained by this change. Loop fusion can be forced explicitly with .* anyway. Why should the scalar case be disrupted for the convenience of the vector case?

Sure, it's not a big deal.

But why is it a performance disaster for the scalar case? Shouldn't it be getting inlined to be equivalent to 2 * 2?

Performance looks good to me:

julia> f(x, y) = broadcast(*, x, y) f (generic function with 1 method) julia> @code_llvm f(2,2) define i64 @julia_f_70407(i64, i64) #0 { top: %2 = mul i64 %1, %0 ret i64 %2 }

Anyway, I'll revert this part of the PR, since it is controversial.

I wonder why @code_llvm on the broadcast itself is so scary.

stevengj · 2016-09-21T03:16:42Z

One of the difficulties I'm having with this PR is that it makes it effectively impossible to define specialized methods for .* etcetera.

For example, in Julia ≤ 0.5 we have specialized methods for array .< scalar that produce a BitArray, and we have specialized methods for sparsevector .* sparsevector that preserve the sparsity. With this PR, I can in principle overload broadcast(::typeof(<), array, scalar), but in fact it is almost useless.

The problem is that as soon as you fuse the operation with another dot call, it produces a fused anonymous function and the specialized broadcast method is never called. (The compress-fuse optimizations that inline numeric literals and combine duplicate broadcast arguments are another problem.)

Do we have to give up on .* for sparse arrays, and do we have to give up on elementwise boolean operations producing a BitArray? Or should broadcast always generate a BitArray if it detects that the output type is Bool?

TotalVerb · 2016-09-21T03:18:42Z

See also ongoing discussion in #18590 (comment).

carlobaldassi · 2016-12-22T01:34:37Z

This has removed a few optimizations for BitArrays. One in particular is the case A .* B when A and B have the same shape, which previously was specialized and called A & B. The difference is quite significant, e.g. for 1000x1000 BitArrays it's almost 40-fold.

I wonder if there is a way to catch this case again and get the same performance as map. Also, now that we have a generalized dot-operator syntax, it would be particularly useful to exploit the cases when the function is pure and operate chunk-wise (also in map, which currently only recognizes a few operations). Is there a way to determine if a function is pure?

stevengj · 2016-12-22T02:35:59Z

Yup, the general problem here is broadcast(::typeof(*),...) and similar methods aren't that useful to define, because they won't be called in lots of cases due to fusion. (In the particular case of A .* B, of course, one could simply call A & B.) As far as I know, we don't currently have any way to determine whether a function is pure.

Now that boolean operations are fused, however, it's not clear to me how often one does non-fused operations on bitarrays. We used to need it for things like (A .> 6) .& (B .> 0), but now this is fused into a single loop. How common are operations on large boolean arrays?

It also seems to me that there is quite a bit of unrolling that could be done to make chunk-by-chunk processing of BitArrays more efficient, e.g. for broadcast! with a BitArray result or getindex(A, ::BitArray). We should probably just write out (via metaprogramming) the unrolled loop for all 64 bits in a chunk. But that is separate from the question of chunk-wise pure bit operations.

JeffBezanson · 2016-12-22T16:41:21Z

base/exports.jl

-    .>=,
-    .≥,
-    .\,
-    .^,
    /,
    //,
    .//,


Is there a reason this is still here?

probably just missed - likewise with .>> and .<< below

Yup, just missed them, sorry.

tkelman · 2016-12-27T16:00:58Z

This appears to have broken the ability for packages to define and use .. as an operator.

stevengj · 2016-12-27T16:19:53Z

Whoops, that wasn't intended.

stevengj · 2016-12-27T20:54:15Z

Will have a PR to fix .. shortly.

…uliaLang#17623)

…eps/build.jl.

ViralBShah added this to the 0.5.x milestone Jul 26, 2016

JeffBezanson modified the milestones: 0.5.x, 0.6.0 Jul 26, 2016

mauro3 mentioned this pull request Jul 26, 2016

*(::Number, ::AbstractArray) calls .* #11053

Closed

tkelman mentioned this pull request Jul 27, 2016

Array isapprox behavior when the norm is NaN #17650

Closed

stevengj mentioned this pull request Jul 29, 2016

Vectorization Roadmap #16285

Closed

5 tasks

stevengj added the domain:broadcast Applying a function over a collection label Aug 2, 2016

stevengj force-pushed the fuse-dotops branch from c4c6847 to ab00ecf Compare August 3, 2016 21:19

TotalVerb reviewed Sep 19, 2016

View reviewed changes

stevengj mentioned this pull request Sep 20, 2016

Define broadcast[!] in Base, and import/extend in module Broadcast? #18462

Closed

stevengj force-pushed the fuse-dotops branch from ab00ecf to 47e8eca Compare September 21, 2016 03:11

stevengj force-pushed the fuse-dotops branch from 4932623 to 28f6a3f Compare September 21, 2016 21:42

Sacha0 mentioned this pull request Sep 25, 2016

Deprecate vectorized round methods in favor of compact broadcast syntax #18590

Closed

stevengj mentioned this pull request Oct 6, 2016

laplace equation benchmark performance #1168

Closed

stevengj mentioned this pull request Nov 4, 2016

broadcast(== produces Vector{Real} and not Vector{Bool} #19218

Closed

stevengj mentioned this pull request Nov 14, 2016

Element-wise call syntax does not always lower to standard broadcast() #19313

Closed

This was referenced Dec 21, 2016

remove method ambiguity with *(y::Number, x::Bool) in test #19671

Merged

remove obsolete performance workaround using broadcast_elwise_op #19672

Merged

JeffBezanson reviewed Dec 22, 2016

View reviewed changes

stevengj mentioned this pull request Dec 22, 2016

add missing deprecations for .//, .>>, .<< #19683

Merged

This was referenced Dec 25, 2016

.+(::Number) not defined #19572

Closed

Deprecate vectorized & in favor of compact broadcast syntax #19709

Closed

Deprecate vectorized | in favor of compact broadcast syntax #19710

Closed

stevengj added a commit to stevengj/julia that referenced this pull request Dec 27, 2016

fix ability to use .. as an infix operator (accidentally removed in J…

73d3cc9

…uliaLang#17623)

stevengj mentioned this pull request Dec 27, 2016

fix ability to use .. as an infix operator #19732

Merged

pabloferz mentioned this pull request Dec 30, 2016

Deprecate bitbroadcast #19771

Merged

martinholters mentioned this pull request Jan 4, 2017

Make broadcast return BitArray even if it cannot be inferred #19854

Merged

Sacha0 mentioned this pull request Jan 7, 2017

things we should deprecate, 0.6 edition #19598

Closed

22 tasks

stevengj mentioned this pull request Jan 9, 2017

isimag: what should it do? #19947

Closed

ararslan mentioned this pull request Jan 9, 2017

RFC: Export iszero #19950

Merged

mbauman mentioned this pull request Jan 10, 2017

.== not defined for tuples #4827

Closed

tkelman referenced this pull request in JuliaLinearAlgebra/BandedMatrices.jl Feb 4, 2017

0.5 and 0.6 tests pass without error/warnings

b51690d

nalimilan mentioned this pull request Feb 4, 2017

Add .& and .| JuliaLang/Compat.jl#306

Merged

ajkeller34 referenced this pull request in PainterQubits/Unitful.jl Feb 21, 2017

Some patches so that Unitful still works on julia 0.5; also, update d…

f03dbcd

…eps/build.jl.

tkelman referenced this pull request in SciML/DiffEqProblemLibrary.jl May 4, 2017

.& v0.6

86bce48

Sacha0 added the kind:deprecation This change introduces or involves a deprecation label May 14, 2017

Sacha0 mentioned this pull request Jul 26, 2017

Fix vecnorm for Vector{Vector{T}} #22945

Merged

andreasnoack mentioned this pull request Jul 28, 2017

iszero vs x==0 and countnz, find etc. #23005

Closed

dsweber2 mentioned this pull request Oct 17, 2017

updating to v0.6, working with conv2 arsenal9971/Shearlab.jl#14

Merged

brossetti mentioned this pull request Jan 21, 2018

no method for scalar divsion by Array{} #25640

Closed

tlienart mentioned this pull request Mar 27, 2020

Chasing dead links JuliaLang/www.julialang.org#690

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make "dot" operations (.+ etc) fusing broadcasts #17623

make "dot" operations (.+ etc) fusing broadcasts #17623

stevengj commented Jul 26, 2016 •

edited

Loading

tkelman commented Jul 26, 2016

stevengj commented Aug 2, 2016

tkelman commented Aug 2, 2016

stevengj commented Aug 2, 2016

stevengj commented Aug 2, 2016

TotalVerb Sep 19, 2016

stevengj Sep 19, 2016

TotalVerb Sep 19, 2016

stevengj Sep 20, 2016

TotalVerb Sep 20, 2016

stevengj Sep 20, 2016

stevengj Sep 20, 2016

stevengj Sep 20, 2016

TotalVerb Sep 20, 2016

yuyichao Sep 20, 2016

stevengj commented Sep 21, 2016

TotalVerb commented Sep 21, 2016

carlobaldassi commented Dec 22, 2016

stevengj commented Dec 22, 2016 •

edited

Loading

JeffBezanson Dec 22, 2016

tkelman Dec 22, 2016

stevengj Dec 22, 2016

tkelman commented Dec 27, 2016

stevengj commented Dec 27, 2016

stevengj commented Dec 27, 2016

make "dot" operations (.+ etc) fusing broadcasts #17623

make "dot" operations (.+ etc) fusing broadcasts #17623

Conversation

stevengj commented Jul 26, 2016 • edited Loading

tkelman commented Jul 26, 2016

stevengj commented Aug 2, 2016

tkelman commented Aug 2, 2016

stevengj commented Aug 2, 2016

stevengj commented Aug 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevengj commented Sep 21, 2016

TotalVerb commented Sep 21, 2016

carlobaldassi commented Dec 22, 2016

stevengj commented Dec 22, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman commented Dec 27, 2016

stevengj commented Dec 27, 2016

stevengj commented Dec 27, 2016

stevengj commented Jul 26, 2016 •

edited

Loading

stevengj commented Dec 22, 2016 •

edited

Loading