Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make "dot" operations (.+ etc) fusing broadcasts #17623

Merged
merged 4 commits into from Dec 20, 2016

Conversation

@stevengj
Copy link
Member

@stevengj stevengj commented Jul 26, 2016

This is a WIP finished PR making dot operations into "fusing" broadcasts. That is, x .⨳ y (for any binary operator ) is transformed by the parser into (⨳).(x,y), which in turn is fused with other nested "dot" calls into a single broadcast.

To do:

  • Parse x .⨳ y as a fusing "dot" function call.
  • Deprecate manual .⨳ method definitions to broadcast(::typeof(⨳), ...) definitions. (Currently, these methods are silently ignored.)
  • Unbreak the REPL. (Somehow this patch breaks it: typing backspace gives MethodError: no method matching splice!(::Array{UInt8,1}, ::Array{Int64,1}, ::Array{UInt8,1}).)
  • [true] .* [true] gives a stack overflow.
  • Fix lots of breaking tests due to #16966.
  • restore 0.5 behavior of Float64 * Array{Float32} = Array{Float32} etc. (for non-dot ops)
  • Fix breaking method-ambiguity test
  • More tests
  • Documentation
  • eliminate as many specialized broadcast(::typeof(op), ...) methods as possible
@ViralBShah ViralBShah added this to the 0.5.x milestone Jul 26, 2016
@tkelman
Copy link
Contributor

@tkelman tkelman commented Jul 26, 2016

This changes semantics and isn't a backport candidate.

@JeffBezanson JeffBezanson modified the milestones: 0.5.x, 0.6.0 Jul 26, 2016
@stevengj stevengj mentioned this pull request Jul 29, 2016
5 of 5 tasks complete
@stevengj
Copy link
Member Author

@stevengj stevengj commented Aug 2, 2016

@JeffBezanson, I'm noticing an odd behavior with the compiler in the REPL. Basically, the first fused function I compile is slow, but the second and subsequent ones are fast.

In particular, the following f(x) = x .+ 3.*x.^3 .+ 4.*x.^2 is extremely slow:

x = rand(10^7);
f(x) = x .+ 3.*x.^3 .+ 4.*x.^2
@time f(x);
@time f(x);

reporting 40M allocations even on the second run. However, the same function is fast if I compile a different fused function first!

x = rand(10^7);
g(x) = x .+ 3.*x.^3 .- 4.*x.^2
@time g(x);
f(x) = x .+ 3.*x.^3 .+ 4.*x.^2
@time f(x);
@time f(x);

This reports only 8 allocations for f(x), and allocates exactly the expected amount of memory for the output array.

Any idea what could cause this? (I'll try to reproduce it in the master branch, to see if it affects the 0.5 loop fusion, and file a separate issue if that is the case.)

@tkelman
Copy link
Contributor

@tkelman tkelman commented Aug 2, 2016

#17759 ?

@stevengj
Copy link
Member Author

@stevengj stevengj commented Aug 2, 2016

Ah, thanks @tkelman, that seems like the culprit. Using f(x) = x .+ 3.*x.*x.*x .+ 4.*x.*x, i.e. avoiding ^, eliminates the problem for me.

@stevengj
Copy link
Member Author

@stevengj stevengj commented Aug 2, 2016

As long as I avoid ^, whose problems seem orthogonal to this PR, the performance is exactly what I was hoping for. For example, y .= x .+ 3.*x.*x.*x .+ 4.*x.*x occurs entirely in-place, with performance identical to writing out the loops manually (with @inbounds annotation).

@jrevels, it would be good to get some "vectorized operation" performance benchmarks on @nanosoldier.

@@ -865,7 +865,7 @@
(begin
#;(if (and (number? ex) (= ex 0))
(error "juxtaposition with literal \"0\""))
`(call * ,ex ,(parse-unary s))))
`(call .* ,ex ,(parse-unary s))))

This comment has been minimized.

@TotalVerb

TotalVerb Sep 19, 2016
Contributor

I'm not convinced this revised behaviour is a good thing. This could break things like 2x where x::Diagonal, which broadcast will try to promote to Array.

This comment has been minimized.

@stevengj

stevengj Sep 19, 2016
Author Member

2 .* x for x::Diagonal is also broken by this PR; why is breaking 2x worse?

(Such cases could be fixed by adding specialized broadcast methods, of course.)

This comment has been minimized.

@TotalVerb

TotalVerb Sep 19, 2016
Contributor

* has different semantics from .*. I think it is a mistake to treat the latter as a superset of the former, as is being done with this implicit multiplication lowering to .*. Intuitively, 2x means 2 * x, not 2 .* x.

This comment has been minimized.

@stevengj

stevengj Sep 20, 2016
Author Member

It doesn't have different semantics for multiplication by scalars...

This comment has been minimized.

@TotalVerb

TotalVerb Sep 20, 2016
Contributor

But there is no guarantee that 2 .* x is the same as 2 * x for all types.

Also, this looks to be a performance disaster for the common scalar case.

julia> @code_llvm broadcast(*, 2, 2)

define %jl_value_t* @julia_broadcast_65524(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
  %3 = alloca %jl_value_t**, align 8
  store volatile %jl_value_t** %1, %jl_value_t*** %3, align 8
  %4 = add i32 %2, -1
  %5 = icmp eq i32 %4, 0
  br i1 %5, label %fail, label %pass

fail:                                             ; preds = %top
  %6 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
  call void @jl_bounds_error_tuple_int(%jl_value_t** %6, i64 0, i64 1)
  unreachable

pass:                                             ; preds = %top
  %7 = icmp ugt i32 %4, 1
  br i1 %7, label %pass.2, label %fail1

fail1:                                            ; preds = %pass
  %8 = sext i32 %4 to i64
  %9 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
  call void @jl_bounds_error_tuple_int(%jl_value_t** %9, i64 %8, i64 2)
  unreachable

pass.2:                                           ; preds = %pass
  %10 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
  %11 = bitcast %jl_value_t** %10 to i64**
  %12 = load i64*, i64** %11, align 8
  %13 = load i64, i64* %12, align 16
  %14 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 2
  %15 = bitcast %jl_value_t** %14 to i64**
  %16 = load i64*, i64** %15, align 8
  %17 = load i64, i64* %16, align 16
  %18 = mul i64 %17, %13
  %19 = call %jl_value_t* @jl_box_int64(i64 signext %18)
  ret %jl_value_t* %19
}

I guess I am not understanding what is gained by this change. Loop fusion can be forced explicitly with .* anyway. Why should the scalar case be disrupted for the convenience of the vector case?

This comment has been minimized.

@stevengj

stevengj Sep 20, 2016
Author Member

Sure, it's not a big deal.

But why is it a performance disaster for the scalar case? Shouldn't it be getting inlined to be equivalent to 2 * 2?

This comment has been minimized.

@stevengj

stevengj Sep 20, 2016
Author Member

Performance looks good to me:

julia> f(x, y) = broadcast(*, x, y)
f (generic function with 1 method)

julia> @code_llvm f(2,2)

define i64 @julia_f_70407(i64, i64) #0 {
top:
  %2 = mul i64 %1, %0
  ret i64 %2
}

This comment has been minimized.

@stevengj

stevengj Sep 20, 2016
Author Member

Anyway, I'll revert this part of the PR, since it is controversial.

This comment has been minimized.

@TotalVerb

TotalVerb Sep 20, 2016
Contributor

I wonder why @code_llvm on the broadcast itself is so scary.

This comment has been minimized.

@yuyichao

yuyichao Sep 20, 2016
Contributor

jlcall

@stevengj
Copy link
Member Author

@stevengj stevengj commented Sep 21, 2016

One of the difficulties I'm having with this PR is that it makes it effectively impossible to define specialized methods for .* etcetera.

For example, in Julia ≤ 0.5 we have specialized methods for array .< scalar that produce a BitArray, and we have specialized methods for sparsevector .* sparsevector that preserve the sparsity. With this PR, I can in principle overload broadcast(::typeof(<), array, scalar), but in fact it is almost useless.

The problem is that as soon as you fuse the operation with another dot call, it produces a fused anonymous function and the specialized broadcast method is never called. (The compress-fuse optimizations that inline numeric literals and combine duplicate broadcast arguments are another problem.)

Do we have to give up on .* for sparse arrays, and do we have to give up on elementwise boolean operations producing a BitArray? Or should broadcast always generate a BitArray if it detects that the output type is Bool?

@TotalVerb
Copy link
Contributor

@TotalVerb TotalVerb commented Sep 21, 2016

See also ongoing discussion in #18590 (comment).

@carlobaldassi
Copy link
Member

@carlobaldassi carlobaldassi commented Dec 22, 2016

This has removed a few optimizations for BitArrays. One in particular is the case A .* B when A and B have the same shape, which previously was specialized and called A & B. The difference is quite significant, e.g. for 1000x1000 BitArrays it's almost 40-fold.

I wonder if there is a way to catch this case again and get the same performance as map. Also, now that we have a generalized dot-operator syntax, it would be particularly useful to exploit the cases when the function is pure and operate chunk-wise (also in map, which currently only recognizes a few operations). Is there a way to determine if a function is pure?

@stevengj
Copy link
Member Author

@stevengj stevengj commented Dec 22, 2016

Yup, the general problem here is broadcast(::typeof(*),...) and similar methods aren't that useful to define, because they won't be called in lots of cases due to fusion. (In the particular case of A .* B, of course, one could simply call A & B.) As far as I know, we don't currently have any way to determine whether a function is pure.

Now that boolean operations are fused, however, it's not clear to me how often one does non-fused operations on bitarrays. We used to need it for things like (A .> 6) .& (B .> 0), but now this is fused into a single loop. How common are operations on large boolean arrays?

It also seems to me that there is quite a bit of unrolling that could be done to make chunk-by-chunk processing of BitArrays more efficient, e.g. for broadcast! with a BitArray result or getindex(A, ::BitArray). We should probably just write out (via metaprogramming) the unrolled loop for all 64 bits in a chunk. But that is separate from the question of chunk-wise pure bit operations.

.>=,
.≥,
.\,
.^,
/,
//,
.//,

This comment has been minimized.

@JeffBezanson

JeffBezanson Dec 22, 2016
Member

Is there a reason this is still here?

This comment has been minimized.

@tkelman

tkelman Dec 22, 2016
Contributor

probably just missed - likewise with .>> and .<< below

This comment has been minimized.

@stevengj

stevengj Dec 22, 2016
Author Member

Yup, just missed them, sorry.

@tkelman
Copy link
Contributor

@tkelman tkelman commented Dec 27, 2016

This appears to have broken the ability for packages to define and use .. as an operator.

@stevengj
Copy link
Member Author

@stevengj stevengj commented Dec 27, 2016

Whoops, that wasn't intended.

@stevengj
Copy link
Member Author

@stevengj stevengj commented Dec 27, 2016

Will have a PR to fix .. shortly.

@Sacha0 Sacha0 mentioned this pull request Jan 7, 2017
16 of 22 tasks complete
@ararslan ararslan mentioned this pull request Jan 9, 2017
tkelman referenced this pull request in JuliaMatrices/BandedMatrices.jl Feb 4, 2017
ajkeller34 referenced this pull request in PainterQubits/Unitful.jl Feb 21, 2017
tkelman referenced this pull request in SciML/DiffEqProblemLibrary.jl May 4, 2017
@Sacha0 Sacha0 added the deprecation label May 14, 2017
@tlienart tlienart mentioned this pull request Mar 27, 2020
0 of 113 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Julia
Review Needed
Linked issues

Successfully merging this pull request may close these issues.

None yet