RFC: Customizable lazy broadcasting with options for pure-Julia fusion and eager evaluation #25377

timholy · 2018-01-03T21:57:37Z

This PR (WIP) implements broadcasting using a lazy wrapper. It's a collaborative effort with @vtjnash, following up on the branch described here. It's a WIP because some of the sparse/higherorderfns tests are allocating and/or hitting inference limits, and there may be other bugs I haven't yet addressed.

Having gotten most of the way through this, I would characterize this as an attractive but dangerous change. It fixes numerous problems but also risks introducing a whole bunch of fun new bugs. For a few people it might make broadcasting harder to support than how it works now, but for others they will finally be able to use broadcasting.

The overall strategy is to create Broadcasted wrappers, and then copy them (or copyto! for broadcast!) to eagerly evaluate. There's also an intermediate step of instantiation, which just finishes various elements of the wrapper (the indexing information). Certain packages like StaticArrays.jl won't want to use the indexing operations, so this allows them to avoid the performance hit that comes from instantiate.

The trickiest part of all this is that Broadcasted wrappers can be nested, for example in 2 .* (x .+ 1), which leads to Broadcasted(*, 2, Broadcasted(+, x, 1)). In general, what this means is that users may have to walk the tree to implement broadcasting, which may not be a simple operation. To simplify this, I provide two approaches: (1) Broadcasted containers can be flattened, using pure-Julia code to perform the same fusion operations we used to do with in julia-syntax.scm (this is used for implementing sparse broadcasting); (2) there's support for straightforward "incremental" (eager) evaluation, starting with the deepest layers and working your way up (this is used for supporting broadcasting on AbstractRanges). Hopefully some combination of these two will ease most folks' pain.

One immediately practical consequence is that certain operations are much harder on inference, since more operations are handled by pure-Julia code. While I haven't timed carefully, I suspect slowdowns in most or all of the following tests:

linalg/dense
sparse/sparsevector
sparse/higherorderfns
linalg/triangular

Achievements:

fixes generated function body is not pure when using dot notation #21094
fixes better internal interface for extending broadcast #22060
fixes x.*x not customizable #22053
fixes Element-wise call syntax does not always lower to standard broadcast() #19313
resolves this concern
deprecates (via a fallback) (1:5) + 1. We're doing this for other AbstractArrays but such methods were still defined for ranges.
speeds up the ranges.jl test considerably, and also fixes a printing bug when run at the REPL.

timholy · 2018-01-03T22:21:06Z

It's worth pointing out an interesting feature of this approach, since it addresses a concern about "poisoning". Consider x = 1:5 and the operation 2 .+ (x .+ (x .+ 1)) .- 1. Because it only involves numbers and ranges, each step of this will be eagerly evaluated and it will return the value 4:2:12, i.e., still a range.

Now change this to 2 .+ (x .* (x .+ 1)) .- 1. It will eagerly evaluate until it gets to the .*, at which point it will realize that the result can't be captured by an AbstractRange, and it will store the intermediate result as a Vector{Int}. From that point on, the remaining calculation is fused. So in a sense this circumvents the "there is no middle ground" problem, though it still suffers from allocating a temporary.

andyferris · 2018-01-04T13:07:35Z

and it will store the intermediate result as a Vector{Int}

Couldn't it store the intermediate result as a Broadcasted instead?

timholy · 2018-01-04T15:58:43Z

Once you've decided to eagerly evaluate, the problem is that (at least with the current implementation) storing as a Broadcasted will put you into a non-terminating recursion. But hmm, maybe one could construct it as a Broadcasted{VectorStyle}; that seems likely to circumvent the problem and avoid the creation of the temporary. I can play with that.

The other strategy is to figure out at the beginning that it won't return a range. That's a little non-trivial though, because it depends on which operations occur in which order. For example, if r is a UnitRange, then r .- 1 returns a ::UnitRange but 1 .- r returns a StepRange. With deep nesting, it seems like it might be pretty complicated to write something that would figure this out at compile-time (i.e., inferrably). So I took the easy way out and evaluated incrementally.

timholy · 2018-01-04T16:08:40Z

I now know more about the test failures. Here are the important ones:

core.jl: Hang in subtyping with sufficiently-complex Union #25388
numbers.jl: it's the literal_pow problem, replicable with [2, 4, 6].^-2 (which on this branch yields a DomainError)
there are many broadcast methods for special array types like Bidiagonal. These have to become copy(::Broadcasted) methods before this can be merged.
finally, the allocation and inference-limits failures in sparse/higherorderfcns.jl have to be addressed. The allocation is more serious, I'm less worried about the inference limits, because it's close now (works for 7 arguments but not 8). Presumably there's a single parameter that can be tweaked, or we can just live with 7 arguments.

Because it's not obvious this is going to be merged for 1.0, I will likely move on with the rest of my life (this has taken a lot of time...) and come back to these later, unless I hear there's interest in getting it into 1.0.

Sacha0 · 2018-01-04T23:44:39Z

(Triage seemed generally favorable pending further review.

Edit: To expand, triage wasn't terribly worried about e.g. the minor impact on sparse broadcast and presently <30-50% performance regressions mentioned elsewhere. Triage seemed uniformly enthused by the prospect of removing the literal slurping in broadcast lowering, and similarly interested in the other benefits this work might provide (e.g. enabling better accelerator support).)

andyferris · 2018-01-05T04:23:16Z

I'd love this for v1.0, assuming it is feasible to do so.

mbauman · 2018-01-05T19:13:11Z

base/broadcast.jl

-@inline broadcast!(f::Tf, dest, ::BroadcastStyle, As::Vararg{Any,N}) where {Tf,N} = broadcast!(f, dest, nothing, As...)
-
-# Default behavior (separated out so that it can be called by users who want to extend broadcast!).
-@inline function broadcast!(f, dest, ::Nothing, As::Vararg{Any, N}) where N


Hooray for removing this method. I was just about to open this issue:

julia> X = Any[1,2] 2-element Array{Any,1}: 1 2 julia> X .= missing 2-element Array{Any,1}: missing missing julia> X .= nothing ERROR: MethodError: no method matching _broadcast!(::typeof(identity), ::Array{Any,1}) Closest candidates are: _broadcast!(::Any, ::AbstractArray, ::K, ::ID, ::AT, ::BT, ::Val{N}, ::Any) where {K, ID, AT, BT, N} at broadcast.jl:377 _broadcast!(::Any, ::AbstractArray, ::K, ::ID, ::AT, ::Val{nargs}, ::Any, ::Any, ::Any) where {K, ID, AT, nargs} at broadcast.jl:489 _broadcast!(::Any, ::Any, ::Any, ::Any...) where N at broadcast.jl:477 ... Stacktrace: [1] broadcast!(::Function, ::Array{Any,1}, ::Nothing) at ./broadcast.jl:455 [2] top-level scope

but it looks like your branch will fix it (I've not tested to confirm, though). I can still open that issue if you want a bit more motivation here… :)

Great test, thanks! It wasn't fully fixed here, so this was useful.

SimonDanisch · 2018-01-06T18:04:45Z

I want to make my broadcast implementation in GPUArrays 0.7 compatible.
Will the API change a lot, or can I easily make things compatible with master, and once this is merged there won't be much work to support the changes from this branch?

If the first applies, would it make sense to base my refactor on this branch, or is there only a small chance that this will get merged any time soon?

The main one left is in concatenation, in a line inds[i] = offsets[i] .+ cat_indices(x, i)

If you've already said `using Test`, defining a function named `Test` causes problems.

This is consistent with the deprecation of methods like `[1,2,3] + 1`.

Among other things, this supports returning AbstractRanges for appropriate inputs. Fixes #21094, fixes #22053

timholy · 2018-01-07T22:04:03Z

Executive Busy developer summary

OK, except for Julia bugs I think this is basically finished: it has docs and NEWS, and test results all "make sense" (see subsection below). The docs in particular should make reviewing this easier. I'll flag two issues that merit particular discussion (see subsections below).

I may be able to work on the subtyping/inference bugs, but given how much I've sunk into this already I'll wait for at least one substantive review to find out if any of this is worth it 😄.

Finally, I tried to reorder the commits to separate out functionality, but the date stamps messed up the ordering as displayed by GitHub. Three or four of the commits are perhaps of value even if we don't merge the main work here; it's probably pretty obvious which ones these are.

Preventing circular evaluation (`StackOverflowError`s)

As described above, there is an "incremental" evaluation path. In its current state, failure to provide one of the necessary broadcast methods will result in circular evaluation and a StackOverflowError. It's presumably fixable if we require users to specialize this syntax:

copy(bc::Broadcasted{Style}, from_incremental::Bool)

rather than the single-argument method copy(bc). (Conveniently, the general technique was illustrated just today in a discourse post.) That has some obvious negatives: a 2-arg copy is weird, and even types that don't use incremental evaluation are still required to implement the 2-arg version.

Consider allowing a non-inferrable return type from `f.(::StructuredMatrix)`

If you define StructuredMatrix = Union{Diagonal,Bidiagonal,SymTridiagonal,Tridiagonal}, then currently on this branch f.(::T) where T<:StructuredMatrix infers as a Union{T,SparseMatrixCSC}. The reason is that there's a test for whether f(0) == 0; if so, it returns type T, and otherwise a (densified) SparseMatrixCSC. This represents a generalization & replacement of quite a few methods such as

broadcast(::typeof(ceil), D::Diagonal) = Diagonal(...)

We were a bit hit-and-miss about which of these were present (Diagonal did not have as many of these methods as some of the others), and only a few functions were covered. Conversely, the strategy here is exhaustive in covering all functions, at the cost having the return type inferred as a Union.

Of course I can change this back to always return a SparseMatrixCSC, but I thought I'd see whether folks think that our Union-splitting is good enough that this becomes attractive. If we like this, then perhaps I should change it to return a Union{T,<:Matrix} since the only advantage of returning a SparseMatrixCSC is for inferrability.

Test results

Unless stated otherwise, tests pass locally.

test/numbers.jl fails because of the literal_pow issue, and I don't personally have a plan for dealing with that---if we want that test to pass, presumably it will take collaboration with someone who speaks lisp more fluently than I.
test/core.jl fails because of Hang in subtyping with sufficiently-complex Union #25388
test/sparse/higherorderfcns.jl fails because of some inference Heisenbug. All I have to do is force one of the methods to recompile in the running julia session and many more tests (all those with fewer than 8 args) pass. The 8-arg case still fails.
that said, when inference gets fixed, it will pass only because I annotated the @allocated(expr) == 0 tests in test/sparse/higherorderfcns.jl as @test_broken. Despite adding @inlines willy-nilly I've been unable to get rid of all the allocations. (Maybe I'm missing some.)

timholy · 2018-01-07T22:05:07Z

@SimonDanisch, I'd recommend holding off for a few more days if you can, since I think we'll know the ultimate fate of this by then.

Sacha0 · 2018-01-08T05:45:12Z

doc/src/manual/interfaces.md

+   the results of ([`Base.BroadcastStyle`](@ref)) applied to the argument types
+ - execution of `copy(bc::Broadcasted{DestStyle})`, which in simple cases only requires that
+   you allocate the output with [`Base.broadcast_similar`](@ref). In more complex cases, you may
+   wish to specialize `copy` and/or `copy!` for `DestStyle`.


copyto!? :)

Also, did you you mean to number the stages or so?

mbauman · 2018-01-09T00:03:38Z

I wasn't quite clear what this ended up looking like, myself, so as a first pass of a review I figured it'd be helpful to others to see what this actually does. For now this is just a scratchpad for my own notes.

High-level end-user summary of fusion

julia> Meta.lower(Main, :(@. 3 + 2x + sin(x)))
:($(Expr(:thunk, CodeInfo(:(begin
      SSAValue(0) = (Base.getproperty)(Base.Broadcast, :execute)
      SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make)
      SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make)
      SSAValue(3) = (SSAValue(2))(*, 2, x)
      SSAValue(4) = (Base.getproperty)(Base.Broadcast, :make)
      SSAValue(5) = (SSAValue(4))(sin, x)
      SSAValue(6) = (SSAValue(1))(+, 3, SSAValue(3), SSAValue(5))
      SSAValue(7) = (SSAValue(0))(SSAValue(6))
      return SSAValue(7)
  end)))))

In other words, @. 3 + 2x + sin(x) lowers to:

julia> using Base.Broadcast: make, execute, instantiate

julia> x = 1:10
1:10

julia> bc = make(+, 3, make(*, 2, x), make(sin, x))
Broadcasted(+, 3, Broadcasted(*, 2, 1:10), Broadcasted(sin, 1:10))

julia> execute(bc)
10-element Array{Float64,1}:
 ⋮

Simple enough. make is basically a Broadcasted constructor (the type that represents the lazy function application), and execute demarcates the end of the laziness. execute is very simple, too — it's just defined as execute(x) = copy(instantiate(x)).

Nitty-gritty details

This belies a goodly amount of complexity, as evidenced by:

julia> typeof(bc)
Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Base.Broadcast.Unknown,Nothing,Nothing,typeof(+),Base.TupleLL{Int64,Base.TupleLL{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Base.Broadcast.Unknown,Nothing,Nothing,typeof(*),Base.TupleLL{Int64,Base.TupleLL{UnitRange{Int64},Base.TupleLLEnd}}},Base.TupleLL{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Base.Broadcast.Unknown,Nothing,Nothing,typeof(sin),Base.TupleLL{UnitRange{Int64},Base.TupleLLEnd}},Base.TupleLLEnd}}}}

Whoa whoa whoa. What's that? Okay, well, Broadcasted has a bunch of type parameters that help identify how the broadcasting should be done. Specifically, it's Broadcasted{Style<:Union{Nothing,BroadcastStyle}, ElType, Axes, Indexing<:Union{Nothing,TupleLL}, F, Args<:TupleLL}. As you can see in the above type, lots of those parameters are Broadcast.Unknown or Nothing. They haven't been decided yet — that happens with the instantiate function:

julia> bc′ = instantiate(bc)
       typeof(bc′)
Base.Broadcast.Broadcasted{
    #= Style  =# Base.Broadcast.DefaultArrayStyle{1},
    #= ElType =# Float64,
    #= Axes   =# Tuple{Base.OneTo{Int64}},
    #= Idxing =# Base.TupleLL{Tuple{Tuple{},Tuple{}},Base.TupleLL{Tuple{Tuple{Bool},Tuple{Int64}},Base.TupleLL{Tuple{Tuple{Bool},Tuple{Int64}},Base.TupleLLEnd}}},
    #= F      =# typeof(+),
    #= Args   =# Base.TupleLL{Int64,Base.TupleLL{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Int64,Tuple{Base.OneTo{Int64}},Base.TupleLL{Tuple{Tuple{},Tuple{}},Base.TupleLL{Tuple{Tuple{Bool},Tuple{Int64}},Base.TupleLLEnd}},typeof(*),Base.TupleLL{Int64,Base.TupleLL{UnitRange{Int64},Base.TupleLLEnd}}},Base.TupleLL{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Float64,Tuple{Base.OneTo{Int64}},Base.TupleLL{Tuple{Tuple{Bool},Tuple{Int64}},Base.TupleLLEnd},typeof(sin),Base.TupleLL{UnitRange{Int64},Base.TupleLLEnd}},Base.TupleLLEnd}}}
}

Aha, ok, now we can see that it's using the DefaultArrayStyle{1}, with an eltype of Float64. The last four parameters simply encode the types of the four fields of the struct — those are more easily understood via their values:

julia> bc′.axes
(Base.OneTo(10),)

julia> bc′.indexing
{((), ()), ((true,), (1,)), ((true,), (1,))}

julia> bc′.f
+ (generic function with 177 methods)

julia> bc′.args
{3, Broadcasted(*, 2, 1:10), Broadcasted(sin, 1:10)}

Makes sense — we have f and its arguments, as well as information about the resulting axes and how to index into each of the three arguments (I think). That funny thing that prints as a {} list? That's a TupleLL — a cons-list representation of a tuple-like object. Why not just use regular tuples? We want this to be type-stable for computations involving types, so it needs to store the specific Type{T} information for each type (and not just a DataType). It's not quite clear to me why the indexing property needs to be a TupleLL yet.

Incremental results

If I remove the sin(x) from the above broadcast, we end up with a nearly identical broadcast, but once we copy it (by way of execute) we get a range. Hooray!

julia> bc = instantiate(make(+, 3, make(*, 2, x)))
Broadcasted(+, 3, Broadcasted(*, 2, 1:10))

julia> copy(bc)
5:2:23

This occurs because of a hook that checks is_broadcast_incremental before doing the allocation of the return array and copyto!ing it. For this particular bc object, that returns true because it's a DefaultArrayStyle{1} and it recursively walks through all the wrapped functions to determine they're all known (probably) linear operations. It's not an exact calculation, though, since *(::Scalar, ::Range) is linear but *(::Range, ::Range) isn't. If this is true, though, then the broadcast machinery will recursively walk through the entire Broadcasted object and explicitly call broadcast(f, args...) for all the leaf-most-nodes. After each leaf, though, it "tries again" and re-runs all the execute broadcast machinery again with these new intermediate results. Thus, it checks again if it's still is_broadcast_incremental with these new results. This leads to that one temporary Tim talks about in the case where you have a complicated fused broadcast that the maybe-linear-operation-check guesses wrong on.

mbauman · 2018-01-09T00:07:40Z

base/broadcast.jl

-longest_tuple(A, B, Bs...) = longest_tuple(Bs...)
-longest_tuple(A::Tuple) = A
+    @simd for I in CartesianIndices(axes(bc))
+        @inbounds dest[I] = _broadcast_getindex(bc, I)


Nifty that you can index directly into the Broadcasted objects. Have you thought about making this a bit more first class? Like calling it simply getindex(::Broadcasted, …)? Or maybe even Broadcasted <: AbstractArray? Not crucial by any means. Just cool.

Yes, I have, and I've gone back and forth on that. Currently I haven't done that for one somewhat arcane reason: in A .+ (v .+ 1) (A is a matrix and v a vector), the inner broadcasting operation with v doesn't have any objects that have 2d indices. Consequently it seems that bcinner[i, j] should throw a BoundsError if j>1, but _broadcast_getindex doesn't fall victim to that requirement.

Now, you might argue that because bcinner will typically be instantiated, the axes field indicates that really, it does have suitable indices. But not all Broadcasted types get instantiated (e.g., for StaticArrays one may want to short-circuit the axes field computation). Since in such cases there's no hint that bcinner should be supported over a larger domain, I decided not to call it getindex.

Like I said, pretty arcane.

Could we define getindex just on Broadcasted objects that have instantiated indexers? The main reason to consider this is that it provides a very nice public & stable API for working with Broadcasted objects. Folks will need to handle them in their copyto! methods, but it'd be nice to be able to draw some clear lines within Base.Broadcast on what's public and what's private.

How are you imagining StaticArrays would use this? Broadcast.flatten?

broadcasting by preprocessing the arguments to potentially wrap them with indexing helpers.

* Slightly clearer recursion through arg lists in not_nested * Move show(::IO, ::Broadcasted) to a more sensible location and have it print its type fully qualified with the `Style` parameter.

* origin/master: (23 commits) fix deprecations of \cdot and \times (#26884) Support reshaping custom 0-dimensional arrays (#26870) fix some cases of dot syntax lowering (#26878) Pkg3: deterministically close the LibGit2 repo in tests (#26883) code loading docs: add missing graph edge (#26874) add news for #26858 and #26859 [ci skip] (#26869) Deprecate using && and || within at-dot expressions (#26792) widen `Int8` and `Int16` to `Int` instead of `Int32` (#26859) fix #26038, make `isequal` consistent with `hash` for `Ptr` (#26858) Deprecate variadic size(A, dim1, dim2, dims...) method (#26862) add using Random to example in manual (#26864) warn once instead of depwarn since we want to test it Revert "reserve syntax that could be used for computed field types (#18466) (#26816)" (#26857) Fix compilation on LLVM 6.0 change promotion behaviour of `cumsum` and `cumsum!` to match `sum` [LLVM 6] add patch to diamond if-conversion add a precompile command that can be used to precompile all dependencies (#254) use registry if no version entry exist in project for developed pacakges make Pkg3 work as a drop in for the old CI scripts update registries when adding (#253) ...

mbauman · 2018-04-23T19:24:19Z

Ok, I think I have the performance as good as I can make it for now. There are still two classes of regressions:

A failure to emit vectorized instructions in a place where we would have vectorized before. Given f(r, x) = r .= x.*x.*x.*x, code_llvm(f, Tuple{Vector{Float64}, Vector{Float64}}) emits SIMD vectorized instructions on master but not on this branch. This results in a 1.5x performance degradation in this example and is responsible for the 1.85x degradation in the dotop test Nanosoldier flagged. I am not able to understand why LLVM isn't vectorizing this on my branch.
A higher cost to broadcasting over non-type-stable functions or non-concretely-typed collections (including small discriminated unions). This is responsible for the 4-8x regressions in the union tests. I've been able to mitigate this somewhat by performing the same sort of preprocessing as we do in copyto! — that is, precomputing indexing helpers and wrapping arguments as necessary — but I'm still seeing 3x-6x regressions locally. I had hoped the new optimizer would prove more willing to work with the resulting Union{Vector{T}, Vector{S}, Vector{Union{T,S}}}, but so far my rudimentary tests with enable_new_optimizer have shown the same performance.

Everything else here is looking great. Given the many issues this resolves I'm inclined to merge with those two items outstanding. Given the messy history, I'll rebase into one well-documented squashed commit that attributes all four authors here as coauthors. I'll leave this PR as is — I believe the history here is useful — and create a new PR.

andreasnoack · 2018-04-23T19:37:59Z

Do we have any quantifications of the compile time consequences of this change?

jekbradbury · 2018-04-23T19:45:39Z

What about your earlier thought that "we could just use axes"? Does following that through to its conclusion lead to unexpected issues?

mbauman · 2018-04-23T19:49:49Z

Do we have any quantifications of the compile time consequences of this change?

Timing that right now.

What about your earlier thought that "we could just use axes"? Does following that through to its conclusion lead to unexpected issues?

We have two types that we use heavily in broadcast that don't define axes (and are thus "patched up" in broadcast_axes): Ref and Tuple. I'd like to define axes directly on those both, but it seems like a completely orthogonal change that should be tackled independently. So for the time being, broadcast_axes remains.

JeffBezanson · 2018-04-23T20:26:21Z

NEWS.md

+    `AbstractDict`, `AbstractString`, `Tuple` and `NamedTuple` objects ([#24774]).
+    In particular, this means that it returns `CartesianIndex` objects for matrices
+    and higher-dimensional arrays instead of linear indices as was previously the case.
+    Use `Int[LinearIndices(size(a))[i] for i in find(f, a)]` to compute linear indices.


This looks redundant with the next news item (and partly contradicts it).

Ah, thanks for the reminder. This was a merge mistake.

mbauman · 2018-04-23T20:36:41Z

Pseudo-compile-time tests:

test	this PR (`110a0a5`)	master (`692408c`)
make && touch base/version.jl && time make	4m26.788s	4m33.808s
time ./julia stdlib/LinearAlgebra/test/triangular.jl	6m39.415s	6m36.216s
time ./julia -e 'using Random, Test; include("test/broadcast.jl")'	0m40.090s	0m38.218s
time ./julia stdlib/SparseArrays/test/higherorderfns.jl	1m57.239s	1m40.889s
time ./julia stdlib/SparseArrays/test/sparsevector.jl	2m33.843s	2m34.833s
time ./julia stdlib/LinearAlgebra/test/dense.jl	2m30.039s	2m22.954s
perf_op_bcast (1)	0.248796	0.034302
perf_op_bcast (2)	0.051545	0.045142

The perf_op_bcast are the results of this interactive julia prompt, sequentially numbered. I believe the regression in (1) simply reflects the lack of the appropriate precompile statements.

julia> perf_op_bcast!(r, x) = r .= 3 .* x .- 4 .* x.^2 .+ x .* x .- x .^ 3;

julia> @time perf_op_bcast!([1],[1])
  0.034302 seconds (43.61 k allocations: 2.416 MiB)
1-element Array{Int64,1}:
 -1

julia> perf_op_bcast!(r, x) = r .= 3 .* x .- 4 .* x.^2 .+ x .* x .- x .^ 3;

julia> @time perf_op_bcast!([1],[1])
  0.045142 seconds (39.13 k allocations: 2.140 MiB)
1-element Array{Int64,1}:
 -1

[ci skip]

@lower

This patch represents the combined efforts of four individuals, over 60 commits, and an iterated design over (at least) three pull requests that spanned nearly an entire year (closes #22063, #23692, #25377 by superceding them). This introduces a pure Julia data structure that represents a fused broadcast expression. For example, the expression `2 .* (x .+ 1)` lowers to: ```julia julia> Meta.@lower 2 .* (x .+ 1) :($(Expr(:thunk, CodeInfo(:(begin Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize) Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1) Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3)) Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4)) return Core.SSAValue(5) end))))) ``` Or, slightly more readably as: ```julia using .Broadcast: materialize, make materialize(make(*, 2, make(+, x, 1))) ``` The `Broadcast.make` function serves two purposes. Its primary purpose is to construct the `Broadcast.Broadcasted` objects that hold onto the function, the tuple of arguments (potentially including nested `Broadcasted` arguments), and sometimes a set of `axes` to include knowledge of the outer shape. The secondary purpose, however, is to allow an "out" for objects that _don't_ want to participate in fusion. For example, if `x` is a range in the above `2 .* (x .+ 1)` expression, it needn't allocate an array and operate elementwise — it can just compute and return a new range. Thus custom structures are able to specialize `Broadcast.make(f, args...)` just as they'd specialize on `f` normally to return an immediate result. `Broadcast.materialize` is identity for everything _except_ `Broadcasted` objects for which it allocates an appropriate result and computes the broadcast. It does two things: it `initialize`s the outermost `Broadcasted` object to compute its axes and then `copy`s it. Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact same expression tree to compute the right-hand side of the expression as above, and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the `Broadcasted` expression tree and then `copyto!` it into the given destination. All-together, this forms a complete API for custom types to extend and customize the behavior of broadcast (fixes #22060). It uses the existing `BroadcastStyle`s throughout to simplify dispatch on many arguments: * Custom types can opt-out of broadcast fusion by specializing `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`. * The `Broadcasted` object computes and stores the type of the combined `BroadcastStyle` of its arguments as its first type parameter, allowing for easy dispatch and specialization. * Custom Broadcast storage is still allocated via `broadcast_similar`, however instead of passing just a function as a first argument, the entire `Broadcasted` object is passed as a final argument. This potentially allows for much more runtime specialization dependent upon the exact expression given. * Custom broadcast implmentations for a `CustomStyle` are defined by specializing `copy(bc::Broadcasted{CustomStyle})` or `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`. * Fallback broadcast specializations for a given output object of type `Dest` (for the `DefaultArrayStyle` or another such style that hasn't implemented assignments into such an object) are defined by specializing `copyto(dest::Dest, bc::Broadcasted{Nothing})`. As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to `.+`, just as had been done for all `AbstractArray`s in general. As a first-mover proof of concept, LinearAlgebra uses this new system to improve broadcasting over structured arrays. Before, broadcasting over a structured matrix would result in a sparse array. Now, broadcasting over a structured matrix will _either_ return an appropriately structured matrix _or_ a dense array. This does incur a type instability (in the form of a discriminated union) in some situations, but thanks to type-based introspection of the `Broadcasted` wrapper commonly used functions can be special cased to be type stable. For example: ```julia julia> f(d) = round.(Int, d) f (generic function with 1 method) julia> @inferred f(Diagonal(rand(3))) 3×3 Diagonal{Int64,Array{Int64,1}}: 0 ⋅ ⋅ ⋅ 0 ⋅ ⋅ ⋅ 1 julia> @inferred Diagonal(rand(3)) .* 3 ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}} Stacktrace: [1] error(::String) at ./error.jl:33 [2] top-level scope julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3) 4×4 Tridiagonal{Float64,Array{Float64,1}}: 1.30771 0.838589 ⋅ ⋅ 0.0 3.89109 0.0459757 ⋅ ⋅ 0.0 4.48033 2.51508 ⋅ ⋅ 0.0 6.23739 ``` In addition to the issues referenced above, it fixes: * Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated specially in a fused broadcast; they're just arguments in a `Broadcasted` object like everything else. * Fixes #21094: Since broadcasting is now represented by a pure Julia datastructure it can be created within `@generated` functions and serialized. * Fixes #26097: The fallback destination-array specialization method of `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not be confused by `nothing` arguments. * Fixes the broadcast-specific element of #25499: The default base broadcast implementation no longer depends upon `Base._return_type` to allocate its array (except in the empty or concretely-type cases). Note that the sparse implementation (#19595) is still dependent upon inference and is _not_ fixed. * Fixes #25340: Functions are treated like normal values just like arguments and only evaluated once. * Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was fixed on master already, but this fixes it now, too. * Fixes #25521. * The performance of this patch has been thoroughly tested through its iterative development process in #25377. There remain [two classes of performance regressions](#25377) that Nanosoldier flagged. * #25691: Propagation of constant literals sill lose their constant-ness upon going through the broadcast machinery. I believe quite a large number of functions would need to be marked as `@pure` to support this -- including functions that are intended to be specialized. (For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](#25377) branch as of a1d4e7e. Squashed and separated out to make it easier to review and commit) Co-authored-by: Tim Holy <tim.holy@gmail.com> Co-authored-by: Jameson Nash <vtjnash@gmail.com> Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>

JeffBezanson · 2018-04-24T03:06:50Z

Should this be closed?

mbauman · 2018-04-24T15:01:42Z

Sure, let's make sure discussion is consolidated in the new PR: #26891.

@lower

This patch represents the combined efforts of four individuals, over 60 commits, and an iterated design over (at least) three pull requests that spanned nearly an entire year (closes #22063, #23692, #25377 by superceding them). This introduces a pure Julia data structure that represents a fused broadcast expression. For example, the expression `2 .* (x .+ 1)` lowers to: ```julia julia> Meta.@lower 2 .* (x .+ 1) :($(Expr(:thunk, CodeInfo(:(begin Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize) Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1) Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3)) Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4)) return Core.SSAValue(5) end))))) ``` Or, slightly more readably as: ```julia using .Broadcast: materialize, make materialize(make(*, 2, make(+, x, 1))) ``` The `Broadcast.make` function serves two purposes. Its primary purpose is to construct the `Broadcast.Broadcasted` objects that hold onto the function, the tuple of arguments (potentially including nested `Broadcasted` arguments), and sometimes a set of `axes` to include knowledge of the outer shape. The secondary purpose, however, is to allow an "out" for objects that _don't_ want to participate in fusion. For example, if `x` is a range in the above `2 .* (x .+ 1)` expression, it needn't allocate an array and operate elementwise — it can just compute and return a new range. Thus custom structures are able to specialize `Broadcast.make(f, args...)` just as they'd specialize on `f` normally to return an immediate result. `Broadcast.materialize` is identity for everything _except_ `Broadcasted` objects for which it allocates an appropriate result and computes the broadcast. It does two things: it `initialize`s the outermost `Broadcasted` object to compute its axes and then `copy`s it. Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact same expression tree to compute the right-hand side of the expression as above, and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the `Broadcasted` expression tree and then `copyto!` it into the given destination. All-together, this forms a complete API for custom types to extend and customize the behavior of broadcast (fixes #22060). It uses the existing `BroadcastStyle`s throughout to simplify dispatch on many arguments: * Custom types can opt-out of broadcast fusion by specializing `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`. * The `Broadcasted` object computes and stores the type of the combined `BroadcastStyle` of its arguments as its first type parameter, allowing for easy dispatch and specialization. * Custom Broadcast storage is still allocated via `broadcast_similar`, however instead of passing just a function as a first argument, the entire `Broadcasted` object is passed as a final argument. This potentially allows for much more runtime specialization dependent upon the exact expression given. * Custom broadcast implmentations for a `CustomStyle` are defined by specializing `copy(bc::Broadcasted{CustomStyle})` or `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`. * Fallback broadcast specializations for a given output object of type `Dest` (for the `DefaultArrayStyle` or another such style that hasn't implemented assignments into such an object) are defined by specializing `copyto(dest::Dest, bc::Broadcasted{Nothing})`. As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to `.+`, just as had been done for all `AbstractArray`s in general. As a first-mover proof of concept, LinearAlgebra uses this new system to improve broadcasting over structured arrays. Before, broadcasting over a structured matrix would result in a sparse array. Now, broadcasting over a structured matrix will _either_ return an appropriately structured matrix _or_ a dense array. This does incur a type instability (in the form of a discriminated union) in some situations, but thanks to type-based introspection of the `Broadcasted` wrapper commonly used functions can be special cased to be type stable. For example: ```julia julia> f(d) = round.(Int, d) f (generic function with 1 method) julia> @inferred f(Diagonal(rand(3))) 3×3 Diagonal{Int64,Array{Int64,1}}: 0 ⋅ ⋅ ⋅ 0 ⋅ ⋅ ⋅ 1 julia> @inferred Diagonal(rand(3)) .* 3 ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}} Stacktrace: [1] error(::String) at ./error.jl:33 [2] top-level scope julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3) 4×4 Tridiagonal{Float64,Array{Float64,1}}: 1.30771 0.838589 ⋅ ⋅ 0.0 3.89109 0.0459757 ⋅ ⋅ 0.0 4.48033 2.51508 ⋅ ⋅ 0.0 6.23739 ``` In addition to the issues referenced above, it fixes: * Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated specially in a fused broadcast; they're just arguments in a `Broadcasted` object like everything else. * Fixes #21094: Since broadcasting is now represented by a pure Julia datastructure it can be created within `@generated` functions and serialized. * Fixes #26097: The fallback destination-array specialization method of `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not be confused by `nothing` arguments. * Fixes the broadcast-specific element of #25499: The default base broadcast implementation no longer depends upon `Base._return_type` to allocate its array (except in the empty or concretely-type cases). Note that the sparse implementation (#19595) is still dependent upon inference and is _not_ fixed. * Fixes #25340: Functions are treated like normal values just like arguments and only evaluated once. * Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was fixed on master already, but this fixes it now, too. * Fixes #25521. * The performance of this patch has been thoroughly tested through its iterative development process in #25377. There remain [two classes of performance regressions](#25377) that Nanosoldier flagged. * #25691: Propagation of constant literals sill lose their constant-ness upon going through the broadcast machinery. I believe quite a large number of functions would need to be marked as `@pure` to support this -- including functions that are intended to be specialized. (For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](#25377) branch as of a1d4e7e. Squashed and separated out to make it easier to review and commit) Co-authored-by: Tim Holy <tim.holy@gmail.com> Co-authored-by: Jameson Nash <vtjnash@gmail.com> Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>

@lower

This patch represents the combined efforts of four individuals, over 60 commits, and an iterated design over (at least) three pull requests that spanned nearly an entire year (closes #22063, #23692, #25377 by superceding them). This introduces a pure Julia data structure that represents a fused broadcast expression. For example, the expression `2 .* (x .+ 1)` lowers to: ```julia julia> Meta.@lower 2 .* (x .+ 1) :($(Expr(:thunk, CodeInfo(:(begin Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize) Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1) Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3)) Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4)) return Core.SSAValue(5) end))))) ``` Or, slightly more readably as: ```julia using .Broadcast: materialize, make materialize(make(*, 2, make(+, x, 1))) ``` The `Broadcast.make` function serves two purposes. Its primary purpose is to construct the `Broadcast.Broadcasted` objects that hold onto the function, the tuple of arguments (potentially including nested `Broadcasted` arguments), and sometimes a set of `axes` to include knowledge of the outer shape. The secondary purpose, however, is to allow an "out" for objects that _don't_ want to participate in fusion. For example, if `x` is a range in the above `2 .* (x .+ 1)` expression, it needn't allocate an array and operate elementwise — it can just compute and return a new range. Thus custom structures are able to specialize `Broadcast.make(f, args...)` just as they'd specialize on `f` normally to return an immediate result. `Broadcast.materialize` is identity for everything _except_ `Broadcasted` objects for which it allocates an appropriate result and computes the broadcast. It does two things: it `initialize`s the outermost `Broadcasted` object to compute its axes and then `copy`s it. Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact same expression tree to compute the right-hand side of the expression as above, and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the `Broadcasted` expression tree and then `copyto!` it into the given destination. All-together, this forms a complete API for custom types to extend and customize the behavior of broadcast (fixes #22060). It uses the existing `BroadcastStyle`s throughout to simplify dispatch on many arguments: * Custom types can opt-out of broadcast fusion by specializing `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`. * The `Broadcasted` object computes and stores the type of the combined `BroadcastStyle` of its arguments as its first type parameter, allowing for easy dispatch and specialization. * Custom Broadcast storage is still allocated via `broadcast_similar`, however instead of passing just a function as a first argument, the entire `Broadcasted` object is passed as a final argument. This potentially allows for much more runtime specialization dependent upon the exact expression given. * Custom broadcast implmentations for a `CustomStyle` are defined by specializing `copy(bc::Broadcasted{CustomStyle})` or `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`. * Fallback broadcast specializations for a given output object of type `Dest` (for the `DefaultArrayStyle` or another such style that hasn't implemented assignments into such an object) are defined by specializing `copyto(dest::Dest, bc::Broadcasted{Nothing})`. As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to `.+`, just as had been done for all `AbstractArray`s in general. As a first-mover proof of concept, LinearAlgebra uses this new system to improve broadcasting over structured arrays. Before, broadcasting over a structured matrix would result in a sparse array. Now, broadcasting over a structured matrix will _either_ return an appropriately structured matrix _or_ a dense array. This does incur a type instability (in the form of a discriminated union) in some situations, but thanks to type-based introspection of the `Broadcasted` wrapper commonly used functions can be special cased to be type stable. For example: ```julia julia> f(d) = round.(Int, d) f (generic function with 1 method) julia> @inferred f(Diagonal(rand(3))) 3×3 Diagonal{Int64,Array{Int64,1}}: 0 ⋅ ⋅ ⋅ 0 ⋅ ⋅ ⋅ 1 julia> @inferred Diagonal(rand(3)) .* 3 ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}} Stacktrace: [1] error(::String) at ./error.jl:33 [2] top-level scope julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3) 4×4 Tridiagonal{Float64,Array{Float64,1}}: 1.30771 0.838589 ⋅ ⋅ 0.0 3.89109 0.0459757 ⋅ ⋅ 0.0 4.48033 2.51508 ⋅ ⋅ 0.0 6.23739 ``` In addition to the issues referenced above, it fixes: * Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated specially in a fused broadcast; they're just arguments in a `Broadcasted` object like everything else. * Fixes #21094: Since broadcasting is now represented by a pure Julia datastructure it can be created within `@generated` functions and serialized. * Fixes #26097: The fallback destination-array specialization method of `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not be confused by `nothing` arguments. * Fixes the broadcast-specific element of #25499: The default base broadcast implementation no longer depends upon `Base._return_type` to allocate its array (except in the empty or concretely-type cases). Note that the sparse implementation (#19595) is still dependent upon inference and is _not_ fixed. * Fixes #25340: Functions are treated like normal values just like arguments and only evaluated once. * Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was fixed on master already, but this fixes it now, too. * Fixes #25521. * The performance of this patch has been thoroughly tested through its iterative development process in #25377. There remain [two classes of performance regressions](#25377) that Nanosoldier flagged. * #25691: Propagation of constant literals sill lose their constant-ness upon going through the broadcast machinery. I believe quite a large number of functions would need to be marked as `@pure` to support this -- including functions that are intended to be specialized. (For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](JuliaLang/julia#25377) branch as of a1d4e7ec9756ada74fb48f2c514615b9d981cf5c. Squashed and separated out to make it easier to review and commit) Co-authored-by: Tim Holy <tim.holy@gmail.com> Co-authored-by: Jameson Nash <vtjnash@gmail.com> Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>

@lower

This patch represents the combined efforts of four individuals, over 60 commits, and an iterated design over (at least) three pull requests that spanned nearly an entire year (closes #22063, #23692, #25377 by superceding them). This introduces a pure Julia data structure that represents a fused broadcast expression. For example, the expression `2 .* (x .+ 1)` lowers to: ```julia julia> Meta.@lower 2 .* (x .+ 1) :($(Expr(:thunk, CodeInfo(:(begin Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize) Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1) Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3)) Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4)) return Core.SSAValue(5) end))))) ``` Or, slightly more readably as: ```julia using .Broadcast: materialize, make materialize(make(*, 2, make(+, x, 1))) ``` The `Broadcast.make` function serves two purposes. Its primary purpose is to construct the `Broadcast.Broadcasted` objects that hold onto the function, the tuple of arguments (potentially including nested `Broadcasted` arguments), and sometimes a set of `axes` to include knowledge of the outer shape. The secondary purpose, however, is to allow an "out" for objects that _don't_ want to participate in fusion. For example, if `x` is a range in the above `2 .* (x .+ 1)` expression, it needn't allocate an array and operate elementwise — it can just compute and return a new range. Thus custom structures are able to specialize `Broadcast.make(f, args...)` just as they'd specialize on `f` normally to return an immediate result. `Broadcast.materialize` is identity for everything _except_ `Broadcasted` objects for which it allocates an appropriate result and computes the broadcast. It does two things: it `initialize`s the outermost `Broadcasted` object to compute its axes and then `copy`s it. Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact same expression tree to compute the right-hand side of the expression as above, and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the `Broadcasted` expression tree and then `copyto!` it into the given destination. All-together, this forms a complete API for custom types to extend and customize the behavior of broadcast (fixes #22060). It uses the existing `BroadcastStyle`s throughout to simplify dispatch on many arguments: * Custom types can opt-out of broadcast fusion by specializing `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`. * The `Broadcasted` object computes and stores the type of the combined `BroadcastStyle` of its arguments as its first type parameter, allowing for easy dispatch and specialization. * Custom Broadcast storage is still allocated via `broadcast_similar`, however instead of passing just a function as a first argument, the entire `Broadcasted` object is passed as a final argument. This potentially allows for much more runtime specialization dependent upon the exact expression given. * Custom broadcast implmentations for a `CustomStyle` are defined by specializing `copy(bc::Broadcasted{CustomStyle})` or `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`. * Fallback broadcast specializations for a given output object of type `Dest` (for the `DefaultArrayStyle` or another such style that hasn't implemented assignments into such an object) are defined by specializing `copyto(dest::Dest, bc::Broadcasted{Nothing})`. As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to `.+`, just as had been done for all `AbstractArray`s in general. As a first-mover proof of concept, LinearAlgebra uses this new system to improve broadcasting over structured arrays. Before, broadcasting over a structured matrix would result in a sparse array. Now, broadcasting over a structured matrix will _either_ return an appropriately structured matrix _or_ a dense array. This does incur a type instability (in the form of a discriminated union) in some situations, but thanks to type-based introspection of the `Broadcasted` wrapper commonly used functions can be special cased to be type stable. For example: ```julia julia> f(d) = round.(Int, d) f (generic function with 1 method) julia> @inferred f(Diagonal(rand(3))) 3×3 Diagonal{Int64,Array{Int64,1}}: 0 ⋅ ⋅ ⋅ 0 ⋅ ⋅ ⋅ 1 julia> @inferred Diagonal(rand(3)) .* 3 ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}} Stacktrace: [1] error(::String) at ./error.jl:33 [2] top-level scope julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3) 4×4 Tridiagonal{Float64,Array{Float64,1}}: 1.30771 0.838589 ⋅ ⋅ 0.0 3.89109 0.0459757 ⋅ ⋅ 0.0 4.48033 2.51508 ⋅ ⋅ 0.0 6.23739 ``` In addition to the issues referenced above, it fixes: * Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated specially in a fused broadcast; they're just arguments in a `Broadcasted` object like everything else. * Fixes #21094: Since broadcasting is now represented by a pure Julia datastructure it can be created within `@generated` functions and serialized. * Fixes #26097: The fallback destination-array specialization method of `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not be confused by `nothing` arguments. * Fixes the broadcast-specific element of #25499: The default base broadcast implementation no longer depends upon `Base._return_type` to allocate its array (except in the empty or concretely-type cases). Note that the sparse implementation (#19595) is still dependent upon inference and is _not_ fixed. * Fixes #25340: Functions are treated like normal values just like arguments and only evaluated once. * Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was fixed on master already, but this fixes it now, too. * Fixes #25521. * The performance of this patch has been thoroughly tested through its iterative development process in #25377. There remain [two classes of performance regressions](#25377) that Nanosoldier flagged. * #25691: Propagation of constant literals sill lose their constant-ness upon going through the broadcast machinery. I believe quite a large number of functions would need to be marked as `@pure` to support this -- including functions that are intended to be specialized. (For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](JuliaLang/julia#25377) branch as of a1d4e7ec9756ada74fb48f2c514615b9d981cf5c. Squashed and separated out to make it easier to review and commit) Co-authored-by: Tim Holy <tim.holy@gmail.com> Co-authored-by: Jameson Nash <vtjnash@gmail.com> Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>

@lower

This patch represents the combined efforts of four individuals, over 60 commits, and an iterated design over (at least) three pull requests that spanned nearly an entire year (closes #22063, #23692, #25377 by superceding them). This introduces a pure Julia data structure that represents a fused broadcast expression. For example, the expression `2 .* (x .+ 1)` lowers to: ```julia julia> Meta.@lower 2 .* (x .+ 1) :($(Expr(:thunk, CodeInfo(:(begin Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize) Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1) Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3)) Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4)) return Core.SSAValue(5) end))))) ``` Or, slightly more readably as: ```julia using .Broadcast: materialize, make materialize(make(*, 2, make(+, x, 1))) ``` The `Broadcast.make` function serves two purposes. Its primary purpose is to construct the `Broadcast.Broadcasted` objects that hold onto the function, the tuple of arguments (potentially including nested `Broadcasted` arguments), and sometimes a set of `axes` to include knowledge of the outer shape. The secondary purpose, however, is to allow an "out" for objects that _don't_ want to participate in fusion. For example, if `x` is a range in the above `2 .* (x .+ 1)` expression, it needn't allocate an array and operate elementwise — it can just compute and return a new range. Thus custom structures are able to specialize `Broadcast.make(f, args...)` just as they'd specialize on `f` normally to return an immediate result. `Broadcast.materialize` is identity for everything _except_ `Broadcasted` objects for which it allocates an appropriate result and computes the broadcast. It does two things: it `initialize`s the outermost `Broadcasted` object to compute its axes and then `copy`s it. Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact same expression tree to compute the right-hand side of the expression as above, and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the `Broadcasted` expression tree and then `copyto!` it into the given destination. All-together, this forms a complete API for custom types to extend and customize the behavior of broadcast (fixes #22060). It uses the existing `BroadcastStyle`s throughout to simplify dispatch on many arguments: * Custom types can opt-out of broadcast fusion by specializing `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`. * The `Broadcasted` object computes and stores the type of the combined `BroadcastStyle` of its arguments as its first type parameter, allowing for easy dispatch and specialization. * Custom Broadcast storage is still allocated via `broadcast_similar`, however instead of passing just a function as a first argument, the entire `Broadcasted` object is passed as a final argument. This potentially allows for much more runtime specialization dependent upon the exact expression given. * Custom broadcast implmentations for a `CustomStyle` are defined by specializing `copy(bc::Broadcasted{CustomStyle})` or `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`. * Fallback broadcast specializations for a given output object of type `Dest` (for the `DefaultArrayStyle` or another such style that hasn't implemented assignments into such an object) are defined by specializing `copyto(dest::Dest, bc::Broadcasted{Nothing})`. As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to `.+`, just as had been done for all `AbstractArray`s in general. As a first-mover proof of concept, LinearAlgebra uses this new system to improve broadcasting over structured arrays. Before, broadcasting over a structured matrix would result in a sparse array. Now, broadcasting over a structured matrix will _either_ return an appropriately structured matrix _or_ a dense array. This does incur a type instability (in the form of a discriminated union) in some situations, but thanks to type-based introspection of the `Broadcasted` wrapper commonly used functions can be special cased to be type stable. For example: ```julia julia> f(d) = round.(Int, d) f (generic function with 1 method) julia> @inferred f(Diagonal(rand(3))) 3×3 Diagonal{Int64,Array{Int64,1}}: 0 ⋅ ⋅ ⋅ 0 ⋅ ⋅ ⋅ 1 julia> @inferred Diagonal(rand(3)) .* 3 ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}} Stacktrace: [1] error(::String) at ./error.jl:33 [2] top-level scope julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3) 4×4 Tridiagonal{Float64,Array{Float64,1}}: 1.30771 0.838589 ⋅ ⋅ 0.0 3.89109 0.0459757 ⋅ ⋅ 0.0 4.48033 2.51508 ⋅ ⋅ 0.0 6.23739 ``` In addition to the issues referenced above, it fixes: * Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated specially in a fused broadcast; they're just arguments in a `Broadcasted` object like everything else. * Fixes #21094: Since broadcasting is now represented by a pure Julia datastructure it can be created within `@generated` functions and serialized. * Fixes #26097: The fallback destination-array specialization method of `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not be confused by `nothing` arguments. * Fixes the broadcast-specific element of #25499: The default base broadcast implementation no longer depends upon `Base._return_type` to allocate its array (except in the empty or concretely-type cases). Note that the sparse implementation (#19595) is still dependent upon inference and is _not_ fixed. * Fixes #25340: Functions are treated like normal values just like arguments and only evaluated once. * Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was fixed on master already, but this fixes it now, too. * Fixes #25521. * The performance of this patch has been thoroughly tested through its iterative development process in #25377. There remain [two classes of performance regressions](#25377) that Nanosoldier flagged. * #25691: Propagation of constant literals sill lose their constant-ness upon going through the broadcast machinery. I believe quite a large number of functions would need to be marked as `@pure` to support this -- including functions that are intended to be specialized. (For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](JuliaLang/julia#25377) branch as of a1d4e7ec9756ada74fb48f2c514615b9d981cf5c. Squashed and separated out to make it easier to review and commit) Co-authored-by: Tim Holy <tim.holy@gmail.com> Co-authored-by: Jameson Nash <vtjnash@gmail.com> Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>

@lower

This patch represents the combined efforts of four individuals, over 60 commits, and an iterated design over (at least) three pull requests that spanned nearly an entire year (closes #22063, #23692, #25377 by superceding them). This introduces a pure Julia data structure that represents a fused broadcast expression. For example, the expression `2 .* (x .+ 1)` lowers to: ```julia julia> Meta.@lower 2 .* (x .+ 1) :($(Expr(:thunk, CodeInfo(:(begin Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize) Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1) Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3)) Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4)) return Core.SSAValue(5) end))))) ``` Or, slightly more readably as: ```julia using .Broadcast: materialize, make materialize(make(*, 2, make(+, x, 1))) ``` The `Broadcast.make` function serves two purposes. Its primary purpose is to construct the `Broadcast.Broadcasted` objects that hold onto the function, the tuple of arguments (potentially including nested `Broadcasted` arguments), and sometimes a set of `axes` to include knowledge of the outer shape. The secondary purpose, however, is to allow an "out" for objects that _don't_ want to participate in fusion. For example, if `x` is a range in the above `2 .* (x .+ 1)` expression, it needn't allocate an array and operate elementwise — it can just compute and return a new range. Thus custom structures are able to specialize `Broadcast.make(f, args...)` just as they'd specialize on `f` normally to return an immediate result. `Broadcast.materialize` is identity for everything _except_ `Broadcasted` objects for which it allocates an appropriate result and computes the broadcast. It does two things: it `initialize`s the outermost `Broadcasted` object to compute its axes and then `copy`s it. Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact same expression tree to compute the right-hand side of the expression as above, and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the `Broadcasted` expression tree and then `copyto!` it into the given destination. All-together, this forms a complete API for custom types to extend and customize the behavior of broadcast (fixes #22060). It uses the existing `BroadcastStyle`s throughout to simplify dispatch on many arguments: * Custom types can opt-out of broadcast fusion by specializing `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`. * The `Broadcasted` object computes and stores the type of the combined `BroadcastStyle` of its arguments as its first type parameter, allowing for easy dispatch and specialization. * Custom Broadcast storage is still allocated via `broadcast_similar`, however instead of passing just a function as a first argument, the entire `Broadcasted` object is passed as a final argument. This potentially allows for much more runtime specialization dependent upon the exact expression given. * Custom broadcast implmentations for a `CustomStyle` are defined by specializing `copy(bc::Broadcasted{CustomStyle})` or `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`. * Fallback broadcast specializations for a given output object of type `Dest` (for the `DefaultArrayStyle` or another such style that hasn't implemented assignments into such an object) are defined by specializing `copyto(dest::Dest, bc::Broadcasted{Nothing})`. As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to `.+`, just as had been done for all `AbstractArray`s in general. As a first-mover proof of concept, LinearAlgebra uses this new system to improve broadcasting over structured arrays. Before, broadcasting over a structured matrix would result in a sparse array. Now, broadcasting over a structured matrix will _either_ return an appropriately structured matrix _or_ a dense array. This does incur a type instability (in the form of a discriminated union) in some situations, but thanks to type-based introspection of the `Broadcasted` wrapper commonly used functions can be special cased to be type stable. For example: ```julia julia> f(d) = round.(Int, d) f (generic function with 1 method) julia> @inferred f(Diagonal(rand(3))) 3×3 Diagonal{Int64,Array{Int64,1}}: 0 ⋅ ⋅ ⋅ 0 ⋅ ⋅ ⋅ 1 julia> @inferred Diagonal(rand(3)) .* 3 ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}} Stacktrace: [1] error(::String) at ./error.jl:33 [2] top-level scope julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3) 4×4 Tridiagonal{Float64,Array{Float64,1}}: 1.30771 0.838589 ⋅ ⋅ 0.0 3.89109 0.0459757 ⋅ ⋅ 0.0 4.48033 2.51508 ⋅ ⋅ 0.0 6.23739 ``` In addition to the issues referenced above, it fixes: * Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated specially in a fused broadcast; they're just arguments in a `Broadcasted` object like everything else. * Fixes #21094: Since broadcasting is now represented by a pure Julia datastructure it can be created within `@generated` functions and serialized. * Fixes #26097: The fallback destination-array specialization method of `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not be confused by `nothing` arguments. * Fixes the broadcast-specific element of #25499: The default base broadcast implementation no longer depends upon `Base._return_type` to allocate its array (except in the empty or concretely-type cases). Note that the sparse implementation (#19595) is still dependent upon inference and is _not_ fixed. * Fixes #25340: Functions are treated like normal values just like arguments and only evaluated once. * Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was fixed on master already, but this fixes it now, too. * Fixes #25521. * The performance of this patch has been thoroughly tested through its iterative development process in #25377. There remain [two classes of performance regressions](#25377) that Nanosoldier flagged. * #25691: Propagation of constant literals sill lose their constant-ness upon going through the broadcast machinery. I believe quite a large number of functions would need to be marked as `@pure` to support this -- including functions that are intended to be specialized. (For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](JuliaLang/julia#25377) branch as of a1d4e7ec9756ada74fb48f2c514615b9d981cf5c. Squashed and separated out to make it easier to review and commit) Co-authored-by: Tim Holy <tim.holy@gmail.com> Co-authored-by: Jameson Nash <vtjnash@gmail.com> Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>

- referenced non-extant `indices` - `broadcasted_indices` was deprecated to `broadcasted_axes` in JuliaLang/julia#25377 - `broadcasted_axes` was then replaced by plain `axes` JuliaLang/julia@a2feccf - StaticArrays already has `axes` methods

timholy added the needs docs Documentation for this change is required label Jan 3, 2018

mbauman added broadcast needs news A NEWS entry is required for this change labels Jan 3, 2018

timholy mentioned this pull request Jan 4, 2018

RFC: introduce runtime representation of broadcast fusion #23692

Closed

timholy mentioned this pull request Jan 4, 2018

Convolutions for DL JuliaImages/ImageFiltering.jl#52

Open

mbauman reviewed Jan 5, 2018

View reviewed changes

timholy force-pushed the teh-jn/lazydotfuse branch from 478c4a0 to ed52263 Compare January 7, 2018 19:00

timholy and others added 8 commits January 7, 2018 15:04

Reduce build-time calls to broadcasting machinery

c0db74c

The main one left is in concatenation, in a line inds[i] = offsets[i] .+ cat_indices(x, i)

Allow test/core.jl to be run from REPL

0c6617a

If you've already said `using Test`, defining a function named `Test` causes problems.

Turn range&number arithmetic operations into broadcast methods

98fe8ab

This is consistent with the deprecation of methods like `[1,2,3] + 1`.

Make lazy dot fusion

aeba265

Speed up range tests and fix printing at REPL

98f6bdc

Integrate lazy broadcast representation into new broadcast machinery

0698edc

Among other things, this supports returning AbstractRanges for appropriate inputs. Fixes #21094, fixes #22053

Temporarily disable failing tests

4c02b07

Centralize broadcast support for structured matrices

e4d1962

timholy force-pushed the teh-jn/lazydotfuse branch from ed52263 to e4d1962 Compare January 7, 2018 21:05

timholy removed needs docs Documentation for this change is required needs news A NEWS entry is required for this change labels Jan 7, 2018

Sacha0 reviewed Jan 8, 2018

View reviewed changes

mbauman reviewed Jan 9, 2018

View reviewed changes

mbauman added 4 commits April 23, 2018 14:23

Mitagate some of the performance issues with non-type-stable...

db690e0

broadcasting by preprocessing the arguments to potentially wrap them with indexing helpers.

broadcast.jl cleanup:

a2b9015

* Slightly clearer recursion through arg lists in not_nested * Move show(::IO, ::Broadcasted) to a more sensible location and have it print its type fully qualified with the `Style` parameter.

Rename Broadcast.*_indices to *_axes as appropriate

c8bb374

JeffBezanson reviewed Apr 23, 2018

View reviewed changes

Remove spurious NEWS item from merge mistake

6fdb86e

mbauman mentioned this pull request Apr 23, 2018

Slow loop fusion when multiplying a column vector with a row vector #20875

Closed

mbauman added 2 commits April 23, 2018 18:31

Fix broadcast_similar docstring

df51b31

[ci skip]

Fix #22255 by inlining the necessary methods

a1d4e7e

mbauman mentioned this pull request Apr 24, 2018

Customizable lazy fused broadcasting in pure Julia #26891

Merged

mbauman closed this Apr 24, 2018

mbauman added the domain:broadcast Applying a function over a collection label Apr 24, 2018

ararslan deleted the teh-jn/lazydotfuse branch April 25, 2018 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Customizable lazy broadcasting with options for pure-Julia fusion and eager evaluation #25377

RFC: Customizable lazy broadcasting with options for pure-Julia fusion and eager evaluation #25377

timholy commented Jan 3, 2018 •

edited by mbauman

Loading

timholy commented Jan 3, 2018

andyferris commented Jan 4, 2018

timholy commented Jan 4, 2018 •

edited

Loading

timholy commented Jan 4, 2018 •

edited

Loading

Sacha0 commented Jan 4, 2018 •

edited

Loading

andyferris commented Jan 5, 2018

mbauman Jan 5, 2018

timholy Jan 7, 2018

SimonDanisch commented Jan 6, 2018

timholy commented Jan 7, 2018

timholy commented Jan 7, 2018

Sacha0 Jan 8, 2018

mbauman commented Jan 9, 2018

mbauman Jan 9, 2018

timholy Jan 9, 2018

mbauman Jan 9, 2018

mbauman commented Apr 23, 2018 •

edited

Loading

andreasnoack commented Apr 23, 2018

jekbradbury commented Apr 23, 2018

mbauman commented Apr 23, 2018

JeffBezanson Apr 23, 2018

mbauman Apr 23, 2018

mbauman commented Apr 23, 2018 •

edited

Loading

JeffBezanson commented Apr 24, 2018

mbauman commented Apr 24, 2018

RFC: Customizable lazy broadcasting with options for pure-Julia fusion and eager evaluation #25377

RFC: Customizable lazy broadcasting with options for pure-Julia fusion and eager evaluation #25377

Conversation

timholy commented Jan 3, 2018 • edited by mbauman Loading

timholy commented Jan 3, 2018

andyferris commented Jan 4, 2018

timholy commented Jan 4, 2018 • edited Loading

timholy commented Jan 4, 2018 • edited Loading

Sacha0 commented Jan 4, 2018 • edited Loading

andyferris commented Jan 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SimonDanisch commented Jan 6, 2018

timholy commented Jan 7, 2018

Executive Busy developer summary

Preventing circular evaluation (StackOverflowErrors)

Consider allowing a non-inferrable return type from f.(::StructuredMatrix)

Test results

timholy commented Jan 7, 2018

Choose a reason for hiding this comment

mbauman commented Jan 9, 2018

High-level end-user summary of fusion

Nitty-gritty details

Incremental results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbauman commented Apr 23, 2018 • edited Loading

andreasnoack commented Apr 23, 2018

jekbradbury commented Apr 23, 2018

mbauman commented Apr 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbauman commented Apr 23, 2018 • edited Loading

JeffBezanson commented Apr 24, 2018

mbauman commented Apr 24, 2018

timholy commented Jan 3, 2018 •

edited by mbauman

Loading

timholy commented Jan 4, 2018 •

edited

Loading

timholy commented Jan 4, 2018 •

edited

Loading

Sacha0 commented Jan 4, 2018 •

edited

Loading

Preventing circular evaluation (`StackOverflowError`s)

Consider allowing a non-inferrable return type from `f.(::StructuredMatrix)`

mbauman commented Apr 23, 2018 •

edited

Loading

mbauman commented Apr 23, 2018 •

edited

Loading