Julep: solving tricky iteration problems #15648

timholy · 2016-03-28T03:21:02Z

The problem (a brief summary)

In a PR for a blog post, I described a number of challenges for several potential new AbstractArray types. Very briefly, the types and the challenges they exhibit are:

ReshapedArrays: the only efficient way of indexing some types is by indexing the parent array directly (i.e., with an iterator optimized for the pre-reshaping array). This means that indexing a 2D array as B[i,j], for integer i and j, may not always be efficient.
TransposedMatrix (for Taking vector transposes seriously LinearAlgebra.jl#42): iteration is most efficient when it produces a cache-friendly sequence of array accesses. Unfortunately, algorithms that mix column-major and row-major arrays imply that we can't naively allow eachindex(A) to always return the most efficient iterator.
OffsetArrays (sometimes called "Fortran arrays"): these violate the assumptions that array indexes run from 1:size(A,d), and more generally question how one should match elements of two different
arrays.

Possible APIs

In the blog post, only two concrete algorithms were considered: matrix-vector multiplication and copy!. At the risk of being insufficiently general, let's explore possible APIs in terms of these two functions.

EDIT: note that the thoughts on the API are evolving as a consequence of this discussion. To jump ahead to an updated proposal, see #15648 (comment). There may be further updates even later. The material below is left for reference, but some of it is outdated.

The "raw" API

For the moment, let's worry about correctness and efficiency more than elegance. Correctness trumps all other considerations. Efficiency means that we have to be able to iterate over the matrix in
cache-friendly order. That means that the array B has to be "in charge" of the sequence of values and pick its most efficient iterator.

I should emphasize that I haven't taken even the first step towards attempting to implement this API. So this is crystal-ball development, trying to imagine what might work while hoping to shake
out likely problems in advance via community feedback. That caveat understood, here's a possible implementation of matrix-vector multiplication:

RB = eachindex(B)
Rv = eachindex(RB[*,:], v)
Rdest = eachindex(RB[:,*], dest)
for (idest, iB, iv) in zip(Rdest, RB, Rv)
    dest[idest] += B[iB] * v[iv]
end

eachindex(iter, obj) should perform bounds-checking, to ensure that the indexes of iter align with obj (even handling things like OffsetArrays). The notation RB[*,:] means "corresponding to the
second dimension of RB"; interestingly, this expression parses without error, attempting to dispatch to getindex with argument types Tuple{typeof(RB), Function, Colon}. Hopefully there isn't another pending meaning for indexing with a Function.

Because either idest or iv is constant over the "inner" loop (depending on storage order of B), ideally one would like to hoist the corresponding access dest[idest] or v[iv]. I'd argue that's a
job for the compiler, and not something we should have to bake into the API.

Because zip(X, Y) does not allow Y to see the state of X, it seems quite possible that this could not be made efficient in general (consider the case described for ReshapedArrays, where the
efficiently-parametrized iterator for the first dimension depends on the state of the index for the second dimension). If this proves to be the case, we have to abandon zip and write this more like

RB = eachindex(B)
for (iB, idest, iv) in couple(RB, (RB[:,*], dest), (RB[*,:], v))
    dest[idest] += B[iB] * v[iv]
end

where couple is a new exported function.

One interesting thing about this API is that sparse multiplication might be implemented efficiently just by replacing the first line with

RB = eachstored(B)

so that only "stored" elements of B are visited. (@tkelman may have been the first to propose an eachstored iterator.) It should be pointed out that several recent discussions have addressed some of the complications that arise from Inf and NaN, and it remains to be determined whether a version that doesn't ignore these complications can be written with this (or similar) API.

Likewise, copy! might be implemented

Rdest = eachindex(dest)
for (idest, isrc) in couple(Rdest, (Rdest, src))
    dest[idest] = src[isrc]
end

At least for the array types described here, this API seems plausibly capable of generating both correct and (fairly) efficient code. One might ideally want to write generic "tiled" implementations, but that seems beyond the scope of this initial effort.

Making it prettier with "sufficiently smart iteration"

This API is still more complicated than naive iteration. One might possibly be able to hide a lot of this with a magical @iterate macro,

@iterate B dest[i] += B[i,j]*v[j]

for matrix-vector multiplication and

@iterate dest dest[i] = src[i]

for copy!. The idea is that the @iterate macro would expand these expressions into their longer forms above. Note that the first argument to @iterate specifies which object is "in charge" of the
sequence of operations. @iterate might be too limited for many things, but may be worth exploring. The KernelTools repository may have some useful tidbits.

The text was updated successfully, but these errors were encountered:

meggart · 2016-03-29T14:42:50Z

Thanks a lot for writing the blog post and this issue, I learned a lot from it.

I have a question: You propose

    RB = eachindex(B)
    Rv = eachindex(RB[*,:], v)

If B is an Array RB will simply be 1:length(B), and lose all the shape information of B and makes your life harder interpreting second eachindex call.

In general, if you can come up with a couple function (which I think would be great), do you need the prior eachindex call at all? Why not simply do for (iB, idest, iv) in couple(B, (B[:,*], dest), (B[*,:], v)). where again B is the array that determines the iteration order.

So if possible I would prefer having a single couple function that could be heavily specialized for different argument types and one would need eachindex and zip only internally (called by couple).

I am also not sure if it is always a good idea to put one iterator in charge of the whole iteration order. How would you, for example formulate the loop for map!(dest,f,a,b), where the 3 arrays might have different storage orders? Could one implement a couple function that lets the three arrays dest, a, and b decide via "majority vote" how to iterate?

timholy · 2016-03-29T15:18:44Z

You're absolutely right; the intent was to keep all the shape information, and I overlooked the problem you pointed out. I was trying to avoid directly indexing arrays with something weird (the function *), since it seems like that could be unpopular. But what I wrote won't work. A possible alternative would be B[?,:], because ? is not defined yet, and so indexing arrays with ? seems like it might be OK.

Your two points also come together very nicely if one thinks about matrix-matrix (rather than matrix-vector) multiplication. First, you pretty much have to have a syntax that addresses the arrays directly, rather than some "master" iterator. Second, you're right that you probably don't want any iterator in charge. Something like this?

for (idest, iA, jdest, jB, kA, kB) in couple((dest[:,?], A[:,?]), (dest[?,:], B[?,:]), (A[?,:], B[:,?]))
    dest[idest,jdest] += A[idest,kA]*B[kB,jB]
end

which could hopefully be simplified to

@iterate dest[i,j] += A[i,k]*B[k,j]

Here, there's not actually much magic in @iterate (other than parsing the pattern of indexes and array-accesses), all the hard work is being done by couple.

My big worry is that couple could be quite hard to write. Still, it seems like the right place to put the complexity: do a bunch of analysis right at the beginning of the loop to figure out the optimal access pattern.

In some situations it may be difficult/impossible to make it type-stable (though all the ?/: indexing is designed to make this more feasible). As a fallback one could do

function foo!(dest, A, B)
    iters = couple(...)
    _foo(dest, A, B, iters)
end

@noinline function _foo!(dest, A, B, iters)
    for (idest, iA, jdest, jB, kA, kB) in iters
        ...
    end
end

However, for the @iterate macro this would need to be

function foo!(dest, A, B)
    iters = couple(...)
    @noinline function _foo!(dest, A, B, iters)
        for (idest, iA, jdest, jB, kA, kB) in iters
            ...
        end
    end
    _foo(dest, A, B, iters)
end

I haven't yet tested whether _foo! could be placed inside foo! and yet solve the type-instability problem. (I suspect so, but worth testing.)

eschnett · 2016-03-29T15:49:51Z

On Tue, Mar 29, 2016 at 11:18 AM, Tim Holy notifications@github.com wrote:

You're absolutely right; the intent was to keep all the shape information,
and I overlooked the problem you pointed out. I was trying to avoid
directly indexing arrays with something weird (the function *), since it
seems like that could be unpopular. But what I wrote won't work. A possible
alternative would be B[?,:], because ? is not defined yet, and so
indexing arrays with ? seems like it might be OK.

Your two points also come together very nicely if one thinks about
matrix-matrix (rather than matrix-vector) multiplication. First, you pretty
much have to have a syntax that addresses the arrays directly, rather than
some "master" iterator. Second, you're right that you probably don't want
any iterator in charge. Something like this?

for (idest, iA, jdest, jB, kA, kB) in couple((dest[:,?], A[:,?]), (dest[?,:], B[?,:]), (A[?,:], B[:,?]))
dest[idest,jdest] += A[idest,kA]*B[kB,jB]end

Ooooooh. This is now awfully close to an abstract index notation, which
allows generalizing this to more than two dimensions:

for magic in joined(dest[:i,:j], A[;i,:k], B[:k,:j])
dest[magic,:i,:j] += A[magic,:i,:k] * B[magic,:k,:j]
end

Obviously the syntax for the loop body can be improved, maybe with a macro,
maybe with clever types (e.g. turning the indices i,j,k into functions that
take the array as argument).

-erik

which could hopefully be simplified to

@iterate dest[i,j] += A[i,k]*B[k,j]

Here, there's not actually much magic in @iterate (other than parsing the
pattern of indexes and array-accesses).

My big worry is that couple could be quite hard to write. Still, it seems
like the right place to put the complexity: do a bunch of analysis right at
the beginning of the loop to figure out the optimal access pattern.

In some situations it may be difficult/impossible to make it type-stable
(though all the ?/: indexing is designed to make this more feasible). As
a fallback one could do

function foo(dest, A, B)
iters = couple(...)
_foo(dest, A, B, iters)end
@noinline function _foo(dest, A, B, iters)
for (idest, iA, jdest, jB, kA, kB) in iters
...
endend

However, for the @iterate macro this would need to be

function foo(dest, A, B)
iters = couple(...)
@noinline function _foo(dest, A, B, iters)
for (idest, iA, jdest, jB, kA, kB) in iters
...
end
end
_foo(dest, A, B, iters)end

I haven't yet tested whether _foo could be placed inside foo and yet
solve the type-instability problem. (I suspect so, but worth testing.)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#15648 (comment)

Erik Schnetter schnetter@gmail.com
http://www.perimeterinstitute.ca/personal/eschnetter/

timholy · 2016-03-29T15:56:09Z

@eschnett, what you're suggesting is exactly what the @iterate macro does. Symbols don't have different types, so to do this via functions you have to introduce the appropriate types. But macros can parse the symbols, and emit code that is properly type-stable. In other words, the @iterate macro is effectively a pretty wrapper around couple.

eschnett · 2016-03-29T16:05:24Z

I was trying to generalize the notation for : and ?, and then got side-tracked by how the loop body should look like. So instead of writing

couple((dest[:,?], A[:,?]), (dest[?,:], B[?,:]), (A[?,:], B[:,?]))

one could write

coupled(dest[:i,:j,:k], A[:i,:j,:l], B[:j,:k,:m], C[:m])

which allows for more than two indices that can be used in arbitrary order, allowing transposition.

mbauman · 2016-03-29T17:17:58Z

Let's look at it from the other side. It seems there are three things @iterate needs to do for an expression like dest[i,j] += A[i,k]*B[k,j]:

Determine the optimal loop order. This is hard… and an endless task if we consider BLAS-like partial loop striding. I say we punt, and require the user to specify the order.
Determine how many times to run each loop. This is straightforward.
Request for an efficient index iterator from each array for the given access pattern. Instead of worrying about coupling these iterators together, what if we just asked each array for an iterator across a given cartesian permutation order? E.g.,

@iterate (i, j, k) begin # specify iteration order
   D[i,j] += A[i,k]*B[k,j]
end

# would expand to:
(sz1, sz2, sz3) = … # get loop lengths and validate array sizes
Ditr = eachindex(D, (1, 2)) # will want Val{(1,2)} or some such
Aitr = eachindex(A, (1, 2))
Bitr = eachindex(B, (2, 1))
Dstate = start(iD)
Astate = start(iA)
for _i=1:sz1
    Bstate = start(iB) # B isn't dependent upon i, @iterate just restarts it every loop
    for _j=1:sz2
        (iD, Dstate) = next(Ditr, Dstate) # D isn't dependent on k, so it hoists it
        Astate′ = Astate # A isn't dependent upon j, so we jump back to where we were
        for _k=1:sz3
            (iB, Bstate) = next(Bitr, Bstate)
            (iA, Astate′) = next(Aitr, Astate′)
            D[iD] += A[iA]*B[iB]
        end
    end
    Astate = Astate′
end

I believe this is sufficiently general that @iterate will always be able to work out where to start, hoist, and hold iteration states. The magic, of course, now happens within eachindex(A, permutation). And while I think that will also be rather challenging to write for more than two dimensions, I think it'd be much simpler than trying to couple iterators together.

I'm sure you're aware, Tim, but I'd like to also point to TensorOperations.jl, which I think does some of @iterate (just not the fast iterated non-cartesian part). And although it's reason for being is completely different, numpy.einsum is also somewhat similar. Lots of people seem to use it, but I've never been able to sink my teeth into it — I find it quite unreadable.

timholy · 2016-03-29T17:57:21Z

@eschnett, the reason I think you need to set this up in a type-stable manner is that some array types aren't efficiently indexed by an NTuple of integers. For example, if A is a SparseMatrixCSC, A[i,j] (for integer i, j) does not have impressive performance; a much better strategy is to iterate over an index into rowindex/nzval and keep track of the corresponding i, j as you go. ReshapedArrays are another good example, and the one that drove me to think about this. For both of these cases, you'd really like to use a custom iterator specialized to extract the last bit of performance. (Particularly for the sparse case, if you only need to visit filled values the savings are huge.)

@mbauman, I suspect you're right about some of the challenges. What you're describing is basically how the macros in KernelTools work currently; I also wrote a nifty little macro (@time on steroids) that tests all possible loop orderings and reports the time for each, thus giving the user the chance to hard-code the one that works best. I should take a look at TensorOperations, I haven't looked in a long time so probably don't remember all the tools (or they may have changed).

But for generic code (that might have to handle Matrix and MatrixTranspose), it sure would be nice to be able to be flexible without having to write tons of cases out by hand. May be a pipe dream (I am sure that many have tried to do this before...e.g., Polly).

timholy · 2016-03-29T18:03:10Z

@mbauman, in your eachindex(A, (d1, d2)), am I reading this right that you're specifying that dimension d1 is the "inner" dimension and d2 is the "outer" dimension? Seems like a promising idea!

mbauman · 2016-03-29T18:09:58Z

Yes, precisely. The core idea is that each array only needs to concern itself about the order of iteration, and the @iterate macro itself handles starting, hoisting, and holding the iteration state at particular points in the expansion to ensure things stay in sync.

timholy · 2016-03-29T18:17:53Z

I guess the reason I didn't go this way is because I was attracted by the hope of solving the loop-ordering problem. Since the macro can't see the types, I was looking for a way for it to set things up in a form that would be amenable to type analysis.

eschnett · 2016-03-29T18:26:11Z

@timholy I don't understand the connection between the order in which the indices are traversed and the syntax to describe which indices need to be coupled.

I agree that sparse (or reshaped, or symmetric where only part of the elements are stored, etc.) arrays require special care.

Let's take this loop as example:

for i,j,k in coupled(r[:i,:j,:k], a[:i,:j,:k], b[:j,:k,:i], c[:k,:i,:j])
   r[i,j,k] = a[i,j,k] + b[j,k,i] + c[k,i,j]
end

How would this look like in the ? / : notation?

timholy · 2016-03-29T19:00:12Z

Something like couple(iblock, jblock, kblock) where

iblock = (r[:,?,?], a[:,?,?], b[?,?,:], c[?,:,?])
jblock = (r[?,:,?], a[?,:,?], b[:,?,?], c[?,?,:])
kblock = (r[?,?,:], a[?,?,:], b[?,:,?], c[:,?,?])

I.e., you match the indexes with the :s.

One of the points perhaps worth making is that I was also looking for an analog of for i = 2:n..., and the ?/: syntax might generalize reasonably well there. But bad things happen if there are no ?, because then it would dispatch just to regular getindex.

All this is to simply explain my reasoning; while I'm attracted by solving the loop ordering problem, I recognize that this may not be possible/practical. (I don't know the literature here.)

eschnett · 2016-03-29T19:04:49Z

@timholy Okay; that's a straightforward transformation from the (:i,:j,:k) syntax I like. There's essentially one block per index, specifying where the index appears for each array.

meggart · 2016-03-29T19:35:59Z

Determine the optimal loop order. This is hard… and an endless task if we consider BLAS-like partial loop striding. I say we punt, and require the user to specify the order.

I just want to support @timholy 's point that I think (at least partially) solving the loop ordering problem would be very nice.Ootherwise one would have to accept that generic library functions like map! and copy! might be slow for transposed matrices, since the user would not be able to influence the loop order.

timholy · 2016-03-29T21:11:14Z

Perhaps I'm most interested in a stepwise approach: if possible, pick an API that doesn't prevent us from tackling this someday, but punt on loop-ordering for now. Revamping how we do array iteration is a big problem no matter how we slice it, and we might make faster progress by first choosing the smallest feasible unit and getting that working.

mbauman · 2016-03-29T21:17:49Z

I think that it still may be possible do something about the for loop ordering problem, but it's a bear. To do so, we'll need to use a bunch of flat iterators. Let's take a step back from syntax and type-stability, and just look at what kinds of iterators are required for a flat matrix multiplication loop:

for i=1:ni, j=1:nj, k=1:nk
   D[i,j] += A[i,k]*B[k,j]
end

Consider a vector that's been reshaped into a matrix. How many different iterators does this type need to implement in order to efficiently participate as either A, B, or D in the above expression?

As D: This is column major! Great, so we can just use a range 1:length(A). Oh, but wait, there's an inner k loop that we don't care about. We'll need a lazy version of repeat(1:length(A), inner=[nk]), which I don't think we have. But we can write that iterator wrapper once and be done.
As B: Shucks, row major traversal. Gotta use row-major cartesian iteration. Oh, and we also have to cycle it ni times. Easy enough, we have cycle.
As A: Ok, column major again, that's good. But… how in the world do we express this? It needs to produce chunks of nk elements at a time, repeating each chunk nj times before moving on to the next chunk. Ok, we can write that iterator, too. But what if there was a fourth loop? I'm sure there's a way to do this generally, but man, it'll be a pain.

I think this demonstrates that there are two orthogonal concerns here. As far as the individual array goes, it's really only concerned about traversal ordering. couple (or whatever) can wrap those iterators with the required wrapper in order to comply with the structuring of the other loops.

So what kind of information do we need to dynamically construct those iterators with full generality? An eachindex API would just need to know the ordering of dimensions. couple needs to know which indices are used in each array so it can arrange for an ordering and wrap iterators appropriately.

for (iD, iA, iB) in couple(D, (:i,:j), A, (:i,:k), B, (:k,:j))
   D[iD] += A[iA]*B[iB]
end

Now with this API, I can finally begin to imagine how coupled would work. It can try to find an ordering of ijk that satisfies the most arrays, then request the ordered index iterator from each array, then wrap it with the required iterator to reflect the chosen for-loop structure. Of course, with three-dimensional arrays, there'd also be partially optimal orderings. And this will almost certainly be dependent upon array size, too. Can we make JuMP a dependency of array iteration?

I think this is the fully general, fully orthogonal design. If we break this orthogonality, I'm afraid that we'll never stop writing different flavors of eachindex iterators for every single array type.

timholy · 2016-03-29T22:05:13Z

Can we make JuMP a dependency of array iteration?

I was also wondering if we'd need an optimizer. I doubt that's a direction we want to explore, unless it would also be useful for making more rational decisions about automatic inlining (which it surely would be). Punting to Polly, by marking a block or function with a future @polly hint, seems like a better way to handle this problem.

timholy · 2016-03-30T09:58:26Z

@mbauman, focusing on the reshaped-vector part of your comment above, a lot of the concerns can be resolved if we have an API for iterating along a particular dimension, aka #15459. This is a different form of orthogonality than you're talking about, and it would resolve the need for many different variants of repeat. It might also be easier to write: e.g., for a SparseMatrixCSC, as demonstrated in #15459 the notion of a "column iterator" and "row iterator" are much more straightforward than supporting some kind of repeat iterator that can start and stop anywhere. For ReshapedArrays, single-dimension iterators are annoying but straightforward; as described in the blog post, even when they require a div for construction it greatly reduces the cost of the whole loop.

A second issue is that fully-orthogonal iterators may not generalize well to efficient sparse iteration: you might want your sparse array iterator to "jump" to the next stored entry, and you might need to correspondingly advance any coupled iterators. This kind of thinking is what first suggested to me that something like couple is an important addition to the API: zip provides complete independence, but in zip(eachstored(A), eachstored(B)), if A is sparse and B is dense, the two will rapidly get out-of-sync. In contrast, in couple(eachstored(A), eachstored(B)), the state for B might be updated in a manner "slaved" to the changes in A. It would be nice to make this symmetric, so that no one iterator is "in charge," but I haven't thought that through yet. If it is symmetric, then couple isn't actually expressive enough on its own: in the example above, if A and B have different numbers of stored elements, should it visit the location of only those that are stored in both? Or stored in either one? At a minimum there's an ||/&& problem here.

I suspect that array-iterators need to be a specialized class, supporting more operations than just start, next, and done. Thinking about the sparse case, one candidate is skip, where the skip-value could be positive or negative. This won't be terribly efficient if skip requires calling div (which it would for something like a ReshapedArray). But I think that's OK; we can't have everything. I'll be happy with 90% of everything 😄.

tkelman · 2016-03-30T12:17:06Z

In my mind you would express couple(eachstored(A), eachstored(B)) as either union or intersect depending on the operation being performed.

timholy · 2016-03-30T13:42:43Z

Thanks to #13412, I'm noticing that this is also possible: couple(any, x, y) or couple(all, x, y). I'm not sure you'd really use the any and all functions directly, but you can dispatch on them:

@eval foo(::$(typeof(any))) = 1
@eval foo(::$(typeof(all))) = 2

timholy · 2016-03-30T14:13:56Z

To help see what different APIs "feel like" for writing code, as an alternative to endlessly creating gists I opened ArrayIterationPlayground and invite collaborators. This doesn't (yet) contain any implementations of any API, but there's one file that implements a couple of algorithms using a fictitious API.

I'd recommend we keep discussion mostly here, though, since it's presumably of general interest.

mbauman · 2016-03-30T14:38:02Z

The design space here is just ginormous even before you begin considering sparse arrays. I'm trying to bring a dose of reality to all the magic that's flying around. :)

My knee-jerk reaction is that trying to extract a single cartesian index out of a given iterator to share with another array will prove to be intractable to code generically. There's two operations that may or may not be possible: eachindex iterators may or may not be able to quickly compute the cartesian indices, and the array that index is shared with may or may not be able to use a cartesian index (especially if you consider OffsetArrays, but I'm still not convinced we should support them without requiring a special offset index type). So you've got to support independent iteration no matter what.

I don't think you want to express this as foo[:, ?], since, to my reading, that'd require foo (be it an array or index iterator) to conjure up a wholly independent iterator object. Better to pass the shared index information separately and let couple sort it out.

As far as eachindex over particular dimensions goes, I'm in support. One possibility to allow for non-column major access is to support something like eachindex(A, Order{2}(:), 42, Order{1}(2:size(A,3)) — that'd iterate over (1, 42, 2), (1, 42, 3), (1, 42, 4), …, (2, 42, 2). Expressing this with dispatch in a sensible way worries me… as does specializing it for each iterator type. And I don't think it helps couple simplify any of the work it requires, though, since it's still got to start and stop these iterators at the right points for a flat loop to work.

All that said, I'd be very happy to be proven wrong.

timholy · 2016-03-30T14:49:27Z

Reality is good. I agree that the challenge here is to define the impactful contribution that's feasible and doesn't prevent more progress by future generations.

The implementation of foo[:,?] I had in mind is just

immutable AnyEntry end
const ? = AnyEntry()

# Typically AA<:AbstractArray, but not sure I yet see a reason to require that
immutable DimSelection{AA,D<:Tuple{Vararg{Union{Colon,AnyEntry}}}}
    A::AA
end

@inline getindex(A, I::Union{Colon,AnyEntry}...) = DimSelection{typeof(A), I}(A)

which is just basically a syntax for passing it on, as you say. I'm not sure whether it's a readability-enhancer, however, and the explicit eachindex is definitely on the table. (It's basically a "lazy eachindex" since it doesn't yet make a choice about which particular iterator type gets chosen.)

And yes, writing couple is going to be hard. Before trying that (with couple or any other API), I'm beginning to think the best thing is to see how well the API works before trying to implement it. The recent PRs from several kind folks addressing the TODO list in #15434 are a huge asset, since they help highlight "interesting" iteration cases.

timholy · 2016-03-30T16:43:42Z

To take out a little of the magic, I'm currently imagining something vaguely like this:

immutable CoupledIterator{I,F<:Tuple{Vararg{Function}}}
    iter::I
    itemfuns::F
end

This has a "master" iterator iter, and a tuple-of-functions that produces the item for each of the coupled iterators. next might be defined generally as just

next(iter::CoupledIterator, state) = map(iter.itemfuns, state), next(iter.iter, state)

or something. That map is mapping a tuple of functions over a single object state, which is kinda the opposite of our usual meaning, but hopefully you get the drift. The point is, this produces an item for each of the arrays that went into the iteration. Since it can be of any type, it has at least the potential to allow for efficient access.

As an example, for sparse matrix-vector multiplication A'*v (so we're taking a transpose), I could imagine that couple(all, eachstored(A, (:,j)), v) would have a default fallback behavior of something like

function couple(::typeof(all), iter::SparseCSCColumnIterator, v::AbstractVector)
    CoupledIterator(iter, state->state, state->state.row)
end

This doesn't work out all the details, but it suggests that couple maybe isn't as hard as feared. What this achieves is "slaving" the index for v to the iterator that only visits the stored items of A.

Now, if v is a ReshapedArray or itself sparse, this won't be optimal. I'm prepared to live with that in exchange for the benefit that one gets from handling the sparse matrix sensibly. I'm hoping that this kind of foundation will (1) give a big improvement over current behavior, (2) may not be an impossible dream, and (3) might be an API that's flexible enough that it won't stand in the way of anyone wanting to tackle more ambitious cases some day.

Does that help? Or did you already read my mind (so this is all old news to you) and then see another few miles ahead to big scary roadblocks?

mbauman · 2016-03-30T18:54:05Z

It does help some, yes. I had been thinking you were prioritizing reshaped and offset arrays over eachstored.

I think the couple that you propose in that latest example can be more generally expressed as:

couple(iter, ::Union{Val{:row}, Val{:col}}...) # or dim numbers, or in the value domain, doesn't matter

Now you can also get the column index for the destination vector: (ia, iv, idest) = couple(eachstored(A), Val{:row}, Val{:col}). Heck, make those Val arguments tuples and now you can allow coupling A element-wise to another matrix of the same size (or transpose). That's tractable. But only A gets special handling. All other arrays used here get indexed by cartesian indices.

Where I get all fuzzy is how to extend this to algorithms like gemm, where there's no one array in charge of the entire triplet of loops. Or in allowing the slave arrays to have a say in the kind of iterator they're indexed by. It's those cases where I think I can see a combinatorial explosion of methods with crazy ambiguity problems coming over the horizon.

timholy · 2016-04-04T13:59:20Z

OK, over at ArrayIterationPlayground I've written out a number of algorithms in a proposed API, and that experience seem to suggest we're converging. (It's possible I wasn't consistent, as my API plans evolved as I tried new things.) I found that this API may suffice to implement things as complicated as cholfact!, provided that we only care about arrays with numeric indexes and not arrays indexed as A[:dog, :cat].

Here's a summary of where I am now:

inds(A, d)               # index vector for dimension d (default 1:size(A, d))
inds(A)                  # tuple-of-inds
zeros(Int, -3:3, -1:5)   # creates matrix of size 7-by-7 with the given inds
fill(val, indexes...)    # likewise

a[first+1:(last+first)÷2]          # copy (or view) of the first half of array, skipping the first element

# `index` and `stored` return "indexing hints," the laziest of wrappers
index(A, :, j)          # lazy-wrapper indicating that one wants indexes associated with column j of A
stored(A, :, j)         # just the stored values of A in column j
index(stored(A, :, j))  # just the row-indexes of the stored values in column j
index(A, :, ?)          # row-index iterator for an arbitrary (unknown) column of A
index(A, Dimension{2})  # similar to index(A, ?, :, ?...)

# Likely: (notice this is different from sub, because this keeps original indexes)
index(A, first+1:last)  # would iterate over 2:length(A) for a typical Array
index(A, first+1:last, j)  # like above, but just over indexes for dimension 1

couple(iter1, iter2)    # iterates over iter1, iter2 containers, keeping them in sync
# Do we want/need this? I suspect not (default would be `any`)
couple(any, stored(iter1), stored(iter2))  # visits i if either iter1[i] or iter2[i] has a value
couple(all, stored(iter1), stored(iter2))  # visits i if both iter1[i] and iter2[i] have values

icat(a, b)  # index-preserving concatenation
# For example:
#    icat(7:10, 3)  7:10 is indexed with 1:4, so this creates a vector indexed from 1:5 (numbers aren't tied to an index)
#    icat(3, 7:10)  creates a vector indexed from 0:4
#    icat(5:7, 2:4) is an error, because they have overlapping indexes 1:3
#    icat(5:7, OffsetArray(2:4, 4:6))  indexed from 1:6
#    icat(5:7, OffsetArray(2:4, 5:7))  an error, non-contiguous indexes

Other than the Order type proposed by @mbauman above, I think this incorporates most of the feedback so far. I might prefer to tackle storage order in a second pass, since this is already peering pretty deeply into the crystal ball. And by having iterators for specific dimensions, that may be less relevant.

I've got a lot of teaching to do in the immediate future, so am unlikely to make much further progress for a while. But unless I hear otherwise I may slowly start implementing pieces of this. It's a lot of new exports, so obviously this deserves some thought.

eschnett · 2016-04-04T14:53:49Z

This looks interesting. I'll probably implement them as well for https://github.com/eschnett/FlexibleArrays.jl.

Painting the bike shed: The name couple is too generic; it doesn't indicate how they are coupled. (When I read it first, it evoked a "truck-and-trailer" coupling, which would sequence the iterators -- also a nice idea.) Given that the function takes two iterators as arguments, it's kind of obvious that it combines them in some way, so that doesn't need to be part of the name. You could call it parallel, par, insync, synced, simultaneously, multi, union, intersect, zip, zipindex, izip, ...

timholy · 2016-04-04T15:33:39Z

Thanks for reminding me about FlexibleArrays, I knew I had been told about another "offset" array indexing package out there and just forgot what it was called and who wrote it. It's obviously further along than the one I've been referring to.

Anyway, on the bike shed: I agree couple is too vague, and I really like both insync and synced. zip is out because it already has a meaning different from couple: the iterators increment independently of one another, and that's opposite the intent here.

vtjnash · 2016-04-04T16:03:15Z

I'm not sold on the complexity of couple being necessary. But I haven't put anywhere near as much effort into this as you, so I'll just note my concerns. However, it's not clear to me that you would be able to define this functionality in a general way. Even if you could achieve 80% of the speed of a raw array in the basic case, I still think you are giving up the possibility of vectorizing and blocking in the kernel. At that point, I believe it becomes faster to make a copy first. That also means that the kernel library author only needs to ensure their algorithm is optimized (and tested) for regular Arrays and the array-subtype library author only needs to ensure that they have an optimized copy method.

@iterate B dest[i] += B[i,j]*v[j]

I'm fairly certain this is just special syntax for broadcast (with an accumulator?) or more generally, another implementation of @devec?

Likewise, copy! might be implemented

My sense for copy! is that the most sensible definition is:

for x in keys(src)
  dest[x] = src[x]
end

which is specifically not an iteration over dest (except in the common array case where the keys of dest were the same as the keys of src)

timholy · 2016-04-04T16:30:13Z

The main place you need something like couple is in generalizing sparse matrix algebra. If we don't have a generic iteration framework, we need to implement special methods for each of, e.g., Tridiagonal*Bidiagonal, Bidiagonal*Tridiagonal, Banded*SparseMatrixCSC, etc. Conversion to full is very expensive for such matrices.

Yes, now that I think arrays should be thought of as associative containers, I agree that's the right meaning of copy!. In my current thinking, keys(src) is equivalent to index(stored(src)), and could be supported for arrays (which it isn't currently).

mbauman · 2016-04-04T16:33:01Z

keys is particularly nice in that it doesn't guarantee an order to the iteration, possibly pointing a path towards a row-major solution.

vtjnash · 2016-04-04T17:03:43Z

If we don't have a generic iteration framework

I'm not against that, I'm only questioning whether couple is better than zip. From the top post, I understood that zip should be able to implement the algorithm, but might do a bit of extra work when computing the iterators? Re the initial motivating example of ReshapedArrays, I think the copy-then-kernel or generate-then-copy approach may be faster (and as a bonus resolves the aliasing question)

timholy · 2016-04-04T18:25:25Z

I should have clarified that your copy! implementation is what I think of as the right "meaning" of copy!, with one potential caveat: right now we support copy!(dest::Array{T,2}, src::Array{S,3}) as long as length(dest) >= length(src) and convert(T, s::S) works. A strict notion would probably favor forcing this to become something like copy!(vec(dest), vec(src)) or something (where vec creates a view). That would require that the array's indexes can be meaningfully converted to a linear index, which then gives a point of correspondence between the two arrays. Of course, for iterating through them, there will be cases where you might like something more efficient than a linear index. (For example, copy!(sub(A, 1:end-1, 1:end-1), src).) It's also a good example of where "just make a copy, that solves most performance problems" isn't viable: it might work for src, but not for dest.

Regarding couple vs zip, I guess my point isn't that one is better than the other, but that we need both because we want different behaviors in different circumstances. For something like computing ::SparseMatrixCSC * ::AbstractMatrix, I'm struggling to see how to use our regular zip---perhaps you can explain? As I'm seeing it, for zip I think the contract is that each iterator is independent of the other, and so I don't see a good way to skip over the non-stored elements of the first array without correspondingly "advancing" the iterator for the second.

Unless you're proposing that zip get some keyword arguments endowing it with flexible behavior? Something along those lines seems fine to me; I'm not trying to argue we need a new name, just that we need new functionality.

timholy · 2016-04-04T18:27:57Z

I also edited the top post to try to reduce confusion from the fact that the API proposal is being hammered out here, and hence the top post isn't very representative of my current thinking.

JaredCrean2 · 2016-04-05T19:33:23Z

I haven't been much involved in various stages of iterator development, but looking at #1032 sparked an idea: what if there was a callback function that each iterator gets passed to before get/setindex? This would abstract away the problem of iterators needing to know about things like whether the array indices start at 1, 0 or some other number. The iterator can say "I point to the 3rd element in the 5th row" and the callback function is responsible for transforming this into a value that get/setindex! will accept.

This could solve:

Non 1-based array indexing
Negative indexing (if this is desirable)
Jagged arrays (maybe - this might be more involved to solve completely)

The generic fallbacks for this callback function would define the "standard" array implementation (1-based indexing etc.), but users could provide specific methods for new AbstarctArray subtypes to get different behavior.

mbauman · 2016-04-06T17:08:04Z

That's basically what the status quo is with cartesian indices. In fact, when you index an Array with cartesian indices, it eventually gets reduced to a linear memory offset. There's a little extra overhead involved compared to just iterating over the linear range 1:length(A), but it's not terrible. Our array infrastructure was built upon the assumption that indexing with cartesian indices is relatively fast — either it's the fastest possible way to index into the array (what we call LinearSlow), or it can be converted to a linear index without much overhead (for LinearFast arrays).

That's not true, though, when ReshapedArrays wrap LinearSlow arrays. Accessing the elements of reshape(sprand(2,3,1.0), 6) as though it's a vector, for example, requires using division to figure out that the fifth element is actually pointing at (1,3) in the parent array. That's dog-slow, especially if you have to do it in every single loop. But if you know ahead of time that you're just going to iterate over all the elements in column major order, then you can be clever. With reshapes, the cleverness is just that the order of the elements is the same, regardless of the shape. So if you ask for an iterator over all indices, then ReshapedArray can just wrap the parent array's iterator, and it'll be just as fast as if it weren't reshaped at all. You, as the user, don't even need to know how it worked this magic: eachindex(A) just does it for you.

The trouble is that you don't always want to iterate over all indices, and not always in column-major order. If you're able to describe the access pattern up-front when you request the index iterator, then it allows for cleverness of this nature, even in more complicated nested loops with multiple arrays. And if we're able to always ask arrays for index iterators, then maybe this can safely allow offset arrays to define non-standard Int indexing.

JaredCrean2 · 2016-04-06T21:51:02Z

I see, thanks for the clear explanation. ReshapedArrays are a much harder problem that I was envisioning.

timholy · 2016-04-06T22:03:22Z

ReshapedArrays are a much harder problem that I was envisioning.

Only if you want to avoid performance regressions. If you don't care about that, they're really easy 😄.

shashi · 2016-04-23T20:55:05Z

I came across @Jutho's TensorOperations.jl. The correspondence between the APIs proposed here and that in TensorOperations is fairly clear. As far as the DSL part of the solution to this issue goes, TensorOperations' DSL, seems to be good at expressing loop-ridden algorithms without committing to any implementation scheme or even a commitment to what the allowed indices are.

I wanted to point out this excerpt from the README that talks about a clever implementation strategy:

For add!, trace! and the native implementation of contract!, the problem is recursively divided into smaller blocks by slicing along those dimensions which correspond to the largest strides for all of the arrays. When the subproblem reaches a sufficiently small size, it is evaluated by a separate kernel using a set of nested for loops. The implementation depends heavily on metaprogramming and Julia's unique @generated functions to implement this strategy efficiently for any dimensionality. The minimal problem size is a constant which could be tuned depending on the cache size. The modularity of the implementation also allows to easily replace the kernels if better implementations would exist (e.g. when more SIMD features become available).

I think parallel implementations can exploit facilities of this divide-and-conquer strategy as well. Food for thought.

aaronsheldon · 2016-08-10T21:23:11Z

Perhaps I am way off base on this one but it seems to me that in its most general form this is an attempt to re-invent database joins, especially in the case of sparse matrices. But it has been know for some time that for large numbers of indexes (i.e. rows) it will be hard to beat hash tables on the specific (fields/columns) indexes being joined.

Essentially sparse matrix multiplication is a form of inner join, and sparse matrix addition is a form of outer join.

Put another way, while it may seem you want to march both arrays in some form of "simultaneous" manner, in the most general case, where you give up all prior knowledge of the memory layout, you would want to read one array in the most efficient linear manner possible, and then the other array, while using a pair of hashes that are dual to one another (e.g. hash 1 maps in column major order and hash 2 maps in row major order).

Sacha0 · 2018-01-26T18:32:59Z

Ref.

julia/stdlib/SparseArrays/src/higherorderfns.jl

Line 32 in 296d572

    
           # (1) The definitions below provide a common interface to sparse vectors and matrices

for a little work in this direction. (Namely a hacky common interface for iteration over SparseVectors and SparseMatrixCSCs sufficient for the purposes of sparse map and broadcast.) Best!

StefanKarpinski · 2019-07-10T15:10:01Z

Is this issue still relevant after the revision of the iteration protocol?

vtjnash · 2019-07-10T15:47:47Z

Skimming quickly, I don't think it had anything to do with the protocol, but may have been enabled by the changes to the broadcast framework and perhaps motivating future work such as "For Loops 2.0: Index Notation And The Future Of Tensor Compilers by Peter Ahrens" https://www.youtube.com/watch?v=Rp7sTl9oPNI

I don't think we need an issue open though, if there's nothing to change specifically

tkf · 2019-07-12T22:05:39Z

(Oops, I've been preparing to post a possible new solution here then I just noticed that it's closed. But please allow me to continue the discussion. I think "how to use the best/a good-enough iteration strategy for a given set of containers?" is still an issue.)

I'd like to suggest an approach to this problem "dual" to the iterator-based APIs: use foldl (or reduce) and foreach derived from it for complex loops such as the ones discussed here. The basic API I came up with is:

using NDReducibles: ndreducible
using Referenceables: referenceable

# Implementing C += A * B
foreach(
    # Create a "virtual container" (reducible) where the indices are coupled
    # based on the specification below:
    ndreducible(
        referenceable(C) => (:i, :j),  # mark that we want to mutate C
        A => (:i, :k),
        B => (:k, :j),
    ),
) do (c, a, b)
    c[] += a * b
end

This tries to choose the best nesting order of the for loop. ATM, it only supports DenseArray and alike composed with Transpose/Adjoint and Broadcasted. A few more demos and docs can be found here: https://tkf.github.io/NDReducibles.jl/dev/ (and https://tkf.github.io/Referenceables.jl/dev/ ) Note that, as you can see, I haven't optimized the terseness of the syntax yet.

This API is based on Transducers.jl's version of foldl. It means that it is as powerful as iterator-based protocol in terms of the semantics. That is to say, things like early termination ("break") and filtering also work. However, by completely going to a "callback"-based API, the "loop executor" has much more control than the iterator-based API. For example, it's possible to implement different parallel execution strategies based on the given set of containers (without macro magics). Fancy iterations like the one based on space filling curve is also possible. In fact, callback-based API naturally fits well with the divide-and-conquer approach. Furthermore, it allows you to define copying-to-Array-first as a valid strategy if the combination of the given custom arrays is too complicated to handle (of course, it may need to do a quick memory estimate and use this path only when the copies are not huge). Similarly, references to an element to arrays (as c[] above) do not have to directly invoke getindex/setindex!. This could be an interesting optimization when none of the indices for the destination array(s) are used in the inner-most loop (i.e., read once in an outer loop, put it in a RefValue, use it in the inner loops, and then write it back in the outer loop). [1]

One point of view is that this API is a generalization of the broadcasting API. That is to say, copyto! can be derived from foldl or reduce but not the other way around. So, I think it is worth considering to formalise foldl/reduce as the basic interface that is used from broadcasting as well as mapping, filtering, and reduction. Then, complex loops like GEMM can also be derived as a foreach (side-effect-only mapping).

[1] the last few points are actually doable in zip-like iterator-based interface as well (but not with foreach-like interface)

tomasaschan mentioned this issue Mar 31, 2016

Add blog post on challenging AbstractArray types JuliaLang/www_old.julialang.org#366

Merged

garrison added the julep Julia Enhancement Proposal label Mar 31, 2016

This was referenced Apr 1, 2016

ReshapedArrays, take2 #15449

Merged

RFC: use first/last as endpoints for Range indexing #15750

Closed

timholy mentioned this issue Apr 4, 2016

Naming and composability? eschnett/FastArrays.jl#1

Closed

timholy mentioned this issue Apr 29, 2016

RFC: always warn on overwritten methods #14759

Merged

JeffBezanson mentioned this issue Jul 7, 2016

n-arg map performance #17321

Closed

mbauman mentioned this issue Jan 26, 2018

AbstractSparseArray Interface #25760

Closed

StefanKarpinski closed this as completed Jul 10, 2019

tkf mentioned this issue Jul 13, 2019

Inference of collect JuliaFolds/Transducers.jl#13

Closed

This was referenced Mar 6, 2020

Code generation and cartesian iteration #9080

Closed

Optimize foldl/foreach for zip(arrays...), CartesianIndices, etc. #35036

Open

mbauman mentioned this issue Nov 6, 2020

Implementing new IndexStyle / type of index? #38284

Closed

MasonProtter mentioned this issue May 10, 2023

More transducers #49735

Open

Julep: solving tricky iteration problems #15648

Julep: solving tricky iteration problems #15648

Comments

timholy commented Mar 28, 2016

The problem (a brief summary)

Possible APIs

The "raw" API

Making it prettier with "sufficiently smart iteration"

meggart commented Mar 29, 2016

timholy commented Mar 29, 2016

eschnett commented Mar 29, 2016

timholy commented Mar 29, 2016

eschnett commented Mar 29, 2016

mbauman commented Mar 29, 2016

timholy commented Mar 29, 2016

timholy commented Mar 29, 2016

mbauman commented Mar 29, 2016

timholy commented Mar 29, 2016

eschnett commented Mar 29, 2016

timholy commented Mar 29, 2016

eschnett commented Mar 29, 2016

meggart commented Mar 29, 2016

timholy commented Mar 29, 2016

mbauman commented Mar 29, 2016

timholy commented Mar 29, 2016

timholy commented Mar 30, 2016

tkelman commented Mar 30, 2016

timholy commented Mar 30, 2016

timholy commented Mar 30, 2016

mbauman commented Mar 30, 2016

timholy commented Mar 30, 2016

timholy commented Mar 30, 2016

mbauman commented Mar 30, 2016

timholy commented Apr 4, 2016

eschnett commented Apr 4, 2016

timholy commented Apr 4, 2016

vtjnash commented Apr 4, 2016

timholy commented Apr 4, 2016

mbauman commented Apr 4, 2016

vtjnash commented Apr 4, 2016

timholy commented Apr 4, 2016

timholy commented Apr 4, 2016

JaredCrean2 commented Apr 5, 2016

mbauman commented Apr 6, 2016

JaredCrean2 commented Apr 6, 2016

timholy commented Apr 6, 2016

shashi commented Apr 23, 2016

aaronsheldon commented Aug 10, 2016

Sacha0 commented Jan 26, 2018

StefanKarpinski commented Jul 10, 2019

vtjnash commented Jul 10, 2019

tkf commented Jul 12, 2019