New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Row and Column Iterator for Matrices #29749

Merged
merged 25 commits into from Dec 3, 2018

Conversation

@arnavs
Contributor

arnavs commented Oct 21, 2018

Talked in Slack about creating eachrow(m::Matrix) and eachcolumn(m::Matrix) which return an iterator over the rows and columns of M. Here's a stab at it.

@arnavs arnavs force-pushed the arnavs:master branch from 54250c9 to 6efae6b Oct 21, 2018

@fredrikekre

This comment has been minimized.

Member

fredrikekre commented Oct 21, 2018

See #14491

@arnavs

This comment has been minimized.

Contributor

arnavs commented Oct 21, 2018

Thanks @fredrikekre, I updated the naming convention to match what was happening in that thread (eachrow and eachcol).

Also, what's going on with CI? Looks like we have two errors in something called sockets for 2 of the builds?

@andyferris

This comment has been minimized.

Contributor

andyferris commented Oct 22, 2018

Cool, nice, I was hoping we would implement something like this soon!

This one is a little close to my heart. For this functionality, I was hoping we could be slightly more generic - by being able to slice up arrays of any dimensionality by any one (or more) dimension.

Additionally, since arrays naturally support random-access, it's also possible to turn the "iterator"s into a fully-fledged AbstractArrays in their own right. That is, eachrow of a matrix could return a nested vector of vectors. This would let users interact with the result in ways beyond a simple for loop. For example, the presence of the indices makes it easier to match up multiple containers which share indices along certain dimensions. Finally, different array types may have their own implementations of map, reduce and so-on which is quite different to a sequential for loop over all elements (consider for example doing something in parallel with the rows of a DMatrix from DistributedArrays).

Anyway, because I've been interested in this for a little while, I implemented such a nested array dimension splitting in a package called SplitApplyCombine under the functions splitdims and splitdimsview (the latter being lazy). My intention here was always to develop these in a package and hopefully move these to Base through a PR here for Julia 1.x. (It should be noted that I don't particularly like my function names, but they are at least descriptive placeholders for now).

If I may be slightly philosophical - in my view, it seems more productive and flexible to let users put together different "split", "apply" and "combine" strategies like lego bricks, rather than to rely on increasingly complex functions like mapslices to do a split-apply-combine strategy all in one step, which is why I'm advocating for this in general, and excited by this PR in particular.

Anyway, if @arnavs and the community are at all interested, I'd be happy to help expand the functionality here to work on arbitrary dimensions and return a fully-fledged AbstractArray.

4 similar comments
@andyferris

This comment has been minimized.

Contributor

andyferris commented Oct 22, 2018

Cool, nice, I was hoping we would implement something like this soon!

This one is a little close to my heart. For this functionality, I was hoping we could be slightly more generic - by being able to slice up arrays of any dimensionality by any one (or more) dimension.

Additionally, since arrays naturally support random-access, it's also possible to turn the "iterator"s into a fully-fledged AbstractArrays in their own right. That is, eachrow of a matrix could return a nested vector of vectors. This would let users interact with the result in ways beyond a simple for loop. For example, the presence of the indices makes it easier to match up multiple containers which share indices along certain dimensions. Finally, different array types may have their own implementations of map, reduce and so-on which is quite different to a sequential for loop over all elements (consider for example doing something in parallel with the rows of a DMatrix from DistributedArrays).

Anyway, because I've been interested in this for a little while, I implemented such a nested array dimension splitting in a package called SplitApplyCombine under the functions splitdims and splitdimsview (the latter being lazy). My intention here was always to develop these in a package and hopefully move these to Base through a PR here for Julia 1.x. (It should be noted that I don't particularly like my function names, but they are at least descriptive placeholders for now).

If I may be slightly philosophical - in my view, it seems more productive and flexible to let users put together different "split", "apply" and "combine" strategies like lego bricks, rather than to rely on increasingly complex functions like mapslices to do a split-apply-combine strategy all in one step, which is why I'm advocating for this in general, and excited by this PR in particular.

Anyway, if @arnavs and the community are at all interested, I'd be happy to help expand the functionality here to work on arbitrary dimensions and return a fully-fledged AbstractArray.

@andyferris

This comment has been minimized.

Contributor

andyferris commented Oct 22, 2018

Cool, nice, I was hoping we would implement something like this soon!

This one is a little close to my heart. For this functionality, I was hoping we could be slightly more generic - by being able to slice up arrays of any dimensionality by any one (or more) dimension.

Additionally, since arrays naturally support random-access, it's also possible to turn the "iterator"s into a fully-fledged AbstractArrays in their own right. That is, eachrow of a matrix could return a nested vector of vectors. This would let users interact with the result in ways beyond a simple for loop. For example, the presence of the indices makes it easier to match up multiple containers which share indices along certain dimensions. Finally, different array types may have their own implementations of map, reduce and so-on which is quite different to a sequential for loop over all elements (consider for example doing something in parallel with the rows of a DMatrix from DistributedArrays).

Anyway, because I've been interested in this for a little while, I implemented such a nested array dimension splitting in a package called SplitApplyCombine under the functions splitdims and splitdimsview (the latter being lazy). My intention here was always to develop these in a package and hopefully move these to Base through a PR here for Julia 1.x. (It should be noted that I don't particularly like my function names, but they are at least descriptive placeholders for now).

If I may be slightly philosophical - in my view, it seems more productive and flexible to let users put together different "split", "apply" and "combine" strategies like lego bricks, rather than to rely on increasingly complex functions like mapslices to do a split-apply-combine strategy all in one step, which is why I'm advocating for this in general, and excited by this PR in particular.

Anyway, if @arnavs and the community are at all interested, I'd be happy to help expand the functionality here to work on arbitrary dimensions and return a fully-fledged AbstractArray.

@andyferris

This comment has been minimized.

Contributor

andyferris commented Oct 22, 2018

Cool, nice, I was hoping we would implement something like this soon!

This one is a little close to my heart. For this functionality, I was hoping we could be slightly more generic - by being able to slice up arrays of any dimensionality by any one (or more) dimension.

Additionally, since arrays naturally support random-access, it's also possible to turn the "iterator"s into a fully-fledged AbstractArrays in their own right. That is, eachrow of a matrix could return a nested vector of vectors. This would let users interact with the result in ways beyond a simple for loop. For example, the presence of the indices makes it easier to match up multiple containers which share indices along certain dimensions. Finally, different array types may have their own implementations of map, reduce and so-on which is quite different to a sequential for loop over all elements (consider for example doing something in parallel with the rows of a DMatrix from DistributedArrays).

Anyway, because I've been interested in this for a little while, I implemented such a nested array dimension splitting in a package called SplitApplyCombine under the functions splitdims and splitdimsview (the latter being lazy). My intention here was always to develop these in a package and hopefully move these to Base through a PR here for Julia 1.x. (It should be noted that I don't particularly like my function names, but they are at least descriptive placeholders for now).

If I may be slightly philosophical - in my view, it seems more productive and flexible to let users put together different "split", "apply" and "combine" strategies like lego bricks, rather than to rely on increasingly complex functions like mapslices to do a split-apply-combine strategy all in one step, which is why I'm advocating for this in general, and excited by this PR in particular.

Anyway, if @arnavs and the community are at all interested, I'd be happy to help expand the functionality here to work on arbitrary dimensions and return a fully-fledged AbstractArray.

@andyferris

This comment has been minimized.

Contributor

andyferris commented Oct 22, 2018

Cool, nice, I was hoping we would implement something like this soon!

This one is a little close to my heart. For this functionality, I was hoping we could be slightly more generic - by being able to slice up arrays of any dimensionality by any one (or more) dimension.

Additionally, since arrays naturally support random-access, it's also possible to turn the "iterator"s into a fully-fledged AbstractArrays in their own right. That is, eachrow of a matrix could return a nested vector of vectors. This would let users interact with the result in ways beyond a simple for loop. For example, the presence of the indices makes it easier to match up multiple containers which share indices along certain dimensions. Finally, different array types may have their own implementations of map, reduce and so-on which is quite different to a sequential for loop over all elements (consider for example doing something in parallel with the rows of a DMatrix from DistributedArrays).

Anyway, because I've been interested in this for a little while, I implemented such a nested array dimension splitting in a package called SplitApplyCombine under the functions splitdims and splitdimsview (the latter being lazy). My intention here was always to develop these in a package and hopefully move these to Base through a PR here for Julia 1.x. (It should be noted that I don't particularly like my function names, but they are at least descriptive placeholders for now).

If I may be slightly philosophical - in my view, it seems more productive and flexible to let users put together different "split", "apply" and "combine" strategies like lego bricks, rather than to rely on increasingly complex functions like mapslices to do a split-apply-combine strategy all in one step, which is why I'm advocating for this in general, and excited by this PR in particular.

Anyway, if @arnavs and the community are at all interested, I'd be happy to help expand the functionality here to work on arbitrary dimensions and return a fully-fledged AbstractArray.

@andyferris

This comment has been minimized.

Contributor

andyferris commented Oct 22, 2018

(Github had problems earlier today... looks like they posted all my retries!)

@joshday

This comment has been minimized.

joshday commented Oct 22, 2018

I'll add that I've also implemented eachrow/eachcol in OnlineStatsBase.

The difference from the implementation in this PR is that OnlineStatsBase uses a buffer instead of views (I think views were a little slower, but that could be investigated further). The code is here: https://github.com/joshday/OnlineStatsBase.jl/blob/0449a586f9229f6bf43a2214b11a271318a240a4/src/OnlineStatsBase.jl#L84-L142.

@arnavs

This comment has been minimized.

Contributor

arnavs commented Oct 22, 2018

Replies to feedback (first @andyferris, then @joshday):

  1. Fully agree, would love if the slicing API was n-dimensional.

  2. Additionally, since arrays naturally support random-access, it's also possible to turn the "iterator"s into a fully-fledged AbstractArrays in their own right. That is, eachrow of a matrix could return a nested vector of vectors.

    For sure, we could implement rows(M) = collect(eachrow(M)) or something similar. Is that what you had in mind? (Edit: I still think it's valuable to keep the lightweight view iterator for instances that aren't conducive to reproducing all the data.)

  3. Anyway, because I've been interested in this for a little while, I implemented such a nested array dimension splitting in a package called SplitApplyCombine under the functions splitdims and splitdimsview

    Sounds good, I'll check this out.

  4. Anyway, if @arnavs and the community are at all interested, I'd be happy to help expand the functionality here to work on arbitrary dimensions and return a fully-fledged AbstractArray.

    For sure. Maybe if people think this is a good idea (cc: @ararslan, @StefanKarpinski, @fredrikekre), we can merge this and then implement a method eachslice(A::AbstractArray)? Or, if we need to workshop this a bit, that's OK too.

  5. I'll add that I've also implemented eachrow/eachcol in OnlineStatsBase...I think views were a little slower...

    Quite possible, but that's above my pay grade. I'll let others weigh in.

@StefanKarpinski

This comment has been minimized.

Member

StefanKarpinski commented Oct 22, 2018

@mbauman, can you take a look at this?

@ararslan ararslan requested review from mbauman and timholy Oct 22, 2018

@mbauman

This comment has been minimized.

Member

mbauman commented Oct 22, 2018

I agree with most all of what @andyferris wrote above, except that I'd suggest we start as simple as possible. I'll also note that there's another package implementation in JuliennedArrays.jl.

I'd suggest simply:

eachrow(A::AbstractArray) = (view(A, i, :) for i in axes(A, 1))
eachcol(A::AbstractArray) = (view(A, :, j) for j in axes(A, 2))

For the general case, it looks like constant propagation can almost handle:

eachslice(A, d) = (view(A, ntuple(n->n==d ? i : (:), ndims(A))...) for i in axes(A, d))

but we're not quite there yet. It'll take a bit more to make this type-stable.

Edit: oh, of course, we can just use selectdim:

eachslice(A, d) = (selectdim(A, 
@mbauman

I'd really like to see this as a more general structure, but I've added a few pedagogical comments based on the code you've written that I hope will be helpful.

Show resolved Hide resolved stdlib/LinearAlgebra/src/generic.jl Outdated
Show resolved Hide resolved stdlib/LinearAlgebra/src/generic.jl Outdated
Show resolved Hide resolved stdlib/LinearAlgebra/src/generic.jl Outdated
@mbauman

This comment has been minimized.

Member

mbauman commented Oct 22, 2018

buffer instead of views

I think I prefer the views behavior. Yes, buffers can be faster in many situations, but they also can have strange effects if you, e.g., collect the iterator. I prefer the view behaviors, and there's hope that someday they'll be faster (#14955).

@andyferris

This comment has been minimized.

Contributor

andyferris commented Oct 22, 2018

we can merge this and then implement a method eachslice(A::AbstractArray)

except that I'd suggest we start as simple as possible

I agree - let's do baby steps! :)

My concern is more on the bikeshedding front - that a function name such as eachrow is great for an iterator (for row in eachrow(matrix) is indeed very readable), but IMO not so great for a general function which transforms an array into another array. If we know the function will be expanded to included more functionality later, we may as well rename it now to reduce churn.

Our typical policy is to pick a simple English verb for Julia functions (and I don't see why we would deviate from that). Ideas that come to mind include slice, split, splitdims.

I'll also note that there's another package implementation in JuliennedArrays.jl.

Yes, I meant to come back and acknowledge JuliennedArrays (sorry @bramtayl). While I am talking function names - while I really appreciate the julienne play on words (genius!), for Julia Base perhaps a simpler English verb might be preferrable for new users (especially for those whose mother tongue isn't English)?

Show resolved Hide resolved base/abstractarraymath.jl Outdated
@nalimilan

This comment has been minimized.

Contributor

nalimilan commented Dec 1, 2018

Don't worry about fixing DataFrames, we can do that on our end. But why did you remove eachrow and eachcol from exports?

@fredrikekre

This comment has been minimized.

Member

fredrikekre commented Dec 1, 2018

But why did you remove eachrow and eachcol from exports?

I proposed on slack (and I think Matt agreed?) that we leave these unexported for Julia 1.1 but document that they will be exported in Julia 1.2. Exporting new names is breaking and I think we should consider doing the same for other new exports. In this case DataFrames just happens to be a package that we know of, and can fix, but I think it would be nice to give one minor release cycle for new names as a heads up and then we can export them in the next release.

We can export them from the Future standard library so they are available after using Future in Julia 1.1.

@nalimilan

This comment has been minimized.

Contributor

nalimilan commented Dec 1, 2018

That sounds like a bit radical to me. In this case the conflict is obvious (since as you say DataFrames is a highly visible package), but would we be willing to apply this rule to any new export? Looking at NEWS.md, I see 1.1 will also introduce splitpath and isnothing.

There will be some time anyway between the feature freeze and the final release, so that leaves time for packages to spot that their tests are failing on nightlies and fix that. We could also have a policy to file issues against all packages which PkgEval detects as broken due to this, for example right after the feature freeze.

@fredrikekre

This comment has been minimized.

Member

fredrikekre commented Dec 1, 2018

but would we be willing to apply this rule to any new export? Looking at NEWS.md, I see 1.1 will also introduce splitpath and isnothing.

Yes, that is what I meant. We haven't really discussed the problem with new exports AFAIK and if we havr lived without these functions for years I think we can live 4 more months without them?

We could also have a policy to file issues against all packages which PkgEval detects as broken

All Julia code out there is not PkgEvalable.

nalimilan and others added some commits Dec 1, 2018

Update base/abstractarraymath.jl
Co-Authored-By: arnavs <soodarnav01@gmail.com>
@mbauman

This comment has been minimized.

Member

mbauman commented Dec 1, 2018

My thoughts are:

  • we’re very close to 1.1
  • fixing this after we break it will lead to gripes and difficulties validating with PkgEval due to the many downstream dependents
  • exporting from Future is cute, but it doesn’t really seem to gain us much over having folks know that they’re public in Base and will eventually be exported.
  • waiting until 1.2 to export them will give folks time to upgrade DataFrames.
@nalimilan

This comment has been minimized.

Contributor

nalimilan commented Dec 1, 2018

Honestly I don't think adding unexported functions in one release will help people to fix their code in advance. Only people following Julia development closely will be aware of that, and these people are precisely the ones which will adapt before 1.1 is out anyway (like DataFrames). Others will only notice the breakage in 1.2 when we eventually export these symbols.

Regarding DataFrames, let's not take it as an argument to delay the addition of new exports. @bkamins has already made a PR to make the API more consistent with this PR, and we can merge a fix even before this PR is merged in Julia, and tag it in a few days.

@arnavs

This comment has been minimized.

Contributor

arnavs commented Dec 1, 2018

If everyone agrees that ‘eachslice’ is good to export, perhaps we could do that now and continue the discussion on the other two?

Not to put too fine a point on it, but I think we’ve reached the convergence point from this PR (the functionality we want is added, documented, tested, and won’t break things downstream). If I’m wrong, happy to iterate further, but this feature request has been open for years now and it would be nice to close.

@mbauman

This comment has been minimized.

Member

mbauman commented Dec 1, 2018

I would just very much like to find a conclusion here. @arnavs has been very tenacious, persistent and patient for this first PR. I leaned on the side of being conservative to help make that happen but apparently that backfired.

@mbauman

mbauman approved these changes Dec 1, 2018

@nalimilan

This comment has been minimized.

Contributor

nalimilan commented Dec 2, 2018

AFAICT we all agree the PR can be merged. The debate is about whether we should add more exports or not. But if we don't reach an agreement soon better merge and continue the discussion after that.

@bkamins

This comment has been minimized.

Contributor

bkamins commented Dec 2, 2018

If some package used e.g. eachcol earlier (like DataFrames.jl) the maintainers will have to introduce a conditional definition of eachcol and Base.eachcol anyway to keep supporting Julia 1.0 (or handle this duality via Compat.jl) no matter if we export it in Julia 1.1 or 1.2.

@AzamatB

This comment has been minimized.

AzamatB commented Dec 2, 2018

My 2¢: I agree with @nalimilan and @bkamins and think there is no need to postpone exports to 1.2

@AzamatB

AzamatB approved these changes Dec 2, 2018

@arnavs

This comment has been minimized.

Contributor

arnavs commented Dec 2, 2018

Alright, I re-added the exports. Let's never think about this PR ever again :)

@StefanKarpinski

This comment has been minimized.

Member

StefanKarpinski commented Dec 3, 2018

@nalimilan or @mbauman, please merge if and when you think this is ready (with appropriate squashing).

@mbauman

This comment has been minimized.

Member

mbauman commented Dec 3, 2018

I'm thrilled DataFrames folks are on board here and will accommodate the breakage. Thanks everyone. Let's merge!

@mbauman mbauman merged commit 6b04291 into JuliaLang:master Dec 3, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
julia freebsd ci Build done
Details

mbauman added a commit that referenced this pull request Dec 3, 2018

fredrikekre added a commit that referenced this pull request Dec 4, 2018

@nalimilan

This comment has been minimized.

Contributor

nalimilan commented Dec 4, 2018

Thanks! DataFrames PR is JuliaData/DataFrames.jl#1614, will be tagged shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment