Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: The Plan #1

Open
iamed2 opened this issue Feb 21, 2019 · 55 comments
Open

WIP: The Plan #1

iamed2 opened this issue Feb 21, 2019 · 55 comments

Comments

@iamed2
Copy link
Contributor

iamed2 commented Feb 21, 2019

This issue is still being written

See discussion in JuliaArrays/AxisArrays.jl#84 for background

Concepts

AxisArrays.jl is currently a mashup of distinct and mostly orthogonal indexing behaviour. These are, at the highest level:

  • associating a name with a dimension
  • associating an index mapping with a dimension
  • looking up an index in an index mapping
  • translating a getindex call using some transformation into a getindex call on an underlying structure, then constructing a wrapped result

Named Dimensions

Index Lookup

Indexed Dimensions

getindex Translation

@iamed2 iamed2 added this to Planning in Axis/Indexed Arrays Feb 21, 2019
Axis/Indexed Arrays automation moved this from Planning to Done Mar 13, 2019
@iamed2 iamed2 moved this from Done to Planning in Axis/Indexed Arrays Mar 13, 2019
@iamed2 iamed2 reopened this Mar 13, 2019
Axis/Indexed Arrays automation moved this from Planning to In progress Mar 13, 2019
@iamed2 iamed2 moved this from In progress to Planning in Axis/Indexed Arrays Mar 13, 2019
@oxinabox
Copy link
Member

Cc : @blegat , @juan-pablo-vielma

@iamed2
Copy link
Contributor Author

iamed2 commented Mar 13, 2019

I plan to continue adding to this as soon as I make some progress on my current task.

@oxinabox
Copy link
Member

Named Dimensions

These are wanted for Neural networks mostly, to make sure that you are summing and batching etc over the dimension you think you are.
See: http://nlp.seas.harvard.edu/NamedTensor
(In contrast Indexed Dimensions, are wanted for making sure the whole DataScience pipeline is correct)

  • Wrapper Type, that wraps an AbstractArray giving names to the indexes. The index named can go in the type params
  • Mapping function: name2dim(::NamedDimsArray, ::Symbol)::Int that will take an array with named dimentsions and dimention name, and return its dimension index.
  • Overloads for common functions that take a dims argument. Like sum and mean
  • kwarg style indexing feature request: kwargs for getindex JuliaLang/julia#25631

Most of this functionality should be implemented in a kind of traitish way so that
we can easily write code for the various different combinations of
Indexed Dimension, Named Dimension and Indexed Named Dimension

@oxinabox oxinabox mentioned this issue Apr 11, 2019
5 tasks
@Tokazama
Copy link
Member

I just started looking through the NamedDims repo, so I apologize if this is already documented but I missed it. Is NamedDims ultimately intended to be integrated into AxisArrays or is it suppose to be a dependency or something else entirely?

Also, I think https://github.com/JuliaDiffEq/LabelledArrays.jl implements a really simple method of indexing for symbols that can be adapted to indices. I know the comments in https://github.com/invenia/NamedDims.jl/blob/master/src/name_core.jl#L66 indicate can run at compile time but I haven't really stress tested NamedDims for speed yet.

I have a mostly finished version of all the statically typed ranges in base and tested against a lot of the range tests in Base https://github.com/Tokazama/StaticRanges.jl. I'm hoping to figure out a mix of dynamic and static indexing in a week or two when classes finish up here. I figured this may be helpful given how much of AxisArrays's code is dedicated to figuring out indexing.

@oxinabox
Copy link
Member

oxinabox commented Apr 18, 2019

I just started looking through the NamedDims repo, so I apologize if this is already documented but I missed it. Is NamedDims ultimately intended to be integrated into AxisArrays or is it suppose to be a dependency or something else entirely?

NamedDims.jl is intended to A) Be used on its own,
and B) Be integrated into a future package along with IndexedDims.jl (name pending, Indexes.jl?),
and that future package will replace AxisArrays.jl. (Like how StaticArrays.jl, replaced FixedSizedArrays.jl)

So if you only want Named Dimensions and nothing else.
then it is fine to just used NamedDim.jl

Also, I think https://github.com/JuliaDiffEq/LabelledArrays.jl implements a really simple method of indexing for symbols that can be adapted to indices.

Probably. It is a different thing again to either of the packages we're talk about (as I think you know.)
(as an aside: one does have to be careful about referring to index vs dim.)
But Indexes.jl (name pending), needs to support a lot more than symbols so I am not so much seeing what ideas you think are worth adapting.
But I've not really looked at the source.

I know the comments in https://github.com/invenia/NamedDims.jl/blob/master/src/name_core.jl#L66 indicate can run at compile time but I haven't really stress tested NamedDims for speed yet.

As of
invenia/NamedDims.jl@e60f5ea
basically everything is effectively zero-cost
No overhead beyond that of the underlying operations, as they would have been on the wrapped array type.

I have a mostly finished version of all the statically typed ranges in base and tested against a lot of the range tests in Base https://github.com/Tokazama/StaticRanges.jl. I'm hoping to figure out a mix of dynamic and static indexing in a week or two when classes finish up here. I figured this may be helpful given how much of AxisArrays's code is dedicated to figuring out indexing.

This looks interesting and might fill some, or all the pieces that the proposed Indexes.jl wants to do.


In general I am currently focussing on NamedDims, because they solve the bugs that had in my code related to various incorrect dimensions being summed and then std dev'd over.
(The sum of the std devs, is not the std dev of the sums 😬)

Indexes.jl will to solve the other kinds of mistake.

@iamed2
Copy link
Contributor Author

iamed2 commented Jul 25, 2019

On Monday, @oxinabox pointed to https://github.com/andyferris/AcceleratedArrays.jl as a candidate for the index lookup aspect and I think it's a great choice and I've been running with it on what I'm calling IndexedDims.jl. My goal for the JuliaCon hackathon tomorrow is to get a reasonable prototype of the indexed dimensions aspect functional, which may not have all of its necessary performance optimizations and may not actually overload getindex yet.

Key issues that require attention with this approach:

  • selecting a subset of a dimension requires reindexing (re-accelerating), meaning the original index is lost and a new one is constructed anew. It would be nice if perhaps AcceleratedArrays supported intelligent reindexing, so that a new accelerated array can be constructed through indexing without duplicating work in some scenarios (like indexing into an UniqueSortIndex-ed UnitRange with a UnitRange)
  • trying to split indexing into two separate styles is difficult, especially since there's no way to notate another indexing style without macros (due to setindex being a special kind of parsed expression). We do need to be able to index into the underlying array with Ints to participate in the generic AbstractArray interface, but this makes Ints special and hard to handle in the special indexing algorithm (same story for logical masks). Not sure how to solve this yet.
  • it's theoretically possible to handle APL-style multidimensional indexing but I'm punting on that for now as AxisArrays doesn't do it and it's hard. But what should happen when unsupported generic indexing patterns are used? Should we decay to generic AbstractArray indexing, or give a method error?
  • How much can we avoid implementing? Are we okay with just getindex and similar, or do we need to handle stuff like copyto! as well? How about broadcasting; what do we need?
  • How much type stability do we need? Is Julia better at dealing with partial type stability than in the 0.4 era? In which cases (user-facing method calls) can we not know the output type based on the input type?

@c42f
Copy link

c42f commented Jul 26, 2019

It's great to see this issue and some progress on replacing AxisArrays with a set of more minimal orthogonal functionality. Lately I've done a couple of cleanups over at AxisArrays but the package is basically unmaintained and I don't really want to be the new maintainer :-) Partly it's a matter of time, but even more importantly I can't see how the current AxisArrays design can move forward in a clear way. At least, not without removing features.

@rafaqz
Copy link

rafaqz commented Sep 8, 2019

This might be of interest: https://github.com/rafaqz/DimensionalData.jl

It's a mix of AxisArays.jl and NamedDims.jl. Basically as @c42f said above the things I wanted aren't possible without a serious refactor of AxisArrays. Mostly its easier to extend and the syntax is more flexible for using in other packages. I wrote it for spatial data, and its extended in https://github.com/rafaqz/GeoData.jl

Edit: The list of concepts above is basically identical to the concepts in DimensionalData.jl. The main point of difference is the stress on always using abstract types and method dispatch in DimensionalData - inherited behaviour instead of wrapper types.

A few other points relating to the list of @iamed2 and DimensionalData :

  • Using AcceleratedArrays would be great, but I have the same concern about the rebuild - it would be preferable if it was lazy or subsetted intelligently.
  • In DD normal getindex is the primary indexing method, the dimensions and selectors just return indices to getindex which works as usual. Extending packages don't need to know about named dims or selectors, they just need to support getindex and a dims() method.
  • DD methods seem to be type stable more often than the Base methods, as it has more type information
  • Multiple indexing styles coexist by using type wrappers - there are 4 styles of getindex:
    • regular getindex a[1, 2:3]
    • dim indexing where order doesn't matter a[X(2:3), Y(1)]
    • value selector indexing a[At(34.5), Between(70, 90)]
    • dim/value selector indexing a[X<|Between(70, 90), Y<|Near(34)]
      (<| just reduces parentheses)

Its a little verbose in the last case. but I can't think of how to simplify it while maintaining the flexibility.

Also I've just registered DD as I have a whole chain of packages depending on it, but it would be great not to maintain it on my own in future, and I'm keen to integrate the functionality into a future AxisArrays if it's possible.

@kcajf
Copy link

kcajf commented Sep 13, 2019

@rafaqz DimensionalData looks nice, have been playing around with it a bit.

I've started a discussion over at AcceleratedArrays - andyferris/AcceleratedArrays.jl#4 - to try and make rebuilding, slicing and dicing AcceleratedArrays cheaper. I currently have my own shoddy equivalents of UniqueHashIndex and SortedIndex but it would be great to have a common layer on which to build, and AcceleratedArrays seems the furthest ahead so far afaict.

@oxinabox
Copy link
Member

IndexedDims.jl is currently building on AcceleratedArrays.jl

@kcajf
Copy link

kcajf commented Sep 13, 2019

IndexedDims.jl is currently building on AcceleratedArrays.jl

yes, exactly! it's an exciting time to be an accelerated array.

@kcajf
Copy link

kcajf commented Sep 13, 2019

[ ] trying to split indexing into two separate styles is difficult, especially since there's no way to notate another indexing style without macros (due to setindex being a special kind of parsed expression). We do need to be able to index into the underlying array with Ints to participate in the generic AbstractArray interface, but this makes Ints special and hard to handle in the special indexing algorithm (same story for logical masks). Not sure how to solve this yet.

@iamed2 on this topic, I notice you have a (probably zero-overhead) Bypass type in IndexedDims.jl. I personally would favour the opposite, where indexing goes to the underlying array by default unless wrapped in a Rich(x) or something. This allows for mixing indexing types within a getindex call, and never risks violating the AbstractArray interface (or any 3rd party methods that do stuff like indexing with Vector{Int}). I just wonder if there could be a lighter / friendlier syntax for wrapping a "rich" indexer...

@rafaqz
Copy link

rafaqz commented Sep 14, 2019

@kcajf in DimensionalData getindex is just vanilla getindex to preserves the interface. Dimension or selector wrapper types trigger other behaviours as you are suggesting. I think this has to be the default and one of the biggest problems with other packages. It also means wrapper types work even if you dont inherit from AbstractDimensionalArray, because dispatch isn't on the array type but the wrapper.

Edit: yes getting the syntax for these wrapper types is the hardest part. Ive been trying yo find the shortest simplest words for them, but I wish we had better piping operators. A better varaiety of ascii infix ops would also help as sugar for type wrapper syntax

@Tokazama
Copy link
Member

One thing that has yet to be discussed is a meta package. The most obvious benefits to a meta package are:

  • Clear dependency structure. Only the meta package would depend on any of the other packages.
  • Early adoption of these various implementations by a common and consistent API. This ties in a lot with @rafaqz's efforts, but would likely be driven more by traits and syntax. I've begun to work on this in ImageCore (see here for using names associated with dimensions).
  • A more complete replacement of AxisArrays

@rafaqz
Copy link

rafaqz commented Sep 16, 2019

What do you think would be in a meta package?

@Tokazama
Copy link
Member

The only thing that I think is absolutely necessary is an array type with the named dimensions and indexing capabilities combined from all the basic packages. So if we were using IndexedDims and NamedDims there could be some sort of exported constant that wraps an IndexedDimsArray in a NamedDimsArray (or whatever array types we end up with in the end).

Everything else I can think of would mostly be tying syntax together, like making use of well thought out syntax found in DimensionalData. It would be nice if we could use thinks like At and Between across all these array types.

@rafaqz
Copy link

rafaqz commented Sep 16, 2019

Ok, I think we have some quite different approaches here! but I am quite interested to see which is the most practical.

First: do you mean a concrete or abstract array type? I'm concerned that using concrete types anywhere will create problems for extensibility. All of the methods that matter in DimensionalData are on abstract types, and the concrete array type is basically just an example. The AbstractDimensionalArray type provides the inherited behaviour in GeoData.jl - for both AbstractGeoArray and AbstractGeoSeries.

Another difference is I'm avoiding array wrappers and connecting indexing modes to the array type in any way. Most of the time (like view/getindex etc) the AbstractDimensionalArray inheritance isn't the object controlling dispatch: the AbstractDimension and Selector are. With some extra work you can use half the DD package without any inheritance (or wrappers) on the array type.

I was planning for indexing behaviour such as using accelerated arrays to be located entirely in the dimension types, not the array! I think its much simpler and more modular if array types don't 'know' that various indexing types even exist. GeoArray in GeoData.jl has no interaction with dimension/selector indexing, even inherited - dispatch happens on AbstractArray and AbstractDimension and GeoArray only sees normal indices after they are permuted/calculated etc.

@oxinabox
Copy link
Member

First: do you mean a concrete or abstract array type? I'm concerned that using concrete types anywhere will create problems for extensibility

Concrete, but type-parameterized.
Extensibility is via composition (i.e. wrapping),
rather than via inheritance.
You want something that is Named, Indexed, and runs on the GPU?
NamedDimArray{(:x, :y,) IndexedArray{..., CudaArray{Float32, 2}}}` Or (effectively equivelently, if care is taken to keep things orthogonal) IndexedArray{..., NamedDimArray{(:x, :y,) CudaArray{Float32, 2}}}`

It can all work the same on abstract stuff I guess,
but it seems unnesc so far.
Since the behavour its concrete, there is no varients on it.

The reason to make it abstract is if someone decides to have varients on it.
Like the proposal to add co-and-contravarient dimensions to NamedDimsArrays.
But even that I think might be better handled via a new type to compose with:
CoContrDimsArray{Tuple{↑, ↓}, NamedDimsArray{(:x, :y), ...}}

@rafaqz
Copy link

rafaqz commented Sep 16, 2019

Yeah that makes sense, although those examples would work equally well if they were concrete types inheriting from an abstract type and it's not a huge overhead to maintain. I pretty much wrote DimensionalData.jl because I had a use case for inheriting indexing behaviour and AxisArrays couldn't manage it.

In GeoData.jl AbstractGeoArray and AbstractGeoSeries both inherit from DimensionalArray, but have quite different roles and behaviours that are differentiated using dispatch. But the indexing is identical, as they have no interaction with it besides having a dims field they return when dims(a) is called. A bunch of concrete types for lazy NetCDF, GDAL and HDF5 loading inherit from them. These cant easily contain wrapper types as they just file pointers mocking the array interface. But wrapping them complicates dispatch for their other methods. (You can wrap CuArray in regular memory-backed GeoArray where it makes sense)

Then AbstractGeoStack isn't even in AbstractArray but it and its descendants also use dim/selector indexing. In this case extensibility is via defining a dims() method and some additional moving parts to handle a Symbol key + indices in the same getindex.

I can't see how to do these things cleanly with concrete array wrappers. There ends up being too much going on and dispatch and plot recipes get complicated. And it obviously can't work for non-arrays.

Edit: There are also cases where you want to lazily load the dimension arrays (ie when they are big matrices in an online database) and having dimensions handled by a concrete wrapper from another package makes that difficult. Inheriting indexing behaviours and adding a lazy dims() method for the type is pretty easy, and lets you present a unified interface for a lot of different underlying data.

@oxinabox
Copy link
Member

I really think having JuliaLang/julia#25631 will improve this whole situation significantly

I have no idea what part of that issue is not already complete.
NamedDims.jl already uses this.
Perhaps JuliaLang/julia#31732
which NamedDims has to special-case

@mcabbott
Copy link

Maybe the follow-up to that issue is to wonder what happens next to keywords. Should Base perhaps have a fallback method which turns getindex(A; kw...) into getindex(A, NamedTuple) or maybe gerindex(A, Pair.(kw.itr, values(kw.data))...) or something, so that they might be passed through wrappers which don’t understand them?

For example, strings seem to pass along just fine:

julia> Transpose(A3)[10.0, "a"] # AbstractIndices from above
0.6086961196739558

julia> Symmetric(NamedDimsArray{(:i,:j)}(rand(1:99, 3,3)))[i=2, j=2]
ERROR: MethodError: no method matching getindex(::Symmetric{...; i=2, j=2)

julia> Transpose(A4)[Near(23.5), At("b”)] # DimensionalData from above
ERROR: `dims` not defined for type Transpose

@oxinabox
Copy link
Member

Basically, the standard for overloading getindex for wrapper types should be to overload
Base.getindex(x::Wrapper, args...; kwargs...) = getindex(parent(x), args...; kwargs...)
that way it is entirely in the wrapped type (i.e. parent(x)) to either accept kwargs or not.

That would be woth its own issue on JuliaLang/Julia

@mcabbott
Copy link

Actually this seems quite messy, how should the named equivalent of Adjoint(reshape(1:4,2,2) .+ im)[1,:] be handled? adjoint(getindex(parent(x), args...; kwargs…)) would handle scalar output.

@oxinabox
Copy link
Member

Actually this seems quite messy, how should the named equivalent of Adjoint

Yes, doesn't work great for some.
ReshapedArray is another.

It works well for things that don't change indexing, like Symetric and a few others.
I think some of the Triangular types

@mcabbott
Copy link

Now I’m struggling to think of a type simple enough for this to work... I don’t think Symmetric([1 2; 999 3])[2,1] is great. Is this OK for PermutedDimsArray?

julia> Base.getindex(A::PermutedDimsArray; kw...) = getindex(A.parent; kw...)

julia> PermutedDimsArray(NamedDimsArray{(:i,:j,:k)}(rand(3,4,5)), (3,2,1))[j=1]
3×5 NamedDimsArray{(:i, :k),false,Float64,2,Array{Float64,2}}:
 0.148024  0.809751    0.77254    0.239625  0.742684
 ...

@oxinabox
Copy link
Member

oxinabox commented Oct 1, 2019

Probably not, since something that should return a row vector will return a column vector.

@rafaqz
Copy link

rafaqz commented Oct 2, 2019

@mcabbott I agree

A4[iter=Near(21.5), var=1]  # doesn’t work but could

That could be added pretty easily and is easy to write, and I've thought about it a lot. The problem is we would lose dispatch on dimensions with regular indexing. a[iter=Near(1)] would be fine, but a[iter=2] gives us nothing to dispatch on besides the array type.

Dispatch on AbstractDimension allows for a lot of flexible behaviour, like using dims on on existing types that can't easily change their upstream abstract type, or aren't even <: AbstractArray. The ideas was also to go for maximum composability and flexibility - and the type wrapper syntax is very composable.

But I agree the Dim{:var} syntax is annoying. You can use @dim Var to make a short standalone dim type, or just const Var = Dim{:var}. In either case the = syntax only saves one character, but neither is totally satisfying.

Edit: another options is to add a Symbol method to the <| operator to mean "wrap symbol in Dim{}" like: A4[:iter<|Near(21.5), :var<|1]. Which could give the best of both worlds.

I'm not totally comfortable using the pipe like that, and unfortunately the range of right associative operators isn't amazing.

@mcabbott
Copy link

mcabbott commented Oct 2, 2019

Why must it be right-associative? I guess you can have A[:iter ∈ Near(21.5)] when you own the type. How about :var == Index[1]? Interesting though that your dispatch seemingly caught things too early in my Transpose(A4)[Near(23.5), At("b")] example, instead of passing it along.

@rafaqz
Copy link

rafaqz commented Oct 3, 2019

It needs to be right associative otherwise you'll need brackets for ranges. Yes Near is never a problem... but we don't own the types for :x = 1. So the operator really needs to be right associative and owned by DD. With Index you're back where we started with Dim!

I also wouldn't say "too early", it's just a matter of priorities. Mine is to hide the dims mechanics from implementations, so I remove dims as early as possible so implementations only see regular indexing. Transpose and PermuteDimsArray are difficult to deal with because they mess with the dim order but not the data. The solutions could be to define dims for them (as we own the method) and reorder the dims tuple of the contained array to match the transpose. I just haven't written that yet. Another option is to add a constructor for Transpose(a::AbstractDimensionalArray) that rewraps the Transpose in a dim array, as with SubArray in view.

@mcabbott
Copy link

mcabbott commented Oct 3, 2019

OK. My point re dispatch was this: I thought the point of being able to dispatch on getindex(..., ::Selector) was to avoid having to dispatch on getindex(::Wrap{..., OwnType}, ::Int...), especially when your package may not know about Wrap. But having done that, next step needs still to work on Wrap.

@Tokazama
Copy link
Member

Small update

In the process of registering StaticRanges.jl which will hopefully be followed up with SortedArrays.jl.

The first started off as an entirely unrelated project to this but I realized that an implementation that used anything other than 1-based indexing would need to reconstruct the entire array or have a mutable struct and reconstruct the axes when concatenating, resizing, etc. StaticRanges.jl has mutable ranges that can be used to circumvent this (and obviously has statically parametrized ranges too).

@mcabbott
Copy link

mcabbott commented Nov 2, 2019

Maybe this is ready for a link here: AxisRanges.jl defines another thin wrapper, which works with NamedDims.jl for names. The two should commute, apart from bugs. It’s callable so that C[1,2] is always distinct from C("cat", 3.5).

Round brackets look up indices via findfirst(isequal("cat"), ranges(C,1)) by default. These “ranges” can be any AbstractVectors, and ought to work fine with AcceleratedArrays.jl etc. if desired. There are some faster methods built-in for ranges which are literally Julia ranges. The terminology seems a little awkward no matter how you slice it... axes(A) and “indices” are taken by Base, if you ask me, and names(A) is used as in NamedDims; ranges(A) could perhaps be axiskeys(A) or domains(A) or dimensions(A)?

To allow push!(V, 3.6) to extend the range, vectors are special cased with V.ranges::Ref instead of M.ranges::Tuple for matrices etc. If I understand right that’s the same concern that led to mutable types in StaticRanges.jl.

I had a go at making it work with Tables.jl, see what you think. It's not so obvious what the equivalent of setindex! should look like; although for anything not a scalar you can write C("dog") .= 0 as this is a view.

Edit: broadcasting now works, as do map and comprehensions.

@Tokazama
Copy link
Member

Tokazama commented Jan 31, 2020

I've been thinking about the problem of indexing by Ints or searching the axis keys and I thought I'd share it because it seems really nice when I play with it and would lead to a lot less user confusion.

We could have normal indexing by an Int or collection as you would get anywhere in Julia, but to index by the unique axis keys the user would use something like you would put in a filter function. For example:

julia> A = AxisArray2(reshape(1:9, 3,3),
               2:4,     # keys(axes(A, 1))
               3.0:5.0) # keys(axes(A, 2))

julia> A[1,1]
1

julia> A[==(2),==(3.0)]
1

julia> A[1:2,1:2]
2×2 Array{Int64,2}:
 1  4
 2  5

julia> A[<(4), <(5.0)]
2×2 Array{Int64,2}:
 1  4
 2  5

I really like this example because:

  1. I think I could show it to someone new to Julia and they could use AxisArray2 the same as every other array type until they needed to search by axis keys and then they would just use syntax they would use elsewhere in the standard library anyway (e.g., filter, findfirst, findlast, etc. all use this as their first argument).
  2. It doesn't require that we worry about the naming convention of a bunch of new names to accomplish this, so development would betty pretty straightforward. There's only one big thing that I think I would add to the methods here and it would be something like
  • and(>(1), <(4)): all values between 1 and 4. (maybe >(1) & <(4) could be add to base to further simplify this?)
  • or(<(1), >(4)): all values less than 1 and greater than four
  1. It solves other indexing problems in the ecosystem:
    • Replaces Not function from InvertedIndices.jl by just using !=(val).
    • Inefficiency of boolean indexing can be optimized. We wouldn't need to generate an entire BitArray the same size as A because determining what values are true in the BitArray could often be fused with the operation of actually retrieving and creating the new array.

@mcabbott
Copy link

mcabbott commented Feb 2, 2020

One possibly weird feature of using functions like this is that in your example A[==(2),==(3.0)] returns one value, presumably findfirst, while others return an array, indexed by findall. This is different to what filter does. What does A[k -> k==2, isequal(3.0)] return?

In AxisRanges right now, A(==(2),==(3.0)) is an array of all that match (for any functions), while A(2, 3.0) is one element. What's the objection to round brackets?

One concern is that you can't have A(2, 3.0) = 4 as notation for setkey!. You could have A(2, 3.0, :) .= 4, i.e. interpret a trailing : as making a length-1 view (or Ref) to write into. (But that's not implemented yet.)

@Tokazama
Copy link
Member

Tokazama commented Feb 2, 2020

If there's no performance or functionality cost to using round brackets then that's a better idea.

@Tokazama
Copy link
Member

Tokazama commented Feb 2, 2020

in your example A[==(2),==(3.0)] returns one value, presumably findfirst, while others return an array, indexed by findall. This is different to what filter does. What does A[k -> k==2, isequal(3.0)] return?

I'm not particularly fond of supporting anonymous functions like k -> k==2. I like that the resulting Fix2 would let us know the exact argument provided. It seems obvious to me (but perhaps only me) that ==(2) could only return one value because we should have unique keys. Things like >(2) can have multiple values.

In my head it seems to be parallel to the behavior of A[1] and A[1:1] which return different things. The first can only return a single value where the latter could return multiple (although in this case it doesn't) so it produces another array.

If this is too convoluted then I'm probably going down the wrong path though.

@Tokazama
Copy link
Member

I have AxisIndices.jl up and running now. It implements the syntax I have proposed above but I've also tried to make it easy to customize behavior. It should be trivial to implement a type that indexes keys by uses round brackets if people want to experiment with that. Hopefully, this will make it so people don't have to waste time reimplementing all of the methods unrelated to indexing for a new array type (e.g., sum, dropdims, etc.).

@rafaqz
Copy link

rafaqz commented Aug 25, 2020

Just an update for DimensionalData.jl. You can now use named syntax for everyone who prefers that:

julia> A = DimArray(rand(10, 20, 30), (:a, :b, :c));

julia> A[a=2:5, c=9]

DimArray with dimensions:
 Dim{:a}: 2:5 (NoIndex)
 Dim{:b}: Base.OneTo(20) (NoIndex)
and referenced dimensions:
 Dim{:c}: 9 (NoIndex)
and data: 4×20 Array{Float64,2}
 0.868237   0.528297   0.32389     0.89322   0.6776    0.604891
 0.635544   0.0526766  0.965727     0.50829   0.661853  0.410173
 0.732377   0.990363   0.728461     0.610426  0.283663  0.00224321
 0.0849853  0.554705   0.594263     0.217618  0.198165  0.661853

You can also use Val with a tuple for the index to get compile time indexing and lookup where that makes sense:

julia> A = DimArray(rand(3, 3), (cat=Val((:a, :b, :c)), 
                                 val=Val((5.0, 6.0, 7.0))))
DimArray with dimensions:
 Dim{:cat}: Val{(:a, :b, :c)}() (Categorical: Unordered)
 Dim{:val}: Val{(5.0, 6.0, 7.0)}() (Categorical: Unordered)
and data: 3×3 Array{Float64,2}
 0.993357  0.765515  0.914423
 0.405196  0.98223   0.330779
 0.365312  0.388873  0.88732

julia> @btime A[1, 2]
  22.934 ns (1 allocation: 16 bytes)
0.9939444595885871

julia> @btime A[:a, 7.0]
  26.333 ns (1 allocation: 16 bytes)
0.32927504968939925

julia> @btime A[cat=:a, val=7.0]
  31.920 ns (2 allocations: 48 bytes)
0.7476441117572306

The At selector is assumed by default, unless you use an Int

There are more examples in the readme

@NowanIlfideme
Copy link

Are there any updates for this as of April 2021? As an avid xarray user (and novice Julia user), it's hard to figure out whether this discussion has converged, died out, or moved elsewhere (Discourse?)... My guess is that it's still up in the air, considering YAXArrays exists, and nearly all the above are still being developed.

I've yet to find anything I really love from the user's point of view, except maybe this comment above.

Since many like xarray's implementation in Python, it seems to me that it makes sense to have 2 main types:

  • An all-in-one array wrapper, that supports dimensions and coordinates, while also behaving like normal arrays. This probably requires some syntax like A[x=1, time=my_timestamp].
  • A "Dataset" wrapper of a collection of arrays, with automatic alignment of coords between arrays. This could then be "viewed" through different lenses in a Tables.jl-compatible way, e.g. view along "time" dimension gives "rows" of either sub-arrays or scalars.

For fun (and learning Julia more in-depth), I tried my own hand at implementing a mock-up... it's nowhere near as cool as all your folks' packages, with your no-overhead compile-time stuff. 😉

@rafaqz
Copy link

rafaqz commented Apr 25, 2021

It hasn't really converged. A few of us have continued working on packages fleshing out functionality for different use cases, but borrowing each others ideas. Both DimensionalData.jl and AxisKeys.jl do most of what you mention here, and get a lot of use. @Tokazama is working on generalizations of some of these ideas in ArrayInterface.jl and SpatioTemporalTraits.jl, which is very interesting work, and might end up tying some of these packages together.

And not loving things isn't very useful feedback. What specific things don't work for you in available packages? What do you want to do that you can't?

@Tokazama
Copy link
Member

As rafaqz said, I'm working on standardizing a lot of the interface for this kind of stuff. I really wish I had more to say at this point but I really need to get this stuff in place for discussion with those developing other packages before addressing general user interface stuff (mainly documented interfaces with competitive benchmarks times across all indexing types to support the change). I wish I could give an estimated time but I can't seem to finish projects faster than they are assigned to me lately. Hopefully I can start hitting this hard again in the next couple weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

8 participants