Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CategoricalVector and collapse #88

Merged
merged 14 commits into from
Aug 3, 2017
Merged

Conversation

iamed2
Copy link
Collaborator

@iamed2 iamed2 commented Jun 5, 2017

CategoricalVector allows any type of vector to be a Categorical axis.

flatten collapse concatenates AxisArrays with equal leading axes into a single AxisArray. All additional axes in any of the arrays are flattened into a single additional
CategoricalVector{Tuple} axis

I also changed axistrait to operate on types, with a fallback that calls typeof, in order to use it in @generated functions.

@iamed2
Copy link
Collaborator Author

iamed2 commented Jun 5, 2017

Nightly failure is a Julia bug (that I'm still trying to track down EDIT: JuliaLang/julia#22248). The issue occurs with the loop at the end of test/core.jl. If I assign the loop variables to variables, the loop completes. If not, the loop hangs and Julia is uninterruptible. I will just assign the loop variables for now.

@iamed2
Copy link
Collaborator Author

iamed2 commented Jun 15, 2017

Julia bug is fixed now anyway.

Copy link
Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent work. I have only one substantial change to suggest, but it's nontrivial. Don't hesitate to ask for help if my comments were confusing.

I should also add that I've not really used the categorical capabilities of AxisArrays, so it would be great if a second set of eyeballs could take a peek here.

But this is really nice, thanks (preliminarily) for the contribution!

v = collect(zip([:a, :b, :c][rand(1:3,20)], [:x,:y][rand(1:2,20)], [:x,:y][rand(1:2,20)]))
idx = sortperm(v)
A = AxisArray(data[idx,:], CategoricalVector(v[idx]), [:a, :b])
@test A[:b, :] == A[5:12, :]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume these reflect the actual random numbers produced due to srand(1234)? Would it be clearer to just hardcode v?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would; I was just following test/sortedvector.jl as much as possible. I can change it though :)

REQUIRE Outdated
@@ -1,4 +1,5 @@
julia 0.5
IntervalSets
Iterators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use IterTools.jl instead, which was created to avoid the name conflicts that arose from having both Base.Iterators and the Iterators package.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I'm actually in the middle of making PRs to transition everyone from Iterators to IterTools! Only missed this because it's still in my 0.5 directory.

src/combine.jl Outdated
return _flatten(gca, As...; kwargs...)
end

function flatten(last_dim::Integer, As::AxisArray...; kwargs...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only sad thing here is that the result of flatten won't be inferrable. Perhaps you don't care, but if there could be an inferrable API, it might be nicer. If you're just comparing AxisArrays, you should be able to split the axes into "common" and "distinct" subsets. For example,

using AxisArrays

# split out the first axis of each axisarray into a separate tuple
@inline selectcommon(As::AxisArray...) = selectcommon(axes.(As)...)
@inline selectcommon(axs::Tuple...) = _selectcommon(first.(axs), Base.tail.(axs))
# If the next axis has common name, keep it and then keep going
@inline _selectcommon(axes1::NTuple{N, Axis{name}}, trailingaxes::NTuple{N, Any}) where {N,name} = (axes1, _selectcommon(first.(trailingaxes), Base.tail.(trailingaxes))...)
# If they don't have the same Axis name, we drop the rest
@inline _selectcommon(axes1, trailingaxes) = ()

A = AxisArray(rand(3,3,3), :x, :y, :z)
B = AxisArray(rand(3,3,3), :x, :y, :t)

cmn = selectcommon(A, B)

should generate a tuple-of-tuples, each element that containing one common axis from each input AxisArray (in this case, it generates (xtuple, ytuple) where xtuple contains the two :x axes and ytuple contains the two :y axes). You can get the (non-common) trailing axes with a call to Base.IteratorsMD.split, just using the N in the NTuple for the length of cmn.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to tweak this to add in the ability to limit the length of the common axes, maybe with Val{max_dim}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I fully understand the concern, can you point me to a use-case example?

If you really can't live without the arithmetic operations, then one can always define a MyVal struct here. There's nothing remotely special about Val per se, it's literally trivial. And there isn't any special support for it in inference.jl either:

tim@diva:~/src/julia-1.0/base$ grep -F "Val{" inference.jl                                                               
tim@diva:~/src/julia-1.0/base$                                                                                                         

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@generated decrement(v::Type{Val{N}}) where N = :(Val{$(N-1)})

# split out the first axis of each axisarray into a separate tuple
@inline selectcommon(v::Type{Val{D}}, As::AxisArray...) where D = selectcommon(v, axes.(As)...)
@inline selectcommon(v::Type{Val{D}}, axs::Tuple...) where D = _selectcommon(v, first.(axs), Base.tail.(axs))
# If the next axis has common name, keep it and then keep going
@inline _selectcommon(v::Type{Val{D}}, axes1::NTuple{N, Axis{name}}, trailingaxes::NTuple{N, Any}) where {D,N,name} = (axes1, _selectcommon(decrement(v), first.(trailingaxes), Base.tail.(trailingaxes))...)
# If we have limited the axes to a certain number, stop
@inline _selectcommon(::Type{Val{0}}, axes1::NTuple{N, Axis{name}}, trailingaxes::NTuple{N, Any}) where {N,name} = ()
# If they don't have the same Axis name, we drop the rest
@inline _selectcommon(::Type{Val{D}}, axes1, trailingaxes) where D = ()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like reasonable code, I just don't understand what you're trying to achieve. Under what circumstance would you flatten with fewer common axes than are available?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, I'll try to explain our use case.

In an internal package we have an appendable feature set type which is index by some set of common indices. At any point we may want the "flattened" version. At any point we may push a new array to the set. We want the result of flattening a particular feature set to have the same axes always. We wouldn't want it to return different numbers of dimensions when there are only N arrays (where they share an extra axis) as when there are N+1 arrays (where array N+1 does not share the extra axis).

Does that make sense? Do you have alternate suggestions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that I really have two separate problems:

  1. Flatten all axes after the common axes
  2. Flatten all axes after axis N, ensuring axes 1:N are common

Trying to solve both in one function is creating strange mutant hybrid functions and causing increasing hacks. I think splitting these apart will be easier.

test/combine.jl Outdated
A1 = AxisArray(A1data, Axis{:X}(1:2), Axis{:Y}(1:2))
A2 = AxisArray(reshape(A2data, size(A2data)..., 1), Axis{:X}(1:2), Axis{:Y}(1:2), Axis{:Z}([:foo]))

@test flatten(A1, A2; array_names=[:A1, :A2]) == AxisArray(cat(3, A1data, A2data), Axis{:X}(1:2), Axis{:Y}(1:2), Axis{:page}(CategoricalVector([(:A1,), (:A2, :foo)])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do make an inferrable version of flatten, it would be great to add @inferred around those calls.

@iamed2
Copy link
Collaborator Author

iamed2 commented Jun 23, 2017

Thanks Tim! I'll work on getting the inferrable flatten tomorrow or Saturday.

@iamed2
Copy link
Collaborator Author

iamed2 commented Jun 23, 2017

Nightly failures are #92

@iamed2
Copy link
Collaborator Author

iamed2 commented Jul 6, 2017

I updated this to only include the fixed-dimension flatten. Hopefully this can get merged and I can do the select-flatten later.

This PR requires 0.6. I think it would be reasonable to have a separate PR to remove support for 0.5. What do you think?

@timholy
Copy link
Member

timholy commented Jul 8, 2017

Sorry I didn't notice this earlier.

I think it would be reasonable to have a separate PR to remove support for 0.5.

Either way would be fine, as long as it's a separate commit.

@timholy
Copy link
Member

timholy commented Jul 8, 2017

Seems fine to me. I'll let you decide how to handle the 0.6 issue and then I think this can be merged.

@iamed2
Copy link
Collaborator Author

iamed2 commented Jul 25, 2017

Can we merge #98 and this?

@@ -0,0 +1,85 @@

export CategoricalVector

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a different name for this, as CategoricalVector is already used in CategoricalArrays, which replaced PooledDataArray in DataTables (and soon in DataFrames). Why not CategoricalAxis?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an Axis, and it mirrors the SortedVector type.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, IIUC its only purpose is to treat it as a categorical axis, isn't it? Ideas about other possible names? It would be too bad to have conflicts when loading both AxisArrays and DataTables/DataFrames.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could choose not to export it?

There aren't conflicts when you use both packages unless you also use CategoricalVector.

The nomenclature used within AxisArrays is Categorical, which is how CategoricalVector came up.

How about DiscreteVector?

Copy link

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very familiar with the AxisArray code, but flatten sounds a bit unexpected for this function. Base.Iterators.flatten does something very different. Also, this seems to mix two operations which are not necessarily related: concatenation and flattening. Shouldn't the latter be implemented as cat? Then flattening could be enabled as an option to cat.

src/combine.jl Outdated
flatten(::Type{Val{N}}, ::Type{NewArrayType}, As::AxisArray...) -> AxisArray
flatten(::Type{Val{N}}, ::Type{NewArrayType}, labels::Tuple, As::AxisArray...) -> AxisArray

Concatenates AxisArrays with N equal leading axes into a single AxisArray.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backquotes around type names (here and elsewhere).

src/combine.jl Outdated
* `As::AxisArray...`: AxisArrays to be flattened together.
"""
@generated function flatten{N, AN, LType, NewArrayType<:AbstractArray}(
::Type{Val{N}},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the conventional indentation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only two existing cases of too-long method signatures and making this one look like those would mean pushing these arguments themselves too far to the right.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do it anyway for consistency, and knowing that with the new where syntax the leading type parameters (which are the actual problem here) will be moved to the end of the line or to another line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the where syntax already (updated now)

src/combine.jl Outdated
end

"""
flatten(::Type{Val{N}}, As::AxisArray...) -> AxisArray

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add examples? I find it hard to understand what this function does.

Copy link
Collaborator Author

@iamed2 iamed2 Jul 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example which has been added:

julia> price_data = AxisArray(rand(10), Axis{:time}(Date(2016,01,01):Day(1):Date(2016,01,10)))
1-dimensional AxisArray{Float64,1,...} with axes:
    :time, 2016-01-01:1 day:2016-01-10
And data, a 10-element Array{Float64,1}:
 0.885014
 0.418562
 0.609344
 0.72221
 0.43656
 0.840304
 0.455337
 0.65954
 0.393801
 0.260207

julia> size_data = AxisArray(rand(10,2), Axis{:time}(Date(2016,01,01):Day(1):Date(2016,01,10)), Axis{:measure}([:area, :volume]))
2-dimensional AxisArray{Float64,2,...} with axes:
    :time, 2016-01-01:1 day:2016-01-10
    :measure, Symbol[:area, :volume]
And data, a 10×2 Array{Float64,2}:
 0.159434     0.456992
 0.344521     0.374623
 0.522077     0.313256
 0.994697     0.320953
 0.95104      0.900526
 0.921854     0.729311
 0.000922581  0.148822
 0.449128     0.761714
 0.650277     0.135061
 0.688773     0.513845

julia> flattened = flatten(Val{1}, (:price, :size), price_data, size_data)
2-dimensional AxisArray{Float64,2,...} with axes:
    :time, 2016-01-01:1 day:2016-01-10
    :flat, Tuple{Symbol,Vararg{Symbol,N} where N}[(:price,), (:size, :area), (:size, :volume)]
And data, a 10×3 Array{Float64,2}:
 0.885014  0.159434     0.456992
 0.418562  0.344521     0.374623
 0.609344  0.522077     0.313256
 0.72221   0.994697     0.320953
 0.43656   0.95104      0.900526
 0.840304  0.921854     0.729311
 0.455337  0.000922581  0.148822
 0.65954   0.449128     0.761714
 0.393801  0.650277     0.135061
 0.260207  0.688773     0.513845

julia> flattened[Axis{:flat}(:size)] == size_data
true

@iamed2
Copy link
Collaborator Author

iamed2 commented Jul 31, 2017

I'm not very familiar with the AxisArray code, but flatten sounds a bit unexpected for this function. Base.Iterators.flatten does something very different. Also, this seems to mix two operations which are not necessarily related: concatenation and flattening. Shouldn't the latter be implemented as cat? Then flattening could be enabled as an option to cat.

The difference from cat is what happens to the axes when merged.

Also, something I want to do but have not yet done is allow merging of the leading axes, similar to join. In that case, it would be even more unlike cat.

export CategoricalVector

"""
A CategoricalVector is an AbstractVector which is treated as a categorical axis regardless

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing backticks around types in this docstring.

@iamed2 iamed2 changed the title Add CategoricalVector and flatten Add CategoricalVector and collapse Jul 31, 2017
Copy link
Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A couple of tiny changes, and then someone should hit merge.

src/combine.jl Outdated
flat_axis_eltype = _flat_axis_eltype(LType, trailing_axes)
flat_axis_type = CategoricalVector{flat_axis_eltype, Vector{flat_axis_eltype}}

new_axes_type = Tuple{new_common_axes..., Axis{:flat, flat_axis_type}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:collapsed?

src/combine.jl Outdated
end

@generated function collapse{N, AN, LType}(::Type{Val{N}}, labels::NTuple{AN, LType}, As::Vararg{AxisArray, AN})
flat_dim = Val{N + 1}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used?

src/combine.jl Outdated
collapse(::Type{Val{N}}, ::Type{NewArrayType}, labels::Tuple, As::AxisArray...) -> AxisArray

Collapses `AxisArray`s with `N` equal leading axes into a single `AxisArray`.
All additional axes in any of the arrays are flattened into a single additional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapsed

src/combine.jl Outdated

Collapses `AxisArray`s with `N` equal leading axes into a single `AxisArray`.
All additional axes in any of the arrays are flattened into a single additional
`CategoricalVector{Tuple}` axis.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on the label of the new axis?

src/combine.jl Outdated
arrays. The remaining axes are flattened. All `N` axes must be common
to each input array, at the same dimension. Values from `0` up to the
minimum number of dimensions across all input arrays are allowed.
* `labels::Tuple`: (optional) a label for each `AxisArray` in `As` which is used in the flat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely clear. Maybe "Index assigned to each array in As in the :collapsed axis"? Since it's optional, also specify what happens by default.

src/combine.jl Outdated
julia> collapsed = collapse(Val{1}, (:price, :size), price_data, size_data)
2-dimensional AxisArray{Float64,2,...} with axes:
:time, 2016-01-01:1 day:2016-01-10
:flat, Tuple{Symbol,Vararg{Symbol,N} where N}[(:price,), (:size, :area), (:size, :volume)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be fixed if you change the name to :collapsed

@timholy timholy merged commit f0743ce into JuliaArrays:master Aug 3, 2017
@timholy
Copy link
Member

timholy commented Aug 3, 2017

Thanks for your patience, and of course for the contribution!

@iamed2
Copy link
Collaborator Author

iamed2 commented Aug 3, 2017

Thank you for the attention!! :D

@omus omus deleted the flatten branch December 28, 2017 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants