Mark arrays with no levels as ordered when assigning ordered value #223

nalimilan · 2019-11-10T18:44:56Z

Adding this special case ensures that copy!(similar(x), x) gives an ordered array
when x is ordered. This is consistent with what vcat already does when one of the
inputs has no levels.

This is needed in particular by the DataFrames vcat method, which cannot use our vcat.

Fixes JuliaData/DataFrames.jl#2002.

Adding this special case ensures that `copy!(similar(x), x)` gives an ordered array when `x` is ordered. This is consistent with what `vcat` already does when one of the inputs has no levels. This is needed in particular by the DataFrames `vcat` method, which cannot use our `vcat`.

bkamins · 2019-11-11T09:05:45Z

Looks good.

A small, indirectly related question, is why similar on ordered categorical array returns an unordered categorical array?

nalimilan · 2019-11-11T10:39:29Z

Good question. Currently similar doesn't preserve levels of the input, as you often want to store completely new values, and levels of the input would be irrelevant and annoying. Since levels are not preserved, orderedness isn't either (as it mainly makes sense in relation with a set of levels).

But yeah, what we could do instead of this PR is have similar preserve orderedness even if it drops all levels. Then we wouldn't need the special-case added by this PR. Instead, we would need another special case: allow setindex! to add new levels to an ordered array when it has no levels. I guess that makes sense and would be more logical than the current state of the PR.

bkamins · 2019-11-11T11:15:55Z

As usual you understand the consequences better here 😄. Feel free to do whatever you find best. I will just leave some thoughts below.

In R neither ordered nor factor allow adding new levels by index setting (NA is produced).

We in CategoricalArrays.jl allow setindex! to extend the pool for unordered categorical, but not for ordered categorical. I understand we want to stick with this distinction.

Given this I think that similar should produce an unordered CategoricalArray and drop all levels by default.

What would you say for adding two Bool kwargs to similar something like keep_ordering and keep_levels (names are tentative) which would default to false. If any of them is set to true the respective attribute is copied from the parent?

nalimilan · 2019-11-11T11:23:01Z

Actually, one issue with the approach I mentioned in my previous comment is that allocatecolumn(::Type{<:CategoricalValue}}) and similar(::Array, ::Type{<:CategoricalValue}) don't have access to the value itself, so they cannot know whether it's ordered or not, and therefore have to return an unordered CategoricalArray. It sounds more and more that storing the orderedness in the type would be the best approach in terms of implementation -- but it would be annoying for users as they couldn't mark the array as ordered in place.

What would you say for adding two Bool kwargs to similar something like keep_ordering and keep_levels (names are tentative) which would default to false. If any of them is set to true the respective attribute is copied from the parent?

Well we could do that, but I'm not sure anybody would use it. :-) The main problem this PR intends to fix is that copy!(similar(x), x) doesn't preserve orderedness. Adding keyword arguments that are specific to categorical arrays (and therefore cannot be passed to similar in generic code) wouldn't fix that.

bkamins · 2019-11-11T12:48:41Z

OK - so I guess we should stick with what you implemented in this PR?

nalimilan · 2019-11-11T15:21:14Z

Well that sounds like the less problematic solution. Indeed I've realized that if we put the orderedness in the type, vcat of two ordered arrays would have to be ordered, or throw an error. That would mean you can't concatenate two ordered arrays with incompatible levels, which while not essential would be annoying and somewhat breaking the AbstractArray interface.

bkamins · 2019-11-11T19:30:22Z

OK - then, again, I think what you proposed is a way to go (as usual considering the best design CategoricalArrays.jl is like writing a research paper experience 😄).

nalimilan · 2019-11-13T17:41:29Z

OK, let's go with that. It's indeed hard to believe how proper handling of categorical data is tricky. Maybe that's why R has completely given up:

> x = ordered(c("a", "b", "c"))
> x
[1] a b c
Levels: a < b < c
> y = ordered(c("a", "b", "c"))
> levels(y) <- rev(levels(y))
> y
[1] c b a
Levels: c < b < a
> c(x, y) # (In)famous
[1] 1 2 3 1 2 3
> str(rbind(data.frame(a=x), data.frame(a=y))) # More pernicious
'data.frame':	6 obs. of  1 variable:
 $ a: Ord.factor w/ 3 levels "a"<"b"<"c": 1 2 3 3 2 1

nalimilan mentioned this pull request Nov 10, 2019

When vcat dataframes, ordering of categorical variables is lost JuliaData/DataFrames.jl#2002

Closed

nalimilan closed this Nov 11, 2019

nalimilan reopened this Nov 11, 2019

nalimilan merged commit f8113c5 into master Nov 13, 2019

nalimilan deleted the nl/ordered branch November 13, 2019 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark arrays with no levels as ordered when assigning ordered value #223

Mark arrays with no levels as ordered when assigning ordered value #223

nalimilan commented Nov 10, 2019

bkamins commented Nov 11, 2019

nalimilan commented Nov 11, 2019

bkamins commented Nov 11, 2019

nalimilan commented Nov 11, 2019 •

edited

bkamins commented Nov 11, 2019

nalimilan commented Nov 11, 2019

bkamins commented Nov 11, 2019

nalimilan commented Nov 13, 2019

Mark arrays with no levels as ordered when assigning ordered value #223

Mark arrays with no levels as ordered when assigning ordered value #223

Conversation

nalimilan commented Nov 10, 2019

bkamins commented Nov 11, 2019

nalimilan commented Nov 11, 2019

bkamins commented Nov 11, 2019

nalimilan commented Nov 11, 2019 • edited

bkamins commented Nov 11, 2019

nalimilan commented Nov 11, 2019

bkamins commented Nov 11, 2019

nalimilan commented Nov 13, 2019

nalimilan commented Nov 11, 2019 •

edited