Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark arrays with no levels as ordered when assigning ordered value #223

Merged
merged 1 commit into from Nov 13, 2019

Conversation

nalimilan
Copy link
Member

Adding this special case ensures that copy!(similar(x), x) gives an ordered array
when x is ordered. This is consistent with what vcat already does when one of the
inputs has no levels.

This is needed in particular by the DataFrames vcat method, which cannot use our vcat.

Fixes JuliaData/DataFrames.jl#2002.

Adding this special case ensures that `copy!(similar(x), x)` gives an ordered array
when `x` is ordered. This is consistent with what `vcat` already does when one of the
inputs has no levels.

This is needed in particular by the DataFrames `vcat` method, which cannot use our `vcat`.
@bkamins
Copy link
Member

bkamins commented Nov 11, 2019

Looks good.

A small, indirectly related question, is why similar on ordered categorical array returns an unordered categorical array?

@nalimilan
Copy link
Member Author

Good question. Currently similar doesn't preserve levels of the input, as you often want to store completely new values, and levels of the input would be irrelevant and annoying. Since levels are not preserved, orderedness isn't either (as it mainly makes sense in relation with a set of levels).

But yeah, what we could do instead of this PR is have similar preserve orderedness even if it drops all levels. Then we wouldn't need the special-case added by this PR. Instead, we would need another special case: allow setindex! to add new levels to an ordered array when it has no levels. I guess that makes sense and would be more logical than the current state of the PR.

@bkamins
Copy link
Member

bkamins commented Nov 11, 2019

As usual you understand the consequences better here 😄. Feel free to do whatever you find best. I will just leave some thoughts below.

In R neither ordered nor factor allow adding new levels by index setting (NA is produced).

We in CategoricalArrays.jl allow setindex! to extend the pool for unordered categorical, but not for ordered categorical. I understand we want to stick with this distinction.

Given this I think that similar should produce an unordered CategoricalArray and drop all levels by default.

What would you say for adding two Bool kwargs to similar something like keep_ordering and keep_levels (names are tentative) which would default to false. If any of them is set to true the respective attribute is copied from the parent?

@nalimilan
Copy link
Member Author

nalimilan commented Nov 11, 2019

Actually, one issue with the approach I mentioned in my previous comment is that allocatecolumn(::Type{<:CategoricalValue}}) and similar(::Array, ::Type{<:CategoricalValue}) don't have access to the value itself, so they cannot know whether it's ordered or not, and therefore have to return an unordered CategoricalArray. It sounds more and more that storing the orderedness in the type would be the best approach in terms of implementation -- but it would be annoying for users as they couldn't mark the array as ordered in place.

What would you say for adding two Bool kwargs to similar something like keep_ordering and keep_levels (names are tentative) which would default to false. If any of them is set to true the respective attribute is copied from the parent?

Well we could do that, but I'm not sure anybody would use it. :-) The main problem this PR intends to fix is that copy!(similar(x), x) doesn't preserve orderedness. Adding keyword arguments that are specific to categorical arrays (and therefore cannot be passed to similar in generic code) wouldn't fix that.

@bkamins
Copy link
Member

bkamins commented Nov 11, 2019

OK - so I guess we should stick with what you implemented in this PR?

@nalimilan nalimilan closed this Nov 11, 2019
@nalimilan nalimilan reopened this Nov 11, 2019
@nalimilan
Copy link
Member Author

Well that sounds like the less problematic solution. Indeed I've realized that if we put the orderedness in the type, vcat of two ordered arrays would have to be ordered, or throw an error. That would mean you can't concatenate two ordered arrays with incompatible levels, which while not essential would be annoying and somewhat breaking the AbstractArray interface.

@bkamins
Copy link
Member

bkamins commented Nov 11, 2019

OK - then, again, I think what you proposed is a way to go (as usual considering the best design CategoricalArrays.jl is like writing a research paper experience 😄).

@nalimilan
Copy link
Member Author

OK, let's go with that. It's indeed hard to believe how proper handling of categorical data is tricky. Maybe that's why R has completely given up:

> x = ordered(c("a", "b", "c"))
> x
[1] a b c
Levels: a < b < c
> y = ordered(c("a", "b", "c"))
> levels(y) <- rev(levels(y))
> y
[1] c b a
Levels: c < b < a
> c(x, y) # (In)famous
[1] 1 2 3 1 2 3
> str(rbind(data.frame(a=x), data.frame(a=y))) # More pernicious
'data.frame':	6 obs. of  1 variable:
 $ a: Ord.factor w/ 3 levels "a"<"b"<"c": 1 2 3 3 2 1

@nalimilan nalimilan merged commit f8113c5 into master Nov 13, 2019
@nalimilan nalimilan deleted the nl/ordered branch November 13, 2019 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When vcat dataframes, ordering of categorical variables is lost
2 participants