Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CategoricalArray type not closed under unique method #129

Open
ablaom opened this issue Feb 26, 2018 · 6 comments
Open

CategoricalArray type not closed under unique method #129

ablaom opened this issue Feb 26, 2018 · 6 comments

Comments

@ablaom
Copy link

ablaom commented Feb 26, 2018

When one applies the unique function to a categorical array, I would expect a categorical array of the same type to be returned but this is not the case. I'm using Julia 0.6:

julia> CategoricalArray(["a","b","c", "a"])
4-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "b"
 "c"
 "a"

julia> unique(ans)
3-element Array{String,1}:
 "a"
 "b"
 "c"

julia> VERSION
v"0.6.2"
@nalimilan
Copy link
Member

It's not terribly useful to return a CategoricalArray in that case, since with unique values this type will take more memory than a standard Array. But indeed that also means you don't get CategoricalValue/CategoricalString objects which include ordering information.

What's your use case?

@ablaom
Copy link
Author

ablaom commented Feb 26, 2018 via email

@nalimilan
Copy link
Member

I use unique to determine what the values are but my code won't work as expected if unique changes the element type.

More specifically, could you explain briefly why the code doesn't work if the types differ? I'm trying to evaluate whether this pattern can be common.

@ablaom
Copy link
Author

ablaom commented Feb 28, 2018 via email

@ablaom
Copy link
Author

ablaom commented May 1, 2019

Update: My code has moved on and the use-case above no longer exists. On reflection, I'm not sure there is a compelling reason to favour different behaviour. Feel free to close.

@alyst
Copy link
Contributor

alyst commented Jul 7, 2021

I have another case of the code that is agnostic of the array representation and breaks if unique!() returns levels.

Suppose there is a function

nodes(edges::AbstractDataFrame) = DataFrame(id = sort!(unique(vcat(edges.source, edges.target))))

that works with the dataframe representation of a graph (dataframe edges with columns source and target) and returns the data frame of the graph nodes.
It's expected that the type of the resulting id column is the same as edges.source and edges.target, but with the current behavior of unique(::CategoricalVector) it would be Vector if source and target are categorical vectors.
So the user code expecting nodes.id to be categorical (e.g. levels(nodes.id)) would fail.

But there are more annoying subtle bugs. E.g. sort!(unique(...)) would sort by value, not by the level index.
So if the levels of source and target are not sorted, the order in the nodes.id would be different than in levels(edges.source) etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants