Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remaining dependencies on CategoricalArrays #2506

Closed
bkamins opened this issue Oct 31, 2020 · 9 comments · Fixed by #2519
Closed

remaining dependencies on CategoricalArrays #2506

bkamins opened this issue Oct 31, 2020 · 9 comments · Fixed by #2519
Labels
non-breaking The proposed change is not breaking
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Oct 31, 2020

Apart from open PRs we have the following remaining dependencies on CategoricalArrays.jl:

  1. https://github.com/JuliaData/DataFrames.jl/blob/master/src/groupeddataframe/splitapplycombine.jl#L963
  2. https://github.com/JuliaData/DataFrames.jl/blob/master/src/groupeddataframe/splitapplycombine.jl#L1040
  3. https://github.com/JuliaData/DataFrames.jl/blob/master/src/dataframerow/utils.jl#L38

This should be worked on after #2481 is merged.

@nalimilan - it can be done post 0.22 I think but before 1.0 (as this seems only optimization related). But could you please check (you know this code better), as maybe removing the dependency will have consequences like changing type or order of resulting computations (as we had in several cases recently). In general - it would be great if you could make an appropriate PR as you know this code much better than me. Thank you!

@bkamins bkamins added the non-breaking The proposed change is not breaking label Oct 31, 2020
@bkamins bkamins added this to the 1.0 milestone Oct 31, 2020
@bkamins bkamins mentioned this issue Oct 31, 2020
20 tasks
@nalimilan
Copy link
Member

Yes since it's not breaking it doesn't need to be in 0.22.

For the first ones, and ugly hack checking the name of the type could work. For the latter we can just use DataAPI.refpool.

@bkamins
Copy link
Member Author

bkamins commented Nov 6, 2020

I have looked into it and my conclusion is:

For

function hashrows_col!(h::Vector{UInt},
                       n::Vector{Bool},
                       v::AbstractCategoricalVector,
                       firstcol::Bool)

it is problematic to make it generic because CategoricalVector encodes missing as 0 as opposed to PooledVector.
Actually I think it would be best to add hasharray(array, h=0) function to DataAPI.jl that would delegate to the packages doing fast hashing of all their elements. What do you think?


for fast aggregates I was not fully sure how to do it cleanly (we need to assure pool sharing, but calling outcol = CategoricalArray{U, 1}(outcol.refs, incol.pool) seems to require having CategoricalArrays.jl as a dependency)

@nalimilan
Copy link
Member

AFAICT we don't really need to know where missing is: just do hashes = hash.(refpool(v)). I think I wrote the code in its current form because refpool didn't exist until recently.

For aggregates it's harder. Need to think about it.

@bkamins
Copy link
Member Author

bkamins commented Nov 6, 2020

The point is that what you propose does not work:

julia> x = categorical([1,2,missing])
3-element CategoricalArray{Union{Missing, Int64},1,UInt32}:
 1
 2
 missing

julia> hash.(DataAPI.refpool(x))
ERROR: MethodError: no method matching similar(::Type{Array{UInt64,N} where N}, ::Tuple{UnitRange{Int64}})

@bkamins
Copy link
Member Author

bkamins commented Nov 6, 2020

My comment JuliaData/DataAPI.jl#26 (comment) is related. If we want to be able to have a fully generic interop without knowing if something is PooledArray, CategoricalArray or even something else that does not exist yet, we need a bit stricter API I think.

In particular I was wondering what is the benefit of encoding missing as 0 in refpool of CategoricalArray?

@nalimilan
Copy link
Member

Ah yes that annoying similar method... Until we have a more direct solution, you can easily work around this with something like:

rp = DataAPI.refpool(x)
hashes = Vector{UInt}(undef, length(rp))
for (i, v) in zip(eachindex(hashes), rp)
    hashes[i] = hash(v)
end

I think technically we don't strictly need similar to work to write generic code. But yeah it's absence is very annoying.

In particular I was wondering what is the benefit of encoding missing as 0 in refpool of CategoricalArray?

Originally that was the simplest solution. I guess we could store it as length(levels)+1 instead, which would have the advantage of being the sorted order too. But we'd need to check that it doesn't make anything slower (i.e. we need to compiler to hoist the access to the array length in loops). Changing this would also break a lot of package code.

@bkamins
Copy link
Member Author

bkamins commented Nov 6, 2020

Ah - right. I will open a PR for this to try learning something more about these things 😄. It will additionally speed up PooledVector a bit (not much though - I have benchmarked this part of code and it does not seem to be very expensive in the total cost)

@nalimilan
Copy link
Member

Originally that was the simplest solution. I guess we could store it as length(levels)+1 instead, which would have the advantage of being the sorted order too. But we'd need to check that it doesn't make anything slower (i.e. we need to compiler to hoist the access to the array length in loops). Changing this would also break a lot of package code.

Now I see it: the problem with length(levels)+1 is that we would have to recode all missing values everytime we add a new level. We could use reference code 1 instead, but the drawback is that missing values would still not be sorted automatically.

@bkamins
Copy link
Member Author

bkamins commented Nov 7, 2020

Ah - right. Let us leave it as is then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants