remaining dependencies on CategoricalArrays #2506

bkamins · 2020-10-31T17:43:36Z

Apart from open PRs we have the following remaining dependencies on CategoricalArrays.jl:

This should be worked on after #2481 is merged.

@nalimilan - it can be done post 0.22 I think but before 1.0 (as this seems only optimization related). But could you please check (you know this code better), as maybe removing the dependency will have consequences like changing type or order of resulting computations (as we had in several cases recently). In general - it would be great if you could make an appropriate PR as you know this code much better than me. Thank you!

nalimilan · 2020-10-31T17:51:37Z

Yes since it's not breaking it doesn't need to be in 0.22.

For the first ones, and ugly hack checking the name of the type could work. For the latter we can just use DataAPI.refpool.

bkamins · 2020-11-06T09:04:21Z

I have looked into it and my conclusion is:

For

function hashrows_col!(h::Vector{UInt},
                       n::Vector{Bool},
                       v::AbstractCategoricalVector,
                       firstcol::Bool)

it is problematic to make it generic because CategoricalVector encodes missing as 0 as opposed to PooledVector.
Actually I think it would be best to add hasharray(array, h=0) function to DataAPI.jl that would delegate to the packages doing fast hashing of all their elements. What do you think?

for fast aggregates I was not fully sure how to do it cleanly (we need to assure pool sharing, but calling outcol = CategoricalArray{U, 1}(outcol.refs, incol.pool) seems to require having CategoricalArrays.jl as a dependency)

nalimilan · 2020-11-06T09:14:21Z

AFAICT we don't really need to know where missing is: just do hashes = hash.(refpool(v)). I think I wrote the code in its current form because refpool didn't exist until recently.

For aggregates it's harder. Need to think about it.

bkamins · 2020-11-06T09:18:52Z

The point is that what you propose does not work:

julia> x = categorical([1,2,missing])
3-element CategoricalArray{Union{Missing, Int64},1,UInt32}:
 1
 2
 missing

julia> hash.(DataAPI.refpool(x))
ERROR: MethodError: no method matching similar(::Type{Array{UInt64,N} where N}, ::Tuple{UnitRange{Int64}})

bkamins · 2020-11-06T09:21:52Z

My comment JuliaData/DataAPI.jl#26 (comment) is related. If we want to be able to have a fully generic interop without knowing if something is PooledArray, CategoricalArray or even something else that does not exist yet, we need a bit stricter API I think.

In particular I was wondering what is the benefit of encoding missing as 0 in refpool of CategoricalArray?

nalimilan · 2020-11-06T12:55:16Z

Ah yes that annoying similar method... Until we have a more direct solution, you can easily work around this with something like:

rp = DataAPI.refpool(x)
hashes = Vector{UInt}(undef, length(rp))
for (i, v) in zip(eachindex(hashes), rp)
    hashes[i] = hash(v)
end

I think technically we don't strictly need similar to work to write generic code. But yeah it's absence is very annoying.

In particular I was wondering what is the benefit of encoding missing as 0 in refpool of CategoricalArray?

Originally that was the simplest solution. I guess we could store it as length(levels)+1 instead, which would have the advantage of being the sorted order too. But we'd need to check that it doesn't make anything slower (i.e. we need to compiler to hoist the access to the array length in loops). Changing this would also break a lot of package code.

bkamins · 2020-11-06T14:27:14Z

Ah - right. I will open a PR for this to try learning something more about these things 😄. It will additionally speed up PooledVector a bit (not much though - I have benchmarked this part of code and it does not seem to be very expensive in the total cost)

nalimilan · 2020-11-07T14:55:02Z

Originally that was the simplest solution. I guess we could store it as length(levels)+1 instead, which would have the advantage of being the sorted order too. But we'd need to check that it doesn't make anything slower (i.e. we need to compiler to hoist the access to the array length in loops). Changing this would also break a lot of package code.

Now I see it: the problem with length(levels)+1 is that we would have to recode all missing values everytime we add a new level. We could use reference code 1 instead, but the drawback is that missing values would still not be sorted automatically.

bkamins · 2020-11-07T15:07:10Z

Ah - right. Let us leave it as is then.

bkamins added the non-breaking The proposed change is not breaking label Oct 31, 2020

bkamins added this to the 1.0 milestone Oct 31, 2020

bkamins mentioned this issue Oct 31, 2020

Release 0.22 tracking #2484

Closed

20 tasks

This was referenced Nov 6, 2020

make hashrows_col! not depend on CategoricalArrays.jl #2518

Merged

avoid CategoricalArrays dependency in aggregates #2519

Merged

bkamins closed this as completed in #2519 Nov 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remaining dependencies on CategoricalArrays #2506

remaining dependencies on CategoricalArrays #2506

bkamins commented Oct 31, 2020

nalimilan commented Oct 31, 2020

bkamins commented Nov 6, 2020

nalimilan commented Nov 6, 2020

bkamins commented Nov 6, 2020

bkamins commented Nov 6, 2020

nalimilan commented Nov 6, 2020

bkamins commented Nov 6, 2020

nalimilan commented Nov 7, 2020

bkamins commented Nov 7, 2020

remaining dependencies on CategoricalArrays #2506

remaining dependencies on CategoricalArrays #2506

Comments

bkamins commented Oct 31, 2020

nalimilan commented Oct 31, 2020

bkamins commented Nov 6, 2020

nalimilan commented Nov 6, 2020

bkamins commented Nov 6, 2020

bkamins commented Nov 6, 2020

nalimilan commented Nov 6, 2020

bkamins commented Nov 6, 2020

nalimilan commented Nov 7, 2020

bkamins commented Nov 7, 2020