make hashrows_col! not depend on CategoricalArrays.jl #2518

bkamins · 2020-11-06T14:44:03Z

@nalimilan - this follows your suggestion in #2506 (comment).

It is not fully in line with DataAPI.jl API (but I propose - as already mentioned to make that API stricter and require DataAPI.refpool to be AbstractVector).

If we agree on the proposal I will add more tests.

src/dataframerow/utils.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-11-06T22:22:58Z

Here are the benchmarks of the change:

using DataFrames, PooledArrays, CategoricalArrays, DataAPI

x = [1:9000000; fill(1, 1000001)];
x1 = PooledArray(x);
x2 = categorical(x);
y = [1:9000002; fill(1, 1000000-1)];
y1 = PooledArray(y);
y2 = categorical(y);
length(DataAPI.refpool(x1))/length(x)
length(DataAPI.refpool(y1))/length(y)
for (i,n) in enumerate((x, x1, x2, y, y1, y2))
    @info i
    GC.gc()
    @time DataFrames.hashrows((n,), false)
end


julia> using DataFrames, PooledArrays, CategoricalArrays, DataAPI

julia> x = [1:9000000; fill(1, 1000001)];

julia> x1 = PooledArray(x);
^[[C
julia> x2 = categorical(x);

julia> y = [1:9000002; fill(1, 1000000-1)];

julia> y1 = PooledArray(y);

julia> y2 = categorical(y);

julia> length(DataAPI.refpool(x1))/length(x)
0.899999910000009

julia> length(DataAPI.refpool(y1))/length(y)
0.900000109999989

julia> for (i,n) in enumerate((x, x1, x2, y, y1, y2))
           @info i
           GC.gc()
           @time DataFrames.hashrows((n,), false)
       end
[ Info: 1
  0.051783 seconds (5 allocations: 76.294 MiB, 5.32% gc time)
[ Info: 2
  0.101282 seconds (7 allocations: 144.959 MiB, 2.91% gc time)
[ Info: 3
  0.158293 seconds (7 allocations: 144.959 MiB, 2.19% gc time)
[ Info: 4
  0.061556 seconds (5 allocations: 76.294 MiB, 2.48% gc time)
[ Info: 5
  0.087488 seconds (5 allocations: 76.294 MiB, 1.94% gc time)
[ Info: 6
  0.112878 seconds (5 allocations: 76.294 MiB, 1.38% gc time)

julia> for (i,n) in enumerate((x, x1, x2, y, y1, y2))
           @info i
           GC.gc()
           @time DataFrames.hashrows((n,), false)
       end
[ Info: 1
  0.055569 seconds (5 allocations: 76.294 MiB, 2.99% gc time)
[ Info: 2
  0.112733 seconds (7 allocations: 144.959 MiB, 2.90% gc time)
[ Info: 3
  0.166712 seconds (7 allocations: 144.959 MiB, 2.05% gc time)
[ Info: 4
  0.062511 seconds (5 allocations: 76.294 MiB, 2.40% gc time)
[ Info: 5
  0.081028 seconds (5 allocations: 76.294 MiB, 1.94% gc time)
[ Info: 6
  0.117411 seconds (5 allocations: 76.294 MiB, 1.64% gc time)

@nalimilan - So it seems that the 90% threshold is OK (probably it could be even a bit lower, but it is hard to tune it optimally).
Also - we can see that when there are so many levels it is better not to do pooling.

@quinnj - do you still disable creation of a PooledArray in CSV.jl if there are too many levels in a categorical column or not?

I will add a test to make sure that all these cases produce the same hashes.

nalimilan · 2020-11-07T11:39:55Z

Thanks for benchmarking! So as I suspected (I was going to comment) the pooled hashing is a bit slower at 90%. I think I'd go with a lower threshold, e.g. 50% or even 10%.

src/dataframerow/utils.jl

bkamins · 2020-11-07T11:48:57Z

So as I suspected

I also suspected this, but the previous code did not use this optimization. 10% is too low (copying data is faster than calculating hashes). I will change it to 50% then (it will not be optimal if we hash very long strings though - in that case something closer to 90% is better, but this is probably rare).

src/dataframerow/utils.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

src/dataframerow/utils.jl

test/grouping.jl

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2020-11-07T21:21:46Z

Thank you!

bkamins requested a review from nalimilan November 6, 2020 14:44

bkamins added non-breaking The proposed change is not breaking ecosystem Issues in DataFrames.jl ecosystem performance labels Nov 6, 2020

bkamins added this to the 1.0 milestone Nov 6, 2020

nalimilan reviewed Nov 6, 2020

View reviewed changes

src/dataframerow/utils.jl Outdated Show resolved Hide resolved

src/dataframerow/utils.jl Outdated Show resolved Hide resolved

src/dataframerow/utils.jl Outdated Show resolved Hide resolved

bkamins and others added 3 commits November 6, 2020 23:01

make hashrows_col! not depend on CategoricalArrays.jl

f2fe516

Update src/dataframerow/utils.jl

ef56881

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

fixes after code review

e3d7d2c

add tests of new hashing code

8c9f8cd

bkamins force-pushed the hashrows_generic branch from 99bf7bf to 8c9f8cd Compare November 6, 2020 22:30

nalimilan reviewed Nov 7, 2020

View reviewed changes

src/dataframerow/utils.jl Outdated Show resolved Hide resolved

bkamins commented Nov 7, 2020

View reviewed changes

src/dataframerow/utils.jl Outdated Show resolved Hide resolved

bkamins and others added 2 commits November 7, 2020 12:51

Apply suggestions from code review

9bada34

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

update tests

d88d0cd

nalimilan reviewed Nov 7, 2020

View reviewed changes

src/dataframerow/utils.jl Outdated Show resolved Hide resolved

nalimilan reviewed Nov 7, 2020

View reviewed changes

test/grouping.jl Outdated Show resolved Hide resolved

nalimilan approved these changes Nov 7, 2020

View reviewed changes

Apply suggestions from code review

114eddc

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins mentioned this pull request Nov 7, 2020

Release 0.22 tracking #2484

Closed

20 tasks

bkamins merged commit b9e47e6 into JuliaData:master Nov 7, 2020

bkamins deleted the hashrows_generic branch November 7, 2020 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make hashrows_col! not depend on CategoricalArrays.jl #2518

make hashrows_col! not depend on CategoricalArrays.jl #2518

bkamins commented Nov 6, 2020

bkamins commented Nov 6, 2020

nalimilan commented Nov 7, 2020

bkamins commented Nov 7, 2020

bkamins commented Nov 7, 2020

make hashrows_col! not depend on CategoricalArrays.jl #2518

make hashrows_col! not depend on CategoricalArrays.jl #2518

Conversation

bkamins commented Nov 6, 2020

bkamins commented Nov 6, 2020

nalimilan commented Nov 7, 2020

bkamins commented Nov 7, 2020

bkamins commented Nov 7, 2020