make SubIndex strict about duplicates #2022

bkamins · 2019-11-22T13:17:35Z

see the discussion in #1958.

Rationale for this change:

the case when the cost of creation of DataFrameRow or SubDataFrame will be affected is only when indexing by AbstractVector{Int} is performed (abut not ranges - as they will be fast) - so I guess the case when it is needed is pretty rare
if someone really wants to avoid checking for duplicates it is possible to call SubIndex default constructor with appropriate values
the additional benefit is that functions that previously called lazyremap! have a bit less work to do later

src/other/index.jl

bkamins · 2019-11-22T15:55:12Z

I thought at #1958 (comment) you said the performance impact was large for DataFrameRow?

I have said that the performance hit is large for selection of kind df[1, [1,2,3]] ONLY (so selections of the kind :, 1:50 or :[:x1, :x2] are largely unaffected; I assume that selection of kind df[1, [1,2,3]] is actually pretty rare in practice).
Here are the benchmarks for DataFrameRow (as this is the most critical kind):

Current implementation

julia> df = DataFrame(rand(50, 50));

julia> x = collect(1:50);

julia> y = names(df);

julia> @benchmark df[1, :]
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     35.749 ns (0.00% GC)
  median time:      36.557 ns (0.00% GC)
  mean time:        52.476 ns (11.99% GC)
  maximum time:     39.307 μs (99.87% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark df[1, 1:50]
BenchmarkTools.Trial:
  memory estimate:  80 bytes
  allocs estimate:  2
  --------------
  minimum time:     46.107 ns (0.00% GC)
  median time:      47.422 ns (0.00% GC)
  mean time:        67.729 ns (15.78% GC)
  maximum time:     41.870 μs (99.86% GC)
  --------------
  samples:          10000
  evals/sample:     989

julia> @benchmark df[1, $x]
BenchmarkTools.Trial:
  memory estimate:  144 bytes
  allocs estimate:  3
  --------------
  minimum time:     88.554 ns (0.00% GC)
  median time:      93.652 ns (0.00% GC)
  mean time:        122.653 ns (11.30% GC)
  maximum time:     41.732 μs (99.51% GC)
  --------------
  samples:          10000
  evals/sample:     961

julia> @benchmark df[1, $y]
BenchmarkTools.Trial:
  memory estimate:  4.55 KiB
  allocs estimate:  16
  --------------
  minimum time:     2.722 μs (0.00% GC)
  median time:      2.911 μs (0.00% GC)
  mean time:        4.634 μs (19.63% GC)
  maximum time:     4.658 ms (99.86% GC)
  --------------
  samples:          10000
  evals/sample:     9

Proposed in this PR:

julia> df = DataFrame(rand(50, 50));

julia> x = collect(1:50);

julia> y = names(df);

julia> @benchmark df[1, :]
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     37.261 ns (0.00% GC)
  median time:      38.570 ns (0.00% GC)
  mean time:        53.947 ns (12.11% GC)
  maximum time:     40.133 μs (99.87% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark df[1, 1:50]
BenchmarkTools.Trial:
  memory estimate:  80 bytes
  allocs estimate:  2
  --------------
  minimum time:     45.804 ns (0.00% GC)
  median time:      50.152 ns (0.00% GC)
  mean time:        81.071 ns (13.05% GC)
  maximum time:     41.938 μs (99.87% GC)
  --------------
  samples:          10000
  evals/sample:     989

julia> @benchmark df[1, $x]
BenchmarkTools.Trial:
  memory estimate:  560 bytes
  allocs estimate:  3
  --------------
  minimum time:     126.487 ns (0.00% GC)
  median time:      138.047 ns (0.00% GC)
  mean time:        171.026 ns (10.79% GC)
  maximum time:     46.290 μs (99.41% GC)
  --------------
  samples:          10000
  evals/sample:     891

julia> @benchmark df[1, $y]
BenchmarkTools.Trial:
  memory estimate:  4.95 KiB
  allocs estimate:  16
  --------------
  minimum time:     2.744 μs (0.00% GC)
  median time:      2.944 μs (0.00% GC)
  mean time:        4.555 μs (20.23% GC)
  maximum time:     4.988 ms (99.90% GC)
  --------------
  samples:          10000
  evals/sample:     9

And what I say is that these nanosecond differences will be negligible anyway as if you try doing anything later with such DataFrameRow it will be more expensive anyway (as I assume that if you create it you actually want to do something with it that is more complex). Here is an example of cost of casting it to a NamedTuple:

julia> dfr =  df[1, y];

julia> @benchmark copy($dfr)
BenchmarkTools.Trial:
  memory estimate:  4.20 KiB
  allocs estimate:  115
  --------------
  minimum time:     5.483 μs (0.00% GC)
  median time:      5.833 μs (0.00% GC)
  mean time:        7.541 μs (10.77% GC)
  maximum time:     7.461 ms (99.89% GC)
  --------------
  samples:          10000
  evals/sample:     6

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

…Index

nalimilan · 2019-11-23T10:26:49Z

That's fine. But why not keep these @boundscheck annotations as they don't hurt?

bkamins · 2019-11-23T11:08:00Z

We can discuss it. It is a matter of taste. I prefer as little magic as possible in the source code, so the average DataFrames.jl user can understand what is going on (the codebase for indexing and views is already quite complex I think).

If we revert @boundscheck I have to add lazyremap! function back to the code base and then also parentindices, getindex and haskey will have do one additional check (as we will dynamically have to check after creation of SubIndex if it was created with @inbounds or without it).

But - if you prefer I will revert @boundscheck - so feel free to decide if it is OK with you.

bkamins · 2019-11-23T11:09:38Z

So in short having or not having @boundscheck is a tradeoff: for a small performance gain in a rare case, we add code complexity and small performance hit in several other places.

nalimilan · 2019-11-25T09:04:27Z

OK, fine with me.

test/dataframerow.jl

test/subdataframe.jl

nalimilan · 2019-11-25T09:03:51Z

test/cat.jl

-    @test vcat(view(df, 1:2, [1,2,3,1,2,3]), view(df, 3:5, [3,2,1,1,2,3])) == df
-    @test all(==(df[1, :]), eachrow(vcat(view(df, [1,1,1], [1,2,3,1,2,3]),
-                                         view(df, [1,1,1], [3,2,1,1,2,3]))))
+    @test_throws ArgumentError vcat(view(df, 1:2, [1,2,3,1,2,3]),


vcat isn't needed anymore, right? Same below for all.

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

bkamins · 2019-11-25T13:02:52Z

Re: vcat not needed - fixed. Thank you (I wanted to minimize changes in the tests but here it does not make sense to retain them - so also thank you for the other fixes).

I have added another tests for duplicate columns when trying to make a view of a SubDataFrame.

bkamins · 2019-11-25T17:37:06Z

@nalimilan - thank you!

make SubIndex strict about duplicates

12b6410

bkamins mentioned this pull request Nov 22, 2019

Add :setequal to cols kwarg in vcat, push! and append! #1958

Merged

nalimilan reviewed Nov 22, 2019

View reviewed changes

src/other/index.jl Show resolved Hide resolved

nalimilan reviewed Nov 22, 2019

View reviewed changes

src/other/index.jl Outdated Show resolved Hide resolved

src/other/index.jl Show resolved Hide resolved

bkamins and others added 5 commits November 22, 2019 16:56

Update src/other/index.jl

e78ac51

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

fix tests

3d6ce2e

Merge remote-tracking branch 'origin/strict_SubIndex' into strict_Sub…

1d48757

…Index

fix more tests

e08fe7b

fix one more test

b8c5a0f

bkamins added this to the 1.0 milestone Nov 23, 2019

nalimilan reviewed Nov 25, 2019

View reviewed changes

bkamins and others added 2 commits November 25, 2019 13:57

Apply suggestions from code review

6fee351

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

additional tests

c8f6a99

nalimilan approved these changes Nov 25, 2019

View reviewed changes

bkamins merged commit 85e5d98 into JuliaData:master Nov 25, 2019

bkamins deleted the strict_SubIndex branch November 25, 2019 17:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make SubIndex strict about duplicates #2022

make SubIndex strict about duplicates #2022

bkamins commented Nov 22, 2019

bkamins commented Nov 22, 2019 •

edited

Loading

nalimilan commented Nov 23, 2019

bkamins commented Nov 23, 2019

bkamins commented Nov 23, 2019

nalimilan commented Nov 25, 2019

nalimilan Nov 25, 2019

bkamins commented Nov 25, 2019

bkamins commented Nov 25, 2019

make SubIndex strict about duplicates #2022

make SubIndex strict about duplicates #2022

Conversation

bkamins commented Nov 22, 2019

bkamins commented Nov 22, 2019 • edited Loading

nalimilan commented Nov 23, 2019

bkamins commented Nov 23, 2019

bkamins commented Nov 23, 2019

nalimilan commented Nov 25, 2019

nalimilan Nov 25, 2019

Choose a reason for hiding this comment

bkamins commented Nov 25, 2019

bkamins commented Nov 25, 2019

bkamins commented Nov 22, 2019 •

edited

Loading