Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make SubIndex strict about duplicates #2022

Merged
merged 8 commits into from
Nov 25, 2019

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Nov 22, 2019

see the discussion in #1958.

Rationale for this change:

  • the case when the cost of creation of DataFrameRow or SubDataFrame will be affected is only when indexing by AbstractVector{Int} is performed (abut not ranges - as they will be fast) - so I guess the case when it is needed is pretty rare
  • if someone really wants to avoid checking for duplicates it is possible to call SubIndex default constructor with appropriate values
  • the additional benefit is that functions that previously called lazyremap! have a bit less work to do later

src/other/index.jl Outdated Show resolved Hide resolved
src/other/index.jl Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Nov 22, 2019

I thought at #1958 (comment) you said the performance impact was large for DataFrameRow?

I have said that the performance hit is large for selection of kind df[1, [1,2,3]] ONLY (so selections of the kind :, 1:50 or :[:x1, :x2] are largely unaffected; I assume that selection of kind df[1, [1,2,3]] is actually pretty rare in practice).
Here are the benchmarks for DataFrameRow (as this is the most critical kind):

Current implementation

julia> df = DataFrame(rand(50, 50));

julia> x = collect(1:50);

julia> y = names(df);

julia> @benchmark df[1, :]
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     35.749 ns (0.00% GC)
  median time:      36.557 ns (0.00% GC)
  mean time:        52.476 ns (11.99% GC)
  maximum time:     39.307 μs (99.87% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark df[1, 1:50]
BenchmarkTools.Trial:
  memory estimate:  80 bytes
  allocs estimate:  2
  --------------
  minimum time:     46.107 ns (0.00% GC)
  median time:      47.422 ns (0.00% GC)
  mean time:        67.729 ns (15.78% GC)
  maximum time:     41.870 μs (99.86% GC)
  --------------
  samples:          10000
  evals/sample:     989

julia> @benchmark df[1, $x]
BenchmarkTools.Trial:
  memory estimate:  144 bytes
  allocs estimate:  3
  --------------
  minimum time:     88.554 ns (0.00% GC)
  median time:      93.652 ns (0.00% GC)
  mean time:        122.653 ns (11.30% GC)
  maximum time:     41.732 μs (99.51% GC)
  --------------
  samples:          10000
  evals/sample:     961

julia> @benchmark df[1, $y]
BenchmarkTools.Trial:
  memory estimate:  4.55 KiB
  allocs estimate:  16
  --------------
  minimum time:     2.722 μs (0.00% GC)
  median time:      2.911 μs (0.00% GC)
  mean time:        4.634 μs (19.63% GC)
  maximum time:     4.658 ms (99.86% GC)
  --------------
  samples:          10000
  evals/sample:     9

Proposed in this PR:

julia> df = DataFrame(rand(50, 50));

julia> x = collect(1:50);

julia> y = names(df);

julia> @benchmark df[1, :]
BenchmarkTools.Trial:
  memory estimate:  32 bytes
  allocs estimate:  1
  --------------
  minimum time:     37.261 ns (0.00% GC)
  median time:      38.570 ns (0.00% GC)
  mean time:        53.947 ns (12.11% GC)
  maximum time:     40.133 μs (99.87% GC)
  --------------
  samples:          10000
  evals/sample:     993

julia> @benchmark df[1, 1:50]
BenchmarkTools.Trial:
  memory estimate:  80 bytes
  allocs estimate:  2
  --------------
  minimum time:     45.804 ns (0.00% GC)
  median time:      50.152 ns (0.00% GC)
  mean time:        81.071 ns (13.05% GC)
  maximum time:     41.938 μs (99.87% GC)
  --------------
  samples:          10000
  evals/sample:     989

julia> @benchmark df[1, $x]
BenchmarkTools.Trial:
  memory estimate:  560 bytes
  allocs estimate:  3
  --------------
  minimum time:     126.487 ns (0.00% GC)
  median time:      138.047 ns (0.00% GC)
  mean time:        171.026 ns (10.79% GC)
  maximum time:     46.290 μs (99.41% GC)
  --------------
  samples:          10000
  evals/sample:     891

julia> @benchmark df[1, $y]
BenchmarkTools.Trial:
  memory estimate:  4.95 KiB
  allocs estimate:  16
  --------------
  minimum time:     2.744 μs (0.00% GC)
  median time:      2.944 μs (0.00% GC)
  mean time:        4.555 μs (20.23% GC)
  maximum time:     4.988 ms (99.90% GC)
  --------------
  samples:          10000
  evals/sample:     9

And what I say is that these nanosecond differences will be negligible anyway as if you try doing anything later with such DataFrameRow it will be more expensive anyway (as I assume that if you create it you actually want to do something with it that is more complex). Here is an example of cost of casting it to a NamedTuple:

julia> dfr =  df[1, y];

julia> @benchmark copy($dfr)
BenchmarkTools.Trial:
  memory estimate:  4.20 KiB
  allocs estimate:  115
  --------------
  minimum time:     5.483 μs (0.00% GC)
  median time:      5.833 μs (0.00% GC)
  mean time:        7.541 μs (10.77% GC)
  maximum time:     7.461 ms (99.89% GC)
  --------------
  samples:          10000
  evals/sample:     6

@bkamins bkamins added this to the 1.0 milestone Nov 23, 2019
@nalimilan
Copy link
Member

That's fine. But why not keep these @boundscheck annotations as they don't hurt?

@bkamins
Copy link
Member Author

bkamins commented Nov 23, 2019

We can discuss it. It is a matter of taste. I prefer as little magic as possible in the source code, so the average DataFrames.jl user can understand what is going on (the codebase for indexing and views is already quite complex I think).

If we revert @boundscheck I have to add lazyremap! function back to the code base and then also parentindices, getindex and haskey will have do one additional check (as we will dynamically have to check after creation of SubIndex if it was created with @inbounds or without it).

But - if you prefer I will revert @boundscheck - so feel free to decide if it is OK with you.

@bkamins
Copy link
Member Author

bkamins commented Nov 23, 2019

So in short having or not having @boundscheck is a tradeoff: for a small performance gain in a rare case, we add code complexity and small performance hit in several other places.

@nalimilan
Copy link
Member

OK, fine with me.

test/dataframerow.jl Outdated Show resolved Hide resolved
test/dataframerow.jl Outdated Show resolved Hide resolved
test/dataframerow.jl Outdated Show resolved Hide resolved
test/dataframerow.jl Outdated Show resolved Hide resolved
test/subdataframe.jl Outdated Show resolved Hide resolved
test/cat.jl Outdated
@test vcat(view(df, 1:2, [1,2,3,1,2,3]), view(df, 3:5, [3,2,1,1,2,3])) == df
@test all(==(df[1, :]), eachrow(vcat(view(df, [1,1,1], [1,2,3,1,2,3]),
view(df, [1,1,1], [3,2,1,1,2,3]))))
@test_throws ArgumentError vcat(view(df, 1:2, [1,2,3,1,2,3]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vcat isn't needed anymore, right? Same below for all.

bkamins and others added 2 commits November 25, 2019 13:57
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
@bkamins
Copy link
Member Author

bkamins commented Nov 25, 2019

Re: vcat not needed - fixed. Thank you (I wanted to minimize changes in the tests but here it does not make sense to retain them - so also thank you for the other fixes).

I have added another tests for duplicate columns when trying to make a view of a SubDataFrame.

@bkamins bkamins merged commit 85e5d98 into JuliaData:master Nov 25, 2019
@bkamins bkamins deleted the strict_SubIndex branch November 25, 2019 17:37
@bkamins
Copy link
Member Author

bkamins commented Nov 25, 2019

@nalimilan - thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants