Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up getindex with AbstractVector{Bool} row selection #1848

Merged
merged 2 commits into from
Jun 12, 2019

Conversation

nalimilan
Copy link
Member

Computing the integer indices before indexing into the vectors is faster: since the array indexing code does that anyway, doing it manually avoids repeating the operation for each column.

using DataFrames, BenchmarkTools
df = DataFrame(rand(10_000, 100));
inds = rand(Bool, nrow(df));
@btime df[inds, :];
@btime df[inds, 1:10];
@btime df[inds, 1:2];
@btime df[inds, [1]];

# git master
julia> @btime df[inds, :];
  4.039 ms (517 allocations: 3.85 MiB)

julia> @btime df[inds, 1:10];
  397.159 μs (76 allocations: 395.48 KiB)

julia> @btime df[inds, 1:2];
  79.435 μs (36 allocations: 80.55 KiB)

julia> @btime df[inds, [1]];
  40.242 μs (34 allocations: 41.34 KiB)

# This PR
julia> @btime df[inds, :];
  1.372 ms (419 allocations: 3.87 MiB)

julia> @btime df[inds, 1:10];
  135.309 μs (68 allocations: 433.06 KiB)

julia> @btime df[inds, 1:2];
  68.609 μs (36 allocations: 119.38 KiB)

julia> @btime df[inds, [1]];
  59.974 μs (35 allocations: 80.33 KiB)

# Alternative using LogicalIndex
julia> @btime df[inds, :];
  3.752 ms (418 allocations: 3.84 MiB)

julia> @btime df[inds, 1:10];
  369.710 μs (67 allocations: 394.58 KiB)

julia> @btime df[inds, 1:2];
  76.246 μs (35 allocations: 80.39 KiB)

julia> @btime df[inds, [1]];
  40.101 μs (34 allocations: 41.28 KiB)

Computing the integer indices before indexing into the vectors is faster.
The array indexing code uses a LogicalIndex wrapper which computes the number
of true indices and doesn't allocate a vector of integer indices, but it's
only slightly faster when there's a single column, and slower for more than
one column.
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
src/dataframe/dataframe.jl Outdated Show resolved Hide resolved
Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK + I left 2 small fixes

Co-Authored-By: Bogumił Kamiński <bkamins@sgh.waw.pl>
@bkamins
Copy link
Member

bkamins commented Jun 11, 2019

I think it is good to merge.

@nalimilan nalimilan merged commit 724d132 into master Jun 12, 2019
@nalimilan nalimilan deleted the nl/getindex branch June 12, 2019 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants