Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

allow columns to be indexed for fast access/split by value #24

Closed
HarlanH opened this Issue Jul 16, 2012 · 16 comments

Comments

Projects
None yet
3 participants
Contributor

HarlanH commented Jul 16, 2012

something like index!(df, "colname") creates an index for the clumn and stores it inside the DataFrame object.

@tshort tshort pushed a commit that referenced this issue Jul 19, 2012

@powerdistribution powerdistribution Add an indexing class for AbstractVectors.
One option for issue #24.
Allows fast lookups and comparisons.
Experimental: not loaded by default.
Doesn't (yet) work for AbstractDataVecs.
268faa1
Contributor

tshort commented Jul 19, 2012

In addition to the commit just made for column-based indexes, another option for this is pandas/data.table style indexing where the whole DataFrame is sorted based on keys. In pandas, the indexing is separate from the DataFrame. In data.table, the key columns are columns in the DataFrame. Either way, we would need to add extra fields to the definition of a DataFrame, and both would complicate indexing.

Keeping indexing with columns has advantages:

  • Column ordering is less likely to be messed up inadvertantly.
  • Multiple columns can be indexed; there's no main key grouping.
  • More flexible logic combinations are possible.
  • You can mix and match keyed and non-keyed comparisons.
  • The DataFrame structure and indexing are not disturbed.
  • It's probably less coding and maintenance than the approach below.

The advantages of an overall DataFrame index are:

  • Lookups are somewhat faster than the approach above.
  • The ordering and priority of keys is kept. This makes things like pivoting easier.
  • You could do data.table/pandas shortcuts like df["A"] for df[:( keycol .== "A" )]. But, df["A"] is less descriptive if you don't know what df's keys are, and it's hard to separate keyed lookups along rows with column indexing.
Contributor

HarlanH commented Jul 19, 2012

Tom, thanks for writing that experimental code! That's definitely the direction I'm currently thinking. And I think you nail the pros and cons of the various options.

I'd like to defer this decision until we have a design for distributed DataFrames, which I think will be a uniquely Julian and very valuable feature. What we end up wanting to do for indexes will, I think, be driven in part by what's a good option there. Does that make sense?

Contributor

tshort commented Jul 19, 2012

Definitely. No need to hurry on this. I especially don't want to clutter the DataFrame definition unless we really know what we want.

For now, it was an easy option to try and may give us some pieces to use later.

Contributor

tshort commented Jul 19, 2012

Some speed tests on the column-based indexing:

N = 100000000
srand(1)
# big grouping:
x = randi(5, N)
xi = IndexedVec(x)

@time x[x .== 3]
# elapsed time: 1.309391975402832 seconds
@time x[xi .== 3]
# elapsed time: 0.37038278579711914 seconds

# smaller grouping:
x = randi(50000, N)
xi = IndexedVec(x)

@time x[x .== 3]
# elapsed time: 0.9579911231994629 seconds
@time x[xi .== 3]
# elapsed time: 0.00017404556274414062 seconds
@time x[xi .<= 3]
# elapsed time: 0.0005171298980712891 seconds
Contributor

tshort commented Oct 4, 2012

Bitmap indexing is an interesting option. It's fast and flexible. The main downside is that it's hard/slow to update for changes to the data.

http://en.wikipedia.org/wiki/Bitmap_index
https://sdm.lbl.gov/fastbit/
http://www-vis.lbl.gov/Events/SC05/HDF5FastQuery/

Contributor

tshort commented Jan 2, 2013

I just updated indexing.jl, renamed IndexedVec to IndexedVector, got rid of warnings, and added a few tests. It is also now live, meaning it is loaded with everything else. In updating, I came across a couple of things:

  • DataFrame indexing isn't as general as it used to be, probably due to changes needed to avoid warnings. Not a big deal, but specific ref entries were needed to make this work.
  • Assignment of an IndexedVector to a DataFrame column wasn't straightforward, probably because this type is an AbstractVector, not an AbstractDataVector. I added methods to make it happen, but I don't know if I did it "the right way".

The timings above as well as the advantages and disadvantages still apply.

Contributor

johnmyleswhite commented Jan 2, 2013

Thanks, @tshort! To wrap my head around the indexing strategy, should I just read through indexing.jl? I'm really excited to have this available.

Contributor

tshort commented Jan 2, 2013

The short answer is yes. I'll add some more comments. The idea is that an
IndexedVector is a pointer to the original column along with an indexing
vector equal to order(orig). A comparison operation like idv .> 3
returns an Indexer type. The Indexer type includes a pointer to the
IndexedVector along with a vector of Range1's. DataVecs and DataFrames can
be indexed with Indexers. It's fast because you're using a slice of the
already indexed vector.

Indexer's can be combined with | and &. In the case where the
IndexedVector is the same, the Indexer is just reduced or expanded as
appropriate. This includes: 0 .< idv .< 3 or idv .== 1 | idv .== 4. If
Indexers have different IndexedVectors (like idv1 .== 1 | idv2 .== 1),
then the result is converted to a BitVector.

On Wed, Jan 2, 2013 at 11:35 AM, John Myles White
notifications@github.comwrote:

Thanks, @tshort https://github.com/tshort! To wrap my head around the
indexing strategy, should I just read through indexing.jl? I'm really
excited to have this available.


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/DataFrames.jl/issues/24#issuecomment-11814185.

Contributor

tshort commented Jan 2, 2013

Also, NA support in DataVecs is spotty. Here is an example:

julia> a = DataVector[1,2,3,NA]
4-element Int64 DataArray
 1
 2
 3
 NA

julia> ia = IndexedVector(a) 
4-element Int64 DataArray
 1
 2
 3
 NA

julia> a[ia .< 3]
3-element Int64 DataArray
 NA                                           # PROBABLY NOT WANTED
 1
 2

julia> a[ia .> 1]
2-element Int64 DataArray
 2
 3

julia> a[ia .== 1]
1-element Int64 DataArray
 1

julia> a = DataVector[NA,1,2,3,NA]
5-element Int64 DataArray
 NA
 1
 2
 3
 NA

julia> ia = IndexedVector(a) 
no method isless(NAtype,NAtype)
 in insertionsort_perm! at sort.jl:163
 in mergesort_perm! at sort.jl:270
 in mergesort_perm at sort.jl:315
 in IndexedVector at /home/tshort/.julia/DataFrames/src/indexing.jl:45

julia> order(a)
no method isless(NAtype,NAtype)
 in insertionsort_perm! at sort.jl:163
 in mergesort_perm! at sort.jl:270
 in mergesort_perm at sort.jl:315
 in order at sort.jl:557
Contributor

johnmyleswhite commented Jan 2, 2013

Have you seen whether just adding isless(NAtype, NAtype) = true to operators.jl makes this better?

Contributor

johnmyleswhite commented Jan 2, 2013

It's possibly worth noting that these types of operations do work correctly for a pure DataArray:

julia> require("DataFrames"); using DataFrames

julia> a = DataVector[1, 2, 3, NA]
4-element Int64 DataArray
1
2
3
NA

julia> a[a .< 3]
2-element Int64 DataArray
1
2

Contributor

tshort commented Jan 2, 2013

Correct. The only thing I would call a bug with the existing DataArray is
the result of order(DataVector[NA, 1,2,3, NA]).

On Wed, Jan 2, 2013 at 12:50 PM, John Myles White
notifications@github.comwrote:

It's possibly worth noting that these types of operations do work
correctly for a pure DataArray:

julia> require("DataFrames"); using DataFrames

julia> a = DataVector[1, 2, 3, NA]
4-element Int64 DataArray
1
2
3
NA

julia> a[a .< 3]
2-element Int64 DataArray
1
2


Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/DataFrames.jl/issues/24#issuecomment-11817077.

Contributor

johnmyleswhite commented Jan 2, 2013

I just fixed a bug in operators. Now we have:

julia> require("DataFrames"); using DataFrames

julia> sort(DataVector[NA, 1, 2, 3, NA])
5-element Int64 DataArray
NA
NA
1
2
3

julia> order(DataVector[NA, 1, 2, 3, NA])
5-element Int64 Array:
1
5
2
3
4

julia> sortperm(DataVector[NA, 1, 2, 3, NA])
([NA,NA,1,2,3],[1, 5, 2, 3, 4])

I think this isn't a bug, but maybe I'm missing something.

Contributor

tshort commented Jan 2, 2013

Works for me now after your fix.

On Wed, Jan 2, 2013 at 1:03 PM, John Myles White
notifications@github.com wrote:

order(DataVector[NA, 1, 2, 3, NA])

Contributor

tshort commented Jan 2, 2013

I just pushed some changes, that make DataVectors with NA's work better with IndexedVectors. Bugs may still lurk, of course.

Contributor

tshort commented Feb 20, 2013

I'm going to close this now and remove the experimental tag on IndexedVectors. IndexedVectors support fast look-up and sorting, and they don't really conflict with anything else. Multiple approaches to indexing can be supported.

Please open a new issue if you'd like a different type of index or more advanced features..

@tshort tshort closed this Feb 20, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment