Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An ETA on Kevin's revisions to PooledDataArray to parameterize the type of the refs field? #241

Closed
dmbates opened this issue Apr 9, 2013 · 21 comments

Comments

@dmbates
Copy link
Contributor

dmbates commented Apr 9, 2013

I had seen messages coming through about @kmsquire providing a parameterized refs field for the PooledDataArray type and the ability to switch to a more compact representation when the size of the pool is small. It seems that this is still in Kevin's fork. Is there something I can do to facilitate bringing this into the main branch? One of the data sets for the "torture test" of the mixed-models software has a factor with 800,000+ levels.

@tshort
Copy link
Contributor

tshort commented Apr 9, 2013

I tried it and tests didn't pass because of an odd error with compact. It
may just be something goofy with my setup, though. If you can get it to
pass tests, I suggest that you go ahead and merge it.

On Tue, Apr 9, 2013 at 1:23 PM, dmbates notifications@github.com wrote:

I had seen messages coming through about @kmsquirehttps://github.com/kmsquireproviding a parameterized refs field for the PooledDataArray type and the
ability to switch to a more compact representation when the size of the
pool is small. It seems that this is still in Kevin's fork. Is there
something I can do to facilitate bringing this into the main branch? One of
the data sets for the "torture test" of the mixed-models software has a
factor with 800,000+ levels.


Reply to this email directly or view it on GitHubhttps://github.com//issues/241
.

@kmsquire
Copy link
Contributor

kmsquire commented Apr 9, 2013

I'm looking at compact now. I also started this before I had commit access to the DataFrames.jl repository. I could move the branch in the main repo so that others could look at/modify it before merging, if that's useful. It would also make it a little easier to test.

@johnmyleswhite
Copy link
Contributor

@kmsquire: Please do whatever you feel you need to do to get this ready.

@dmbates: For your purposes, would the proposed switch to using Uint64 as the default ref type solve your problems? Or do you simultaneously some large factor levels and some highly compact columns?

@dmbates
Copy link
Contributor Author

dmbates commented Apr 9, 2013

@kmsquire Please do create a branch in the DataFrames.jl repository.

@dmbates
Copy link
Contributor Author

dmbates commented Apr 9, 2013

@johnmyleswhite For the time being I am working with a local modification to set the default ref type to Uint32. I am reading an R data frame saved with R-2.15.2 when only 32-bit integers were available. Along the way I discovered an unfortunate inconsistency in R factors. The NA value for an R factor is supposed to be indicated by a ref of 0 but apparently a ref of NA_Integer, which is typemin(Int32), can sneak in there as well. I need to revise the read_RDA code accordingly.

@johnmyleswhite
Copy link
Contributor

Ok. As long as you're able to work, it sounds like we're safe letting Kevin work at a healthy pace rather than merging something incomplete before the end of the day.

That R bug makes me happy that we've chosen the conceptually simpler masking approach for NA's.

@dmbates
Copy link
Contributor Author

dmbates commented Apr 9, 2013

@johnmyleswhite I wouldn't say it was an "R bug" per se - just an undocumented behavior.

@kmsquire
Copy link
Contributor

kmsquire commented Apr 9, 2013

Okay, I've created a pooled_ref_parameterization branch on the main github repo.

As mentioned in the original pull request, the default type is still Uint16. To specify a different type, all relevant PooledDataArray constructors take an additional parameter, r:

julia> using DataFrames

julia> a = PooledDataArray([1,1,2,2])
4-element Int64 PooledDataArray:
 1
 1
 2
 2

julia> a.refs
4-element Uint16 Array:
 0x0001
 0x0001
 0x0002
 0x0002

julia> b = compact(a)
4-element Int64 PooledDataArray:
 1
 1
 2
 2

julia> b.refs
4-element Uint8 Array:
 0x01
 0x01
 0x02
 0x02

julia> c = PooledDataArray([1,1,2,2], Uint8)
4-element Int64 PooledDataArray:
 1
 1
 2
 2

julia> c.refs
4-element Uint8 Array:
 0x01
 0x01
 0x02
 0x02

julia> c = PooledDataArray([1,1,2,2], Int8)
4-element Int64 PooledDataArray:
 1
 1
 2
 2

julia> c.refs
4-element Int8 Array:
 1
 1
 2
 2

I could see changing the default size to Uint32. I think that Uint64 is probably an overkill; if there are more than 4.3 billion types (typemax(Uint32)=4,294,967,295), you probably shouldn't be using a PooledDataArray.

One thing that might be good to add is error checking to make sure appending to the pool doesn't overflow.

@johnmyleswhite
Copy link
Contributor

I think you're right with everything. My only concern is that the overflow checks will slow things down, but I don't see anyway around that if people are allowed to use Uint8 or Uint16 refs.

@kmsquire
Copy link
Contributor

kmsquire commented Apr 9, 2013

I've closed the original pull request from my repo, and opened one from the main DataFrames.jl repo here: #242.

@kmsquire
Copy link
Contributor

kmsquire commented Apr 9, 2013

@johnmyleswhite, we could only do the overflow checks for Uint8 and Uint16 arrays. At most, those checks should only happen 255 or 65535 times, respectively, since it only needs to be done when appending a new value to the pool.

@johnmyleswhite
Copy link
Contributor

That's a good point. Let's add the check.

@kmsquire
Copy link
Contributor

I finally made the time to make the last few changes to this branch that we talked about before:

  1. The default ref type parameter is now Uint32
  2. For ref type parameters of types Uint8, Uint16, Int8, and Int16, overflow checking is done whenever a new pool value is added.

I also added a small bit of documentation, which mostly points out some basic constructors for PooledDataFrames, including the new parameterization for the refs type.

Anyway, it's all available in the pooled_ref_parameterization branch. I think the patch should be good to go, although it could probably use some more testing to shake out any final problems.

@dmbates Doug, did you have the chance to test this out before? If not, can you test this when you have the chance and get back to me with any problems (or feel free to fix them yourself, of course).

Cheers, Kevin

@dmbates
Copy link
Contributor Author

dmbates commented May 1, 2013

@kmsquire It looks fine to me. Thanks for completing this. I look forward to its being merged into the master branch.

I committed a change to RDA.jl to create the new-style PooledDataVector objects from factors in a saved R data frame. I do the compact operation during the creation to save on the copy. For some large data frames that I work with, this results in considerable space savings.

One thing to look into is the m argument in the various constructors. In pooleddataarray.jl it is typed as AbstractArray{Bool,N} whereas the equivalent argument in dataarray.jl is typed as BitArray{N}. That tripped me up.

@dmbates
Copy link
Contributor Author

dmbates commented May 9, 2013

What's holding up the merge of the pooled_ref_parameterization branch into master? I thought I saw it had been merged and mistakenly pushed a commit into master of changes in RDA.jl that depend on it.

@dmbates
Copy link
Contributor Author

dmbates commented May 9, 2013

I am working on merging that branch now. There are still a few failures in the tests, which is why it is taking longer than I expected.

@dmbates
Copy link
Contributor Author

dmbates commented May 9, 2013

I have merged the pooled_ref_parameterization branch after a few modifications to get the tests to pass. However, not all tests pass. The test/constructors.jl test fails. I can reproduce the problem in the REPL but I can't figure out why it is a problem. I pushed to my fork of the master branch dmbates/DataFrames.jl@0bea89ab998335d1656e591f77fbaed2c4a00555

If anyone can determine why PooledDataArray(falses(3)) throws an error I would very much appreciate learning how to fix it.

@tshort
Copy link
Contributor

tshort commented May 9, 2013

Doug, here's the problem code from Kevin's branch in pooleddataarray.jl:

Convert a BitArray to an Array{Bool} (m = missingness)

PooledDataArray{R<:Integer,N}(d::BitArray{N},
m::AbstractArray{Bool, N} = falses(size(a)),
r::Type{R} = DEFAULT_POOLED_REF_TYPE) =
PooledDataArray(convert(Array{Bool}, d), m, r)

I think a should be d in falses(size(a)).

Tracking down which method gets called is a little harder with default
arguments.

On Thu, May 9, 2013 at 1:49 PM, dmbates notifications@github.com wrote:

I have merged the pooled_ref_parameterization branch after a few
modifications to get the tests to pass. However, not all tests pass. The
test/constructors.jl test fails. I can reproduce the problem in the REPL
but I can't figure out why it is a problem. I pushed to my fork of the
master branch dmbates/DataFrames.jl@0bea89ahttps://github.com/dmbates/DataFrames.jl/commit/0bea89ab998335d1656e591f77fbaed2c4a00555

If anyone can determine why PooledDataArray(falses(3)) throws an error I
would very much appreciate learning how to fix it.


Reply to this email directly or view it on GitHubhttps://github.com//issues/241#issuecomment-17678596
.

@dmbates
Copy link
Contributor Author

dmbates commented May 9, 2013

Once again the "given enough eyeballs all bugs are shallow" aphorism is reinforced. I kept seeing the error message and the line number and I never looked at the argument to size(). Thanks. I'm just running the tests again before creating a pull request.

@kmsquire
Copy link
Contributor

Doug, thanks for taking care of this, and sorry that I haven't been more
responsive in testing and merging. I'm currently at a conference, and
haven't had stellar internet access, or much time, for that matter.

On Thu, May 9, 2013 at 12:30 PM, dmbates notifications@github.com wrote:

Once again the "given enough eyeballs all bugs are shallow" aphorism is
reinforced. I kept seeing the error message and the line number and I never
looked at the argument to size(). Thanks. I'm just running the tests again
before creating a pull request.


Reply to this email directly or view it on GitHubhttps://github.com//issues/241#issuecomment-17684570
.

@garborg
Copy link
Contributor

garborg commented Jan 12, 2015

Believe this was done long ago.

@garborg garborg closed this as completed Jan 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants