-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An ETA on Kevin's revisions to PooledDataArray to parameterize the type of the refs field? #241
Comments
I tried it and tests didn't pass because of an odd error with On Tue, Apr 9, 2013 at 1:23 PM, dmbates notifications@github.com wrote:
|
I'm looking at |
@kmsquire Please do create a branch in the DataFrames.jl repository. |
@johnmyleswhite For the time being I am working with a local modification to set the default ref type to Uint32. I am reading an R data frame saved with R-2.15.2 when only 32-bit integers were available. Along the way I discovered an unfortunate inconsistency in R factors. The NA value for an R factor is supposed to be indicated by a ref of 0 but apparently a ref of NA_Integer, which is typemin(Int32), can sneak in there as well. I need to revise the read_RDA code accordingly. |
Ok. As long as you're able to work, it sounds like we're safe letting Kevin work at a healthy pace rather than merging something incomplete before the end of the day. That R bug makes me happy that we've chosen the conceptually simpler masking approach for NA's. |
@johnmyleswhite I wouldn't say it was an "R bug" per se - just an undocumented behavior. |
Okay, I've created a As mentioned in the original pull request, the default type is still julia> using DataFrames
julia> a = PooledDataArray([1,1,2,2])
4-element Int64 PooledDataArray:
1
1
2
2
julia> a.refs
4-element Uint16 Array:
0x0001
0x0001
0x0002
0x0002
julia> b = compact(a)
4-element Int64 PooledDataArray:
1
1
2
2
julia> b.refs
4-element Uint8 Array:
0x01
0x01
0x02
0x02
julia> c = PooledDataArray([1,1,2,2], Uint8)
4-element Int64 PooledDataArray:
1
1
2
2
julia> c.refs
4-element Uint8 Array:
0x01
0x01
0x02
0x02
julia> c = PooledDataArray([1,1,2,2], Int8)
4-element Int64 PooledDataArray:
1
1
2
2
julia> c.refs
4-element Int8 Array:
1
1
2
2 I could see changing the default size to One thing that might be good to add is error checking to make sure appending to the pool doesn't overflow. |
I think you're right with everything. My only concern is that the overflow checks will slow things down, but I don't see anyway around that if people are allowed to use |
I've closed the original pull request from my repo, and opened one from the main DataFrames.jl repo here: #242. |
@johnmyleswhite, we could only do the overflow checks for |
That's a good point. Let's add the check. |
I finally made the time to make the last few changes to this branch that we talked about before:
I also added a small bit of documentation, which mostly points out some basic constructors for Anyway, it's all available in the @dmbates Doug, did you have the chance to test this out before? If not, can you test this when you have the chance and get back to me with any problems (or feel free to fix them yourself, of course). Cheers, Kevin |
@kmsquire It looks fine to me. Thanks for completing this. I look forward to its being merged into the master branch. I committed a change to RDA.jl to create the new-style PooledDataVector objects from factors in a saved R data frame. I do the compact operation during the creation to save on the copy. For some large data frames that I work with, this results in considerable space savings. One thing to look into is the |
What's holding up the merge of the |
I am working on merging that branch now. There are still a few failures in the tests, which is why it is taking longer than I expected. |
I have merged the If anyone can determine why |
Doug, here's the problem code from Kevin's branch in pooleddataarray.jl: Convert a BitArray to an Array{Bool} (m = missingness)PooledDataArray{R<:Integer,N}(d::BitArray{N}, I think Tracking down which method gets called is a little harder with default On Thu, May 9, 2013 at 1:49 PM, dmbates notifications@github.com wrote:
|
Once again the "given enough eyeballs all bugs are shallow" aphorism is reinforced. I kept seeing the error message and the line number and I never looked at the argument to size(). Thanks. I'm just running the tests again before creating a pull request. |
Doug, thanks for taking care of this, and sorry that I haven't been more On Thu, May 9, 2013 at 12:30 PM, dmbates notifications@github.com wrote:
|
Believe this was done long ago. |
I had seen messages coming through about @kmsquire providing a parameterized refs field for the PooledDataArray type and the ability to switch to a more compact representation when the size of the pool is small. It seems that this is still in Kevin's fork. Is there something I can do to facilitate bringing this into the main branch? One of the data sets for the "torture test" of the mixed-models software has a factor with 800,000+ levels.
The text was updated successfully, but these errors were encountered: