Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Added ref type parameter, compact to PooledDataFrame #227

Closed

Conversation

kmsquire
Copy link
Contributor

@kmsquire kmsquire commented Apr 3, 2013

  • Allows the user to specify the reference type for PooledDataArrays
    (default is still Uint16)
  • Added compact to take an existing PooledDataArray and change the
    reference type to the smallest size possible. Returns the same
    PooledDataArray if nothing has changed, and a new one if the size
    is reduced

The current PooledDataArray tests pass.

Edit: example (from #241 (comment))

julia> using DataFrames

julia> a = PooledDataArray([1,1,2,2])
4-element Int64 PooledDataArray:
 1
 1
 2
 2

julia> a.refs
4-element Uint16 Array:
 0x0001
 0x0001
 0x0002
 0x0002

julia> b = compact(a)
4-element Int64 PooledDataArray:
 1
 1
 2
 2

julia> b.refs
4-element Uint8 Array:
 0x01
 0x01
 0x02
 0x02

julia> c = PooledDataArray([1,1,2,2], Uint8)
4-element Int64 PooledDataArray:
 1
 1
 2
 2

julia> c.refs
4-element Uint8 Array:
 0x01
 0x01
 0x02
 0x02

julia> c = PooledDataArray([1,1,2,2], Int8)
4-element Int64 PooledDataArray:
 1
 1
 2
 2

julia> c.refs
4-element Int8 Array:
 1
 1
 2
 2

@kmsquire
Copy link
Contributor Author

kmsquire commented Apr 3, 2013

WIP because the additional parameterization of the default constructor has a severe conflict with construction from an array + pool (https://github.com/kmsquire/DataFrames.jl/pull/new/pooled_ref_parameterization#L6R120):

julia> import DataFrames
Warning: New definition 
    PooledDataArray(Array{T,N},Array{T,1}) at /home/kmsquire/.julia/DataFrames/src/pooleddataarray.jl:120
is ambiguous with 
    PooledDataArray(Array{R<:Integer,N},Array{T,1}) at /home/kmsquire/.julia/DataFrames/src/pooleddataarray.jl:50.
Make sure 
    PooledDataArray(Array{T,N},Array{T,1})
is defined first.

The parameterizations are slightly different, but not different enough, and moving things around doesn't help in this case. One of the function signatures really just needs to change. Some options

  1. rename one (or more) of the constructors (e.g., PooledDataArray(...) could become pooleddataarray(...) for some/most of the current constructors. They would still will produce the same result, but would have a distinct name.
  2. change the order of refs and pool in the PooledDataArray constructor. EDIT: doesn't work
  3. add a dummy parameter to one of the constructors, so it can be distinguished from the other. E.g., the default PooledDataArray constructor could be
type DefaultPooledConstructor end
function PooledDataArray{T,R<:Integer,N}(refs::Array{R, N},
                                         pool::Vector{T},
                                         ::Type{DefaultPooledConstructor})
    PooledDataArray{T,R,N}(refs, pool)
end

There may be other options. My vote is for 2. Some feedback would be useful (although this is a pretty big change, so I understand it may take some time to review.)

@kmsquire
Copy link
Contributor Author

kmsquire commented Apr 3, 2013

Just realize that I included #226 in this. That change is distinct from this one, and actually still needed even if these changes are accepted. I can remove it from here.

@kmsquire
Copy link
Contributor Author

kmsquire commented Apr 3, 2013

More info/thoughts:

  • The type of refs is currently included only in a subset of the constructors, mainly because
    1. adding it to everything would require twice as many constructors
    2. keyword arguments would really make adding it everywhere a lot easier, and they might just be added soon (keyword arguments JuliaLang/julia#485 (comment))
  • As suggested by @HarlanH here: Bugfix: fix overflow in groupby #226 (comment), it might just be easier to default to Int64 (or Int), and then use compact to reduce the size. The current ambiguity warning might still need to be dealt with, however...
  • Note that changing things in place (compact!) is not likely to be a good idea, since I think it would require that the array be Array{Integer} or Array{Unsigned}, which is actually an array of pointers--not efficient. Hence, this parameterized version.

@tshort
Copy link
Contributor

tshort commented Apr 3, 2013

I prefer #2 out of the choices above.
On Apr 3, 2013 2:02 AM, "Kevin Squire" notifications@github.com wrote:

WIP because the additional parameterization of the default constructor has
a severe conflict with construction from an array + pool (
https://github.com/kmsquire/DataFrames.jl/pull/new/pooled_ref_parameterization#L6R120):

julia> import DataFramesWarning: New definition
PooledDataArray(Array{T,N},Array{T,1}) at /home/kmsquire/.julia/DataFrames/src/pooleddataarray.jl:120is ambiguous with
PooledDataArray(Array{R<:Integer,N},Array{T,1}) at /home/kmsquire/.julia/DataFrames/src/pooleddataarray.jl:50.Make sure
PooledDataArray(Array{T,N},Array{T,1})is defined first.

The parameterizations are slightly different, but not different enough,
and moving things around doesn't help in this case. One of the function
signatures really just needs to change. Some options

  1. rename one (or more) of the constructors (e.g., PooledDataArray(...)could become
    pooleddataarray(...) for some/most of the current constructors. They
    would still will produce the same result, but would have a distinct name.
  2. change the order of refs and pool in the PooledDataArrayconstructor.
  3. add a dummy parameter to one of the constructors, so it can be
    distinguished from the other. E.g., the default PooledDataArray constructor
    could be

type DefaultPooledConstructor endfunction PooledDataArray{T,R<:Integer,N}(refs::Array{R, N},
pool::Vector{T},
::Type{DefaultPooledConstructor})
PooledDataArray{T,R,N}(refs, pool)end

There may be other options. My vote is for 2. Some feedback would be
useful (although this is a pretty big change, so I understand it may take
some time to review.)


Reply to this email directly or view it on GitHubhttps://github.com//pull/227#issuecomment-15819914
.

@kmsquire
Copy link
Contributor Author

kmsquire commented Apr 3, 2013

I haven't actually tried it yet, but thinking a little more, I'm not sure choice 2 is going to work--it still might end up being ambiguous. But I'll try it and see.

@kmsquire
Copy link
Contributor Author

kmsquire commented Apr 3, 2013

Yeah, that didn't work--still ambiguous.

Do either of the other options look reasonable, or does anyone have any other suggestions?

@kmsquire
Copy link
Contributor Author

kmsquire commented Apr 3, 2013

Okay, I think I came up with a reasonable solution.

The problem is that we need to distinguish between normal arrays being converted to PooledDataArrays, and refs arrays, which are arrays of references to the pool of identifiers.

To do this, I added type RefType, which wraps refs arrays and is used only during construction. This seemed like a reasonable tradeoff, since end users will be less likely (I think) to use the refs version of the PooledDataArrays constructor.

This latest version also adds a type parameter as an optional argument to all PooledDataArray constructors, requiring julia commit JuliaLang/julia@92642aa (woot!), so it needs bleeding edge julia to run.

@kmsquire
Copy link
Contributor Author

kmsquire commented Apr 4, 2013

Removed #226 from this pull. I can squash the commits to clean things up, if that's preferable. Let me know.

* Allows the user to specify the reference type for PooledDataArrays
  (default is still Uint16)
* Added `compact` to take an existing PooledDataArray and change the
  reference type to the smallest size possible.  Returns the same
  PooledDataArray if nothing has changed, and a new one if the size
  is reduced
* Added RefType to distinguish refs arrays from normal arrays during PooledDataArray construction
@kmsquire
Copy link
Contributor Author

kmsquire commented Apr 4, 2013

Squashed and rebased.

@tshort
Copy link
Contributor

tshort commented Apr 9, 2013

Kevin, in a brief look at the code and approach, it looks good to me. compact needs to be exported. Also, I couldn't get compact to work. I get the following error:

julia> DataFrames.compact(a)
ERROR: REFTYPE not defined
 in typeinf at inference.jl:1093
 in typeinf_ext at inference.jl:1020

@kmsquire
Copy link
Contributor Author

kmsquire commented Apr 9, 2013

Thanks for the feedback, Tom. I'll look at compact to see what the issue is.

@kmsquire
Copy link
Contributor Author

kmsquire commented Apr 9, 2013

This is idential to the pooled_ref_parameterization branch on the main DataFrames repo. Because others can update that one, I'll close this request, and make another from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants