Skip to content

Commit

Permalink
Fix InexactError in groupby() due to too many levels
Browse files Browse the repository at this point in the history
Using the storage size of the first PDA does not make sense, since it can have
very few levels, and yet its combination with other columns may produce a lot
of them. UInt32 will be enough for any reasonable number of levels.
  • Loading branch information
nalimilan committed May 16, 2016
1 parent e975c42 commit 1b61b44
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 1 deletion.
3 changes: 2 additions & 1 deletion src/groupeddataframe/grouping.jl
Expand Up @@ -90,7 +90,8 @@ function groupby{T}(d::AbstractDataFrame, cols::Vector{T})
dv = PooledDataArray(d[cols[ncols]])
# if there are NAs, add 1 to the refs to avoid underflows in x later
dv_has_nas = (findfirst(dv.refs, 0) > 0 ? 1 : 0)
x = copy(dv.refs) .+ dv_has_nas
# use UInt32 instead of the PDA's integer size since the number of levels can be high
x = copy!(similar(dv.refs, UInt32), dv.refs) .+ dv_has_nas
# also compute the number of groups, which is the product of the set lengths
ngroups = length(dv.pool) + dv_has_nas
# if there's more than 1 column, do roughly the same thing repeatedly
Expand Down
5 changes: 5 additions & 0 deletions test/grouping.jl
Expand Up @@ -29,4 +29,9 @@ module TestGrouping
h(df) = g(f(df))

@test combine(map(h, gd)) == combine(map(g, ga))

# issue #960
x = pool(collect(1:20))
df = DataFrame(v1=x, v2=x)
groupby(df, [:v1, :v2])
end

0 comments on commit 1b61b44

Please sign in to comment.