Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix StackOverflow in promote_col_type() when grouping with large number of groups #1248

Merged
merged 1 commit into from
Oct 9, 2017

Conversation

ExpandingMan
Copy link
Contributor

@ExpandingMan ExpandingMan commented Oct 9, 2017

This is a fix to a rather critical stack overflow error that can occur when aggregating even modestly sized data sets. See discussion here.

Fixes #1247.

@cjprybol
Copy link
Contributor

cjprybol commented Oct 9, 2017

Glad to see @nalimilan's suggested fix worked! Thanks for the PR @ExpandingMan.

It looks like this helps with memory and speed quite a bit.

julia> typevec = [Int, Float64, Bool]
3-element Array{DataType,1}:
 Int64
 Float64
 Bool

julia> @time promote_type(rand(typevec, 10^3)...)
  0.264214 seconds (7.17 k allocations: 15.110 MiB)
Float64

julia> @time promote_type(rand(typevec, 10^3)...)
  0.216443 seconds (4.97 k allocations: 13.637 MiB)
Float64

julia> @time mapreduce(x -> x, promote_type, rand(typevec, 10^3))
  0.035831 seconds (12.45 k allocations: 712.130 KiB)
Float64

julia> @time mapreduce(x -> x, promote_type, rand(typevec, 10^3))
  0.023948 seconds (8.37 k allocations: 477.014 KiB)
Float64

julia> @time promote_type(rand(typevec, 10^4)...)
 47.650366 seconds (68.16 k allocations: 1.494 GiB, 53.63% gc time)
Float64

julia> @time mapreduce(x -> x, promote_type, rand(typevec, 10^4))
  0.038973 seconds (8.37 k allocations: 547.373 KiB)
Float64

julia> @time mapreduce(x -> x, promote_type, rand(typevec, 10^7))
  1.104587 seconds (8.37 k allocations: 76.752 MiB)
Float64

Could you generate a test case similar to the dataset where you encountered this error to include in the PR? I expect the actual dataset is too large to include in the repo and may also be private, so something along the lines of

column_with_lots_of_groups = rand([randstring(string length here) for i in 1:10^5], 10^7)
DataFrame(groupbycol = column_with_lots_of_groups, other fake columns = other fake data...)
# the groupby call you used to break the current master code with your dataset

Either that or testing promote_col_type directly, whichever is easier to target.

@coveralls
Copy link

coveralls commented Oct 9, 2017

Coverage Status

Coverage remained the same at 72.552% when pulling ffa5b77 on ExpandingMan:sofix into eb3d10a on JuliaData:master.

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented Oct 9, 2017

Yeah, we definitely need extensive tests on larger datasets. We should work on generating a random dataframe that is reasonably large to test this sort of thing on. I'll have time to do a little bit of this tomorrow, but I think getting proper testing in will be a larger effort.

This was a pretty critical bug. I suggest this be merged now, and I can make another PR for testing.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I think this kind of tests on large data would better be added as part of benchmarks, since running them on every PR will make Travis and AppVeyor fail or slow down. Also we want to track performance and not only failures.

@nalimilan
Copy link
Member

See also my comments on the Discourse post about the difficulty of improving things. (BTW, please post links in both directions when cross-posting to prevent redundant comments.)

@nalimilan nalimilan changed the title fix #1247 Fix StackOverflow in promote_col_type() when grouping with large number of groups Oct 9, 2017
@cjprybol cjprybol merged commit 3f07f79 into JuliaData:master Oct 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stack overflow when doing groupby of large dataframe
4 participants