Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed comparison of dplyr and tidierdata #24

Open
Zhaoju-Deng opened this issue Aug 8, 2023 · 24 comments
Open

speed comparison of dplyr and tidierdata #24

Zhaoju-Deng opened this issue Aug 8, 2023 · 24 comments

Comments

@Zhaoju-Deng
Copy link

Hi Karandeep,
Its nice to have the tidyverse package in Julia!
I tried TidierData to create two new columns for a ~1Gb dataset, I was actually expecting TidierData much more faster than dplyr, but the results showed dplyr was much faster than TidierData (0.9s in dplyr and 5.5-8s in TidierData), would it possible to fine-tune the speed so that TidierData would be much more efficient in manipulating large datasets. Personally I think that matters for data analysis.

kind regards,
Zhaoju

@kdpsingh
Copy link
Member

kdpsingh commented Aug 8, 2023

Benchmarking is a tricky issue in Julia because the first time you run code in Julia, it is compiled (leading to a compilation delay). The code usually runs much, much faster after code is compiled. This is true even if you change the underlying dataset or certain parameters, so it's not that Julia is cheating from having cached the answer but is legimately way faster on the second run.

This issue is been mitigated in Julia 1.9 by precompilation, which caches compiled code. DataFrames.jl (which TidierData.jl wraps) takes advantage of precompilation workflows, but TidierData.jl for now doesn't do any additional precompilation (which we probably should).

I'll move this issue to the TidierData.jl repo, because I agree we should periodically run some basic benchmarks to understand how TidierData.jl stacks up against DataFrames.jl (to understand how much overhead is added) and against R tidyverse.

tl;dr: I agree speed is important. Many published benchmarks show DataFrames.jl to be faster than R tidyverse, but we on the first run the compilation can introduce a delay.

Two questions:

  • which version of Julia are you running?
  • can you try running your Julia code two times and see if the code is faster on the second run?

@kdpsingh kdpsingh transferred this issue from TidierOrg/Tidier.jl Aug 8, 2023
@kdpsingh
Copy link
Member

From some initial testing, there is likely room for optimization within TidierData.jl. We will do some profiling on our end to understand the bottlenecks as well as the relationship between data size and overhead (which could be possible if the additional allocations in TidierData are related to inadvertent copies being made).

@Zhaoju-Deng
Copy link
Author

Hi Kdpsingh,
I tested it on Julia version 1.9.0 on vscode windows 10 intel i7-9750H cpu and 64G RAM. I have already ran TidierData multiple times, and the initial time was ~ 9 seconds, and thereafter were in the range of 5-7 seconds. I quite appreciate for this package(s), and I think it would be more attractive for people using R to use Julia for data manipulation. looking forward for the next version of this package!

@kdpsingh
Copy link
Member

Thanks for sharing. Right now, this package does a lot of extra stuff on top of DataFrames.jl for the sake of user convenience, and I imagine some of that is responsible for the slowdown.

However, I do think some of it is fixable because we can avoid certain steps that I think will speed things up.

So in summary, the package's main selling point at the moment is the consistent syntax. Hoping in the near future that the speed penalty won't be as large.

@Zhaoju-Deng
Copy link
Author

sounds great, while I can not contribute to the development of this package, I will test it when it release again.

@kdpsingh
Copy link
Member

kdpsingh commented Aug 11, 2023

Ok, I did some initial exploration and think I know what's responsible for the slowdown. Some of the functions call an extra select() and/or transform(), and I believe that's the underlying cause of the extra allocations and slowness.

We will try the following things in future releases.

  • Set up simple benchmarking tasks in the docs at least for Tidier.jl vs DataFrames.jl to monitor overhead.

  • Autodetect whether the extra calls are needed and eliminate them when not needed.

  • Consider replacing our current internal @chain-based syntax with one that makes a single copy of the data frame up front and then applies the remaining transformations in-place.

  • Add common workflows to PrecompileTools to minimize TTFX

@kdpsingh
Copy link
Member

kdpsingh commented Aug 15, 2023

@Zhaoju-Deng, thanks for bringing this up. This issue is mostly resolved in v0.10.0, which is on the registry now. I haven't yet added support for PrecompileTools (which will minimize differences between the first run of the code and subsequent runs), but otherwise you should see major speed-ups in the performance in v0.10.0.

I'll leave this issue open mostly as a placeholder so that we can return to it and add support for PrecompileTools.

Feel free to check it out and see if you notice any difference on your end.

@Zhaoju-Deng
Copy link
Author

@kdpsingh I just upgraded Julia to v1.10beta1 and tested it again, however, the first compile time increased to 13.9 seconds and the following compile time to be 6-7 seconds (see in the attached screenshot). it is not a big issue for now, but hope it could be solved soon.
5ac16e5f1b516712c24ab6d84de3b5c

@kdpsingh
Copy link
Member

Thanks for sharing the screenshot!

I'll try to recreate this on my end. If the dataset happens to be publicly available, please let me know -- otherwise I'll create some synthetic data with similar properties.

The precompilation issue will be fixed in a future update.

However, we shouldn't be several-fold slower than dplyr so let me look at this carefully.

@kdpsingh
Copy link
Member

I think I know what is going on. Tidier.jl currently points to the old version of TidierData.jl, so you're not seeing the changes from the new version yet.

I bet if you go to the package REPL by pressing ] and type in st, it'll point to the older version of TidierData.

I'm fixing the Tidier dependencies right now.

For TidierData.jl:
Slow version = 0.9.2
Fast version = 0.10.0

Feel free to confirm.

@kdpsingh
Copy link
Member

A simple way to fix this is to remove Tidier.jl and to directly update TidierData.jl.

I just pushed the updated version of Tidier.jl to the Julia repository, so that should be fixed soon.

@kdpsingh
Copy link
Member

The new version of Tidier.jl is now on the registry. If you update it using ] update Tidier that should now install the latest version of TidierData.jl, which should be much faster.

@Zhaoju-Deng
Copy link
Author

Zhaoju-Deng commented Aug 16, 2023

hi @kdpsingh , I used Tidier v0.7.6 and julia v1.10beta1, while the first compile time ~8 seconds and the following compile time in the range 4.8-5.1 seconds. while the TidierData v10.0.0, the first compile time to be 4.63 second and the following compile time to be 2.6-2.9 seconds. it seems improved a lot! but still slower than dplyr, hope you could re-fine it to be much faster than dplyr!

@kdpsingh
Copy link
Member

kdpsingh commented Aug 16, 2023

We'll keep working on it! Step 1 is for us to try to reproduce this result. I'm surprised it is slower than dplyr here but have some ideas.

@kdpsingh
Copy link
Member

Note to self: My suspicion is that there is still some recompilation happening here because this code isn't wrapped in a function. Will test it out.

@Zhaoju-Deng
Copy link
Author

great, I am very intersted to see its lightning fast performance!

@drizk1
Copy link
Member

drizk1 commented Aug 16, 2023

@Zhaoju-Deng Thanks your efforts here!

I was wondering, to keep it consistent with the style of benchmarking I have been trying, would it be too much trouble for you to try the following:

using BenchmarkTools

function trial()
@chain dt begin
   @group_by(tmvFrmId, tmvLifeNumber)
   @mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
           mean_scc = mean(skipmissing(tmvMviCelICountUdder))
  @ungroup 
end 

@benchmark trial()

This will keep the methods consistent with what I was trying.

I'll have a benchmark.jl in the git later this week.

Thank you!

@kdpsingh
Copy link
Member

kdpsingh commented Aug 16, 2023

I'm fairly convinced that the @time macro being used in the global scope is leading to unreliable results in the timing here (see discussion for a different package at kdpsingh/TidyTable.jl#3).

Wrapping it in a function at @drizk1 suggests is the easiest way to check this.

@drizk1, if we work on a benchmark.jl, we may want to show why folks may get different results when using @time in the global scope, and what the implications of this are (which I can help with).

I don't want to assume that this is the issue (until we check it), so we'll revisit further optimization until after we generate a set of benchmarks and explain the implications of benchmarking within functions vs. global scope.

@drizk1
Copy link
Member

drizk1 commented Aug 16, 2023

I just ran a quick test with @time vs @benchmark on the file I've been working off. @time took over twice as long as @benchmark with 80% of @time being in recompilation. Very curious to see what @Zhaoju-Deng might find.

@Zhaoju-Deng
Copy link
Author

I just ran @benchmark and the estimated time was indeed only half of the time reported by @time, however, the @benchmark time was not the total compile time fo the code, the actual running time of the code was much longer than the estimated time by @benchmark. I am not familiar with the underlying algorithm for calculating the compiling time of @benchmark and @time, however, to my feeling, the @time estimated time is more close to the "actual" running time of code

@kdpsingh
Copy link
Member

kdpsingh commented Aug 17, 2023

@Zhaoju-Deng, thanks for doing that.

The short answer is this. If you write code like this...

  @chain dt begin
   @group_by(tmvFrmId, tmvLifeNumber)
   @mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
           mean_scc = mean(skipmissing(tmvMviCelICountUdder))
  @ungroup 

...in the global scope, then Julia first compiles the code (which takes a second), and then runs it. It sometimes has to do less compilation the second time around, but still has to do compilation.

However, if you wrap that same code in a function like this...

function analysis()
  @chain dt begin
     @group_by(tmvFrmId, tmvLifeNumber)
     @mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
     		     mean_scc = mean(skipmissing(tmvMviCelICountUdder))
  @ungroup
end

...and then you run analysis(), the first time you run it, it compiles, and then it doesn't have to compile again.

Now you might wonder, well how does that help in interactive usage?

Well, if you redefine the function like this, with the data frame dt as an argument...

function analysis(dataset)
  @chain dataset begin
     @group_by(tmvFrmId, tmvLifeNumber)
     @mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
     		     mean_scc = mean(skipmissing(tmvMviCelICountUdder))
  @ungroup
end

...then you can update the data frame and re-run the function with the updated data frame, and it'll be lightning fast (often 10x+ faster than tidyverse).

So to summarize:

  • If you wrap code in a function, it only has to compile once. For production workflows where you are running the same code multiple times on different datasets (or on different versions of the same dataset), you only pay the compilation penalty once, and then it's much, much faster (hence why Julia is said to be useful "for production"). This is because you only pay the compilation penalty once.

  • For purely interactive usage where you are only planning to run your code once with no updates, you'll pay the compilation penalty the very first time, but the compilation time is usually very small (~1-2 seconds) and isn't related to the size of your dataset. For larger datasets, the compilation penalty is usually irrelevant since you usually save much more than 1-2 seconds on the analysis side.

I had the same questions as you about @benchmark() vs. @time, so let me show you a way you can test this out using only the @time macro.

Try the following set-up using @time(). Is it any faster the second time around?

function analysis(dataset)
  @chain dataset begin
     @group_by(tmvFrmId, tmvLifeNumber)
     @mutate(mean_my = mean(skipmissing (tmvMviMilkYield)),
     		     mean_scc = mean(skipmissing(tmvMviCelICountUdder))
  @ungroup
end

# the first time, the function compiles
@time analysis(dt) # assuming your dataset is named `dt`

# the second time, there should be no recompilation
@time analysis(dt)

@drizk1
Copy link
Member

drizk1 commented Dec 13, 2023

With the recent updates to TidierData.jl, I was curious to revisit some benchmarks.

I benchmarked DF.jl vs TidierData.jl on a dataframe that was about 7.4 mil rows x 11 columns

Overall they performed nearly identically, coming within 15-20 ms of each other (different cases would lead one to be faster than the other, but minimally (ie 812ms vs 828ms)). The only significant time difference was when the summarize macro was used, at which point TidierData was notably slower.

Overall, the progress and performance of TidierData.jl is incredible ! Just thought I'd share the update here.

@kdpsingh
Copy link
Member

Thanks for that update. This is a great reminder that I need to review the benchmarking page you had prepared for our documentation site, clean it up a bit, and make it public.

I'll try to run the the @summarize() benchmark to see if I can reproduce. I think I know why it's happening -- it's probably because I think we make an extra copy of the data. I have to look and see if it's avoidable (it probably is). The reason the code needs to be slightly different here (than @mutate) is that while there is a transform!() function, there's understandably no combine!() function, so I think I end up making a copy up front that's not needed.

@kdpsingh
Copy link
Member

kdpsingh commented Dec 14, 2023

Also, at some point we should add precompilation to TidierData to remove any lag from first usage. Even though we are primarily wrapping DataFrames.jl (which already caches precompiled code), the parsing functions should be precompiled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants