Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JOSS review] Paper comments #171

Closed
jpata opened this issue Jun 22, 2022 · 4 comments · Fixed by #174
Closed

[JOSS review] Paper comments #171

jpata opened this issue Jun 22, 2022 · 4 comments · Fixed by #174

Comments

@jpata
Copy link

jpata commented Jun 22, 2022

Congrats on the substantial work on the package and the paper! I had some comments in the context of the JOSS review which would be good to address before I sign off.
I'm still working through some actual practical use cases in code, so I might have some more comments later.

Content-related comments:

  1. I understand the goal is to provide performant access to ROOT files in a non-vectorized (for-loop) format, in Julia. It might be useful to add some concrete benchmarks to the paper to compare this package (Julia for-loop) to the alternatives:
    • a C++ ROOT loop
    • a C++ RDataFrame vectorized analysis
    • a python/awkward-array vectorized analysis
  2. the interface supports multi-threading which is great, do you have any information on the multi-threading performance (e.g. scaling) and correctness?
  3. lines 21-23: "dancing across language barriers hinders the ability to parallelize tasks that are conceptually trivial most of the time". I'm not sure if this is universally true. Running parallel code from python via e.g. multiprocessing or numba is often quite simple and effective. I think also parallelized RDataFrame via Python should be fairly easy to use and performant.
  4. line 49 states that UnROOT is used in CMS analysis, citing https://doi.org/10.1051/epjconf/202024506002. Checking those proceedings, there is no mention of Julia or UnROOT, therefore, I'm not sure if this is a statement supported by this citation. I checked the other citation for KM3NeT, and there is also no mention of julia/UnROOT there (it's from 2016, which predates uproot/UnROOT I believe).
  5. lines 56-57 mention composability with other nice Julia packages like queries or loop fusion, it would be great to put some example code in the paper!
  6. line 79 in the Summary states that the processing speeds from UnROOT are comparable to the C++ root framework, but there are no concrete benchmark results in the paper to support this claim (see also point 1)

Style-related comments:

  • line 11: "HEP community as been troubled by the two-language problem for a long time" -> This statement is rather generic. Who decides that this is actually a problem? Should one language solve everything, from configuration to computation on different devices and platforms?
  • line 14: "vectorized style, a type of problems which are normally tackled with" -> "usually implemented with ...". I think already here awkward-array (with dedicated precompiled kernels) should be mentioned and cited as a concrete solution to the problem of doing vectorized computations on jagged data.
  • line 18: "physicists who usually have no or little background in software engineering" -> is this statement suppported by any studies? In any case, I would imagine that the physicists who are interested to try Julia (as opposed to using something used by more people) might need to be a bit more software-inclined than the average.
  • line 40: resemble -> represent
  • for awkward-array, in addition to the zenodo of the software, it might be worth it to cite some of these proceedings 10.1051/epjconf/201921406026, 10.1051/epjconf/202024505023, 10.1088/1742-6596/1525/1/012053 (according to awkward authors, it's preferred to cite just the software)
@Moelf
Copy link
Member

Moelf commented Jun 23, 2022

thanks for the review!

  1. do you think this is good enough:
    https://github.com/Moelf/UnROOT_RDataFrame_MiniBenchmark

  2. The scaling from 1's benchmark shows linear, just like g++ compiled RDataFrame

  3. also related to 1., but

I think also parallelized RDataFrame via Python should be fairly easy to use and performant.

it seems python is 2x slower. (I personally suspect it's due to JIT optimization being lower for interactive use)

@tamasgal
Copy link
Member

Thanks for your valuable feedback and thanks @Moelf for the quick fixes in the other PR :)

Regarding 4) we cited the experiments themselves and not direct publications which involve UnROOT, maybe thatwas a bit unclear or misreading. UnROOT is used in many analyses in KM3NeT (as a core library) but I think it's not explicitly mentioned (yet). One of the goals of the JOSS paper submission was to get a DOI and encourage people to mention UnROOT after all ;)
For CMS I think it's a fairly similar situation, but @Moelf will comment for sure.
Should we make it more clear or leave it out?

  1. Yes, we should include examples, that's a good idea indeed.

  2. @Moelf I think your microbenchmarks are well suited for this. Those were also discussed with other experts so I think it's fine! Let's see what @jpata thinks about it.

Thanks for the style related comments, yes there is room for improvement and clarity. I will work on that...
Regarding line 18: we don't have any citation but it's a fairly common observation among many different institutes in my experience. I agree, it suggests it's a backed statement, I'll rephrase that too ;)

@jpata
Copy link
Author

jpata commented Jun 30, 2022

@Moelf

thanks for the review! do you think this is good enough: https://github.com/Moelf/UnROOT_RDataFrame_MiniBenchmark

I think quoting these numbers (potentially with a link to the repo) gives a good impression that it's a quantitative statement.

The scaling from 1's benchmark shows linear, just like g++ compiled RDataFrame

That's great - I did some tests too and after 2-4 threads I didn't see too much improvement, but probably depends on the circumstances. Have you checked that the code running with several threads returns exactly the same results (i.e. there is no multithreading bugs or dependence on non-associative floating point operations)?

it seems python is 2x slower. (I personally suspect it's due to JIT optimization being lower for interactive use)

Ha, good to know...

In any case, from my point of view, the performance points 1-2 are addressed.

@tamasgal

For CMS I think it's a fairly similar situation, but @Moelf will comment for sure. Should we make it more clear or leave it out?

If you want to just give a reference to CMS, citing NanoAOD proceedings is probably a bit of an odd choice. Perhaps a more truthy statement would be that UnROOT, like RDataFrame, can directly be used with flat ntuples such as CMS NanoAOD? In any case, it'd be nice to hear if it's really being used in an analysis (even if a citation is not available for understandable reasons)!
For KM3NeT, the citation and the statement are also confusing, so it might be good to reword to make it clearer.

Note that I will be mostly off in July, but once the other comments are addressed, I can sign off from my side.

Moelf added a commit that referenced this issue Jul 1, 2022
@Moelf Moelf mentioned this issue Jul 1, 2022
@Moelf
Copy link
Member

Moelf commented Jul 4, 2022

That's great - I did some tests too and after 2-4 threads I didn't see too much improvement, but probably depends on the circumstances.

yeah, it's possible after some threading we're limited by I/O, and this comes in two forms:

  1. the raw bandwidth of your I/O device under random read
  2. the latency for (if HDD) reader to seek to a different location

Moelf added a commit that referenced this issue Jul 11, 2022
Moelf added a commit to Moelf/UnROOT.jl that referenced this issue Oct 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants