[JOSS review] Paper comments #171

jpata · 2022-06-22T08:46:22Z

Congrats on the substantial work on the package and the paper! I had some comments in the context of the JOSS review which would be good to address before I sign off.
I'm still working through some actual practical use cases in code, so I might have some more comments later.

Content-related comments:

I understand the goal is to provide performant access to ROOT files in a non-vectorized (for-loop) format, in Julia. It might be useful to add some concrete benchmarks to the paper to compare this package (Julia for-loop) to the alternatives:
- a C++ ROOT loop
- a C++ RDataFrame vectorized analysis
- a python/awkward-array vectorized analysis
the interface supports multi-threading which is great, do you have any information on the multi-threading performance (e.g. scaling) and correctness?
lines 21-23: "dancing across language barriers hinders the ability to parallelize tasks that are conceptually trivial most of the time". I'm not sure if this is universally true. Running parallel code from python via e.g. multiprocessing or numba is often quite simple and effective. I think also parallelized RDataFrame via Python should be fairly easy to use and performant.
line 49 states that UnROOT is used in CMS analysis, citing https://doi.org/10.1051/epjconf/202024506002. Checking those proceedings, there is no mention of Julia or UnROOT, therefore, I'm not sure if this is a statement supported by this citation. I checked the other citation for KM3NeT, and there is also no mention of julia/UnROOT there (it's from 2016, which predates uproot/UnROOT I believe).
lines 56-57 mention composability with other nice Julia packages like queries or loop fusion, it would be great to put some example code in the paper!
line 79 in the Summary states that the processing speeds from UnROOT are comparable to the C++ root framework, but there are no concrete benchmark results in the paper to support this claim (see also point 1)

Style-related comments:

line 11: "HEP community as been troubled by the two-language problem for a long time" -> This statement is rather generic. Who decides that this is actually a problem? Should one language solve everything, from configuration to computation on different devices and platforms?
line 14: "vectorized style, a type of problems which are normally tackled with" -> "usually implemented with ...". I think already here awkward-array (with dedicated precompiled kernels) should be mentioned and cited as a concrete solution to the problem of doing vectorized computations on jagged data.
line 18: "physicists who usually have no or little background in software engineering" -> is this statement suppported by any studies? In any case, I would imagine that the physicists who are interested to try Julia (as opposed to using something used by more people) might need to be a bit more software-inclined than the average.
line 40: resemble -> represent
for awkward-array, in addition to the zenodo of the software, it might be worth it to cite some of these proceedings 10.1051/epjconf/201921406026, 10.1051/epjconf/202024505023, 10.1088/1742-6596/1525/1/012053 (according to awkward authors, it's preferred to cite just the software)

Moelf · 2022-06-23T15:12:47Z

thanks for the review!

do you think this is good enough:
https://github.com/Moelf/UnROOT_RDataFrame_MiniBenchmark
The scaling from 1's benchmark shows linear, just like g++ compiled RDataFrame
also related to 1., but

I think also parallelized RDataFrame via Python should be fairly easy to use and performant.

it seems python is 2x slower. (I personally suspect it's due to JIT optimization being lower for interactive use)

tamasgal · 2022-06-26T21:30:48Z

Thanks for your valuable feedback and thanks @Moelf for the quick fixes in the other PR :)

Regarding 4) we cited the experiments themselves and not direct publications which involve UnROOT, maybe thatwas a bit unclear or misreading. UnROOT is used in many analyses in KM3NeT (as a core library) but I think it's not explicitly mentioned (yet). One of the goals of the JOSS paper submission was to get a DOI and encourage people to mention UnROOT after all ;)
For CMS I think it's a fairly similar situation, but @Moelf will comment for sure.
Should we make it more clear or leave it out?

Yes, we should include examples, that's a good idea indeed.
@Moelf I think your microbenchmarks are well suited for this. Those were also discussed with other experts so I think it's fine! Let's see what @jpata thinks about it.

Thanks for the style related comments, yes there is room for improvement and clarity. I will work on that...
Regarding line 18: we don't have any citation but it's a fairly common observation among many different institutes in my experience. I agree, it suggests it's a backed statement, I'll rephrase that too ;)

jpata · 2022-06-30T18:57:07Z

@Moelf

thanks for the review! do you think this is good enough: https://github.com/Moelf/UnROOT_RDataFrame_MiniBenchmark

I think quoting these numbers (potentially with a link to the repo) gives a good impression that it's a quantitative statement.

The scaling from 1's benchmark shows linear, just like g++ compiled RDataFrame

That's great - I did some tests too and after 2-4 threads I didn't see too much improvement, but probably depends on the circumstances. Have you checked that the code running with several threads returns exactly the same results (i.e. there is no multithreading bugs or dependence on non-associative floating point operations)?

it seems python is 2x slower. (I personally suspect it's due to JIT optimization being lower for interactive use)

Ha, good to know...

In any case, from my point of view, the performance points 1-2 are addressed.

@tamasgal

For CMS I think it's a fairly similar situation, but @Moelf will comment for sure. Should we make it more clear or leave it out?

If you want to just give a reference to CMS, citing NanoAOD proceedings is probably a bit of an odd choice. Perhaps a more truthy statement would be that UnROOT, like RDataFrame, can directly be used with flat ntuples such as CMS NanoAOD? In any case, it'd be nice to hear if it's really being used in an analysis (even if a citation is not available for understandable reasons)!
For KM3NeT, the citation and the statement are also confusing, so it might be good to reword to make it clearer.

Note that I will be mostly off in July, but once the other comments are addressed, I can sign off from my side.

Moelf · 2022-07-04T13:31:08Z

That's great - I did some tests too and after 2-4 threads I didn't see too much improvement, but probably depends on the circumstances.

yeah, it's possible after some threading we're limited by I/O, and this comes in two forms:

the raw bandwidth of your I/O device under random read
the latency for (if HDD) reader to seek to a different location

jpata mentioned this issue Jun 22, 2022

[REVIEW]: UnROOT: an I/O library for the CERN ROOT file format written in Julia openjournals/joss-reviews#4452

Closed

Moelf added a commit that referenced this issue Jul 1, 2022

fix #171

945be94

Moelf mentioned this issue Jul 1, 2022

fix #171 #174

Merged

Moelf closed this as completed in #174 Jul 11, 2022

Moelf added a commit that referenced this issue Jul 11, 2022

fix #171 (#174)

790ebaf

Moelf added a commit to Moelf/UnROOT.jl that referenced this issue Oct 10, 2022

fix JuliaHEP#171 (JuliaHEP#174)

664b528

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JOSS review] Paper comments #171

[JOSS review] Paper comments #171

jpata commented Jun 22, 2022 •

edited

Loading

Moelf commented Jun 23, 2022 •

edited

Loading

tamasgal commented Jun 26, 2022

jpata commented Jun 30, 2022

Moelf commented Jul 4, 2022

[JOSS review] Paper comments #171

[JOSS review] Paper comments #171

Comments

jpata commented Jun 22, 2022 • edited Loading

Content-related comments:

Style-related comments:

Moelf commented Jun 23, 2022 • edited Loading

tamasgal commented Jun 26, 2022

jpata commented Jun 30, 2022

Moelf commented Jul 4, 2022

jpata commented Jun 22, 2022 •

edited

Loading

Moelf commented Jun 23, 2022 •

edited

Loading