This repository contains implementations of High-energy Physics (HEP) analysis queries from the IRIS HEP benchmark written in JSONiq to be run on Rumble.
The purpose of this repository is to study the suitability of JSONiq for HEP analyses and to serve as a use case for improving Rumble. While JSONiq has been originally designed for dealing with JSON documents, we believe that it might be suited for HEP analyses as well. Compared to many existing HEP tools, it is a higher-level language that separates the logic of the analyses from how data is stored and how the query is executed more (which has many advantages, but also some disadvantages). Compared to SQL it has better support for the nestedness of both data and queries typically found in HEP and is standardized and hence portable across different JSONiq implementations.
There are currently three sets of implementations: one index-based (stored in queries/index-based/
and two object-based ones (stored in queries/shredded-objects/
and queries/native-objects/
). The index-based implementation directly manipulates the columnar data model as it is typically exposed by existing HEP tools and which corresponds how data is physically stored in ROOT files. For example, computing the invariant mass looks loke this, using $i
and $j
to extract eta
, phi
, and pt
from two events:
let $eta-diff := $event.Muon_eta[[$i]] - $event.Muon_eta[[$j]]
let $phi-diff := $event.Muon_phi[[$i]] - $event.Muon_phi[[$j]]
let $cosh := (exp($eta-diff) + exp(-$eta-diff)) div 2
let $invariant-mass :=
2 * $event.Muon_pt[[$i]] * $event.Muon_pt[[$j]] * ($cosh - cos($phi-diff))
return $invariant-mass
The object-based implementations restructure each event first by reconstructing the objects from its values in the different columns. The same computation then looks like this:
2 * $m1.pt * $m2.pt * (math:cosh($m1.eta - $m2.eta) - cos($m1.phi - $m2.phi))
While this is clearly more readable, the restructuring may have an overhead and access data that is in fact not needed. The two object-based versions thus do this restructuring in different points in time: shredded-objects
reads the same file as the index-based
queries and restructures the events on the fly, while native-objects
expects the file to contain restructured events already (using the method described below). Otherwise, the query implementations are largely identical. Since in native-objects
the restructuring is materialized in the file, it should be free at query time, whereas it may have an overhead in shredded-objects
. At least in theory, due to the high-level nature of JSONiq, it should also possible to eliminate that overhead; this is in fact a standard feature of SQL optimizers and the same techniques can be applied to JSONiq.
- Docker or an installation of Rumble (see its documentation)
- Python 3 with pip
- Install the Python requirements:
pip3 install -r requirements.txt
- Configure
rumble.sh
orstart-server.sh
. The simplest is to copy the provided files that use docker:Alternatively, copy thecp rumble.docker.sh rumble.sh cp start-server.docker.sh start-server.sh
*.local.sh
variants and modify them to contain the correct paths. - If you want to use a long-running Rumble server, start it:
./start-server.sh
The benchmark defines the queries against the following data set:
root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root
For getting started quickly, we provide two samples of the file converted to Parquet are included in data/
. Since Rumble supports reading ROOT files directly, doing so only requires a minimal change in the query implementations.
Instead of processing the file in-place through the root://
protocol, you may download it to local storage using xrdcp
(which you can install by installing the ROOT framework).
In order to extract a samle for faster processing, you can use rooteventselector
and its -f
and -l
flags (which is part of the ROOT framework as well).
You may convert the ROOT files into Parquet (or other formats) using tools/root2parquet.py
. This downloads the specified package from the internet, so if you are behind a proxy, follow this.
spark-submit \
--packages edu.vanderbilt.accre:laurelin:1.1.1 \
tools/root2parquet.py \
-i data/Run2012B_SingleMu.root \
-o data/Run2012B_SingleMu.parquet
This may create several partitions, each of which is a valid Parquet file. Spark (and hence Rumble) is able to read all those files as one input data set. If you want to control the number of partitions (and hence files), use the --num-files
flag.
From the root of this repository, run the following command:
spark-submit \
tools/restructure.py \
-i data/Run2012B_SingleMu.parquet \
-o data/Run2012B_SingleMu-restructured.parquet
This may produce a partitioned data set as with the previous script.
extract_samples.sh
does the three previous steps in one go for all sample sizes of n=2**$l*1000
for l={0..16}
. (Notice that there are fewer than 2**16
events in the original data set so the largest sample contains the full data set and has a misleading file name.) You may need to edit the script to match the paths of some executables or modify your PATH
accordingly.
test_queries.py
looks for the input files in data/
with names of the form Run2012B_SingleMu{restructured}{suffix}.parquet
, where {restructured}
is -restructured
for the native-objects
queries, and {suffix}
is empty for the full data set and -{num_events}
for a sample of {num_events}
. It also looks for reference results in queries/{query_name}/ref{suffix}.csv
with the same {suffix}
. It also looks for reference results in queries/{query_name}/ref{suffix}.csv
with the same {suffix}
.
Run all queries on the full data set using rumble.sh
from above with the following command:
./test_queries.py -v
This will currently fail as we do not have a reference result for the full data set yet. Use -N 1000
to test with 1000 events, respectively.
Run the following command to see more options
$ ./test_queries.py --help
usage: test_queries.py [options] [file_or_dir] [file_or_dir] [...]
...
custom options:
-Q QUERY_ID, --query-id=QUERY_ID
Folder name of query to run.
-N NUM_EVENTS, --num-events=NUM_EVENTS
Number of events taken from the input file. This influences which reference file should be taken.
-I INPUT_PATH, --input-path=INPUT_PATH
Path to input ROOT file.
--rumble-cmd=RUMBLE_CMD
Path to spark-submit.
--rumble-server=RUMBLE_SERVER
Rumble server to connect to.
--freeze-result Overwrite reference result.
--plot-histogram Plot resulting histogram as PNG file.
...
For example, to run all queries containing shredded-objects
on the test data set with 1000 events using a local server, do the following:
./test_queries.py -v -N 1000 --rumble-server http://localhost:8001/jsoniq -k shredded-objects
It may be the case that the following errors are encountered during the execution of the queries:
Spark java.lang.OutOfMemoryError: Java heap space
- In this case, it is suggested that thespark.driver.memory
andspark.executor.memory
are increased, for example to8g
and4g
respectively. These should be set in<spark_install_dir>/conf/spark-defaults.conf
.Buffer overflow. Available: 0, required: xxx
- In this case, the issue likely stems from the Kryo framework. It is suggested that thespark.kryoserializer.buffer.max
be set to something like1024m
. This should be set in<spark_install_dir>/conf/spark-defaults.conf
.