Skip to content

Latest commit

 

History

History
359 lines (292 loc) · 18.3 KB

TUTORIAL.md

File metadata and controls

359 lines (292 loc) · 18.3 KB

This document provides a guided exploration through several of rebar's sub-commands.

Table of contents

Run benchmarks and gather measurements

For this section, you'll need to build both rebar and at least the regex engines that you want to gather measurements for. Note that you don't need to build all of the regex engines, but if you want to follow along with this tutorial, you'll need to build rust/regex, go/regexp and python/re.

The BUILD document explains the build process in more detail, but quickly, you can build rebar like so:

$ git clone https://github.com/BurntSushi/rebar
$ cd rebar
$ cargo install --path .

and then build the regex engines we're interested in like this:

$ rebar build -e '^(rust/regex|go/regexp|python/re)$'

The -e flag is short for --engine, which can be used multiple times and accepts a regex pattern. For example, this command is equivalent to the one above:

$ rebar build -e '^rust/regex$' -e '^go/regexp$' -e '^python/re$'

The -e/--engine flag is accepted by most of rebar's sub-commands.

Assuming you have Cargo, Go and Python installed, you should see some output like the following:

rust/regex: running: cd "engines/rust/regex" && "cargo" "build" "--release"
rust/regex: build complete for version 1.7.2
go/regexp: running: cd "engines/go" && "go" "build"
go/regexp: build complete for version 1.20.1
python/re: running: cd "engines/python" && "virtualenv" "ve"
python/re: build complete for version 3.10.9

Before collecting measurements, it might be a good idea to run rebar's test suite for these engines. This ensures the regex engines are working in a minimally expected way:

$ rebar measure -e '^(rust/regex|go/regexp|python/re)$' -f '^test/' --test
test/func/leftmost-first,count-spans,go/regexp,1.20.1,OK
test/func/leftmost-first,count-spans,python/re,3.10.9,OK
test/func/leftmost-first,count-spans,rust/regex,1.7.2,OK
test/func/dollar-only-matches-end,count,go/regexp,1.20.1,OK
test/func/dollar-only-matches-end,count,python/re,3.10.9,OK
test/func/dollar-only-matches-end,count,rust/regex,1.7.2,OK
[... snip ...]

In reality, running tests is almost like collecting measurements. The only difference is that we provide the -t/--test flag, which tells rebar to collect a single measurement for each regex engine and verify that its result is what is expected.

Let's collect our first measurements. We'll run all of the benchmarks in the curated/04-ruff group. This might take a little while to run, because each benchmark repeats each task repeatedly until a certain number of iterations or a time limit has beeen hit.

$ rebar measure -e '^(rust/regex|go/regexp|python/re)$' -f '^curated/04-ruff' | tee ruff.csv
name,model,rebar_version,engine,engine_version,err,haystack_len,iters,total,median,mad,mean,stddev,min,max
curated/04-ruff-noqa/real,grep-captures,0.0.1 (rev cef9e52192),go/regexp,1.20.1,,32514634,4,5.72s,933.94ms,180.57us,933.98ms,1.10ms,932.48ms,935.57ms
curated/04-ruff-noqa/real,grep-captures,0.0.1 (rev cef9e52192),python/re,3.10.9,,32514634,3,5.52s,1.08s,0.00ns,1.08s,785.70us,1.08s,1.08s
curated/04-ruff-noqa/real,grep-captures,0.0.1 (rev cef9e52192),rust/regex,1.7.2,,32514634,71,4.61s,42.44ms,0.00ns,42.45ms,137.39us,42.19ms,42.77ms
curated/04-ruff-noqa/tweaked,grep-captures,0.0.1 (rev cef9e52192),go/regexp,1.20.1,,32514634,70,4.66s,43.20ms,17.61us,43.47ms,2.11ms,41.00ms,58.82ms
curated/04-ruff-noqa/tweaked,grep-captures,0.0.1 (rev cef9e52192),python/re,3.10.9,,32514634,11,4.96s,283.11ms,0.00ns,283.14ms,759.22us,281.96ms,284.25ms
curated/04-ruff-noqa/tweaked,grep-captures,0.0.1 (rev cef9e52192),rust/regex,1.7.2,,32514634,128,4.56s,23.52ms,225.00ns,23.53ms,59.95us,23.37ms,23.69ms
curated/04-ruff-noqa/compile-real,compile,0.0.1 (rev cef9e52192),go/regexp,1.20.1,,,354872,4.67s,7.33us,0.00ns,7.61us,9.61us,2.41us,465.85us
curated/04-ruff-noqa/compile-real,compile,0.0.1 (rev cef9e52192),python/re,3.10.9,,,39818,4.57s,73.68us,0.00ns,73.77us,1.13us,71.05us,205.05us
curated/04-ruff-noqa/compile-real,compile,0.0.1 (rev cef9e52192),rust/regex,1.7.2,,,85849,4.57s,30.94us,0.00ns,30.98us,613.00ns,28.77us,39.90us

The -f/--filter flag accepts a regex pattern just like the -e/--engine flag, but instead of being applied to the engine name, it's applied to the benchmark name. If we use both -e/--engine and -f/--filter like we do above, then the benchmark has to match both of them in order to run.

Notice here that I used tee to both write the measurement results to stdout and capture them to a file. I did that because measurement results can be used as an input to other rebar sub-commands for inspection. Since measurement results can take a long time to record, it's nice to both be able to see the progress and to capture them for inspection later.

The next section will discuss how to inspect the results. Namely, the raw CSV data isn't particularly amenable for each analysis. You can look at it directly if you chop it down a little bit and format it nicely with a tool like xsv:

$ xsv select name,model,engine,iters,median,min ruff.csv | xsv table
name                               model          engine      iters   median    min
curated/04-ruff-noqa/real          grep-captures  go/regexp   4       933.94ms  932.48ms
curated/04-ruff-noqa/real          grep-captures  python/re   3       1.08s     1.08s
curated/04-ruff-noqa/real          grep-captures  rust/regex  71      42.44ms   42.19ms
curated/04-ruff-noqa/tweaked       grep-captures  go/regexp   70      43.20ms   41.00ms
curated/04-ruff-noqa/tweaked       grep-captures  python/re   11      283.11ms  281.96ms
curated/04-ruff-noqa/tweaked       grep-captures  rust/regex  128     23.52ms   23.37ms
curated/04-ruff-noqa/compile-real  compile        go/regexp   354872  7.33us    2.41us
curated/04-ruff-noqa/compile-real  compile        python/re   39818   73.68us   71.05us
curated/04-ruff-noqa/compile-real  compile        rust/regex  85849   30.94us   28.77us

Otherwise, there is one other filtering option worth mentioning. The -m/--model flag lets you filter on the benchmark model used. This is often useful, for example, to exclude measurements for compile time in the case where you're only interested in search time benchmarks:

$ rebar measure -e '^(rust/regex|go/regexp|python/re)$' -f '^curated/04-ruff' -M compile | tee ruff-nocompile.csv
name,model,rebar_version,engine,engine_version,err,haystack_len,iters,total,median,mad,mean,stddev,min,max
curated/04-ruff-noqa/real,grep-captures,0.0.1 (rev cef9e52192),go/regexp,1.20.1,,32514634,4,5.67s,933.62ms,1.81ms,933.93ms,3.48ms,929.68ms,938.80ms
curated/04-ruff-noqa/real,grep-captures,0.0.1 (rev cef9e52192),python/re,3.10.9,,32514634,3,5.52s,1.08s,0.00ns,1.08s,1.13ms,1.08s,1.08s
curated/04-ruff-noqa/real,grep-captures,0.0.1 (rev cef9e52192),rust/regex,1.7.2,,32514634,71,4.61s,42.43ms,0.00ns,42.46ms,163.10us,42.22ms,43.06ms
curated/04-ruff-noqa/tweaked,grep-captures,0.0.1 (rev cef9e52192),go/regexp,1.20.1,,32514634,70,4.61s,42.89ms,13.13us,42.95ms,1.76ms,40.46ms,56.15ms
curated/04-ruff-noqa/tweaked,grep-captures,0.0.1 (rev cef9e52192),python/re,3.10.9,,32514634,11,5.01s,287.84ms,0.00ns,287.26ms,1.67ms,284.65ms,290.43ms
curated/04-ruff-noqa/tweaked,grep-captures,0.0.1 (rev cef9e52192),rust/regex,1.7.2,,32514634,130,4.56s,23.18ms,1.05us,23.19ms,59.18us,23.07ms,23.36ms

In this case, we actually used the inverse flag, -M/--model-not, to exclude benchmarks with the string compile in their model name.

Explore measurements

In this section, we explore measurements that have already been recorded. You can of course use the very same techniques to explore your own measurements, but by using pre-recorded measurements from this repository we make it possible to do exploratory analysis without needing to build and run the benchmarks. (For example, on my machine, collecting measurements for every benchmark takes about 3 hours. Of course, you can always gather a subset of measurements too, as explained in the previous section.)

All we need to do exploratory analysis on measurements is rebar and a checkout of this repository. The BUILD document explains that in more detail.

Assuming your current working directory is at the root of the repository, we can start working with a saved set of recorded measurements:

$ ls -l record/all/2023-04-11/*.csv | wc -l
30

Typically, the way to look at a set of measurements is to compare the results for each benchmark across each regex engine. The command to do that is rebar cmp. You just need to give it all of the measurements:

$ rebar cmp record/all/2023-04-11/*.csv

If you run that command, you'll likely quickly see the problem: the table is just too big. There are too many benchmarks and too many regex engines. There are a few ways to deal with this.

Since the rows for the table are the benchmarks (i.e., each column reflects the results for a single regex engine), you can make the table more readable by limiting the regex engines one examines with the -e/--engine flag. For example, to compare the rust/regex, re2 and go/regexp engines across all benchmarks:

$ rebar cmp record/all/2023-04-11/*.csv -e '^(re2|rust/regex|go/regexp)$'

The above command shows all benchmarks where at least one of the engines has a recorded result. You might instead only want to look at results where all three engines have a result:

$ rebar cmp record/all/2023-04-11/*.csv -e '^(re2|rust/regex|go/regexp)$' --intersection

The --intersection flag is a filter that says, "only show results where every regex engine has a measurement."

But sometimes looking at all results, even for a smaller set of engines, can be overwhelming. What about only looking at benchmarks for which there is a big difference between engines? The -t/--threshold-min and -T/--threshold-max flags let's you filter results down to "benchmarks that contain at least one results whose speedup ratio compared to the best is within the threshold range given." The threshold flags are usually useful when doing a comparison between two engines, but it works for any number. For example, we might only want to see benchmarks in which there is at least one result that is 50 times slower than the best:

$ rebar cmp record/all/2023-04-11/*.csv -e '^(re2|rust/regex)$' -t 50

Or perhaps we might only want to see benchmarks in which there is at least one result that is within 1.2 times the speed of the best:

$ rebar cmp record/all/2023-04-11/*.csv -e '^(re2|rust/regex)$' -T 1.20

But what if we want to flip all of this around and look at all of the regex engines, but only for a small set of benchmarks? In this case, the --row flag can be used to flip the rows and colummns. That is, the rows become regex engines and the columns become benchmarks. We use the -f/--filter flag to limit our benchmarks (otherwise we'd still have the problem of too many columns because there are so many benchmarks):

$ rebar cmp record/all/2023-04-11/*.csv --row engine -f mariomka -f regex-redux
engine              imported/mariomka/email  imported/mariomka/uri  imported/mariomka/ip  imported/regex-redux/regex-redux
------              -----------------------  ---------------------  --------------------  --------------------------------
dotnet              -                        -                      -                     223.01ms (17.91x)
dotnet/compiled     30.4 GB/s (1.74x)        8.2 GB/s (1.00x)       1777.3 MB/s (8.61x)   62.28ms (5.00x)
dotnet/nobacktrack  26.9 GB/s (1.96x)        3.5 GB/s (2.37x)       705.1 MB/s (21.71x)   65.55ms (5.27x)
go/regexp           46.6 MB/s (1160.30x)     46.9 MB/s (179.06x)    30.4 MB/s (503.58x)   407.51ms (32.73x)
hyperscan           37.6 GB/s (1.40x)        4.1 GB/s (2.01x)       14.9 GB/s (1.00x)     77.04ms (6.19x)
icu                 30.8 MB/s (1757.19x)     34.3 MB/s (245.30x)    574.2 MB/s (26.66x)   -
java/hotspot        62.1 MB/s (870.02x)      73.0 MB/s (115.08x)    53.0 MB/s (288.82x)   137.21ms (11.02x)
javascript/v8       163.8 MB/s (329.80x)     213.5 MB/s (39.35x)    10.4 GB/s (1.44x)     25.35ms (2.04x)
pcre2               87.8 MB/s (615.69x)      94.4 MB/s (88.98x)     819.4 MB/s (18.68x)   158.03ms (12.69x)
pcre2/jit           497.9 MB/s (108.52x)     496.8 MB/s (16.91x)    2.5 GB/s (5.98x)      21.00ms (1.69x)
perl                159.9 MB/s (337.83x)     154.9 MB/s (54.24x)    581.3 MB/s (26.33x)   151.96ms (12.21x)
python/re           46.3 MB/s (1167.51x)     77.3 MB/s (108.69x)    40.5 MB/s (377.63x)   135.95ms (10.92x)
python/regex        20.6 MB/s (2622.48x)     36.2 MB/s (231.98x)    736.2 MB/s (20.79x)   202.29ms (16.25x)
re2                 986.8 MB/s (54.76x)      891.1 MB/s (9.43x)     982.3 MB/s (15.58x)   30.96ms (2.49x)
regress             88.5 MB/s (610.89x)      99.0 MB/s (84.90x)     105.3 MB/s (145.43x)  63.68ms (5.11x)
rust/regex          825.6 MB/s (65.45x)      765.6 MB/s (10.97x)    721.5 MB/s (21.22x)   13.09ms (1.05x)
rust/regex/meta     52.8 GB/s (1.00x)        7.9 GB/s (1.04x)       3.0 GB/s (5.05x)      12.45ms (1.00x)
rust/regexold       824.6 MB/s (65.53x)      765.6 MB/s (10.97x)    3.0 GB/s (4.95x)      14.88ms (1.20x)

Here we use -f/--filter twice to say "show any benchmarks containing either mariomka or regex-redux in the name."

We can also change the aggregate statistic used for comparison with the -s/--statistic flag. By default, the median is used. But we could use the minimum instead. The minimum applies to the minimum absolute timing, and since we show throughput by default, this in turn means that we show the maximum recorded throughput.

$ rebar cmp record/all/2023-04-11/*.csv --row engine -f mariomka -f regex-redux -s min

Speaking of throughput, we can change the default such that only absolute times are shown using the -u/--units flag:

$ rebar cmp record/all/2023-04-11/*.csv --row engine -f mariomka -f regex-redux -s min -u time

Rank regex engines

If you want to try to get a sense of how regex engines do over a large corpus of benchmarks, you can summarize the results using the rebar rank command. It works by computing the speedup ratios of each regex engine for each benchmark, relative to the fastest engine for each benchmark. The fastest engine has a speedup ratio of 1.0, and every other engine has a speedup ratio of N, where the engine is said to be "N times slower than the fastest."

The rebar rank command then averages the speedup ratios for every engine using the geometric mean. In other words, this distills the relative performance of a single regex engine across many benchmarks down into a single number.

Naively, you can ask for a ranking of regex engines across all recorded results, and rebar will happily give it to you:

$ rebar rank record/all/2023-04-11/*.csv

The problem with this is that some regex engines are used in a lot more benchmarks than others. Notably, rust/regex is used in a lot because the author of this barometer also uses this tool for optimizing rust/regex, and so it naturally has many benchmarks defined for it. Similarly, some engines have very few benchmarks defined for it. More to the point, some engines, like rust/regex/hir, are just measuring an aspect of compilation time for the rust/regex engine and aren't relevant for search benchmarks.

In other words, asking for a ranking across all benchmarks in this barometer is a category error. You could make it a little better by limiting the ranking to considering search-only benchmarks by excluding all benchmarks that use the compile model:

$ rebar rank record/all/2023-04-11/*.csv -M compile

But this doesn't address all of the issues.

Another way to go about this is to limit yourself to a subset of benchmarks for which every regex engine has healthy representation compared to the others. The curated benchmarks are specifically designed for this (although are not perfect at it, so the comparison is still flawed in some sense):

$ rebar rank record/all/2023-04-11/*.csv -M compile -f '^curated/'

You can make the comparison a little bit more rigorous by limiting it to the set of benchmarks for which every regex engine has a result:

$ rebar rank record/all/2023-04-11/*.csv -M compile -f '^curated/' --intersection

It is also possible to do a correctish ranking across all measurements if you limit yourself to the benchmarks in common between each engine. In practice, this is most useful in the context of a pairwise comparison, because it ensures you get the most amount of data. (As you add more engines, the set of benchmarks containing all of them starts to shrink.) We do still want to exclude compile benchmarks. (You basically never want to run rebar rank without either -m compile to include only compile-time benchmarks, or -M compile to exclude only compile-time benchmarks. Namely, lumping compile and search time benchmarks into the same ranking doesn't make a ton of sense.)

$ rebar rank record/all/2023-04-11/*.csv -M compile -e '^(hyperscan|rust/regex)$' --intersection
Engine      Version           Geometric mean of speed ratios  Benchmark count
------      -------           ------------------------------  ---------------
hyperscan   5.4.1 2023-02-22  1.33                            45
rust/regex  1.7.2             2.73                            45

And if you ever want to drill down into what precise benchmarks are contributing to the above ranking, you can always flip to using rebar cmp. It accepts the same types of filtering flags, so literally all you have to do is use rebar cmp instead of rebar rank:

$ rebar cmp record/all/2023-04-11/*.csv -M compile -e '^(hyperscan|rust/regex)$' --intersection

Please be careful what conclusions you draw from rebar rank. It exists purely because there is so much data, and it can be useful to get a "general sense" of where regex engines stand with respect to one another. But remember, it is very difficult to use it as a tool for definitively declaring one regex engine as faster than another, especially if the geometric means are anywhere near close to one another. More to the point, biased benchmark selection directly influences what the geometric mean is able to tell you. Its signal is only as good as the inputs given to it.