This repository contains a suite of test cases and implementations for parsing text using Python. Both speed an memory usage are compared for each implementation. Each implementation must be:
- Correct -- it implements the specification exactly
- Syntax -- it parses all valid inputs and rejects bad ones
- Semantics -- it produces the expected data structures
- Plausible -- it only uses generally known parser features or parsing strategies
Validation tests can ensure the correctness of an implementation, but its plausibility is harder to pin down. Basically, grammars that require knowlege of a parser's internals or undocumented features to increase performance are discouraged. The grammar should be understandable by someone with access to the parser's documentation and examples.
This table lists the tasks performed and the libraries benchmarked.
Implementation | Algorithm | JSON | Arithmetic | INI |
---|---|---|---|---|
stdlib | handwritten | yes | yes | yes |
Lark | LALR | yes | yes | yes |
parsimonious | Recursive Descent | yes | ||
pe | Recursive Descent | yes | yes | yes |
pyparsing | Recursive Descent | yes | ||
SLY | LALR | yes |
The following bar chart shows the time in milliseconds to parse a ~5MB JSON file using both CPython and PyPy.
[cpython] stdlib : ▏ 97 ms
[cpython] pe : ▇▇▇▇▇▇▇ 1931 ms
[cpython] lark : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3799 ms
[cpython] sly : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4166 ms
[cpython] parsimonious: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8377 ms
[cpython] pyparsing : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 12412 ms
[pypy] stdlib : ▇ 288 ms
[pypy] pe : ▇▇ 574 ms
[pypy] lark : ▇▇ 638 ms
[pypy] sly : ▇▇ 656 ms
[pypy] pyparsing : ▇▇▇▇▇ 1283 ms
[pypy] parsimonious: ▇▇▇▇▇▇▇▇▇▇▇▇ 3004 ms
Here are the results for parsing 10k complicated (from a parsing point of view) arithmetic expressions:
[cpython] stdlib : ▇▇ 132 ms
[cpython] pe : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2200 ms
[cpython] lark : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2304 ms
[pypy] stdlib : ▇▇▇▇ 209 ms
[pypy] lark : ▇▇▇▇ 222 ms
[pypy] pe : ▇▇▇▇▇▇▇▇▇▇▇▇ 591 ms
And here is an INI file with 1000 sections:
[cpython] stdlib : ▇▇▇▇▇▇▇▇▇▇▇▇ 97 ms
[cpython] pe : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 187 ms
[cpython] lark : ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 379 ms
[pypy] stdlib : ▇▇▇▇ 33 ms
[pypy] lark : ▇▇▇▇▇▇ 51 ms
[pypy] pe : ▇▇▇▇▇▇▇▇ 67 ms
Charts produced with termgraph
These benchmarks were run on a Lenovo Thinkpad with an Intel Core-i5 6200U with 8GB memory running Pop!_OS Linux 18.04. The millisecond values will change across systems but the relative performance should be similar (but I'd be interested in hearing if you find otherwise!).
Python 3.6+ is required.
First create a virtual environment (recommended) and install the package and requirements:
$ python3 -m venv cpy
$ source cpy/bin/activate
(cpy) $ pip install -r requirements.txt
You can also use PyPy by creating its own virtual environment:
$ pypy3 -m venv pypy
$ source pypy/bin/activate
(pypy) $ pip install -r requirements.txt
From here on it's assumed you're in one of the (cpy)
or (pypy)
environments.
The benchmarks are implemented using pytest and
pytest-benchmark, so
you can run the tests with pytest
if you adjust PYTHONPATH
:
$ PYTHONPATH=. pytest
But it might be easier to just run the included validate.py
and
benchmark.py
scripts, which pass the appropriate options on to
pytest
:
$ python validate.py # skip performance tests (they take a while)
$ python benchmark.py # skip validity tests
You can give specific library names to limit the tests that are run:
$ python validate.py pe stdlib # only run validation for pe and stdlib
Some tests print to stdout diagnostic information that can be useful, such as the recursion limit in JSON parsing. Use the following option to see that information:
$ python validate.py -rP # print stdout for each test
Performance benchmarks are not the only criterion one should use when choosing a parsing library, and this repository is not meant to declare some winner. On the one hand, there are many other valid criteria (ease of use, stability, security, availability, support, etc.), but on the other hand we can't discuss relative performance without numbers, so here we are.