Regular Expression Engines

Lev Eliezer Israel edited this page Nov 11, 2015 · 6 revisions

Python's native regular expression engine uses Perl style regular expressions that can not be compiled to fast DFAs. For regular expressions that don't use Perl style extensions (like look-behinds) the re2 library and its Python wrapper can produce code that runs about 100 times faster. Sefaria uses this library for compiling the regular expression that finds book titles in text. If 're2' isn't installed on the system, the code will fall back to using the built in 're' module.

Installing re2

To install the python re2 engine, you will need to install the Python development headers and install a build environment with g++, then install Google's re2, and the python re2 wrapper.

Linux: Use your package manager to install - e.g. sudo apt-get install python-dev; sudo apt-get install build-essential

Mac OS: You'll likely need to install the Xcode dev tools.

Windows: ???

Google's re2

Compile and install Google's re2 code (https://github.com/google/re2/wiki/Install) with the following:

git clone https://code.googlesource.com/re2
cd re2
make test
make install
make testinstall

Pthread Errors

Some systems see pthread errors during compilation. (See: https://code.google.com/p/re2/issues/detail?id=100) If you see this error, change the following in Makefile: LDFLAGS?=-lpthread In this case, you can ignore the errors in make testinstall

Python re2 wrappers

Compile and install pyre2 from the repository (https://github.com/axiak/pyre2). The version installable by PIP is out of date, as of this writing.

git clone git://github.com/axiak/pyre2.git
cd pyre2
sudo python setup.py install