C++ Quad Wavelet Tree

Wavelet trees[1] are a self-indexing rank and select data structure, i.e., they can answer rank (how often does a symbol occur in a prefix of length $i$) and select (where does a symbol occur for the $i$-th time) queries, while still allowing to access the text.

This makes them an important building block for compressed full-text indices, where they are used to answer rank queries during the pattern matching algorithm, the so called backwards search.

Using a wavelet tree requires $n\lceil\log \sigma \rceil (1+o(1))$ bits of space for a text of length $n$ over an alphabet of size $\sigma$. Access, rank, and select queries can be answered in time $O(\log\sigma)$ time.

This repository provides a fast implementation of wavelet matrices in C++, since they are in practice faster than wavelet trees. A companion Rust implementation is available here.

To improve query performance, our implementation uses a 4-ary tree instead of a binary tree as basis of the wavelet tree. The 4-ary tree layout of a wavelet tree helps to halve the number of cache misses during queries and thus reduces the query latency. Our experimental evaluation shows that our 4-ary wavelet tree can improve the latency of access, rank and select queries by factor of $\approx$ 2 compared to other implementations of wavelet trees contained in the widely used Succinct Data Structure Library (SDSL). For more details, see Benchmarks and the paper [2].

Benchmarks

All the experiments are performed using a single thread on a server machine with 8 Intel~i9-9900KF cores with base frequencies of 3.60 GHz running Linux 5.19.0. Each core has a dedicated L1 cache of size 32 KiB, a dedicated L2 cache of size 256 KiB, a shared L3 cache of size 16 MiB, and 64 GiB of RAM. The code is compiled with GCC 12.2.0 using the highest optimization setting (i.e., flags -O3 -march=native -DNDEBUG -flto). A more detailed experimental evaluation can be found in [2].

Building the code

To just clone the source code, use the following.

git clone git@github.com:MatteoCeregini/quad-wavelet-tree.git
cd quad-wavelet-tree

To compile

mkdir build
cd build
cmake ..
make -j

If you want to run the test for quad vectors, please continue with the following commands.

./test_qvector

If you want to run the test for quad wavelet trees, please continue with the following commands.

./create_dataset PATH_TO_TEXT PATH_TO_DATASET
./test_qwm PATH_TO_DATASET

create_dataset reads a text file, maps it to a sequence of integers, and serializes it to disk:

PATH_TO_TEXT is the path to the text you want to use in the test (something like /texts/english.txt).
PATH_TO_DATASET is the path to the serialized sequence created by create_dataset (for example /texts/english.bin).

The resulting file is then used by create_dataset.

Bibliography

Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In SODA, pages 841–850. ACM/SIAM, 2003.
Matteo Ceregini, Florian Kurpicz, Rossano Venturini. Faster Wavelet Trees with Quad Vectors. Arxiv, 2023.

Please cite the following paper if you use this code.

@misc{QWT,
  author = {Matteo Ceregini, Florian Kurpicz, Rossano Venturini},
  title = {Faster Wavelet Trees with Quad Vectors},
  publisher = {arXiv},
  year = {2023},
  doi = {10.48550/ARXIV.2302.09239},
  url = {http://arxiv.org/abs/2302.09239}
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
include		include
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

C++ Quad Wavelet Tree

Benchmarks

Building the code

Bibliography

About

Releases

Packages

Contributors 2

Languages

License

MatteoCeregini/quad-wavelet-tree

Folders and files

Latest commit

History

Repository files navigation

C++ Quad Wavelet Tree

Benchmarks

Building the code

Bibliography

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages