The llamapun library hosts common language and mathematics processing algorithms, used by the KWARC research group.
At its core, llamapun is a Rust implementation that aims at minimal footprint and optimal runtime, in order to safely scale to corpora of millions of documents and tens of billions ot tokens.
- Built-in support for STEM documents in (LaTeXML-flavoured) HTML5.
- Unicode normalization,
- Stopwords - based on widely accepted lists, enhanced for STEM texts,
- Semi-structured to plain text normalization (math, citations, tables, etc.),
- [TODO #3] Purification of text and math modality (e.g. move trailing dots left in math back into the sentence text),
- Stemming - adaptation of the Morpha stemmer,
- Tokenization - rule-based sentence segmentation, and SENNA word tokenization
- Document Narrative Model (DNM) addition to the XML DOM
- XPointer and string offset annotation support
- [TOPORT] Shared Packed parse forests for mathematical formulas (aka "disjunctive logical forms")
- High-level iterators over the narrative elements of scientific documents
- Zero-cost abstractions over the source data, as well as over linguistic annotations of various granularity.
Additional included Examples
- math-aware corpus token models, via DNM plain text normalization
- math-aware AMS-labeled dataset generation
in the project directory.
It is recommended to use to the latest nightly builds of rust (https://github.com/rust-lang-nursery/rustup.rs#working-with-nightly-rust), i.e. using rustup (www.rustup.rs) and keep it updated (run 'rustup update' on a regular basis).
For problems with libxml, it helps to install its development headers (
libxml2-dev is the package name for a Debian-based Linux).