The llamapun library hosts common language and mathematics processing algorithms, used by the KWARC research group.
At its core, llamapun is a Rust implementation that aims at minimal footprint and optimal runtime, in order to safely scale to corpora of millions of documents and tens of billions ot tokens.
- Built-in support for STEM documents in (LaTeXML-flavoured) HTML5.
- Unicode normalization,
- Stopwords - based on widely accepted lists, enhanced for STEM texts,
- Semi-structured to plain text normalization (math, citations, tables, etc.),
- [TODO] Purification of text and math modality (e.g. move trailing dots left in math back into the sentence text),
- Stemming - adaptation of the Morpha stemmer,
- Tokenization - rule-based sentence segmentation, and SENNA word tokenization
- Document Narrative Model (DNM) addition to the XML DOM
- [TODO] XPointer and string offset annotation support
- [TOPORT] Shared Packed parse forests for mathematical formulas (aka "disjunctive logical forms")
- High-level iterators over the narrative elements of scientific documents
- Zero-cost abstractions over the source data, as well as over linguistic annotations of various granularity.
in the project directory.
In case of errors, it's recommended to switch to the nightly builds of rust (https://github.com/rust-lang-nursery/rustup.rs#working-with-nightly-rust), i.e. using rustup (www.rustup.rs) and keep it updated (run 'rustup update' on a regular basis).
For problems with libxml, it helps to install its development headers (libxml2-dev is the package name for a Debian-based Linux).