common language and mathematics processing algorithms, in Rust
The llamapun library hosts common language and mathematics processing algorithms, used by the KWARC research group.

At its core, llamapun is a Rust implementation that aims at minimal footprint and optimal runtime, in order to safely scale to corpora of millions of documents and tens of billions ot tokens.


  • Source Data

  • Preprocessing

    • Unicode normalization,
    • Stopwords - based on widely accepted lists, enhanced for STEM texts,
    • Semi-structured to plain text normalization (math, citations, tables, etc.),
    • [TODO #3] Purification of text and math modality (e.g. move trailing dots left in math back into the sentence text),
    • Stemming - adaptation of the Morpha stemmer,
    • Tokenization - rule-based sentence segmentation, and SENNA word tokenization
  • Shallow Analysis

    • Part-of-speech tagging (via SENNA),
    • Named Entity recognition (via SENNA),
    • Chunking and shallow parsing (via SENNA),
    • Extract token models for GloVe,
    • Pattern-matching library for rule-based extraction and/or bootstrapping,
    • [TODO] Language identification (via libTextCat),
    • N-gram footprints
  • Representation Toolkit

    • Document Narrative Model (DNM) addition to the XML DOM
    • XPointer and string offset annotation support
    • [TOPORT] Shared Packed parse forests for mathematical formulas (aka "disjunctive logical forms")
  • Programming API

    • High-level iterators over the narrative elements of scientific documents
    • Zero-cost abstractions over the source data, as well as over linguistic annotations of various granularity.
  • Additional included Examples

    • math-aware corpus token models, via DNM plain text normalization
    • math-aware AMS-labeled dataset generation

Getting started


cargo test

in the project directory.

It is recommended to use to the latest nightly builds of rust (, i.e. using rustup ( and keep it updated (run 'rustup update' on a regular basis).

For problems with libxml, it helps to install its development headers (libxml2-dev is the package name for a Debian-based Linux).


  1. Please remember that all third-party tools (such as the SENNA NLP toolkit) enforce their own licensing constraints.

  2. This Github repository is a successor to the now deprecated C+Perl LLaMaPUn implementation.