The llamapun library hosts common language and mathematics processing algorithms, used by the KWARC research group.
At its core, lamapun is a Rust implementation that aims at minimal footprint and optimal runtime, in order to safely scale to corpora of millions of documents and tens of billions ot tokens.
- Built-in support for STEM documents in (LaTeXML-flavoured) HTML5.
- Unicode normalization,
- Stopwords - based on widely accepted lists, enhanced for STEM texts,
- Semi-structured to plain text normalization (math, citations, tables, etc.),
- Purification of text and math modality (e.g. move trailing dots left in math back into the sentence text),
- Stemming - adaptation of the Morpha stemmer,
- Tokenization - rule-based sentence segmentation, and SENNA word tokenization
- Part-of-speech tagging (via SENNA),
- Named Entity recognition (via SENNA),
- Chunking and shallow parsing (via SENNA),
- GloVe (Rust reimplementation)
- [TODO] Language identification (via libTextCat),
- [TODO] N-gram footprints,
- [TODO] "Definition" paragraph discrimination task (training SVM classifiers, based on TF/IDF and Ngram BoW features, via libsvm)
- [TODO] "Declaration" sentence discrimination task (training CRF models via CRFsuite).
- Document Narrative Model (DNM) addition to the XML DOM
- [TODO] XPointer and string offset annotation support
- [TODO] Integration with the CorTeX processing framework
- [TOPORT] Shared Packed parse forests for mathematical formulas (aka "disjunctive logical forms")
- High-level iterators over the narrative elements of scientific documents
- Zero-cost abstractions over the source data, as well as over linguistic annotations of various granularity.