common language and mathematics processing algorithms, in Rust
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
doc
examples
scripts
src
tests
.gitattributes
.gitignore
.rustfmt.toml
.travis.yml
CHANGELOG.md
Cargo.toml
LICENSE
README.md

README.md

The llamapun library hosts common language and mathematics processing algorithms, used by the KWARC research group.

Build Status API Documentation license version

At its core, llamapun is a Rust implementation that aims at minimal footprint and optimal runtime, in order to safely scale to corpora of millions of documents and tens of billions ot tokens.

Features

  • Source Data

  • Preprocessing

    • Unicode normalization,
    • Stopwords - based on widely accepted lists, enhanced for STEM texts,
    • Semi-structured to plain text normalization (math, citations, tables, etc.),
    • [TODO #3] Purification of text and math modality (e.g. move trailing dots left in math back into the sentence text),
    • Stemming - adaptation of the Morpha stemmer,
    • Tokenization - rule-based sentence segmentation, and SENNA word tokenization
  • Shallow Analysis

    • Part-of-speech tagging (via SENNA),
    • Named Entity recognition (via SENNA),
    • Chunking and shallow parsing (via SENNA),
    • Extract token models for GloVe,
    • Pattern-matching library for rule-based extraction and/or bootstrapping,
    • [TODO] Language identification (via libTextCat),
    • N-gram footprints
  • Representation Toolkit

    • Document Narrative Model (DNM) addition to the XML DOM
    • XPointer and string offset annotation support
    • [TOPORT] Shared Packed parse forests for mathematical formulas (aka "disjunctive logical forms")
  • Programming API

    • High-level iterators over the narrative elements of scientific documents
    • Zero-cost abstractions over the source data, as well as over linguistic annotations of various granularity.
  • Additional included Examples

    • math-aware corpus token models, via DNM plain text normalization
    • math-aware AMS-labeled dataset generation

Getting started

Run

cargo test

in the project directory.

It is recommended to use to the latest nightly builds of rust (https://github.com/rust-lang-nursery/rustup.rs#working-with-nightly-rust), i.e. using rustup (www.rustup.rs) and keep it updated (run 'rustup update' on a regular basis).

For problems with libxml, it helps to install its development headers (libxml2-dev is the package name for a Debian-based Linux).


Disclaimers:

  1. Please remember that all third-party tools (such as the SENNA NLP toolkit) enforce their own licensing constraints.

  2. This Github repository is a successor to the now deprecated C+Perl LLaMaPUn implementation.