Skip to content
common language and mathematics processing algorithms, in Rust
HTML Rust
Latest commit c9324d2 Mar 9, 2016 @dginev dginev Merge pull request #2 from KWARC/glove
Corpus bag of tokens example; docs and cleanup

README.md

The llamapun library hosts common language and mathematics processing algorithms, used by the KWARC research group.

Build Status API Documentation license


At its core, lamapun is a Rust implementation that aims at minimal footprint and optimal runtime, in order to safely scale to corpora of millions of documents and tens of billions ot tokens.

Features

  • Source Data

  • Preprocessing

    • Unicode normalization,
    • Stopwords - based on widely accepted lists, enhanced for STEM texts,
    • Semi-structured to plain text normalization (math, citations, tables, etc.),
    • Purification of text and math modality (e.g. move trailing dots left in math back into the sentence text),
    • Stemming - adaptation of the Morpha stemmer,
    • Tokenization - rule-based sentence segmentation, and SENNA word tokenization
  • Shallow Analysis

    • Part-of-speech tagging (via SENNA),
    • Named Entity recognition (via SENNA),
    • Chunking and shallow parsing (via SENNA),
    • GloVe (Rust reimplementation)
    • [TODO] Language identification (via libTextCat),
    • [TODO] N-gram footprints,
    • [TODO] "Definition" paragraph discrimination task (training SVM classifiers, based on TF/IDF and Ngram BoW features, via libsvm)
    • [TODO] "Declaration" sentence discrimination task (training CRF models via CRFsuite).
  • Representation Toolkit

    • Document Narrative Model (DNM) addition to the XML DOM
    • [TODO] XPointer and string offset annotation support
    • [TODO] Integration with the CorTeX processing framework
    • [TOPORT] Shared Packed parse forests for mathematical formulas (aka "disjunctive logical forms")
  • Programming API

    • High-level iterators over the narrative elements of scientific documents
    • Zero-cost abstractions over the source data, as well as over linguistic annotations of various granularity.

Disclaimers:

  1. Please remember that all third-party tools (such as the SENNA NLP toolkit) enforce their own licensing constraints.

  2. This Github repository is a successor to the now deprecated C+Perl LLaMaPUn implementation.

Something went wrong with that request. Please try again.