Skip to content


Subversion checkout URL

You can clone with
Download ZIP
LLaMaPUn is Language and Mathematics Processing and Understanding
HTML C Perl TeX CMake Shell Other
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
config Growth => massive reorganization of the LLaMaPUn repository;
src add script for putting POS tags into xhtml
t using XHTML serialization for the definition example article
test sigh, bumping sentence counts for now
.gitignore first CRF skeleton; updating .gitignore
LICENSE Adding GPLv3 license for LLaMaPUn
Makefile.PL adding web information to makefile

LLaMaPUn Logo

The LLaMaPUn library will consist of a wide range of processing tools for natural language and mathematics.

Build Status

New: Efforts have started in adopting third-party tools (such as the SENNA NLP toolkit) and adapting them to the focus of mathematical documents. As such, the current build target is refocused on the C programming language, migrating away from Perl. Given the portability of C, we expect to eventually offer high level wrappers for a variety of scripting languages.

Please remember that all third-party tools enforce their own licensing constraints.

Disclaimer: This Github repository is currently undergoing gradual migration from the original subversion repository. The migration consists of reorganizing the libraries, and preparing a CPAN-near bundle including a testbed and detailed documentation. This process also brings a namespace change to the now properly spelled LLaMaPUn.

Several upcoming deployments of the CorTeX framework have motivated the move to GitHub and provide an outlook for a number of fixes and features to be added to the library.

High-level Overview

  • Preprocessing

    • Unicode normalization,
    • Stopwords - based on widely accepted lists, enhanced for STEM texts,
    • Semi-structured to plain text normalization (math, citations, tables, etc.),
    • Purification of text and math modality (e.g. move trailing dots left in math back into the sentence text),
    • Stemming - adaptation of the Morpha stemmer,
    • Tokenization - rule-based sentence segmentation, and SENNA word tokenization
  • Shallow Analysis

    • Language identification (via libTextCat),
    • N-gram footprints,
    • Part-of-speech tagging (via SENNA),
    • Named Entity recognition (via SENNA),
    • Chunking and shallow parsing (via SENNA),
    • [TODO] "Definition" paragraph discrimination task (training SVM classifiers, based on TF/IDF and Ngram BoW features, via libsvm)
    • [TODO] "Declaration" sentence discrimination task (training CRF models via CRFsuite).
  • Representation Toolkit

    • Document Narrative Model (DNM) addition to the XML DOM
    • XPointer and string offset annotation support
    • Integration with the CorTeX processing framework
    • [TOPORT] Shared Packed parse forests for mathematical formulas (aka "disjunctive logical forms")

See also


Feel free to send any feedback to the project maintainer at


Something went wrong with that request. Please try again.