Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
LLaMaPUn is Language and Mathematics Processing and Understanding
HTML C Perl TeX CMake Shell Other
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
config Growth => massive reorganization of the LLaMaPUn repository;
doc
lib
src add script for putting POS tags into xhtml
t using XHTML serialization for the definition example article
test sigh, bumping sentence counts for now
third-party
.gitignore first CRF skeleton; updating .gitignore
.travis.yml
CMakeLists.txt
LICENSE Adding GPLv3 license for LLaMaPUn
MANUAL.md
Makefile.PL adding web information to makefile
README.md

README.md

LLaMaPUn Logo

The LLaMaPUn library will consist of a wide range of processing tools for natural language and mathematics.

Build Status

New: Efforts have started in adopting third-party tools (such as the SENNA NLP toolkit) and adapting them to the focus of mathematical documents. As such, the current build target is refocused on the C programming language, migrating away from Perl. Given the portability of C, we expect to eventually offer high level wrappers for a variety of scripting languages.

Please remember that all third-party tools enforce their own licensing constraints.

Disclaimer: This Github repository is currently undergoing gradual migration from the original subversion repository. The migration consists of reorganizing the libraries, and preparing a CPAN-near bundle including a testbed and detailed documentation. This process also brings a namespace change to the now properly spelled LLaMaPUn.

Several upcoming deployments of the CorTeX framework have motivated the move to GitHub and provide an outlook for a number of fixes and features to be added to the library.

High-level Overview

  • Preprocessing

    • Unicode normalization,
    • Stopwords - based on widely accepted lists, enhanced for STEM texts,
    • Semi-structured to plain text normalization (math, citations, tables, etc.),
    • Purification of text and math modality (e.g. move trailing dots left in math back into the sentence text),
    • Stemming - adaptation of the Morpha stemmer,
    • Tokenization - rule-based sentence segmentation, and SENNA word tokenization
  • Shallow Analysis

    • Language identification (via libTextCat),
    • N-gram footprints,
    • Part-of-speech tagging (via SENNA),
    • Named Entity recognition (via SENNA),
    • Chunking and shallow parsing (via SENNA),
    • [TODO] "Definition" paragraph discrimination task (training SVM classifiers, based on TF/IDF and Ngram BoW features, via libsvm)
    • [TODO] "Declaration" sentence discrimination task (training CRF models via CRFsuite).
  • Representation Toolkit

    • Document Narrative Model (DNM) addition to the XML DOM
    • XPointer and string offset annotation support
    • Integration with the CorTeX processing framework
    • [TOPORT] Shared Packed parse forests for mathematical formulas (aka "disjunctive logical forms")

See also

Contact

Feel free to send any feedback to the project maintainer at d.ginev@jacobs-university.de


A LLaMa PUn

Something went wrong with that request. Please try again.