Skip to content
This repository has been archived by the owner on Aug 2, 2019. It is now read-only.
/ deprecated-LLaMaPUn Public archive

LLaMaPUn is Language and Mathematics Processing and Understanding

License

Notifications You must be signed in to change notification settings

KWARC/deprecated-LLaMaPUn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLaMaPUn Logo

The LLaMaPUn library will consist of a wide range of processing tools for natural language and mathematics. Build Status

New: Efforts have started in adopting third-party tools (such as the SENNA NLP toolkit) and adapting them to the focus of mathematical documents. As such, the current build target is refocused on the C programming language, migrating away from Perl. Given the portability of C, we expect to eventually offer high level wrappers for a variety of scripting languages.

Please remember that all third-party tools enforce their own licensing constraints.

Disclaimer: This Github repository is currently undergoing gradual migration from the original subversion repository. The migration consists of reorganizing the libraries, and preparing a CPAN-near bundle including a testbed and detailed documentation. This process also brings a namespace change to the now properly spelled LLaMaPUn.

Several upcoming deployments of the CorTeX framework have motivated the move to GitHub and provide an outlook for a number of fixes and features to be added to the library.

High-level Overview

  • Preprocessing

    • Unicode normalization,
    • Stopwords - based on widely accepted lists, enhanced for STEM texts,
    • Semi-structured to plain text normalization (math, citations, tables, etc.),
    • Purification of text and math modality (e.g. move trailing dots left in math back into the sentence text),
    • Stemming - adaptation of the Morpha stemmer,
    • Tokenization - rule-based sentence segmentation, and SENNA word tokenization
  • Shallow Analysis

    • Language identification (via libTextCat),
    • N-gram footprints,
    • Part-of-speech tagging (via SENNA),
    • Named Entity recognition (via SENNA),
    • Chunking and shallow parsing (via SENNA),
    • [TODO] "Definition" paragraph discrimination task (training SVM classifiers, based on TF/IDF and Ngram BoW features, via libsvm)
    • [TODO] "Declaration" sentence discrimination task (training CRF models via CRFsuite).
  • Representation Toolkit

    • Document Narrative Model (DNM) addition to the XML DOM
    • XPointer and string offset annotation support
    • Integration with the CorTeX processing framework
    • [TOPORT] Shared Packed parse forests for mathematical formulas (aka "disjunctive logical forms")

See also

Contact

Feel free to send any feedback to the project maintainer at d.ginev@jacobs-university.de


A LLaMa PUn

About

LLaMaPUn is Language and Mathematics Processing and Understanding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published