A list of resources for conservation, development, and documentation of endangered, minority, and low or under-resourced human languages.
Branch: master
Clone or download
Pull request Compare This branch is 64 commits behind RichardLitt:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
paper
tooling
.gitignore
.travis.yml
CONTRIBUTING.md
Entry Content Template.md
LICENSE
Publications.md
README.md
Standards used in Language Technology.md
package.json

README.md

Endangered Languages

Resources for conservation, development, and documentation of endangered, minority, and low or under-resourced human languages.

There is no centralized list of open-source code that would be useful for documenting, conserving, developing, preserving, or working with endangered languages. According to some estimates, half of the 7,000~ currently spoken languages are expected to become extinct this century (Wikipedia). However, there is a lot of work by academics, independent scholars, organizations, communities, and individuals which goes towards stopping or slowing this trend. This list is intended to provide a central location to document those efforts.

Slack Group

We have a Slack group for live discussion. Join Us Here!

Publication

A white paper describing this repository was published at the LREC 2016 CCURL Workshop (Collaboration and Computing for Under-Resourced Languages). The paper is in this repository, in the papers folder. Download the raw paper here: Open Source Code Serving Endangered Languages.

Contribute

To edit this list on GitHub, simply click here. If you would like to discuss anything at all related to this, please open an issue. If you know of any resource available that is not on this list, please add it, either using the link above or by submitting pull requests.

There are more details on contributing in the CONTRIBUTING guide.

If you're interested in discussing the list in some offline capacity, get in touch with @RichardLitt. I'd be more than happy to have a phone call or email exchange.

Table of Contents

Table of Contents generated with DocToc

Definitions

Endangered languages are human languages that are in danger of extinction. This list also encompasses minority languages - languages which are spoken by a stable, but small, population (for example, Maltese or Hawai'ian); and low- or under-resourced languages, which are spoken by a significant population but under-represented on the web (for instance, Quechua). These languages share certain characteristics in common; the most pertinent is sparse data and a lack of resources, ranging from spell-checkers to grammars to machine translation corpora. Other under-resourced languages that do not fall under this list include constructed languages (for instance, Klingon or Na'vi), computer languages (for instance, Javascript or Lua), and extinct languages that are so sparse as to be rendered computationally irrelevant for most purposes (for instance, Tocharian).

Open Source "promotes a universal access via a free license to a product's design or blueprint, and universal redistribution of that design or blueprint, including subsequent improvements to it by anyone." (Wiki). This is important because money and resources allocated towards a language or project that are not open source is spent at the expense of possible extensibility elsewhere.

Regarding the name, Endangered Languages may not be the best term, as many low resource languages are not necessarily endangered. But this term is the most accessible to the widest amount of people. Low Resource Languages would also suit this list.

Looking for resources for code languages? Take a look at the awesome lists collection.

Generic Repositories

Massive Dictionary and Lexicography projects

  • ABVD Austronesian Basic Vocabulary Database
  • CBOLD Comparative Bantu OnLine Dictionary
  • IE Indo-european comparative lexical resource
  • REFLEX a comparative dictionary project for Africa based out of CNRS in France.
  • Southeast Asian lexicography Several Southeast Asian lexicons hosted.
  • STEDT Tibeto-burman focused project where dictionaries from several languages are comparable.
  • Tibeto-burman lexicography

Single language lexicography projects and utilities

Utilities

Interactions and presentations of data

Software

  • 4lang GitHub stars - Concept dictionary using Eilenberg machines.
  • accentuate.us a.k.a. "charlifter". Statistical Unicodification of plain text for many languages
  • alignment-with-openfst GitHub stars - This is an implementation of the CRF autoencoder framework for four tasks: bitext word alignment, part-of-speech tagging, code switching, dependency parsing.
  • ANNIS Search and Visualization in Multilayer Linguistic Corpora
  • Apertium Apertium is a toolbox to build open-source shallow-transfer machine translation systems, especially suitable for related language pairs: it includes the engine, maintenance tools, and open linguistic data for several language pairs.
  • ark-tweet-nlp GitHub stars CMU ARK Twitter Part-of-Speech Tagger (Fork)
  • ArtOfReading GitHub stars Index and processing scripts related to the Art Of Reading illustration collection
  • bayesline GitHub stars A Multinomial Bayesian Classification for Language Identification
  • bible-corpus-tools GitHub stars A collection of tools for reading/processing the multilingual Bible corpus.
  • BloomDesktop GitHub stars Bloom Desktop is a hybrid c#/javascript/html/css Windows application that dramatically "lowers the bar" for language communities who want books in their own languages. Bloom delivers a low-training, high-output system where mother tongue speakers and their advocates work together to foster both community authorship and access to external materia… http://bloomlibrary.org/
  • BloomLibrary GitHub stars - Bloom Library Single Page App, using AngularJS & Bootstrap, Parse.com backend. http://www.bloomlibrary.org
  • brain GitHub stars Neural networks in JavaScript
  • Bristol Uni MT Morphology tools GitHub stars This repo is a mirror of scripts available on http://www.cs.bris.ac.uk/Research/MachineLearning/Morphology/resources.jsp#corpus. Included: Ukwabelana - An open-source morphological Zulu corpus and EMMA: A Novel Evaluation Metric for Morphological Analysis.
  • brown-cluster GitHub stars C++ implementation of the Brown word clustering algorithm.
  • CasualCon CasualConc is a concordance program that runs natively on Mac OS X 10.5 Leopard or later. It was originally designed for casual use (preliminary analysis or non-research purposes), though [the maintainer] has been using it for his own research (and may others have). It can generate kwic concordance lines, word clusters, collocation analysis, and word count.
  • cdec GitHub stars - Decoder, aligner, and model optimizer for statistical machine translation and other structured prediction models based on (mostly) context-free formalisms
  • charlint Charlint is a character normalization/checking tool written in Perl. Among else, it implements Normalization Form C of Unicode TR 15, as a test platform for Early Uniform Normalization in the W3C Character Model.
  • chorus GitHub stars A version control system designed to enable workflows appropriate for typical language development teams who are geographically distributed.
  • clam GitHub stars Computational Linguistics Application Mediator -- Quickly turn NLP applications into RESTful webservices with a web-application front-end. You provide a specification of your command line application, its input, output and parameters, and CLAM wraps around your application to form a fully fledged RESTful webservice.
  • CMU Sphinx CMUSphinx is a speaker-independent large vocabulary continuous speech recognizer released under BSD style license. It is also a collection of open source tools and resources that allows researchers and developers to build speech recognition systems.
  • cnminlangwebcollect GitHub stars Chinese minorities website languages detection and websites collection
  • Cog GitHub stars Cog is a tool for comparing languages using lexicostatistics and comparative linguistics techniques. It can be used to automate much of the process of comparing word lists from different language varieties. http://sillsdev.github.io/cog/
  • convertextract GitHub stars Convert Excel, Word and PowerPoint files with non-Unicode text (like text requiring SIL fonts) into Unicode, while preserving original file's formatting.
  • CorpusTools GitHub stars Phonological CorpusTools http://phonologicalcorpustools.github.io/CorpusTools/
  • CTK GitHub stars Built around LDC's champollion sentence aligner kernel, Champollion Tool Kit (CTK) aims to providing ready-to-use parallel text sentence alignment tools for as many language pairs as possible. (Original project is on SourceForge: http://champollion.sourceforge.net)
  • CuPED CuPED ('Customizable Presentation of ELAN Documents') is a tool for transforming time-aligned transcripts, such as those produced by ELAN, into a variety of presentation formats.
  • DataTags GitHub stars A system to assess the sensitivity and privacy risk of a dataset, and assign a tag to describe how the dataset must be transfered, stored and accessed. (Fork)
  • dataverse GitHub stars A data repository framework to share and publish research data.
  • dative GitHub stars A single-page application that interacts with multiple linguistic fieldwork web service databases. Website.
  • DeepLearnToolbox GitHub stars Matlab/Octave toolbox for deep learning. Includes Deep Belief Nets, Stacked Autoencoders, Convolutional Neural Nets, Convolutional Autoencoders and vanilla Neural Nets. Each method has examples to get you started.
  • Desmeme GitHub stars Database and tools for exploring linguistic templates
  • dictdb GitHub stars dictionary database for language translation
  • discoursegraphs GitHub stars Python-based tool to convert and merge multilayer annotated linguistic data
  • divvun-suggest GitHub stars - This program does FST lookup on forms specified as Constraint Grammar format readings, and looks up error-tags in an XML file with human-readable messages. It is meant to be used as a late stage of a grammar checker pipeline.
  • DLTK GitHub stars Deutsch Language Tool Kit. More
  • ELDER: Endangered Language Data Electronic Repository GitHub stars Endangered Language Data Electronic Repository: A web-based ontologically-compliant collaborative linguistic data cataloguing tool.
  • EMMA A Novel Evaluation Metric for Morphological Analysis
  • enchant GitHub stars enchant spellchecking library https://abiword.github.io/enchant
  • fast_align GitHub stars Simple, fast unsupervised word aligner.
  • fastText GitHub stars - Library for fast text representation and classification.
  • FieldWorks GitHub stars FieldWorks is a suite of software tools for language and cultural data, with support for complex scripts. http://software.sil.org/fieldworks/ FieldWorks Language Explorer (or FLEx, for short) is designed to help field linguists perform many common language documentation and analysis tasks. It can help you: elicit and record lexical information, create dictionaries, interlinearize texts, analyze discourse features, study morphology
  • Franc GitHub stars Natural language detection http://wooorm.com/franc/
  • FwDocumentation GitHub stars Developer documentation for FieldWorks (software tools for language and cultural data, with support for complex scripts).
  • FwLocalizations GitHub stars Localizations for FieldWorks
  • FwSupportTools GitHub stars Additional tools for FieldWorks development
  • Gaia GitHub stars Gaia is a HTML5-based Phone UI for the Boot 2 Gecko Project. NOTE: For details of what branches are used for what releases, see the wiki. If you're interested in setting up a keyboard in new language, see this.
  • giellakbd-ios GitHub stars - An open source reimplementation of Apple's native iOS keyboard with a specific focus on support for localised keyboards.
  • giella-ime GitHub stars - A fork of LatinIME (by Google for Android), targeting marginalised languages that also deserve first-class status on mobile operating systems.
  • giza-pp GitHub stars GIZA++ is a statistical machine translation toolkit that is used to train IBM Models 1-5 and an HMM word alignment model. This package also contains the source for the mkcls tool which generates the word classes necessary for training some of the alignment models.
  • gv-crawl GitHub stars - Global Voices bitext crawler for creating parallel corpora.
  • Glottolog data GitHub stars Glottolog provides comprehensive reference information for the world's languages.
  • Gramadóir GitHub stars Grammar checking engine that is designed for the rapid development of grammar checkers for minority languages and other languages with limited computational resources.
  • grind GitHub stars An InDesign 5.5 plug-in designed allow graphite enabled smart fonts to be used in Adobe InDesign. This project integrates SIL's Graphite 2 smart font technology with our own implementation of a paragraph composer plugin.
  • hermitcrab GitHub stars HermitCrab.NET is a flexible morphological/phonological parser that takes an item-and-process approach.
  • HFST GitHub stars - This package contains a bridging library for multiple FST libraries and toolkits and set of tools for processing of finite-state automate especially for linguistic systems.
  • hfst-ospell GitHub stars - HFST spell checker library and command line tool.
  • hfst-ospell-js GitHub stars - Node bindings for hfst-ospell.
  • hfst-optimized-lookup GitHub stars - HFST optimized-lookup standalone library and command line tool.
  • hundict GitHub stars - bilingual dictionary extractor from parallel corpora.
  • hunspell GitHub stars Spell checker and morphological analyzer library and program designed for languages with rich morphology and complex word compounding or character encoding
  • huntag GitHub stars - a sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models.
  • icu-dotnet GitHub stars C# wrapper for ICU4C
  • icu4c GitHub stars Mirror of svn project at http://source.icu-project.org/repos/icu/icu/. The FieldWorks branch has some FieldWorks specific enhancements.
  • iLanguage GitHub stars A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of possible parses for morphemes in a word. Input: a corpus. Uses compression, maximum entropy and fieldlinguistics.
  • ipa-help GitHub stars IPA Helps
  • itweets-geodata GitHub stars Geodata from Indigenous Tweets
  • jQuery.ime GitHub stars jQuery based input methods library
  • kbdgen GitHub stars - Generate keyboards and keyboard layouts for various operating systems.
  • koreksyon GitHub stars Tools for developing and implementing spell-checking and grammar-checking capabilities in low-resource languages
  • l20n.js GitHub stars L20n reinvents software localization. Users should be able to benefit from the entire expressive power of natural languages. L20n keeps simple things simple, and at the same time makes complex things possible. This is the JavaScript implementation of L20n. http://l20n.org
  • langid.py GitHub stars Stand-alone language identification system.
  • langtech A host of resources provided in SVN by the University of Tromsø. Details are here and in English here.
  • leebock/languages GitHub stars Application files for the Smithsonian endangered languages story map.
  • LEGO Unified Concepticon GitHub stars Material relating to the LEGO Unified Concepticon
  • Lex4All GitHub stars pronunciation LEXicons for Any Low-resource Language http://lex4all.github.io/lex4all/
  • lexdb - LexDB is a lexical cognate tracking database. It stores the full provenance of all lexemes and cognate judgements, and allows export into a number of nexus dialects. The database is written in the flexible python/django web framework.
  • LfMerge GitHub stars Send/Receive for languageforge.org
  • liblevenshtein GitHub stars - A library for generating Finite State Transducers based on Levenshtein Automata.
  • libpalaso GitHub stars Palaso Library: A set of .Net libraries useful for developers of Language Software.
  • LinGO Grammar Matrix The LinGO Grammar Matrix is a framework for the development of broad-coverage, precision, implemented grammars for diverse languages.
  • Lingpy GitHub stars LingPy: Python library for quantitative tasks in historical linguistics http://lingpy.org
  • Linguistica Linguistica is a program designed to explore the unsupervised learning of natural language, with primary focus on morphology (word-structure). It runs under Windows, Mac OS X and Linux, and is written in C++ within the Qt development framework. Its demands on memory depend on the size of the corpus analyzed.
  • long-press GitHub stars jQuery plugin to ease the writing of accented or rare characters. http://toki-woki.net/lab/long-press
  • low-resource-pos-tagging-2014 GitHub stars Low-Resource POS-Tagging: 2014
  • lrl GitHub stars For work concerning low resource languages.
  • MacVoikko GitHub stars - An OS X spelling server based on Voikko.
  • Machine GitHub stars Machine is a natural language processing library for .NET that is focused on providing tools for processing resource-poor languages (used by FLEx)
  • Make-extensions GitHub stars Scripts for generating hunspell spellchecking extensions
  • MARY TTS GitHub stars MARY TTS -- an open-source, multilingual text-to-speech synthesis system written in pure java http://mary.dfki.de
  • maxent GitHub stars Maximum Entropy Modeling Toolkit for Python and C++ http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html
  • mgiza GitHub stars A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training.
  • Minority Translate Minority Translate is a simple program for helping content generation on smaller sized Wikipedias (actually any sized) by giving pointers to existing articles in other language Wikipedias, so that the user can easily translate or adapt existing texts and thus increase the size and useability of their Wikipedia editions.
  • morfessor GitHub stars Morfessor is a tool for unsupervised and semi-supervised morphological segmentation
  • morpholm GitHub stars Morphology-aware language models.
  • morph-test GitHub stars - A python script to run tests for generation and analysis of a morphological transducer built using the Giella infrastructure. Works with Hfst, Xerox' fst tools, and with Foma.
  • mosesdecoder GitHub stars Moses, the machine translation system
  • moz-l10n-tiers GitHub stars Creates a pseudo-locale to evaluate string prioritization for l10n
  • mythes GitHub stars - MyThes is a simple thesaurus that uses a structured text data file and an index file with binary search to lookup words and phrases and return information on part of speech, meanings, and synonyms.
  • myWorkSafe GitHub stars Smart & Simple Backup for Language Development Workers http://myWorkSafe.palaso.org
  • Natural GitHub stars Javascript general natural language facilities for node
  • NIST 2008 Open Machine Translation Evalutation
  • NLTK GitHub stars Python Natural Language Tool Kit. NLTK Source http://www.nltk.org/
  • node-panlex GitHub stars node.js client for PanLex
  • norma GitHub stars A tool for automatic spelling normalization
  • nplm GitHub stars Fork of https://nlg.isi.edu/software/nplm/ with some efficiency tweaks and adaptation for use in mosesdecoder.
  • octothorpe GitHub stars CouchDB-powered wiki thing
  • OdtXslt GitHub stars Perform XSLT transform on contents of a package (such as ODT, Docx, etc.)
  • old-webapp GitHub stars Online Linguistic Database --- software for creating web applications to collaboratively document languages.http://www.onlinelinguisticdatabase.org
  • old-pyramid GitHub stars Online Linguistic Database migrated to the Pyramid framework.
  • OmegaT-hfst-tokenizer GitHub stars - OmegaT-hfst-tokenizer provides fst-based tokenisation in OmegaT.
  • OpenDataKit Open Data Kit (ODK) is an open-source suite of tools that helps organizations author, field, and manage mobile data collection solutions
  • OpenNLP GitHub stars The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. Website.
  • ops-devbox GitHub stars Ansible playbook for a (linux) developer machine
  • panlex-tools GitHub stars This package contains scripts to transform lexical resources into a format suitable for importing into PanLex. Documentation may be found at https://dev.panlex.org
  • paradigm GitHub stars PARADIGM is a .Net (C#) implementation of Joseph E. Grimes' 1983 work entitled "Affix Positions and Cooccurrences: The PARADIGM Program".
  • pathway GitHub stars Preparing language data for publication
  • pdfdroplet GitHub stars Library and GUI for imposition of PDF pages (e.g. 2-up) http://pdfdroplet.palaso.org
  • pepper GitHub stars Pepper is a pluggable, Java-based, open source converter framework for linguistic data.
  • phonology-assistant GitHub stars Phonology Assistant is a discovery tool. Provided with a corpus of phonetic data, it automatically charts the sounds and through its searching capabilities, helps a user discover and test the rules of sound in a language.
  • pressagio GitHub stars Pressagio is a library that predicts text based on n-gram models. For example, you can send a string and the library will return the most likely word completions for the last token in the string.
  • PrimerPro GitHub stars The purpose of PrimerPro is to assist the literacy worker in the development of primers for a given language.
  • pyDelphin GitHub stars Python libraries for DELPH-IN (Friendly Fork)
  • RBGParser GitHub stars Graph-based Dependency Parser.
  • Rosetta Pangloss GitHub stars The Rosetta Project's Pangloss system
  • salm GitHub stars SALM: Suffix Array and its Applications in Empirical Language Processing by Joy
  • Salt GitHub stars A graph-based model to store and manipulate linguistic data.
  • saymore GitHub stars - A tool for making common Language Documentation tasks such as keeping all the resulting files and meta data organized, converting files to archive formats, and transcription.
  • Secwepemc-Facebook GitHub stars Translate Facebook into unsupported languages
  • SegParser GitHub stars Randomized Greedy algorithm for joint segmentation, POS tagging and dependency parsing
  • SeedLing GitHub stars Building and Using A Seed Corpus for the Human Language Project
  • Skype in your language GitHub stars Translate Skype into unsupported languages
  • solid GitHub stars Solid is a software tool that can be used to check, clean up, and convert Standard Format (e.g. Toolbox) lexicon data.
  • SPHERE Conversion Tools Many LDC corpora contain speech files in NIST SPHERE format. The programs below convert SPHERE files to other formats.
  • StandardFormatLib GitHub stars Standard Format Library
  • Stanford CoreNLP GitHub stars Stanford CoreNLP: A Java suite of core NLP tools. https://stanfordnlp.github.io/CoreNLP/
  • Stanford CoreNLP Python GitHub stars Python wrapper for Stanford CoreNLP tools
  • stanza GitHub stars Stanford NLP group's shared Python tools.
  • str2ipa GitHub stars Pronunciation dictionaries for languages with close-to-phonetic writing systems
  • sugali GitHub stars This is a legacy repository of the language identification project for many (many) languages project for the software project course, NLP projects for low-resource languages.
  • SuGarLike GitHub stars Language Identification for Low Resource Languages (by Susanne, Guy and Liling)
  • tasty-imitation-keyboard GitHub stars - A custom keyboard for iOS8+ that serves as a tasty imitation of the default Apple keyboard. Built using Swift and the latest Apple technologies!
  • teny GitHub stars Tools for low-resource machine translation.
  • TeraDict GitHub stars Translate English words into hundreds of languages!
  • Tesseract.js GitHub stars Pure Javascript OCR for 62 Languages 📖🎉🖥 http://tesseract.projectnaptha.com/
  • TexNLP GitHub stars TexNLP: Texas Natural Language Processing tools
  • TiMBL TiMBL is an open source software package implementing several memory-based learning algorithms, among which IB1-IG, an implementation of k-nearest neighbor classification with feature weighting suitable for symbolic feature spaces, and IGTree, a decision-tree approximation of IB1-IG. All implemented algorithms have in common that they store some representation of the training set explicitly in memory. During testing, new cases are classified by extrapolation from the most similar stored cases.
  • Toney GitHub stars Tone Classification Software
  • Toolbox Scripts for ELAN GitHub stars Mirror of Alexander Koenig's Toolbox Scripts https://tla.mpi.nl/tools/tla-tools/elan/thirdparty/
  • ToolsForFieldLinguistics GitHub stars A collection of scripts and recipes for linguistics
  • translitit-engine GitHub stars A transliteration engine written in JavaScript
  • Tsammalex data GitHub stars Tsammalex is a multilingual lexical database on plants and animals.
  • tweet2learn GitHub stars An app to make it easier to use your native language on Twitter
  • twitter_langid GitHub stars A hierarchical character-word neural network for language identification
  • UniversalDependencies docs GitHub stars Universal Dependencies online documentation http://universaldependencies.org/docs/
  • UniversalDependencies tools GitHub stars Various utilities for processing the data.
  • VocBench VocBench is a web-based, multilingual, editing and workflow tool that manages thesauri, authority lists and glossaries using SKOS-XL.
  • wavesurfer.js GitHub stars Navigable waveform built on Web Audio and Canvas https://wavesurfer-js.org/ (Also has an ELAN plugin)
  • web-scriptureforge GitHub stars platform for Scripture-related web apps
  • webcorpus GitHub stars - This project is a collection of scripts and programs for creating a webcorpus from crawled data.
  • wikt2dict GitHub stars - Wiktionary parser tool for many language editions.
  • Word Generator WordGenerator generates hypothetical words from specifications of their syllable structure.
  • WordBoundary GitHub stars An experiment in the detection and segmentation of word boundaries
  • wordbyword GitHub stars WordByWord is a free, open source, easy-to-use multimedia vocabulary trainer developed by Vera Ferreira, Peter Bouda, and Ricardo Filipe at CIDLeS with the support of the Foundation for Endangered Languages.
  • WSI4URLang GitHub stars Word Sense Induction (WSI) for Under-resourced Languages (URLang)
  • XDXF_Makedict GitHub stars XDXF dictionary format and "makedict" dictionary converting software (official repository)

Annotation

  • AGTK GitHub stars AGTK is a suite of software components for building tools for annotating linguistic signals, time-series data which documents any kind of linguistic behavior (e.g. audio, video). The internal data structures are based on annotation graphs. (Original project is on SourceForge: https://sourceforge.net/projects/agtk/)
  • brendano GitHub stars - Graph Fragment Language for Easy Syntactic Annotation https://www.cs.cmu.edu/~ark/FUDG/
  • ELAN ELAN is a professional tool for the creation of complex annotations on video and audio resources.
  • eopas GitHub stars ETHNOER Online Presentation and Annotation System
  • FLAT - FoLia Linguistic Annotation Tool GitHub stars FLAT is a web-based linguistic annotation environment based around the FoLiA format (http://proycon.github.io/folia/), a rich XML-based format for linguistic annotation. FLAT allows users to view annotated FoLiA documents and enrich these documents with new annotations, a wide variety of linguistic annotation types is supported through the FoLiA paradigm. It is a document-centric tool that fully preserves and visualises document structure.
  • gfl_syntax GitHub stars Graph Fragment Language for Easy Syntactic Annotation http://www.ark.cs.cmu.edu/FUDG
  • graf-python GitHub stars The library graf-python is an open source Python implemenation to parse and write GrAF/XML files as described in ISO 24612. The parser of the library creates an annotation graph from the files. The user may then query the annotation graph via the API of graf-python.
  • LDC Word Aligner LDC Word Aligner is a software tool used for manual annotation of word alignment developed to support Arabic-English and Chinese-English word alignment tasks. It has a clean, easy-to-use interface. Since its development in 2009, LDC has used LDC Word Aligner to generate over 1,000,000 tokens of annotated word alignment data from a variety of genres including broadcast, newswire and web-based sources.
  • poio-analyzer GitHub stars Poio is a collection of software tools for linguists working in language documentation, descriptive linguistics and/or language typology. It allows linguists to manage and analyze their data. The Poio Interlinear Editor allows to add morpho-syntactic annotations to transcriptions. It supports various file formats for input, but will only output standardized XML defined by the Corpus Encoding Standard and the Text Encoding Initiative. Several tools for analyzing linguistic data will be made available to further process annotated data. Poio tools are written in Python and are based on PyQt.
  • poio-api GitHub stars Poio API is a free and open source Python library to access and search data from language documentation in your linguistic analysis workflow. It converts file formats like Elan’s EAF, Toolbox files, Typecraft XML and others into annotation graphs as defined in ISO 24612. Those graphs, for which we use an implementation called “Graph Annotation F… http://www.poio.eu/
  • poio-doc GitHub stars Documentation of the Poio project.http://www.poio.eu
  • pyannotation GitHub stars PyAnnotation is a Python Library to access and manipulate linguistically annotated corpus files.
  • XTrans Trans is a next generation multi-platform, multilingual, multi-channel transcription tool that supports manual transcription and annotation of audio recordings. The XTrans toolkit provides new and efficient solutions to common transcription challenges and addresses critical gaps in existing tools.Designed with input from experienced human transcribers working with real world data, XTrans provides a flexible and intuitive graphical user interface for a multitude of speech annotation tasks including (virtual) segmentation of audio into smaller units like turns and sentences; speaker identification; orthographic transcription in any language; and labeling of structural elements of the transcript like topics.

Format Specifications

  • dlx-spec GitHub stars The official specification for the DLx linguistic data format. http://developer.digitallinguistics.io/spec/
  • FoLiA GitHub stars FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are support, making FoLiA a useful format for NLP tasks and data interchange. http://proycon.github.io/folia/
  • xdxf_makedict GitHub stars - XDXF dictionary format and "makedict" dictionary converting software (official repository).

i18n-related Repositories

  • Express-Lingua GitHub stars An i18n middleware for the Express.js framework.
  • Polyglot.js Give your JavaScript the ability to speak many languages.
  • Transifex - System for providing a nice, userfriendly/project oriented approach to translating .po files. Great for non-technical users, free for open-source projects, decent for minority languages; however, it can take a while to get a new language added to the Transifex system because the ticketing system Transifex uses results in them losing tickets sometimes. Provides translation memory, ability to appoint reviewers, etc. Transifex used to have an open source system that you could host on your own, but that seems to have disappeared.

Audio automation

Text automation

  • clld GitHub stars Cross Linguistic Linked Data python library
  • LaTeX2HTML5 GitHub stars LaTeX web components
  • MultilingualCorporaExtractor GitHub stars Node io Spider for extracting multilingual corpora (Fork of a student project)
  • SeedLing GitHub stars Building and Using A Seed Corpus for the Human Language Project (Fork of a student project)

Experimentation

  • experigen GitHub stars A framework for creating linguistic experiments
  • GamifyPsycholinguisticsExperiments GitHub stars A simple node server to gamify linguistics experiments, runs offline on a laptop for small scale experiements and online on a server for large scale experiments. Data is sent to a Google spreadsheet. (Fork of a dormant project)
  • OpenSesame GitHub stars Graphical experiment builder for the social sciences
  • OPrime GitHub stars Open Source Experimentation Libraries - Online and Offline for Android and HTML5
  • psychopyMegProsody GitHub stars Runs MegProsody using PsychoPy.
  • PsychScript GitHub stars A HTML5/Javascript library for running behavioural experiments online.

Flashcards

Natural language generation

  • hailo GitHub stars A conversation bot using Markov chains
  • ngram-natural-language-generator GitHub stars Takes in a text file and generates random sentences that sound like they could have been in the file
  • OpenCCG GitHub starts GitHub stars OpenCCG library for parsing and realization with CCG. Includes mini-grammars for Inuit, Nezperce, Basque and others.
  • SimpleNLG GitHub starts GitHub stars SimpleNLG is a simple Java API designed to facilitate the generation of Natural Language. It was originally developed at the University of Aberdeen's Department of Computing Science. English at this moment but there exist forks in French and German.
  • See more at Downloadable NLG systems at the ACL Wiki. Of particular interest there might be the List of resources by language at the wiki.

Computing systems

Android Applications

Chrome Extensions

  • babelfrog GitHub stars Chrome extension to help learn languages as you browse.
  • DictionaryChromeExtension GitHub stars Dictionary for websites in low-resource languages. App and codebase which connects to a Wiktionary to provide definitions of any term on any website (current languages Cherokee 194,426 entries, Inuktitut 251 entries, Kartuli 7,363 entries, Plains Cree (incubation) 0 entries) use

FieldDB

FieldDB is actively worked on by the FieldDB (Formally known as OpenSourceFieldlinguistics) group. These repos explicitly work with it but could be repurposed for other projects.

  • FieldDB GitHub stars An offline/online field database which adapts to its user's terminology and I-Language, has plugins for various data automation routines along the process of primary data collection to cleaning to publication and archival. use

FieldDB Webservices/Components/Plugins

Academic Research Paper-Specific Repositories

  • Gargantua GitHub stars Fast Unsupervised Sentence Aligner described in "Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora", COLING 2010.
  • ldc-kiy GitHub stars Materials for: The experimental state of mind in elicitation: illustrations from tonal fieldwork. Dubmitted to Language Documentation & Conservation, How to study a tone language.
  • Learning to map into a Univerisal POS tagset Yuan Zhang, Roi Reichart, Regina Barzilay and Amir Globerson
  • low-resource-pos-tagging-2014 GitHub stars and low-resource-pos-tagging-2014 Published in: Learning a Part-of-Speech Tagger from Two Hours of Annotation. Dan Garrette and Jason Baldridge. In Proceedings of NAACL 2013. And in: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages. Dan Garrette, Jason Mielens, and Jason Baldridge. In Proceedings of ACL 2013.
  • orthotree GitHub stars Linguistic family tree based on orthographic distance
  • type-supervised-tagging-2012emnlp GitHub stars This repository contains the code, scripts, and instructions needed to reproduce the results in the paper: Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries. Dan Garrette and Jason Baldridge. In Proceedings of EMNLP 2012. This code is frozen as of the version used to obtain the results in the paper. It will not be maintained. To see the updated code, visit nlp
  • visualizing-language GitHub stars For visualizations of WALS and other typological databases
  • WALS-APiCS GitHub stars Code for working with WALS-APiCS (Atlas of Pidgin and Creole Language Structures) complexity metrics

Example Repositories

These are repositories that are generally only interesting for training purposes or seeing how something is done.

Language & Code Interfaces

  • قلب GitHub stars ‬ is a simple, Scheme-like programming language that you code entirely in Arabic. It is an exploration of the impact of human culture on computer science, the role of tradition in software engineering, and the connection between natural and computer languages.

Fonts

  • fontinline GitHub stars Make inline stroke paths from an outline font
  • Noto Fonts GitHub stars Noto is Google’s free font family that aims to support all the world’s scripts. Its design goal is to achieve visual harmonization across languages. Noto fonts are under Apache License 2.0.
  • Unicodify Unicodify is a suite of programs for converting text in a variety of 8-bit encodings to Unicode (using the UTF-16 encoding). Unicodify was particularly designed to handle HTML-based text using non-ISCII 8-bit fonts to render South Asian scripts. However, elements of the suite can map other types of non-ASCII 8-bit encodings, such as Latin-2, ISCII and PASCII.

Corpora

These corpora are useful for working with tools on endangered languages. Monolingual corpora that are more for archival efforts should most likely not be included here.

  • bible-corpus GitHub stars A multilingual parallel corpus created from translations of the Bible.
  • poio-corpus GitHub stars The Poio Corpus is a freely available collection of language resources for the lesser-used languages. The data is extracted from free sources like Wikipedia, dictionaries, documents, websites and others.

Organizations

On GitHub

  • batumi GitHub stars - Speech recognition and natural language processing for low-resource languages
  • BloomBooks GitHub stars
  • cmusphinx GitHub stars - Mirror of the SourceForge repositories
  • dativebase GitHub stars - Tools for working with OLD.
  • divvun GitHub stars - The Divvun group at UiT develops proofing tools, keyboard apps and other language technology solutions for minority languages, especially the Sámi languages. Website.
  • FieldDB GitHub stars
  • HFST GitHub stars - Helsinki Finite-State Technology. Website.
  • hunspell GitHub stars
  • lex4all GitHub stars
  • longnow GitHub stars
  • moses-smt GitHub stars - Statistical Machine Translation.
  • NLTK GitHub stars - Natural Language Toolkit
  • PhonologicalCorpusTools GitHub stars
  • Projet de recherche sur l'écriture GitHub stars - Crowdsourcing or conducting large scale psycholinguistics experiments (or statistically significant field linguistics)
  • SIL International (Dev) GitHub stars SIL International- Another SIL organization, with many repositories.
  • SIL International GitHub stars - SIL (originally known as the Summer Institute of Linguistics, Inc.) is probably the leading organization which provides software and tools tailored for use by field linguists and lexicographers working on endangered languages. A little known fact is that much of it's code is open sourced on GitHub and SIL is happy to recieve open source contributions and collaborate on open source projects.
  • SIL NRSI GitHub stars - SIL Non-Roman Script Initiative. The NRSI is a department of SIL International, whose task is to provide assistance, research and development for SIL International and its partners to support the use of non-Roman and complex scripts in language development.
  • StanfordNLP GitHub stars https://nlp.stanford.edu
  • UniversalDependencies - Universal Dependencies (UD) is a project that is developing cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on an evolution of (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary.
  • utcompling GitHub stars - The University of Texas at Austin's Computational Linguistics Lab. Website.

Other OSS Organisations

  • Giellatekno - Giellatekno combines cutting-edge linguistic and computational research into the analysis of Saami and other morphologically-rich languages, with the development of practical applications. We focus on deep linguistic modeling and on highly efficient and robust computational analysis with a wide empirical coverage. They use svn for their code: all of it can be found here, sorted by language.
  • LTRC: Language Technologies Research Center IIIT Hyderabad LTRC addresses the complex problem of understanding and processing natural languages in both speech and text mode. LTRC conducts research on both basic and applied aspects of language technology. It is the largest academic centre of speech and language technology in South Asia. LTRC carries out its work through four labs, which work in synergy with each other, as listed above.
  • The Language Archive Part of the MPI

Tutorials

Language Specific Projects

Albanian

sqi :: shqip

Alutiiq

ems :: sugpiaq

  • wiinaq GitHub stars - Word Wiinaq is a Kodiak Alutiiq dictionary web application with automatically generated ending tables and souped-up search capabilities. It is written in Python using Django.

Amharic

amh :: አማርኛ

  • HornMorpho GitHub stars - morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs

Arabic

ara :: العربية

  • Buckwalter GitHub stars A small python script that transliterates Arabic text using the Buckwalter Transliteration Scheme. It allows for multiple decisions to be made around whether or not to include all types of diacritics and characters or ignore them. Useful for NLP experiments where you may want to normalize text.
  • Dialects GitHub stars Django project to allow for documentation (input and displaty) of linguistic forms in dialects or closely related languages.

Bengali

ben :: বাংলা

  • Bangla-অঙ্কুর for Mac This project aims to develop a phonetic based Bangla typing system for Macintosh computer which can be developed into a transliteration technique in the future.
  • Bengali Writer GitHub stars `Bengali Writer' is a set of utilities for computerized editing and typesetting in Bengali, a language of India and Bangladesh. It comprises a set of fonts for Bengali in several formats (METAFONT, BDF, PS), a text editor with spell-cheking, export, and more. (Original project is on SourceForge: https://sourceforge.net/projects/bengaliwriter/)
  • Ekushey Bangla Computing and Localization Project for the Bangla speaking people.
  • Lekho GitHub stars A collection of tools and resources for using bangla on computers (Original project is on SourceForge: https://sourceforge.net/projects/lekho/)

Chichewa

nya :: chicheŵa

Estonian

est :: eesti keel

Georgian

kat :: ქართული

  • awesome-georgia GitHub stars A curated list of awesome libraries and packages specific/related to Georgia (country).
  • Gadatsqvetilebebi GitHub stars გადაწყვეტილებები; Web spider and corpora importer for public legal decisions
  • GeoWordsDatabase GitHub stars Around 310 000 unique Georgian words https://bumbeishvili.github.io/GeoWordsDatabase/
  • Kartuli Speech Recognition GitHub stars ანდროიდის ქართველი მომხმარებლებისთვის სიტყვის ამოცნობის სისტემის შექმნა. Codebase to turn any webpage from any alphabet into another alphabet, the default is to turn latin letters into Kartuli. use "Do your friends keep commenting on Facebook with English keyboards (either because they forgot to switch, or because they didn't/can't install a Georgian keyboard)? Now you can read the web through კართული eyes."
  • KartuliChromeExtension GitHub stars Chrome აპლიკაცია, რომელიც ყველა ინგლისურ ასო-ბგერას აჩვენებს ქართულ ასო-ბგერად
  • QartuliDaBunebismetkveleba GitHub stars მათემატიკისა და ბუნებისმეტყველების ინტერაქტიული სახელმძღვანელო მე-2 - მე-3 კლასის მოსწავლეებისათვის.
  • SakartvelosUzenaesiSasamartloSarke GitHub stars საქართველოს უზენაესი სასამართლო სარკე
  • SamartlosSakonstitutsioSasamartdoSarke GitHub stars სამართლოს საკონსტიტუციო სასამართდო სარკე
  • translitit-latin-to-mkhedruli-georgian GitHub stars A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript
  • translitit-mkhedruli-georgian-to-ipa GitHub stars A Latin to ქართული (Mkhedruli Georgian) transliteration function written in JavaScript

Fonts

Internationalization and Localization (i18n/l10n)

Guarani

grn :: Guarani

  • ParaMorfo GitHub stars - morphological analysis and generation of Spanish and Guarani verbs, nouns, and adjectives. Used to be here.

Hausa

hau :: هَرْشَن هَوْسَ

  • Hausa GitHub stars Repository for Hausa NLP tools

Hindi

hin :: हिन्दी

  • hindi-morph GitHub stars An open source morphological analyzer for Hindi

Høgnorsk

nno :: Høgnorsk

  • hunspell-hn_NO GitHub stars A beginning to a spellchecking tool for Høgnorsk, a conservative variant of Norwegian Nynorsk, based on a set of corpuses.

Inuktitut

iku :: Inuktitut

Irish

gle :: Gaeilge

Japanese

jpn :: 日本語

  • JapaneseCorpusAngoSakaguchi GitHub stars Ango Sakaguchi's essays, with some code
  • kuromoji GitHub stars Kuromoji is a self-contained and very easy to use Japanese morphological analyzer designed for search
  • kuromoji-server GitHub stars Kuromoji server and demo that shows Japanese morphological analyzer capabilities

Kinyarwanda

kin :: Ikinyarwanda

Korean

kor :: 한국어

Lingala

lin :: Lingála

Lushootseed

lut :: Lushootseed

Malay

Malagasy

mlg :: Malagasy

  • Global Voices Malagasy Project This page provides a link to a corpus of parallel news articles in Malagasy and English from the Global Voices project. This corpus was collected and aligned at the sentence level by Victor Chahuneau.

Manx

glv :: Gaelg

  • aspell-gv GitHub stars Manx Gaelic dictionary for aspell
  • gaelg GitHub stars NLP resources for Manx Gaelic, mainly in support of the gv2ga MT engine

Migmaq

mic :: Mi'kmaq

Minderico

  • fredericajordarzambarino GitHub stars A web based game for mobile devices in minderico based in the "Who Wants to be a Millionaire" TV show.

Nishnaabe

oji :: Ojibwe, Oddawa, Chippewa, Anishinaabemowin, ᐊᓂᔑᓈᐯᒧᐎᓐ

  • Ojibway-iphone-app GitHub stars An iPhone app with audio and images for learning the Ojibway language.
  • OjibwayMap GitHub stars An iPhone app with audio and images for learning Ojibway language and culture.
  • nishanimate GitHub stars A desktop app to facilitate Nishnaabe-language acquisition via animations produced by the natural language processing of audio-accompanied text.

Oromo

orm :: Oromo

  • hornmorpho GitHub stars morphological analysis and generation of amharic and oromo verbs and nouns and tigrinya verbs

Quechua

que :: Runa Simi

  • AntiMorfo GitHub stars - morphological analysis and generation of Quechua nouns, adjectives, and verbs and Spanish verbs
  • Morphology, spellchecker - XFST and FOMA, plus OpenOffice plugin.

Sami

sma :: Sámi/Saami

  • Mobile keyboards (iOS and Android), learning apps, dictionaries, morphologies, syntax disambiguators, some amount of project collaboration with Apertium on shallow translation between Saami languages, and
  • Oahpa! - A learning portal for Saami languages. Includes WordPress based, media rich lesson-based learning, and morphological and syntactic exercizes generated from the morphological and syntactic tools
  • Neahttadigisánit - A morphologically sensitive dictionary, with modes for 'social media input' (which allows users to type a 'relaxed' version of the orthography (acdnstz will be recognized also as áčđŋšŧz̄), and also includes a JavaScript bookmarklet to offer click-to-read dictionary lookup functionality. Also available for other Uralic, and non-Uralic languages. Giellatekno does a lot for other minority Uralic languages. Following are some keywords for CTRL+F friendliness:
  • Saami languages: North Saami, Lule Saami, South Saami // Inari Saami, Kildin Saami, Pite Saami, Skolt Saami.
  • Other Uralic languages: Erzya, Finnish, Hill Mari, Ingrian, Khanty, Kven, Komi, Livonian, Meadow Mari, Moksha, Nenets, Nganasan, Olonetsian, Udmurt, Veps.
  • Other languages: Buriat, Cornish, Faroese, Greenlandic, Iñupiaq, Northern Haida, Ojibwe, Plains Cree, Russian.

Scottish Gaelic

gla :: Gàidhlig

  • aspell-gd GitHub stars Scottish Gaelic dictionary for aspell
  • briathrachan GitHub stars This is the source code to Briathrachan, a Gaelic-English dictionary app for iOS.
  • gaidhlig GitHub stars NLP resources for Scottish Gaelic, mainly in support of gd2ga/ga2gd MT engines
  • gd-fcfg GitHub stars Context-free feature-based grammar of Scottish Gaelic in the NLTK format
  • gdbank GitHub stars Some tools and resources for natural language processing of Scottish Gaelic. http://www.tantallon.org.uk/cggblog/
  • hunspell-gd GitHub stars Files for building Scottish Gaelic spell checkers

Secwepemctsín

shs :: Secwepemctsín

Somali

som :: Soomaaliga

  • somorph GitHub stars Somali morphological and syntactic analyzers and generators built on XFST and VISL-CG Constraint Grammar. Up to date version checked in in Giellatekno's repository.
  • qaamuus.net morphologically aware dictionary based on lexical resources found online, and the somali morphology.

Tigrinya

tir :: ትግርኛ

  • HornMorpho GitHub stars morphological analysis and generation of Amharic and Oromo verbs and nouns and Tigrinya verbs

Zulu

zul :: zulu

  • Ukwabelana An open-source morphological Zulu corpus

License

License: CC BY-SA 4.0 © Richard Littauer 2014-2017