Generalized Language Modeling toolkit
Java Shell Python
Clone or download
Pull request Compare This branch is 10 commits behind renepickhardt:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

#Generalized Language Model Toolkit

The software can be used to compute a Generalized Language Model which is yet another mean to compute a Language Model

Getting started

git clone
sudo chmod a+x

You will need to install maven in order to build the project.

sudo apt-get install maven2

You need to copy config.sample.txt to config.txt and reed the instructions in config.sample.txt.

cp config.sample.txt config.txt
emacs config.txt

After you set all your directories in config.txt you can run the project


Disk and Main memory requirements

Since Generalized language models can become very large the software is written to use the hard disk. In this sense you can theoretically run the programm with very little memory. Still we recommend 16 GB of main memory for the large english wikipedia data sets.

We tried to avoid frequent disc hits. Still the programm will execute much faster if you store your data on a Solid State disk.

Download the test data sets

Please refere to in order to download preprocessed and formatted data sets.

If you whish to parse the data yourself (e.g. because you want to use a newer wikipedia dump) refer to

Citing the paper

If this software or data is of any help to your research please be so fair and cite the original publication. You might want to use the following bibtex entry

   author = {Pickhardt, Rene and Gottron, Thomas and Körner, Martin and  Wagner, Paul Georg and  Speicher, Till and  Staab, Steffen}, 
   title = {A Generalized Language Model as the Combination of Skipped n-grams and Modified Kneser Ney Smoothing}, 
   year = {2014}, 
   booktitle = {ACL'14: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics}, 


The Generalized Language models envolved from Paul Georg Wagner's and Till Speicher's Young Scientists project called Typology which I advised in 2012. The Typology project played around and evaluated an idea I had (inspired by Adam Schenker) of presenting text as a graph in which the edges would encode relationships (nowerdays known as skipped bi-grams). The Graph was used to produce an answer to the next word prediction problem applied to word suggestions in keyboards of modern smartphones. From the convincing results I developed the theory of Generalized Language models. Most of the Code was written by my student assistent Martin Körner who also created his bachlor thesis about the implementation of a preliminary vesion of the Generalized Language Models. This thesis is a nice reference if you want to get an understanding of modified kneser ney smoothing for standard language models. In terms of notation and building of generalized language models it is outdated.