Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Collection of three method of moments based algorithms for learning stochastic languages. Accompaniment to the ICML 2014 paper "Methods of Moments for Learning Stochastic Languages: Unified Presentation and Empirical Comparison", by Borja Balle, William L. Hamilton, and Joelle Pineau. The learning algorithms produce weighted finite automata that can be used to make predictions over string. For more details, the paper can be found here:


Spectral learning algorithm (using both string and substring estimates). Convex optimization learning algorithm. Tensor decomposition based learning algorithm. Wrapper for the Treba EM library. 12 synthetic datasets from the PAutomac competition ( One real-world NLP dataset from the parts-of-speech tags in the Penn-Treebank2 dataset (


The code requires Python 2.7, SciPy (, the Python SpPy ( Necessary C++ libraries are provided. The latest GCC compiler is recommended.


For the spectral methods: No installation required.

For the CO method: Navigate to the co/cpp directory. Run "make" command.

For the tensor method: Navigate to the tensor/cpp directory. Run "make" command.

Using the code (with settings from the paper):

For all methods: Modify the PAUTOMACPATH, RESULTSPATH, etc. variables at the start of the main methods sections as necessary.

For the spectral method:

  • Navigate to the spectral directory.
  • Run python wfaspectrallearn [esttype] [metric] [problem-id] [n-symbols] [basissize]
    • estype is either "string" or "substring"
    • metric is either "WER" or "KL" (perplexity)
    • problem-id is the PAutomaC problem id number or "tree" for the Treebank2 dataset.
    • n-symbols is the number of symbols in the target alphabet.
    • basissize is the number of prefixes/suffixes to use in estimation.

For the convex optimization method:

  • Navigate to the co directory.
  • Run python [modeltype] [metric] [problem-id] [n-symbols] [basissize]
    • modeltype should be set to "WFA" (other settings not used in the paper).
    • other args same as above.

For the tensor method:

  • Navigate to the tensor directory.
  • Run python [metric] [problem-id] [n-symbols] [basisize] this method automatically outputs a model description in the directory specified by MODELDIR in the main method.

For the EM wrapper code:

  • Navigate to the em directory:
  • Run python [problem-id] [n-symbols]

For the tensor-initialized EM:

  • Navigate to the em directory:
  • Run python [metric] [problem-id] [numsymbols] [path-to-tensor-model-file]


Collection of three method of moments based algorithms for learning stochastic languages.







No releases published


No packages published