New Python module release: RerankingParser updated, ModelFetcher added

python/bllipparser/RerankingParser.py: Various improvements. Sentence class now ensures tokens are strings to reduce crashing. RerankingParser class: - load_unified_model_dir() renamed to from_unified_model_dir(), old method name is deprecated now. - Parser options can now be changed with the set_parser_options() method. - parse() and parse_tagged() both default to a new rerank mode, rerank='auto', which will only rerank if a reranker model is available. - parse_tagged() now throws a ValueError if you provide an invalid POS tag (instead of segfaulting) - check_loaded_models() renamed to _check_loaded_models() since it's not intended for users. - added get_unified_model_parameters() helper function which provides paths to parser and reranker model files. python/bllipparser/ModelFetcher.py: new Python module which downloads and installs BLLIP unified parsing models. Can be used via command line or Python library. python/bllipparser/ParsingShell.py: can now be launched without parsing models README-python.txt: docs, examples updated. Now covers ModelFetcher and ParsingShell (the latter was previously distributed but not mentioned) setup.py: updated with latest release information
BLLIP · Feb 10, 2014 · 7ecb65f · 7ecb65f
1 parent f13217f
commit 7ecb65f
Show file tree

Hide file tree

Showing 5 changed files with 425 additions and 68 deletions.
diff --git a/README-python.txt b/README-python.txt
@@ -1,24 +1,56 @@
 The BLLIP parser (also known as the Charniak-Johnson parser or
 Brown Reranking Parser) is described in the paper `Charniak
 and Johnson (Association of Computational Linguistics, 2005)
-<http://aclweb.org/anthology/P/P05/P05-1022.pdf>`_.  This code
-provides a Python interface to the parser. Note that it does
-not contain any parsing models which must be downloaded
-separately (for example, `WSJ self-trained parsing model
-<http://cs.brown.edu/~dmcc/selftraining/selftrained.tar.gz>`_).
+<http://aclweb.org/anthology/P/P05/P05-1022.pdf>`_.  This package provides
+the BLLIP parser runtime along with a Python interface. Note that it
+does not come with any parsing models but includes a downloader.
 The primary maintenance for the parser takes place at `GitHub
 <http://github.com/BLLIP/bllip-parser>`_.
 
+Fetching parsing models
+-----------------------
+
+Before you can parse, you'll need some parsing models.  ``ModelFetcher``
+will help you download and install parsing models.  It can be invoked
+from the command line. For example, this will download and install the
+standard WSJ model::
+
+    shell% python -mbllipparser.ModelFetcher -i WSJ
+
+Run ``python -mbllipparser.ModelFetcher`` with no arguments for a full
+listing of options and available parsing models. It can also be invoked
+as a Python library::
+
+    >>> from bllipparser.ModelFetcher import download_and_install_model
+    >>> download_and_install_model('WSJ', '/tmp/models')
+    /tmp/models/WSJ
+
+In this case, it would download WSJ and install it to
+``/tmp/models/WSJ``. Note that it returns the path to the downloaded
+model.
+
 Basic usage
 -----------
 
 The easiest way to construct a parser is with the
-``load_unified_model_dir`` class method. A unified model is a directory
+``from_unified_model_dir`` class method. A unified model is a directory
 that contains two subdirectories: ``parser/`` and ``reranker/``, each
 with the respective model files::
 
     >>> from bllipparser import RerankingParser, tokenize
-    >>> rrp = RerankingParser.load_unified_model_dir('/path/to/model/')
+    >>> rrp = RerankingParser.from_unified_model_dir('/path/to/model/')
+
+This can be integrated with ModelFetcher (if the model is already
+installed, ``download_and_install_model`` is a no-op)::
+
+    >>> model_dir = download_and_install_model('WSJ', '/tmp/models')
+    >>> rrp = RerankingParser.from_unified_model_dir(model_dir)
+
+You can also load parser and reranker models manually::
+
+    >>> rrp = RerankingParser()
+    >>> rrp.load_parser_model('/tmp/models/WSJ/parser')
+    >>> rrp.load_reranker_model('/tmp/models/WSJ/reranker')
 
 Parsing a single sentence and reading information about the top parse
 with ``parse()``. The parser produces an *n-best list* of the *n* most
@@ -49,26 +81,74 @@ The reranker can be disabled by setting ``rerank=False``::
 
     >>> nbest_list = rrp.parse('Parser only!', rerank=False)
 
-Parsing text with existing POS tag (soft) constraints. In this example,
-token 0 ('Time') should have tag VB and token 1 ('flies') should have
-tag NNS::
+You can also parse text with existing POS tags (these act as soft
+constraints). In this example, token 0 ('Time') should have tag VB and
+token 1 ('flies') should have tag NNS::
 
     >>> rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB', 1 : 'NNS'})[0]
     ScoredParse('(S1 (NP (VB Time) (NNS flies)))', parser_score=-53.94938875760073, reranker_score=-15.841407102717749)
 
-You don't need to specify a tag for all words: token 0 ('Time') should
+You don't need to specify a tag for all words: Here, token 0 ('Time') should
 have tag VB and token 1 ('flies') is unconstrained::
 
     >>> rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB'})[0]
     ScoredParse('(S1 (S (VP (VB Time) (NP (VBZ flies)))))', parser_score=-54.390430751112156, reranker_score=-17.290145080887005)
 
-You can specify multiple tags for each token: token 0 ('Time') should
-have tag VB, JJ, or NN and token 1 ('flies') is unconstrained::
+You can specify multiple tags for each token. When you do this, the
+tags for a token will be used in decreasing priority. token 0 ('Time')
+should have tag VB, JJ, or NN and token 1 ('flies') is unconstrained::
 
     >>> rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : ['VB', 'JJ', 'NN']})[0]
     ScoredParse('(S1 (NP (NN Time) (VBZ flies)))', parser_score=-42.82904107213723, reranker_score=-12.865900776775314)
 
+There are many parser options which can be adjusted (though the defaults
+should work well for most cases) with ``set_parser_options``. This
+will change the size of the n-best list and pick the defaults for all
+other options. It returns a dictionary of the current options::
+
+    >>> rrp.set_parser_options(nbest=10)
+    {'language': 'En', 'case_insensitive': False, 'debug': 0, 'small_corpus': True, 'overparsing': 21, 'smooth_pos': 0, 'nbest': 10}
+    >>> nbest_list = rrp.parse('The list is smaller now.', rerank=False)
+    >>> len(nbest_list)
+    10
+
 Use this if all you want is a tokenizer::
 
     >>> tokenize("Tokenize this sentence, please.")
     ['Tokenize', 'this', 'sentence', ',', 'please', '.']
+
+Parsing shell
+-------------
+
+There is an interactive shell which can help visualize a parse::
+
+    shell% python -mbllipparser.ParsingShell /path/to/model
+
+Once in the shell, type a sentence to have the parser parse it::
+
+    rrp> I saw the astronomer with the telescope.
+    Tokens: I saw the astronomer with the telescope .
+
+    Parser's parse:
+    (S1 (S (NP (PRP I))
+         (VP (VBD saw)
+          (NP (NP (DT the) (NN astronomer))
+           (PP (IN with) (NP (DT the) (NN telescope)))))
+         (. .)))
+
+    Reranker's parse: (parser index 2)
+    (S1 (S (NP (PRP I))
+         (VP (VBD saw)
+          (NP (DT the) (NN astronomer))
+          (PP (IN with) (NP (DT the) (NN telescope))))
+         (. .)))
+
+If you have ``nltk`` installed, you can use its tree visualization to
+see the output::
+
+    rrp> visual Show me this parse.
+    Tokens: Show me this parse .
+
+    [graphical display of the parse appears]
+
+There is more detailed help inside the shell under the ``help`` command.
diff --git a/python/bllipparser/ModelFetcher.py b/python/bllipparser/ModelFetcher.py
@@ -0,0 +1,146 @@
+#!/usr/bin/env python
+# Licensed under the Apache License, Version 2.0 (the "License"); you may
+# not use this file except in compliance with the License.  You may obtain
+# a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the
+# License for the specific language governing permissions and limitations
+# under the License.
+
+"""Simple BLLIP Parser unified parsing model repository and installer."""
+from __future__ import division
+import sys, urlparse, urllib
+from os import makedirs, system, chdir, getcwd
+from os.path import basename, exists, join
+
+class ModelInfo:
+    def __init__(self, model_desc, url, uncompressed_size='unknown '):
+        """uncompressed_size is approximate size in megabytes."""
+        self.model_desc = model_desc
+        self.url = url
+        self.uncompressed_size = uncompressed_size
+    def __str__(self):
+        return "%s [%sMB]" % (self.model_desc, self.uncompressed_size)
+
+# should this grow large enough, we'll find a better place to store it
+models = {
+    'OntoNotes-WSJ' : ModelInfo('OntoNotes portion of WSJ', 'http://nlp.stanford.edu/~mcclosky/models/BLLIP-OntoNotes-WSJ.tar.bz2', 61),
+    'SANCL2012-Uniform' : ModelInfo('Self-trained model on OntoNotes-WSJ and the Google Web Treebank',
+                                    'http://nlp.stanford.edu/~mcclosky/models/BLLIP-SANCL2012-Uniform.tar.bz2', 890),
+    'WSJ+Gigaword' : ModelInfo('Self-trained model on PTB2-WSJ and approx. two million sentences from Gigaword',
+                               'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-Gigaword2000.tar.bz2', 473),
+    'WSJ+PubMed' : ModelInfo('Self-trained model on PTB2-WSJ and approx. 200k sentences from PubMed',
+                             'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-PubMed.tar.bz2', 152),
+    'WSJ' : ModelInfo('Wall Street Journal corpus from Penn Treebank, version 2',
+                      'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-no-AUX.tar.bz2', 52),
+    'WSJ-with-AUX' : ModelInfo('Wall Street Journal corpus from Penn Treebank, version 2 (AUXified version, deprecated)',
+                               'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-with-AUX.tar.bz2', 55),
+}
+
+class UnknownParserModel(ValueError):
+    def __str__(self):
+        return "Unknown parser model name: " + self[0]
+
+def download_and_install_model(model_name, target_directory, verbose=False):
+    """Downloads and installs models to a specific directory. Models
+    can be specified by simple names (use list_models() for a list
+    of known models) or a URL. If the model is already installed in
+    target_directory, it won't download it again.  Returns the path to
+    the new model."""
+
+    if model_name.lower().startswith('http'):
+        parsed_url = urlparse.urlparse(model_name)
+        model_url = model_name
+        model_name = basename(parsed_url.path).split('.')[0]
+    elif model_name in models:
+        model_url = models[model_name].url
+    else:
+        raise UnknownParserModel(model_name)
+
+    output_path = join(target_directory, model_name)
+    if verbose:
+        print "Fetching model:", model_name, "from", model_url
+        print "Model directory:", output_path
+
+    if exists(output_path):
+        if verbose:
+            print "Model directory already exists, not reinstalling"
+        return output_path
+
+    if verbose:
+        def status_func(blocks, block_size, total_size):
+            amount_downloaded = blocks * block_size
+            if total_size == -1:
+                sys.stdout.write('Downloaded %s\r' % amount_downloaded)
+            else:
+                percent_downloaded = 100 * amount_downloaded / total_size
+                size = amount_downloaded / (1024 ** 2)
+                sys.stdout.write('Downloaded %.1f%% (%.1f MB)\r' % (percent_downloaded, size))
+    else:
+        status_func = None
+    downloaded_filename, headers = urllib.urlretrieve(model_url, reporthook=status_func)
+    if verbose:
+        sys.stdout.write('\rDownload complete' + (' ' * 20) + '\n')
+        print 'Downloaded to temporary file', downloaded_filename
+
+    try:
+        makedirs(output_path)
+    except OSError, ose:
+        if ose.errno != 17:
+            raise
+
+    orig_path = getcwd()
+    chdir(output_path)
+    # by convention, all models are currently in tar.bz2 format
+    # we may want to generalize this code later
+    assert downloaded_filename.lower().endswith('.bz2')
+    command = 'tar xvjf %s' % downloaded_filename
+    if verbose:
+        print "Extracting with %r to %s" % (command, output_path)
+    system(command)
+    chdir(orig_path)
+
+    return output_path
+
+def list_models():
+    print len(models), "known unified parsing models: [uncompressed size]"
+    for key, model_info in sorted(models.items()):
+        print '\t%-20s\t%s' % (key, model_info)
+
+def main():
+    from optparse import OptionParser
+    parser = OptionParser(usage="""%prog [options]
+
+Tool to help you download and install BLLIP Parser models.""")
+    parser.add_option("-l", "--list", action='store_true', help="List known parsing models.")
+    parser.add_option("-i", "--install", metavar="NAME", action='append',
+        help="Install a unified parser model.")
+    parser.add_option("-d","--directory", default='./models', metavar="PATH",
+        help="Directory to install parsing models in (will be created if it doesn't exist). Default: %default")
+
+    (options, args) = parser.parse_args()
+
+    if not (options.list or options.install):
+        parser.print_help()
+        # flip this on to make 'list' the default action
+        options.list = True
+        print
+    if options.list:
+        list_models()
+    if options.install:
+        for i, model in enumerate(options.install):
+            if i:
+                print
+            try:
+                ret = download_and_install_model(model, options.directory, verbose=True)
+            except UnknownParserModel, u:
+                print u
+                list_models()
+                sys.exit(1)
+
+if __name__ == "__main__":
+    main()
diff --git a/python/bllipparser/ParsingShell.py b/python/bllipparser/ParsingShell.py
@@ -22,12 +22,17 @@
 
 from bllipparser.RerankingParser import RerankingParser
 
+# TODO should integrate with bllipparser.ModelFetcher
+
 class ParsingShell(Cmd):
     def __init__(self, model):
         Cmd.__init__(self)
         self.prompt = 'rrp> '
         print "Loading models..."
-        self.rrp = RerankingParser.load_unified_model_dir(model)
+        if model is None:
+            self.rrp = None
+        else:
+            self.rrp = RerankingParser.from_unified_model_dir(model)
         self.last_nbest_list = []
 
     def do_visual(self, text):