Skip to content

Commit

Permalink
New Python module release: RerankingParser updated, ModelFetcher added
Browse files Browse the repository at this point in the history
python/bllipparser/RerankingParser.py: Various improvements.
    Sentence class now ensures tokens are strings to reduce crashing.
    RerankingParser class:
    -   load_unified_model_dir() renamed to
        from_unified_model_dir(), old method name is deprecated now.
    -   Parser options can now be changed with the set_parser_options()
        method.
    -   parse() and parse_tagged() both default to a new rerank
        mode, rerank='auto', which will only rerank if a reranker model
        is available.
    -   parse_tagged() now throws a ValueError if you provide an invalid
        POS tag (instead of segfaulting)
    -   check_loaded_models() renamed to _check_loaded_models() since it's
        not intended for users.
    -   added get_unified_model_parameters() helper function which
        provides paths to parser and reranker model files.
python/bllipparser/ModelFetcher.py: new Python module which downloads
    and installs BLLIP unified parsing models. Can be used via command
    line or Python library.
python/bllipparser/ParsingShell.py: can now be launched without parsing
    models
README-python.txt: docs, examples updated. Now covers ModelFetcher
    and ParsingShell (the latter was previously distributed but not
    mentioned)
setup.py: updated with latest release information
  • Loading branch information
dmcc committed Feb 10, 2014
1 parent f13217f commit 7ecb65f
Show file tree
Hide file tree
Showing 5 changed files with 425 additions and 68 deletions.
106 changes: 93 additions & 13 deletions README-python.txt
@@ -1,24 +1,56 @@
The BLLIP parser (also known as the Charniak-Johnson parser or
Brown Reranking Parser) is described in the paper `Charniak
and Johnson (Association of Computational Linguistics, 2005)
<http://aclweb.org/anthology/P/P05/P05-1022.pdf>`_. This code
provides a Python interface to the parser. Note that it does
not contain any parsing models which must be downloaded
separately (for example, `WSJ self-trained parsing model
<http://cs.brown.edu/~dmcc/selftraining/selftrained.tar.gz>`_).
<http://aclweb.org/anthology/P/P05/P05-1022.pdf>`_. This package provides
the BLLIP parser runtime along with a Python interface. Note that it
does not come with any parsing models but includes a downloader.
The primary maintenance for the parser takes place at `GitHub
<http://github.com/BLLIP/bllip-parser>`_.

Fetching parsing models
-----------------------

Before you can parse, you'll need some parsing models. ``ModelFetcher``
will help you download and install parsing models. It can be invoked
from the command line. For example, this will download and install the
standard WSJ model::

shell% python -mbllipparser.ModelFetcher -i WSJ

Run ``python -mbllipparser.ModelFetcher`` with no arguments for a full
listing of options and available parsing models. It can also be invoked
as a Python library::

>>> from bllipparser.ModelFetcher import download_and_install_model
>>> download_and_install_model('WSJ', '/tmp/models')
/tmp/models/WSJ

In this case, it would download WSJ and install it to
``/tmp/models/WSJ``. Note that it returns the path to the downloaded
model.

Basic usage
-----------

The easiest way to construct a parser is with the
``load_unified_model_dir`` class method. A unified model is a directory
``from_unified_model_dir`` class method. A unified model is a directory
that contains two subdirectories: ``parser/`` and ``reranker/``, each
with the respective model files::

>>> from bllipparser import RerankingParser, tokenize
>>> rrp = RerankingParser.load_unified_model_dir('/path/to/model/')
>>> rrp = RerankingParser.from_unified_model_dir('/path/to/model/')

This can be integrated with ModelFetcher (if the model is already
installed, ``download_and_install_model`` is a no-op)::

>>> model_dir = download_and_install_model('WSJ', '/tmp/models')
>>> rrp = RerankingParser.from_unified_model_dir(model_dir)

You can also load parser and reranker models manually::

>>> rrp = RerankingParser()
>>> rrp.load_parser_model('/tmp/models/WSJ/parser')
>>> rrp.load_reranker_model('/tmp/models/WSJ/reranker')

Parsing a single sentence and reading information about the top parse
with ``parse()``. The parser produces an *n-best list* of the *n* most
Expand Down Expand Up @@ -49,26 +81,74 @@ The reranker can be disabled by setting ``rerank=False``::

>>> nbest_list = rrp.parse('Parser only!', rerank=False)

Parsing text with existing POS tag (soft) constraints. In this example,
token 0 ('Time') should have tag VB and token 1 ('flies') should have
tag NNS::
You can also parse text with existing POS tags (these act as soft
constraints). In this example, token 0 ('Time') should have tag VB and
token 1 ('flies') should have tag NNS::

>>> rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB', 1 : 'NNS'})[0]
ScoredParse('(S1 (NP (VB Time) (NNS flies)))', parser_score=-53.94938875760073, reranker_score=-15.841407102717749)

You don't need to specify a tag for all words: token 0 ('Time') should
You don't need to specify a tag for all words: Here, token 0 ('Time') should
have tag VB and token 1 ('flies') is unconstrained::

>>> rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB'})[0]
ScoredParse('(S1 (S (VP (VB Time) (NP (VBZ flies)))))', parser_score=-54.390430751112156, reranker_score=-17.290145080887005)

You can specify multiple tags for each token: token 0 ('Time') should
have tag VB, JJ, or NN and token 1 ('flies') is unconstrained::
You can specify multiple tags for each token. When you do this, the
tags for a token will be used in decreasing priority. token 0 ('Time')
should have tag VB, JJ, or NN and token 1 ('flies') is unconstrained::

>>> rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : ['VB', 'JJ', 'NN']})[0]
ScoredParse('(S1 (NP (NN Time) (VBZ flies)))', parser_score=-42.82904107213723, reranker_score=-12.865900776775314)

There are many parser options which can be adjusted (though the defaults
should work well for most cases) with ``set_parser_options``. This
will change the size of the n-best list and pick the defaults for all
other options. It returns a dictionary of the current options::

>>> rrp.set_parser_options(nbest=10)
{'language': 'En', 'case_insensitive': False, 'debug': 0, 'small_corpus': True, 'overparsing': 21, 'smooth_pos': 0, 'nbest': 10}
>>> nbest_list = rrp.parse('The list is smaller now.', rerank=False)
>>> len(nbest_list)
10

Use this if all you want is a tokenizer::

>>> tokenize("Tokenize this sentence, please.")
['Tokenize', 'this', 'sentence', ',', 'please', '.']

Parsing shell
-------------

There is an interactive shell which can help visualize a parse::

shell% python -mbllipparser.ParsingShell /path/to/model

Once in the shell, type a sentence to have the parser parse it::

rrp> I saw the astronomer with the telescope.
Tokens: I saw the astronomer with the telescope .

Parser's parse:
(S1 (S (NP (PRP I))
(VP (VBD saw)
(NP (NP (DT the) (NN astronomer))
(PP (IN with) (NP (DT the) (NN telescope)))))
(. .)))

Reranker's parse: (parser index 2)
(S1 (S (NP (PRP I))
(VP (VBD saw)
(NP (DT the) (NN astronomer))
(PP (IN with) (NP (DT the) (NN telescope))))
(. .)))

If you have ``nltk`` installed, you can use its tree visualization to
see the output::

rrp> visual Show me this parse.
Tokens: Show me this parse .

[graphical display of the parse appears]

There is more detailed help inside the shell under the ``help`` command.
146 changes: 146 additions & 0 deletions python/bllipparser/ModelFetcher.py
@@ -0,0 +1,146 @@
#!/usr/bin/env python
# Licensed under the Apache License, Version 2.0 (the "License"); you may
# not use this file except in compliance with the License. You may obtain
# a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations
# under the License.

"""Simple BLLIP Parser unified parsing model repository and installer."""
from __future__ import division
import sys, urlparse, urllib
from os import makedirs, system, chdir, getcwd
from os.path import basename, exists, join

class ModelInfo:
def __init__(self, model_desc, url, uncompressed_size='unknown '):
"""uncompressed_size is approximate size in megabytes."""
self.model_desc = model_desc
self.url = url
self.uncompressed_size = uncompressed_size
def __str__(self):
return "%s [%sMB]" % (self.model_desc, self.uncompressed_size)

# should this grow large enough, we'll find a better place to store it
models = {
'OntoNotes-WSJ' : ModelInfo('OntoNotes portion of WSJ', 'http://nlp.stanford.edu/~mcclosky/models/BLLIP-OntoNotes-WSJ.tar.bz2', 61),
'SANCL2012-Uniform' : ModelInfo('Self-trained model on OntoNotes-WSJ and the Google Web Treebank',
'http://nlp.stanford.edu/~mcclosky/models/BLLIP-SANCL2012-Uniform.tar.bz2', 890),
'WSJ+Gigaword' : ModelInfo('Self-trained model on PTB2-WSJ and approx. two million sentences from Gigaword',
'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-Gigaword2000.tar.bz2', 473),
'WSJ+PubMed' : ModelInfo('Self-trained model on PTB2-WSJ and approx. 200k sentences from PubMed',
'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-PubMed.tar.bz2', 152),
'WSJ' : ModelInfo('Wall Street Journal corpus from Penn Treebank, version 2',
'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-no-AUX.tar.bz2', 52),
'WSJ-with-AUX' : ModelInfo('Wall Street Journal corpus from Penn Treebank, version 2 (AUXified version, deprecated)',
'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-with-AUX.tar.bz2', 55),
}

class UnknownParserModel(ValueError):
def __str__(self):
return "Unknown parser model name: " + self[0]

def download_and_install_model(model_name, target_directory, verbose=False):
"""Downloads and installs models to a specific directory. Models
can be specified by simple names (use list_models() for a list
of known models) or a URL. If the model is already installed in
target_directory, it won't download it again. Returns the path to
the new model."""

if model_name.lower().startswith('http'):
parsed_url = urlparse.urlparse(model_name)
model_url = model_name
model_name = basename(parsed_url.path).split('.')[0]
elif model_name in models:
model_url = models[model_name].url
else:
raise UnknownParserModel(model_name)

output_path = join(target_directory, model_name)
if verbose:
print "Fetching model:", model_name, "from", model_url
print "Model directory:", output_path

if exists(output_path):
if verbose:
print "Model directory already exists, not reinstalling"
return output_path

if verbose:
def status_func(blocks, block_size, total_size):
amount_downloaded = blocks * block_size
if total_size == -1:
sys.stdout.write('Downloaded %s\r' % amount_downloaded)
else:
percent_downloaded = 100 * amount_downloaded / total_size
size = amount_downloaded / (1024 ** 2)
sys.stdout.write('Downloaded %.1f%% (%.1f MB)\r' % (percent_downloaded, size))
else:
status_func = None
downloaded_filename, headers = urllib.urlretrieve(model_url, reporthook=status_func)
if verbose:
sys.stdout.write('\rDownload complete' + (' ' * 20) + '\n')
print 'Downloaded to temporary file', downloaded_filename

try:
makedirs(output_path)
except OSError, ose:
if ose.errno != 17:
raise

orig_path = getcwd()
chdir(output_path)
# by convention, all models are currently in tar.bz2 format
# we may want to generalize this code later
assert downloaded_filename.lower().endswith('.bz2')
command = 'tar xvjf %s' % downloaded_filename
if verbose:
print "Extracting with %r to %s" % (command, output_path)
system(command)
chdir(orig_path)

return output_path

def list_models():
print len(models), "known unified parsing models: [uncompressed size]"
for key, model_info in sorted(models.items()):
print '\t%-20s\t%s' % (key, model_info)

def main():
from optparse import OptionParser
parser = OptionParser(usage="""%prog [options]
Tool to help you download and install BLLIP Parser models.""")
parser.add_option("-l", "--list", action='store_true', help="List known parsing models.")
parser.add_option("-i", "--install", metavar="NAME", action='append',
help="Install a unified parser model.")
parser.add_option("-d","--directory", default='./models', metavar="PATH",
help="Directory to install parsing models in (will be created if it doesn't exist). Default: %default")

(options, args) = parser.parse_args()

if not (options.list or options.install):
parser.print_help()
# flip this on to make 'list' the default action
options.list = True
print
if options.list:
list_models()
if options.install:
for i, model in enumerate(options.install):
if i:
print
try:
ret = download_and_install_model(model, options.directory, verbose=True)
except UnknownParserModel, u:
print u
list_models()
sys.exit(1)

if __name__ == "__main__":
main()
7 changes: 6 additions & 1 deletion python/bllipparser/ParsingShell.py
Expand Up @@ -22,12 +22,17 @@

from bllipparser.RerankingParser import RerankingParser

# TODO should integrate with bllipparser.ModelFetcher

class ParsingShell(Cmd):
def __init__(self, model):
Cmd.__init__(self)
self.prompt = 'rrp> '
print "Loading models..."
self.rrp = RerankingParser.load_unified_model_dir(model)
if model is None:
self.rrp = None
else:
self.rrp = RerankingParser.from_unified_model_dir(model)
self.last_nbest_list = []

def do_visual(self, text):
Expand Down

0 comments on commit 7ecb65f

Please sign in to comment.