Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

New Python module release: RerankingParser updated, ModelFetcher added

python/bllipparser/RerankingParser.py: Various improvements.
    Sentence class now ensures tokens are strings to reduce crashing.
    RerankingParser class:
    -   load_unified_model_dir() renamed to
        from_unified_model_dir(), old method name is deprecated now.
    -   Parser options can now be changed with the set_parser_options()
        method.
    -   parse() and parse_tagged() both default to a new rerank
        mode, rerank='auto', which will only rerank if a reranker model
        is available.
    -   parse_tagged() now throws a ValueError if you provide an invalid
        POS tag (instead of segfaulting)
    -   check_loaded_models() renamed to _check_loaded_models() since it's
        not intended for users.
    -   added get_unified_model_parameters() helper function which
        provides paths to parser and reranker model files.
python/bllipparser/ModelFetcher.py: new Python module which downloads
    and installs BLLIP unified parsing models. Can be used via command
    line or Python library.
python/bllipparser/ParsingShell.py: can now be launched without parsing
    models
README-python.txt: docs, examples updated. Now covers ModelFetcher
    and ParsingShell (the latter was previously distributed but not
    mentioned)
setup.py: updated with latest release information
  • Loading branch information...
commit 7ecb65fa1770afece25bfa69136f9792564c159e 1 parent f13217f
@dmcc dmcc authored
View
106 README-python.txt
@@ -1,24 +1,56 @@
The BLLIP parser (also known as the Charniak-Johnson parser or
Brown Reranking Parser) is described in the paper `Charniak
and Johnson (Association of Computational Linguistics, 2005)
-<http://aclweb.org/anthology/P/P05/P05-1022.pdf>`_. This code
-provides a Python interface to the parser. Note that it does
-not contain any parsing models which must be downloaded
-separately (for example, `WSJ self-trained parsing model
-<http://cs.brown.edu/~dmcc/selftraining/selftrained.tar.gz>`_).
+<http://aclweb.org/anthology/P/P05/P05-1022.pdf>`_. This package provides
+the BLLIP parser runtime along with a Python interface. Note that it
+does not come with any parsing models but includes a downloader.
The primary maintenance for the parser takes place at `GitHub
<http://github.com/BLLIP/bllip-parser>`_.
+Fetching parsing models
+-----------------------
+
+Before you can parse, you'll need some parsing models. ``ModelFetcher``
+will help you download and install parsing models. It can be invoked
+from the command line. For example, this will download and install the
+standard WSJ model::
+
+ shell% python -mbllipparser.ModelFetcher -i WSJ
+
+Run ``python -mbllipparser.ModelFetcher`` with no arguments for a full
+listing of options and available parsing models. It can also be invoked
+as a Python library::
+
+ >>> from bllipparser.ModelFetcher import download_and_install_model
+ >>> download_and_install_model('WSJ', '/tmp/models')
+ /tmp/models/WSJ
+
+In this case, it would download WSJ and install it to
+``/tmp/models/WSJ``. Note that it returns the path to the downloaded
+model.
+
Basic usage
-----------
The easiest way to construct a parser is with the
-``load_unified_model_dir`` class method. A unified model is a directory
+``from_unified_model_dir`` class method. A unified model is a directory
that contains two subdirectories: ``parser/`` and ``reranker/``, each
with the respective model files::
>>> from bllipparser import RerankingParser, tokenize
- >>> rrp = RerankingParser.load_unified_model_dir('/path/to/model/')
+ >>> rrp = RerankingParser.from_unified_model_dir('/path/to/model/')
+
+This can be integrated with ModelFetcher (if the model is already
+installed, ``download_and_install_model`` is a no-op)::
+
+ >>> model_dir = download_and_install_model('WSJ', '/tmp/models')
+ >>> rrp = RerankingParser.from_unified_model_dir(model_dir)
+
+You can also load parser and reranker models manually::
+
+ >>> rrp = RerankingParser()
+ >>> rrp.load_parser_model('/tmp/models/WSJ/parser')
+ >>> rrp.load_reranker_model('/tmp/models/WSJ/reranker')
Parsing a single sentence and reading information about the top parse
with ``parse()``. The parser produces an *n-best list* of the *n* most
@@ -49,26 +81,74 @@ The reranker can be disabled by setting ``rerank=False``::
>>> nbest_list = rrp.parse('Parser only!', rerank=False)
-Parsing text with existing POS tag (soft) constraints. In this example,
-token 0 ('Time') should have tag VB and token 1 ('flies') should have
-tag NNS::
+You can also parse text with existing POS tags (these act as soft
+constraints). In this example, token 0 ('Time') should have tag VB and
+token 1 ('flies') should have tag NNS::
>>> rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB', 1 : 'NNS'})[0]
ScoredParse('(S1 (NP (VB Time) (NNS flies)))', parser_score=-53.94938875760073, reranker_score=-15.841407102717749)
-You don't need to specify a tag for all words: token 0 ('Time') should
+You don't need to specify a tag for all words: Here, token 0 ('Time') should
have tag VB and token 1 ('flies') is unconstrained::
>>> rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB'})[0]
ScoredParse('(S1 (S (VP (VB Time) (NP (VBZ flies)))))', parser_score=-54.390430751112156, reranker_score=-17.290145080887005)
-You can specify multiple tags for each token: token 0 ('Time') should
-have tag VB, JJ, or NN and token 1 ('flies') is unconstrained::
+You can specify multiple tags for each token. When you do this, the
+tags for a token will be used in decreasing priority. token 0 ('Time')
+should have tag VB, JJ, or NN and token 1 ('flies') is unconstrained::
>>> rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : ['VB', 'JJ', 'NN']})[0]
ScoredParse('(S1 (NP (NN Time) (VBZ flies)))', parser_score=-42.82904107213723, reranker_score=-12.865900776775314)
+There are many parser options which can be adjusted (though the defaults
+should work well for most cases) with ``set_parser_options``. This
+will change the size of the n-best list and pick the defaults for all
+other options. It returns a dictionary of the current options::
+
+ >>> rrp.set_parser_options(nbest=10)
+ {'language': 'En', 'case_insensitive': False, 'debug': 0, 'small_corpus': True, 'overparsing': 21, 'smooth_pos': 0, 'nbest': 10}
+ >>> nbest_list = rrp.parse('The list is smaller now.', rerank=False)
+ >>> len(nbest_list)
+ 10
+
Use this if all you want is a tokenizer::
>>> tokenize("Tokenize this sentence, please.")
['Tokenize', 'this', 'sentence', ',', 'please', '.']
+
+Parsing shell
+-------------
+
+There is an interactive shell which can help visualize a parse::
+
+ shell% python -mbllipparser.ParsingShell /path/to/model
+
+Once in the shell, type a sentence to have the parser parse it::
+
+ rrp> I saw the astronomer with the telescope.
+ Tokens: I saw the astronomer with the telescope .
+
+ Parser's parse:
+ (S1 (S (NP (PRP I))
+ (VP (VBD saw)
+ (NP (NP (DT the) (NN astronomer))
+ (PP (IN with) (NP (DT the) (NN telescope)))))
+ (. .)))
+
+ Reranker's parse: (parser index 2)
+ (S1 (S (NP (PRP I))
+ (VP (VBD saw)
+ (NP (DT the) (NN astronomer))
+ (PP (IN with) (NP (DT the) (NN telescope))))
+ (. .)))
+
+If you have ``nltk`` installed, you can use its tree visualization to
+see the output::
+
+ rrp> visual Show me this parse.
+ Tokens: Show me this parse .
+
+ [graphical display of the parse appears]
+
+There is more detailed help inside the shell under the ``help`` command.
View
146 python/bllipparser/ModelFetcher.py
@@ -0,0 +1,146 @@
+#!/usr/bin/env python
+# Licensed under the Apache License, Version 2.0 (the "License"); you may
+# not use this file except in compliance with the License. You may obtain
+# a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+# License for the specific language governing permissions and limitations
+# under the License.
+
+"""Simple BLLIP Parser unified parsing model repository and installer."""
+from __future__ import division
+import sys, urlparse, urllib
+from os import makedirs, system, chdir, getcwd
+from os.path import basename, exists, join
+
+class ModelInfo:
+ def __init__(self, model_desc, url, uncompressed_size='unknown '):
+ """uncompressed_size is approximate size in megabytes."""
+ self.model_desc = model_desc
+ self.url = url
+ self.uncompressed_size = uncompressed_size
+ def __str__(self):
+ return "%s [%sMB]" % (self.model_desc, self.uncompressed_size)
+
+# should this grow large enough, we'll find a better place to store it
+models = {
+ 'OntoNotes-WSJ' : ModelInfo('OntoNotes portion of WSJ', 'http://nlp.stanford.edu/~mcclosky/models/BLLIP-OntoNotes-WSJ.tar.bz2', 61),
+ 'SANCL2012-Uniform' : ModelInfo('Self-trained model on OntoNotes-WSJ and the Google Web Treebank',
+ 'http://nlp.stanford.edu/~mcclosky/models/BLLIP-SANCL2012-Uniform.tar.bz2', 890),
+ 'WSJ+Gigaword' : ModelInfo('Self-trained model on PTB2-WSJ and approx. two million sentences from Gigaword',
+ 'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-Gigaword2000.tar.bz2', 473),
+ 'WSJ+PubMed' : ModelInfo('Self-trained model on PTB2-WSJ and approx. 200k sentences from PubMed',
+ 'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-PubMed.tar.bz2', 152),
+ 'WSJ' : ModelInfo('Wall Street Journal corpus from Penn Treebank, version 2',
+ 'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-no-AUX.tar.bz2', 52),
+ 'WSJ-with-AUX' : ModelInfo('Wall Street Journal corpus from Penn Treebank, version 2 (AUXified version, deprecated)',
+ 'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-with-AUX.tar.bz2', 55),
+}
+
+class UnknownParserModel(ValueError):
+ def __str__(self):
+ return "Unknown parser model name: " + self[0]
+
+def download_and_install_model(model_name, target_directory, verbose=False):
+ """Downloads and installs models to a specific directory. Models
+ can be specified by simple names (use list_models() for a list
+ of known models) or a URL. If the model is already installed in
+ target_directory, it won't download it again. Returns the path to
+ the new model."""
+
+ if model_name.lower().startswith('http'):
+ parsed_url = urlparse.urlparse(model_name)
+ model_url = model_name
+ model_name = basename(parsed_url.path).split('.')[0]
+ elif model_name in models:
+ model_url = models[model_name].url
+ else:
+ raise UnknownParserModel(model_name)
+
+ output_path = join(target_directory, model_name)
+ if verbose:
+ print "Fetching model:", model_name, "from", model_url
+ print "Model directory:", output_path
+
+ if exists(output_path):
+ if verbose:
+ print "Model directory already exists, not reinstalling"
+ return output_path
+
+ if verbose:
+ def status_func(blocks, block_size, total_size):
+ amount_downloaded = blocks * block_size
+ if total_size == -1:
+ sys.stdout.write('Downloaded %s\r' % amount_downloaded)
+ else:
+ percent_downloaded = 100 * amount_downloaded / total_size
+ size = amount_downloaded / (1024 ** 2)
+ sys.stdout.write('Downloaded %.1f%% (%.1f MB)\r' % (percent_downloaded, size))
+ else:
+ status_func = None
+ downloaded_filename, headers = urllib.urlretrieve(model_url, reporthook=status_func)
+ if verbose:
+ sys.stdout.write('\rDownload complete' + (' ' * 20) + '\n')
+ print 'Downloaded to temporary file', downloaded_filename
+
+ try:
+ makedirs(output_path)
+ except OSError, ose:
+ if ose.errno != 17:
+ raise
+
+ orig_path = getcwd()
+ chdir(output_path)
+ # by convention, all models are currently in tar.bz2 format
+ # we may want to generalize this code later
+ assert downloaded_filename.lower().endswith('.bz2')
+ command = 'tar xvjf %s' % downloaded_filename
+ if verbose:
+ print "Extracting with %r to %s" % (command, output_path)
+ system(command)
+ chdir(orig_path)
+
+ return output_path
+
+def list_models():
+ print len(models), "known unified parsing models: [uncompressed size]"
+ for key, model_info in sorted(models.items()):
+ print '\t%-20s\t%s' % (key, model_info)
+
+def main():
+ from optparse import OptionParser
+ parser = OptionParser(usage="""%prog [options]
+
+Tool to help you download and install BLLIP Parser models.""")
+ parser.add_option("-l", "--list", action='store_true', help="List known parsing models.")
+ parser.add_option("-i", "--install", metavar="NAME", action='append',
+ help="Install a unified parser model.")
+ parser.add_option("-d","--directory", default='./models', metavar="PATH",
+ help="Directory to install parsing models in (will be created if it doesn't exist). Default: %default")
+
+ (options, args) = parser.parse_args()
+
+ if not (options.list or options.install):
+ parser.print_help()
+ # flip this on to make 'list' the default action
+ options.list = True
+ print
+ if options.list:
+ list_models()
+ if options.install:
+ for i, model in enumerate(options.install):
+ if i:
+ print
+ try:
+ ret = download_and_install_model(model, options.directory, verbose=True)
+ except UnknownParserModel, u:
+ print u
+ list_models()
+ sys.exit(1)
+
+if __name__ == "__main__":
+ main()
View
7 python/bllipparser/ParsingShell.py
@@ -22,12 +22,17 @@
from bllipparser.RerankingParser import RerankingParser
+# TODO should integrate with bllipparser.ModelFetcher
+
class ParsingShell(Cmd):
def __init__(self, model):
Cmd.__init__(self)
self.prompt = 'rrp> '
print "Loading models..."
- self.rrp = RerankingParser.load_unified_model_dir(model)
+ if model is None:
+ self.rrp = None
+ else:
+ self.rrp = RerankingParser.from_unified_model_dir(model)
self.last_nbest_list = []
def do_visual(self, text):
View
222 python/bllipparser/RerankingParser.py
@@ -14,13 +14,15 @@
lower-level (SWIG-generated) CharniakParser and JohnsonReranker modules
so you don't need to interact with them directly."""
-import os.path
+from os.path import exists, join
import CharniakParser as parser
import JohnsonReranker as reranker
class ScoredParse:
- """Represents a single parse and its associated parser probability
- and reranker score."""
+ """Represents a single parse and its associated parser
+ probability and reranker score. Note that ptb_parse is actually
+ a CharniakParser.InputTree rather than a string (str()ing it will
+ return the actual PTB parse."""
def __init__(self, ptb_parse, parser_score=None, reranker_score=None,
parser_rank=None, reranker_rank=None):
self.ptb_parse = ptb_parse
@@ -46,6 +48,9 @@ def __init__(self, text_or_tokens, max_sentence_length=399):
self.sentrep = parser.tokenize('<s> ' + text_or_tokens + ' </s>',
max_sentence_length)
else:
+ # text_or_tokens is a sequence -- need to make sure that each
+ # element is a string to avoid crashing
+ text_or_tokens = map(str, text_or_tokens)
self.sentrep = parser.SentRep(text_or_tokens)
def get_tokens(self):
tokens = []
@@ -67,7 +72,7 @@ def __init__(self, sentrep, parses):
self._reranked = False
def __getattr__(self, key):
- """Defer anything unimplemented to our list of ScoredParse objects."""
+ """Delegate everything else to our list of ScoredParse objects."""
return getattr(self.parses, key)
def sort_by_reranker_scores(self):
@@ -121,20 +126,21 @@ def __str__(self):
return parser.asNBestList(self._parses)
def as_reranker_input(self, lowercase=True):
"""Convert the n-best list to an internal structure used as input
- to the reranker. You shouldn't typically need to call this."""
+ to the reranker. You shouldn't typically need to call this."""
return reranker.readNBestList(str(self), lowercase)
class RerankingParser:
"""Wraps the Charniak parser and Johnson reranker into a single
- object. In general, the RerankingParser is not thread safe."""
+ object. Note that RerankingParser is not thread safe."""
def __init__(self):
"""Create an empty reranking parser. You'll need to call
- load_parsing_model() at minimum and load_reranker_model() if
- you're using the reranker. See also the load_unified_model_dir()
+ load_parser_model() at minimum and load_reranker_model() if
+ you're using the reranker. See also the from_unified_model_dir()
classmethod which will take care of calling both of these
for you."""
self._parser_model_loaded = False
self.parser_model_dir = None
+ self.parser_options = {}
self.reranker_model = None
self._parser_thread_slot = parser.ThreadSlot()
self.unified_model_dir = None
@@ -148,43 +154,43 @@ def __repr__(self):
(self.__class__.__name__, self.parser_model_dir,
self.reranker_model)
- def load_parsing_model(self, model_dir, language='En',
- case_insensitive=False, nbest=50, small_corpus=True,
- overparsing=21, debug=0, smoothPos=0):
+ def load_parser_model(self, model_dir, **parser_options):
"""Load the parsing model from model_dir and set parsing
- options. In general, the default options should suffice. Note
- that the parser does not allow loading multiple models within
- the same process."""
+ options. In general, the default options should suffice but see
+ the set_parser_options() method for details. Note that the parser
+ does not allow loading multiple models within the same process
+ (calling this function twice will raise a RuntimeError)."""
if self._parser_model_loaded:
- raise ValueError('Parser is already loaded and can only be loaded once.')
- if not os.path.exists(model_dir):
+ raise RuntimeError('Parser is already loaded and can only be loaded once.')
+ if not exists(model_dir):
raise ValueError('Parser model directory %r does not exist.' % model_dir)
self._parser_model_loaded = True
- parser.loadModel(model_dir)
self.parser_model_dir = model_dir
- parser.setOptions(language, case_insensitive, nbest, small_corpus,
- overparsing, debug, smoothPos)
+ parser.loadModel(model_dir)
+ self.set_parser_options(**parser_options)
def load_reranker_model(self, features_filename, weights_filename,
feature_class=None):
"""Load the reranker model from its feature and weights files. A feature
class may optionally be specified."""
- if not os.path.exists(features_filename):
+ if not exists(features_filename):
raise ValueError('Reranker features filename %r does not exist.' % \
features_filename)
- if not os.path.exists(weights_filename):
+ if not exists(weights_filename):
raise ValueError('Reranker weights filename %r does not exist.' % \
weights_filename)
self.reranker_model = reranker.RerankerModel(feature_class,
features_filename,
weights_filename)
- def parse(self, sentence, rerank=True, max_sentence_length=399):
+ def parse(self, sentence, rerank='auto', max_sentence_length=399):
"""Parse some text or tokens and return an NBestList with the
- results. sentence can be a string or a sequence. If it is a
- string, it will be tokenized. If rerank is True, we will rerank
- the n-best list."""
- self.check_loaded_models(rerank)
+ results. sentence can be a string or a sequence. If it is a
+ string, it will be tokenized. If rerank is True, we will rerank
+ the n-best list, if False the reranker will not be used. rerank
+ can also be set to 'auto' which will only rerank if a reranker
+ model is loaded."""
+ rerank = self._check_loaded_models(rerank)
sentence = Sentence(sentence, max_sentence_length)
try:
@@ -196,20 +202,30 @@ def parse(self, sentence, rerank=True, max_sentence_length=399):
nbest_list.rerank(self)
return nbest_list
- def parse_tagged(self, tokens, possible_tags, rerank=True):
- """Parse some pre-tagged, pre-tokenized text. tokens is a
- sequence of strings. possible_tags is map from token indices
- to possible POS tags. Tokens without an entry in possible_tags
- will be unconstrained by POS. If rerank is True, we will
- rerank the n-best list."""
- self.check_loaded_models(rerank)
+ def parse_tagged(self, tokens, possible_tags, rerank='auto'):
+ """Parse some pre-tagged, pre-tokenized text. tokens must be a
+ sequence of strings. possible_tags is map from token indices
+ to possible POS tags (strings). Tokens without an entry in
+ possible_tags will be unconstrained by POS. POS tags must be
+ in the terms.txt file in the parsing model or else you will get
+ a ValueError. If rerank is True, we will rerank the n-best list,
+ if False the reranker will not be used. rerank can also be set to
+ 'auto' which will only rerank if a reranker model is loaded."""
+ rerank = self._check_loaded_models(rerank)
+ if isinstance(tokens, basestring):
+ raise ValueError("tokens must be a sequence, not a string.")
ext_pos = parser.ExtPos()
for index in range(len(tokens)):
tags = possible_tags.get(index, [])
if isinstance(tags, basestring):
tags = [tags]
- ext_pos.addTagConstraints(parser.VectorString(tags))
+ tags = map(str, tags)
+ valid_tags = ext_pos.addTagConstraints(parser.VectorString(tags))
+ if not valid_tags:
+ # at least one of the tags is bad -- find out which ones
+ # and throw a ValueError
+ self._find_bad_tag_and_raise_error(tags)
sentence = Sentence(tokens)
parses = parser.parse(sentence.sentrep, ext_pos,
@@ -219,36 +235,102 @@ def parse_tagged(self, tokens, possible_tags, rerank=True):
nbest_list.rerank(self)
return nbest_list
- def check_loaded_models(self, rerank):
+ def _find_bad_tag_and_raise_error(self, tags):
+ ext_pos = parser.ExtPos()
+ bad_tags = set()
+ for tag in set(tags):
+ good_tag = ext_pos.addTagConstraints(parser.VectorString([tag]))
+ if not good_tag:
+ bad_tags.add(tag)
+
+ raise ValueError("Invalid POS tags (not present in the parser's terms.txt file): %s" % ', '.join(sorted(bad_tags)))
+
+ def _check_loaded_models(self, rerank):
+ """Given a reranking mode (True, False, 'auto') determines
+ whether we have the appropriately loaded models. Also returns
+ whether the reranker should be used (essentially resolves the
+ value of rerank if rerank='auto')."""
if not self._parser_model_loaded:
raise ValueError("Parser model has not been loaded.")
- if rerank and not self.reranker_model:
+ if rerank == True and not self.reranker_model:
raise ValueError("Reranker model has not been loaded.")
+ if rerank == 'auto':
+ return bool(self.reranker_model)
+ else:
+ return rerank
+
+ def set_parser_options(self, language='En', case_insensitive=False,
+ nbest=50, small_corpus=True, overparsing=21, debug=0, smooth_pos=0):
+ """Set options for the parser. Note that this is called
+ automatically by load_parser_model() so you should only need to
+ call this to update the parsing options. The method returns a
+ dictionary of the new options.
+
+ The options are as follows: language is a string describing
+ the language. Currently, it can be one of En (English), Ch
+ (Chinese), or Ar (Arabic). case_insensitive will make the parser
+ ignore capitalization. nbest is the maximum size of the n-best
+ list. small_corpus=True enables additional smoothing (originally
+ intended for training from small corpora, but helpful in many
+ situations). overparsing determines how much more time the parser
+ will spend on a sentence relative to the time it took to find the
+ first possible complete parse. This affects the speed/accuracy
+ tradeoff. debug takes a non-negative integer. Setting it higher
+ than 0 will cause the parser to print debug messages (surprising,
+ no?). Setting smooth_pos to a number higher than 0 will cause the
+ parser to assign that value as the probability of seeing a known
+ word in a new part-of-speech (one never seen in training)."""
+ if not self._parser_model_loaded:
+ raise RuntimeError('Parser must already be loaded (call load_parser_model() first)')
+
+ parser.setOptions(language, case_insensitive, nbest, small_corpus,
+ overparsing, debug, smooth_pos)
+ self.parser_options = {
+ 'language': language,
+ 'case_insensitive': case_insensitive,
+ 'nbest': nbest,
+ 'small_corpus': small_corpus,
+ 'overparsing': overparsing,
+ 'debug': debug,
+ 'smooth_pos': smooth_pos
+ }
+ return self.parser_options
@classmethod
- def load_unified_model_dir(this_class, model_dir, parsing_options=None,
+ def load_unified_model_dir(this_class, *args, **kwargs):
+ """Deprecated. Use from_unified_model_dir() instead as this
+ method will eventually disappear."""
+ import warnings
+ warnings.warn('BllipParser.load_parser_model() method is deprecated now, use BllipParser.from_unified_model_dir() instead.')
+ return this_class.from_unified_model_dir(*args, **kwargs)
+
+ @classmethod
+ def from_unified_model_dir(this_class, model_dir, parsing_options=None,
reranker_options=None):
"""Create a RerankingParser from a unified parsing model on disk.
- A unified parsing model should have the following filesystem structure:
+ A unified parsing model should have the following filesystem
+ structure:
parser/
- Charniak parser model: should contain pSgT.txt, *.g files,
- and various others
+ Charniak parser model: should contain pSgT.txt, *.g files
+ among others
reranker/
- features.gz -- features for reranker
- weights.gz -- corresponding weights of those features
+ features.gz or features.bz2 -- features for reranker
+ weights.gz or weights.bz2 -- corresponding weights of those
+ features
"""
parsing_options = parsing_options or {}
reranker_options = reranker_options or {}
- rrp = this_class()
- rrp.load_parsing_model(model_dir + '/parser/', **parsing_options)
+ (parser_model_dir, reranker_features_filename,
+ reranker_weights_filename) = get_unified_model_parameters(model_dir)
- reranker_model_dir = model_dir + '/reranker/'
- features_filename = reranker_model_dir + 'features.gz'
- weights_filename = reranker_model_dir + 'weights.gz'
+ rrp = this_class()
+ if parser_model_dir:
+ rrp.load_parser_model(parser_model_dir, **parsing_options)
+ if reranker_features_filename and reranker_weights_filename:
+ rrp.load_reranker_model(reranker_features_filename,
+ reranker_weights_filename, **reranker_options)
- rrp.load_reranker_model(features_filename, weights_filename,
- **reranker_options)
rrp.unified_model_dir = model_dir
return rrp
@@ -259,3 +341,45 @@ def tokenize(text, max_sentence_length=399):
longer than max_sentence_length tokens, it will be truncated."""
sentence = Sentence(text)
return sentence.get_tokens()
+
+def get_unified_model_parameters(model_dir):
+ """Determine the actual parser and reranker model filesystem entries
+ for a unified parsing model. Returns a triple:
+
+ (parser_model_dir, reranker_features_filename,
+ reranker_weights_filename)
+
+ Any of these can be None if that part of the model is not present
+ on disk (though, if you have only one of the reranker model files,
+ the reranker will not be loaded).
+
+ A unified parsing model should have the following filesystem structure:
+
+ parser/
+ Charniak parser model: should contain pSgT.txt, *.g files
+ among others
+ reranker/
+ features.gz or features.bz2 -- features for reranker
+ weights.gz or weights.bz2 -- corresponding weights of those
+ features
+ """
+ if not exists(model_dir):
+ raise IOError("Model directory %r does not exist" % model_dir)
+
+ parser_model_dir = join(model_dir, 'parser')
+ if not exists(parser_model_dir):
+ parser_model_dir = None
+ reranker_model_dir = join(model_dir, 'reranker')
+
+ def get_reranker_model_filename(name):
+ filename = join(reranker_model_dir, '%s.gz' % name)
+ if not exists(filename):
+ # try bz2 version
+ filename = join(reranker_model_dir, '%s.bz2' % name)
+ if not exists(filename):
+ filename = None
+ return filename
+
+ features_filename = get_reranker_model_filename('features')
+ weights_filename = get_reranker_model_filename('weights')
+ return (parser_model_dir, features_filename, weights_filename)
View
12 setup.py
@@ -63,18 +63,20 @@ def run(args):
reranker_wrapper]]
# what's with the -O0? well, using even the lowest levels of optimization
-# (gcc -O1) cause symbols to be inlined and disappear in _JohnsonReranker.so.
-# it's not clear how to fix this at this point.
+# (gcc -O1) causes one symbol which we wrap with SWIG to be inlined and
+# disappear in _JohnsonReranker.so which causes an ImportError. this will
+# hopefully be addressed in the near future
reranker_module = Extension('bllipparser._JohnsonReranker',
sources=reranker_sources,
extra_compile_args=['-iquote', reranker_base, '-O0'])
setup(name='bllipparser',
- version='2013.10.16-1',
+ version='2014.02.09',
description='Python bindings for the BLLIP natural language parser',
long_description='See http://pypi.python.org/pypi/bllipparser/',
- author='David McClosky',
- author_email='notsoweird+pybllipparser@gmail.com',
+ author='Eugene Charniak, Mark Johnson, David McClosky, many others',
+ maintainer='David McClosky',
+ maintainer_email='notsoweird+pybllipparser@gmail.com',
classifiers=[
'Development Status :: 4 - Beta',
'Intended Audience :: Science/Research',
Please sign in to comment.
Something went wrong with that request. Please try again.