first-stage/README

/*
 * Copyright 1999, 2000, 2001, 2005, 2006 Brown University, Providence, RI.
 * 
 *                         All Rights Reserved
 * 
 * Permission to use, copy, modify, and distribute this software and its
 * documentation for any purpose other than its incorporation into a
 * commercial product is hereby granted without fee, provided that the
 * above copyright notice appear in all copies and that both that
 * copyright notice and this permission notice appear in supporting
 * documentation, and that the name of Brown University not be used in
 * advertising or publicity pertaining to distribution of the software
 * without specific, written prior permission.
 * 
 * BROWN UNIVERSITY DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE,
 * INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY
 * PARTICULAR PURPOSE.  IN NO EVENT SHALL BROWN UNIVERSITY BE LIABLE FOR
 * ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
 * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
 * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
 * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
 */

----------------------------------------------------------------------

1.  Basic Usage

The parser (which is to be found in the sub-directory PARSE)expects
sentences delimited by <s> ...</s>, and outputs the parsed versions in
Penn tree-bank style.  (The <s> and </s> must be separated by spaces
from all other text.) So if the input is

<s> (``He'll work at the factory.'') </s>

the output will be (to cout):

(S1 (PRN (-LRB- -LRB-) (S (`` ``) (NP (PRP He)) (VP (MD 'll) (VP (VB work) (PP (IN at) (NP (DT the) (NN factory))))) (. .) ('' '')) (-RRB- -RRB-)))

If you want to make it slightly easier for humans to read, use the command
line argument -P (for pretty print), in which case you will get:

(S1 (PRN (-LRB- -LRB-)
     (S (`` ``)
      (NP (PRP He))
      (VP (MD 'll)
       (VP (VB work) (PP (IN at) (NP (DT the) (NN factory)))))
      (. .)
      ('' ''))
     (-RRB- -RRB-)))

The parser will take input from either cin, or, if given the name
of a file, from that file.  So in the latter case the call to
the parser would be:

shell> parseIt <path to directory with parsing statistics>  <file of sentences>
E.g.,
shell> parseIt ../DATA/EN/ testSentence
(Note that as the parser now has three separate DATA files, one each
for English, Chinese, and English Language Modeling, the DATA directories
have been made separate subdirectories under the directory DATA).

As indicated above, the parser will first tokenize the input.
If you do NOT want to to tokenize (for some reason you are
handing it pretokenized input, as you would do if you were
testing it's performance on the tree-bank), give it a -K option.

2. Compilation instructions

shell> cd TRAIN
shell> make all
shell> cd ..
shell> cd PARSE
shell> make all

3. N-best Parsing

This version of the parser can produce multiple-best parses.  So if
you what 50 alternative parses rather than just one, just add -N50
to the command line.

In multiparse mode the output format is slightly different;
#of-parses sentence-indicator-string
logProb(parse1)
parse1

logProb(parse2)
parse2

etc.

The sentence indicator string will typically just a sentence number.
However, if the input to the parser is of the form <s sentence-id >
... </s>, then the sentence-id proveded will be used instead.  This is
useful if, e.g., you want to know where article boundaries are.

4. Other options

-S tells the parser to remain silent when it cannot parse a sentence
(it just goes on to the next one.

The parser can now parse Chinese.  It requires that the Chinese
characters already be grouped into words.  Assuming you have 
trained on the Chinese Tree-bank from LDC (see the README for
the TRAIN programs), you tell the parser to be expecting Chinese
by giving it the command-line option -LCh.  (The default is
English, which is also be specified by -LEn .)  The files you
need to train Chinese are in DATA/CH/.

The parser will ignore any sentence consisting of > 100 words +
punctuation.  To change this to, say 200 you give it the on-line
argument -l200.

The parser is set to be case sensitive.  To make it case insensitive
add the command-line flat -C .

Currently there are various array sizes that make 400 the absolute
maximum sentence length.  To allow for longer sentences change (in
Feature.h)
#define MAXSENTLEN 400
Similarly to allow for a larger dictionary of words from training increase
#define MAXNUMWORDS 50000

To see debugging information give it the on-line argument -d#
where # is a number > 10.  As the numbers get larger, the verbosity of
the information increases.

5. Training

There is a subdirectory TRAIN which contains the programs used to
collect the statistics the parser requires from tree-bank data.  As
the parser comes with the statistics it needs you will only need this
if you want to try experiments with the parser on more (or less, or
different) tree-bank data.  For more information see the README file
in TRAIN.

6. Language Modeling

To use the parser as the language model described in Charniak 2001
(Proceedings of ACL) you must first retrain the data using the
settings found in DATA/LM/. 

Then give parseIt a -M command-line argument.  If the data is from
speech, and thus all one case, also use the -C flag.

The output in -L  mode is of the form:

log-grammar-probability	  log-trigram-probability   log-mixed-probability
parse

Again, if the data is from speech and has a limited vocabulary, it
will often be the case that the parser will have a very difficult time
finding a parse because of incorrect words (or, in simulated speech
output, the presence of "unk" the unknown word replacement), and there
will be many parses with equally bad probabilities.  In such cases the
pruning that keeps memory in bounds for 50-best parsing fails.  So
just use 1-best, or maybe 10 best.

7. Faster Parsing

The defaulty speed/accuracy setting should give you the results in the
published papers.  It is, however, easy to get faster parsing at the
expense of some accuracy.  So a command-line argument of -T50 costs
you about a percent of parsing accuracy, but rather than 1.4
sentences/second you will get better than 6 sentences/second.  (The
default is -T210.)

8. Multi-threaded version

parseIt is multi threaded.  It currently assumes two threads (for dual
processors).  To change this, use the command line argument, -t4 to
have it use, e,g, 4 threads.  Currently the maximum number of threads
allowed is 4.  To change this change the following line in Features.h
and recompile parseItT.

#define MAXNUMTHREADS 4

9. evalTree

evalTree takes penn-treebank parse trees from cin, and outputs to cout
sentence-number log2(parse-tree-probability)
for each tree, one per line.

shell> evalTree <path to directory with parsing statistics> 

If the tree is assigned zero probability it returns 0 for the log2
probability.

For reasons that would take us too far afield, about 13% of the time
it returns a probability that is too high.  If you want to be warned
when it is doing this, give evalTree a -W command-line argument and
the output will have an "!" at the end of the line.

10. Parsing from tagged input

This can now be done using a command such as the following:

shell> parseIt -K -Einput.tags <model dir/> input.sgml

and input.sgml looks like this:

<s> This is a test sentence . </s>

and input.tags looks like this:

This DT
is VBZ
a DT
test NN
sentence NN
. .
---

Each token is given a list of zero or more tags and sentences are separated by "---".  If a token is given zero tags, the standard tagging mechanism will be employed.  If a token is given multiple tags, they will each be considered.

Note that the tokenization must match exactly between these files (tokens are space-separated in input.sgml).  To ensure that tokenization matches, you should pretokenize your input and supply the -K flag.

11. Frequently confusing errors

a.  Parser prints something like the following:

    Can't open terms file ../DATA/ENterms.txt
    parseIt: headFinder.C:39: void readHeadInfoEn(std::string&): Assertion `headStrm' failed.

This error means that you forgot to add a / after the model directory (e.g. DATA/EN instead of DATA/EN/).

b.  If parser provides no output at all

This is most likely caused by not having spaces around the <s> and </s> brackets, i.e.

    <s>This is a test sentence.</s>

instead of

    <s> This is a test sentence. </s>