Permalink
Browse files

Initial commit.

This version is based on my latest modifications to the parser from

    http://bllip.cs.brown.edu/download/reranking-parserAug06.tar.gz
  • Loading branch information...
0 parents commit 5868357986aeace60498860d06d96f545f02a9cc @dmcc dmcc committed Feb 11, 2011
Showing 327 changed files with 88,420 additions and 0 deletions.
Oops, something went wrong.
Oops, something went wrong.
Oops, something went wrong.
@@ -0,0 +1,224 @@
+~BLLIP/reranking-parser/README
+
+(c) Mark Johnson,Eugene Charniak, 24th November 2005 --- August 2006
+
+We request acknowledgement in any publications that make use of this
+software and any code derived from this software. Please report the
+release date of the software that you are using (this is part of the
+name of the tar file you have downloaded), as this will enable others
+to compare their results to yours.
+
+MULTI-THREADED PARSING
+======================
+
+NEW!!! The first stage parser, which uses about 95% of the time, is
+now multi-treaded. The default is two threads. Currently the maximum
+is four. To change the number (or maximum) see the README file for
+the first-stage parser. For the time being a non-threaded version is
+available in case there are problems with threads. Send email to ec
+if you have problems. See below for details.
+
+COMPILING THE PARSER
+====================
+
+To compile the two-stage parser, first define GCCFLAGS appropriately
+for your machine, e.g., with csh or tcsh
+
+> setenv GCCFLAGS "-march=pentium4 -mfpmath=sse -msse2 -mmmx"
+
+or
+
+> setenv GCCFLAGS "-march=opteron -m64"
+
+Then execute
+
+make
+
+After it has built, the parser can be run with
+
+> parse.sh <sourcefile.txt>
+
+E.g.,
+
+> parse.sh sample-data.txt
+
+The script parse-eval.sh takes a list of treebank files as arguments
+and extracts the terminal strings from them, runs the two-stage parser
+on those terminal strings and then evaluates the parsing accuracy with
+the U. Penn EVAL-B program. For example, on my machine the Penn
+Treebank 3 CD-ROM is installed at /usr/local/data/Penn3/, so the
+following code evaluates the two-stage parser on section 24.
+
+> parse-eval.sh /usr/local/data/Penn3/parsed/mrg/wsj/24/wsj*.mrg
+
+
+TRAINING THE RERANKER
+=====================
+
+Retraining the reranker takes a considerable amount of time, disk
+space and RAM. At Brown we use a dual Opteron machine with 16Gb RAM,
+and it takes around two days. You should be able to do it with only
+8Gb RAM, and maybe even with 4Gb RAM with an appropriately tweaked
+kernel (e.g., sysctl overcommit_memory, and a so-called 4Gb/4Gb split
+if you're using a 32-bit OS).
+
+The time and memory you need depend on the features that the reranker
+extracts and the size of the n-best tree training and development
+data. You can change the features that are extracted by changing
+second-stage/programs/features/features.h, and you can reduce the size
+of the n-best tree data by reducing NPARSES in the Makefile from 50
+to, say, 25.
+
+You will need to edit the Makefile in order to retrain the reranker.
+
+First, you need to set the variable PENNWSJTREEBANK in Makefile to the
+directory that holds your version of the Penn WSJ Treebank. On my
+machine this is:
+
+PENNWSJTREEBANK=/usr/local/data/Penn3/parsed/mrg/wsj/
+
+You'll also need the Boost C++ and the Petsc/Tao C++ libraries in
+order to retrain the reranker. The environment variables PETSC_DIR
+and TAO_DIR should all point to the installation directories of this
+software. I define these variables in my .login file as follows on my
+machine.
+
+setenv PETSC_DIR /usr/local/share/petsc
+setenv TAO_DIR /usr/local/share/tao
+setenv PETSC_ARCH linux
+setenv BOPT O_c++
+
+While many modern Linux distributions come with the Boost C++
+libraries pre-installed, if the Boost C++ libraries are not included
+in your standard libraries and headers, you will need to install them
+and add an include file specification for them in your GCCFLAGS. For
+example, if you have installed the Boost C++ libraries in
+/home/mj/C++/boost, then your GCCFLAGS environment variable should be
+something like:
+
+> setenv GCCFLAGS "-march=pentium4 -mfpmath=sse -msse2 -mmmx -I /home/mj/C++/boost"
+
+or
+
+> setenv GCCFLAGS "-march=opteron -m64 -I /home/mj/C++/boost"
+
+Once this is set up, you retrain the reranker as follows:
+
+> make reranker
+> make nbesttrain
+> make eval-reranker
+
+The script train-eval-reranker.sh does all of this.
+
+The reranker goal builds all of the programs, nbesttrain constructs
+the 20 folds of n-best parses required for training, and eval-reranker
+extracts features, estimates their weights and evaluates the
+reranker's performance on the development data (dev) and the two test
+data sets (test1 and test2).
+
+If you have a parallel processor, you can run 2 (or more) jobs
+in parallel by running
+
+> make -j 2 nbesttrain
+
+Currently this only helps for nbesttrain (but this is the slowest
+step, so maybe this is not so bad).
+
+The Makefile contains a number of variables that control how the
+training process works. The most important of these is the VERSION
+variable. You should do all of your experiments with VERSION=nonfinal,
+and only run with VERSION=final once to produce results for publication.
+
+If VERSION is nonfinal then the reranker trains on WSJ PTB sections
+2-19, sections 20-21 are used for development, section 22 is used as
+test1 and section 24 is used as test2 (this approximately replicates
+the Collins 2000 setup).
+
+If VERSION is final then the reranker trains on WSJ PTB sections 2-21,
+section 24 is used for development, section 22 is used as test1 and
+section 23 is used as test2.
+
+The Makefile also contains variables you may want to change, such as
+NBEST, which specfies how many parses per sentence are extracted from
+each sentence, and NFOLDS, which specifies how many folds are created.
+
+If you decide to experiment with new features or new feature weight
+estimators, take a close look at the Makefile. If you change the
+features please also change FEATURESNICKNAME; this way your new
+features won't over-write our existing ones. Similarly, if you change
+the feature weight estimator please pick a new ESTIMATORNICKNAME and
+if you change the n-best parser please pick a new NBESTPARSERNICKNAME;
+this way you new n-best parses or feature weights won't over-write the
+existing ones.
+
+To get rid of (many of) the object files produced in compilation, run:
+
+> make clean
+
+Training, especially constructing the 20 folds of n-best parses,
+produces a lot of temporary files which you can remove if you want to.
+To remove the temporary files used to construct the 20 fold n-best
+parses, run:
+
+> make nbesttrain-clean
+
+All of the information needed by the reranker is in
+second-stage/models. To remove everything except the information
+needed for running the reranking parser, run:
+
+> make train-clean
+
+To clean up everything, including the data needed for running the
+reranking parser, run:
+
+> make real-clean
+
+
+NON-THREADED PARSER
+===================
+To use the non-threaded parser instead change the following line
+in the Makefile
+
+NBESTPARSER=first-stage/PARSE/parseIt
+
+It should now read:
+NBESTPARSER=first-stage/PARSE/oparseIt
+
+That is, it is identical except for the "o" in oparseIt
+
+Then run oparse.sh, rather than parse.sh.
+
+
+INSTALLING PETSC AND TAO
+========================
+
+You'll need to have PETSc and Tao installed in order to
+retrain the reranker.
+
+These installation instructions work for gcc version 4.2.1 (you also
+need g++ and gfortran).
+
+1. Unpack PETSc and TAO somewhere, and make shell variables point
+to those directories (put the shell variable definitions in your
+.bash_profile or equivalent)
+
+export PETSC_DIR=/usr/local/share/petsc
+export TAO_DIR=/usr/local/share/tao
+export PETSC_ARCH="linux"
+export BOPT=O_c++
+
+cd /usr/local/share
+ln -s petsc-2.3.3-p6 petsc
+ln -s tao-1.9 tao
+
+2. Configure and build PETSc
+
+cd petsc
+FLAGS="-march=native -mfpmath=sse -msse2 -mmmx -O3 -ffast-math"
+./config/configure.py --with-cc=gcc --with-fc=gfortran --with-cxx=g++ --download-f-blas-lapack=1 --with-mpi=0 --with-clanguage=C++ --with-shared=1 --with-dynamic=1 --with-debugging=0 --with-x=0 --with-x11=0 COPTFLAGS=$FLAGS FOPTFLAGS=$FLAGS CXXOPTFLAGS=$FLAGS
+make all
+
+3. Configure and build TAO
+
+cd ../tao
+make all
@@ -0,0 +1,22 @@
+
+README.results:
+==============
+
+This file records the results of the n-best parser under VERSION=nonfinal, training on
+sections 2-19, using 20-21 as dev and testing on section 22 (test1) and section 24 (test2).
+
+version of 4th September 2006
+=============================
+
+# Evaluating second-stage/features/ec50spnonfinal/dev.gz
+# 1219273 features in second-stage/models/ec50spnonfinal/features.gz
+# 3984 sentences in second-stage/features/ec50spnonfinal/dev.gz
+# ncorrect = 66991, ngold = 73905, nparse = 73397, f-score = 0.909574, -log P = 18859.3, 1219273 nonzero features, mean w = 0.000210852, sd w = 0.000403686
+# Evaluating second-stage/features/ec50spnonfinal/test1.gz
+# 1219273 features in second-stage/models/ec50spnonfinal/features.gz
+# 1700 sentences in second-stage/features/ec50spnonfinal/test1.gz
+# ncorrect = 28022, ngold = 30633, nparse = 30460, f-score = 0.917356, -log P = 8087.09, 1219273 nonzero features, mean w = 0.000210852, sd w = 0.000403686
+# Evaluating second-stage/features/ec50spnonfinal/test2.gz
+# 1219273 features in second-stage/models/ec50spnonfinal/features.gz
+# 1346 sentences in second-stage/features/ec50spnonfinal/test2.gz
+# ncorrect = 23121, ngold = 25729, nparse = 25327, f-score = 0.905711, -log P = 6257.9, 1219273 nonzero features, mean w = 0.000210852, sd w = 0.000403686
@@ -0,0 +1,83 @@
+#! /bin/sh
+
+# This script recompiles the reranker code, rebuilds the nbest trees
+# and retrains and evaluates the reranker itself.
+
+# You can change the flags below here
+
+NBESTPARSERBASEDIR=first-stage
+NBESTPARSERNICKNAME=ec
+
+# NBESTPARSERBASEDIR=first-stage-Aug06
+# NBESTPARSERNICKNAME=Aug06
+
+# FEATUREEXTRACTOR=second-stage/programs/features/extract-spfeatures
+# FEATUREEXTRACTORFLAGS="-l -c -i -s 5"
+# FEATURESNICKNAME=spc
+
+FEATUREEXTRACTOR=second-stage/programs/features/extract-nfeatures
+FEATUREEXTRACTORFLAGS="-l -c -i -s 5 -f splh"
+FEATURESNICKNAME=splh
+
+# ESTIMATOR=second-stage/programs/wlle/cvlm
+# ESTIMATORFLAGS="-l 1 -c0 10 -Pyx_factor 1 -debug 10 -ns -1"
+# ESTIMATORNICKNAME=cvlm-l1c10P1-openmp
+
+ESTIMATOR=second-stage/programs/wlle/cvlm-owlqn
+ESTIMATORFLAGS="-l 1 -c 10 -F 1 -d 10 -n -1 -t 1e-7"
+ESTIMATORNICKNAME=owlqn-l1c10t1e-7
+
+# ESTIMATOR=second-stage/programs/wlle/cvlm-owlqn
+# ESTIMATORFLAGS="-l 1 -p 1 -c 10 -F 1 -d 10 -n -1 -t 1e-7"
+# ESTIMATORNICKNAME=owlqn-l1c10p1t1e-7
+
+# ESTIMATOR=second-stage/programs/wlle/avper
+# ESTIMATORFLAGS="-n 10 -d 0 -F 1 -N 10"
+# ESTIMATORNICKNAME=avper
+
+# ESTIMATOR=second-stage/programs/wlle/gavper
+# ESTIMATORFLAGS="-a -n 10 -d 10 -F 1 -m 999999"
+# ESTIMATORNICKNAME=gavper
+
+# ESTIMATOR=second-stage/programs/wlle/hlm
+# ESTIMATORFLAGS="-l 1 -c 10 -C 10000 -F 1 -d 100 -n 0 -S 7 -t 1e-7"
+# ESTIMATORNICKNAME=hlm2S7
+
+
+###############################################################################
+#
+# You shouldn't need to change anything below here
+#
+FLAGS="NBESTPARSERBASEDIR=$NBESTPARSERBASEDIR NBESTPARSERNICKNAME=$NBESTPARSERNICKNAME FEATUREEXTRACTOR=$FEATUREEXTRACTOR FEATURESNICKNAME=$FEATURESNICKNAME ESTIMATOR=$ESTIMATOR ESTIMATORNICKNAME=$ESTIMATORNICKNAME"
+
+# echo make clean $FLAGS
+# make clean
+
+echo
+echo make reranker $FLAGS
+make reranker $FLAGS
+
+# echo
+# echo make -j 8 nbesttrain
+# make -j 8 nbesttrain
+
+# Avoid remaking the nbest parses. Warning -- you'll recompute the features if you run this!
+#
+# echo
+# echo make touch-nbest $FLAGS
+# make touch-nbest $FLAGS
+
+# The nonfinal version trains on sections 2-19, uses sections 20-21 as dev,
+# section 22 as test1 and 24 as test2 (this is the "Collins' split")
+#
+echo
+echo make eval-reranker VERSION=nonfinal $FLAGS FEATUREEXTRACTORFLAGS="$FEATUREEXTRACTORFLAGS" ESTIMATORFLAGS="$ESTIMATORFLAGS"
+time make eval-reranker VERSION=nonfinal $FLAGS FEATUREEXTRACTORFLAGS="$FEATUREEXTRACTORFLAGS" ESTIMATORFLAGS="$ESTIMATORFLAGS"
+
+# The final version trains on sections 2-21, uses section 24 as dev,
+# section 22 as test1 and section 23 as test2 (this is the standard PARSEVAL split)
+#
+echo
+echo make eval-reranker VERSION=final $FLAGS FEATUREEXTRACTORFLAGS="$FEATUREEXTRACTORFLAGS" ESTIMATORFLAGS="$ESTIMATORFLAGS"
+make eval-reranker VERSION=final $FLAGS FEATUREEXTRACTORFLAGS="$FEATUREEXTRACTORFLAGS" ESTIMATORFLAGS="$ESTIMATORFLAGS"
+
@@ -0,0 +1,66 @@
+##------------------------------------------##
+## Debug mode ##
+## 0: No debugging ##
+## 1: print data for individual sentence ##
+##------------------------------------------##
+DEBUG 0
+
+##------------------------------------------##
+## MAX error ##
+## Number of error to stop the process. ##
+## This is useful if there could be ##
+## tokanization error. ##
+## The process will stop when this number##
+## of errors are accumulated. ##
+##------------------------------------------##
+MAX_ERROR 10
+
+##------------------------------------------##
+## Cut-off length for statistics ##
+## At the end of evaluation, the ##
+## statistics for the senetnces of length##
+## less than or equal to this number will##
+## be shown, on top of the statistics ##
+## for all the sentences ##
+##------------------------------------------##
+CUTOFF_LEN 40
+
+##------------------------------------------##
+## unlabeled or labeled bracketing ##
+## 0: unlabeled bracketing ##
+## 1: labeled bracketing ##
+##------------------------------------------##
+LABELED 1
+
+##------------------------------------------##
+## Delete labels ##
+## list of labels to be ignored. ##
+## If it is a pre-terminal label, delete ##
+## the word along with the brackets. ##
+## If it is a non-terminal label, just ##
+## delete the brackets (don't delete ##
+## deildrens). ##
+##------------------------------------------##
+DELETE_LABEL TOP
+DELETE_LABEL -NONE-
+DELETE_LABEL ,
+DELETE_LABEL :
+DELETE_LABEL ``
+DELETE_LABEL ''
+DELETE_LABEL .
+
+##------------------------------------------##
+## Delete labels for length calculation ##
+## list of labels to be ignored for ##
+## length calculation purpose ##
+##------------------------------------------##
+DELETE_LABEL_FOR_LENGTH -NONE-
+
+##------------------------------------------##
+## Equivalent labels, words ##
+## the pairs are considered equivalent ##
+## This is non-directional. ##
+##------------------------------------------##
+EQ_LABEL ADVP PRT
+
+# EQ_WORD Example example
Oops, something went wrong.

0 comments on commit 5868357

Please sign in to comment.