GitHub - neubig/egret: A fork of the Egret parser that fixes a few bugs

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Egret		Egret
chn_grammar		chn_grammar
eng_grammar		eng_grammar
jpn_grammar		jpn_grammar
script		script
.gitignore		.gitignore
Makefile		Makefile
README		README
test-chn.txt		test-chn.txt
testeng.gra		testeng.gra
testeng.txt		testeng.txt

Repository files navigation

Egret: A set of syntax parsing tools, including inside-outside grammar estimation, pure PCFG parser, and a latent-annotation-based syntax parser (a reimplementation of Berkeley Parser, with F1-Measure(length <=40words): Chinese 85.45%, English 89.64%, All length: Chinese 84.16%, English 89.02%). It features static and dynamic pruning, and can output n-best trees and packed forests for other NLP applications. It is open source and free for research purposes. 

Directory
=========================
1) Egret 
   - Egret, Project Dir, for code generation
   - src , source code
   - Egret.sln , VS 8.0 solution file
2) chn_grammar, contains latent-annotation pcfg grammar trained on Penn Chinese TreeBank 6.0
   - levelx.grammar, x level grammar file
   - levelx.lexicon, x level lexicon file
3) eng_grammar, contains latent-annotation pcfg grammar trained on Penn English Treebank 3.0
4) testeng.gra, a pure pcfg grammar example
5) testeng.txt, English test sentences
6) test-chn.txt, Chinese test sentences
7) make-linux.sh, shell file to make the linux binary file
8) egret.exe, win32 binary file 

How to compile
=========================
For windows, please open the file "Egret.sln", and click "build". Note: you may install stlport first.
For linux, just run make-linux.sh

Example Usage
=========================
***Example usage of the latent-annotation pcfg Parser
--get 100 best trees
  egret -lapcfg -i=test-chn.txt -data=chn_grammar -n=100 > log.txt
  
--parse with n level grammar (default is 5)
  egret -lapcfg -i=test-chn.txt -data=chn_grammar -grammarLevel=3 > log.txt
 
--get packed forest (dynamic pruning)
  egret -lapcfg -i=test-chn.txt -data=chn_grammar -nbest4threshold=100 -printForest > log.txt
  -nbest4threshold=100 means all the hyperedges which do not appear in the 100 nbest trees will be pruned
  However, please note that even after pruning there are still exponential number of additional trees embedded in the forest 
  because of the sharing structure of forest.
  
--get packed forest (static pruning)
  egret -lapcfg -i=test-chn.txt -data=chn_grammar -threshold=10 -printForest > log.txt
  -threshold=10 means all the hyperedges with margin larger than 10 to the viterbi path will be prune
   
***Example usage of the pure pcfg Parser
  egret -pcfg -g=testeng.gra -i=testeng.txt -n=3
 
***Example usage of inside-outside extimation
  egret -io -g=testeng.gra -i=testeng.txt -pass=3

Acknowledgment
=========================
Parts of this software were written when I studied syntax parsing in NUS and XMU. 
I would like to thank Prof. Hwee Tou Ng(NUS) and Prof. Xiaodong Shi(XMU) for their guidance and support during this process.

Reference
=========================
The inside-outside algorithm was described in
"Trainable grammars for speech recognition."
J. Baker (1979). In J. J. Wolf and D. H. Klatt, editors, Speech communication papers presented at the 97th meeting of the Acoustical Society of America, pages 547¨C550, Cambridge, MA, June 1979. MIT.
And
Christopher D. Manning, Heinrich Sh¨¹tze (1999): Foundations of statistical natural language processing.

The K-Best Algorithm was described in
"Better k-best Parsing." 
Liang Huang and David Chiang (2005). 
In Proceedings of the 9th International Workshop on Parsing Technologies (IWPT 2005), Vancouver, BC. 

The latent-annotation model was described in 
"Learning Accurate, Compact, and Interpretable Tree Annotation"
Slav Petrov, Leon Barrett, Romain Thibaux and Dan Klein 
in COLING-ACL 2006  
And
"Improved Inference for Unlexicalized Parsing"
Slav Petrov and Dan Klein 
in HLT-NAACL 2007
And 
I also learned a lot of valuable details from the source code of Berkeley Parser.

Contact
=========================
Syntax parsing is so full of fun that I have to share my joy with those who are interested in this topic.
Hui Zhang 
zhangh1982@gmail.com
http://sites.google.com/site/zhangh1982

-------------------------
This version of Egret has been modified by Graham Neubig to fix a few bugs.
Major changes include:
* Preventing a segmentation fault in English parsing on encountering unusual
  unknown words
* Removing a flag that lexicalizes punctuation in Chinese parsing only for
  tree output (but not forest output)
* Adding a Japanese parsing model (trained on the Japanese word dependency
  corpus of Mori et al. 2014)

About

A fork of the Egret parser that fixes a few bugs

Readme

Activity

10 stars

4 watching

4 forks

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Egret

Egret

chn_grammar

chn_grammar

eng_grammar

eng_grammar

jpn_grammar

jpn_grammar

script

script

.gitignore

.gitignore

Makefile

Makefile

README

README

test-chn.txt

test-chn.txt

testeng.gra

testeng.gra

testeng.txt

testeng.txt

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

neubig/egret

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages