GitHub - Oneplus/cnccgbank: Chinese CCGbank conversion algorithm suite

Oneplus / cnccgbank Public

forked from jogloran/cnccgbank

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Chinese CCGbank conversion algorithm suite

View license

0 stars 9 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1,432 Commits
apps		apps
lib		lib
munge		munge
patches		patches
scripts		scripts
.gitignore		.gitignore
.svnignore		.svnignore
LICENCE		LICENCE
README		README
alltests.sh		alltests.sh
anno.py		anno.py
c		c
cmp_markedup.py		cmp_markedup.py
config.yml		config.yml
corpusparts.py		corpusparts.py
depvalid.py		depvalid.py
do_filter.sh		do_filter.sh
filter.py		filter.py
findcommon.py		findcommon.py
leaves.py		leaves.py
locate.py		locate.py
make.sh		make.sh
make_all.sh		make_all.sh
make_base.sh		make_base.sh
make_ccgbank.sh		make_ccgbank.sh
make_clean.sh		make_clean.sh
make_corpora.sh		make_corpora.sh
make_markedup.sh		make_markedup.sh
make_noprodrop.sh		make_noprodrop.sh
pp		pp
q		q
regroup.py		regroup.py
regroup_piped.sh		regroup_piped.sh
rmconj.py		rmconj.py
s		s
t		t
trouble.rb		trouble.rb
vocab.py		vocab.py

Repository files navigation

Chinese CCGbank conversion
==========================
(c) 2008-2012 Daniel Tse <cncandc@gmail.com>
University of Sydney

Use of this software is governed by the attached "Chinese CCGbank converter Licence Agreement"
supplied in the Chinese CCGbank conversion distribution. If the LICENCE file is missing, please
notify the maintainer Daniel Tse <cncandc@gmail.com>.

    Licensees shall acknowledge use of the Licensed Software and Derivative Works in all 
    publications of research based in whole or in part on their use through citation of the following publication:

    Chinese CCGbank: extracting CCG derivations from the Penn Chinese Treebank
    Daniel Tse and James R. Curran
    Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 1083-1091, 
    Beijing, China, 2010.

How to obtain a copy of Chinese CCGbank:
----------------------------------------
Python version:
    The Chinese CCGbank conversion process has been tested on Python 2.7.1 and GNU bash 4.1.5.
    The scripts will require a Unix environment with bash available.

Obtaining a copy of Penn Chinese Treebank:
    The Chinese CCGbank conversion process requires a copy of Penn Chinese Treebank (tested on PCTB 6.0,
    may work on other versions; LDC catalog no. LDC2007T36), which can be obtained through the
    Linguistic Data Consortium (LDC). The LDC catalogue page for this corpus is located at:
        http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T36

Installing required dependencies:
    These Python packages are required by the conversion process:
        PyYAML (tested on 3.10)
        PLY (tested on 3.4)
        cmd2 (tested on 0.6.2)

    The easiest way to install these dependencies is through 'easy_install' (Python setuptools).
    Instructions for installing setuptools can be found here:
        http://pypi.python.org/pypi/setuptools

    Once setuptools is installed, execute:
        easy_install PyYAML
        easy_install PLY
        easy_install cmd2

Installing Python C extensions:
    The tokeniser is written in C for speed. To build it, change to the directory of the
    Chinese CCGbank converter distribution, then execute:
        cd lib && python setup.py install
        (sudo may be required for setup.py to install the library in a location accessible
         to all users)

Generating the corpus:
   While in the directory of the Chinese CCGbank converter distribution, execute:
        ./make.sh -o <CHINESE CCGBANK DESTINATION DIRECTORY> -c <CHINESE PENN TREEBANK DIRECTORY>

        where the argument of '-o' is the desired output directory, and
              the argument of '-c' is the Penn Chinese Treebank directory containing '.fid' files.

    Generating the corpus will also write debugging output to the screen. When the process is complete,
    the CCGbank corpus will be output to the directory specified by the argument of '-o'.