Skip to content

JingL1014/Phrase_extractor

Repository files navigation

=============================================================================
1. Prerequisites
=============================================================================
1a.
TurboParser
You need download TurboParser v2.3 and compile it following the instructions in TurboParser's README file.
LINK:http://www.cs.cmu.edu/~ark/TurboParser/. To use TurboParser, pre-trained Arabic models for pos tagging and dependency parser are needed.

-> Download two models from the following link: https://1drv.ms/u/s!An8tQToldhOkhO9O8p3lB1oR5MUl6A
-> Uncompress the models. 
-> create a folder "arabic" under directory models/ in the root folder of TurboParser and save the models in that folder (e.g. /users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0/models/arabic/arabic_tagger.model)


UDpipe
An alternative parser is UDpipe, the pre-compiled binary package (udpipe-1.0.0-bin.zip) can be downloaded at: https://github.com/ufal/udpipe/releases/tag/v1.0.0. To use UDpipe, a language model is needed, which can be downloaded at
https://1drv.ms/u/s!An8tQToldhOkh_JIep-cJhimMDiboQ 


1b. Stanford word segmenter
You need download Stanford segmenter v3.6.0 and unzip the file.
LINK:http://nlp.stanford.edu/software/segmenter.shtml#Download

1c. Stanford CoreNLP
If your input file is document and you need to use Stanford CoreNLP to do sentence splitting
LINK:http://stanfordnlp.github.io/CoreNLP/download.html

============================================================================
2. Usage
============================================================================
Download the entire repository. Shell script run_sentence.sh and run_document.sh are provided to do phrase extraction of an input arabic file.

usage: Based on the input format (sentence or document), you can run either run_sentence_X.sh or run_document_X.sh (X is turboParser or udpipe based on your choice of parser). The first argument is the file name. The second argument is the format of noun phrases: long, short or both. "long" means that the extracted noun phrases include not only the head but also addtional modifiers: adjective, prepositional phrase. "short" means the extracted noun phrases include just the head. "both" means that both long and short noun phrases are extracted.

run_sentence_X.sh Sample_arabic_sent.xml [NounPhraseOption]
run_document_X.sh Sample_arabic_doc.xml [NounPhraseOption]


2a. Input data format
Two types of input format is allowed:
-> sentence: the input file has one sentence per entry. A sample "Sample_arabic_sent.xml" of input data can be found under directory data/
-> document: the input file has one document per entry. This type of input will perform sentence splitting before other processes. A sample "Sample_arabic_doc.xml" of input data can be found under directory data/

2b. Running run_document_X.sh or run_sentence_X.sh
You need change the value of following parameters based on your situation

-> SCRIPT: the location where you put folder "scripts"
example:
SCRIPT=/users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0/phrase_extractor/scripts

-> FILE: the location where you put the input file 
example:
FILE=/users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0/phrase_extractor/data

-> STANFORD_SEG: the location where the stanford segmenter is saved
example:
STANFORD_SEG=/users/ljwinnie/Downloads/stanford-segmenter-2015-12-09

-> TurboParserPath: the location where TurboParser is saved
example:
TurboParserPath=/users/ljwinnie/toolbox/turboParser/TurboParser-2.3.0

-> udpipePath: the location where UDpipe is saved
example:
/users/ljwinnie/Desktop/petrarch2/phrase_extractor/udpipe-1.0.0-bin

-> languageModel: the location where the language model used in UDpipe is saved
example:
/users/ljwinnie/Desktop/petrarch2/phrase_extractor/udpipe-1.0.0-bin/model/arabic-ud-1.2-160523.udpipe

->STANFORD_CORENLP: the location where the Stanford CoreNLP is saved
example:
STANFORD_CORENLP=/users/ljwinnie/toolbox/stanford-corenlp-full-2015-01-29

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published