This directory provides an NLP extractor that is a wrapper for Stanford NLP. Its input is textual data in TSV format, and its output is one or many TSV files that contains processed sentences data that can be directly loaded into PostgreSQL / Greenplum tables.
In (1) we show how to use it this NLP extractor as an "extractor" in DeepDive, and in (2) we provide a script to run it as a stand-alone application in parallel, without DeepDive --- which is a recommended way if you have lots of data, since this way can achieve more parallism.
Compile the NLP extractor using following command in this directory:
sbt stage
The input file is a TSV file of following form. Each line is a document.
The TSV file should have two columns. The first column is
document_id
(text format), a unique modifier for a document. The
second column is text
, all sentences in the document in plain text:
doc1\tFull text in doc1
doc2\tFull text in doc2
...
Note that the input TSV file should not have headers.
Output is another TSV file that contains multiple columns:
document_id
: The document_id from input.sentence
: The raw sentence text same as inputwords
: (PSQL-formatted) array of wordslemma
: array of lemmatized wordspost_tags
: array of Part-of-speech tagsner_tags
: array of Named Entity tagsdependencies
: array of collapsed dependenciessentence_offset
: The index / offset of this sentence in documentsentence_id
: A unique identifier to the sentence
You can create a table like this to be able to import the output TSV
file to the database. (Note that this is the output_relation
in
DeepDive)
CREATE TABLE sentences(
document_id bigint,
sentence text,
words text[],
lemma text[],
pos_tags text[],
dependencies text[],
ner_tags text[],
sentence_offset bigint,
sentence_id text
);
When used the run.sh
with DeepDive, sometime ideal parallelism
cannot be achieved because of memory problems. In this case, we
recommend to use the run_parallel.sh
. It does the following steps:
- Split your input file
INPUT_FILE
into chunks inINPUT_FILE.split/
- Uses system parallelism tool
xargs
to runrun.sh
in parallel. The outputs are saved toINPUT_FILE.split/*.out
.
Run it with the following command
./run_parallel.sh INPUT_FILE PARALLELISM [INPUT_BATCH_SIZE=100] [SENTENCE_WORDS_LIMIT=120]
INPUT_FILE
: your input TSV filePARALLELISM
: a number indicating desired parallelism. e.g.: 8INPUT_BATCH_SIZE
: how many lines are in each file after split. Default 100.SENTENCE_WORDS_LIMIT
: Do not run dependency parsing if number of words in sentence is larger than this number. This helps in speeding up the parsing.
When finished, you should manually import the files in
INPUT_FILE.split/*.out
into your database. You can use a COPY query
like this:
cat INPUT_FILE.split/*.out | psql YOUR_DB_NAME -c "COPY sentences FROM STDIN"