NLP Extractor

This directory provides an NLP extractor that is a wrapper for Stanford NLP. Its input is textual data in TSV format, and its output is one or many TSV files that contains processed sentences data that can be directly loaded into PostgreSQL / Greenplum tables.

In (1) we show how to use it this NLP extractor as an "extractor" in DeepDive, and in (2) we provide a script to run it as a stand-alone application in parallel, without DeepDive --- which is a recommended way if you have lots of data, since this way can achieve more parallism.

Compile

Compile the NLP extractor using following command in this directory:

sbt stage

(1) run.sh and Integration with DeepDive

Input

The input file is a TSV file of following form. Each line is a document.

The TSV file should have two columns. The first column is document_id (text format), a unique modifier for a document. The second column is text, all sentences in the document in plain text:

doc1\tFull text in doc1
doc2\tFull text in doc2
...

Note that the input TSV file should not have headers.

Output

Output is another TSV file that contains multiple columns:

document_id: The document_id from input.
sentence: The raw sentence text same as input
words: (PSQL-formatted) array of words
lemma: array of lemmatized words
post_tags: array of Part-of-speech tags
ner_tags: array of Named Entity tags
dependencies: array of collapsed dependencies
sentence_offset: The index / offset of this sentence in document
sentence_id: A unique identifier to the sentence

You can create a table like this to be able to import the output TSV file to the database. (Note that this is the output_relation in DeepDive)

CREATE TABLE sentences(
  document_id bigint,
  sentence text,
  words text[],
  lemma text[],
  pos_tags text[],
  dependencies text[],
  ner_tags text[],
  sentence_offset bigint,
  sentence_id text
  );

(2) run_parallel.sh: Stand-alone Parallel NLP Extractor

When used the run.sh with DeepDive, sometime ideal parallelism cannot be achieved because of memory problems. In this case, we recommend to use the run_parallel.sh. It does the following steps:

Split your input file INPUT_FILE into chunks in INPUT_FILE.split/
Uses system parallelism tool xargs to run run.sh in parallel. The outputs are saved to INPUT_FILE.split/*.out.

Run it with the following command

./run_parallel.sh INPUT_FILE PARALLELISM [INPUT_BATCH_SIZE=100] [SENTENCE_WORDS_LIMIT=120]

INPUT_FILE: your input TSV file
PARALLELISM: a number indicating desired parallelism. e.g.: 8
INPUT_BATCH_SIZE: how many lines are in each file after split. Default 100.
SENTENCE_WORDS_LIMIT: Do not run dependency parsing if number of words in sentence is larger than this number. This helps in speeding up the parsing.

When finished, you should manually import the files in INPUT_FILE.split/*.out into your database. You can use a COPY query like this:

cat INPUT_FILE.split/*.out | psql YOUR_DB_NAME -c "COPY sentences FROM STDIN"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NLP Extractor

Compile

(1) run.sh and Integration with DeepDive

Input

Output

(2) run_parallel.sh: Stand-alone Parallel NLP Extractor

Files

README.md

Latest commit

History

README.md

File metadata and controls

NLP Extractor

Compile

(1) run.sh and Integration with DeepDive

Input

Output

(2) run_parallel.sh: Stand-alone Parallel NLP Extractor