Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
..
Failed to load latest commit information.
project
sbt
src
test
.gitignore
README.md
build.sbt
run.sh
run_parallel.sh
setup.sh

README.md

Parser

Run setup.sh to install dependencies and build the parser.

We assume that your input has the following format. There's one line per document and each document is a JSON object with a key and content field.

{ "item_id":"doc1", "content":"Here is the content of my document.\nAnd here's another line." }
{ "item_id":"doc2", "content":"Here's another document." }

You can run the NLP pipeline on 1 core as follows:

cat input.json | ./run.sh -i json -k "item_id" -v "content" > output.tsv

You can run the NLP pipeline on 16 cores as follows:

./run_parallel.sh -in="input.json" --parallelism=16 -i json -k "item_id" -v "content"

You can run the NLP pipeline as a REST service as follows:

./run.sh -p 8080

The output will be files in tsv-format that you can directly load into the database.

Setup

This package requires Java 8.