Skip to content
This repository has been archived by the owner on Jul 10, 2019. It is now read-only.

Core module

jnioche edited this page Jan 9, 2013 · 5 revisions


Core commands are found in behemoth-core.job.

CorpusGenerator is used to ingest a corpus of documents on the local filesystem as a collection of Behemoth documents stored in a SequenceFile on HDFS.

usage: com.digitalpebble.behemoth.util.CorpusGenerator -i <localdir> -o <outputDFSdir> [--recurse] [--unpack]
<localdir>            The input path on the local filesystem
<outputDFSDIR>        The output path on HDFS
--recurse             Process the input path recursively
--unpack              Unpack archives
--metadata              Add document metadata separated by semicolon e.g. -md source=internet;label=public"

CorpusReader is used to print out the contents of documents stored in a SequenceFile on HDFS.

usage: com.digitalpebble.behemoth.util.CorpusReader -i <inputDFSPath> [-c] [-t] [-a] [-m]
<inputDFSPath>        The input path on HDFS
-c   Print the first 200 characters of binary content
-t   Display the text
-m   Display the metadata
-a   Display the annotations

The CorpusFilter is used to filter the documents stored in the input SequenceFile and store them in a new SequenceFile.

usage: com.digitalpebble.behemoth.util.CorpusFilter -i <inputPath> -o <outputPath>
<inputPath>        The input path on HDFS
<outputPath>       The output path on HDFS

In addition to the parameters above, the filtering options are set using the standard Hadoop configuration e.g. -D document.filter.url.keep=.*

ContentExtractor TODO

Behemoth Modules | Home