For this tutorial we need to download and configure both Hadoop and Behemoth.
As a first step, install and configure Hadoop, e.g. in single node setup following: Hadoop set-up.
Then in your new hadoop installation folder, add the following property to hdfs-site.xml (probably to be found in subfolder: /conf) in order to prevent _logs directories to be generated within the output of the Behemoth jobs:
<property> <name>hadoop.job.history.user.location</name> <value>none</value> </property>
For Hadoop 0.20.x you may need to specify the following in hadoop-env.sh (make sure, it’s all on one line):
export HADOOP_OPTS="-server -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl"
You will also need to download and compile Behemoth following these instructions.
As part of compiling Behemoth, we will need to generate job files for the Tika, GATE and UIMA… modules. Compiling is done with Maven and will generates a file: behemoth-module-.job.jar , in the /target directory of each module. For ease of usage, you should make sure that the hadoop command has been set in the Path and can be called from anywhere, i.e. set $HADOOP__HOME globally to point to your hadoop installation folder.
You will find a script in the main directory of behemoth which simplifies the calls to the Behemoth jobs. The examples below use the explicit form (hadoop jar etc…).
In order to specify your own configuration via the behemoth-site.xml file, you can copy it to the $HADOOP/conf directory or leave it anywhere you want but do
prior to calling the script or the hadoop jar command. Alternatively you can also put it in core/src/main/resources/ and recompile behemoth so that it gets included in the job file.
The first step is to convert a set of documents into a Behemoth corpus by using the CorpusGenerator in
the core module. The class returns a sequence file of Behemoth documents which can then be further
processed using the other modules.
hadoop jar core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i "path to corpus" -o "path for output file"
N.B. directory path’s should be qualified like file:/path/to/directory, e.g. file:///home/carmen/data/.
Use the recurse option if you want CorpusGenerator to process the input path recursively e.g.
hadoop jar core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i "path to corpus" -o "path for output file" --recurse
Note: The same can be done using any of the job files generated above as it is implemented in the core module which all modules: gate, tika or uima depend on. Practically, this means that you could also have called it using, for instance, the behemoth-gate.jar, as in:
hadoop jar gate/target/behemoth-gate*job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i "path to corpus" -o "path for output file"
In the rest of the document we will use the job files instead as they contain the dependencies that the modules require. The reason why it worked with the behemoth-core.jar is that it does not have any dependencies (apart from the Hadoop libs).
A simpler approach is to use the behemoth script and call ‘./behemoth importer’ with the same parameters as above.
Again, using another Behemoth core utility: the CorpusReader, we can have a look at the content of the
produced sequence file. The following command displays all the content in the Behemoth corpus:
hadoop jar core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusReader -i "path to generated Corpus"
url: file:/localPath/corpus/somedocument.rtf contentType: metadata: null Annotations:
Additionally, there are a couple of flags, that allow displaying more information,
that can be added to the above basic command.
-a,--displayAnnotations display annotations in output -c,--displayContent display binary content in output -m,--displayMetadata display metadata in output -t,--displayText display text in output
For instance, setting -c causes the display of the first 200 characters of byte content.
hadoop jar core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusReader -i "path to generated Corpus" -c
Note: At this stage, the text contained in the documents has not yet been extracted from the original format and similarly, neither the content type has not been identified nor are there any annotations for the documents yet.
The Tika module in Behemoth uses the Apache Tika library to extract the text from the documents into a Behemoth sequence file. It offers a variety of identification and filtering options.
The following step of extracting the text is the least that is necessary to work with
the content or process it any further. The basic command for this step is:
hadoop jar tika/target/behemoth-tika-*-job.jar com.digitalpebble.behemoth.tika.TikaDriver -i "path to previous output from the CorpusGenerator" -o "path to output file"
Additionally, we can identify and inspect, for instance the type of different languages in the corpus by running:
(Step 1) hadoop jar language-id/target/behemoth-lang*job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -i corpusTika -o corpusTika-lang
Having detected the language, one can then filter on a specific language ID (in this case ‘en’ for English) and discard the remainder. (The exact distribution of languages in the corpus and their IDs can be inspected in the hadoop jobtracker of that specific job.)
(Step 2) hadoop jar language-id/target/behemoth-lang*-SNAPSHOT-job.jar com.digitalpebble.behemoth.languageidentification.LanguageIdDriver -D document.filter.md.keep.lang=en -i corpusTika-lang -o corpusTika-EN
Note: Skipping a specific language, is done by running the same command with:
The first step here is optional, but allows you to look at the distribution in the corpus.
The core module allows post-Tika filtering of documents based on regular expressions.
Those documents that match the RE with mime type/URL… will be retained and written to the new output destination. The first example (1) shows filtering to only keep html documents.
The 2nd example aims to keep only urls that contain ‘333’ and the third example filters on document
labels, such as ‘contract’ (-this would be relevant for further processing in classification/clustering.)
(1) hadoop jar core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusFilter -D document.filter.mimetype.keep=.+html.* -i tikaCorpus -o tikaCorpus-html
(2) hadoop jar core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusFilter -D document.filter.url.keep=.+333.* -i tikaCorpus -o tikaCorpus-333
(3) hadoop jar core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusFilter -D document.filter.md.keep.label=contract -i textcorpusTika -o textcorpusTika-contracts
For the last filter it’s also possible to skip a label, by replacing “document.filter.md.keep.label” with
If you apply more than one filter at a tim you can control the mode by specifying how the filters should interact. ‘OR’ will keep or skip the document, if any filter matches and ‘AND’ if all constraints match.
Intermediately, one can extract the documents from the sequence file that have been filtered etc. for inspection:
hadoop jar core/target/behemoth-core*job.jar com.digitalpebble.behemoth.util.ContentExtractor -i seq-directory -o seqdirectory-output
For this step, the zipped GATE application must be pushed onto the distributed filesystem by copying
the file from your local file system onto the hdfs.
hadoop fs -copyFromLocal /mylocalpath/ANNIE.zip /apps/ANNIE.zip
In case, you haven’t done so already, create a file behemoth-site.xml file in your Hadoop/conf directory and add the following properties:
<property> <name>gate.annotationset.input</name> <value></value> <description>Map the information at the behemoth format onto the select annotationset </description> </property> <property> <name>gate.annotationset.output</name> <value></value> <description>AnnotationSet to consider when serializing to the behemoth format </description> </property> <property> <name>gate.annotations.filter</name> <value>Token</value> <description>Annotations types to consider when serializing to the behemoth format, separated by commas </description> </property> <property> <name>gate.features.filter</name> <value>Token.string</value> <description>if specified, only the feature listed for a type will be kept </description> </property> <property> <name>gate.emptyannotationset</name> <value>false</value> <description>if specified all the annotations in the Behemoth document will be deleted before processing with GATE </description> </property>
hadoop jar gate/target/behemoth-gate*job.jar com.digitalpebble.behemoth.gate.GATEDriver "input path" "target output path" /apps/ANNIE.zip e.g. hadoop jar gate/target/behemoth-gate*job.jar com.digitalpebble.behemoth.gate.GATEDriver /data/behemothcorpus /data/behemothcorpus-2 /apps/ANNIE.zip
The procedure is very similar for UIMA, where you first generate a job file for the UIMA module and then copy the pear file to HDFS:
hadoop fs -copyFromLocal /mylocalpath/WhitespaceTokenizer.pear /apps/WhitespaceTokenizer.pear
The parameters in behemoth-site.xml to specify are:
<property> <name>uima.annotations.filter</name> <value>org.apache.uima.TokenAnnotation,org.apache.uima.SentenceAnnotation</value> <description>Annotations types to consider when serializing to the behemoth format, separated by commas </description> </property> <property> <name>uima.features.filter</name> <value>org.apache.uima.TokenAnnotation:posTag</value> <description>Feature names to consider when serializing to the behemoth format, separated by commas </description> </property>
hadoop jar uima/target/behemoth-uima*job.jar com.digitalpebble.behemoth.uima.UIMADriver /data/behemothcorpus /data/behemothcorpus-2 /apps/WhitespaceTokenizer.pear
Again, the content of the corpus can be checked by using the CorpusReader:
e.g. hadoop jar core/target/behemoth-core*job.jar com.digitalpebble.behemoth.util.CorpusReader -i /data/behemothcorpus-2 -a
Assuming that you’ve generated a job file for the Mahout module, you can create vectors from
the Behemoth corpus.
There are various options, e.g. using annotations of type Token and take the value of the feature string instead of relying on the Lucene analysers as done by Mahout’s SparseVectorsFromSequenceFiles. This allows the use of any features generated by a previous module (e.g. lemmas, POS tags, semantic features, …) as feature values for the clustering / classification with Mahout.
hadoop jar mahout/target/behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth -i "previous output" -o "target output path"
Note: There are a number of flags that can be set at this point, such as weighting scheme (‘-wt’) or
‘-nv’ for “named Vector” (this is particularly relevant for clustering to later identify document-cluster mappings).
For a full list of options, run the following:
hadoop jar mahout/target/behemoth-mahout*job.jar com.digitalpebble.behemoth.mahout.SparseVectorsFromBehemoth
Note: Another more complete example on clustering with Mahout, can be found here.
Note : the command above works with Hadoop 0.21 only