Skip to content
This repository has been archived by the owner on Oct 20, 2018. It is now read-only.

Indexing with Pignlproc and Hadoop

bernhardschaefer edited this page Dec 17, 2013 · 30 revisions

This page explains the steps required to generate several types of indexes from an XML dump of Wikipedia using Apache Pig and Hadoop. Some of this code is still in the testing phase, but each of the steps described here have been tested successfully using clients running Ubuntu 10.04 and 12.04, and a Hadoop cluster of five machines running Ubuntu 10.04, as well as in local and pseudo-distributed modes on a single machine.

What You'll Need

JDK >= 1.6

Apache Maven 2.2.1

Apache Pig version >= 0.8.1 (version 0.10.0 preferred)

Hadoop >= 0.20.x (tested with 0.20.2) - if you only want to test the scripts in local mode, you won't need to install Hadoop, as Pig comes with a bundled Hadoop installation for local mode.

*Optional: A stopword list with one word per line, such as [the one used by DBpedia-Spotlight] (http://spotlight.dbpedia.org/download/release-0.4/stopwords.en.list)

Steps

  1. Clone this fork of pignlproc
git clone https://github.com/dbpedia-spotlight/pignlproc.git /your/local/dir
  1. From the top dir of pignlproc, build the project by running:
mvn package

**this will compile the project and run the tests. If you want to skip tests:

mvn package -DskipTests=true
  1. Set JAVA_HOME to the location of your JDK:
    i.e.
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0

Add Apache Pig to your PATH:
i.e.

export PATH=/home/<user>/programs/pig-0.10.0/bin:$PATH
  1. You're now ready to test in local mode. Modify the parameters in indexer-local.pig.params to control the indexing scripts. Note that not all params are used in all scripts - you can look at the top of each script to find the params that it uses.
Param Function
Input/Output
$INPUT Specifies the path to the xml wikidump
$OUTPUT_DIR Specifies the directory for output
Configuration
$MAX_SPAN_LENGTH The maximum span in chars for individual (not aggregate) contexts
$MIN_COUNT The minimum count for a token to be included in a resource's token index
$MIN_CONTEXTS The minimum number of contexts a resource must have to be included
$NUM_DOCS The number of resources in the wikidump (used to approximate idf)
$N A cutoff param to keep only the top N tokens for a resource
$URI_LIST For filtering scripts, the location of the uri whitelist
Language-Specific
$LANG The lowercase language code ('en', 'fr', 'de', etc...)
$ANALYZER_NAME* The name of the Lucene Analyzer to use (EnglishAnalyzer, FrenchAnalyzer, etc...)
$STOPLIST_NAME The filename of the stopwords file to use (i.e. 'stopwords.en.list')
$STOPLIST_PATH The path to the dir containing the stoplist (i.e. /user/hadoop/stoplists)
$PIGNLPROC_JAR The local path to the JAR containing the Pignlproc UDFs
Hadoop Configuration
$DEFAULT_PARALLEL The default number of reducers to use (may be overridden by your cluster config}

To test in local mode, modify indexer-local.pig.params for your local setup, then run examples/indexing/indexer_small_cluster.pig From top dir of pignlproc:

pig -x local -m indexer-local.pig.params examples/indexing/indexer_small_cluster.pig
 -Note: change $OUTPUT_DIR if necessary to point to an output dir that works for you, and $STOPLIST_PATH to point to the directory containing your stoplist. When the script finishes, check $OUTPUT_DIR to confirm output.        
  1. If you want to run indexing on an actual Hadoop cluster, you'll first need to put your wikidump and stoplist into the Hadoop File System (HDFS).

    hadoop fs -put /location/of/enwiki-latest-pages-articles.xml /location/on/hdfs/enwiki-latest-pages-articles.xml    
    hadoop fs -put /location/of/stopwords.en.list /location/on/hdfs/stopwords.en.list    
    

    -Note: Although Pig supports automatic loading of files with .bz2 compression, this feature is not currently implemented in the custom loader in pignlproc. Thus, the extracted version of the XML dump is currently required. This inconvenience will be resolved in the near future.

    You should also define your output dir ($DIR) as a directory in HDFS

    Note also that the parameters containing paths must now be paths in HDFS

There are currently six possibilities for indexing:

Script Function
indexer_small_cluster.pig create an index {URI, {(term, count),...}} - count in UDF
indexer_lucene.pig create an index of {URI, {(term, count),...}} - count in MapRed
tfidf.pig create a tfidf index: {URI, {(term, weight),...}}
uri_to_context_indexer.pig create an index {URI, aggregated context} (one long chararray)
uri_to_context_indexer_filter.pig same as above, except filter by a user-provided list of URIs
sf_group.pig create an index {SurfaceForm, {(URI),...}, count}

-- Testing has shown that indexer_small_cluster.pig is much more efficient than indexer_lucene.pig on small/mid-sized clusters (up to 35 mappers and 15 reducers), but these scripts have not yet been tested on a very large cluster.

Make sure that $HADOOP_HOME is set to the location of your local Hadoop installation.

echo $HADOOP_HOME
/local/hadoop-0.20.2

From your client machine you can now run the script using something like the following:

pig -m indexer.pig.params indexer_small_cluster.pig

--- substitute 'indexer_small_cluster.pig' with 'tfidf.pig' to try the script that creates a tfidf index.

##Output## Support for bz2 compressed JSON output has been added using pignlproc.storage.JsonCompressedStorage. If you want to change the output format, modify the last line of the scripts. i.e.

STORE counts INTO '$OUTPUT_DIR/token_counts_big_cluster.TSV.bz2' USING PigStorage();

or

DEFINE JsonCompressedStorage pignlproc.storage.JsonCompressedStorage();
STORE counts INTO '$OUTPUT_DIR/token_counts_big_cluster.TSV.bz2' USING JsonCompressedStorage();

Notes:
1- the speed of indexing obviously depends upon the size of your cluster. Constraints such as the size of hadoop.tmp.dir may also affect performance. This code has been tested on the full English Wikipedia with (relatively) good performance on a five-node cluster.
2- you only need the pignlproc JAR and the example scripts to use this code with Hadoop. You can just copy examples/indexing/indexer_small_cluster.pig, examples/indexing/indexer_lucene.pig and target/pignlproc-0.1.0-SNAPSHOT.jar to a client machine configured for your cluster if you built the project on a different machine (you'll need Apache Pig though).
3- Once indexing has finished, to get the files back on to your local machine, do:
hadoop fs -get /path/to/hadoop/output /path/to/local
if you want one big file instead of the part-* files, do:
hadoop fs -getmerge /path/to/hadoop/output /path/to/local
*if you used JsonCompressedStorage(), you'll need to delete the automatically generated header and schema files before using -getmerge, otherwise the file will not be recognized as bz2.
Example:

hadoop fs -rm /user/hadoop/output/tfidf_token_weights.json.bz2/.pig_header
hadoop fs -rm /user/hadoop/output/tfidf_token_weights.json.bz2/.pig_schema
Clone this wiki locally