Skip to content
Aapo Kyrola edited this page Jul 24, 2013 · 1 revision

GraphChi for Pig

Pig is a powerful query language for Hadoop commonly used for large scale data processing. Now it is possible to run GraphChi programs as parts of Pig-scripts, with just one line of script This allows easy huge scale graph computation with data stored in HDFS (Hadoop File System). As GraphChi will ultimately execute only on a single Hadoop machine (see How-GraphChi-For-Pig-Works, the size of the Hadoop Cluster is not a limiting factor.

GraphChi for Pig is a viable alternative to Giraph, which is a distributed graph engine built on top of Hadoop. With GraphChi, you can develop your algorithms on your laptop (with realistically sized data) and then deploy them to run the big cluster. GraphChi will also often run faster and uses much less resources than alternatives.

This has been tested only on Pig version 0.10.

How to get started

Basically, you write your GraphChi application in Java or Scala just like before.

However, your application (GraphChiProgram) must extend edu.cmu.graphchi.hadoop.PigGraphChiBase and implement following methods:

  • createSharder(String graphName, int numShards): this should initialize a FastSharder class with your application-specific edge and vertex value parsers. This is what you would normally do in your main().
  • getSchemaString() : returns schema information to Pig of the output result of your application, for example "(vertex:int, weight:float)"
  • runGraphChi(): run your program with GraphChi. At the end, if your program returns value for every vertex, you should initialize a vertex-iterator so you can return the results to Pig. For example: this.vertexIterator = VertexAggregator.vertexIterator(engine.numVertices(), getGraphName(), new FloatConverter(), engine.getVertexIdTranslate())
  • getNextResult(TupleFactory tupleFactory) called for every result tuple your program generates.

Running your application with Pig

Below is an example how to run the Pagerank example with Pig:

REGISTER graphchi-java-0.2-jar-with-dependencies.jar;

pagerank = LOAD '/user/akyrola/graphs/soc-LiveJournal1.txt' USING edu.cmu.graphchi.apps.pig.PigPagerank as (vertex:int, rank:float);

On the first line, you must register the GraphChi jar-file (which includes all dependencies). See the README.txt for instructions how to build this jar-file.

On the second line the program is run. GraphChi programs are implemented as special "loader"-functions. That is, they load a graph in HDFS, do their computation, and output tuples of results to the Pig runtime.

Remark: unfortunately you cannot pass a Pig alias to GraphChi but instead need to do as follows:

 graph = <.... some pig computation... >
 graph = FOREACH graph GENERATE src, dest, value;
 STORE graph into 'processing/mygraph';
 
 myresults  = LOAD 'processing/mygraph' USING my.org.graphchi.MyGraphChiProgram(); 

Example code

Input format

The graph must be stored in edge-list format:

src dest  value
src2 dest2  value

(the delimiter can be space, comma or tab).

Vertex ids must be numerical and less than 2^31, i.e 2.1 billion.

If src=dest, then the line is assumed to contain vertex value for vertex id "src". Thus, the same input data is used for initializing the edges and vertices.

Performance

Computing ALS Matrix Factorization on the NetFlix movie ratings data (used in the Netflix-prize competition), it takes 9 minutes to compute the factorization with latent factor dimension D=5 (five iterations). Note: I would be interested to hear a comparison to Mahout ALS on same data.

How it works

To be honest, it is a huge hack :). Hadoop administrators should read How-GraphChi-For-Pig-Works to understand how it operates and to troubleshoot any potential problems.

Author

[http://www.cs.cmu.edu/~akyrola Aapo Kyrola], 2013. Follow me on Twitter: @kyrpov.

Acknowledgements

The Pig support for GraphChi was developed during author's internship at Twitter on Fall 2012. Special thanks to Pankaj Gupta, Dong Wang and Dmitriy Ryaboy for the opportunity and helpful suggestions and ideas.