Personalized Pagerank with DrunkardMob

Note: this work was published in ACM RecSys 2013: http://www.cs.cmu.edu/~akyrola/files/drunkard-final.pdf

DrunkardMob: Introduction

GraphChi's Java-version contains a new system for executing a massive number of random walks. The system will be introduced in detail in an upcoming paper, but here is a short guide how to get started with it. Unlike normal GraphChi applications, this functionality uses a plenty of memory: RAM is used for storing the current state of each walk and the visit distributions of the walks (see below). The graph itself is loaded from disk efficiently, with very small memory footprint. We call this algorithm DrunkardMob.

The prime example of this functionality is the ability to estimate Personalized Pagerank for a huge number of vertices. The basic idea is this: we define a number of sources, i.e vertices that we start walks from. From each source we start usually some thousands of walks. A separate object, DrunkardMobCompanion keeps track of the visit by each walks, and in the end will output a distribution of the top K vertices visited, for each source vertex separately. That sample distribution can be used to estimate the Personalized Pagerank for the source vertex.

DrunkardCompanion can also be run as a separate service. In that case, the random walk engine sends walks in batches to the companion to be processed. Note that companion also requires plenty of memory, and thus it can be useful to run the main process and companion on separate machines.

Example code

Class edu.cmu.graphchi.apps.randomwalks.PersonalizedPageRank contains a fully implemented example. It should be easy to modify it to do weighted sampling of the edges, use in- and out-edges etc...

It is run by giving it a graph and a range of vertices A - B to use as sources. See note about vertex IDs below.

It uses following command line arguments:

usage: PersonalizedPageRank
 -f,--firstsource <arg>      id of the first source vertex (internal id)
 -g,--graph <arg>            graph file name
 -i,--niters <arg>           number of iterations
 -n,--nshards <arg>          number of shards
 -s,--nsources <arg>         number of sources
 -t,--filetype <arg>         filetype (edgelist|adjlist)
 -w,--walkspersource <arg>   number of walks to start from each source
 -u,--companion <arg>        RMI url to the DrunkardCompanion or 'local'
                             (default)

Example command line:

mvn assembly:assembly -DdescriptorId=jar-with-dependencies
java -Xmx4096m -cp target/graphchi-java-0.2-jar-with-dependencies.jar edu.cmu.graphchi.apps.randomwalks.PersonalizedPageRank  --graph=/Users/akyrola/graphs/soc-LiveJournal1.txt --nshards=4 --niters=5 --nsources=10000 --firstsource=0 --walkspersource=4000

Note about vertex IDs

Note, that GraphChi Java shuffles the vertex ids (to enable more efficient preprocessing of the graph). Thus, the source vertex ids are also defined in the shuffled (internal ) ids. It is easy to translate between the original and vertex ids, see example:

/* For debug */
        VertexIdTranslate vertexIdTranslate = this.drunkardMobEngine.getVertexIdTranslate();
        IdCount[] topForFirst = companion.getTop(firstSource);

        System.out.println("Top visits from source vertex " + vertexIdTranslate.forward(firstSource) + " (internal id=" + firstSource + ")");
        for(IdCount idc : topForFirst) {
            System.out.println(vertexIdTranslate.backward(idc.id) + ": " + idc.count);
        }

Reading the results

You can directly query the companion for top K visits for each source (see example above).

In addition, after the process, it will write the results into a file, as done in the example:

 /* Ask companion to dump the results to file */
        int nTop = 100;
        companion.outputDistributions(baseFilename + "_ppr_" + firstSource + "_"
                + (firstSource + numSources - 1) + ".top" + nTop, nTop);

The file is in binary format:
<...>

Note, that the vertex ids are in the internal id space

Memory usage

Each walk takes roughly 4 bytes of memory, so if you have >100GB of memory, you can have hundreds of thousands, or even millions of sources and several thousand of walks per source. On a laptop you of course need to configure less sources.

Acknowledgement

Part of this work was done during Aapo Kyrola's internship at Twitter, Fall 2012.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly