Graphee is a lightweight piece of code written in C++ to compute graph properties, such as: PageRank, TrustRank, harmonic centrality, etc...
What's the purpose of Graphee?
In order to improve search engine results, one has compute some values from the graph of the web. Actually, some famous works are known like the PageRank  or the TrustRank  algorithms. They were both indispensable to make their search engines pertinent and spam-proof.
However, with time, the graph of the web has become wider and wider. It clearly contains more than a
dozen billion vertices and at least ten times more edges. Some, has chosen to persist in full-RAM
computations and using distributed framework as Hadoop or Spark. While, some other approaches as
GraphChi [3;4] or
m-flash [5;6] are using serialization methods,
but both limited by
uint32_t for the vertex ids.
Thus, we propose to develop software able to handle
uint64 and still performing calculations
on a single laptop (sometimes with an external hard-drive for wide graphs).
The graph is converted into an adjacency matrix and then saved with the fully-tested (since 70's) Compressed Sparse Row (CSR) matrix format. The serialization is done by dividing vectors into slices, and matrices into blocks, with sizes not exceeding the RAM limit of the computer.
However, nowadays the graph of the web is so wide that one cannot restart calculations from scratch. The preferred solution would be using delta files, saving only differences between two graph snapshots. But the CSR format is unfriendly with insertions and deletions. So we propose to implement the Dynamic CSR  matrix format, which presents high speedups for the matrix modifications.
Also a binding of Graphee with
Python is under development !
Technically we use
OpenMP technologies for the parallel calculations on CPUs.
The full support or
CUDA standards is one of the major properties to be included soon !
We are presenting the results for the Pagerank computaton of the domain-level graph of the CommonCrawl:
- 98 million vertices,
- 1.5 billion edges.
The graph prepossessing (RAW to CSR) and the 10 iterations of the Pagerank, runs in 16 minutes and 11 seconds on a Dell XPS laptop made of an Intel Core i7-7700HQ @ 2.8GHz x 8, 16GB RAM, 512GB SSD. (We limited the RAM usage to 10GB to avoid freezing because of the RAM usage of other apps)
Examples & functionalities
- Verify that
zlibis installed on your computer. If not follow the link ZLib.
grapheeand synchronize the
submodulesnecessary for the code, with the following command:
$ git clone --recursive https://github.com/QwantResearch/graphee.git
- Compile the customized version of
Snappywhich supports files over
4.2 GB(again the 2018-problem of
$ cd src/snappy $ mkdir -p build $ cd build && cmake .. $ make
Compile the Pagerank example and download the data:
$ make pagerank $ cd examples/ $ chmod u+x load_data.sh $ ./load_data.sh
Then launch it !
$ ./pagerank filelist
The documentation is realized with Doxygen, create it:
$ mkdir -p docs $ make docs
For any questions, comments, or collaborations, please use: n.martin [at] qwantresearch [dot] com or also @stdthread on Twitter.
- The Anatomy of a Large-Scale Hypertextual Web Search Engine, S. Brin and L. Page
- Web Spam Detection with Anti-Trust Rank, V. Krishnan and R. Raj
- GraphChi: Large-Scale Graph Computation on Just a PC, A. Kyrola, G. Blelloch, and C. Guestrin
- Sources: GraphChi/graphchi-cpp
- M-Flash: Fast Billion-Scale Graph Computation Using a Bimodal Block Processing Model, H. Gualdron et al.
- Sources: M-Flash/m-flash-cpp
- Dynamic Sparse-Matrix Allocation on GPUs, J. King, T. Gilray, R.M. Kirby and M. Might