<a href="https://colab.research.google.com/github/MrKCodes/pregel-sample/blob/main/notebooks/Pregel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![arangodb](https://github.com/arangodb/interactive_tutorials/blob/master/notebooks/img/ArangoDB_logo.png?raw=1)

# Iterative, Distributed Graph Analytics with Pregel

<a href="https://colab.research.google.com/github/arangodb/interactive_tutorials/blob/master/notebooks/Pregel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


*“Many practical computing problems concern large graphs.”*

Distributed graph processing enables you to do online analytical processing directly on graphs stored in ArangoDB. This is intended to help you gain analytical insights on your data, without having to use external processing systems.
[The processing system](https://www.arangodb.com/docs/stable/graphs-pregel.html) inside ArangoDB is based on Google's Pregel framework: [Pregel: A System for Large-Scale Graph Processing](http://www.dcs.bbk.ac.uk/~dell/teaching/cc/paper/sigmod10/p135-malewicz.pdf). This concept enables us to perform distributed graph processing, without the need for distributed global locking.

Currently, ArangoDB support the [following algorithms out of box](https://www.arangodb.com/docs/stable/graphs-pregel.html#available-algorithms) (For custom algorithms see note about Custom Pregel below):
*  Page Rank
*  Seeded PageRank
* Single-Source Shortest Path
* Connected Components:
   * WeaklyConnected
   * StronglyConnected
* Hyperlink-Induced Topic Search (HITS)
* Vertex Centrality
* Effective Closeness
* LineRank
* Label Propagation
* Speaker-Listener Label Propagation


Pregel is not useful for typical online queries, where you just work on a small set of vertices. These kind of tasks are better suited for AQL traversals.

Furthermore, for best performance Pregel should be used in combination with [SMART Graphs (Enterprise feature)](https://www.arangodb.com/enterprise-server/smartgraphs/).


# Setup

Before getting started with ArangoDB we need to prepare our environment and create a temporary database on ArangoDB's managed Service Oasis.

In [1]:
%%capture
!git clone https://github.com/joerg84/ArangoDBUniversity.git
!rsync -av ArangoDBUniversity/ ./ --exclude=.git
!!pip3 install pyarango
!pip3 install "python-arango>=5.0"

In [2]:
import json
import requests
import sys
import oasis
import time
from IPython.display import JSON

from arango import ArangoClient

Create the temporary database:

In [3]:
# Retrieve tmp credentials from ArangoDB Tutorial Service
login = oasis.getTempCredentials(tutorialName="Pregel", credentialProvider='https://tutorials.arangodb.cloud:8529/_db/_system/tutorialDB/tutorialDB')

# Connect to the temp database
db = oasis.connect_python_arango(login)

Requesting new temp credentials.
Temp database ready to use.


In [4]:
print("https://{}:{}".format(login["hostname"], login["port"]))
print("Username: " + login["username"])
print("Password: " + login["password"])
print("Database: " + login["dbName"])

https://tutorials.arangodb.cloud:8529
Username: TUTktmqji1u978xic42lzelg
Password: TUTxikhlb9p79v5b5f8isk19
Database: TUTo33q0tsc0situdgx7i3kim


Feel free to use to above URL to checkout the UI!

##  Import Data

Let us first start by creating an empty graph:

In [5]:
if db.has_graph('school'):
        school = db.graph('school')
else:
        school = db.create_graph('school')

# Retrieve various graph properties.
print(school.name)
print(school.db_name)
print(school.vertex_collections())
print(school.edge_definitions())

school
TUTo33q0tsc0situdgx7i3kim
[]
[]


Next, we create a Pregel job on a (empty) graph:

In [6]:
    pregel = db.pregel

    # Start a new Pregel job in "school" graph.
    job_id = db.pregel.create_job(
        graph='school',
        algorithm='pagerank',
        store=False,
        max_gss=100,
        thread_count=1,
        async_mode=False,
        result_field='result',
        algorithm_params={'threshold': 0.000001}
    )

Furthermore, we can observe the status of a given Pregel job.

In [7]:
# Retrieve details of a Pregel job by ID.
job = pregel.job(job_id)
print(job['state'])

print(job)

loading
{'id': '318859117698095', 'algorithm': 'pagerank', 'created': '2025-05-20T11:09:48Z', 'ttl': 600, 'state': 'loading', 'gss': 0, 'database': 'TUTo33q0tsc0situdgx7i3kim', 'user': 'TUTktmqji1u978xic42lzelg', 'graph_loaded': False}


And even delete it:

In [8]:
    # Delete a Pregel job by ID.
    pregel.delete_job(job_id)

True

# Community Detection

Next, let us look at larger realworld example using the [Pokec Social Network](https://snap.stanford.edu/data/soc-Pokec.html).

In [12]:
#Download the Pokec Dataset (be aware of the size of 1GB)
!wget https://pokec-data.s3-us-west-2.amazonaws.com/pokec.tar.gz
!tar xvf pokec.tar.gz
!ls

--2025-05-20 11:10:32--  https://pokec-data.s3-us-west-2.amazonaws.com/pokec.tar.gz
Resolving pokec-data.s3-us-west-2.amazonaws.com (pokec-data.s3-us-west-2.amazonaws.com)... 52.92.148.250, 52.92.225.66, 52.92.146.138, ...
Connecting to pokec-data.s3-us-west-2.amazonaws.com (pokec-data.s3-us-west-2.amazonaws.com)|52.92.148.250|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 985191636 (940M) [application/x-gzip]
Saving to: ‘pokec.tar.gz’


2025-05-20 11:11:18 (21.2 MB/s) - ‘pokec.tar.gz’ saved [985191636/985191636]

pokec/
pokec/relations.jsonl
pokec/profiles.jsonl
AqlCrudTutorial.ipynb	     creds.dat		pokec.tar.gz
AqlGeospatialTutorial.ipynb  data		__pycache__
AqlJoinTutorial.ipynb	     example_output	README.md
AqlPart2Tutorial.ipynb	     FuzzySearch.ipynb	sample_data
AqlTraversalTutorial.ipynb   img		tools
ArangoDBUniversity	     oasis.py		Upsert.ipynb
ArangoSearch.ipynb	     pokec


Next, we will import the profiles and relationship using arangorestore.

*Note the included arangorestore will only work on Linux, if you want to run this notebook on a different OS please consider using the appropriate arangorestore from the Download area and for more information on how to use the ArangoDB client tools, see the documentation.*

In [18]:
! ./tools/arangoimport -c none --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]}   --file "pokec/profiles.jsonl" --type jsonl --collection profiles --progress true --create-collection true --create-collection-type document
#! ./tools/arangoimport -c none --server.endpoint http+ssl://{login["hostname"]}:{login["port"]} --server.username {login["username"]} --server.database {login["dbName"]} --server.password {login["password"]} --default-replication-factor 3  --input-directory "pokec/relations"

Connected to ArangoDB 'http+ssl://tutorials.arangodb.cloud:8529, database: 'TUTo33q0tsc0situdgx7i3kim', username: 'TUTktmqji1u978xic42lzelg'
----------------------------------------
database:               TUTo33q0tsc0situdgx7i3kim
collection:             profiles
create:                 yes
create database:        no
source filename:        pokec/profiles.jsonl
file type:              jsonl
threads:                2
connect timeout:        5
request timeout:        1200
----------------------------------------
Starting JSON import...
[0m2025-05-20T11:15:17Z [2374] INFO [9ddf3] processed 70252448 bytes (3%) of input file
[0m[0m2025-05-20T11:15:31Z [2374] INFO [9ddf3] processed 140504896 bytes (6%) of input file
[0m[0m2025-05-20T11:15:39Z [2374] INFO [9ddf3] processed 210757344 bytes (9%) of input file
[0m[0m2025-05-20T11:15:46Z [2374] INFO [9ddf3] processed 281009792 bytes (12%) of input file
[0m[0m2025-05-20T11:15:52Z [2374] INFO [9ddf3] processed 351262240 bytes (15%) of inp

# Custom Pregel

So far we looked at predefined algorithms. ArangoDB is also offering an (at time of writing experimental) feature which allows users to add/modify their custom Pregel algorithms at runtime. Check out [this webinar](https://www.arangodb.com/events/arangodb-feature-preview-custom-pregel/) for more details.

# Next Steps

Check out the [community detection tutorial](https://www.arangodb.com/learn/graphs/pregel-community-detection/) to explore further applications of pregel to social network analytics.


To continue playing and working with ArangoDB beyond the temporary database, you can:

* [Get a 2 week free Trial with the ArangoDB Cloud](https://cloud.arangodb.com/home?utm_source=AQLJoin&utm_medium=Github&utm_campaign=ArangoDB%20University)
* Take the [free Graph Course](https://www.arangodb.com/arangodb-graph-course)  
* [Download ArangoDB](https://www.arangodb.com/download-major/)
* Keep Learning at https://www.arangodb.com/arangodb-training-center/

# Further Links

* https://www.arangodb.com/docs/stable/aql/tutorial.html