# Pagerank on subgraphs—efficient Monte-Carlo estimation

In this repo you can find the reference code for my novel Subrank algorithm for efficiently computing the Pagerank distribution over $S$ subgraph of $G$.
For the reasoning behind the algorithm, the definition and the analysis, I invite the interested reader to [read the paper](https://pippellia.com/pippellia/Social+Graph/Pagerank+on+subgraphs%E2%80%94efficient+Monte-Carlo+estimation).

To play with it, follow these steps:

## Step 0: Build and store the Graph

In [12]:
# Imports
from nostr_dvm.utils.wot_utils import build_network_from, save_network, load_network, get_mc_pagerank, get_subrank
import time
import networkx as nx
import random


In [13]:
user = '99bb5591c9116600f845107d31f9b59e2f7c7e09a1ff802e84f1d43da557ca64'


In [14]:
index_map, network_graph = await build_network_from(user, depth=2, max_batch=500, max_time_request=10)
save_network(index_map, network_graph, user)

Step 1: fetching kind 3 events from relays & pre-processing
current network: 44029 npubs
Finished in 39.349228858947754
 > index_map_99bb5591c9116600f845107d31f9b59e2f7c7e09a1ff802e84f1d43da557ca64.json
 > network_graph_99bb5591c9116600f845107d31f9b59e2f7c7e09a1ff802e84f1d43da557ca64.json


## Step 1: load the graph database

First, you have to load the networkx graph database into memory by running the following code.

In [15]:
# loading the database
print('loading the database...')
tic = time.time()

index_map, G = load_network(user)

toc = time.time()
print(f'finished in {toc-tic} seconds')

loading the database...
finished in 1.1044838428497314 seconds


## Step 2: Compute Pagerank over $G$

Compute the pagerank over $G$ by using the networkx built-in pagerank function that uses the power iteration method.
This vector will be considered as the real Pagerank vector and will be used to compute the errors of the Monte-Carlo algorithm.

In [16]:
# computing the pagerank
print('computing global pagerank...')
tic = time.time()

p_G = nx.pagerank(G, tol=1e-12)

toc = time.time()
print(f'finished in {toc-tic} seconds')

computing global pagerank...
finished in 1.036012887954712 seconds


## Step 3: Approximate Pagerank over $G$ using Monte-Carlo

Compute the pagerank over $G$ using a simple Monte-Carlo implementation and compute the L1 error.
This step is essential because it returns the csr-matrix `walk_visited_count`, that will be used later by the Subrank algorithm.

In [17]:
# number of the random walks per node
R = 10

# fix the order of the nodes
nodelist = list(G.nodes())

tic = time.time()

# perform the random walks and get the monte-carlo pagerank
walk_visited_count, mc_pagerank = get_mc_pagerank(G, R, nodelist)

toc = time.time()
print(f'performed random walks in {toc-tic} seconds')

# computing the L1 error
error_G_mc = sum( abs(p_G[node] - mc_pagerank[node])
                  for node in G.nodes() )

print(f'error pagerank vs mc pagerank in G = {error_G_mc}')

progress = 100%       
Total walks performed:  440290
performed random walks in 1.0430760383605957 seconds
error pagerank vs mc pagerank in G = 0.020183034486847724


## Step 4: Select random subgraph $S$ and compute its Pagerank distribution

Select a random subgraph $S$ consisting of 50k nodes, and compute its Pagerank distribution.

In [18]:
# selecting random subgraph S
S_nodes = set(random.sample(list(G.nodes()), k=500)) #50000
S = G.subgraph(S_nodes).copy()

# computing pagerank over S
print('computing local pagerank...')
tic = time.time()

p_S = nx.pagerank(S, tol=1e-12)

toc = time.time()
print(f'finished in {toc-tic} seconds')

computing local pagerank...
finished in 0.0029449462890625 seconds


## Step 5: Approximate Pagerank over $S$ using Subrank

Run the Subrank algorithm to approximate the Pagerank over $S$ subgraph of $G$. Then compute the L1 error.

In [19]:
# computing subrank
print('computing subrank over S...')
tic = time.time()

subrank = get_subrank(S, G, walk_visited_count, nodelist)

toc = time.time()
print(f'performed random walks in {toc-tic} seconds')

# computing the L1 error
error_S_subrank = sum( abs(p_S[node] - subrank[node])
                      for node in S_nodes )

print(f'error pagerank vs subrank in S = {error_S_subrank}')

computing subrank over S...
walks performed = 157
performed random walks in 0.05398416519165039 seconds
error pagerank vs subrank in S = 0.017882599344811713


## Step 6: Approximate Pagerank over $S$ using Monte-Carlo naive recomputation

Run the Monte-Carlo Pagerank algorithm on $S$ as a reference for the number of random walks required and the error achieved.

In [20]:
# computing the monte-carlo pagerank 
print('computing naive monte-carlo pagerank over S')
tic = time.time()

_, mc_pagerank_S_naive = get_mc_pagerank(S,R)

toc = time.time()
print(f'finished in {toc-tic} seconds')

# computing the L1 error
error_S_naive = sum( abs(p_S[node] - mc_pagerank_S_naive[node])
                      for node in S.nodes())

print(f'error pagerank vs mc pagerank in S = {error_S_naive}')

computing naive monte-carlo pagerank over S
progress = 100%       
Total walks performed:  5000
finished in 0.010932207107543945 seconds
error pagerank vs mc pagerank in S = 0.020285385444477645
