# Billion-scale attributed isomorphic nodes with 🍇🍇 GRAPE 🍇🍇
In this tutorial, I will show you how to use the [GRAPE library](https://github.com/AnacletoLAB/grape) to identify attributed isomorphic groups in large graphs and knowledge graphs. This algorithm works both on attributed and inattributed graphs, and helps in identifying notable oddities that can slip in KGs during construction.

I will then briefly explain how the algorithm available in [GRAPE](https://github.com/AnacletoLAB/grape) works.

By the end of the tutorial, you will have a good understanding of how to use [GRAPE](https://github.com/AnacletoLAB/grape) to isomorphic nodes of a graph and apply this knowledge to your projects.

[Remember to ⭐ GRAPE!](https://github.com/AnacletoLAB/grape)

### What is GRAPE?
[🍇🍇 GRAPE 🍇🍇](https://github.com/AnacletoLAB/grape) is a graph processing and embedding library that enables users to easily manipulate and analyze graphs. With [GRAPE](https://github.com/AnacletoLAB/grape), users can efficiently load and preprocess graphs, generate random walks, and apply various node and edge embedding models. Additionally, [GRAPE](https://github.com/AnacletoLAB/grape) provides a fair and reproducible evaluation pipeline for comparing different graph embedding and graph-based prediction methods.

## Isomorphic nodes
Nodes in an isomorphic group have the same neighbourhood. They are, therefore, topologically identical, as they cannot be distinguished from one another based on their connections. Intuitively, swapping nodes within INGs does not change the graph topology.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/isomorphic_nodes_spiderman.jpg?raw=true" width=600 />

### A formal definition
Two nodes $a$ and $b$ are isomorphic when they have the same neighbours, except for $\{a, b\}$ themselves, which we denote as $\widehat{\mathcal{N}}_{w}(v) = \mathcal{N}(v) \setminus \{v, w\}$. Moreover, if $a \in \mathcal{N}(b)$, then $b \in \mathcal{N}(a)$ and vice-versa. When available, also edge labels and weights should be compared.
$$
    eq(a,b) = (\widehat{\mathcal{N}}_{b}(a) = \widehat{\mathcal{N}}_{a}(b)) \land \left(b \in \mathcal{N}(a) \Longleftrightarrow a \in \mathcal{N}(b)\right)
$$

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/connected_isomorphic_nodes.png?raw=true" width=400 />

### Basic examples
Let's proceed with some intuitive examples:

#### Triangles
In a triangle, i.e. a circular graph with 3 nodes, all nodes are isomorphic to each other.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/triangle.jpg?raw=true" width=200 />

#### Squares
In a square, i.e. a circular graph with 4 nodes, the nodes in the diagonals are isomorphic two-by-two.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/squares.png?raw=true" width=200 />


### The algorithm
Nodes in an Isomorphic Group have the same neighborhood. They are, therefore, topologically identical, as they are not distinguishable from one another based on their connections. Intuitively, swapping nodes within these groups does not change the graph topology.
Isomorphic Nodes with high degrees are interesting as they are most likely duplicates or artifacts due to a bug in the graph generation. 

In GRAPE, we implemented an efficient algorithm to detect Isomorphic Groups. 
Two nodes are Isomorphic if they have the same neighbors. 
Therefore, the naive algorithm would check each pair of nodes in the graph. This approach would scale quadratically, making it untractable.
To mitigate this problem, we focus on nodes with degrees higher than a given threshold, as they are the most significant.
Note that many graphs have few high-degree and many low-degree nodes, dramatically reducing the number of nodes to consider.
Our algorithm fingerprints and groups nodes using a hash to reduce the size of the node groups to check.

The algorithm starts by filtering out all the nodes with degrees less than $d_\text{min}$ to reduce the number of comparisons needed.
Then we fingerprint each node through a hash of its degree and the first few neighbors. Then we collect the groups of nodes with the same fingerprint and separate them into possibly multiple groups.

Part of the performance of our algorithm relies on the right tradeoff between the hash quality because we want as few collisions as possible and how expensive the hash is to compute. For this reason, our implementation allows different types of hashes, from the cryptographically robust SipHash2-4 to the common xxh3 to two fast but low-quality custom hashes.
 
The following values can be passed to the `hash_name` argument:
- [`"ahash"`](https://github.com/tkaitchuck/aHash): A really fast, good quality hash that 
- [`"siphash"`](https://en.wikipedia.org/wiki/SipHash): A cryptographically robust hash, it should have better guarantees on the randomness of the hashes
- [`"xxh3"`](https://github.com/Cyan4973/xxHash): A robust hash used in many softwares like Linux, MySQL, Apache Spark, and many others.
- `"simple"`: a simple test hash where the value is XOR-ed against the state and then we add 0xed4e83c06c9fe588.
- [`"xorshift"`](https://en.wikipedia.org/wiki/Xorshift) The state is multiplied by the value after being XOR-ed against 0x44d4c5a74c775ba0, and then we execute a round of Xorshift64.

[You can find the Rust implementation of the algorithm here.](https://github.com/AnacletoLAB/ensmallen/blob/c597a58d587721fec762796699742bbb0c307835/graph/src/isomorphism.rs#L18)

## Installing GRAPE
First, we install the GRAPE library from PyPI:

In [1]:
!pip install grape -qU

## Experiments
Welcome to the experiments section of this tutorial! In this section, we will put our knowledge into practice by applying the work-stealing parallel random spanning tree algorithm on four different graphs: the [KGCOVID19 knowledge graph](https://www.cell.com/patterns/fulltext/S2666-3899(20)30203-8?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2666389920302038%3Fshowall%3Dtrue), the [Friendster graph](https://networkrepository.com/friendster.php), the [ClueWeb09 web graph](https://networkrepository.com/web-ClueWeb09.php), and [the WikiData graph](https://www.wikidata.org/wiki/Wikidata:Main_Page).

We run these experiments on a machine with 24 threads and 12 cores.

**Do note that, for the limits of memory of my desktop, I will restart the jupyter after running the experiment on each of the large graphs.**

In my machine I only have 24 threads. You can estimate the expected computation time by interpolating the time estimates on 24 threads and the amount you have:

In [2]:
import os

os.cpu_count()

24

Also, this machine has about `128GB` of RAM:

In [3]:
import psutil
    
psutil.virtual_memory().total / 1024**3 # total physical memory in Bytes

125.7063217163086

### Triangle
In a triangle, a circular graph with 3 nodes, all nodes are isomorphic to each other.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/triangle.jpg?raw=true" width=200 />

In [4]:
from grape import Graph

g = Graph.generate_circle_graph(nodes_number=3)
g.get_isomorphic_node_names_groups()

[['0', '1', '2']]

### Square
In a square, i.e. a circular graph with 4 nodes, the nodes are isomorphic two-by-two.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/squares.png?raw=true" width=200 />

In [3]:
from grape import Graph

g = Graph.generate_circle_graph(nodes_number=4)
g.get_isomorphic_node_names_groups()

[['0', '2'], ['1', '3']]

### STRING

Evaluation of performance of STRING using a single thread to evaluate the hash functions

In [18]:
%%time
import os

assert os.environ["RAYON_NUM_THREADS"] == "1"

from tqdm.auto import tqdm
import pandas as pd
from grape.datasets.string import (
    SaccharomycesCerevisiae,
    HomoSapiens,
    MusMusculus
)

string_results = []

for graph in tqdm((SaccharomycesCerevisiae, HomoSapiens, MusMusculus)):
    graph = graph()
    for iteration in range(100):
        for hash_name in ("simple", "ahasher", "xxh3", "siphash"):
            result = {
                "hash_name": hash_name,
                "graph_name": graph.get_name(),
                **graph.get_number_of_isomorphic_node_groups(
                    hash_name=hash_name,
                    minimum_node_degree=100,
                    number_of_neighbours_for_hash=1000,
                )
            }
            
            result.pop("number_of_isomorphic_node_groups")
            
            string_results.append(result)
            
string_results = pd.DataFrame(string_results)
string_results

  0%|          | 0/3 [00:00<?, ?it/s]

CPU times: user 5min 15s, sys: 576 ms, total: 5min 16s
Wall time: 5min 17s


Unnamed: 0,hash_name,graph_name,collecting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,sorting_degree_bounded_nodes_and_hash
0,simple,SaccharomycesCerevisiae,25,0,0
1,ahasher,SaccharomycesCerevisiae,27,0,0
2,xxh3,SaccharomycesCerevisiae,150,0,0
3,siphash,SaccharomycesCerevisiae,47,0,0
4,simple,SaccharomycesCerevisiae,27,0,0
...,...,...,...,...,...
1195,siphash,MusMusculus,266,0,0
1196,simple,MusMusculus,146,0,0
1197,ahasher,MusMusculus,165,0,0
1198,xxh3,MusMusculus,818,0,0


Computation of mean and standard deviation of performance of STRING using one thread to evaluate the hash functions

In [19]:
string_results.groupby(["hash_name", "graph_name"]).agg(["mean", "std"])

Unnamed: 0_level_0,Unnamed: 1_level_0,collecting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,counting_isomorphic_groups,sorting_degree_bounded_nodes_and_hash,sorting_degree_bounded_nodes_and_hash
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std,mean,std
hash_name,graph_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
ahasher,HomoSapiens,149.69,1.252835,0.0,0.0,0.0,0.0
ahasher,MusMusculus,163.57,2.749674,0.0,0.0,0.0,0.0
ahasher,SaccharomycesCerevisiae,27.36,0.643852,0.0,0.0,0.0,0.0
simple,HomoSapiens,135.75,2.969117,0.0,0.0,0.0,0.0
simple,MusMusculus,148.05,3.619183,0.0,0.0,0.0,0.0
simple,SaccharomycesCerevisiae,25.34,0.768312,0.0,0.0,0.0,0.0
siphash,HomoSapiens,241.85,1.833333,0.0,0.0,0.0,0.0
siphash,MusMusculus,265.78,4.672248,0.0,0.0,0.0,0.0
siphash,SaccharomycesCerevisiae,44.68,0.815011,0.0,0.0,0.0,0.0
xxh3,HomoSapiens,746.75,14.848987,0.0,0.0,0.0,0.0


Evaluation of performance of STRING using a six threads to evaluate the hash functions

In [1]:
%%time
import os

assert os.environ["RAYON_NUM_THREADS"] == "6"

from tqdm.auto import tqdm
import pandas as pd
from grape.datasets.string import (
    SaccharomycesCerevisiae,
    HomoSapiens,
    MusMusculus
)

string_results = []

for graph in tqdm((SaccharomycesCerevisiae, HomoSapiens, MusMusculus)):
    graph = graph()
    for iteration in range(100):
        for hash_name in ("simple", "ahasher", "xxh3", "siphash"):
            result = {
                "hash_name": hash_name,
                "graph_name": graph.get_name(),
                **graph.get_number_of_isomorphic_node_groups(
                    hash_name=hash_name,
                    minimum_node_degree=100,
                    number_of_neighbours_for_hash=1000,
                )
            }
            
            result.pop("number_of_isomorphic_node_groups")
            
            string_results.append(result)
            
string_results = pd.DataFrame(string_results)
string_results

  0%|          | 0/3 [00:00<?, ?it/s]

CPU times: user 5min 21s, sys: 4.27 s, total: 5min 25s
Wall time: 1min 14s


Unnamed: 0,hash_name,graph_name,sorting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,counting_isomorphic_groups
0,simple,SaccharomycesCerevisiae,0,4,0
1,ahasher,SaccharomycesCerevisiae,0,4,0
2,xxh3,SaccharomycesCerevisiae,0,22,0
3,siphash,SaccharomycesCerevisiae,0,7,0
4,simple,SaccharomycesCerevisiae,0,4,0
...,...,...,...,...,...
1195,siphash,MusMusculus,0,44,0
1196,simple,MusMusculus,0,24,0
1197,ahasher,MusMusculus,0,27,0
1198,xxh3,MusMusculus,0,131,0


Computation of mean and standard deviation of performance of STRING using six thread to evaluate the hash functions

In [2]:
string_results.groupby(["hash_name", "graph_name"]).agg(["mean", "std"])

Unnamed: 0_level_0,Unnamed: 1_level_0,sorting_degree_bounded_nodes_and_hash,sorting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,counting_isomorphic_groups
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std,mean,std
hash_name,graph_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
ahasher,HomoSapiens,0.0,0.0,25.01,0.1,0.0,0.0
ahasher,MusMusculus,0.0,0.0,27.17,0.53286,0.0,0.0
ahasher,SaccharomycesCerevisiae,0.0,0.0,4.03,0.171447,0.0,0.0
simple,HomoSapiens,0.0,0.0,22.82,0.641652,0.0,0.0
simple,MusMusculus,0.0,0.0,24.84,0.734709,0.0,0.0
simple,SaccharomycesCerevisiae,0.0,0.0,4.02,0.140705,0.0,0.0
siphash,HomoSapiens,0.0,0.0,40.36,1.105724,0.0,0.0
siphash,MusMusculus,0.0,0.0,43.95,0.592461,0.0,0.0
siphash,SaccharomycesCerevisiae,0.0,0.0,7.07,0.256432,0.0,0.0
xxh3,HomoSapiens,0.0,0.0,121.94,1.722225,0.0,0.0


Evaluation of performance of STRING using a twelve threads to evaluate the hash functions

In [1]:
%%time
import os

assert os.environ["RAYON_NUM_THREADS"] == "12"

from tqdm.auto import tqdm
import pandas as pd
from grape.datasets.string import (
    SaccharomycesCerevisiae,
    HomoSapiens,
    MusMusculus
)

string_results = []

for graph in tqdm((SaccharomycesCerevisiae, HomoSapiens, MusMusculus)):
    graph = graph()
    for iteration in range(100):
        for hash_name in ("simple", "ahasher", "xxh3", "siphash"):
            result = {
                "hash_name": hash_name,
                "graph_name": graph.get_name(),
                **graph.get_number_of_isomorphic_node_groups(
                    hash_name=hash_name,
                    minimum_node_degree=100,
                    number_of_neighbours_for_hash=1000,
                )
            }
            
            result.pop("number_of_isomorphic_node_groups")
            
            string_results.append(result)
            
string_results = pd.DataFrame(string_results)
string_results

  0%|          | 0/3 [00:00<?, ?it/s]

CPU times: user 5min 23s, sys: 4.63 s, total: 5min 27s
Wall time: 49.5 s


Unnamed: 0,hash_name,graph_name,collecting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,sorting_degree_bounded_nodes_and_hash
0,simple,SaccharomycesCerevisiae,2,0,0
1,ahasher,SaccharomycesCerevisiae,2,0,0
2,xxh3,SaccharomycesCerevisiae,12,0,0
3,siphash,SaccharomycesCerevisiae,4,0,0
4,simple,SaccharomycesCerevisiae,2,0,0
...,...,...,...,...,...
1195,siphash,MusMusculus,22,0,0
1196,simple,MusMusculus,13,0,0
1197,ahasher,MusMusculus,13,0,0
1198,xxh3,MusMusculus,66,0,0


Computation of mean and standard deviation of performance of STRING using twelve thread to evaluate the hash functions

In [2]:
string_results.groupby(["hash_name", "graph_name"]).agg(["mean", "std"])

Unnamed: 0_level_0,Unnamed: 1_level_0,collecting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,counting_isomorphic_groups,sorting_degree_bounded_nodes_and_hash,sorting_degree_bounded_nodes_and_hash
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std,mean,std
hash_name,graph_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
ahasher,HomoSapiens,12.35,0.641573,0.0,0.0,0.0,0.0
ahasher,MusMusculus,13.24,0.452155,0.0,0.0,0.0,0.0
ahasher,SaccharomycesCerevisiae,2.0,0.0,0.0,0.0,0.0,0.0
simple,HomoSapiens,11.73,0.489382,0.0,0.0,0.0,0.0
simple,MusMusculus,12.7,0.460566,0.0,0.0,0.0,0.0
simple,SaccharomycesCerevisiae,2.0,0.0,0.0,0.0,0.0,0.0
siphash,HomoSapiens,20.29,0.591096,0.0,0.0,0.0,0.0
siphash,MusMusculus,22.09,0.287623,0.0,0.0,0.0,0.0
siphash,SaccharomycesCerevisiae,3.13,0.337998,0.0,0.0,0.0,0.0
xxh3,HomoSapiens,60.68,1.98418,0.0,0.0,0.0,0.0


Evaluation of performance of STRING using a 24 threads to evaluate the hash functions

In [5]:
%%time
import os

assert os.environ["RAYON_NUM_THREADS"] == "24"

from tqdm.auto import tqdm
import pandas as pd
from grape.datasets.string import (
    SaccharomycesCerevisiae,
    HomoSapiens,
    MusMusculus
)

string_results = []

for graph in tqdm((SaccharomycesCerevisiae, HomoSapiens, MusMusculus)):
    graph = graph()
    for iteration in range(100):
        for hash_name in ("simple", "ahasher", "xxh3", "siphash"):
            result = {
                "hash_name": hash_name,
                "graph_name": graph.get_name(),
                **graph.get_number_of_isomorphic_node_groups(
                    hash_name=hash_name,
                    minimum_node_degree=100,
                    number_of_neighbours_for_hash=1000,
                )
            }
            
            result.pop("number_of_isomorphic_node_groups")
            
            string_results.append(result)
            
string_results = pd.DataFrame(string_results)
string_results

  0%|          | 0/3 [00:00<?, ?it/s]

CPU times: user 8min 6s, sys: 2.71 s, total: 8min 8s
Wall time: 39 s


Unnamed: 0,hash_name,graph_name,sorting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,counting_isomorphic_groups
0,simple,SaccharomycesCerevisiae,0,2,0
1,ahasher,SaccharomycesCerevisiae,0,2,0
2,xxh3,SaccharomycesCerevisiae,0,8,0
3,siphash,SaccharomycesCerevisiae,0,3,0
4,simple,SaccharomycesCerevisiae,0,2,0
...,...,...,...,...,...
1195,siphash,MusMusculus,0,20,0
1196,simple,MusMusculus,0,11,0
1197,ahasher,MusMusculus,0,11,0
1198,xxh3,MusMusculus,0,50,0


Computation of mean and standard deviation of performance of STRING using 24 thread to evaluate the hash functions

In [7]:
string_results.groupby(["hash_name", "graph_name"]).agg(["mean", "std"])

Unnamed: 0_level_0,Unnamed: 1_level_0,sorting_degree_bounded_nodes_and_hash,sorting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,counting_isomorphic_groups
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std,mean,std
hash_name,graph_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
ahasher,HomoSapiens,0.0,0.0,10.08,0.27266,0.0,0.0
ahasher,MusMusculus,0.0,0.0,11.04,0.196946,0.0,0.0
ahasher,SaccharomycesCerevisiae,0.0,0.0,2.06,0.422116,0.0,0.0
simple,HomoSapiens,0.0,0.0,10.07,0.256432,0.0,0.0
simple,MusMusculus,0.0,0.0,11.03,0.331967,0.0,0.0
simple,SaccharomycesCerevisiae,0.0,0.0,2.0,0.0,0.0,0.0
siphash,HomoSapiens,0.0,0.0,18.24,0.494822,0.01,0.1
siphash,MusMusculus,0.0,0.0,20.04,0.373896,0.0,0.0
siphash,SaccharomycesCerevisiae,0.0,0.0,3.08,0.27266,0.0,0.0
xxh3,HomoSapiens,0.0,0.0,46.79,0.890976,0.0,0.0


### KGCOVID19
We kick off our experiments with a relatively small graph, considering the sizes of the networks we will tackle by the end of it: KGCOVID19, with `574K` nodes and `18M` edges.

#### What is KGCOVID19?
[KGCOVID19](https://doi.org/10.1016%2Fj.patter.2020.100155) is a framework for producing knowledge graphs (KGs) that integrate and integrate biomedical data related to the COVID-19 pandemic. The framework is designed to be flexible and customizable, allowing researchers to create KGs for different downstream applications, including machine learning tasks, hypothesis-based querying, and browsable user interfaces for exploring and discovering relationships in COVID-19 data. The goal of KGCOVID19 is to provide an up-to-date, integrated source of data on SARS-CoV-2 and related viruses, including SARS-CoV and MERS-CoV, to support the biomedical research community in its efforts to respond to the COVID-19 pandemic. The framework can also be applied to other situations where siloed biomedical data must be quickly integrated for various research purposes, including future pandemics.

In [14]:
%%time
from grape.datasets.kghub import KGCOVID19

kgcovid19 = KGCOVID19()

CPU times: user 27.6 s, sys: 434 ms, total: 28.1 s
Wall time: 4.67 s


We display the number of nodes, `574K` and of undirected edges `18M`.

In [2]:
kgcovid19.get_number_of_nodes(), kgcovid19.get_number_of_edges()

(574232, 18251238)

In [15]:
%%timeit
kgcovid19.get_number_of_isomorphic_node_groups(
    minimum_node_degree=100,
    number_of_neighbours_for_hash=100
)

56.9 ms ± 696 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [16]:
%%timeit
kgcovid19.get_number_of_isomorphic_node_groups(
    minimum_node_degree=100,
    number_of_neighbours_for_hash=10
)

8.47 ms ± 47 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [18]:
%%timeit
kgcovid19.get_number_of_isomorphic_node_groups(
    minimum_node_degree=100,
    number_of_neighbours_for_hash=1
)

5.93 ms ± 47.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Friendster
[Friendster](https://en.wikipedia.org/wiki/Friendster) was a social networking service launched in 2002. It was one of the first social networking sites and was popular in the early 2000s. The site allowed users to connect with friends and meet new people through the use of personal profiles and networks of friends. Friendster was initially successful but eventually faced competition from more recent social networking sites such as MySpace and Facebook. In 2011, the company announced that it was transitioning from a social networking site to a social gaming site, and in 2015 it was acquired by a Malaysian company.

#### What is the network repository?
[Network Repository](https://networkrepository.com/index.php) is a scientific network data repository that provides interactive visualization and mining tools for analyzing and exploring network data. It is the first interactive repository of its kind. Network Repository is intended to facilitate scientific research on networks by making it easier for researchers to access and analyze an extensive network data collection. It is a valuable resource for researchers in various fields, including network science, bioinformatics, machine learning, data mining, physics, and social science.

#### ⚠️⚠️⚠️ WARNING: Make sure you have enough disk space! ⚠️⚠️⚠️
*Please be aware that this graph is not small and requires a significant amount of disk space to store and work with. Before proceeding with the tutorial, ensure you have enough free space on your hard drive or other storage devices to accommodate the size of the graph. If you do not have sufficient space, you may encounter errors or other issues when downloading or working with the graph. It is important to ensure that you have enough space available before proceeding. If necessary, consider freeing up additional space on your device to make room for the graph.*

In [12]:
!du -sh /bfd/graphs/networkrepository/SocFriendster

97G	/bfd/graphs/networkrepository/SocFriendster


In the next cell we retrieve and load the Friendster dataset from GRAPE, dataset from the [network repository](https://networkrepository.com/index.php).. Do note that we are configuring it to not load the node names and edge types in order to conserve memory. The cell also includes a directive to measure and display the execution time of the code.

In [19]:
from grape.datasets.networkrepository import SocFriendster

friendster = SocFriendster(
    # We cannot load the node names, as the would require too much memory
    # for my poor old desktop.
    load_nodes=False,
)

We display the number of nodes, `65.6M`, and of undirected edges, `1.8G`.

In [7]:
friendster.get_number_of_nodes(), friendster.get_number_of_edges()

(65608366, 1806067135)

We check the performance of the algorithm on Friendster running with different hash:

In [1]:
%%time
import os

assert os.environ["RAYON_NUM_THREADS"] == "1"

from tqdm.auto import tqdm, trange
import pandas as pd

friendster_results = []

for iteration in trange(10):
    for hash_name in ("simple", "ahasher", "xxh3", "siphash"):
        result = {
            "hash_name": hash_name,
            "graph_name": friendster.get_name(),
            **friendster.get_number_of_isomorphic_node_groups(
                hash_name=hash_name,
                minimum_node_degree=100,
                number_of_neighbours_for_hash=1000,
            )
        }

        friendster_results.append(result)
            
friendster_results = pd.DataFrame(friendster_results)
display(friendster_results)

friendster_results.groupby(["hash_name", "graph_name"]).agg(["mean", "std"])

  0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,hash_name,graph_name,sorting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,collecting_degree_bounded_nodes_and_hash
0,simple,SocFriendster,450,17,142789
1,ahasher,SocFriendster,446,17,144230
2,xxh3,SocFriendster,446,16,343295
3,siphash,SocFriendster,452,17,244417
4,simple,SocFriendster,450,17,143910
5,ahasher,SocFriendster,447,17,144759
6,xxh3,SocFriendster,452,17,345700
7,siphash,SocFriendster,447,17,244458
8,simple,SocFriendster,448,17,143162
9,ahasher,SocFriendster,444,18,143195


CPU times: user 2h 44min 10s, sys: 19.6 s, total: 2h 44min 30s
Wall time: 2h 44min 15s


Unnamed: 0_level_0,Unnamed: 1_level_0,sorting_degree_bounded_nodes_and_hash,sorting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,counting_isomorphic_groups,collecting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std,mean,std
hash_name,graph_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
ahasher,SocFriendster,448.7,2.58414,17.4,0.516398,144223.1,647.30526
simple,SocFriendster,449.7,2.110819,17.1,0.316228,142812.2,722.642497
siphash,SocFriendster,449.5,3.02765,17.3,0.483046,243392.5,1527.175334
xxh3,SocFriendster,448.8,4.022161,17.1,0.567646,344255.4,1799.479382


We check the performance of the algorithm on Friendster running with different number of neighbours:

In [1]:
%%time
import os

assert os.environ["RAYON_NUM_THREADS"] == "1"

from tqdm.auto import tqdm, trange
import pandas as pd

friendster_results = []

for iteration in trange(10):
    for k in tqdm((0, 10, 50, 100), leave=False):
        result = {
            "number_of_neighbours_for_hash": k,
            "graph_name": friendster.get_name(),
            **friendster.get_number_of_isomorphic_node_groups(
                hash_name="ahasher",
                minimum_node_degree=100,
                number_of_neighbours_for_hash=k,
            )
        }

        friendster_results.append(result)
            
friendster_results = pd.DataFrame(friendster_results)
display(friendster_results)

friendster_results.groupby(["number_of_neighbours_for_hash", "graph_name"]).agg(["mean", "std"])

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

  0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,number_of_neighbours_for_hash,graph_name,sorting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,collecting_degree_bounded_nodes_and_hash
0,0,SocFriendster,479,10540922,979
1,10,SocFriendster,455,18,4984
2,50,SocFriendster,455,18,25005
3,100,SocFriendster,464,18,53547
4,0,SocFriendster,478,10466141,979
5,10,SocFriendster,454,18,4988
6,50,SocFriendster,453,18,25051
7,100,SocFriendster,451,18,54020
8,0,SocFriendster,474,10580299,983
9,10,SocFriendster,452,18,4956


CPU times: user 1d 5h 58min 41s, sys: 2min 37s, total: 1d 6h 1min 19s
Wall time: 1d 5h 58min 46s


Unnamed: 0_level_0,Unnamed: 1_level_0,sorting_degree_bounded_nodes_and_hash,sorting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,counting_isomorphic_groups,collecting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std,mean,std
number_of_neighbours_for_hash,graph_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
0,SocFriendster,474.9,3.3483,10595880.8,247520.199248,979.8,1.873796
10,SocFriendster,453.4,1.837873,18.0,0.0,5006.6,65.20259
50,SocFriendster,450.5,2.718251,18.0,0.0,25134.1,244.430154
100,SocFriendster,453.0,4.642796,18.0,0.0,54054.3,699.891904


We check the performance of the algorithm on Friendster when running with six threads:

In [8]:
%%time
import os

assert os.environ["RAYON_NUM_THREADS"] == "6"

from tqdm.auto import tqdm, trange
import pandas as pd
from grape.datasets.networkrepository import SocFriendster

friendster_results = []

friendster = SocFriendster(
    # We cannot load the node names, as the would require too much memory
    # for my poor old desktop.
    load_nodes=False,
)

for iteration in trange(10):
    for hash_name in ("ahasher", ):
        result = {
            "hash_name": hash_name,
            "graph_name": friendster.get_name(),
            **friendster.get_number_of_isomorphic_node_groups(
                hash_name=hash_name,
                minimum_node_degree=100,
                number_of_neighbours_for_hash=1000,
            )
        }

        result.pop("number_of_isomorphic_node_groups")

        friendster_results.append(result)
            
friendster_results = pd.DataFrame(friendster_results)
friendster_results

  0%|          | 0/10 [00:00<?, ?it/s]

CPU times: user 51min 51s, sys: 23.3 s, total: 52min 14s
Wall time: 9min 15s


Unnamed: 0,hash_name,graph_name,sorting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,collecting_degree_bounded_nodes_and_hash
0,ahasher,SocFriendster,99,3,25682
1,ahasher,SocFriendster,98,3,25618
2,ahasher,SocFriendster,97,3,25341
3,ahasher,SocFriendster,97,3,25066
4,ahasher,SocFriendster,97,3,24930
5,ahasher,SocFriendster,96,3,24973
6,ahasher,SocFriendster,97,3,25154
7,ahasher,SocFriendster,99,3,25026
8,ahasher,SocFriendster,97,3,24993
9,ahasher,SocFriendster,97,3,25070


We compute the mean and standard deviation:

In [9]:
friendster_results.groupby(["hash_name", "graph_name"]).agg(["mean", "std"])

Unnamed: 0_level_0,Unnamed: 1_level_0,sorting_degree_bounded_nodes_and_hash,sorting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,counting_isomorphic_groups,collecting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std,mean,std
hash_name,graph_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
ahasher,SocFriendster,97.4,0.966092,3.0,0.0,25185.3,270.656424


We check the performance of the algorithm on Friendster when running with 12 threads:

In [5]:
%%time
import os

assert os.environ["RAYON_NUM_THREADS"] == "12"

from tqdm.auto import tqdm, trange
import pandas as pd
from grape.datasets.networkrepository import SocFriendster

friendster_results = []

friendster = SocFriendster(
    # We cannot load the node names, as the would require too much memory
    # for my poor old desktop.
    load_nodes=False,
)

for iteration in trange(10):
    for hash_name in ("ahasher", ):
        result = {
            "hash_name": hash_name,
            "graph_name": friendster.get_name(),
            **friendster.get_number_of_isomorphic_node_groups(
                hash_name=hash_name,
                minimum_node_degree=100,
                number_of_neighbours_for_hash=1000,
            )
        }

        result.pop("number_of_isomorphic_node_groups")

        friendster_results.append(result)
            
friendster_results = pd.DataFrame(friendster_results)

display(friendster_results)

friendster_results.groupby(["hash_name", "graph_name"]).agg(["mean", "std"])

  0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,hash_name,graph_name,counting_isomorphic_groups,collecting_degree_bounded_nodes_and_hash,sorting_degree_bounded_nodes_and_hash
0,ahasher,SocFriendster,2,13769,69
1,ahasher,SocFriendster,2,13756,68
2,ahasher,SocFriendster,2,13712,68
3,ahasher,SocFriendster,2,13900,72
4,ahasher,SocFriendster,2,13695,72
5,ahasher,SocFriendster,2,13737,71
6,ahasher,SocFriendster,2,13584,70
7,ahasher,SocFriendster,2,13734,69
8,ahasher,SocFriendster,2,13872,70
9,ahasher,SocFriendster,2,15253,68


CPU times: user 57min 54s, sys: 22.3 s, total: 58min 16s
Wall time: 5min 56s


Unnamed: 0_level_0,Unnamed: 1_level_0,counting_isomorphic_groups,counting_isomorphic_groups,collecting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,sorting_degree_bounded_nodes_and_hash,sorting_degree_bounded_nodes_and_hash
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std,mean,std
hash_name,graph_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
ahasher,SocFriendster,2.0,0.0,13901.2,483.117837,69.7,1.567021


We check the performance of the algorithm on Friendster when running with 24 threads:

In [8]:
%%time
import os

assert os.environ["RAYON_NUM_THREADS"] == "24"

from tqdm.auto import tqdm, trange
import pandas as pd
from grape.datasets.networkrepository import SocFriendster

friendster_results = []

friendster = SocFriendster(
    # We cannot load the node names, as the would require too much memory
    # for my poor old desktop.
    load_nodes=False,
)

for iteration in trange(10):
    for hash_name in ("simple", "ahasher", "xxh3", "siphash"):
        result = {
            "hash_name": hash_name,
            "graph_name": friendster.get_name(),
            **friendster.get_number_of_isomorphic_node_groups(
                hash_name=hash_name,
                minimum_node_degree=100,
                number_of_neighbours_for_hash=50,
            )
        }

        result.pop("number_of_isomorphic_node_groups")

        friendster_results.append(result)
            
friendster_results = pd.DataFrame(friendster_results)
friendster_results

  0%|          | 0/10 [00:00<?, ?it/s]

CPU times: user 37min 55s, sys: 3.33 s, total: 37min 58s
Wall time: 1min 40s


Unnamed: 0,hash_name,graph_name,counting_isomorphic_groups,sorting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash
0,simple,SocFriendster,2,55,1956
1,ahasher,SocFriendster,2,56,1964
2,xxh3,SocFriendster,3,57,3397
3,siphash,SocFriendster,2,57,2462
4,simple,SocFriendster,2,52,1971
5,ahasher,SocFriendster,2,57,1978
6,xxh3,SocFriendster,2,57,3418
7,siphash,SocFriendster,2,58,2339
8,simple,SocFriendster,2,56,1993
9,ahasher,SocFriendster,2,56,2044


In [18]:
%%time
friendster.get_isomorphic_node_ids_groups(minimum_node_degree=100)

CPU times: user 7.5 s, sys: 239 ms, total: 7.74 s
Wall time: 412 ms


[[56464815, 56464814, 56464813, 56464812, 56464811],
 [62680935, 62669483, 62671943],
 [62588028, 62586898, 62659978],
 [62716381, 62702963, 62703096, 62702752],
 [56465443, 56465442],
 [62565196, 62565219, 62565074],
 [13920183, 13920169],
 [56464825, 56464826],
 [56464821, 56464820],
 [62711600, 62711260, 62712212, 62714661]]

In [11]:
%%timeit
_ = friendster.get_isomorphic_node_ids_groups(minimum_node_degree=3)

673 ms ± 4.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### ClueWeb
[The ClueWeb09 dataset](http://lemurproject.org/clueweb09/) was created to support research on information retrieval and related human language technologies; it consists of about `1.7` billion web pages that were collected in January and February 2009 and the roughly `8` billion undirected links.

It is used for research on information retrieval and related human language technologies and is used by several tracks of the TREC conference. The dataset includes web pages in various languages and a web graph that includes unique URLs and total outlinks for the entire dataset and for a subset called TREC Category B (the first 50 million English pages). The ClueWeb09 dataset and subsets are distributed in different formats, including as tarred/gzipped files on hard disk drives and as a subset that is downloaded from the web. The Lemur Project provides online services for searching and interacting with the ClueWeb09 dataset, including an Indri search engine for searching the English and Japanese subsets and Wikipedia, as well as a batch query service and an attribute lookup service. The Lemur Project also offers hosted copies of the ClueWeb09 dataset for organizations that have licenses to use it.

*We also retrieve this graph from [Network Repository](https://networkrepository.com/index.php)*

#### ⚠️⚠️⚠️ This is a big graph! Make sure you have the disk space! ⚠️⚠️⚠️
*This is a warning to ensure that users have sufficient disk space before downloading and using a large graph. It is important to ensure that you have enough space on your hard drive or another storage device to accommodate the graph size, as attempting to download or work with a graph that is too large for your available space can lead to errors and other issues. It is advisable to check your available disk space before downloading or working with a large graph and free up additional space if necessary.*

In [1]:
!du -sh /bfd/graphs/networkrepository/WebClueweb09/

631G	/bfd/graphs/networkrepository/WebClueweb09/


In the following cell we retrieve and load the `Clueweb09` dataset from the [network repository](https://networkrepository.com/index.php). We configure it to not load the node names in order to conserve memory. The cell also includes a directive to measure and display the execution time of the code.

In [1]:
%%time
from grape.datasets.networkrepository import WebClueweb09

clueweb = WebClueweb09(
    # We cannot load the node names,
    # as the would require too much memory
    # for my poor old desktop.
    load_nodes=False,
)

CPU times: user 2h 42min 17s, sys: 8min 6s, total: 2h 50min 23s
Wall time: 38min 38s


We display the number of nodes, `1.68G`, and of undirected edges, `7.8G`.

In [3]:
clueweb.get_number_of_nodes(), clueweb.get_number_of_edges()

(1684868322, 7811385827)

We compute the isomorphic node groups of nodes with at least node degree equal to `1000`:

In [11]:
%%time
isomorphic_node_groups = clueweb\
    .get_isomorphic_node_ids_groups(
        minimum_node_degree=1_000,
        hash_name="ahasher",
        number_of_neighbours_for_hash=1000
    )

CPU times: user 22.2 s, sys: 42.4 ms, total: 22.3 s
Wall time: 22.2 s


We compute the isomorphic node groups of nodes with at least node degree equal to `500`:

In [15]:
%%time
isomorphic_node_groups = clueweb\
    .get_isomorphic_node_ids_groups(
        minimum_node_degree=500,
        hash_name="ahasher",
        number_of_neighbours_for_hash=1000
    )

CPU times: user 41.9 s, sys: 63.1 ms, total: 42 s
Wall time: 41.9 s


We compute the isomorphic node groups of nodes with at least node degree equal to `100`:

In [2]:
%%timeit
clueweb.get_number_of_isomorphic_node_groups(
    minimum_node_degree=100,
    hash_name="ahasher",
    number_of_neighbours_for_hash=1000
)

38.2 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


We compute the isomorphic node groups of nodes using `10` neighbours, and minimum node degree 500:

In [18]:
%%time
clueweb.get_number_of_isomorphic_node_groups(
    minimum_node_degree=500,
    hash_name="ahasher",
    number_of_neighbours_for_hash=10
)

CPU times: user 2.95 s, sys: 9.76 ms, total: 2.96 s
Wall time: 2.96 s


{'collecting_degree_bounded_nodes_and_hash': 2863,
 'number_of_isomorphic_node_groups': 10872,
 'counting_isomorphic_groups': 44,
 'sorting_degree_bounded_nodes_and_hash': 53}

We compute the isomorphic node groups of nodes using `10` neighbours, and minimum node degree `100`:

In [22]:
%%time
clueweb.get_number_of_isomorphic_node_groups(
    minimum_node_degree=100,
    hash_name="ahasher",
    number_of_neighbours_for_hash=10
)

CPU times: user 11.7 s, sys: 75.3 ms, total: 11.7 s
Wall time: 11.7 s


{'number_of_isomorphic_node_groups': 91294,
 'counting_isomorphic_groups': 1104,
 'collecting_degree_bounded_nodes_and_hash': 9790,
 'sorting_degree_bounded_nodes_and_hash': 817}

We compute the isomorphic node groups of nodes using `50` neighbours, and minimum node degree `100`:

In [23]:
%%time
clueweb.get_number_of_isomorphic_node_groups(
    minimum_node_degree=100,
    hash_name="ahasher",
    number_of_neighbours_for_hash=50
)

CPU times: user 29.5 s, sys: 102 ms, total: 29.6 s
Wall time: 29.6 s


{'number_of_isomorphic_node_groups': 91294,
 'collecting_degree_bounded_nodes_and_hash': 28287,
 'sorting_degree_bounded_nodes_and_hash': 824,
 'counting_isomorphic_groups': 486}

We compute the isomorphic node groups of nodes using `100` neighbours, and minimum node degree `100`:

In [24]:
%%time
clueweb.get_number_of_isomorphic_node_groups(
    minimum_node_degree=100,
    hash_name="ahasher",
    number_of_neighbours_for_hash=100
)

CPU times: user 57.3 s, sys: 115 ms, total: 57.4 s
Wall time: 57.3 s


{'sorting_degree_bounded_nodes_and_hash': 825,
 'collecting_degree_bounded_nodes_and_hash': 56208,
 'counting_isomorphic_groups': 294,
 'number_of_isomorphic_node_groups': 91294}

We compute the isomorphic node groups of nodes using `100` neighbours, and minimum node degree `1000`, with hash `ahasher`:

In [25]:
%%time
clueweb.get_number_of_isomorphic_node_groups(
    minimum_node_degree=100,
    hash_name="ahasher",
    number_of_neighbours_for_hash=1000
)

CPU times: user 2min 6s, sys: 175 ms, total: 2min 6s
Wall time: 2min 6s


{'collecting_degree_bounded_nodes_and_hash': 125654,
 'counting_isomorphic_groups': 178,
 'number_of_isomorphic_node_groups': 91294,
 'sorting_degree_bounded_nodes_and_hash': 815}

We compute the isomorphic node groups of nodes using `100` neighbours, and minimum node degree `1000`, with hash `simple`:

In [26]:
%%time
clueweb.get_number_of_isomorphic_node_groups(
    minimum_node_degree=100,
    hash_name="simple",
    number_of_neighbours_for_hash=1000
)

CPU times: user 2min 2s, sys: 189 ms, total: 2min 2s
Wall time: 2min 2s


{'collecting_degree_bounded_nodes_and_hash': 121746,
 'counting_isomorphic_groups': 180,
 'number_of_isomorphic_node_groups': 91294,
 'sorting_degree_bounded_nodes_and_hash': 827}

We compute the isomorphic node groups of nodes using `100` neighbours, and minimum node degree `1000`, with hash `xxh3`:

In [27]:
%%time
clueweb.get_number_of_isomorphic_node_groups(
    minimum_node_degree=100,
    hash_name="xxh3",
    number_of_neighbours_for_hash=1000
)

CPU times: user 6min 11s, sys: 360 ms, total: 6min 11s
Wall time: 6min 10s


{'collecting_degree_bounded_nodes_and_hash': 369907,
 'sorting_degree_bounded_nodes_and_hash': 816,
 'number_of_isomorphic_node_groups': 91294,
 'counting_isomorphic_groups': 181}

We compute the isomorphic node groups of nodes using `100` neighbours, and minimum node degree `1000`, with hash `siphash`:

In [28]:
%%time
clueweb.get_number_of_isomorphic_node_groups(
    minimum_node_degree=100,
    hash_name="siphash",
    number_of_neighbours_for_hash=1000
)

CPU times: user 3min 22s, sys: 248 ms, total: 3min 22s
Wall time: 3min 22s


{'collecting_degree_bounded_nodes_and_hash': 201536,
 'counting_isomorphic_groups': 182,
 'number_of_isomorphic_node_groups': 91294,
 'sorting_degree_bounded_nodes_and_hash': 817}

## WikiData
[WikiData](https://www.wikidata.org/wiki/Wikidata:Main_Page) is a collaborative, multilingual, free knowledge base that can be read and edited by humans and machines. It provides structured data representing the relationships between concepts and entities, including real-world objects, events, ideas and abstract concepts. The data in WikiData is organized into a graph structure, with nodes representing concepts or entities and edges representing relationships between them. For example, a node for the idea "dog" might be connected to other nodes representing specific dog breeds, such as "Labrador Retriever" or "Poodle," through edges that define the relationship "breed of."

The WikiData graph is constantly growing and changing as users contribute new data and edit existing data. It is based on a flexible data model that allows for creation of new properties and classes to represent the relationships between concepts and entities. The WikiData graph is used in various applications, including data integration, natural language processing, and machine learning. It also provides structured data for Wikipedia and other Wikimedia projects.

#### ⚠️⚠️⚠️ This is a big graph! Make sure you have the disk space! ⚠️⚠️⚠️
*This is a warning to ensure that users have sufficient disk space before downloading and using a large graph. It is important to ensure that you have enough space on your hard drive or another storage device to accommodate the graph size, as attempting to download or work with a graph that is too large for your available space can lead to errors and other issues. It is advisable to check your available disk space before downloading or working with a large graph and free up additional space if necessary.*

In [1]:
!du -sh /bfd/graphs/wikidata/WikiData

1,7T	/bfd/graphs/wikidata/WikiData


In the next cell we retrieve and load the WikiData dataset from GRAPE, directly from [WikiData's website](https://www.wikidata.org/wiki/Wikidata:Main_Page). Do note that we are configuring it to not load the node names and edge types in order to conserve memory. The cell also includes a directive to measure and display the execution time of the code.

In [1]:
%%time
from grape.datasets.wikidata import WikiData

wikidata = WikiData(
    load_nodes=False,
    ring_bell=True,
    directed=False
)

CPU times: user 1h 6min 26s, sys: 25min 13s, total: 1h 31min 40s
Wall time: 2h 51min 37s


In [2]:
%%time
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="ahasher",
    minimum_node_degree=10000,
    number_of_neighbours_for_hash=10,
)

CPU times: user 3.46 s, sys: 5.73 ms, total: 3.47 s
Wall time: 3.49 s


{'sorting_degree_bounded_nodes_and_hash': 0,
 'number_of_isomorphic_node_groups': 483,
 'collecting_degree_bounded_nodes_and_hash': 1797,
 'counting_isomorphic_groups': 409}

In [3]:
%%time
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="ahasher",
    minimum_node_degree=1000,
    number_of_neighbours_for_hash=10,
)

CPU times: user 7min 44s, sys: 790 ms, total: 7min 45s
Wall time: 7min 45s


{'collecting_degree_bounded_nodes_and_hash': 8781,
 'counting_isomorphic_groups': 456773,
 'number_of_isomorphic_node_groups': 0,
 'sorting_degree_bounded_nodes_and_hash': 13}

In [5]:
%%time
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="ahasher",
    minimum_node_degree=100_000,
    number_of_neighbours_for_hash=1000,
)

CPU times: user 2.37 s, sys: 7.45 ms, total: 2.38 s
Wall time: 2.39 s


{'collecting_degree_bounded_nodes_and_hash': 2063,
 'counting_isomorphic_groups': 324,
 'number_of_isomorphic_node_groups': 148,
 'sorting_degree_bounded_nodes_and_hash': 0}

In [3]:
%%time
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="ahasher",
    minimum_node_degree=10000,
    number_of_neighbours_for_hash=1000,
)

CPU times: user 4.8 s, sys: 7.74 ms, total: 4.81 s
Wall time: 4.8 s


{'counting_isomorphic_groups': 402,
 'collecting_degree_bounded_nodes_and_hash': 4394,
 'number_of_isomorphic_node_groups': 483,
 'sorting_degree_bounded_nodes_and_hash': 0}

In [4]:
%%time
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="ahasher",
    minimum_node_degree=1000,
    number_of_neighbours_for_hash=1000,
)

CPU times: user 32.9 s, sys: 73.8 ms, total: 33 s
Wall time: 33 s


{'collecting_degree_bounded_nodes_and_hash': 32530,
 'number_of_isomorphic_node_groups': 3749,
 'counting_isomorphic_groups': 442,
 'sorting_degree_bounded_nodes_and_hash': 14}

In [None]:
%%time
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="ahasher",
    minimum_node_degree=500,
    number_of_neighbours_for_hash=1000,
)

In [2]:
%%timeit
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="ahasher",
    minimum_node_degree=100,
    number_of_neighbours_for_hash=1000,
)

1min 9s ± 890 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [2]:
%%time
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="simple",
    minimum_node_degree=100,
    number_of_neighbours_for_hash=1000,
)

CPU times: user 6min 28s, sys: 784 ms, total: 6min 29s
Wall time: 6min 29s


{'counting_isomorphic_groups': 766,
 'number_of_isomorphic_node_groups': 20756,
 'sorting_degree_bounded_nodes_and_hash': 853,
 'collecting_degree_bounded_nodes_and_hash': 386283}

In [3]:
%%time
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="xxh3",
    minimum_node_degree=100,
    number_of_neighbours_for_hash=1000,
)

CPU times: user 9min 46s, sys: 950 ms, total: 9min 47s
Wall time: 9min 46s


{'collecting_degree_bounded_nodes_and_hash': 585297,
 'number_of_isomorphic_node_groups': 20756,
 'sorting_degree_bounded_nodes_and_hash': 839,
 'counting_isomorphic_groups': 498}

In [4]:
%%time
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="siphash",
    minimum_node_degree=100,
    number_of_neighbours_for_hash=1000,
)

CPU times: user 7min 26s, sys: 620 ms, total: 7min 26s
Wall time: 7min 26s


{'counting_isomorphic_groups': 492,
 'number_of_isomorphic_node_groups': 20756,
 'collecting_degree_bounded_nodes_and_hash': 445066,
 'sorting_degree_bounded_nodes_and_hash': 848}

In [5]:
%%timeit
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="ahasher",
    minimum_node_degree=500,
    number_of_neighbours_for_hash=1000,
)

51.4 s ± 32.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [None]:
%%time
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="ahasher",
    minimum_node_degree=100,
    number_of_neighbours_for_hash=100,
)

In [None]:
%%time
wikidata.get_number_of_isomorphic_node_groups(
    hash_name="ahasher",
    minimum_node_degree=100,
    number_of_neighbours_for_hash=100,
)

In [2]:
%%time
import os

assert os.environ["RAYON_NUM_THREADS"] == "12"

from tqdm.auto import tqdm, trange
import pandas as pd

wikidata_results = []

for iteration in trange(10):  
    for hash_name in ("ahasher", ):
        result = {
            "hash_name": hash_name,
            "graph_name": wikidata.get_name(),
            **wikidata.get_number_of_isomorphic_node_groups(
                hash_name=hash_name,
                minimum_node_degree=100,
                number_of_neighbours_for_hash=1000,
            )
        }

        result.pop("number_of_isomorphic_node_groups")

        wikidata_results.append(result)
            
wikidata_results = pd.DataFrame(wikidata_results)

display(wikidata_results)

wikidata_results.groupby(["hash_name", "graph_name"]).agg(["mean", "std"])

  0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,hash_name,graph_name,sorting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,counting_isomorphic_groups
0,ahasher,WikiData,133,42554,76
1,ahasher,WikiData,138,42647,77
2,ahasher,WikiData,134,42674,77
3,ahasher,WikiData,133,42834,76
4,ahasher,WikiData,128,42794,75
5,ahasher,WikiData,132,42666,75
6,ahasher,WikiData,134,42711,79
7,ahasher,WikiData,136,42779,74
8,ahasher,WikiData,136,42803,74
9,ahasher,WikiData,134,42899,79


CPU times: user 1h 20min 20s, sys: 3.72 s, total: 1h 20min 24s
Wall time: 7min 9s


Unnamed: 0_level_0,Unnamed: 1_level_0,sorting_degree_bounded_nodes_and_hash,sorting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,counting_isomorphic_groups,counting_isomorphic_groups
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std,mean,std
hash_name,graph_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
ahasher,WikiData,133.8,2.699794,42736.1,103.434843,76.2,1.813529


In [2]:
%%time
import os

assert os.environ["RAYON_NUM_THREADS"] == "24"

from tqdm.auto import tqdm, trange
import pandas as pd

wikidata_results = []

for iteration in trange(10):
    for hash_name in ("ahasher", ):
        result = {
            "hash_name": hash_name,
            "graph_name": wikidata.get_name(),
            **wikidata.get_number_of_isomorphic_node_groups(
                hash_name=hash_name,
                minimum_node_degree=100,
                number_of_neighbours_for_hash=1000,
            )
        }

        result.pop("number_of_isomorphic_node_groups")

        wikidata_results.append(result)
            
wikidata_results = pd.DataFrame(wikidata_results)

display(wikidata_results)

wikidata_results.groupby(["hash_name", "graph_name"]).agg(["mean", "std"])

  0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,hash_name,graph_name,counting_isomorphic_groups,sorting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash
0,ahasher,WikiData,88,117,37668
1,ahasher,WikiData,88,117,37514
2,ahasher,WikiData,86,115,37935
3,ahasher,WikiData,83,117,37776
4,ahasher,WikiData,86,119,38213
5,ahasher,WikiData,85,118,37909
6,ahasher,WikiData,83,120,38088
7,ahasher,WikiData,84,112,37930
8,ahasher,WikiData,89,119,37598
9,ahasher,WikiData,85,119,37595


CPU times: user 1h 46min 13s, sys: 8 s, total: 1h 46min 21s
Wall time: 6min 20s


Unnamed: 0_level_0,Unnamed: 1_level_0,counting_isomorphic_groups,counting_isomorphic_groups,sorting_degree_bounded_nodes_and_hash,sorting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash,collecting_degree_bounded_nodes_and_hash
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std,mean,std,mean,std
hash_name,graph_name,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
ahasher,WikiData,85.7,2.110819,117.3,2.359378,37822.6,230.467255


We display the number of nodes, `1.29G` and of undirected edges `5G`.

In [2]:
wikidata.get_number_of_nodes(), wikidata.get_number_of_edges()

(1294126247, 6218438107)

In [4]:
%%time
isomorphic_node_groups = wikidata\
    .get_isomorphic_node_ids_groups(
        minimum_node_degree=1000,
    )

CPU times: user 5min 28s, sys: 319 ms, total: 5min 29s
Wall time: 1min 58s


In [8]:
%%time
wikidata\
    .get_number_of_isomorphic_node_groups(
        minimum_node_degree=1000,
    )

CPU times: user 5min 17s, sys: 334 ms, total: 5min 18s
Wall time: 2min 2s


{'counting_isomorphic_groups': 122633,
 'collecting_degree_bounded_nodes_and_hash': 315,
 'number_of_isomorphic_node_groups': 3749,
 'sorting_degree_bounded_nodes_and_hash': 1}

In [9]:
%%time
wikidata\
    .get_number_of_isomorphic_node_groups(
        minimum_node_degree=1000,
        number_of_neighbours_for_hash=1000
    )

CPU times: user 1min 17s, sys: 101 ms, total: 1min 17s
Wall time: 14.5 s


{'collecting_degree_bounded_nodes_and_hash': 14432,
 'counting_isomorphic_groups': 69,
 'sorting_degree_bounded_nodes_and_hash': 2,
 'number_of_isomorphic_node_groups': 3749}

In [14]:
%%time
wikidata\
    .get_number_of_isomorphic_node_groups(
        minimum_node_degree=200,
        number_of_neighbours_for_hash=100
    )

CPU times: user 3min 1s, sys: 227 ms, total: 3min 2s
Wall time: 54.4 s


{'collecting_degree_bounded_nodes_and_hash': 7914,
 'number_of_isomorphic_node_groups': 11792,
 'counting_isomorphic_groups': 46474,
 'sorting_degree_bounded_nodes_and_hash': 18}

In [26]:
%%time
isomorphisms = wikidata\
    .get_isomorphic_node_ids_groups(
        minimum_node_degree=750_000,
        number_of_neighbours_for_hash=100
    )

CPU times: user 5.48 s, sys: 34.5 ms, total: 5.52 s
Wall time: 257 ms


In [27]:
isomorphisms

[[15433587, 15433592],
 [148905595, 148905596, 148905598, 148905600, 148905602, 148905607, 148905612],
 [64077364, 64077365, 64077367, 64077369, 64077370, 64077371],
 [47692852, 47692869]]

In [10]:
%%time
wikidata\
    .get_number_of_isomorphic_node_groups(
        minimum_node_degree=1000,
        number_of_neighbours_for_hash=10000
    )

CPU times: user 3min 4s, sys: 162 ms, total: 3min 4s
Wall time: 46.6 s


{'number_of_isomorphic_node_groups': 3749,
 'sorting_degree_bounded_nodes_and_hash': 2,
 'collecting_degree_bounded_nodes_and_hash': 46570,
 'counting_isomorphic_groups': 69}

In [31]:
%%time
isomorphic_node_groups = wikidata\
    .get_isomorphic_node_ids_groups(
        minimum_node_degree=100_000
    )

CPU times: user 5.96 s, sys: 22.3 ms, total: 5.98 s
Wall time: 293 ms


## Conclusions
In this tutorial, we learned about the efficient algorithm for detecting attributed isomorphic nodes in large graphs, which is a part of the quality control suite of the GRAPE graph machine learning library. We emphasized the importance of detecting isomorphic nodes and how it can aid in improving the quality and accuracy of graph-based machine learning models. The algorithm is implemented in Rust and has Python bindings for improved usability.

I hope you now have a better understanding of the algorithm for detecting attributed isomorphic nodes and the role it plays in the quality control process of graph-based machine learning. Do feel free to reach out with any questions or feedback, as I always look for ways to improve this tutorial.

[And remember to ⭐ GRAPE!](https://github.com/AnacletoLAB/grape)