# Exact billion-scale graph diameter with 🍇🍇 GRAPE 🍇🍇
In this tutorial, I will show you how to use the [GRAPE library](https://github.com/AnacletoLAB/grape) to compute the exact diameter of a graph, which is the longest shortest path between any two nodes in the graph. This is a challenging problem, especially for large graphs with millions or billions of nodes and edges.

We will discuss some of the basics of graph analysis and introduce key concepts such as quality control, computational complexity, and breadth-first search. We will also briefly touch the concept of graph neural networks.

We will discuss how many graphs have short average distances between nodes, and IFUB, a great algorithm to efficiently compute the diameter of large graphs. We will execute that algorithm on a the KGCOVID19 knowledge graph, and then move to two graphs with over one billion nodes: ClueWeb09 and WikiData. We will see how IFUB performs impeccably in the first one, and struggles more in the latter.

By the end of the tutorial, you will have a good understanding of how to use GRAPE to compute the exact diameter of a graph and apply this knowledge to your own projects.

[Remember to ⭐ GRAPE!](https://github.com/AnacletoLAB/grape)

## Some basics
In this section, we will provide a brief overview of some key concepts that will be discussed throughout the tutorial. These include quality control of datasets, computational complexity, graphs and their various applications, breadth-first search, and graph neural networks. Understanding these basic concepts is essential for understanding the more advanced topics that will be covered in the tutorial. We will also introduce the concept of graph convolutional networks, which are a specialized type of neural network used for processing graph-structured data. **It is likely you are familiar with all of these concepts, and you may just skip this section**, but I made it available so that more readers can be on the same page when the tutorial starts.

### Quality control
Quality control of datasets is the process of ensuring that the data used for various purposes is accurate, reliable, and relevant. It involves checking the data for completeness, accuracy, and consistency, and correcting or removing any errors or inconsistencies that may be present. Quality control of datasets is important because the quality of the data has a direct impact on the accuracy and reliability of the results obtained from the data. Poor quality data can lead to incorrect conclusions, which can have serious consequences in fields such as healthcare, finance, and scientific research. Ensuring the quality of datasets is therefore essential for ensuring the integrity and reliability of the results obtained from the data. [We have already learned how to create an extensive quality control report for graphs in this other GRAPE tutorial](https://github.com/AnacletoLAB/grape/blob/main/tutorials/Create%20extensive%20knowledge%20graph%20reports%20using%20GRAPE.ipynb)

### Computational complexity
[Computational complexity](https://en.wikipedia.org/wiki/Computational_complexity) refers to the amount of resources (e.g., time, space) required by an algorithm to solve a problem. It is typically measured in terms of the size of the input data. Worst-case complexity refers to the maximum amount of resources required by the algorithm over all possible inputs of a given size. This measure is useful because it provides a guarantee on the performance of the algorithm, regardless of the specific input data. However, it may not accurately reflect the average performance of the algorithm on typical input data.

### What is a graph
[A graph](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)) is a data structure that consists of a set of vertices, or nodes, and a set of edges connecting these vertices. Graphs are used to represent relationships between different entities in a wide range of applications, such as social networks, transportation systems, and biological networks.

Some graphs can be very large, with millions or even billions of vertices and edges. The size of a graph can significantly impact the performance of algorithms used to analyze or process it. Therefore, it is important to develop efficient algorithms for analyzing large graphs.

### Breadth-first search
[Breadth-first search (BFS)](https://en.wikipedia.org/wiki/Breadth-first_search) is an algorithm for traversing or searching a graph, tree, or other data structure. It starts at a given node (called the root or starting node) and explores as far as possible along each branch before backtracking.

The algorithm starts by placing the root node in a queue, which is a first-in, first-out data structure. It then repeatedly removes the first node from the queue, examines it, and adds its neighbors to the end of the queue. By repeating this process, the algorithm visits all the nodes in the graph in a specific order, called a breadth-first traversal.

BFS has a number of applications, including finding the shortest path between two nodes in a graph and checking if a graph is connected. It is also used as a building block for other algorithms, such as topological sorting and network connectivity analysis.

### Graph neural networks
[Graph neural networks (GNNs)](https://www.cs.mcgill.ca/~wlh/grl_book/) are a class of neural networks that are specifically designed to process graph-structured data. They have been applied to a variety of tasks including node classification, link prediction, and graph classification. GNNs are particularly useful for tasks that involve the analysis of relationships between entities in a graph, as they are able to incorporate the graph structure in their learning process.

#### Graph convolutional networks
[Graph convolutional networks (GCNs)](https://arxiv.org/pdf/1609.02907.pdf) are a type of neural network designed specifically to operate on graph-structured data. Like traditional convolutional neural networks, GCNs use convolutional layers to process and analyze data. However, rather than operating on grid-structured data such as images, GCNs perform convolutions on the graph structure itself, using the relationships between nodes in the graph as the basis for their analysis. GCNs have been successfully applied to a wide range of tasks in domains such as computer vision, natural language processing, and drug discovery, and have been shown to outperform traditional methods on many graph-based problems.

### What is GRAPE?
[🍇🍇 GRAPE 🍇🍇](https://github.com/AnacletoLAB/grape) is a graph processing and embedding library that enables users to easily manipulate and analyze graphs. With [GRAPE](https://github.com/AnacletoLAB/grape), users can efficiently load and preprocess graphs, generate random walks, and apply various node and edge embedding models. Additionally, [GRAPE](https://github.com/AnacletoLAB/grape) provides a fair and reproducible evaluation pipeline for comparing different graph embedding and graph-based prediction methods.

![GRAPE](https://github.com/AnacletoLAB/grape/raw/main/images/sequence_diagram.png?raw=true)

## What is the diameter?
The diameter of a graph is a measure of the size of the graph, defined as the maximum distance between any pair of nodes in the graph. It is an important metric for characterizing the structure of a graph. Computing the diameter of a graph can be a challenging task, particularly for large graphs with many nodes and edges. Exact algorithms for computing the diameter typically have a high time complexity, scaling poorly with the size of the graph. As a result, it can be difficult to compute the diameter of very large graphs in a reasonable amount of time. Despite this, the diameter is an important metric for various applications, including graph neural networks and graph convolutional networks. These types of networks are used to process and analyze graphs, and the diameter can impact the performance and accuracy of these networks. For example, the diameter of a graph may determine the maximum depth of a graph neural network required to accurately process the graph, and manage expectations of the performance of those models.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/diameter.jpg?raw=true" width=300 />

## Four degrees of separation
[The Four Degrees of Separation paper](https://arxiv.org/pdf/1111.4570.pdf) presents the results of a study on the distance distribution of the Facebook social network, which at the time of the study contained approximately 721 million active users and approximately 69 billion friendship links. The goal of the study was to identify statistical parameters that can distinguish proper social networks from other complex networks, such as web graphs. The study found that the average distance, or number of degrees of separation, between two users on Facebook is 4.74, corresponding to 3.74 intermediaries or "degrees of separation." This is significantly lower than the average number of intermediaries found in previous studies, such as the ones conducted by Stanley Milgram, which ranged between 4.4 and 5.7. The study also analyzed the distance distribution of geographic subgraphs of Facebook and observed their evolution over time.

## On computing the diameter of real-world undirected graphs
The algorithm we are going to use in this tutorial is a highly parallelized version of the algorithm described in Crescenzi et al. fantastic paper [On computing the diameter of real-world undirected graphs](https://who.rocq.inria.fr/Laurent.Viennot/road/papers/ifub.pdf), which I cannot suggest enough to read.

This paper presents the IFUB algorithm, an exact method for computing the diameter of large graphs. It works by starting a breadth-first search from a randomly chosen node or a node with the highest degree, and refining both a lower bound and an upper bound on the diameter until the two bounds meet. This is done through an iterative process that terminates when the upper and lower bounds meet.

The IFUB algorithm has a worst-case complexity of `O(|V||E|)`, where `|V|` is the number of nodes and `|E|` is the number of edges in the graph, but it has been shown to work in `O(|E|)` time in practice on almost `200` real-world graphs.

It is a significant improvement over other methods, such as those based on random sampling, which do not provide a useful bound on the error and may not be precise in practice. The IFUB algorithm has been shown to be particularly useful for computing the diameter of large graphs, where other methods may be too time-consuming.

**The performance of the IFUB algorithm can vary greatly depending on the topology of the input graph. Some graphs may be more suited to IFUB and result in faster execution times, while others may be more challenging for IFUB and result in slower execution times.**

## Installing GRAPE
First, we install the GRAPE library from PyPI:

In [4]:
!pip install grape -qU


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Experiments
Welcome to the experiments section of this tutorial! In this section, we will put our knowledge into practice by applying the IFUB algorithm to compute the diameter of four different graphs: the KGCOVID19 knowledge graph, the Friendter graph, the ClueWeb09 web graph, and the WikiData graph.

We will observe the performance of IFUB on each of these graphs and discuss the results. By the end of this section, you will have a better understanding of how IFUB performs on different types of graphs and how to use GRAPE to compute the exact diameter of a graph.

**Do note that, for limits of memory of my desktop, I will restart the jupyter after running the experiment on each of the large graphs.**

### KGCOVID19
We kick off our experiments with a rather small graph, considering the sizes of the networks we are going to tackle by the end of it: KGCOVID19, with `574K` nodes and `18M` edges.

#### What is KGCOVID19?
[KGCOVID19](https://doi.org/10.1016%2Fj.patter.2020.100155) is a framework for producing knowledge graphs (KGs) that integrate and integrate biomedical data related to the COVID-19 pandemic. The framework is designed to be flexible and customizable, allowing researchers to create KGs for different downstream applications including machine learning tasks, hypothesis-based querying, and browsable user interfaces for exploring and discovering relationships in COVID-19 data. The goal of KGCOVID19 is to provide an up-to-date, integrated source of data on SARS-CoV-2 and related viruses, including SARS-CoV and MERS-CoV, to support the biomedical research community in its efforts to respond to the COVID-19 pandemic. The framework can also be applied to other situations in which siloed biomedical data must be quickly integrated for various research purposes, including future pandemics.

In [5]:
%%time
from grape.datasets.kghub import KGCOVID19

kgcovid19 = KGCOVID19()

CPU times: user 23 s, sys: 279 ms, total: 23.3 s
Wall time: 1.87 s


We display the number of nodes, `574K` and of undirected edges `18M`.

In [6]:
kgcovid19.get_number_of_nodes(), kgcovid19.get_number_of_edges()

(574232, 18251238)

And now we compute the diameter. It should be pretty much instantenous.

In [7]:
%%time
kgcovid19.get_diameter()

CPU times: user 1.35 s, sys: 10.4 ms, total: 1.36 s
Wall time: 64.7 ms


38.0

### Friendster
[Friendster](https://en.wikipedia.org/wiki/Friendster) was a social networking service launched in 2002. It was one of the first social networking sites, and was popular in the early 2000s. The site allowed users to connect with friends and meet new people through the use of personal profiles and networks of friends. Friendster was initially successful, but it eventually faced competition from newer social networking sites such as MySpace and Facebook. In 2011, the company announced that it was transitioning from a social networking site to a social gaming site, and in 2015 it was acquired by a Malaysian company.

#### What is network repository?
[Network Repository](https://networkrepository.com/index.php) is a scientific network data repository that provides interactive visualization and mining tools for analyzing and exploring network data. It is the first interactive repository of its kind and is also the largest network repository, containing thousands of network data sets in over 30 domains, including biological, social, and machine learning data. The repository allows users to visualize and explore network data sets, view interactive statistics and plots, and download massive network data sets with billions of edges. It also includes a visual analytics platform called GraphVis, which allows users to interactively analyze and explore network data in real-time over the web and use it for educational purposes. Network Repository is intended to facilitate scientific research on networks by making it easier for researchers to access and analyze a large collection of network data. It is a valuable resource for researchers in a variety of fields, including network science, bioinformatics, machine learning, data mining, physics, and social science.

#### ⚠️⚠️⚠️ WARNING: Make sure you have enough disk space! ⚠️⚠️⚠️
*Please be aware that this graph is not small and requires a significant amount of disk space to store and work with. Before proceeding with the tutorial, make sure that you have enough free space on your hard drive or other storage device to accommodate the size of the graph. If you do not have sufficient space, you may encounter errors or other issues when attempting to download or work with the graph. It is important to ensure that you have enough space available before proceeding. If necessary, consider freeing up additional space on your device to make room for the graph.*

In [8]:
!du -sh /bfd/graphs/networkrepository/SocFriendster

97G	/bfd/graphs/networkrepository/SocFriendster


In the next cell we retrieve and load the Friendster dataset from GRAPE, dataset from the [network repository](https://networkrepository.com/index.php).. Do note that we are configuring it to not load the node names and edge types in order to conserve memory. The cell also includes a directive to measure and display the execution time of the code.

In [9]:
%%time
from grape.datasets.networkrepository import SocFriendster

friendster = SocFriendster(
    # We cannot load the node names, as the would require too much memory
    # for my poor old desktop.
    load_nodes=False,
)

CPU times: user 33min 11s, sys: 23.3 s, total: 33min 34s
Wall time: 2min 20s


We display the number of nodes, `65.6M`, and of undirected edges, `1.8G`.

In [10]:
friendster.get_number_of_nodes(), friendster.get_number_of_edges()

(65608366, 1806067135)

And now we compute the diameter. In this particular graph, even though it is large, we see that the IFUB heuristic works great and it completes in a very short time.


In [11]:
%%time
friendster.get_diameter()

CPU times: user 30min 4s, sys: 7.72 s, total: 30min 12s
Wall time: 1min 17s


37.0

### ClueWeb
[The ClueWeb09 dataset](http://lemurproject.org/clueweb09/) was created to support research on information retrieval and related human language technologies; it consists of about `1.7` billion web pages that were collected in January and February 2009 and the roughly `8` billion undirected links.

It is used for research on information retrieval and related human language technologies and is used by several tracks of the TREC conference. The dataset includes web pages in various languages and a web graph that includes unique URLs and total outlinks for the entire dataset and for a subset called TREC Category B (the first 50 million English pages). The ClueWeb09 dataset and subsets are distributed in different formats, including as tarred/gzipped files on hard disk drives and as a subset that is downloaded from the web. The Lemur Project provides online services for searching and interacting with the ClueWeb09 dataset, including an Indri search engine for searching the English and Japanese subsets and Wikipedia, as well as a batch query service and an attribute lookup service. The Lemur Project also offers hosted copies of the ClueWeb09 dataset for organizations that have licenses to use it.

*We also retrieve this graph from [Network Repository](https://networkrepository.com/index.php)*

#### ⚠️⚠️⚠️ This is a big graph! Make sure you have the disk space! ⚠️⚠️⚠️
*This is a warning to ensure that users have sufficient disk space available before attempting to download and use a large graph. It is important to ensure that you have enough space on your hard drive or other storage device to accommodate the size of the graph, as attempting to download or work with a graph that is too large for your available space can lead to errors and other issues. It is advisable to check your available disk space before attempting to download or work with a large graph, and to free up additional space if necessary.*

In [8]:
!du -sh /bfd/graphs/networkrepository/WebClueweb09/

631G	/bfd/graphs/networkrepository/WebClueweb09/


In the following cell we retrieve and load the `Clueweb09` dataset from the [network repository](https://networkrepository.com/index.php). We configure it to not load the node names in order to conserve memory. The cell also includes a directive to measure and display the execution time of the code.

In [9]:
%%time
from grape.datasets.networkrepository import WebClueweb09

clueweb = WebClueweb09(
    # We cannot load the node names, as the would require too much memory
    # for my poor old desktop.
    load_nodes=False,
)

CPU times: user 2h 59min 41s, sys: 6min 31s, total: 3h 6min 12s
Wall time: 37min 52s


We display the number of nodes, `1.68G`, and of undirected edges, `7.8G`.

In [10]:
clueweb.get_number_of_nodes(), clueweb.get_number_of_edges()

(1684868322, 7811385827)

And now we compute the diameter. In this particular graph, even though it is colossal, we see that the IFUB heuristic works great and it completes in a very short time.

In [11]:
%%time
clueweb.get_diameter()

CPU times: user 40min 39s, sys: 34.3 s, total: 41min 13s
Wall time: 1min 53s


124.0

## WikiData
[WikiData](https://www.wikidata.org/wiki/Wikidata:Main_Page) is a collaborative, multilingual, free knowledge base that can be read and edited by humans and machines. It provides structured data that represents the relationships between concepts and entities, including real-world objects, events, and ideas, as well as abstract concepts. The data in WikiData is organized into a graph structure, with nodes representing concepts or entities and edges representing relationships between them. For example, a node for the concept "dog" might be connected to other nodes representing specific dog breeds, such as "Labrador Retriever" or "Poodle," through edges that represent the relationship "breed of."

The WikiData graph is constantly growing and changing as users contribute new data and edit existing data. It is based on a flexible data model that allows for the creation of new properties and classes to represent the relationships between concepts and entities. The data in the WikiData graph is available for free and can be accessed through a variety of methods, including the WikiData API, SPARQL queries, and third-party tools and services. The WikiData graph is used in a variety of applications, including data integration, natural language processing, and machine learning. It is also used to provide structured data for Wikipedia and other Wikimedia projects.

#### ⚠️⚠️⚠️ This is a big graph! Make sure you have the disk space! ⚠️⚠️⚠️
*This is a warning to ensure that users have sufficient disk space available before attempting to download and use a large graph. It is important to ensure that you have enough space on your hard drive or other storage device to accommodate the size of the graph, as attempting to download or work with a graph that is too large for your available space can lead to errors and other issues. It is advisable to check your available disk space before attempting to download or work with a large graph, and to free up additional space if necessary.*

In [6]:
!du -sh /bfd/graphs/wikidata/WikiData

1,7T	/bfd/graphs/wikidata/WikiData


In the next cell we retrieve and load the WikiData dataset from GRAPE, directly from [WikiData's website](https://www.wikidata.org/wiki/Wikidata:Main_Page). Do note that we are configuring it to not load the node names and edge types in order to conserve memory. The cell also includes a directive to measure and display the execution time of the code.

In [1]:
%%time
from grape.datasets.wikidata import WikiData

wikidata = WikiData(
    # We cannot load the node names, as the would require too much memory
    # for my poor old desktop.
    load_nodes=False,
    # Same thing for the edge types.
    load_edge_types=False
)

CPU times: user 1h 56min 23s, sys: 4min 52s, total: 2h 1min 15s
Wall time: 20min 23s


We display the number of nodes, `1.29G` and of undirected edges `5G`.

In [3]:
wikidata.get_number_of_nodes(), wikidata.get_number_of_edges()

(1294126247, 5040170396)

And now we compute the diameter. In this particular graph, quite differently from the previous one, we encounter a case where IFUB is having a very hard time, and takes very long to compute the diameter.

In [None]:
%%time
wikidata.get_diameter()

This last one, after 12 hours of executions on my 24 threads desktop, has not seen the end of it yet. **But why is that? Why is IFUB failing here?** IFUB tests all nodes at a given distance `k` to check whether any of of them may be the diameter, and gradually closes in the bound. If a large amount of nodes are all at a distance equal to or very close to the diameter, IFUB needs to test all of them, leading to an explosion of computing time.

## Conclusions

In this tutorial, we learned how to use the GRAPE library to compute the exact diameter of large graphs. We started by discussing some basic concepts of graph analysis, including quality control, computational complexity, and breadth-first search. We briefly touched upon the concept of graph neural networks and discussed the IFUB algorithm, which is an efficient way to compute the diameter of large graphs. We applied IFUB to four different graphs: the KGCOVID19 knowledge graph, Friendster, ClueWeb09, and WikiData. We saw that IFUB performed well on the KGCOVID19, Friendster and ClueWeb09 graphs, but struggled more on the WikiData graph.

You should now have a good understanding of how scalable IFUB is, how to use it with GRAPE and where it may fail.

Do feel free to reach out with questions, so we may improve this tutorial!

[And remember to ⭐ GRAPE!](https://github.com/AnacletoLAB/grape)