# Billion-scale triangle count with 🍇 GRAPE 🍇
In this tutorial, I will show you how to use the [GRAPE library](https://github.com/AnacletoLAB/grape) to count the number of triangles in a graph. We will first compute a vertex cover of the graph using GRAPE, [as extensively covered in this previous tutorial](https://github.com/AnacletoLAB/grape/blob/main/tutorials/Billion-scale%202-approximated%20vertex%20cover%20with%20GRAPE.ipynb), and then use this vertex cover to efficiently count the triangles in the graph by using [this awesome algorithm by Oded Green and David Bader](https://davidbader.net/publication/2013-g-ba/2013-g-ba.pdf).

I will explain the concept of a triangles and its importance in triangle counting, and what triangle counting is for. By the end of the tutorial, you will have a good understanding of how to use [GRAPE]((https://github.com/AnacletoLAB/grape)) to count the triangles in a graph and apply this knowledge to your projects.

[Remember to ⭐ GRAPE!](https://github.com/AnacletoLAB/grape)

### What is GRAPE?
[🍇🍇 GRAPE 🍇🍇](https://github.com/AnacletoLAB/grape) is a graph processing and embedding library that enables users to easily manipulate and analyze graphs. With [GRAPE](https://github.com/AnacletoLAB/grape), users can efficiently load and preprocess graphs, generate random walks, and apply various node and edge embedding models. Additionally, [GRAPE](https://github.com/AnacletoLAB/grape) provides a fair and reproducible evaluation pipeline for comparing different graph embedding and graph-based prediction methods.

![features in GRAPE](https://github.com/AnacletoLAB/grape/raw/main/images/sequence_diagram.png?raw=true)

*The methods shown in the tutorial are available from the nightly version of 🍇 on GitHub, which we'll release on PyPI next week. (Today is 2/01/2023)*

## Triangles in graphs
In graph theory, **a triangle is a simple cycle of three vertices**. A triangle is also known as a 3-cycle.

A triangle can be represented by three vertices and the three edges connecting them. For example, in the following graph:

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/triangle.jpg?raw=true" width=200 />

There is one triangle, formed by vertices `1`, `2`, and `3`. The triangle is represented by the three edges connecting these vertices: `(1,2)`, `(2,3)`, and `(3,1)`.

### Why should you care about triangles?
[Triangles](https://en.wikipedia.org/wiki/Triangle_graph) are an important concept in graph theory because they represent a basic unit of connectivity in a graph. In other words, triangles are a measure of how well connected the vertices in a graph are to each other. For example, if a graph has many triangles, it means that the vertices in the graph are well connected to each other, forming a dense and interconnected structure. On the other hand, if a graph has few triangles, it means that the vertices in the graph are less connected to each other, forming a more sparse and disconnected structure.

Triangles also have several applications in various fields, including social network analysis, machine learning, and data mining.

### What is triangle counting?
The triangle count problem is the problem of counting the number of triangles in a graph. It is a subproblem of more general cycle counting problems, such as counting the number of cycles of a given length in a graph.

To count the number of triangles in a graph, one must first identify all of the triangles in the graph. This can be done using various algorithms, such as brute force methods, matrix multiplication-based algorithms, and random sampling-based algorithms. Once all of the triangles in the graph have been identified, the total number of triangles can be counted by simply adding up the number of triangles identified by the algorithm.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/triangles_in_graph.jpg?raw=true" width=500 />

#### Why should I count triangles?
The triangle count problem has several applications in various fields, including social network analysis, machine learning, and data mining. In these fields, the number of triangles in a graph is often used as a measure of the graph's structure and connectivity. For example, in social network analysis, the number of triangles in a person's social network can be used to measure the person's [clustering coefficient](https://en.wikipedia.org/wiki/Clustering_coefficient), which is a measure of how well connected the person is to their friends. In machine learning and data mining, the triangle count problem can be used to identify patterns and trends in large data sets, and can be used for tasks such as general graphs node embedding, i.e. not specific to a single graph.

We will explore in an upcoming tutorial how we can compute the clustering coefficient of large graphs.

### A good way to count triangles!
We will be using [an efficient method to count triangles in a graph involves using a vertex cover created by Oded Green and David Bader](https://davidbader.net/publication/2013-g-ba/2013-g-ba.pdf). A vertex cover is a set of vertices such that for every edge in the graph, at least one of its endpoints is included in the vertex cover. By exploiting the properties of a vertex cover, it is possible to significantly reduce the number of intersections of adjacency lists that must be performed in order to count the triangles in a graph.

[We have covered in a previous tutorial how we can compute good 2-approximated vertex covers](https://github.com/AnacletoLAB/grape/blob/main/tutorials/Billion-scale%202-approximated%20vertex%20cover%20with%20GRAPE.ipynb).

The algorithm, in python pseudocode, is the following:

```python
number_of_triangles = 0
# We can compute a vertex cover using many approaches
# During the last tutorial, I showed a 2-approximation
# of a minimum vertex cover.
# I stress that the following vertex cover does not need
# to be minimal, but the smaller it is the faster the algorithm
# will be.
vertex_cover = compute_vertex_cover()

# We iterate over all nodes in the vertex cover
# Of course, this iteration can be trivially
# parallelized.
for first in vertex_cover:
    # We iterate over all neighbours of the current node
    for second in neighbours(node):
        # If the second is equal to the first node,
        # or is not in the vertex cover we can skip
        # over this node and continue.
        if second == first or second not in vertex_cover:
            continue
        # Otherwise we can continue, and we iterate
        # over the intersection of the first and second
        # node neighbours.
        for third in insersection(
            neighbours(first),
            neighbours(second)
        ):
            # We skip over the first and second
            # nodes when we encounter them, as
            # these would not be triangles but tuples.
            if third == second or third == first:
                continue
            # Then, if also the third node is in
            # the vertex cover, we increase the
            # triangle count by 1 as we will encounter
            # this node other times.
            if third in vertex_cover:
                number_of_triangles += 1
            # Otherwise, we increase the count by 3,
            # since we will not encounter this triangle
            # other times.
            else:
                number_of_triangles += 3

# Finally, if the graph is undirected, we 
# need to half the number of triangles as
# we have counted all triangles twice.

# We may be able to avoid this by building
# an ad-hoc vertex cover.
if not graph.is_directed():
    number_of_triangles /= 2
```

The algorithm works by iterating over all vertices in the vertex cover, and for each vertex, iterating over its neighbors in the adjacency list. If the neighbor is also included in the vertex cover, the algorithm calculates the intersection of the adjacency lists of the two vertices. If the intersection contains any vertices that are also included in the vertex cover, this implies the presence of a triangle in the graph, and the algorithm increments a counter. If the intersection contains any vertices that are not included in the vertex cover, this implies the presence of three triangles in the graph, and the algorithm increments the counter by three.

This approach has a worst-case time complexity of:

$$O(\lvert \hat{V}\rvert \cdot \hat{d}^2_{\text{max}}) + \underbrace{O(\lvert E\rvert)}_{\text{Vertex cover}}$$

where $\lvert \hat{V}\rvert$ is the cardinality of the vertex cover, $\hat{d}_{\text{max}}$ is the maximum node degree in the vertex cover, and $O(\lvert E\rvert)$ is the time needed to find the vertex cover in a graph with $\lvert E\rvert$ edges. The approach can also be extended to count squares (4-circuits) and generic circuits.

Overall, the use of a vertex cover to count triangles in a graph is a more efficient approach than the best known time complexity for computing the clustering coefficient, which is $O(\lvert V \rvert \cdot d^2_\text{max})$.

**One think that can be explored and improved over the current algorithm, is that different approximated vertex covers may lead to different computational requirements. One important trade-off that is present here is that as a vertex cover becomes smaller and therefore the $\lvert \hat{V}\rvert$ component decreases, it will necessarily contain nodes with higher degrees, and therefore $ \hat{d}_{\text{max}}$ will increase.**

Feel free to reach out if you have ideas.

## Installing GRAPE
First, we install the GRAPE library from PyPI:

In [1]:
!pip install grape -qU

## Experiments
Welcome to the experiments section of this tutorial! In this section, we will put our knowledge into practice by applying the triangles counting algorithm on four different graphs: the [KGCOVID19 knowledge graph](https://www.cell.com/patterns/fulltext/S2666-3899(20)30203-8?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS2666389920302038%3Fshowall%3Dtrue), the [Friendster graph](https://networkrepository.com/friendster.php), the [ClueWeb09 web graph](https://networkrepository.com/web-ClueWeb09.php), and [the WikiData graph](https://www.wikidata.org/wiki/Wikidata:Main_Page).

We run these experiments on a machine with 24 threads and 12 cores.

**Do note that, for the limits of memory of my desktop, I will restart the jupyter after running the experiment on each of the large graphs.**

In my machine I only have 24 threads. You can estimate the expected computation time by interpolating the time estimates on 24 threads and the amount you have:

In [2]:
import os

os.cpu_count()

24

Also, this machine has about `128GB` of RAM:

In [3]:
import psutil
    
psutil.virtual_memory().total / 1024**3 # total physical memory in Bytes

125.7063217163086

### KGCOVID19
We kick off our experiments with a relatively small graph, considering the sizes of the networks we will tackle by the end of it: KGCOVID19, with `574K` nodes and `18M` edges.

#### What is KGCOVID19?
[KGCOVID19](https://doi.org/10.1016%2Fj.patter.2020.100155) is a framework for producing knowledge graphs (KGs) that integrate and integrate biomedical data related to the COVID-19 pandemic. The framework is designed to be flexible and customizable, allowing researchers to create KGs for different downstream applications, including machine learning tasks, hypothesis-based querying, and browsable user interfaces for exploring and discovering relationships in COVID-19 data. The goal of KGCOVID19 is to provide an up-to-date, integrated source of data on SARS-CoV-2 and related viruses, including SARS-CoV and MERS-CoV, to support the biomedical research community in its efforts to respond to the COVID-19 pandemic. The framework can also be applied to other situations where siloed biomedical data must be quickly integrated for various research purposes, including future pandemics.

In [4]:
%%time
from grape.datasets.kghub  import KGCOVID19

kgcovid19 = KGCOVID19()

CPU times: user 32.1 s, sys: 3.81 s, total: 35.9 s
Wall time: 12 s


We display the number of nodes, `574K` and of undirected edges `18M`.

In [5]:
kgcovid19.get_number_of_nodes(), kgcovid19.get_number_of_edges()

(574232, 18251238)

And now we compute a number of triangles of KGCOVID19.

In [6]:
%%time
kgcovid19.get_number_of_triangles()

CPU times: user 2min 24s, sys: 738 ms, total: 2min 25s
Wall time: 6.28 s


1208845416

And done! Just a few seconds! Over $1$ billion triangles!

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/one_billion_triangles.jpeg?raw=true" width=300 />

### Friendster
[Friendster](https://en.wikipedia.org/wiki/Friendster) was a social networking service launched in 2002. It was one of the first social networking sites and was popular in the early 2000s. The site allowed users to connect with friends and meet new people through the use of personal profiles and networks of friends. Friendster was initially successful but eventually faced competition from more recent social networking sites such as MySpace and Facebook. In 2011, the company announced that it was transitioning from a social networking site to a social gaming site, and in 2015 it was acquired by a Malaysian company.

#### What is the network repository?
[Network Repository](https://networkrepository.com/index.php) is a scientific network data repository that provides interactive visualization and mining tools for analyzing and exploring network data. It is the first interactive repository of its kind. Network Repository is intended to facilitate scientific research on networks by making it easier for researchers to access and analyze an extensive network data collection. It is a valuable resource for researchers in various fields, including network science, bioinformatics, machine learning, data mining, physics, and social science.

#### ⚠️⚠️⚠️ WARNING: Make sure you have enough disk space! ⚠️⚠️⚠️
*Please be aware that this graph is not small and requires a significant amount of disk space to store and work with. Before proceeding with the tutorial, ensure you have enough free space on your hard drive or other storage devices to accommodate the size of the graph. If you do not have sufficient space, you may encounter errors or other issues when downloading or working with the graph. It is important to ensure that you have enough space available before proceeding. If necessary, consider freeing up additional space on your device to make room for the graph.*

In [7]:
!du -sh /bfd/graphs/networkrepository/SocFriendster

97G	/bfd/graphs/networkrepository/SocFriendster


In the next cell we retrieve and load the Friendster dataset from GRAPE, dataset from the [network repository](https://networkrepository.com/index.php).. Do note that we are configuring it to not load the node names and edge types in order to conserve memory. The cell also includes a directive to measure and display the execution time of the code.

In [8]:
%%time
from grape.datasets.networkrepository import SocFriendster

friendster = SocFriendster(
    # We cannot load the node names, as the would require too much memory
    # for my poor old desktop.
    load_nodes=False,
)

CPU times: user 44min 17s, sys: 53.8 s, total: 45min 11s
Wall time: 5min 37s


We display the number of nodes, `65.6M`, and of undirected edges, `1.8G`.

In [9]:
friendster.get_number_of_nodes(), friendster.get_number_of_edges()

(65608366, 1806067135)

We compute the number of triangles in Friendster:

In [10]:
%%time
friendster.get_number_of_triangles()

CPU times: user 2h 53min 45s, sys: 20.6 s, total: 2h 54min 6s
Wall time: 7min 50s


12521172426

A bit slower, but considering we are already in the billions of edges not too bad! $12$ billion triangles!

### ClueWeb
[The ClueWeb09 dataset](http://lemurproject.org/clueweb09/) was created to support research on information retrieval and related human language technologies; it consists of about `1.7` billion web pages that were collected in January and February 2009 and the roughly `8` billion undirected links.

It is used for research on information retrieval and related human language technologies and is used by several tracks of the TREC conference. The dataset includes web pages in various languages and a web graph that includes unique URLs and total outlinks for the entire dataset and for a subset called TREC Category B (the first 50 million English pages). The ClueWeb09 dataset and subsets are distributed in different formats, including as tarred/gzipped files on hard disk drives and as a subset that is downloaded from the web. The Lemur Project provides online services for searching and interacting with the ClueWeb09 dataset, including an Indri search engine for searching the English and Japanese subsets and Wikipedia, as well as a batch query service and an attribute lookup service. The Lemur Project also offers hosted copies of the ClueWeb09 dataset for organizations that have licenses to use it.

*We also retrieve this graph from [Network Repository](https://networkrepository.com/index.php)*

#### ⚠️⚠️⚠️ This is a big graph! Make sure you have the disk space! ⚠️⚠️⚠️
*This is a warning to ensure that users have sufficient disk space before downloading and using a large graph. It is important to ensure that you have enough space on your hard drive or another storage device to accommodate the graph size, as attempting to download or work with a graph that is too large for your available space can lead to errors and other issues. It is advisable to check your available disk space before downloading or working with a large graph and free up additional space if necessary.*

In [1]:
!du -sh /bfd/graphs/networkrepository/WebClueweb09/

631G	/bfd/graphs/networkrepository/WebClueweb09/


In the following cell we retrieve and load the `Clueweb09` dataset from the [network repository](https://networkrepository.com/index.php). We configure it to not load the node names in order to conserve memory. The cell also includes a directive to measure and display the execution time of the code.

In [2]:
%%time
from grape.datasets.networkrepository import WebClueweb09

clueweb = WebClueweb09(
    # We cannot load the node names, as the would require too much memory
    # for my poor old desktop.
    load_nodes=False,
)

CPU times: user 3h 2min 51s, sys: 6min 59s, total: 3h 9min 50s
Wall time: 37min 59s


We display the number of nodes, `1.68G`, and of undirected edges, `7.8G`.

In [3]:
clueweb.get_number_of_nodes(), clueweb.get_number_of_edges()

(1684868322, 7811385827)

We compute the number of triangles of ClueWeb. Here we start with the heavy weights!

In [4]:
%%time
clueweb.get_number_of_triangles()

CPU times: user 7d 17h 41min 32s, sys: 28min 38s, total: 7d 18h 10min 11s
Wall time: 8h 36min 37s


93039057369

DAMN! That's a big graph! $93$ billion triangles! No wonder it took 8 hours even for this algorithm!

<img src="https://media.tenor.com/T42cqp6YKEEAAAAC/damn-damn-damn-damn.gif" />

## WikiData
[WikiData](https://www.wikidata.org/wiki/Wikidata:Main_Page) is a collaborative, multilingual, free knowledge base that can be read and edited by humans and machines. It provides structured data representing the relationships between concepts and entities, including real-world objects, events, ideas and abstract concepts. The data in WikiData is organized into a graph structure, with nodes representing concepts or entities and edges representing relationships between them. For example, a node for the idea "dog" might be connected to other nodes representing specific dog breeds, such as "Labrador Retriever" or "Poodle," through edges that define the relationship "breed of."

The WikiData graph is constantly growing and changing as users contribute new data and edit existing data. It is based on a flexible data model that allows for creation of new properties and classes to represent the relationships between concepts and entities. The WikiData graph is used in various applications, including data integration, natural language processing, and machine learning. It also provides structured data for Wikipedia and other Wikimedia projects.

#### ⚠️⚠️⚠️ This is a big graph! Make sure you have the disk space! ⚠️⚠️⚠️
*This is a warning to ensure that users have sufficient disk space before downloading and using a large graph. It is important to ensure that you have enough space on your hard drive or another storage device to accommodate the graph size, as attempting to download or work with a graph that is too large for your available space can lead to errors and other issues. It is advisable to check your available disk space before downloading or working with a large graph and free up additional space if necessary.*

In [1]:
!du -sh /bfd/graphs/wikidata/WikiData

1,7T	/bfd/graphs/wikidata/WikiData


In the next cell we retrieve and load the WikiData dataset from GRAPE, directly from [WikiData's website](https://www.wikidata.org/wiki/Wikidata:Main_Page). Do note that we are configuring it to not load the node names and edge types in order to conserve memory. The cell also includes a directive to measure and display the execution time of the code.

In [1]:
%%time
from grape.datasets.wikidata import WikiData

wikidata = WikiData(
    # We cannot load the node names, as the would require too much memory
    # for my poor old desktop.
    load_nodes=False,
    # Same thing for the edge types.
    load_edge_types=False
)

CPU times: user 2h 1min 1s, sys: 5min 21s, total: 2h 6min 22s
Wall time: 20min 29s


We display the number of nodes, `1.29G` and of undirected edges `5G`.

In [2]:
wikidata.get_number_of_nodes(), wikidata.get_number_of_edges()

(1294126247, 5040170396)

Here, we try to compute the number of triangles of WikiData.

In [None]:
%%time
wikidata.get_number_of_triangles()

After over two days of computation, this one was still running, so I had to kill it. [We know from the previous tutorial on computing an approximated vertex cover on WikiData](https://github.com/AnacletoLAB/grape/blob/main/tutorials/Billion-scale%202-approximated%20vertex%20cover%20with%20GRAPE.ipynb) that we can quickly compute a vertex cover with only $16\%$ of WikiData's nodes, roughly with the same number of nodes as ClueWeb's, and WikiData has less edges than ClueWeb. This means that, most likely, WikiData just has much more triangles than ClueWeb and therefore intersections between the first and second order neighbourhoods.

## Conclusions

In this tutorial, we learned how to use the [GRAPE](https://github.com/AnacletoLAB/grape) library to compute the exact number of triangles in large graphs. We discussed what is a triangle, and why counting triangles can be useful. Also, we illustrated an algorithm for computing triangles using an approximated vertex cover.

I hope you now have a better grasp on computing triangles and how to use GRAPE to compute them for your projects. Do feel free to reach out with any questions or feedback, as I always look for ways to improve this tutorial.

[And remember to ⭐ GRAPE!](https://github.com/AnacletoLAB/grape)