# Graph set algebra with 🍇🍇 GRAPE 🍇🍇
Welcome to this tutorial on graph set algebra! In this tutorial, we will learn how to execute set algebraic operations on graphs using [GRAPE!](https://github.com/AnacletoLAB/grape). Such operations can be useful to analyze and describe the structure and relationships of graphs.

We will start by introducing the basic concepts of set theory and set algebra and how they apply to graphs. We will then explore some of the common operations used in set algebra, such as union, intersection, and complement. We will also discuss how to apply these operations to analyze and understand the structure and relationships of different types of graphs.

By the end of this tutorial, you will have a good understanding of how to use set algebra to analyze and describe the structure and relationships of graphs, and you will be able to apply these techniques to your own projects.

[Remember to ⭐ GRAPE!](https://github.com/AnacletoLAB/grape)

### What is GRAPE?
[🍇🍇 GRAPE 🍇🍇](https://github.com/AnacletoLAB/grape) is a graph processing and embedding library that enables users to easily manipulate and analyze graphs. With [GRAPE](https://github.com/AnacletoLAB/grape), users can efficiently load and preprocess graphs, generate random walks, and apply various node and edge embedding models. Additionally, [GRAPE](https://github.com/AnacletoLAB/grape) provides a fair and reproducible evaluation pipeline for comparing different graph embedding and graph-based prediction methods.

## Set Theory
[Set theory](https://en.wikipedia.org/wiki/Set_theory) is a branch of mathematics that deals with a set, a collection of objects. Set theory is a foundation of modern mathematics. It can formalize and generalize mathematics concepts such as numbers, functions, and space.

### The stormy origins of set theory
The history of set theory can be traced back to ancient Greece, where philosophers such as Zeno of Elea and Euclid used a set to study logical and geometric concepts. However, the modern formulation of set theory began in the late 19th century with the work of Georg Cantor. Cantor developed a theory of infinite sets, which was **a revolutionary idea** at the time because it challenged the traditional mathematical concept of a set as a finite collection of objects. 

Initially, set theory was met with resistance from some mathematicians **who believed it was nonsense because it seemed to violate the fundamental principles of mathematics**. For example, the concept of a set containing an infinite number of elements seemed to contradict the principle of one-to-one correspondence, which states that every element in a set has a unique corresponding element in another set. Additionally, set theory introduced the concept of a set having members that could not be determined, which seemed to contradict the principle of definite descriptions, which states that every object has a unique set of properties that can be used to distinguish it from other objects.

Despite this resistance, set theory eventually gained acceptance as a fundamental mathematical concept and became a cornerstone of modern mathematics. **Today, set theory is a fundamental part of the mathematical landscape, and it is used in many areas of mathematics and computer science.**


### Set algebra
[Set algebra](https://en.wikipedia.org/wiki/Algebra_of_sets) is a branch of mathematics that deals with the manipulation and analysis of sets using algebraic techniques. It is used to describe and reason about sets and their properties using mathematical symbols and operations.

In set algebra, sets are represented by capital letters, and the members of a set are written inside curly braces. For example, the set of all strictly positive integers less than $10$ can be written as $\{1, 2, 3, 4, 5, 6, 7, 8, 9\}$. The **union** of two sets $A$ and $B$ is represented by the symbol $\cup$, and it represents the set of elements that belong to either $A$ or $B$ (or both). The **intersection** of two sets $A$ and $B$ is represented by the symbol $\cap$, and it represents the set of elements that belong to both $A$ and $B$.

Set algebra also includes operations such as **complement**, **difference**, and **power set**.

The **complement** of a set $A$ is the set of elements that *do not belong* to $A$. The difference between two sets $A$ and $B$ is the set of elements that belong to $A$ but not to $B$, and it is represented by the symbol $A - B$ or $A \setminus B$. The power set of a set $A$ is the set of all possible subsets of $A$, and it is represented by the symbol $\mathcal{P}(A)$.

Set algebra is an important tool in many areas of mathematics, including set theory, logic, and computer science. It is used to formalize and analyze sets and their properties, and it is a fundamental part of the mathematical language used to describe and reason about sets.

### Graphs are sets
A [graph](https://en.wikipedia.org/wiki/Graph_(discrete_mathematics)) is a mathematical structure used to represent relationships between objects. It consists of a set of nodes (also called vertices) and a set of edges that connect the nodes. The nodes represent the objects, and the edges represent the relationships between the objects.

Graphs are used to model a wide variety of situations and structures, including social networks, transportation systems, and biological systems. They are a useful tool for understanding and analyzing complex systems and relationships.

Graphs are based on the concept of a set, which is a collection of objects. In a graph, the nodes of the graph represent the elements of the set, and the edges represent the relationships between the nodes and form their own set. Set theory provides the mathematical tools for analyzing and manipulating sets, and it is an important foundation for the study of graphs.

Graphs are an important tool in many areas of mathematics and computer science, and they are used to represent and analyze complex structures and relationships. They are a fundamental concept in the study of set theory and discrete mathematics, and they play a central role in many other areas of mathematics and computer science.


### Graph set algebra
Since graphs are sets, it is possible to readily apply set algebra to graphs. A graph is composed of two sets, so the graph set algebra operations should be designed in a way that maintain consistency between the two sets. This is not always implementative possible or desirable, as we will see shortly.

In the following illustration, I tried to represent graph set algebra using three planar graphs following the [RGB luminosity rules](https://en.wikipedia.org/wiki/RGB_color_model). The red graph represents set $A$, the green graph represents set $B$, and the blue graph represents set $C$. The intersections of the sets are shown in yellow $A \cap B$, purple $A \cap C$, cyan $B \cap C$, and white $A \cap B \cap C$. This figure illustrates how set algebra can be used to analyze and describe the relationships between the nodes and edges of a graph.

<img src="https://github.com/AnacletoLAB/grape/blob/main/images/set_algebra.jpg?raw=true" alt="Graph algebra" width=400 />

## Installing GRAPE
Installing the GRAPE library is super-easy. Just [install it from PyPI](https://pypi.org/project/grape/) as follows:

In [1]:
!pip install grape -qU

## Hardware requirements
The following example will use small graphs with up to tens million nodes and edges, so they are small enough you can confidently run them on your notebook without worrying about the memory requirements or the execution time.

Nevertheless, to allow you to easily compare the computer you have access to with what I am currently using, know that I have executed this tutorial on a desktop with 24 threads as 12 cores.

In [2]:
import os

os.cpu_count()

24

## Experiments
Welcome to the experiments section of this tutorial! In this section, we will apply various techniques to merge and analyze the five datasets. We will start by merging the HPO and MPO using the union operator, then we will perform an intersection operation between the KGCOVID19 knowledge graph and the union of the HPO and MPO. Next, we will merge the PubMed citation graph with the KGCOVID19 knowledge graph and the HPO and MPO. Finally, we will test the validity of our edge holdouts by checking for overlaps between the training and test sets and by subtracting the test set from the complete Homo Sapiens STRING graph.

By the end of this, you should be a pro at manipulating graphs using graph set algebra.

### Datasets
In this tutorial, we will be working with five datasets: the [Human Phenotype Ontology (HPO)](https://hpo.jax.org/app/), the [Mammalian Phenotype Ontology (MPO)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2801442/), the [STRING protein-protein interaction dataset](https://string-db.org/), [the KGCOVID19 knowledge graph](https://doi.org/10.1016%2Fj.patter.2020.100155), and [the PubMed citation graph](https://github.com/LucaCappelletti94/pubmed_embedding).

The HPO is a standardized vocabulary for describing human phenotypes, used to classify and annotate human phenotypes in a consistent and standardized way. It is an important resource for understanding the underlying causes of human diseases and conditions. The MPO is similar, but for describing the phenotypes of mammals. The STRING dataset is a database of protein-protein interactions in various organisms, represented as a graph. The KGCOVID19 knowledge graph contains information about the COVID-19 pandemic. The PubMed citation graph represents the relationships between articles in the PubMed database, with the nodes of the graph representing articles and the edges representing citations between the articles. The edges can be weighted and labeled with the type of citation.

### Human Phenotype Ontology (HPO)
The [Human Phenotype Ontology (HPO)](https://hpo.jax.org/app/) is a standardized vocabulary for describing human phenotypes, which are the observable characteristics of an individual resulting from the interaction between their genotype and the environment. The HPO is used to classify and annotate human phenotypes in a consistent and standardized way, and it is an important resource for understanding the underlying causes of human diseases and conditions.

The HPO is organized as a hierarchical structure, with each term representing a specific phenotype. The terms are organized into a tree-like structure, with more specific terms being children of more general terms.

*⚠️ Heads up! This line of code will download a graph! Make sure you have a good internet connection and enough disk space before proceeding! 💾*

In [3]:
! du -sh /bfd/graphs/kgobo/HP/2022-10-05

24M	/bfd/graphs/kgobo/HP/2022-10-05


In [4]:
from grape.datasets.kgobo import HP

hpo = HP(version="2022-10-05")

Let's check out the report of HPO:

In [5]:
hpo

### Mammalian Phenotype Ontology
The [Mammalian Phenotype Ontology (MPO)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2801442/) is a standardized vocabulary for describing the phenotypes of mammals. Like the Human Phenotype Ontology (HPO), the MPO is used to classify and annotate the phenotypes of mammals in a consistent and standardized way, and it is an important resource for understanding the underlying causes of diseases and conditions in mammals.

Like HPO, MPO is also organized as a hierarchical structure, with each term representing a specific phenotype. The terms are organized into a tree-like structure, with more specific terms being children of more general terms.

*⚠️ Heads up! This line of code will download a graph! Make sure you have a good internet connection and enough disk space before proceeding! 💾*

In [6]:
! du -sh /bfd/graphs/kgobo/MP/2021-11-04

14M	/bfd/graphs/kgobo/MP/2021-11-04


In [7]:
from grape.datasets.kgobo import MP

mpo = MP(version="2021-11-04")

Let's checkout the report of MPO:

In [8]:
mpo

### STRING protein-protein interaction
[STRING (Search Tool for the Retrieval of Interacting Genes/Proteins)](https://string-db.org/) is a database that provides information on protein-protein interactions (PPIs) in a variety of organisms. The database includes a large network of PPIs, which is represented as a graph.

In the STRING graph, the nodes represent proteins, and the edges represent the interactions between the proteins. The edges are weighted to indicate the confidence in the existance of the interaction, and goes from `100` to `1000`. It is generally filtered at `700`, as we will see shortly.

In [9]:
%%time
from grape.datasets.string import HomoSapiens

homo_sapiens = HomoSapiens()

CPU times: user 14.4 s, sys: 652 ms, total: 15 s
Wall time: 8.61 s


Let's take a look at STRING's graph report:

In [10]:
homo_sapiens

As suggested by the STRING authors, we filter the graph at `700`:

In [11]:
homo_sapiens = homo_sapiens.filter_from_names(min_edge_weight=700)

### KGCOVID19
[KGCOVID19](https://doi.org/10.1016%2Fj.patter.2020.100155) is a framework for producing knowledge graphs (KGs) that integrate and integrate biomedical data related to the COVID-19 pandemic. The framework is designed to be flexible and customizable, allowing researchers to create KGs for different downstream applications, including machine learning tasks, hypothesis-based querying, and browsable user interfaces for exploring and discovering relationships in COVID-19 data. The goal of KGCOVID19 is to provide an up-to-date, integrated source of data on SARS-CoV-2 and related viruses, including SARS-CoV and MERS-CoV, to support the biomedical research community in its efforts to respond to the COVID-19 pandemic. The framework can also be applied to other situations where siloed biomedical data must be quickly integrated for various research purposes, including future pandemics.

*⚠️ Heads up! This line of code will download a graph! Make sure you have a good internet connection and enough disk space before proceeding! 💾*

In [12]:
%%time
from grape.datasets.kghub import KGCOVID19

kgcovid19 = KGCOVID19()

CPU times: user 23.3 s, sys: 222 ms, total: 23.5 s
Wall time: 1.85 s


Let's take a look at its graph report:

In [13]:
kgcovid19

### PubMed
PubMed is a database of biomedical literature maintained by the National Institutes of Health (NIH). It includes citations and abstracts for articles from a wide range of biomedical journals, as well as links to full-text articles when available. PubMed is a valuable resource for researchers and clinicians who are looking for information on biomedical topics.

[The following is a PubMed citation graph that I personally created](https://github.com/LucaCappelletti94/pubmed_embedding). This graph represents the relationships between articles in the PubMed database. The nodes of the graph represent articles, and the edges represent the citations between the articles. 

*⚠️ Heads up! This line of code will download a graph! Make sure you have a good internet connection and enough disk space before proceeding! 💾*

In [14]:
%%time
from grape.datasets.pubmed import PubMed

pubmed = PubMed()

CPU times: user 5min 21s, sys: 6.19 s, total: 5min 27s
Wall time: 31.7 s


Since all other graphs we are considering have edge types, let's add a proper one also to PubMed:

In [15]:
pubmed.set_inplace_all_edge_types("biolink:Mentions")

This graph is built from the XML data provided by PubMed, but this means it contains many singleton nodes, i.e. articles for which we do not know the citations. We can drop them.

In [16]:
pubmed = pubmed.remove_singleton_nodes()

Let's take a look at its graph report:

In [17]:
pubmed

### Merging ontologies

In this section, we are merging the Human Phenotype Ontology (HPO) and the Mammalian Phenotype Ontology (MPO) using the union operator `or` (`|`). The union operator combines the elements of two sets into a single set, with no duplicate elements. In this case, the resulting set `hp_or_mp` will contain all of the elements from the HPO and the MPO, without any duplicates.

The use of the union operator here is useful because it allows us to combine the two ontologies into a single set, making it easier to work with and analyze the data.

In [18]:
%%time
hp_or_mp = hpo | mpo

CPU times: user 653 ms, sys: 1.84 s, total: 2.49 s
Wall time: 128 ms


Let's check out the graph report:

In [19]:
hp_or_mp

### Overlap between ontologies and KGs

This magic line of code is performing an **intersection operation** between the `kgcovid19` and `hp_or_mp` graphs. The intersection operation is represented by the `and` (`&`) symbol, and it returns a new graph that contains only the nodes and edges that are present in both of the input graphs. In this case, the `kgcovid19` graph represents a knowledge graph that integrates biomedical data related to the COVID-19 pandemic, while the `hp_or_mp` graph represents the union of the Human Phenotype Ontology (HPO) and Mammalian Phenotype Ontology (MPO), which are standardized vocabularies for describing human and mammalian phenotypes, respectively. By performing an intersection operation between these two graphs, we are effectively extracting the portion of the COVID-19 knowledge graph that is related to human and mammalian phenotypes.

In [20]:
%%time
kgcovid19_and_hp_or_mp = kgcovid19 & hp_or_mp

CPU times: user 39 s, sys: 5.51 s, total: 44.5 s
Wall time: 2.05 s


Let's take a look at how this looks:

In [21]:
kgcovid19_and_hp_or_mp

### Merging with a citation graph

The following line of code merges the PubMed citation graph with the KGCOVID19 knowledge graph and the Human Phenotype Ontology (HPO) and Mammalian Phenotype Ontology (MPO). This is useful because the PubMed graph contains articles that are also present in the other three graphs, and merging them will bring together all of the important topological information into one graph. The resulting graph will contain information about the relationships between articles in the PubMed database, as well as the relationships between genes, proteins, and other entities in the KGCOVID19 knowledge graph and the HPO and MPO ontologies. This will allow researchers to more easily explore and analyze the relationships between these datasets, and to better understand the underlying causes of diseases and conditions.

In [22]:
%%time
pubmed_or_kgcovid19_and_hp_or_mp = pubmed | kgcovid19_and_hp_or_mp

CPU times: user 17min 46s, sys: 54min 9s, total: 1h 11min 55s
Wall time: 3min 16s


Let's check out its graph report:

In [23]:
pubmed_or_kgcovid19_and_hp_or_mp

### Testing the validity of edge holdouts
In this section, we will be testing the validity of our edge holdouts. Edge holdouts are a common method used in machine learning to evaluate the performance of a model. The goal is to ensure that the training and test sets do not share any edges, as this would compromise the validity of the evaluation. To test the validity of our holdouts, we will use various methods to check if the training and test sets overlap, and if they do, we will investigate the cause. We will also subtract the test set from the complete graph and verify that it is a valid holdout.

First, we create an edge holdout. In this instance, we are using a connected holdout. [You can learn more about holdouts in this previous tutorial.](https://github.com/AnacletoLAB/grape/blob/main/tutorials/Graph_holdouts_using_GRAPE.ipynb)

In [24]:
train, test = homo_sapiens.connected_holdout(train_size=0.8)

By definition, training and test holdouts SHOULD NOT share edges. We can trivially test this by making sure that the insection is empty:

In [25]:
train & test

We can verify this also easily using:

In [26]:
train.overlaps(test), test.overlaps(train)

(False, False)

We can also subtract the test set to the complete graph, and run the same test:

In [27]:
test.overlaps(homo_sapiens - test), test.overlaps(homo_sapiens)

(False, True)

And if we subtract train and test to the complete graph, we should get an empty graph:

In [28]:
homo_sapiens - train - test

## Conclusions
In this tutorial, we explored the use of set theory and set algebra in the context of graphs and graph analysis. We learned about various set theory operations that can be applied to graphs, including union, intersection, and difference. These operations allow us to manipulate and analyze graphs in a more formal and precise way, and they are an important part of the mathematical language used to describe and reason about graphs.

We also saw how these operations can be used to merge and compare different graphs, such as ontologies and knowledge graphs, in order to extract and analyze the relevant information contained within them.

Finally, we learned how to test the validity of edge holdouts using set theory operations. Overall, this tutorial provided an introduction to the use of set theory and set algebra in the context of graph analysis and machine learning, and it demonstrated the power and utility of these concepts for understanding and analyzing complex systems and relationships.

I hope you now understand the applications and benefits of graph set algebra and how to use [GRAPE](https://github.com/AnacletoLAB/grape) to use it for your projects. Do feel free to reach out with any questions or feedback, as I always look for ways to improve this tutorial.

[And remember to ⭐ GRAPE!](https://github.com/AnacletoLAB/grape)