In [None]:
!pip install graphdatascience

# Analyzing the evolution of life on Earth with Neo4j
Explore the NCBI taxonomy of organisms in a graph database
The evolution of life is a beautiful and insightful field of study that traces our origins back to the beginning of life. It helps us understand where we came from and where we are potentially going. The relationships between species are often depicted in the tree of life, which is a model used to describe relationships between various species. Since a tree structure is a form of a graph, it makes sense to store those relationships in a graph database to be analyzed and visualized.
In this blog post, I have decided to import the NCBI taxonomy of organisms into Neo4j, a graph database, where we can easily traverse and analyze relationships between various species.
# Environment and dataset setup
To follow the code examples in this post, you will need to download Neo4j Desktop application. I have prepared a [database dump](https://drive.google.com/file/d/1-TNOU3KKEaDH6AtXJRQxzy8yRimx41Zt/view?usp=sharing) that you can use to easily get the Neo4j database up and running without having to import the dataset yourself. Take a look at my [previous blog post](https://tbgraph.wordpress.com/2020/11/11/dump-and-load-a-database-in-neo4j-desktop/) if you need some help with restoring the database dump.

The original dataset is available on the NCBI website. I have used the new tax dump folder downloaded on 13th June 2022 to create the above database dump. While no explicit license is specified for the dataset, the NCBI website states that all information is available within the public domain.
I have made available the code used to import the taxonomy into Neo4j on my GitHub if you want to evaluate the process or make any changes.
# Graph schema
I have imported the following files into Neo4j:
* nodes.dmp
* names.dmp
* host.dmp
* citations.dmp

Some other files have redundant information that is already present in the nodes.dmp file that contains the taxonomy of organisms. I have looked a bit at genetic code files, but since I have no idea what to do with genetic code name and their translations, I have skipped them during import.

I have added a generic label Node to all nodes present in the nodes.dmp file. The nodes with the generic label node contain multiple properties that can be used to import other files and help experts better analyze the dataset. For us, only the name property will be relevant. The taxonomy hierarchy is represented with the PARENT relationship between nodes. The dataset also contains a file that describes potential hosts of various species. Lastly, some of the nodes are mentioned in various medical sources, which are represented as the Citation nodes.
All the nodes with the generic label Node have a secondary label that describes their rank. Some examples of ranks are Species, Family, and Genus.

In [1]:
from graphdatascience import GraphDataScience

host = "bolt://44.193.28.203:7687"
user = "neo4j"
password = "combatants-coordinates-tugs"
gds = GraphDataScience(host, auth=(user, password))

# Exploratory analysis

I looked for Homo Sapiens species in the dataset but couldn't find it. Interestingly, the folks at NCBI decided to name our species simply Human. We can examine the taxonomy neighborhood up to four hops with the following Cypher statement:

In [3]:
gds.run_cypher("""
MATCH p=(n:Node {name:"human"})-[:PARENT*..3]-()
RETURN [n in nodes(p) | n.name] AS result
""")

Unnamed: 0,result
0,"[human, Neanderthal man]"
1,"[human, Homo sp. Altai]"
2,"[human, humans]"
3,"[human, humans, unclassified Homo]"
4,"[human, humans, unclassified Homo, Homo sp.]"
5,"[human, humans, environmental samples]"
6,"[human, humans, environmental samples, Homo sa..."
7,"[human, humans, Homo heidelbergensis]"
8,"[human, humans, Homo/Pan/Gorilla group]"
9,"[human, humans, Homo/Pan/Gorilla group, Pongidae]"


So, human node is a species that belongs to a humans genus, which is a part of the Pongidae family. After a quick Google search it seems that Pongidae taxon is obsolete, and Hominidae should be used, which is represented in the NCBI taxonomy as a super family. Interestingly, the human species has two subspecies, namely neanderthals and denisovans, which are represented under the homo sp altai node. I just learned something new about our history.

The NCBI taxonomy dataset contains only 10% of the described species of life on the planet, so don't be surprised if there are missing species from the dataset.

Let's examine how many species are there in the dataset with the following Cypher statement:

In [4]:
gds.run_cypher("""
MATCH (s:Species)
RETURN count(s) AS speciesCount
""")

Unnamed: 0,speciesCount
0,1981376


There are almost two million species described in the dataset, which means there is plenty of room to explore.
Next, we can examine the taxonomy hierarchy for human species all the way to the root of the tree using a simple query:

In [10]:
gds.run_cypher("""
MATCH p=(:Node {name:'human'})-[:PARENT*0..]->(parent)
RETURN parent.name AS lineage, labels(parent)[1] AS rank
""")

Unnamed: 0,lineage,rank
0,human,Species
1,humans,Genus
2,Homo/Pan/Gorilla group,Subfamily
3,Pongidae,Family
4,Hominoidea,Superfamily
5,Catarrhini,Parvorder
6,Simiiformes,Infraorder
7,Haplorrhini,Suborder
8,Primates,Order
9,Euarchontoglires,Superorder


It seems that there are 31 traversals needed to get from the human node to the root node. For some reason, the root node has a self-loop (relationship with itself), and that's why it shows twice in the results. In addition, a clade, a group of organisms that have evolved from a common ancestor, shows up multiple times in the hierarchy. It looks like the NCBI taxonomy is richer than what you would find with a quick Google search.

Graph databases like Neo4j are also great at finding shortest paths between nodes in the graph. Now, we can answer a critical question of how close are apples to oranges in the taxonomy.

In [12]:
gds.run_cypher("""
MATCH (h:Node {name:'Valencia orange'}), (g:Node {name:'sweet banana'})
MATCH p=shortestPath( (h)-[:PARENT*]-(g))
RETURN [n in nodes(p) | {name: n.name, rank: labels(n)[1]}] AS path
""")['path'][0]

[{'name': 'Valencia orange', 'rank': 'Species'},
 {'name': 'Microcitrus Swingle', 'rank': 'Genus'},
 {'name': 'Aurantioideae', 'rank': 'Subfamily'},
 {'name': 'Rutaceae', 'rank': 'Family'},
 {'name': 'Sapindales', 'rank': 'Order'},
 {'name': 'malvids', 'rank': 'Clade'},
 {'name': 'rosids', 'rank': 'Clade'},
 {'name': 'Pentapetalae', 'rank': 'Clade'},
 {'name': 'Gunneridae', 'rank': 'Clade'},
 {'name': 'eudicotyledons', 'rank': 'Clade'},
 {'name': 'Mesangiospermae', 'rank': 'Clade'},
 {'name': 'monocotyledons', 'rank': 'Clade'},
 {'name': 'Petrosaviidae S.W.Graham & W.S.Judd, 2007', 'rank': 'Subclass'},
 {'name': 'Commeliniflorae', 'rank': 'Clade'},
 {'name': 'Zingiberiflorae', 'rank': 'Order'},
 {'name': 'Musaceae', 'rank': 'Family'},
 {'name': 'Musa', 'rank': 'Genus'},
 {'name': 'sweet banana', 'rank': 'Species'}]

It seems that the closest common ancestor between sweet banana and valencia orange is Mesangiospermae clade. Mesangiospermae is a clade of flowering plants.

Another use-case for traversing relationships could be finding all the species in the same family as a particular species. Here, we will visualize all the genus in the same family as the sweet banana.

In [23]:
gds.run_cypher("""
MATCH (:Node {name:'sweet banana'})-[:PARENT*0..]->(f:Family)
MATCH (f)<-[:PARENT*]-(s:Genus)
RETURN s.name AS genus
""")

Unnamed: 0,genus
0,Musella
1,Ensete
2,Musa


Sweet banana belongs to the Musa genus and Musaceae family. Interestingly, there is a Musella genus, which sounds like a small Musa. In fact, after googling the Musella genus, it looks like only a single species is present in the Musella genus. The species is commonly referred to as the Chinese dwarf banana.
# Inference with Neo4j
In the last example, we will look at how to develop inference queries in Neo4j. Inference means we create new relationships based on a set of rules between nodes and either store them in the database or use them at query-time only. Here, I will show you an example of inference queries using new relationships only at query-time when analyzing potential hosts.
First, we will evaluate which organism have described potential parasites in the dataset.

In [27]:
gds.run_cypher("""
MATCH (n:Node)
RETURN n.name AS organism,
       labels(n)[1] AS rank,
       count{ (n)<-[:POTENTIAL_HOST]-() } AS potentialParasites
ORDER BY potentialParasites DESC
LIMIT 5
""")

Unnamed: 0,organism,rank,potentialParasites
0,vertebrates,Clade,175285
1,human,Species,169891
2,plants,Clade,51
3,Azorhizobium,Genus,0
4,primary endosymbiont of Schizaphis graminum,Species,0


It seems that humans are the most described and only species with potential parasites. I would venture a guess that most if not all of the potential parasites for humans are also potential parasites for vertebrates since the counts are so close.

We can check how many potential hosts organisms have with the following Cypher statement.

In [28]:
gds.run_cypher("""
MATCH (n:Node)
WHERE EXISTS { (n)-[:POTENTIAL_HOST]->()}
WITH count{ (n)-[:POTENTIAL_HOST]->() } AS ph
RETURN ph, count(*) AS count
ORDER BY ph
""")

Unnamed: 0,ph,count
0,1,18359
1,2,163434


18359 organisms have only one known host, while 163434 have two known hosts. Therefore, my hypothesis that most parasites that attack humans also potentially attack all vertebrates is valid.

Here is where the inference queries comes into play. We know that vertebrates is a higher level taxon in the taxonomy of organisms. Therefore, we can traverse from vertebrates to the species level to examine which species could be potentially used as hosts.

We will use the example of Monkeypox virus as it is relevant in this time. First, we can evaluate its potential hosts.

In [29]:
gds.run_cypher("""
MATCH (n: Node {name:"Monkeypox virus"})-[:POTENTIAL_HOST]->(host)
RETURN host.name AS host
""")

Unnamed: 0,host
0,human
1,vertebrates


Notice that both human and vertebrates are described as potential hosts of Monkeypox virus. However, let's say we want to examine all the species that are potentially endangered by the virus.

In [32]:
gds.run_cypher("""
MATCH (n: Node {name:"Monkeypox virus"})-[:POTENTIAL_HOST]->()<-[:PARENT*0..]-(host:Species)
RETURN host.name AS host
LIMIT 10
""")

Unnamed: 0,host
0,human
1,Neoceratodus forsteri
2,Lepidosireniformes sp. BOLD:AAL6055
3,Protopterus sp. NBE-2020
4,Protopterus sp. LMN-2018
5,Protopterus sp. BAFEN289-10
6,Protopterus sp. BOLD:AAL6244
7,Protopterus sp. DRV-2007
8,Protopterus sp. IMCB-2001
9,Protopterus sp.


We have used a limit as there are a lot of vertebrates. Unfortunately, we don't know which of them are extinct as that would help us filter them out and identify only potential victims of the Monkeypox virus that are still alive. However, it is still an excellent example of inference in Neo4j, where we create or infer a new relationship based on the predefined set of rules at query time.