# Practical: The HALD Knowledge Graph
This section has been inspired by the work of Robert Haas on Biomedical Knowledge Graphs which can be found at 
https://github.com/robert-haas/awesome-biomedical-knowledge-graphs/tree/main

The source of the data is the webpage on [Figshare](https://figshare.com/articles/dataset/HALD_a_human_aging_and_longevity_knowledge_graph_for_precision_gerontology_and_geroscience_analyses/22828196). This include versions in JSON and CSV. The CSV versions are structured for a particular package which we will not be using, so we will use the json packages which are more general.

First we'll create a data directory.

In [None]:
import os
Download = False
datadir = "HALD_Dataset"
if not os.path.exists(datadir):
    os.mkdir(datadir)

Now we define the list of the files that we want to download. We'll define a *list* of *tuples*, with each tuple representing one of the files that we want to fetch, specifying three things:

* The name that we want the file to be called.
* The URL from where it will be downloaded.
* The MD5 checksum which will allow us to verify the downloaded file's integrity.

In [None]:
# List of files to download
filelist = [
    ("Entity_info.json", "https://figshare.com/ndownloader/files/43612509", '1746cde24a1bac0460f1ccf646608cc9'),
    ("Literature_Info.json", "https://figshare.com/ndownloader/files/43612512", "10b78e8ec30f5b85f2a58d8fe24f056b"),
    ("Longevity_Biomarkers.json", "https://figshare.com/ndownloader/files/43612497", "0dbd9c3f8474dc3cd744ed38af460d75"),
    ("Relation_Info.json", "https://figshare.com/ndownloader/files/43612506", "0c1fa199269adc58f64ad4d5b9fd87b9"),
    ("Aging_Biomarkers.json", "https://figshare.com/ndownloader/files/43612503", "abd0eb6cb7295ae500c5d676b7797324")
]

Now we can download the files. For each file in `filelist` we will:

* Download the file from the URL.
* If the download request indicates that the download is unsuccessful, print an error.
* If the download is successfull, verify the checksum and if that is correct, write the file to disk in `datadir`

In [None]:
import requests
import hashlib

if Download:
    for f in filelist:
        response = requests.get(f[1])
        file_Path = datadir + "/" + f[0]
        if response.status_code != 200:
            print('Failed to download file {f[0]} from {f[1]}')
        else:
            m = hashlib.md5()
            m.update(response.content)
            if m.hexdigest() == f[2]:
                print(f"SUCCESS: File {f[0]} downloaded from {f[1]} with correct checksum {f[2]}")
                with open(file_Path, 'wb') as file:
                    file.write(response.content)
            else:
                print(f"ERROR: File {f[0]} downloaded from {f[1]} with incorrect checksum {m.hexdigest()} (should be {f[2]})")            


## What does the data look like?

Let us inspect these files. The two key files here are those containing the *entities* (nodes) and the *edges* (relations).

In [None]:
import json

def load_json(fname):
    with open(fname, 'rb') as file:
        return json.load(file)

nodes = load_json(f"{datadir}/{filelist[0][0]}")
edges = load_json(f"{datadir}/{filelist[3][0]}")

What are these new data structures?

In [None]:
# Placeholder

How big are they?

In [None]:
# Placeholder

There are several nodes in the dataset which seem somehow upsetting to OWLReady2. We need to remove the nodes and their edges.

In [None]:
# Placeholder

In [None]:
# Placeholder

Let's look at each of these in more detail. First, let's pick one of the nodes. Let's print the keys and choose one:

In [None]:
# Placeholder

Pick a random node key, say `AHR`:

In [None]:
# Placeholder

So each node is a list of length 1. What's in that?

In [None]:
# Placeholder

Another dictionary. Let's get the keys:

In [None]:
# Placeholder

In [None]:
# Placeholder

So what does this tell us? These are *annotations* of the node that tell us:
* The name of the **entity** (node)
* What the **type** of the node is (a gene, in this case)
* The **official full name** and **alias names** of the entity.
* A **URL** to the official record of the entity.
* A **description** of the entity.

Some entries tell us about the research papers that include information about the entity:

* The **number of articles** that mention this entity
* The **PMID** (pubmed identity) of the article that were used to get information about the entity. You can enter these numbers at [https://pubmed.ncbi.nlm.nih.gov](https://pubmed.ncbi.nlm.nih.gov) to get the papers themselves.
* A **sentence** containing the entity name from each of the articles.
* The **JT** (journal title) **TA** (journal title abbreviation) of each article.
* The **IF** (impact factor) and **IF5** (five year impact factor) of each of the journals.
* The **year** and **date** each of the articles was published.
* Information about mutations: **mutuation position** and **mutation alleles**.
* Some **external links** and the **MeSH ID** for the Medical Subject Headings database.
* Whether the entity is an **aging biomarker** or a **longenvity** biomarker.

Let's now look at one of the edges:

In [None]:
# Placeholder

Perhaps the two key properties of the edge are the

* **source entity** and **target entity** which specify the *entity* property of the nodes that the edge connects. Let us check that these exist and see what they are:

In [None]:
# Placeholder

In [None]:
# Placeholder

The edge also has a

* **relationship**, which specifies how the two nodes are related.
* **source** and **target** attributes, which are alternative names for the entities and which we will not use.
* **source type** and **target type** which refer to the *type* atribute of the source and target nodes.
* A range of attributes related to the publications in which the relationship modelled by the edge is described (**PMID**, **DP**, **TI**,**TA**, **IF**, **IF5**).
* A list of **method**s, which we will not use.

## A deeper dive into the data

Let's do a deeper dive into the data now. We should easily be able to find out what the type of entity, the type of relation, and how many of each there are. This will require us to do a full pass through the data. The edges contain all of the information we need here so we can just iterate through those.



In [None]:
# Placeholder

How many of each type are there?

In [None]:
# Placeholder

Let's now repeat for the edges:

In [None]:
# Placeholder

How many of each type of relation?

In [None]:
# Placeholder

Plot some graphs of these

In [None]:
# Placeholder

In [None]:
# Placeholder

Also interesting to look at the number of times each type of node is connected to each other type of node.

In [None]:
# Placeholder

## Creating the Ontology

Now we are in a position to create the ontology. A major challenge here is that we need to do this dynamically. Our approach will be to store the classes in a `dictionary` from where they can still be called.

We will first create the entities.

In [None]:
# Placeholder

Now we create the relationships. For this, we need to determine the domain and range for each type of relation

In [None]:
# Placeholder

Now we have this, we can create the relations

In [None]:
# Placeholder

Check this this looks sensible against the ones printed out above

In [None]:
# Placeholder

Let's now populate the graph. Start by adding the Entities

In [None]:
# Placeholder

Now the edges

In [None]:
# Placeholder

Get the edges for one node.

In [None]:
# Placeholder

Which nodes has the most connections from it?
Which node has the most connections to it?

In [None]:
# Placeholder

In [None]:
# Placeholder