This section has been inspired by the work of Robert Haas on Biomedical Knowledge Graphs which can be found at 
https://github.com/robert-haas/awesome-biomedical-knowledge-graphs/tree/main

The source of the data is the webpage on [Figshare](https://figshare.com/articles/dataset/HALD_a_human_aging_and_longevity_knowledge_graph_for_precision_gerontology_and_geroscience_analyses/22828196). This include versions in JSON and CSV. The CSV versions are structured for a particular package which we will not be using, so we will use the json packages which are more general.

First we'll create a data directory.

In [1]:
import os
datadir = "HALD_Dataset"
if not os.path.exists(datadir):
    os.mkdir('HALD_Dataset')

Now we define the list of the files that we want to download. We'll define a *list* of *tuples*, with each tuple representing one of the files that we want to fetch, specifying three things:

* The name that we want the file to be called.
* The URL from where it will be downloaded.
* The MD5 checksum which will allow us to verify the downloaded file's integrity.

In [2]:
# List of files to download
filelist = [
    ("Entity_info.json", "https://figshare.com/ndownloader/files/43612509", '1746cde24a1bac0460f1ccf646608cc9'),
    ("Literature_Info.json", "https://figshare.com/ndownloader/files/43612512", "10b78e8ec30f5b85f2a58d8fe24f056b"),
    ("Longevity_Biomarkers.json", "https://figshare.com/ndownloader/files/43612497", "0dbd9c3f8474dc3cd744ed38af460d75"),
    ("Relation_Info.json", "https://figshare.com/ndownloader/files/43612506", "0c1fa199269adc58f64ad4d5b9fd87b9"),
    ("Aging_Biomarkers.json", "https://figshare.com/ndownloader/files/43612503", "abd0eb6cb7295ae500c5d676b7797324")
]

Now we can download the files. For each file in `filelist` we will:

* Download the file from the URL.
* If the download request indicates that the download is unsuccessful, print an error.
* If the download is successfull, verify the checksum and if that is correct, write the file to disk in `datadir`

In [None]:
import requests
import hashlib

for f in filelist:
    response = requests.get(f[1])
    file_Path = datadir + "/" + f[0]
    if response.status_code != 200:
        print('Failed to download file {f[0]} from {f[1]}')
    else:
        m = hashlib.md5()
        m.update(response.content)
        if m.hexdigest() == f[2]:
            print(f"SUCCESS: File {f[0]} downloaded from {f[1]} with correct checksum {f[2]}")
            with open(file_Path, 'wb') as file:
                file.write(response.content)
        else:
            print(f"ERROR: File {f[0]} downloaded from {f[1]} with incorrect checksum {m.hexdigest()} (should be {f[2]})")            


Let us inspect these files. The two key files here are those containing the *entities* (nodes) and the *edges* (relations).

In [3]:
import json

def load_json(fname):
    with open(fname, 'rb') as file:
        return json.load(file)

nodes = load_json(f"{datadir}/{filelist[0][0]}")
edges = load_json(f"{datadir}/{filelist[3][0]}")

In [4]:
print(len(nodes))

12257
