This section has been inspired by the work of Robert Haas on Biomedical Knowledge Graphs which can be found at 
https://github.com/robert-haas/awesome-biomedical-knowledge-graphs/tree/main

The source of the data is the webpage on [Figshare](https://figshare.com/articles/dataset/HALD_a_human_aging_and_longevity_knowledge_graph_for_precision_gerontology_and_geroscience_analyses/22828196). This include versions in JSON and CSV. The CSV versions are structured for a particular package which we will not be using, so we will use the json packages which are more general.

First we'll create a data directory.

In [1]:
import os
datadir = "HALD_Dataset"
if not os.path.exists(datadir):
    os.mkdir('HALD_Dataset')

Now we define the list of the files that we want to download. We'll define a *list* of *tuples*, with each tuple representing one of the files that we want to fetch, specifying three things:

* The name that we want the file to be called.
* The URL from where it will be downloaded.
* The MD5 checksum which will allow us to verify the downloaded file's integrity.

In [2]:
# List of files to download
filelist = [
    ("Entity_info.json", "https://figshare.com/ndownloader/files/43612509", '1746cde24a1bac0460f1ccf646608cc9'),
    ("Literature_Info.json", "https://figshare.com/ndownloader/files/43612512", "10b78e8ec30f5b85f2a58d8fe24f056b"),
    ("Longevity_Biomarkers.json", "https://figshare.com/ndownloader/files/43612497", "0dbd9c3f8474dc3cd744ed38af460d75"),
    ("Relation_Info.json", "https://figshare.com/ndownloader/files/43612506", "0c1fa199269adc58f64ad4d5b9fd87b9"),
    ("Aging_Biomarkers.json", "https://figshare.com/ndownloader/files/43612503", "abd0eb6cb7295ae500c5d676b7797324")
]

Now we can download the files. For each file in `filelist` we will:

* Download the file from the URL.
* If the download request indicates that the download is unsuccessful, print an error.
* If the download is successfull, verify the checksum and if that is correct, write the file to disk in `datadir`

In [3]:
import requests
import hashlib

for f in filelist:
    response = requests.get(f[1])
    file_Path = datadir + "/" + f[0]
    if response.status_code != 200:
        print('Failed to download file {f[0]} from {f[1]}')
    else:
        m = hashlib.md5()
        m.update(response.content)
        if m.hexdigest() == f[2]:
            print(f"SUCCESS: File {f[0]} downloaded from {f[1]} with correct checksum {f[2]}")
            with open(file_Path, 'wb') as file:
                file.write(response.content)
        else:
            print(f"ERROR: File {f[0]} downloaded from {f[1]} with incorrect checksum {m.hexdigest()} (should be {f[2]})")            




SUCCESS: File Entity_info.json downloaded from https://figshare.com/ndownloader/files/43612509 with correct checksum 1746cde24a1bac0460f1ccf646608cc9
SUCCESS: File Literature_Info.json downloaded from https://figshare.com/ndownloader/files/43612512 with correct checksum 10b78e8ec30f5b85f2a58d8fe24f056b
SUCCESS: File Longevity_Biomarkers.json downloaded from https://figshare.com/ndownloader/files/43612497 with correct checksum 0dbd9c3f8474dc3cd744ed38af460d75
SUCCESS: File Relation_Info.json downloaded from https://figshare.com/ndownloader/files/43612506 with correct checksum 0c1fa199269adc58f64ad4d5b9fd87b9
SUCCESS: File Aging_Biomarkers.json downloaded from https://figshare.com/ndownloader/files/43612503 with correct checksum abd0eb6cb7295ae500c5d676b7797324


Let us inspect these files. The two key files here are those containing the *entities* (nodes) and the *edges* (relations).

In [4]:
import json

def load_json(fname):
    with open(fname, 'rb') as file:
        return json.load(file)

nodes = load_json(f"{datadir}/{filelist[0][0]}")
edges = load_json(f"{datadir}/{filelist[3][0]}")

What are these new data structures?

In [5]:
print(type(nodes))
print(type(edges))

<class 'dict'>
<class 'dict'>


How big are they?

In [6]:
print(f"There are {len(nodes)} and {len(edges)} edges")


There are 12257 and 116495 edges


Let's look at each of these in more detail. First, let's pick one of the nodes. Let's print the keys and choose on:

In [7]:
print(list(nodes.keys()))

['MLH1', 'CD4', 'INS', 'MAPT', 'MYC', 'GSR', 'SOD2', 'CRP', 'IL6', 'SIRT1', 'CHGA', 'CFB', 'SKIV2L', 'TNXB', 'FKBPL', 'NOTCH4', 'CFH', 'HTRA1', 'GCG', 'IGF1', 'GH1', 'GHRH', 'WRN', 'NFKB1', 'SHBG', 'PIAS4', 'CCL2', 'RECQL4', 'BLM', 'ALB', 'TNF', 'BCAM', 'CD151', 'GGH', 'FGF23', 'PTH', 'JUNB', 'H2AZ1', 'PAPPA2', 'ELN', 'KIT', 'CSF2', 'VEGFA', 'MYO5A', 'MTOR', 'KLK3', 'AR', 'ACE', 'LMNB1', 'LMNA', 'NUP62', 'ULK1', 'MAP1LC3A', 'PIK3R2', 'IAPP', 'VDR', 'CLPS', 'APOD', 'FERMT2', 'MS4A6A', 'ABCA7', 'SORL1', 'HTT', 'APOB', 'RAF1', 'MAPK3', 'MAPK1', 'MAP2K1', 'MAP2K2', 'CFI', 'SERPINA1', 'IL7', 'KL', 'BECN1', 'NFE2L2', 'SENP7', 'MOB1B', 'CARMIL1', 'PRRC2A', 'TERF2', 'RFWD3', 'PARP1', 'POT1', 'ATM', 'MPHOSPH6', 'PPARGC1A', 'FNDC5', 'BDNF', 'NTRK2', 'CD8A', 'IFITM3', 'TRIM22', 'LY6E', 'IFNAR1', 'CTNNB1', 'APOL1', 'VWF', 'ATR', 'RNF8', 'BRCA1', 'TP53BP1', 'RETN', 'CXCL8', 'IL10', 'IL1B', 'IL13RA2', 'CXCR4', 'POU5F1', 'NANOG', 'IL2', 'APOE', 'NDRG2', 'BACE1', 'GGA3', 'CDK5', 'PIN1', 'STAT3', 'IFNG

Pick a random node key, say `AHR`:

In [5]:
node = nodes['AHR']
print(type(node))
print(len(node))

<class 'list'>
1


So each node is a list of length 1. What's in that?

In [6]:
type(node[0])

dict

Another dictionary. Let's get the keys:

In [7]:
print(node[0].keys())

dict_keys(['entity', 'type', 'PMID', 'official full name', 'sentence', 'numbers of articles', 'JT', 'TA', 'IF', 'IF5', 'year', 'date', 'alias names', 'description', 'url', 'mutation position', 'mutation alleles', 'MeSH ID', 'relation', 'external links', 'aging biomarker', 'longevity biomarker'])


In [8]:
for k in node[0].keys():
    print(f"{k}: {node[0][k]}")

entity: AHR
type: Gene
PMID: ['33923487', '30716515', '25777082', '33233417', '28633424', '32939877', '24106308', '31640697', '26790370', '25110076', '29102224', '32915475', '32183254', '23406155', '24495120', '18975255', '28057405', '27363826', '33669008', '15592584', '33866778', '23555298', '32965514', '23614742', '32414118', '26857571', '25186463', '30626868', '25680693', '28526404', '34685709', '33527709', '31001893', '33592460', '29908909', '35766906', '36159806', '17070097', '31391494']
official full name: aryl hydrocarbon receptor
sentence: [['The aryl hydrocarbon receptor (AhR) is a transcription factor deeply implicated in health and diseases.', 'Historically identified as a sensor of xenobiotics and mainly toxic substances, AhR has recently become an emerging pharmacological target in cancer, immunology, inflammatory conditions, and aging.', 'Multiple AhR ligands are recognized, with plant occurring flavonoids being the largest group of natural ligands of AhR in the human die

So what does this tell us? These are *annotations* of the node that tell us:
* The name of the **entity** (node)
* What the **type** of the node is (a gene, in this case)
* The **official full name** and **alias names** of the entity.
* A **URL** to the official record of the entity.
* A **description** of the entity.

Some entries tell us about the research papers that include information about the entity:

* The **number of articles** that mention this entity
* The **PMID** (pubmed identity) of the article that were used to get information about the entity. You can enter these numbers at [https://pubmed.ncbi.nlm.nih.gov](https://pubmed.ncbi.nlm.nih.gov) to get the papers themselves.
* A **sentence** containing the entity name from each of the articles.
* The **JT** (journal title) **TA** (journal title abbreviation) of each article.
* The **IF** (impact factor) and **IF5** (five year impact factor) of each of the journals.
* The **year** and **date** each of the articles was published.
* Information about mutations: **mutuation position** and **mutation alleles**.
* Some **external links** and the **MeSH ID** for the Medical Subject Headings database.
* Whether the entity is an **aging biomarker** or a **longenvity** biomarker.

Let's now look at one of the edges:

In [61]:
edgekeys = list(edges.keys())
edge = edges[edgekeys[50000]]
for k in edge.keys():
    print(f"{k}: {edge[k]}")

source entity: Neck Pain
relationship: predict
target entity: Uterine Cervicitis
sentence: ['The only factor significantly predicting rotator cuff tendon tears was old age (odds ratio, 1.04; 95% confidence interval: 1.00-1.09).In patients with shoulder or neck pain, no significant association existed between rotator cuff tendon tears and cervical foraminal stenosis (at the C5 and C6 levels).']
source: ['neck pain']
target: ['cervical foraminal stenosis']
source type: ['Disease']
target type: ['Disease']
PMID: ['30200155']
DP: ['2018 Sep']
date: [20180901]
TI: ['Cross-talk between shoulder and neck pain: an imaging study of association between rotator cuff tendon tears and cervical foraminal stenosis.']
TA: ['Medicine (Baltimore)']
IF: [1.6]
IF5: [1.9]
method: ['shortest path']


Perhaps the two key properties of the edge are the

* **source entity** and **target entity** which specify the *entity* property of the nodes that the edge connects. Let us check that these exist

In [55]:
source_node = nodes[edge['source entity']]
print(source_node)

[{'entity': 'Pulmonary Disease, Chronic Obstructive', 'type': 'Disease', 'PMID': ['35758351', '35749794', '35744080', '35690768', '35655299', '35642183', '35643460', '35609226', '35604255', '35536064', '35534855', '35532589', '35501357', '35345479', '35303167', '35279314', '35279458', '35273027', '35272603', '35252447', '35241091', '35189900', '35165108', '35118641', '35114313', '35082306', '35063235', '35010919', '35002516', '35001332', '34979967', '34943963', '34943847', '34925327', '34914976', '34912578', '34906332', '34902448', '34888281', '34863159', '34850811', '34819432', '34788448', '34769017', '34767594', '34763018', '34757975', '34749325', '34744434', '34680058', '34677889', '34675127', '34666080', '34649490', '34605374', '34597264', '34588196', '34556072', '34530106', '34490189', '34483657', '34479482', '34473553', '34463851', '34459145', '34425308', '34423803', '34402330', '34397750', '34351502', '34340238', '34324449', '34321308', '34302051', '34281568', '34280980', '34275

In [56]:
target_node = nodes[edge['target entity']]
print(target_node)

[{'entity': 'Inflammation', 'type': 'Disease', 'PMID': ['35795858', '35795148', '35787298', '35784327', '35775757', '35770239', '35750356', '35745198', '35745147', '35742971', '35735245', '35729551', '35729501', '35714185', '35710943', '35703366', '35697214', '35684455', '35675903', '35667273', '35665714', '35663976', '35661774', '35658705', '35652508', '35647762', '35645319', '35641938', '35641679', '35638483', '35638413', '35631195', '35628113', '35617332', '35618619', '35617280', '35614304', '35610652', '35607300', '35599014', '35586815', '35581554', '35580547', '35577781', '35573426', '35564391', '35563810', '35561889', '35551917', '35545374', '35544976', '35536668', '35527242', '35514037', '35504286', '35503599', '35490483', '35486520', '35476749', '35471420', '35468098', '35466876', '35461468', '35460762', '35459240', '35458696', '35450990', '35450904', '35447046', '35445358', '35439553', '35434940', '35431244', '35421968', '35421558', '35416312', '35410677', '35408987', '3540903

The edge also has a

* **relationship**, which specifies how the two nodes are related (in this case they are *associated*.
* **source** and **target** attributes, which are alternative names for the entities and which we will not use.
* **source type** and **target type** which refer to the *type* atribute of the source and target nodes.
* A range of attributes related to the publications in which the relationship modelled by the edge is described (**PMID**, **DP**, **TI**,**TA**, **IF**, **IF5**).
* A list of **method**s, which we will not use.