## Group 1: merge and biolink format all tsv files
1. with predicate biolink:physically_interacts_with and biolink:gene_associated_with_condition

2. the corresponding config.json file is: config_bigGIM_interacts_with_associated_with

3. all of them have same cols "subject, predicate, object, agent_type, knowledge_level, knowledge_source, object_category, publications, subject_category"

4. Full list of tsv files handled in this group is:


In [130]:
## To do list:
## add signor and cell_marker_genes.csv into pharmacogenomics KG
## check overlap with BigGIM KG and best way to remove redundance edges
## check which part belongs to AML

In [106]:
## Load necessary packages
import os
import pandas as pd
import glob
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

## Define the version number
version_number = "05_19_2025"
deployment_date = "2025-05-19"

## Load files and convert them into separate node & edge files
* check all imported file structure

In [77]:
## Notice!! Please change the file path of following codes into your own
raw_files_path = '/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/primeKG/dataverse_files/'

## Define the output path for node & edge files after formatting
download_path_node_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/primeKG_parsed_node_{version_number}.tsv'
download_path_edge_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/primeKG_parsed_edge_{version_number}.tsv'

In [78]:
## Check all node files being read
## Read all BigGIM node csv file in group 1

for f in os.listdir(raw_files_path):
    if f.endswith('.csv'):
        print(f)

disease_features.csv
nodes.csv
kg_grouped.csv
kg_grouped_diseases.csv
kg_grouped_diseases_bert_map.csv
kg_giant.csv
kg.csv
drug_features.csv
kg_raw.csv
edges.csv


In [79]:
# -----------------------------------------------------------------------------------------------
# Filename				Description
# -----------------------------------------------------------------------------------------------
# nodes.csv				Contains node level information
# 					Primary key: `node_index`
# -----------------------------------------------------------------------------------------------
# edges.csv				Contains undirected relationships between nodes 
# 					Primary key: (`x_index`, `y_index`)
# -----------------------------------------------------------------------------------------------
# kg.csv					This is the Precision Medicine knowledge graph  
# 					Primary key: (`x_index`, `y_index`)
# -----------------------------------------------------------------------------------------------
# disease_features.csv			Contains textual descriptions of diseases 
# 					Primary key: `node_index`
# -----------------------------------------------------------------------------------------------
# drug_features.csv			Contains textual descriptions of diseases 
# 					Primary key: `node_index`
# -----------------------------------------------------------------------------------------------
# kg_raw.csv 				Intermediate PrimeKG made by joining nodes and edges
# -----------------------------------------------------------------------------------------------
# kg_giant.csv				Intermediate PrimeKG made by taking LCC of kg_raw.csv 
# -----------------------------------------------------------------------------------------------
# kg_grouped.csv				Intermediate PrimeKG made by grouping diseases  
# -----------------------------------------------------------------------------------------------
# kg_grouped_diseases.csv			List of all diseases and their assigned group name  
# -----------------------------------------------------------------------------------------------
# kg_grouped_diseases_bert_map.csv	Manual grouping created for diseases using BERT model
# ---------------------------------------------------------------------------------------------

In [80]:
## Read each individual csv files
nodes_df = pd.read_csv(raw_files_path + 'nodes.csv')

nodes_df.head(10)

Unnamed: 0,node_index,node_id,node_type,node_name,node_source
0,0,9796,gene/protein,PHYHIP,NCBI
1,1,7918,gene/protein,GPANK1,NCBI
2,2,8233,gene/protein,ZRSR2,NCBI
3,3,4899,gene/protein,NRF1,NCBI
4,4,5297,gene/protein,PI4KA,NCBI
5,5,6564,gene/protein,SLC15A1,NCBI
6,6,8668,gene/protein,EIF3I,NCBI
7,7,10826,gene/protein,FAXDC2,NCBI
8,8,4489,gene/protein,MT1A,NCBI
9,9,6272,gene/protein,SORT1,NCBI


In [81]:
## check unique node_type and their node_source
unique_node_type_values = nodes_df['node_type'].unique()
print("All possible node_type are here: " ,unique_node_type_values)

unique_node_source_values = nodes_df['node_source'].unique()
print("All possible node_source are here: " ,unique_node_source_values)

All possible node_type are here:  ['gene/protein' 'drug' 'effect/phenotype' 'disease' 'biological_process'
 'molecular_function' 'cellular_component' 'exposure' 'pathway' 'anatomy']
All possible node_source are here:  ['NCBI' 'DrugBank' 'HPO' 'MONDO_grouped' 'MONDO' 'GO' 'CTD' 'REACTOME'
 'UBERON']


### Implementation notes:
* Essential list of nodes to add: gene/protein, drug, disease, effect/phenotype
* Good to have list: pathway, exposure (more on the enviromental side), anatomy (only relevant if we need spatial information, e.g. cancer on specific organ)
* Maybe not for now?: biological_process, molecular_function, cellular_component
*
* following example inputs here for node normalizer service:
* https://github.com/TranslatorSRI/NodeNormalization/blob/master/documentation/NodeNormalization.ipynb
* 
* for MONDO codes: need to change format to MONDO:XXXXXXX, most cases those numbers are not in correct 7 digits so need to padding zeros
* Discussion: how to deal with MONDO_grouped?
* see question 1 below
* for NCBI code: add NCBIGene: prefix
* for HPO code: also need to change format to HP:XXXXXXX, 7 digits
* for DrugBank: need to check if Translator accepts DB: codes
* for GO: also need to change format to HP:XXXXXXX, 7 digits
* for CTD: change format to CID:, add prefix
* for REACTOME:
* for UBERON: also need to change format to HP:XXXXXXX, 7 digits

In [120]:
sanity_check_df = nodes_df[nodes_df['node_type'] == 'effect/phenotype']

sanity_check_df.head(15)

Unnamed: 0,node_index,node_id,node_type,node_name,node_source
22117,22117,1507,effect/phenotype,Growth abnormality,HPO
22118,22118,107,effect/phenotype,Renal cyst,HPO
22119,22119,1,effect/phenotype,All,HPO
22120,22120,5,effect/phenotype,Mode of inheritance,HPO
22121,22121,10460,effect/phenotype,Abnormality of the female genitalia,HPO
22122,22122,812,effect/phenotype,Abnormal internal genitalia,HPO
22123,22123,14,effect/phenotype,Abnormality of the bladder,HPO
22124,22124,2719,effect/phenotype,Recurrent infections,HPO
22125,22125,11277,effect/phenotype,Abnormality of the urinary system physiology,HPO
22126,22126,8684,effect/phenotype,Aplasia/hypoplasia of the uterus,HPO


In [82]:
## Sanity check on the current format of the file

# Randomly select one row for each unique node_type
sampled_df = nodes_df.groupby('node_type').sample(n=1, random_state=151)  # Set random_state for reproducibility

print(sampled_df)

        node_index        node_id           node_type  \
65751        65751           3476             anatomy   
114674      114674        1902367  biological_process   
124851      124851          70913  cellular_component   
99417        99417          23137             disease   
19830        19830        DB06980                drug   
87918        87918           9598    effect/phenotype   
61700        61700        C091375            exposure   
34206        34206         348793        gene/protein   
115725      115725          36200  molecular_function   
128395      128395  R-HSA-1368082             pathway   

                                                node_name node_source  
65751              respiratory system venous blood vessel      UBERON  
114674  negative regulation of Notch signaling pathway...          GO  
124851                                 Ddb1-Wdr21 complex          GO  
99417                        feigenbaum Bergeron syndrome       MONDO  
19830       

### Following are questions for data ingestion of the PrimeKG into Pharmocogenomics KG
* **Question 1:** How to handle those grouped Mondo codes?

In [83]:
nodes_MONDO_grouped = nodes_df[nodes_df['node_source'] == 'MONDO_grouped']

nodes_MONDO_grouped.head(10)

Unnamed: 0,node_index,node_id,node_type,node_name,node_source
27158,27158,13924_12592_14672_13460_12591_12536_30861_8146...,disease,osteogenesis imperfecta,MONDO_grouped
27159,27159,11160_13119_13978_12060_12327_12670_13210_1106...,disease,autosomal recessive nonsyndromic deafness,MONDO_grouped
27160,27160,8099_12497_12498,disease,congenital stationary night blindness autosoma...,MONDO_grouped
27161,27161,14854_14293_14470_12380_11832_14603_14853_1176...,disease,autosomal dominant nonsyndromic deafness,MONDO_grouped
27162,27162,33202_32776_30905_33670_33200_32740_32732_3320...,disease,"deafness, autosomal recessive",MONDO_grouped
27163,27163,11396_7422,disease,keratoderma hereditarium mutilans,MONDO_grouped
27164,27164,14828_14829_9454_13553_133,disease,immunodeficiency-centromeric instability-facia...,MONDO_grouped
27167,27167,9260_9261_9262_18149,disease,GM1 gangliosidosis,MONDO_grouped
27170,27170,14083_13288_12987_13287_13289_20729_14840_1329...,disease,agammaglobulinemia,MONDO_grouped
27173,27173,14986_14987_10351_10953_13565_9213_12565_13248...,disease,Fanconi anemia complementation group,MONDO_grouped


* **Questions 2:** How do or should we want to integrate those additional feature information of drug & disease into our KG?
* Sample information can be obtained from disease features
    - mondo_id	mondo_name	group_id_bert	group_name_bert	mondo_definition	umls_description	orphanet_definition	**orphanet_prevalence**	orphanet_epidemiology	orphanet_clinical_description	**orphanet_management_and_treatment**	mayo_symptoms	mayo_causes	mayo_risk_factors	mayo_complications	mayo_prevention	mayo_see_doc
    - 12345	acral peeling skin syndrome			Acral peeling skin syndrome (PSS) is a form of PSS characterized by superficial peeling of the skin predominantly affecting the dorsa of the hands and feet.		A rare peeling skin syndrome characterized by superficial peeling of the skin predominantly affecting the dorsa of the hands and feet.	**<1/1000000**	Acral PSS is rare, with approximately 40 cases described in the literature to date.	The disease manifests shortly after birth or in early childhood with superficial peeling on the palmar, plantar and dorsal surfaces of the hands and feet, that leaves residual painless erythema. Manual skin removal is also possible. Seasonal variations are generally observed. Heat, humidity, exposure to water and friction or minor trauma can induce exfoliation. The lesions are not painful and heal without scarring.	**There is no effective treatment.** Emollients are often used to reduce skin peeling. Patients must avoid immersion in water and are recommended to use absorbing powders or aluminum antiperspirants.						
* Sample information can be obtained from drug features
    - description	**half_life**	indication	**mechanism_of_action**	protein_binding	pharmacodynamics	state	atc_1	atc_2	atc_3	atc_4	category	group	pathway	molecular_weight	tpsa	**clogp**
    - Budesonide is a glucocorticoid that is a mix of the 22R and 22S epimer used to treat inflammatory conditions of the lungs and intestines such as asthma, COPD, Crohn's disease, and ulcerative colitis.	**Budesonide has a plasma elimination half life of 2-3.6h. The terminal elimination half life in asthmatic children 4-6 years old is 2.3h.**	Budesonide extended release capsules are indicated for the treatment and maintenance of mild to moderate Crohn’s disease. Various inhaled budesonide products are indicated for prophylactic therapy in asthma and reducing exacerbations of COPD. A budesonide nasal spray is available over the counter for symptoms of hay fever and upper respiratory allergies. Extended release capsules are indicated to induce remission of mild to moderate ulcerative colitis and a rectal foam is used for mild to moderate distal ulcerative colitis.	**The short term effects of corticosteroids are decreased vasodilation and permeability of capillaries, as well as decreased leukocyte migration to sites of inflammation.** Corticosteroids binding to the glucocorticoid receptor mediates changes in gene expression that lead to multiple downstream effects over hours to days.	Corticosteroids are generally bound to corticosteroid binding globulin and serum albumin in plasma. Budesonide is 85-90% protein bound in plasma.	Budesonide is a glucocorticoid used to treat respiratory and digestive conditions by reducing inflammation. It has a wide therapeutic index, as dosing varies highly from patient to patient. Patients should be counselled regarding the risk of hypercorticism and adrenal axis suppression.	Budesonide is a solid.	Budesonide is anatomically related to dermatologicals and respiratory system and respiratory system and alimentary tract and metabolism and respiratory system.	Budesonide is in the therapeutic group of corticosteroids, dermatological preparations and nasal preparations and drugs for obstructive airway diseases and antidiarrheals, intestinal antiinflammatory/antiinfective agents and drugs for obstructive airway diseases.	Budesonide is pharmacologically related to corticosteroids, plain and decongestants and other nasal preparations for topical use and adrenergics, inhalants and intestinal antiinflammatory agents and other drugs for obstructive airway diseases, inhalants.	The chemical and functional group of  is corticosteroids, potent (group iii) and corticosteroids and adrenergics in combination with corticosteroids or other drugs, excl. anticholinergics and corticosteroids acting locally and glucocorticoids.	Budesonide is part of Adrenal Cortex Hormones ; Adrenals ; Agents to Treat Airway Disease ; Alimentary Tract and Metabolism ; Anti-Asthmatic Agents ; Anti-Inflammatory Agents ; Antidiarrheals, Intestinal Antiinflammatory/antiinfective Agents ; Autonomic Agents ; Bronchodilator Agents ; BSEP/ABCB11 Substrates ; Corticosteroid Hormone Receptor Agonists ; Corticosteroids ; Corticosteroids Acting Locally ; Corticosteroids for Systemic Use ; Corticosteroids, Dermatological Preparations ; Corticosteroids, Potent (Group III) ; Cytochrome P-450 CYP2A6 Inducers ; Cytochrome P-450 CYP2B6 Inducers ; Cytochrome P-450 CYP2B6 Inducers (strength unknown) ; Cytochrome P-450 CYP2C19 Inducers ; Cytochrome P-450 CYP2C19 Inducers (strength unknown) ; Cytochrome P-450 CYP2C8 Inducers ; Cytochrome P-450 CYP2C8 Inducers (strength unknown) ; Cytochrome P-450 CYP2C9 Inducers ; Cytochrome P-450 CYP2C9 Inducers (strength unknown) ; Cytochrome P-450 CYP3A Inducers ; Cytochrome P-450 CYP3A Substrates ; Cytochrome P-450 CYP3A4 Inducers ; Cytochrome P-450 CYP3A4 Inducers (strength unknown) ; Cytochrome P-450 CYP3A4 Substrates ; Cytochrome P-450 CYP3A5 Inducers ; Cytochrome P-450 CYP3A5 Inducers (moderate) ; Cytochrome P-450 Enzyme Inducers ; Cytochrome P-450 Substrates ; Dermatologicals ; Drugs for Obstructive Airway Diseases ; Drugs that are Mainly Renally Excreted ; Fused-Ring Compounds ; Hormones ; Hormones, Hormone Substitutes, and Hormone Antagonists ; Immunosuppressive Agents ; Intestinal Antiinflammatory Agents ; Nasal Preparations ; OAT3/SLC22A8 Substrates ; P-glycoprotein substrates ; Peripheral Nervous System Agents ; Pregnanes ; Pregnenediones ; Pregnenes ; Respiratory System Agents ; Steroids.	Budesonide is approved.		The molecular weight is 430.54.	Budesonide has a topological polar surface area of 93.06.	**The log p value of  is 2.9.**
    - clogp: a predicted value for the partition coefficient (LogP) of a molecule

In [84]:
## Read each individual csv files
disease_features_df = pd.read_csv(raw_files_path + 'disease_features.csv')

disease_features_df.head(5)

Unnamed: 0,node_index,mondo_id,mondo_name,group_id_bert,group_name_bert,mondo_definition,umls_description,orphanet_definition,orphanet_prevalence,orphanet_epidemiology,orphanet_clinical_description,orphanet_management_and_treatment,mayo_symptoms,mayo_causes,mayo_risk_factors,mayo_complications,mayo_prevention,mayo_see_doc
0,27165,8019,mullerian aplasia and hyperandrogenism,,,"Deficiency of the glycoprotein WNT4, associate...","Deficiency of the glycoprotein wnt4, associate...","A rare syndrome with 46,XX disorder of sex dev...",,,,,,,,,,
1,27165,8019,mullerian aplasia and hyperandrogenism,,,"Deficiency of the glycoprotein WNT4, associate...","Deficiency of the glycoprotein wnt4, associate...","A rare syndrome with 46,XX disorder of sex dev...",,,,,,,,,,
2,27166,11043,"myelodysplasia, immunodeficiency, facial dysmo...",,,,,,,,,,,,,,,
3,27168,8878,"bone dysplasia, lethal Holmgren type",,,Bone dysplasia lethal Holmgren type (BDLH) is ...,A lethal bone dysplasia with characteristics o...,Bone dysplasia lethal Holmgren type (BDLH) is ...,<1/1000000,,,,,,,,,
4,27169,8905,predisposition to invasive fungal disease due ...,,,,,"A rare, genetic primary immunodeficiency chara...",,,,,,,,,,


In [85]:
## Read each individual csv files
drug_features_df = pd.read_csv(raw_files_path + 'drug_features.csv')

drug_features_df.head(5)

Unnamed: 0,node_index,description,half_life,indication,mechanism_of_action,protein_binding,pharmacodynamics,state,atc_1,atc_2,atc_3,atc_4,category,group,pathway,molecular_weight,tpsa,clogp
0,14012,Copper is a transition metal and a trace eleme...,,For use in the supplementation of total parent...,Copper is absorbed from the gut via high affin...,Copper is nearly entirely bound by ceruloplasm...,Copper is incorporated into many enzymes throu...,Copper is a solid.,,,,,Copper is part of Copper-containing Intrauteri...,Copper is approved and investigational.,,,,
1,14013,Oxygen is an element displayed by the symbol O...,The half-life is approximately 122.24 seconds,Oxygen therapy in clinical settings is used ac...,Oxygen therapy increases the arterial pressure...,Oxygen binds to oxygen-carrying protein in red...,Oxygen therapy improves effective cellular oxy...,Oxygen is a gas.,Oxygen is anatomically related to various.,Oxygen is in the therapeutic group of all othe...,Oxygen is pharmacologically related to all oth...,The chemical and functional group of is medic...,Oxygen is part of Chalcogens ; Elements ; Gase...,Oxygen is approved and vet_approved.,,The molecular weight is 32.0.,Oxygen has a topological polar surface area of...,
2,14014,"Flunisolide (marketed as AeroBid, Nasalide, Na...",The half-life is 1.8 hours,For the maintenance treatment of asthma as a p...,Flunisolide is a glucocorticoid receptor agoni...,Approximately 40% after oral inhalation,Flunisolide is a synthetic corticosteroid. It ...,Flunisolide is a solid.,Flunisolide is anatomically related to respira...,Flunisolide is in the therapeutic group of nas...,Flunisolide is pharmacologically related to de...,The chemical and functional group of is corti...,Flunisolide is part of Adrenal Cortex Hormones...,Flunisolide is approved and investigational.,,The molecular weight is 434.5.,Flunisolide has a topological polar surface ar...,The log p value of is 2.41.
3,14015,Alclometasone is synthetic glucocorticoid ster...,,For the relief of the inflammatory and pruriti...,The mechanism of the anti-inflammatory activit...,,Alclometasone is a synthetic corticosteroid fo...,Alclometasone is a solid.,Alclometasone is anatomically related to derma...,Alclometasone is in the therapeutic group of c...,Alclometasone is pharmacologically related to ...,The chemical and functional group of is corti...,Alclometasone is part of Adrenal Cortex Hormon...,Alclometasone is approved.,,,,
4,14016,Medrysone is a corticosteroid used in ophthalm...,,"For the treatment of allergic conjunctivitis, ...",There is no generally accepted explanation for...,,Medrysone is a topical anti-inflammatory corti...,Medrysone is a solid.,Medrysone is anatomically related to sensory o...,Medrysone is in the therapeutic group of ophth...,Medrysone is pharmacologically related to anti...,The chemical and functional group of is corti...,Medrysone is part of Adrenal Cortex Hormones ;...,Medrysone is approved.,,The molecular weight is 344.5.,Medrysone has a topological polar surface area...,The log p value of is 3.36.


In [86]:
## Select only the 4 must have node types
must_have_list = ['gene/protein', 'drug', 'disease', 'effect/phenotype']
nodes_select_df = nodes_df[nodes_df['node_type'].isin(must_have_list)]

## count rows
unique_node_count = nodes_select_df['node_index'].nunique()
print(unique_node_count)

68019


In [87]:
sanity_check_df = nodes_select_df[nodes_select_df['node_id'] == '5156']

sanity_check_df.head()

Unnamed: 0,node_index,node_id,node_type,node_name,node_source
3772,3772,5156,gene/protein,PDGFRA,NCBI
36722,36722,5156,disease,encephalomyelitis,MONDO
86127,86127,5156,effect/phenotype,Hypoplastic left atrium,HPO


In [88]:
## let's exclude those 'MONDO_grouped' rows for now
nodes_select_no_MONDO_grouped_df = nodes_select_df[nodes_select_df['node_source'] != 'MONDO_grouped']

## count rows
unique_node_count = nodes_select_no_MONDO_grouped_df['node_index'].nunique()
print(unique_node_count)

66752


In [89]:
## get the list of unique values in column node_id of nodes_select_no_MONDO_grouped_df
## since node_id neededs to combine with node_source as prefix to obtain a unique identifier
## see above example of same "node_id" in multiple vocabularies
unique_node_ids = nodes_select_no_MONDO_grouped_df['node_index'].unique().tolist()

## Read each individual csv files
edges_df = pd.read_csv(raw_files_path + 'kg.csv')

## count rows
print(len(edges_df))

## Filter to only include edges that are in the selected node_id list
edges_filtered_df = edges_df[edges_df['x_index'].isin(unique_node_ids)]
edges_filtered2_df = edges_filtered_df[edges_filtered_df['y_index'].isin(unique_node_ids)]

## count rows
print(len(edges_filtered2_df))

edges_filtered2_df.head(5)

  edges_df = pd.read_csv(raw_files_path + 'kg.csv')


8100498
4009644


Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source
0,protein_protein,ppi,0,9796,gene/protein,PHYHIP,NCBI,8889,56992,gene/protein,KIF15,NCBI
1,protein_protein,ppi,1,7918,gene/protein,GPANK1,NCBI,2798,9240,gene/protein,PNMA1,NCBI
2,protein_protein,ppi,2,8233,gene/protein,ZRSR2,NCBI,5646,23548,gene/protein,TTC33,NCBI
3,protein_protein,ppi,3,4899,gene/protein,NRF1,NCBI,11592,11253,gene/protein,MAN1B1,NCBI
4,protein_protein,ppi,4,5297,gene/protein,PI4KA,NCBI,2122,8601,gene/protein,RGS20,NCBI


In [90]:
## check unique relation
unique_relation_values = edges_filtered2_df['relation'].unique()
print("All possible relation are here: " , unique_relation_values)

All possible relation are here:  ['protein_protein' 'drug_protein' 'contraindication' 'indication'
 'off-label use' 'drug_drug' 'phenotype_protein' 'phenotype_phenotype'
 'disease_phenotype_negative' 'disease_phenotype_positive'
 'disease_protein' 'disease_disease' 'drug_effect']


In [91]:
## check unique relation
unique_relation_values = edges_filtered2_df['display_relation'].unique()
print("All possible display relation are here: " , unique_relation_values)

All possible display relation are here:  ['ppi' 'carrier' 'enzyme' 'target' 'transporter' 'contraindication'
 'indication' 'off-label use' 'synergistic interaction' 'associated with'
 'parent-child' 'phenotype absent' 'phenotype present' 'side effect']


In [92]:
## check unique relation
unique_type_values = set(list(edges_filtered2_df['x_type'].unique()) + list(edges_filtered2_df['y_type'].unique()))
print("All possible node types are here: " , unique_type_values)

All possible node types are here:  {'drug', 'disease', 'gene/protein', 'effect/phenotype'}


In [93]:
## check unique relation
unique_type_values = set(list(edges_filtered2_df['x_source'].unique()) + list(edges_filtered2_df['y_source'].unique()))
print("All possible node sources are here: " , unique_type_values)

All possible node sources are here:  {'MONDO', 'HPO', 'DrugBank', 'NCBI'}


In [94]:
## Sanity check anatomy_anatomy relations
## should return empty df
sanity_check_df = edges_filtered2_df[edges_filtered2_df['relation'] == 'anatomy_anatomy']
sanity_check_df.head(10)

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source


### Now handles the different type of nodes differently
* following example inputs here for node normalizer service: https://github.com/TranslatorSRI/NodeNormalization/blob/master/documentation/NodeNormalization.ipynb
* Drug: DRUGBANK:DB09145
* Disease: MONDO:0004976
* Gene / Protein: NCBIGene:9496
* Effect / Phenotype: HP:0007354
* Need to find a way to add zeros padding for MONDO and HP codes

In [95]:
## Start with check MONDO or HPO in either x_source or y_source
## this is just for test run, when the code is done, directly run on full df
mondo_or_hpo = ['MONDO', 'HPO']
edges_mondo_hpo_df = edges_filtered2_df[ (edges_filtered2_df['x_source'].isin(mondo_or_hpo)) | (edges_filtered2_df['y_source'].isin(mondo_or_hpo))]

# Randomly select one row for each unique combination
sampled_df = edges_mondo_hpo_df.groupby(['x_source', 'y_source']).sample(n=1, random_state=151)  # Set random_state for reproducibility
sampled_df

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source
3371127,drug_effect,side effect,14202,DB00598,drug,Labetalol,DrugBank,84392,16,effect/phenotype,Urinary retention,HPO
372552,indication,indication,16845,DB00925,drug,Phenoxybenzamine,DrugBank,31122,8233,disease,phaeochromocytoma,MONDO
6110961,drug_effect,side effect,22982,12531,effect/phenotype,Pain,HPO,15111,DB00633,drug,Dexmedetomidine,DrugBank
5796226,phenotype_phenotype,parent-child,93166,100135,effect/phenotype,Absent epiphysis of the distal phalanx of the ...,HPO,26795,100091,effect/phenotype,Abnormality of the epiphysis of the distal pha...,HPO
5857733,disease_phenotype_positive,phenotype present,84496,48,effect/phenotype,Bifid scrotum,HPO,31078,8727,disease,congenital adrenal hyperplasia due to 3-beta-h...,MONDO
5777536,phenotype_protein,associated with,22180,1892,effect/phenotype,Abnormal bleeding,HPO,5499,94,gene/protein,ACVRL1,NCBI
5743204,contraindication,contraindication,29182,7915,disease,systemic lupus erythematosus (disease),MONDO,17308,DB01170,drug,Guanethidine,DrugBank
3093310,disease_phenotype_positive,phenotype present,27213,9712,disease,congenital multicore myopathy with external op...,MONDO,85634,3798,effect/phenotype,Nemaline bodies,HPO
6058939,disease_disease,parent-child,35879,21670,disease,post-infectious syndrome,MONDO,35690,2254,disease,syndromic disease,MONDO
6011774,disease_protein,associated with,31988,7336,disease,isolated cleft palate,MONDO,7019,6662,gene/protein,SOX9,NCBI


In [96]:
## work directly on the full dataframe
## first loop x_id 
## to add prefix MONDO or HP for rows with sources "MONDO" or "HPO" and padding to 7 digits as required by example

## deep copy a df
edges_deep_copy_df = edges_filtered2_df.copy(deep=True)

## Prefix mapping for selected sources
prefix_map = {
    'HPO': 'HP:',
    'MONDO': 'MONDO:',
    'DrugBank': 'DRUGBANK:', 
    'NCBI': 'NCBIGene:'
}

## Initialize new column with default values (original id as string)
edges_deep_copy_df['subject'] = edges_deep_copy_df['x_id'].astype(str)

## Mask for HPO and MONDO → prefix + zero-padded id
mask_hpo_mondo = edges_deep_copy_df['x_source'].isin(['HPO', 'MONDO'])
edges_deep_copy_df.loc[mask_hpo_mondo, 'subject'] = (
    edges_deep_copy_df.loc[mask_hpo_mondo, 'x_source'].map(prefix_map) +
    edges_deep_copy_df.loc[mask_hpo_mondo, 'x_id'].astype(str).str.zfill(7)
)

## Mask for DrugBank → prefix only, no padding
mask_drugbank_ncbi = edges_deep_copy_df['x_source'].isin(['DrugBank', 'NCBI'])
edges_deep_copy_df.loc[mask_drugbank_ncbi, 'subject'] = (
    edges_deep_copy_df.loc[mask_drugbank_ncbi, 'x_source'].map(prefix_map) +
    edges_deep_copy_df.loc[mask_drugbank_ncbi, 'x_id'].astype(str)
)

sampled_df = edges_deep_copy_df.groupby(['x_source']).sample(n=1, random_state=151)  # Set random_state for reproducibility
sampled_df

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source,subject
1362595,drug_drug,synergistic interaction,21014,DB13648,drug,Alcuronium,DrugBank,14614,DB06738,drug,Ketobemidone,DrugBank,DRUGBANK:DB13648
5891379,disease_phenotype_positive,phenotype present,88949,11800,effect/phenotype,Midface retrusion,HPO,31464,9841,disease,PEHO syndrome,MONDO,HP:0011800
3153485,disease_phenotype_positive,phenotype present,31611,14892,disease,micrognathia-recurrent infections-behavioral a...,MONDO,22715,750,effect/phenotype,Delayed speech and language development,HPO,MONDO:0014892
116373,protein_protein,ppi,2071,27074,gene/protein,LAMP3,NCBI,10788,347733,gene/protein,TUBB2B,NCBI,NCBIGene:27074


In [97]:
## similarly change for y_id
## Initialize new column with default values (original id as string)
edges_deep_copy_df['object'] = edges_deep_copy_df['y_id'].astype(str)

## Mask for HPO and MONDO → prefix + zero-padded id
mask_hpo_mondo = edges_deep_copy_df['y_source'].isin(['HPO', 'MONDO'])
edges_deep_copy_df.loc[mask_hpo_mondo, 'object'] = (
    edges_deep_copy_df.loc[mask_hpo_mondo, 'y_source'].map(prefix_map) +
    edges_deep_copy_df.loc[mask_hpo_mondo, 'y_id'].astype(str).str.zfill(7)
)

## Mask for DrugBank → prefix only, no padding
mask_drugbank_ncbi = edges_deep_copy_df['y_source'].isin(['DrugBank', 'NCBI'])
edges_deep_copy_df.loc[mask_drugbank_ncbi, 'object'] = (
    edges_deep_copy_df.loc[mask_drugbank_ncbi, 'y_source'].map(prefix_map) +
    edges_deep_copy_df.loc[mask_drugbank_ncbi, 'y_id'].astype(str)
)

sampled_df = edges_deep_copy_df.groupby(['y_source']).sample(n=1, random_state=151)  # Set random_state for reproducibility
sampled_df

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source,subject,object
1423055,drug_drug,synergistic interaction,15103,DB00496,drug,Darifenacin,DrugBank,15441,DB00402,drug,Eszopiclone,DrugBank,DRUGBANK:DB00496,DRUGBANK:DB00402
3200343,disease_phenotype_positive,phenotype present,28319,8900,disease,camptodactyly with fibrous tissue hyperplasia ...,MONDO,22650,100490,effect/phenotype,Camptodactyly of finger,HPO,MONDO:0008900,HP:0100490
3249032,disease_protein,associated with,1749,7150,gene/protein,TOP1,NCBI,84244,16267,disease,undifferentiated carcinoma of the corpus uteri,MONDO,NCBIGene:7150,MONDO:0016267
116373,protein_protein,ppi,2071,27074,gene/protein,LAMP3,NCBI,10788,347733,gene/protein,TUBB2B,NCBI,NCBIGene:27074,NCBIGene:347733


* **Questions 3:** for effect/phenotype nodes which biolink node type to map to
* following are detail on candidates
* DiseaseOrPhenotypicFeature
  - Either one of a disease or an individual phenotypic feature. Some knowledge resources such as Monarch treat these as distinct, others such as MESH conflate. Please see definitions of phenotypic feature and disease in this model for their independent descriptions. This class is helpful to enforce domains and ranges that may involve either a disease or a phenotypic feature.
* DiseaseOrPhenotypicFeatureExposure
  - A disease or phenotypic feature state, when viewed as an exposure, represents an precondition, leading to or influencing an outcome, e.g. HIV predisposing an individual to infections; a relative deficiency of skin pigmentation predisposing an individual to skin cancer.
* DiseaseOrPhenotypicFeatureOutcome
  - Physiological outcomes resulting from an exposure event which is the manifestation of a disease or other characteristic phenotype.

In [98]:
## Create a new map to change subject and object category based on x_type and y_type
## {'drug', 'disease', 'gene/protein', 'effect/phenotype'}
category_map = {
    'drug': 'biolink:Drug',
    'gene/protein': 'biolink:Gene',
    'disease': 'biolink:Disease', 
    'effect/phenotype': 'biolink:DiseaseOrPhenotypicFeature'
}

## Initialize new column with default values (original id as string)
edges_deep_copy_df['subject_category'] = edges_deep_copy_df['x_type'].astype(str)

## Mask
mask = edges_deep_copy_df['x_type'].isin(['drug', 'gene/protein', 'disease', 'effect/phenotype'])
edges_deep_copy_df.loc[mask, 'subject_category'] = (
    edges_deep_copy_df.loc[mask, 'x_type'].map(category_map)
)

## Initialize new column with default values (original id as string)
edges_deep_copy_df['object_category'] = edges_deep_copy_df['y_type'].astype(str)

## Mask
mask = edges_deep_copy_df['y_type'].isin(['drug', 'gene/protein', 'disease', 'effect/phenotype'])
edges_deep_copy_df.loc[mask, 'object_category'] = (
    edges_deep_copy_df.loc[mask, 'y_type'].map(category_map)
)

sampled_df = edges_deep_copy_df.groupby(['x_type']).sample(n=1, random_state=151)  # Set random_state for reproducibility
sampled_df

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source,subject,object,subject_category,object_category
3157330,disease_phenotype_positive,phenotype present,31809,32864,disease,intellectual developmental disorder with speec...,MONDO,22759,6,effect/phenotype,Autosomal dominant inheritance,HPO,MONDO:0032864,HP:0000006,biolink:Disease,biolink:DiseaseOrPhenotypicFeature
1162436,drug_drug,synergistic interaction,20349,DB09543,drug,Methyl salicylate,DrugBank,14432,DB09383,drug,Meprednisone,DrugBank,DRUGBANK:DB09543,DRUGBANK:DB09383,biolink:Drug,biolink:Drug
6112214,drug_effect,side effect,22939,83,effect/phenotype,Renal insufficiency,HPO,17730,DB00710,drug,Ibandronate,DrugBank,HP:0000083,DRUGBANK:DB00710,biolink:DiseaseOrPhenotypicFeature,biolink:Drug
5473379,protein_protein,ppi,637,5725,gene/protein,PTBP1,NCBI,866,5595,gene/protein,MAPK3,NCBI,NCBIGene:5725,NCBIGene:5595,biolink:Gene,biolink:Gene


In [129]:
## Sanity check anatomy_anatomy relations
## should return empty df
sanity_check_df = edges_deep_copy_df[edges_deep_copy_df['display_relation'] == 'side effect']
print(len(sanity_check_df))
sanity_check_df.head(10)

129568


Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source,subject,object,subject_category,object_category,predicate,knowledge_souce,knowledge_level,agent_type
3348335,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,23158,2027,effect/phenotype,Abdominal pain,HPO,DRUGBANK:DB00583,HP:0002027,biolink:Drug,biolink:DiseaseOrPhenotypicFeature,biolink:has_side_effect,PrimeKG,knowledge_assertion,automated_agent
3348336,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,85849,4396,effect/phenotype,Poor appetite,HPO,DRUGBANK:DB00583,HP:0004396,biolink:Drug,biolink:DiseaseOrPhenotypicFeature,biolink:has_side_effect,PrimeKG,knowledge_assertion,automated_agent
3348337,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,22447,739,effect/phenotype,Anxiety,HPO,DRUGBANK:DB00583,HP:0000739,biolink:Drug,biolink:DiseaseOrPhenotypicFeature,biolink:has_side_effect,PrimeKG,knowledge_assertion,automated_agent
3348338,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,22831,11675,effect/phenotype,Arrhythmia,HPO,DRUGBANK:DB00583,HP:0011675,biolink:Drug,biolink:DiseaseOrPhenotypicFeature,biolink:has_side_effect,PrimeKG,knowledge_assertion,automated_agent
3348339,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,23469,3418,effect/phenotype,Back pain,HPO,DRUGBANK:DB00583,HP:0003418,biolink:Drug,biolink:DiseaseOrPhenotypicFeature,biolink:has_side_effect,PrimeKG,knowledge_assertion,automated_agent
3348340,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,23254,12387,effect/phenotype,Bronchitis,HPO,DRUGBANK:DB00583,HP:0012387,biolink:Drug,biolink:DiseaseOrPhenotypicFeature,biolink:has_side_effect,PrimeKG,knowledge_assertion,automated_agent
3348341,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,23168,1626,effect/phenotype,Abnormality of the cardiovascular system,HPO,DRUGBANK:DB00583,HP:0001626,biolink:Drug,biolink:DiseaseOrPhenotypicFeature,biolink:has_side_effect,PrimeKG,knowledge_assertion,automated_agent
3348342,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,26336,100749,effect/phenotype,Chest pain,HPO,DRUGBANK:DB00583,HP:0100749,biolink:Drug,biolink:DiseaseOrPhenotypicFeature,biolink:has_side_effect,PrimeKG,knowledge_assertion,automated_agent
3348343,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,26160,12735,effect/phenotype,Cough,HPO,DRUGBANK:DB00583,HP:0012735,biolink:Drug,biolink:DiseaseOrPhenotypicFeature,biolink:has_side_effect,PrimeKG,knowledge_assertion,automated_agent
3348344,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,25201,2321,effect/phenotype,Vertigo,HPO,DRUGBANK:DB00583,HP:0002321,biolink:Drug,biolink:DiseaseOrPhenotypicFeature,biolink:has_side_effect,PrimeKG,knowledge_assertion,automated_agent


In [123]:
## Sanity check anatomy_anatomy relations
## should return empty df
## enzyme: for drug metabolism
sanity_check_df = edges_filtered2_df[edges_filtered2_df['display_relation'] == 'enzyme']
sanity_check_df.head(10)

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source
321939,drug_protein,enzyme,14584,DB00130,drug,L-Glutamine,DrugBank,13982,27165,gene/protein,GLS2,NCBI
321940,drug_protein,enzyme,14585,DB11118,drug,Ammonia,DrugBank,13982,27165,gene/protein,GLS2,NCBI
321941,drug_protein,enzyme,14584,DB00130,drug,L-Glutamine,DrugBank,476,2162,gene/protein,F13A1,NCBI
321942,drug_protein,enzyme,14275,DB00997,drug,Doxorubicin,DrugBank,373,4843,gene/protein,NOS2,NCBI
321943,drug_protein,enzyme,14423,DB09237,drug,Levamlodipine,DrugBank,373,4843,gene/protein,NOS2,NCBI
321944,drug_protein,enzyme,14586,DB00157,drug,NADH,DrugBank,2499,128,gene/protein,ADH5,NCBI
321945,drug_protein,enzyme,14587,DB00898,drug,Ethanol,DrugBank,2499,128,gene/protein,ADH5,NCBI
321946,drug_protein,enzyme,14588,DB01020,drug,Isosorbide mononitrate,DrugBank,2499,128,gene/protein,ADH5,NCBI
321947,drug_protein,enzyme,14589,DB11077,drug,Polyethylene glycol 400,DrugBank,2499,128,gene/protein,ADH5,NCBI
321948,drug_protein,enzyme,14590,DB12612,drug,Ozanimod,DrugBank,2499,128,gene/protein,ADH5,NCBI


* **Question 4:** general mapping for predicate
* protein-protein interacts, whether use the upper level interacts_with predicate or more specific genetically_interacts_with
* use physical_interacts_with
* carrier, use 'biolink:can_be_carried_out_by'?
* enzyme?
* indication?
* synergistic interaction?
* parent-child, between two phenotypes nodes, should I use biolink:broad_match?
* phenotype absent, opposite of has_phenotype? or just ignore?

In [104]:
## start convert to biolink compliant predicate types based on display_relation column
## check supplementary table of primeKG, descriptions on relations
## translator discussion point - data ingestion group meeting
## separate out off label drug usage, clinical trial, and FDA approved usage
## for 'synergistic interaction', combine with knowledge from LLM clinical trial attempts and find drug combination and which disease 
## they targeted
## currently exclude associated with
## not need to include 'phenotype absent'

list_of_display_relations = ['ppi', 'carrier', 'enzyme', 'target', 'transporter', 'contraindication',
'indication', 'off-label use', 'synergistic interaction', 'associated with',
'parent-child', 'phenotype absent', 'phenotype present', 'side effect']

relation_predicate_map = {
    'ppi': 'biolink:interacts_with',
    'carrier': 'biolink:can_be_carried_out_by',
    'enzyme': '', 
    'target': 'biolink:target_for',
    'transporter': 'biolink:GeneAffectsChemicalAssociation',
    'contraindication': 'biolink:has_contraindication',
    'indication': '',
    'off-label use': 'biolink:treats',
    'synergistic interaction': '',
    'associated with': 'biolink:associated_with',
    'parent-child': 'biolink:broad_match',
    'phenotype absent': '',
    'phenotype present': 'biolink:has_phenotype',
    'side effect': 'biolink:has_side_effect',
}

## Initialize new column with default values (original id as string)
edges_deep_copy_df['predicate'] = edges_deep_copy_df['display_relation'].astype(str)

## Mask
mask = edges_deep_copy_df['display_relation'].isin(list_of_display_relations)
edges_deep_copy_df.loc[mask, 'predicate'] = (
    edges_deep_copy_df.loc[mask, 'display_relation'].map(relation_predicate_map)
)

sampled_df = edges_deep_copy_df.groupby(['display_relation']).sample(n=1, random_state=151)  # Set random_state for reproducibility
sampled_df

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source,subject,object,subject_category,object_category,predicate
3259526,disease_protein,associated with,2429,596,gene/protein,BCL2,NCBI,94633,6402,disease,salivary gland basal cell adenocarcinoma,MONDO,NCBIGene:596,MONDO:0006402,biolink:Gene,biolink:Disease,biolink:associated_with
5708531,drug_protein,carrier,4735,5004,gene/protein,ORM1,NCBI,14549,DB01429,drug,Aprindine,DrugBank,NCBIGene:5004,DRUGBANK:DB01429,biolink:Gene,biolink:Drug,biolink:can_be_carried_out_by
5771132,contraindication,contraindication,35773,21148,disease,female reproductive system neoplasm,MONDO,14235,DB00783,drug,Estradiol,DrugBank,MONDO:0021148,DRUGBANK:DB00783,biolink:Disease,biolink:Drug,biolink:has_contraindication
327071,drug_protein,enzyme,15255,DB13174,drug,Rhein,DrugBank,12434,1551,gene/protein,CYP3A7,NCBI,DRUGBANK:DB13174,NCBIGene:1551,biolink:Drug,biolink:Gene,
369124,indication,indication,15090,DB00349,drug,Clobazam,DrugBank,29865,16532,disease,Lennox-Gastaut syndrome,MONDO,DRUGBANK:DB00349,MONDO:0016532,biolink:Drug,biolink:Disease,
389165,off-label use,off-label use,15218,DB06819,drug,Phenylbutyric acid,DrugBank,28466,9475,disease,isovaleric acidemia,MONDO,DRUGBANK:DB06819,MONDO:0009475,biolink:Drug,biolink:Disease,biolink:treats
6033771,disease_disease,parent-child,96078,2996,disease,cavernous sinus meningioma,MONDO,35912,4634,disease,vein disease,MONDO,MONDO:0002996,MONDO:0004634,biolink:Disease,biolink:Disease,biolink:broad_match
3084660,disease_phenotype_negative,phenotype absent,27557,7357,disease,colonic varices without portal hypertension,MONDO,23279,1392,effect/phenotype,Abnormality of the liver,HPO,MONDO:0007357,HP:0001392,biolink:Disease,biolink:DiseaseOrPhenotypicFeature,
5872212,disease_phenotype_positive,phenotype present,22488,4322,effect/phenotype,Short stature,HPO,31854,12496,disease,Koolen de Vries syndrome,MONDO,HP:0004322,MONDO:0012496,biolink:DiseaseOrPhenotypicFeature,biolink:Disease,biolink:has_phenotype
262128,protein_protein,ppi,164,7329,gene/protein,UBE2I,NCBI,2743,55145,gene/protein,THAP1,NCBI,NCBIGene:7329,NCBIGene:55145,biolink:Gene,biolink:Gene,biolink:interacts_with


In [107]:
## add a new knowledge_souce column and set value to be "PrimeKG"
edges_deep_copy_df['knowledge_souce'] = 'PrimeKG'
## add a new knowledge_level column and set value to be 'knowledge_assertion'
edges_deep_copy_df['knowledge_level'] = 'knowledge_assertion'
## add a new agent_type column and set value to be 'manual_agent'
edges_deep_copy_df['agent_type'] = 'automated_agent'

## copy to a final df
edge_df = edges_deep_copy_df.copy(deep = True)

print(edge_df.shape[0])
## Remove rows where subject or object is empty
# Remove rows where 'Subject' OR 'Object' have NaN values
edge_df = edge_df.dropna(subset=['subject', 'object'])

print(edge_df.shape[0])

edge_df['deploy_date'] = deployment_date
edge_df.head(5)

4009644
4009644


Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,...,y_source,subject,object,subject_category,object_category,predicate,knowledge_souce,knowledge_level,agent_type,deploy_date
0,protein_protein,ppi,0,9796,gene/protein,PHYHIP,NCBI,8889,56992,gene/protein,...,NCBI,NCBIGene:9796,NCBIGene:56992,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19
1,protein_protein,ppi,1,7918,gene/protein,GPANK1,NCBI,2798,9240,gene/protein,...,NCBI,NCBIGene:7918,NCBIGene:9240,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19
2,protein_protein,ppi,2,8233,gene/protein,ZRSR2,NCBI,5646,23548,gene/protein,...,NCBI,NCBIGene:8233,NCBIGene:23548,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19
3,protein_protein,ppi,3,4899,gene/protein,NRF1,NCBI,11592,11253,gene/protein,...,NCBI,NCBIGene:4899,NCBIGene:11253,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19
4,protein_protein,ppi,4,5297,gene/protein,PI4KA,NCBI,2122,8601,gene/protein,...,NCBI,NCBIGene:5297,NCBIGene:8601,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19


In [109]:
## create a context_qualifier column and fill na
## if all of them are empty then fill na
edge_df['context_qualifier'] = np.nan
edge_df.head(5)

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,...,subject,object,subject_category,object_category,predicate,knowledge_souce,knowledge_level,agent_type,deploy_date,context_qualifier
0,protein_protein,ppi,0,9796,gene/protein,PHYHIP,NCBI,8889,56992,gene/protein,...,NCBIGene:9796,NCBIGene:56992,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,
1,protein_protein,ppi,1,7918,gene/protein,GPANK1,NCBI,2798,9240,gene/protein,...,NCBIGene:7918,NCBIGene:9240,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,
2,protein_protein,ppi,2,8233,gene/protein,ZRSR2,NCBI,5646,23548,gene/protein,...,NCBIGene:8233,NCBIGene:23548,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,
3,protein_protein,ppi,3,4899,gene/protein,NRF1,NCBI,11592,11253,gene/protein,...,NCBIGene:4899,NCBIGene:11253,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,
4,protein_protein,ppi,4,5297,gene/protein,PI4KA,NCBI,2122,8601,gene/protein,...,NCBIGene:5297,NCBIGene:8601,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,


In [111]:
import uuid
import pandas as pd

## generate uuid from column combination
def generate_uuid_from_columns(df, column_list, namespace=uuid.NAMESPACE_DNS):
    """
    Generates UUIDs based on the values in a specified column of a Pandas DataFrame.

    Args:
        df (pd.DataFrame): The input DataFrame.
        column_list (list): List of all names of columns to use for UUID generation.
        namespace (uuid.UUID): A UUID namespace (default is uuid.NAMESPACE_DNS).

    Returns:
        pd.Series: A Pandas Series containing the generated UUIDs.
    """
    return df[column_list].apply(lambda x: uuid.uuid5(namespace, str(x)).hex)

def generate_uuid(row):
    """
    Generates a UUID based on the combined values of multiple columns.
    """
    combined_string = ''.join(row.astype(str))
    return uuid.uuid5(uuid.NAMESPACE_DNS, combined_string)

In [112]:
### Add resources_id column, checking whether edge is already
column_list = ['subject', 'predicate', 'object', 'context_qualifier', 'deploy_date']
# Apply the function to each row to generate UUIDs
edge_df['id'] = edge_df[column_list].apply(generate_uuid, axis=1)

# edge_df['id'] = generate_uuid_from_columns(edge_df, column_list)
edge_df.head(5)

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,...,object,subject_category,object_category,predicate,knowledge_souce,knowledge_level,agent_type,deploy_date,context_qualifier,id
0,protein_protein,ppi,0,9796,gene/protein,PHYHIP,NCBI,8889,56992,gene/protein,...,NCBIGene:56992,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,,82e8043f-e62c-52b6-a9ad-b7119a414cd7
1,protein_protein,ppi,1,7918,gene/protein,GPANK1,NCBI,2798,9240,gene/protein,...,NCBIGene:9240,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,,f0841a15-227c-5cd0-b4ee-6a6c1d824d28
2,protein_protein,ppi,2,8233,gene/protein,ZRSR2,NCBI,5646,23548,gene/protein,...,NCBIGene:23548,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,,3f646090-b060-5616-ba78-3c9446a1c03c
3,protein_protein,ppi,3,4899,gene/protein,NRF1,NCBI,11592,11253,gene/protein,...,NCBIGene:11253,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,,0c29dccf-88cf-5ad5-bab3-d1c705ffb3fd
4,protein_protein,ppi,4,5297,gene/protein,PI4KA,NCBI,2122,8601,gene/protein,...,NCBIGene:8601,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,,7ece0de4-0b62-570c-b257-03b3f698fbe3


### Now create the corresponding node file
* only need three columns: id, name, category

In [113]:
node_subject_df = edge_df[['subject', 'x_name', 'subject_category']]
node_object_df = edge_df[['object', 'y_name', 'object_category']]

## rename those columns into desired format
node_subject_df.rename(columns={'subject': 'id', 'x_name': 'name', 'subject_category': 'category'}, inplace=True)
node_object_df.rename(columns={'object': 'id', 'y_name': 'name', 'object_category': 'category'}, inplace=True)

concat_node_df = pd.concat([node_subject_df, node_object_df]).drop_duplicates(keep='first')

concat_node_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_subject_df.rename(columns={'subject': 'id', 'x_name': 'name', 'subject_category': 'category'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_object_df.rename(columns={'object': 'id', 'y_name': 'name', 'object_category': 'category'}, inplace=True)


Unnamed: 0,id,name,category
0,NCBIGene:9796,PHYHIP,biolink:Gene
1,NCBIGene:7918,GPANK1,biolink:Gene
2,NCBIGene:8233,ZRSR2,biolink:Gene
3,NCBIGene:4899,NRF1,biolink:Gene
4,NCBIGene:5297,PI4KA,biolink:Gene


In [115]:
## drop no longer needed columns from edge df
print(edge_df.columns.tolist())
drop_cols = ['relation', 'display_relation', 'x_index', 'x_id', 'x_type', 'x_name', 'x_source', 'y_index', 'y_id', 'y_type', 'y_name', 'y_source']

edge_output_df = edge_df.drop(drop_cols, axis=1)

edge_output_df.head()

['relation', 'display_relation', 'x_index', 'x_id', 'x_type', 'x_name', 'x_source', 'y_index', 'y_id', 'y_type', 'y_name', 'y_source', 'subject', 'object', 'subject_category', 'object_category', 'predicate', 'knowledge_souce', 'knowledge_level', 'agent_type', 'deploy_date', 'context_qualifier', 'id']


Unnamed: 0,subject,object,subject_category,object_category,predicate,knowledge_souce,knowledge_level,agent_type,deploy_date,context_qualifier,id
0,NCBIGene:9796,NCBIGene:56992,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,,82e8043f-e62c-52b6-a9ad-b7119a414cd7
1,NCBIGene:7918,NCBIGene:9240,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,,f0841a15-227c-5cd0-b4ee-6a6c1d824d28
2,NCBIGene:8233,NCBIGene:23548,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,,3f646090-b060-5616-ba78-3c9446a1c03c
3,NCBIGene:4899,NCBIGene:11253,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,,0c29dccf-88cf-5ad5-bab3-d1c705ffb3fd
4,NCBIGene:5297,NCBIGene:8601,biolink:Gene,biolink:Gene,biolink:interacts_with,PrimeKG,knowledge_assertion,automated_agent,2025-05-19,,7ece0de4-0b62-570c-b257-03b3f698fbe3


In [117]:
## download both node and edge files
## Download the result df
concat_node_df.to_csv(download_path_node_file, sep ='\t', index=False)
edge_output_df.to_csv(download_path_edge_file, sep ='\t', index=False)

In [118]:
print("The formatted node file will be saved in this path: ", download_path_node_file)
print("The formatted edge file will be saved in this path: ", download_path_edge_file)

The formatted node file will be saved in this path:  /Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/primeKG_parsed_node_05_19_2025.tsv
The formatted edge file will be saved in this path:  /Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/primeKG_parsed_edge_05_19_2025.tsv


In [119]:
## Create a graph from the DataFrame
graph = nx.from_pandas_edgelist(edge_df, 'subject', 'object', edge_attr='predicate')

## Print graph information
print('Number of nodes', len(set(graph.nodes)))
print('Number of edges', len(set(graph.edges)))
print('Average degree', sum(dict(graph.degree).values()) / len(graph.nodes))

Number of nodes 57521
Number of edges 2004436
Average degree 69.69405956085603
