## Group 1: merge and biolink format all tsv files
1. with predicate biolink:physically_interacts_with and biolink:gene_associated_with_condition

2. the corresponding config.json file is: config_bigGIM_interacts_with_associated_with

3. all of them have same cols "subject, predicate, object, agent_type, knowledge_level, knowledge_source, object_category, publications, subject_category"

4. Full list of tsv files handled in this group is:


In [1]:
## To do list:
## Add signor and cell_marker_genes.csv into pharmacogenomics KG
    ## Done
## Check overlap with BigGIM KG and the best way to remove redundancy edges
    ## 0 duplicated edges when checking the same combination of subject, object and predicate
## Check which part belongs to AML
    ## 652 edges related to AML

In [2]:
## Load necessary packages
import os
import pandas as pd
import glob
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

## Define the version number
version_number = "07_08_2025"
deployment_date = "2025-07-08"

In [3]:
## Load the Biolink category and predicate dictionary for mapping subject, object, and predicate types
%run ./Biolink_category_and_predication_dictionary.ipynb

Date of last update:  2025-07-08
Order is to always process Node/category map first, since the Edeg/predicate map depends on biolink-complainat node values
-----------------------------------------------------------------------------------------------------------------------------
Dictionary: category_map, Key template: Subject_category or Object_category
------------------------------------------------------------------------------------------
Dictionary: predicate_map, Key template: (Subject_category, Object_category, Predicate)


In [4]:
# print(category_map)

## Load files and convert them into separate node & edge files
* check all imported file structure

In [5]:
## Notice!! Please change the file path of following codes into your own
raw_files_path = '/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/primeKG/dataverse_files/'

## Define the output path for node & edge files after formatting
download_path_node_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/primeKG_parsed_node_{version_number}.tsv'
download_path_edge_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/primeKG_parsed_edge_{version_number}.tsv'

In [6]:
## Check all node files being read
## Read all BigGIM node csv file in group 1

for f in os.listdir(raw_files_path):
    if f.endswith('.csv'):
        print(f)

disease_features.csv
nodes.csv
kg_grouped.csv
kg_grouped_diseases.csv
kg_grouped_diseases_bert_map.csv
kg_giant.csv
kg.csv
drug_features.csv
kg_raw.csv
edges.csv


In [7]:
# -----------------------------------------------------------------------------------------------
# Filename				Description
# -----------------------------------------------------------------------------------------------
# nodes.csv				Contains node level information
# 					Primary key: `node_index`
# -----------------------------------------------------------------------------------------------
# edges.csv				Contains undirected relationships between nodes 
# 					Primary key: (`x_index`, `y_index`)
# -----------------------------------------------------------------------------------------------
# kg.csv					This is the Precision Medicine knowledge graph  
# 					Primary key: (`x_index`, `y_index`)
# -----------------------------------------------------------------------------------------------
# disease_features.csv			Contains textual descriptions of diseases 
# 					Primary key: `node_index`
# -----------------------------------------------------------------------------------------------
# drug_features.csv			Contains textual descriptions of diseases 
# 					Primary key: `node_index`
# -----------------------------------------------------------------------------------------------
# kg_raw.csv 				Intermediate PrimeKG made by joining nodes and edges
# -----------------------------------------------------------------------------------------------
# kg_giant.csv				Intermediate PrimeKG made by taking LCC of kg_raw.csv 
# -----------------------------------------------------------------------------------------------
# kg_grouped.csv				Intermediate PrimeKG made by grouping diseases  
# -----------------------------------------------------------------------------------------------
# kg_grouped_diseases.csv			List of all diseases and their assigned group name  
# -----------------------------------------------------------------------------------------------
# kg_grouped_diseases_bert_map.csv	Manual grouping created for diseases using BERT model
# ---------------------------------------------------------------------------------------------

In [8]:
## Read each individual csv files
nodes_df = pd.read_csv(raw_files_path + 'nodes.csv')

nodes_df.head(10)

Unnamed: 0,node_index,node_id,node_type,node_name,node_source
0,0,9796,gene/protein,PHYHIP,NCBI
1,1,7918,gene/protein,GPANK1,NCBI
2,2,8233,gene/protein,ZRSR2,NCBI
3,3,4899,gene/protein,NRF1,NCBI
4,4,5297,gene/protein,PI4KA,NCBI
5,5,6564,gene/protein,SLC15A1,NCBI
6,6,8668,gene/protein,EIF3I,NCBI
7,7,10826,gene/protein,FAXDC2,NCBI
8,8,4489,gene/protein,MT1A,NCBI
9,9,6272,gene/protein,SORT1,NCBI


In [9]:
## check unique node_type and their node_source
unique_node_type_values = nodes_df['node_type'].unique()
print("All possible node_type are here: " ,unique_node_type_values)

unique_node_source_values = nodes_df['node_source'].unique()
print("All possible node_source are here: " ,unique_node_source_values)

All possible node_type are here:  ['gene/protein' 'drug' 'effect/phenotype' 'disease' 'biological_process'
 'molecular_function' 'cellular_component' 'exposure' 'pathway' 'anatomy']
All possible node_source are here:  ['NCBI' 'DrugBank' 'HPO' 'MONDO_grouped' 'MONDO' 'GO' 'CTD' 'REACTOME'
 'UBERON']


### Implementation notes:
* Essential list of nodes to add: gene/protein, drug, disease, effect/phenotype
* Good to have list: pathway, exposure (more on the enviromental side), anatomy (only relevant if we need spatial information, e.g. cancer on specific organ)
* Maybe not for now?: biological_process, molecular_function, cellular_component
*
* following example inputs here for node normalizer service:
* https://github.com/TranslatorSRI/NodeNormalization/blob/master/documentation/NodeNormalization.ipynb
* 
* for MONDO codes: need to change format to MONDO:XXXXXXX, most cases those numbers are not in correct 7 digits so need to padding zeros
* Discussion: how to deal with MONDO_grouped?
* see question 1 below
* for NCBI code: add NCBIGene: prefix
* for HPO code: also need to change format to HP:XXXXXXX, 7 digits
* for DrugBank: need to check if Translator accepts DB: codes
* for GO: also need to change format to HP:XXXXXXX, 7 digits
* for CTD: change format to CID:, add prefix
* for REACTOME:
* for UBERON: also need to change format to HP:XXXXXXX, 7 digits

In [10]:
sanity_check_df = nodes_df[nodes_df['node_type'] == 'effect/phenotype']

sanity_check_df.head(15)

Unnamed: 0,node_index,node_id,node_type,node_name,node_source
22117,22117,1507,effect/phenotype,Growth abnormality,HPO
22118,22118,107,effect/phenotype,Renal cyst,HPO
22119,22119,1,effect/phenotype,All,HPO
22120,22120,5,effect/phenotype,Mode of inheritance,HPO
22121,22121,10460,effect/phenotype,Abnormality of the female genitalia,HPO
22122,22122,812,effect/phenotype,Abnormal internal genitalia,HPO
22123,22123,14,effect/phenotype,Abnormality of the bladder,HPO
22124,22124,2719,effect/phenotype,Recurrent infections,HPO
22125,22125,11277,effect/phenotype,Abnormality of the urinary system physiology,HPO
22126,22126,8684,effect/phenotype,Aplasia/hypoplasia of the uterus,HPO


In [11]:
## Sanity check on the current format of the file

# Randomly select one row for each unique node_type
sampled_df = nodes_df.groupby('node_type').sample(n=1, random_state=151)  # Set random_state for reproducibility

print(sampled_df)

        node_index        node_id           node_type  \
65751        65751           3476             anatomy   
114674      114674        1902367  biological_process   
124851      124851          70913  cellular_component   
99417        99417          23137             disease   
19830        19830        DB06980                drug   
87918        87918           9598    effect/phenotype   
61700        61700        C091375            exposure   
34206        34206         348793        gene/protein   
115725      115725          36200  molecular_function   
128395      128395  R-HSA-1368082             pathway   

                                                node_name node_source  
65751              respiratory system venous blood vessel      UBERON  
114674  negative regulation of Notch signaling pathway...          GO  
124851                                 Ddb1-Wdr21 complex          GO  
99417                        feigenbaum Bergeron syndrome       MONDO  
19830       

### Following are questions for data ingestion of the PrimeKG into Pharmocogenomics KG
* **Question 1:** How to handle those grouped Mondo codes?

In [12]:
nodes_MONDO_grouped = nodes_df[nodes_df['node_source'] == 'MONDO_grouped']

nodes_MONDO_grouped.head(10)

Unnamed: 0,node_index,node_id,node_type,node_name,node_source
27158,27158,13924_12592_14672_13460_12591_12536_30861_8146...,disease,osteogenesis imperfecta,MONDO_grouped
27159,27159,11160_13119_13978_12060_12327_12670_13210_1106...,disease,autosomal recessive nonsyndromic deafness,MONDO_grouped
27160,27160,8099_12497_12498,disease,congenital stationary night blindness autosoma...,MONDO_grouped
27161,27161,14854_14293_14470_12380_11832_14603_14853_1176...,disease,autosomal dominant nonsyndromic deafness,MONDO_grouped
27162,27162,33202_32776_30905_33670_33200_32740_32732_3320...,disease,"deafness, autosomal recessive",MONDO_grouped
27163,27163,11396_7422,disease,keratoderma hereditarium mutilans,MONDO_grouped
27164,27164,14828_14829_9454_13553_133,disease,immunodeficiency-centromeric instability-facia...,MONDO_grouped
27167,27167,9260_9261_9262_18149,disease,GM1 gangliosidosis,MONDO_grouped
27170,27170,14083_13288_12987_13287_13289_20729_14840_1329...,disease,agammaglobulinemia,MONDO_grouped
27173,27173,14986_14987_10351_10953_13565_9213_12565_13248...,disease,Fanconi anemia complementation group,MONDO_grouped


* **Questions 2:** How do or should we want to integrate those additional feature information of drug & disease into our KG?
* Sample information can be obtained from disease features
    - mondo_id	mondo_name	group_id_bert	group_name_bert	mondo_definition	umls_description	orphanet_definition	**orphanet_prevalence**	orphanet_epidemiology	orphanet_clinical_description	**orphanet_management_and_treatment**	mayo_symptoms	mayo_causes	mayo_risk_factors	mayo_complications	mayo_prevention	mayo_see_doc
    - 12345	acral peeling skin syndrome			Acral peeling skin syndrome (PSS) is a form of PSS characterized by superficial peeling of the skin predominantly affecting the dorsa of the hands and feet.		A rare peeling skin syndrome characterized by superficial peeling of the skin predominantly affecting the dorsa of the hands and feet.	**<1/1000000**	Acral PSS is rare, with approximately 40 cases described in the literature to date.	The disease manifests shortly after birth or in early childhood with superficial peeling on the palmar, plantar and dorsal surfaces of the hands and feet, that leaves residual painless erythema. Manual skin removal is also possible. Seasonal variations are generally observed. Heat, humidity, exposure to water and friction or minor trauma can induce exfoliation. The lesions are not painful and heal without scarring.	**There is no effective treatment.** Emollients are often used to reduce skin peeling. Patients must avoid immersion in water and are recommended to use absorbing powders or aluminum antiperspirants.						
* Sample information can be obtained from drug features
    - description	**half_life**	indication	**mechanism_of_action**	protein_binding	pharmacodynamics	state	atc_1	atc_2	atc_3	atc_4	category	group	pathway	molecular_weight	tpsa	**clogp**
    - Budesonide is a glucocorticoid that is a mix of the 22R and 22S epimer used to treat inflammatory conditions of the lungs and intestines such as asthma, COPD, Crohn's disease, and ulcerative colitis.	**Budesonide has a plasma elimination half life of 2-3.6h. The terminal elimination half life in asthmatic children 4-6 years old is 2.3h.**	Budesonide extended release capsules are indicated for the treatment and maintenance of mild to moderate Crohn’s disease. Various inhaled budesonide products are indicated for prophylactic therapy in asthma and reducing exacerbations of COPD. A budesonide nasal spray is available over the counter for symptoms of hay fever and upper respiratory allergies. Extended release capsules are indicated to induce remission of mild to moderate ulcerative colitis and a rectal foam is used for mild to moderate distal ulcerative colitis.	**The short term effects of corticosteroids are decreased vasodilation and permeability of capillaries, as well as decreased leukocyte migration to sites of inflammation.** Corticosteroids binding to the glucocorticoid receptor mediates changes in gene expression that lead to multiple downstream effects over hours to days.	Corticosteroids are generally bound to corticosteroid binding globulin and serum albumin in plasma. Budesonide is 85-90% protein bound in plasma.	Budesonide is a glucocorticoid used to treat respiratory and digestive conditions by reducing inflammation. It has a wide therapeutic index, as dosing varies highly from patient to patient. Patients should be counselled regarding the risk of hypercorticism and adrenal axis suppression.	Budesonide is a solid.	Budesonide is anatomically related to dermatologicals and respiratory system and respiratory system and alimentary tract and metabolism and respiratory system.	Budesonide is in the therapeutic group of corticosteroids, dermatological preparations and nasal preparations and drugs for obstructive airway diseases and antidiarrheals, intestinal antiinflammatory/antiinfective agents and drugs for obstructive airway diseases.	Budesonide is pharmacologically related to corticosteroids, plain and decongestants and other nasal preparations for topical use and adrenergics, inhalants and intestinal antiinflammatory agents and other drugs for obstructive airway diseases, inhalants.	The chemical and functional group of  is corticosteroids, potent (group iii) and corticosteroids and adrenergics in combination with corticosteroids or other drugs, excl. anticholinergics and corticosteroids acting locally and glucocorticoids.	Budesonide is part of Adrenal Cortex Hormones ; Adrenals ; Agents to Treat Airway Disease ; Alimentary Tract and Metabolism ; Anti-Asthmatic Agents ; Anti-Inflammatory Agents ; Antidiarrheals, Intestinal Antiinflammatory/antiinfective Agents ; Autonomic Agents ; Bronchodilator Agents ; BSEP/ABCB11 Substrates ; Corticosteroid Hormone Receptor Agonists ; Corticosteroids ; Corticosteroids Acting Locally ; Corticosteroids for Systemic Use ; Corticosteroids, Dermatological Preparations ; Corticosteroids, Potent (Group III) ; Cytochrome P-450 CYP2A6 Inducers ; Cytochrome P-450 CYP2B6 Inducers ; Cytochrome P-450 CYP2B6 Inducers (strength unknown) ; Cytochrome P-450 CYP2C19 Inducers ; Cytochrome P-450 CYP2C19 Inducers (strength unknown) ; Cytochrome P-450 CYP2C8 Inducers ; Cytochrome P-450 CYP2C8 Inducers (strength unknown) ; Cytochrome P-450 CYP2C9 Inducers ; Cytochrome P-450 CYP2C9 Inducers (strength unknown) ; Cytochrome P-450 CYP3A Inducers ; Cytochrome P-450 CYP3A Substrates ; Cytochrome P-450 CYP3A4 Inducers ; Cytochrome P-450 CYP3A4 Inducers (strength unknown) ; Cytochrome P-450 CYP3A4 Substrates ; Cytochrome P-450 CYP3A5 Inducers ; Cytochrome P-450 CYP3A5 Inducers (moderate) ; Cytochrome P-450 Enzyme Inducers ; Cytochrome P-450 Substrates ; Dermatologicals ; Drugs for Obstructive Airway Diseases ; Drugs that are Mainly Renally Excreted ; Fused-Ring Compounds ; Hormones ; Hormones, Hormone Substitutes, and Hormone Antagonists ; Immunosuppressive Agents ; Intestinal Antiinflammatory Agents ; Nasal Preparations ; OAT3/SLC22A8 Substrates ; P-glycoprotein substrates ; Peripheral Nervous System Agents ; Pregnanes ; Pregnenediones ; Pregnenes ; Respiratory System Agents ; Steroids.	Budesonide is approved.		The molecular weight is 430.54.	Budesonide has a topological polar surface area of 93.06.	**The log p value of  is 2.9.**
    - clogp: a predicted value for the partition coefficient (LogP) of a molecule

In [13]:
## Read each individual csv files
disease_features_df = pd.read_csv(raw_files_path + 'disease_features.csv')

disease_features_df.head(5)

Unnamed: 0,node_index,mondo_id,mondo_name,group_id_bert,group_name_bert,mondo_definition,umls_description,orphanet_definition,orphanet_prevalence,orphanet_epidemiology,orphanet_clinical_description,orphanet_management_and_treatment,mayo_symptoms,mayo_causes,mayo_risk_factors,mayo_complications,mayo_prevention,mayo_see_doc
0,27165,8019,mullerian aplasia and hyperandrogenism,,,"Deficiency of the glycoprotein WNT4, associate...","Deficiency of the glycoprotein wnt4, associate...","A rare syndrome with 46,XX disorder of sex dev...",,,,,,,,,,
1,27165,8019,mullerian aplasia and hyperandrogenism,,,"Deficiency of the glycoprotein WNT4, associate...","Deficiency of the glycoprotein wnt4, associate...","A rare syndrome with 46,XX disorder of sex dev...",,,,,,,,,,
2,27166,11043,"myelodysplasia, immunodeficiency, facial dysmo...",,,,,,,,,,,,,,,
3,27168,8878,"bone dysplasia, lethal Holmgren type",,,Bone dysplasia lethal Holmgren type (BDLH) is ...,A lethal bone dysplasia with characteristics o...,Bone dysplasia lethal Holmgren type (BDLH) is ...,<1/1000000,,,,,,,,,
4,27169,8905,predisposition to invasive fungal disease due ...,,,,,"A rare, genetic primary immunodeficiency chara...",,,,,,,,,,


In [14]:
## Read each individual csv files
drug_features_df = pd.read_csv(raw_files_path + 'drug_features.csv')

drug_features_df.head(5)

Unnamed: 0,node_index,description,half_life,indication,mechanism_of_action,protein_binding,pharmacodynamics,state,atc_1,atc_2,atc_3,atc_4,category,group,pathway,molecular_weight,tpsa,clogp
0,14012,Copper is a transition metal and a trace eleme...,,For use in the supplementation of total parent...,Copper is absorbed from the gut via high affin...,Copper is nearly entirely bound by ceruloplasm...,Copper is incorporated into many enzymes throu...,Copper is a solid.,,,,,Copper is part of Copper-containing Intrauteri...,Copper is approved and investigational.,,,,
1,14013,Oxygen is an element displayed by the symbol O...,The half-life is approximately 122.24 seconds,Oxygen therapy in clinical settings is used ac...,Oxygen therapy increases the arterial pressure...,Oxygen binds to oxygen-carrying protein in red...,Oxygen therapy improves effective cellular oxy...,Oxygen is a gas.,Oxygen is anatomically related to various.,Oxygen is in the therapeutic group of all othe...,Oxygen is pharmacologically related to all oth...,The chemical and functional group of is medic...,Oxygen is part of Chalcogens ; Elements ; Gase...,Oxygen is approved and vet_approved.,,The molecular weight is 32.0.,Oxygen has a topological polar surface area of...,
2,14014,"Flunisolide (marketed as AeroBid, Nasalide, Na...",The half-life is 1.8 hours,For the maintenance treatment of asthma as a p...,Flunisolide is a glucocorticoid receptor agoni...,Approximately 40% after oral inhalation,Flunisolide is a synthetic corticosteroid. It ...,Flunisolide is a solid.,Flunisolide is anatomically related to respira...,Flunisolide is in the therapeutic group of nas...,Flunisolide is pharmacologically related to de...,The chemical and functional group of is corti...,Flunisolide is part of Adrenal Cortex Hormones...,Flunisolide is approved and investigational.,,The molecular weight is 434.5.,Flunisolide has a topological polar surface ar...,The log p value of is 2.41.
3,14015,Alclometasone is synthetic glucocorticoid ster...,,For the relief of the inflammatory and pruriti...,The mechanism of the anti-inflammatory activit...,,Alclometasone is a synthetic corticosteroid fo...,Alclometasone is a solid.,Alclometasone is anatomically related to derma...,Alclometasone is in the therapeutic group of c...,Alclometasone is pharmacologically related to ...,The chemical and functional group of is corti...,Alclometasone is part of Adrenal Cortex Hormon...,Alclometasone is approved.,,,,
4,14016,Medrysone is a corticosteroid used in ophthalm...,,"For the treatment of allergic conjunctivitis, ...",There is no generally accepted explanation for...,,Medrysone is a topical anti-inflammatory corti...,Medrysone is a solid.,Medrysone is anatomically related to sensory o...,Medrysone is in the therapeutic group of ophth...,Medrysone is pharmacologically related to anti...,The chemical and functional group of is corti...,Medrysone is part of Adrenal Cortex Hormones ;...,Medrysone is approved.,,The molecular weight is 344.5.,Medrysone has a topological polar surface area...,The log p value of is 3.36.


In [15]:
## Select only the 4 must have node types
must_have_list = ['gene/protein', 'drug', 'disease', 'effect/phenotype']
nodes_select_df = nodes_df[nodes_df['node_type'].isin(must_have_list)]

## count rows
unique_node_count = nodes_select_df['node_index'].nunique()
print(unique_node_count)

68019


In [16]:
sanity_check_df = nodes_select_df[nodes_select_df['node_id'] == '5156']

sanity_check_df.head()

Unnamed: 0,node_index,node_id,node_type,node_name,node_source
3772,3772,5156,gene/protein,PDGFRA,NCBI
36722,36722,5156,disease,encephalomyelitis,MONDO
86127,86127,5156,effect/phenotype,Hypoplastic left atrium,HPO


In [17]:
## let's exclude those 'MONDO_grouped' rows for now
nodes_select_no_MONDO_grouped_df = nodes_select_df[nodes_select_df['node_source'] != 'MONDO_grouped']

## count rows
unique_node_count = nodes_select_no_MONDO_grouped_df['node_index'].nunique()
print(unique_node_count)

66752


In [18]:
## get the list of unique values in column node_id of nodes_select_no_MONDO_grouped_df
## since node_id neededs to combine with node_source as prefix to obtain a unique identifier
## see above example of same "node_id" in multiple vocabularies
unique_node_ids = nodes_select_no_MONDO_grouped_df['node_index'].unique().tolist()

## Read each individual csv files
edges_df = pd.read_csv(raw_files_path + 'kg.csv')

## count rows
print(len(edges_df))

## Filter to only include edges that are in the selected node_id list
edges_filtered_df = edges_df[edges_df['x_index'].isin(unique_node_ids)]
edges_filtered2_df = edges_filtered_df[edges_filtered_df['y_index'].isin(unique_node_ids)]

## count rows
print(len(edges_filtered2_df))

edges_filtered2_df.head(5)

  edges_df = pd.read_csv(raw_files_path + 'kg.csv')


8100498
4009644


Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source
0,protein_protein,ppi,0,9796,gene/protein,PHYHIP,NCBI,8889,56992,gene/protein,KIF15,NCBI
1,protein_protein,ppi,1,7918,gene/protein,GPANK1,NCBI,2798,9240,gene/protein,PNMA1,NCBI
2,protein_protein,ppi,2,8233,gene/protein,ZRSR2,NCBI,5646,23548,gene/protein,TTC33,NCBI
3,protein_protein,ppi,3,4899,gene/protein,NRF1,NCBI,11592,11253,gene/protein,MAN1B1,NCBI
4,protein_protein,ppi,4,5297,gene/protein,PI4KA,NCBI,2122,8601,gene/protein,RGS20,NCBI


In [19]:
## check unique relation
unique_relation_values = edges_filtered2_df['relation'].unique()
print("All possible relation are here: " , unique_relation_values)

All possible relation are here:  ['protein_protein' 'drug_protein' 'contraindication' 'indication'
 'off-label use' 'drug_drug' 'phenotype_protein' 'phenotype_phenotype'
 'disease_phenotype_negative' 'disease_phenotype_positive'
 'disease_protein' 'disease_disease' 'drug_effect']


In [20]:
## check unique relation
unique_relation_values = edges_filtered2_df['display_relation'].unique()
print("All possible display relation are here: " , unique_relation_values)

All possible display relation are here:  ['ppi' 'carrier' 'enzyme' 'target' 'transporter' 'contraindication'
 'indication' 'off-label use' 'synergistic interaction' 'associated with'
 'parent-child' 'phenotype absent' 'phenotype present' 'side effect']


In [21]:
## check unique relation
unique_type_values = set(list(edges_filtered2_df['x_type'].unique()) + list(edges_filtered2_df['y_type'].unique()))
print("All possible node types are here: " , unique_type_values)

All possible node types are here:  {'effect/phenotype', 'disease', 'drug', 'gene/protein'}


In [22]:
## check unique relation
unique_type_values = set(list(edges_filtered2_df['x_source'].unique()) + list(edges_filtered2_df['y_source'].unique()))
print("All possible node sources are here: " , unique_type_values)

All possible node sources are here:  {'NCBI', 'MONDO', 'HPO', 'DrugBank'}


In [23]:
## Sanity check anatomy_anatomy relations
## should return empty df
sanity_check_df = edges_filtered2_df[edges_filtered2_df['relation'] == 'anatomy_anatomy']
sanity_check_df.head(10)

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source


### Investigate those reverse order pairs of subject & objects
* seem like in raw PrimeKG file, they kept a reverse copy of all those edges
* Strategy is to throw away those pairs that "contradict" to what defined in the biolink associations & predicates

In [24]:
sanity_check_df = edges_filtered2_df[ (edges_filtered2_df['display_relation'] == 'transporter') & (edges_filtered2_df['x_type'] == 'drug')]

print(len(sanity_check_df))
# sanity_check_df.head(5)

3092


In [25]:
sanity_check_df = edges_filtered2_df[ (edges_filtered2_df['display_relation'] == 'transporter') & (edges_filtered2_df['x_type'] == 'gene/protein')]

print(len(sanity_check_df))
# sanity_check_df.head(5)

3092


In [26]:
sanity_check_df = edges_filtered2_df[ (edges_filtered2_df['display_relation'] == 'target') & (edges_filtered2_df['x_type'] == 'drug')]

print(len(sanity_check_df))
# sanity_check_df.head(5)

16380


In [27]:
sanity_check_df = edges_filtered2_df[ (edges_filtered2_df['display_relation'] == 'target') & (edges_filtered2_df['x_type'] == 'gene/protein')]

print(len(sanity_check_df))
sanity_check_df.head(5)

16380


Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source
5714056,drug_protein,target,13174,3067,gene/protein,HDC,NCBI,15906,DB00114,drug,Pyridoxal phosphate,DrugBank
5714057,drug_protein,target,13174,3067,gene/protein,HDC,NCBI,15907,DB00117,drug,Histidine,DrugBank
5714058,drug_protein,target,13982,27165,gene/protein,GLS2,NCBI,15908,DB00142,drug,Glutamic acid,DrugBank
5714059,drug_protein,target,476,2162,gene/protein,F13A1,NCBI,15909,DB02340,drug,N-Acetyl-Serine,DrugBank
5714060,drug_protein,target,476,2162,gene/protein,F13A1,NCBI,15910,DB11300,drug,Thrombin,DrugBank


In [28]:
sanity_check_df = edges_filtered2_df[ (edges_filtered2_df['display_relation'] == 'off-label use') & (edges_filtered2_df['x_type'] == 'drug')]

print(len(sanity_check_df))
# sanity_check_df.head(5)

2186


In [29]:
sanity_check_df = edges_filtered2_df[ (edges_filtered2_df['display_relation'] == 'off-label use') & (edges_filtered2_df['x_type'] == 'disease')]

print(len(sanity_check_df))
sanity_check_df.head(5)

2186


Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source
5733816,off-label use,off-label use,33577,5044,disease,hypertensive disorder,MONDO,14257,DB00903,drug,Etacrynic acid,DrugBank
5734096,off-label use,off-label use,33577,5044,disease,hypertensive disorder,MONDO,14668,DB00887,drug,Bumetanide,DrugBank
5734351,off-label use,off-label use,38121,5391,disease,restless legs syndrome,MONDO,14321,DB01235,drug,Levodopa,DrugBank
5734353,off-label use,off-label use,38121,5391,disease,restless legs syndrome,MONDO,19207,DB00190,drug,Carbidopa,DrugBank
5734665,off-label use,off-label use,28396,7803,disease,multiple system atrophy,MONDO,15003,DB01380,drug,Cortisone acetate,DrugBank


### Now handles the different type of nodes differently
* following example inputs here for node normalizer service: https://github.com/TranslatorSRI/NodeNormalization/blob/master/documentation/NodeNormalization.ipynb
* Drug: DRUGBANK:DB09145
* Disease: MONDO:0004976
* Gene / Protein: NCBIGene:9496
* Effect / Phenotype: HP:0007354
* Need to find a way to add zeros padding for MONDO and HP codes

In [30]:
## Start with check MONDO or HPO in either x_source or y_source
## this is just for test run, when the code is done, directly run on full df
mondo_or_hpo = ['MONDO', 'HPO']
edges_mondo_hpo_df = edges_filtered2_df[ (edges_filtered2_df['x_source'].isin(mondo_or_hpo)) | (edges_filtered2_df['y_source'].isin(mondo_or_hpo))]

# Randomly select one row for each unique combination
sampled_df = edges_mondo_hpo_df.groupby(['x_source', 'y_source']).sample(n=1, random_state=151)  # Set random_state for reproducibility
# sampled_df

In [31]:
## work directly on the full dataframe
## first loop x_id 
## to add prefix MONDO or HP for rows with sources "MONDO" or "HPO" and padding to 7 digits as required by example

## deep copy a df
edges_deep_copy_df = edges_filtered2_df.copy(deep=True)

## Prefix mapping for selected sources
prefix_map = {
    'HPO': 'HP:',
    'MONDO': 'MONDO:',
    'DrugBank': 'DRUGBANK:', 
    'NCBI': 'NCBIGene:'
}

## Initialize new column with default values (original id as string)
edges_deep_copy_df['subject'] = edges_deep_copy_df['x_id'].astype(str)

## Mask for HPO and MONDO → prefix + zero-padded id
mask_hpo_mondo = edges_deep_copy_df['x_source'].isin(['HPO', 'MONDO'])
edges_deep_copy_df.loc[mask_hpo_mondo, 'subject'] = (
    edges_deep_copy_df.loc[mask_hpo_mondo, 'x_source'].map(prefix_map) +
    edges_deep_copy_df.loc[mask_hpo_mondo, 'x_id'].astype(str).str.zfill(7)
)

## Mask for DrugBank → prefix only, no padding
mask_drugbank_ncbi = edges_deep_copy_df['x_source'].isin(['DrugBank', 'NCBI'])
edges_deep_copy_df.loc[mask_drugbank_ncbi, 'subject'] = (
    edges_deep_copy_df.loc[mask_drugbank_ncbi, 'x_source'].map(prefix_map) +
    edges_deep_copy_df.loc[mask_drugbank_ncbi, 'x_id'].astype(str)
)

sampled_df = edges_deep_copy_df.groupby(['x_source']).sample(n=1, random_state=151)  # Set random_state for reproducibility
sampled_df

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source,subject
1362595,drug_drug,synergistic interaction,21014,DB13648,drug,Alcuronium,DrugBank,14614,DB06738,drug,Ketobemidone,DrugBank,DRUGBANK:DB13648
5891379,disease_phenotype_positive,phenotype present,88949,11800,effect/phenotype,Midface retrusion,HPO,31464,9841,disease,PEHO syndrome,MONDO,HP:0011800
3153485,disease_phenotype_positive,phenotype present,31611,14892,disease,micrognathia-recurrent infections-behavioral a...,MONDO,22715,750,effect/phenotype,Delayed speech and language development,HPO,MONDO:0014892
116373,protein_protein,ppi,2071,27074,gene/protein,LAMP3,NCBI,10788,347733,gene/protein,TUBB2B,NCBI,NCBIGene:27074


In [32]:
## similarly change for y_id
## Initialize new column with default values (original id as string)
edges_deep_copy_df['object'] = edges_deep_copy_df['y_id'].astype(str)

## Mask for HPO and MONDO → prefix + zero-padded id
mask_hpo_mondo = edges_deep_copy_df['y_source'].isin(['HPO', 'MONDO'])
edges_deep_copy_df.loc[mask_hpo_mondo, 'object'] = (
    edges_deep_copy_df.loc[mask_hpo_mondo, 'y_source'].map(prefix_map) +
    edges_deep_copy_df.loc[mask_hpo_mondo, 'y_id'].astype(str).str.zfill(7)
)

## Mask for DrugBank → prefix only, no padding
mask_drugbank_ncbi = edges_deep_copy_df['y_source'].isin(['DrugBank', 'NCBI'])
edges_deep_copy_df.loc[mask_drugbank_ncbi, 'object'] = (
    edges_deep_copy_df.loc[mask_drugbank_ncbi, 'y_source'].map(prefix_map) +
    edges_deep_copy_df.loc[mask_drugbank_ncbi, 'y_id'].astype(str)
)

sampled_df = edges_deep_copy_df.groupby(['y_source']).sample(n=1, random_state=151)  # Set random_state for reproducibility
sampled_df

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source,subject,object
1423055,drug_drug,synergistic interaction,15103,DB00496,drug,Darifenacin,DrugBank,15441,DB00402,drug,Eszopiclone,DrugBank,DRUGBANK:DB00496,DRUGBANK:DB00402
3200343,disease_phenotype_positive,phenotype present,28319,8900,disease,camptodactyly with fibrous tissue hyperplasia ...,MONDO,22650,100490,effect/phenotype,Camptodactyly of finger,HPO,MONDO:0008900,HP:0100490
3249032,disease_protein,associated with,1749,7150,gene/protein,TOP1,NCBI,84244,16267,disease,undifferentiated carcinoma of the corpus uteri,MONDO,NCBIGene:7150,MONDO:0016267
116373,protein_protein,ppi,2071,27074,gene/protein,LAMP3,NCBI,10788,347733,gene/protein,TUBB2B,NCBI,NCBIGene:27074,NCBIGene:347733


* **Questions 3:** for effect/phenotype nodes which biolink node type to map to
* following are detail on candidates
* DiseaseOrPhenotypicFeature
  - Either one of a disease or an individual phenotypic feature. Some knowledge resources such as Monarch treat these as distinct, others such as MESH conflate. Please see definitions of phenotypic feature and disease in this model for their independent descriptions. This class is helpful to enforce domains and ranges that may involve either a disease or a phenotypic feature.
* DiseaseOrPhenotypicFeatureExposure
  - A disease or phenotypic feature state, when viewed as an exposure, represents an precondition, leading to or influencing an outcome, e.g. HIV predisposing an individual to infections; a relative deficiency of skin pigmentation predisposing an individual to skin cancer.
* DiseaseOrPhenotypicFeatureOutcome
  - Physiological outcomes resulting from an exposure event which is the manifestation of a disease or other characteristic phenotype.

In [33]:
## Depracted, switch to use a general category mapping dictionart shared across different sources
## Create a new map to change subject and object category based on x_type and y_type
## {'drug', 'disease', 'gene/protein', 'effect/phenotype'}
# category_map = {
#     'drug': 'biolink:Drug',
#     'gene/protein': 'biolink:Gene',
#     'disease': 'biolink:Disease', 
#     'effect/phenotype': 'biolink:DiseaseOrPhenotypicFeature'
# }

## Initialize new column with default values (original id as string)
edges_deep_copy_df['subject_category'] = edges_deep_copy_df['x_type'].astype(str)

## Mask
# mask = edges_deep_copy_df['x_type'].isin(['drug', 'gene/protein', 'disease', 'effect/phenotype'])
# edges_deep_copy_df.loc[mask, 'subject_category'] = (
#     edges_deep_copy_df.loc[mask, 'x_type'].map(category_map)
# )

edges_deep_copy_df['subject_category'] = (
    edges_deep_copy_df['x_type'].map(category_map)
)

## Initialize new column with default values (original id as string)
edges_deep_copy_df['object_category'] = edges_deep_copy_df['y_type'].astype(str)

## Mask
# mask = edges_deep_copy_df['y_type'].isin(['drug', 'gene/protein', 'disease', 'effect/phenotype'])
# edges_deep_copy_df.loc[mask, 'object_category'] = (
#     edges_deep_copy_df.loc[mask, 'y_type'].map(category_map)
# )

edges_deep_copy_df['object_category'] = (
    edges_deep_copy_df['y_type'].map(category_map)
)

sampled_df = edges_deep_copy_df.groupby(['x_type']).sample(n=1, random_state=151)  # Set random_state for reproducibility
sampled_df

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source,subject,object,subject_category,object_category
3157330,disease_phenotype_positive,phenotype present,31809,32864,disease,intellectual developmental disorder with speec...,MONDO,22759,6,effect/phenotype,Autosomal dominant inheritance,HPO,MONDO:0032864,HP:0000006,biolink:Disease,biolink:DiseaseOrPhenotypicFeature
1162436,drug_drug,synergistic interaction,20349,DB09543,drug,Methyl salicylate,DrugBank,14432,DB09383,drug,Meprednisone,DrugBank,DRUGBANK:DB09543,DRUGBANK:DB09383,biolink:Drug,biolink:Drug
6112214,drug_effect,side effect,22939,83,effect/phenotype,Renal insufficiency,HPO,17730,DB00710,drug,Ibandronate,DrugBank,HP:0000083,DRUGBANK:DB00710,biolink:DiseaseOrPhenotypicFeature,biolink:Drug
5473379,protein_protein,ppi,637,5725,gene/protein,PTBP1,NCBI,866,5595,gene/protein,MAPK3,NCBI,NCBIGene:5725,NCBIGene:5595,biolink:Gene,biolink:Gene


In [34]:
## Sanity check anatomy_anatomy relations
## should return empty df
sanity_check_df = edges_deep_copy_df[edges_deep_copy_df['display_relation'] == 'side effect']
print(len(sanity_check_df))
sanity_check_df.head(10)

129568


Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source,subject,object,subject_category,object_category
3348335,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,23158,2027,effect/phenotype,Abdominal pain,HPO,DRUGBANK:DB00583,HP:0002027,biolink:Drug,biolink:DiseaseOrPhenotypicFeature
3348336,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,85849,4396,effect/phenotype,Poor appetite,HPO,DRUGBANK:DB00583,HP:0004396,biolink:Drug,biolink:DiseaseOrPhenotypicFeature
3348337,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,22447,739,effect/phenotype,Anxiety,HPO,DRUGBANK:DB00583,HP:0000739,biolink:Drug,biolink:DiseaseOrPhenotypicFeature
3348338,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,22831,11675,effect/phenotype,Arrhythmia,HPO,DRUGBANK:DB00583,HP:0011675,biolink:Drug,biolink:DiseaseOrPhenotypicFeature
3348339,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,23469,3418,effect/phenotype,Back pain,HPO,DRUGBANK:DB00583,HP:0003418,biolink:Drug,biolink:DiseaseOrPhenotypicFeature
3348340,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,23254,12387,effect/phenotype,Bronchitis,HPO,DRUGBANK:DB00583,HP:0012387,biolink:Drug,biolink:DiseaseOrPhenotypicFeature
3348341,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,23168,1626,effect/phenotype,Abnormality of the cardiovascular system,HPO,DRUGBANK:DB00583,HP:0001626,biolink:Drug,biolink:DiseaseOrPhenotypicFeature
3348342,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,26336,100749,effect/phenotype,Chest pain,HPO,DRUGBANK:DB00583,HP:0100749,biolink:Drug,biolink:DiseaseOrPhenotypicFeature
3348343,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,26160,12735,effect/phenotype,Cough,HPO,DRUGBANK:DB00583,HP:0012735,biolink:Drug,biolink:DiseaseOrPhenotypicFeature
3348344,drug_effect,side effect,16322,DB00583,drug,Levocarnitine,DrugBank,25201,2321,effect/phenotype,Vertigo,HPO,DRUGBANK:DB00583,HP:0002321,biolink:Drug,biolink:DiseaseOrPhenotypicFeature


In [35]:
## Sanity check anatomy_anatomy relations
## should return empty df
## enzyme: for drug metabolism
sanity_check_df = edges_filtered2_df[edges_filtered2_df['display_relation'] == 'enzyme']
sanity_check_df.head(10)

Unnamed: 0,relation,display_relation,x_index,x_id,x_type,x_name,x_source,y_index,y_id,y_type,y_name,y_source
321939,drug_protein,enzyme,14584,DB00130,drug,L-Glutamine,DrugBank,13982,27165,gene/protein,GLS2,NCBI
321940,drug_protein,enzyme,14585,DB11118,drug,Ammonia,DrugBank,13982,27165,gene/protein,GLS2,NCBI
321941,drug_protein,enzyme,14584,DB00130,drug,L-Glutamine,DrugBank,476,2162,gene/protein,F13A1,NCBI
321942,drug_protein,enzyme,14275,DB00997,drug,Doxorubicin,DrugBank,373,4843,gene/protein,NOS2,NCBI
321943,drug_protein,enzyme,14423,DB09237,drug,Levamlodipine,DrugBank,373,4843,gene/protein,NOS2,NCBI
321944,drug_protein,enzyme,14586,DB00157,drug,NADH,DrugBank,2499,128,gene/protein,ADH5,NCBI
321945,drug_protein,enzyme,14587,DB00898,drug,Ethanol,DrugBank,2499,128,gene/protein,ADH5,NCBI
321946,drug_protein,enzyme,14588,DB01020,drug,Isosorbide mononitrate,DrugBank,2499,128,gene/protein,ADH5,NCBI
321947,drug_protein,enzyme,14589,DB11077,drug,Polyethylene glycol 400,DrugBank,2499,128,gene/protein,ADH5,NCBI
321948,drug_protein,enzyme,14590,DB12612,drug,Ozanimod,DrugBank,2499,128,gene/protein,ADH5,NCBI


* **Question 4:** general mapping for predicate
* protein-protein interacts, use physical_interacts_with
* carrier, use 'biolink:can_be_carried_out_by'?
* enzyme?
* indication?
* synergistic interaction?
* parent-child, between two phenotypes nodes, should I use biolink:broad_match?
* phenotype absent, opposite of has_phenotype? or just ignore?

In [36]:
## start convert to biolink compliant predicate types based on display_relation column
## check supplementary table of primeKG, descriptions on relations
## translator discussion point - data ingestion group meeting
## separate out off label drug usage, clinical trial, and FDA approved usage
## for 'synergistic interaction', combine with knowledge from LLM clinical trial attempts and find drug combination and which disease 
## they targeted
## currently exclude associated with
## not need to include 'phenotype absent'

list_of_display_relations = ['ppi', 'carrier', 'enzyme', 'target', 'transporter', 'contraindication',
'indication', 'off-label use', 'synergistic interaction', 'associated with',
'parent-child', 'phenotype absent', 'phenotype present', 'side effect']

# relation_predicate_map = {
#     'ppi': 'biolink:physical_interacts_with',
#     'carrier': 'biolink:can_be_carried_out_by',
#     'enzyme': '', 
#     'target': 'biolink:target_for',
#     'transporter': 'biolink:GeneAffectsChemicalAssociation',
#     'contraindication': 'biolink:has_contraindication',
#     'indication': '',
#     'off-label use': 'biolink:treats',
#     'synergistic interaction': '',
#     # 'associated with': 'biolink:associated_with',
#     'parent-child': 'biolink:broad_match',
#     # 'phenotype absent': '',
#     'phenotype present': 'biolink:has_phenotype',
#     'side effect': 'biolink:has_side_effect',
# }

## Initialize new column with default values (original id as string)
edges_deep_copy_df['predicate'] = edges_deep_copy_df['display_relation'].astype(str)

## Mask
# mask = edges_deep_copy_df['display_relation'].isin(list_of_display_relations)
# edges_deep_copy_df.loc[mask, 'predicate'] = (
#     edges_deep_copy_df.loc[mask, 'display_relation'].map(predicate_map)
# )

edges_deep_copy_df['predicate'] = edges_deep_copy_df.apply(
    lambda row: predicate_map.get((row['subject_category'], row['object_category'], row['display_relation'])),
    axis=1
)

# sampled_df = edges_deep_copy_df.groupby(['display_relation']).sample(n=1, random_state=151)  # Set random_state for reproducibility
# sampled_df

## add individual data source (original instead of PrimeKG) for each edge in the PrimeKG
* The mapping can be found in the build_kg script: https://github.com/mims-harvard/PrimeKG/blob/main/knowledge_graph/build_graph.ipynb
* Then we create our own dictionary to map the individual data souce to the ingested KG
* use a unique combon of following columns:
* x_source, y_source, relation

In [37]:
data_source_map_dict = {
    ('NCBI', 'NCBI', 'protein_protein'): 'NCBI',
    ('NCBI', 'MONDO', 'disease_protein'): 'DisGenNet',
    ('NCBI', 'HPO', 'phenotype_protein'): 'DisGenNet',
    ('NCBI', 'GO', 'molfunc_protein'): 'Gene2GO',
    ('NCBI', 'GO', 'cellcomp_protein'): 'Gene2GO',
    ('NCBI', 'GO', 'bioprocess_protein'): 'Gene2GO',
    ('NCBI', 'UBERON', 'anatomy_protein_present'): 'BGEE',
    ('NCBI', 'UBERON', 'anatomy_protein_absent'): 'BGEE',
    ('NCBI', 'REACTOME', 'pathway_protein'): 'Reactome',
    ('DrugBank', 'MONDO', 'contraindication'): 'DrugCentral', ## Note: need to double check if other relation types are for this data source
    ('DrugBank', 'MONDO', 'indication'): 'DrugCentral', ## note: there is a typo? of DrugCentral into DiseaseCentral in their comments
    ('DrugBank', 'MONDO', 'off-label use'): 'DrugCentral',
    ('DrugBank', 'NCBI', 'drug_protein'): 'DrugBank',
    ('DrugBank', 'DrugBank', 'drug_drug'): 'DrugBank',
    ('DrugBank', 'HPO', 'drug_effect'): 'SIDER',
    ('MONDO', 'MONDO', 'disease_disease'): 'MONDO',
    ('MONDO', 'NCBI', 'disease_protein'): 'MONDO',
    ('MONDO', 'HPO', 'disease_phenotype_positive'): 'HPO-A',
    ('MONDO', 'HPO', 'disease_phenotype_negative'): 'HPO-A',
    ('HPO', 'HPO', 'phenotype_phenotype'): 'HPO',
    ('GO', 'GO', 'bioprocess_bioprocess'): 'GO',
    ('GO', 'GO', 'molfunc_molfunc'): 'GO',
    ('GO', 'GO', 'cellcomp_cellcomp'): 'GO',
    ('CTD', 'NCBI', 'exposure_protein'): 'CTD',
    ('CTD', 'MONDO', 'exposure_disease'): 'CTD',
    ('CTD', 'CTD', 'exposure_exposure'): 'CTD',
    ('CTD', 'GO', 'exposure_bioprocess'): 'CTD',
    ('CTD', 'GO', 'exposure_molfunc'): 'CTD',
    ('CTD', 'GO', 'exposure_cellcomp'): 'CTD',
    ('UBERON', 'UBERON', 'anatomy_anatomy'): 'UBERON',
    ('REACTOME', 'REACTOME', 'pathway_pathway'): 'Reactome',
}

## Map the combination to a new column 'knowledge_source'
## if not finding a mapped original data source then label as the edge is from PrimeKG itself
# edges_deep_copy_df['knowledge_source'] = edges_deep_copy_df.apply(
#     lambda row: data_source_map_dict.get((row['x_source'], row['y_source'], row['relation']), 'PrimeKG'), axis=1)

edges_deep_copy_df['knowledge_source'] = 'PrimeKG'

In [38]:
# Count occurrences of each unique value in 'knowledge_source'
counts = edges_deep_copy_df['knowledge_source'].value_counts()

print(counts)

knowledge_source
PrimeKG    4009644
Name: count, dtype: int64


In [39]:
## Sanity check those rows with PrimeKG as datasource
## and randomly select 1 row out of each relation type

sampled_df = edges_deep_copy_df[edges_deep_copy_df['knowledge_source'] == 'PrimeKG'].groupby(['relation', 'x_source', 'y_source']).sample(n=1, random_state=151)  # Set random_state for reproducibility
# sampled_df

In [40]:
## add a new knowledge_souce column and set value to be "PrimeKG"
edges_deep_copy_df['knowledge_souce_total'] = 'PrimeKG'
## add a new knowledge_level column and set value to be 'knowledge_assertion'
edges_deep_copy_df['knowledge_level'] = 'knowledge_assertion'
## add a new agent_type column and set value to be 'manual_agent'
edges_deep_copy_df['agent_type'] = 'automated_agent'

## copy to a final df
edge_df = edges_deep_copy_df.copy(deep = True)

print(edge_df.shape[0])
## Remove rows where subject or object is empty
# Remove rows where 'Subject' OR 'Object' have NaN values
edge_df = edge_df.dropna(subset=['subject', 'object'])

print(edge_df.shape[0])

edge_df['deploy_date'] = deployment_date
# edge_df.head(5)

4009644
4009644


In [41]:
## create a context_qualifier column and fill na
## if all of them are empty then fill na
edge_df['context_qualifier'] = np.nan
# edge_df.head(5)

In [42]:
import uuid
import pandas as pd

## generate uuid from column combination
def generate_uuid_from_columns(df, column_list, namespace=uuid.NAMESPACE_DNS):
    """
    Generates UUIDs based on the values in a specified column of a Pandas DataFrame.

    Args:
        df (pd.DataFrame): The input DataFrame.
        column_list (list): List of all names of columns to use for UUID generation.
        namespace (uuid.UUID): A UUID namespace (default is uuid.NAMESPACE_DNS).

    Returns:
        pd.Series: A Pandas Series containing the generated UUIDs.
    """
    return df[column_list].apply(lambda x: uuid.uuid5(namespace, str(x)).hex)

def generate_uuid(row):
    """
    Generates a UUID based on the combined values of multiple columns.
    """
    combined_string = ''.join(row.astype(str))
    return uuid.uuid5(uuid.NAMESPACE_DNS, combined_string)

In [43]:
### Add resources_id column, checking whether edge is already
column_list = ['subject', 'predicate', 'object', 'context_qualifier', 'deploy_date']
# Apply the function to each row to generate UUIDs
edge_df['id'] = edge_df[column_list].apply(generate_uuid, axis=1)

# edge_df['id'] = generate_uuid_from_columns(edge_df, column_list)
# edge_df.head(5)

### Now create the corresponding node file
* only need three columns: id, name, category

In [44]:
node_subject_df = edge_df[['subject', 'x_name', 'subject_category']]
node_object_df = edge_df[['object', 'y_name', 'object_category']]

## rename those columns into desired format
node_subject_df.rename(columns={'subject': 'id', 'x_name': 'name', 'subject_category': 'category'}, inplace=True)
node_object_df.rename(columns={'object': 'id', 'y_name': 'name', 'object_category': 'category'}, inplace=True)

concat_node_df = pd.concat([node_subject_df, node_object_df]).drop_duplicates(keep='first')

concat_node_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_subject_df.rename(columns={'subject': 'id', 'x_name': 'name', 'subject_category': 'category'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_object_df.rename(columns={'object': 'id', 'y_name': 'name', 'object_category': 'category'}, inplace=True)


Unnamed: 0,id,name,category
0,NCBIGene:9796,PHYHIP,biolink:Gene
1,NCBIGene:7918,GPANK1,biolink:Gene
2,NCBIGene:8233,ZRSR2,biolink:Gene
3,NCBIGene:4899,NRF1,biolink:Gene
4,NCBIGene:5297,PI4KA,biolink:Gene


In [45]:
## drop no longer needed columns from edge df
print(edge_df.columns.tolist())
drop_cols = ['relation', 'display_relation', 'x_index', 'x_id', 'x_type', 'x_name', 'x_source', 'y_index', 'y_id', 'y_type', 'y_name', 'y_source']

edge_output_df = edge_df.drop(drop_cols, axis=1)

# edge_output_df.head()

['relation', 'display_relation', 'x_index', 'x_id', 'x_type', 'x_name', 'x_source', 'y_index', 'y_id', 'y_type', 'y_name', 'y_source', 'subject', 'object', 'subject_category', 'object_category', 'predicate', 'knowledge_source', 'knowledge_souce_total', 'knowledge_level', 'agent_type', 'deploy_date', 'context_qualifier', 'id']


In [46]:
## exclude rows with Null and empty string values in the predicate column
# edge_output_df = edge_output_df.dropna(subset=['predicate'])

## Drop rows where 'name' is NaN, None, or empty string
edge_output_df = edge_output_df[~edge_output_df['predicate'].isna() & (edge_output_df['predicate'].str.strip() != '')]

## throw away those rows which either subject_category, object_category, or predicate is not start with "biolink:" prefix
## since they cannot be biolink-compliant converted
## Keep only rows where all three columns start with 'biolink:'
edge_output_df = edge_output_df[
    edge_output_df['subject_category'].str.startswith('biolink:') &
    edge_output_df['object_category'].str.startswith('biolink:') &
    edge_output_df['predicate'].str.startswith('biolink:')
]

In [47]:
## Check the knowledge_source column again
## Count occurrences of each unique value in 'knowledge_source'
counts = edge_output_df['knowledge_source'].value_counts()

print(counts)

knowledge_source
PrimeKG    914870
Name: count, dtype: int64


In [48]:
print(len(edge_output_df))

914870


In [49]:
## sanity on the column names of the parsed KG
## the desired order should be: 
## 'subject', 'object', 'subject_category', 'object_category', 'predicate', 'knowledge_source', 'knowledge_souce_total', 
## 'knowledge_level', 'agent_type', 'deploy_date', 'context_qualifier', 'id'

## Print the order of column names
print(list(edge_output_df.columns))

['subject', 'object', 'subject_category', 'object_category', 'predicate', 'knowledge_source', 'knowledge_souce_total', 'knowledge_level', 'agent_type', 'deploy_date', 'context_qualifier', 'id']


### Now quality control of the parsed PrimeKG dataframe

In [50]:
## check all unique predicate values
counts = edge_output_df['subject_category'].value_counts()
print(counts)

subject_category
biolink:Gene       662486
biolink:Disease    160452
biolink:Drug        91932
Name: count, dtype: int64


In [51]:
## check all unique predicate values
counts = edge_output_df['object_category'].value_counts()
print(counts)

object_category
biolink:Gene                          642150
biolink:DiseaseOrPhenotypicFeature    175784
biolink:Disease                        76600
biolink:Drug                           20336
Name: count, dtype: int64


In [52]:
## check all unique predicate values
counts = edge_output_df['predicate'].value_counts()
print(counts)

predicate
biolink:physical_interacts_with           642150
biolink:has_phenotype                     111000
biolink:has_side_effect                    64784
biolink:broad_match                        49452
biolink:has_contraindication               24962
biolink:target_for                         16380
biolink:GeneAffectsChemicalAssociation      3092
biolink:treats                              2186
biolink:can_be_carried_out_by                864
Name: count, dtype: int64


In [53]:
## Group by predicate
grouped = edge_output_df.groupby('predicate')

## For each predicate, output unique (subject_category, object_category) pairs
for predicate, group in grouped:
    print(f"\nPredicate: {predicate}")
    pairs = group[['subject_category', 'object_category']].drop_duplicates()
    for _, row in pairs.iterrows():
        print(f"  ({row['subject_category']}, {row['object_category']})")


Predicate: biolink:GeneAffectsChemicalAssociation
  (biolink:Gene, biolink:Drug)

Predicate: biolink:broad_match
  (biolink:Disease, biolink:Disease)

Predicate: biolink:can_be_carried_out_by
  (biolink:Gene, biolink:Drug)

Predicate: biolink:has_contraindication
  (biolink:Drug, biolink:Disease)

Predicate: biolink:has_phenotype
  (biolink:Disease, biolink:DiseaseOrPhenotypicFeature)

Predicate: biolink:has_side_effect
  (biolink:Drug, biolink:DiseaseOrPhenotypicFeature)

Predicate: biolink:physical_interacts_with
  (biolink:Gene, biolink:Gene)

Predicate: biolink:target_for
  (biolink:Gene, biolink:Drug)

Predicate: biolink:treats
  (biolink:Drug, biolink:Disease)


In [54]:
## check all unique predicate values
counts = edge_output_df['knowledge_source'].value_counts()
print(counts)

knowledge_source
PrimeKG    914870
Name: count, dtype: int64


In [55]:
edge_output_df.head(5)

Unnamed: 0,subject,object,subject_category,object_category,predicate,knowledge_source,knowledge_souce_total,knowledge_level,agent_type,deploy_date,context_qualifier,id
0,NCBIGene:9796,NCBIGene:56992,biolink:Gene,biolink:Gene,biolink:physical_interacts_with,PrimeKG,PrimeKG,knowledge_assertion,automated_agent,2025-07-08,,b517ee25-7533-5773-9d01-55c489e65ce3
1,NCBIGene:7918,NCBIGene:9240,biolink:Gene,biolink:Gene,biolink:physical_interacts_with,PrimeKG,PrimeKG,knowledge_assertion,automated_agent,2025-07-08,,982e1505-0a46-524a-ae94-a01d3ad46662
2,NCBIGene:8233,NCBIGene:23548,biolink:Gene,biolink:Gene,biolink:physical_interacts_with,PrimeKG,PrimeKG,knowledge_assertion,automated_agent,2025-07-08,,88373117-1e09-5fbc-92c7-44196778f5db
3,NCBIGene:4899,NCBIGene:11253,biolink:Gene,biolink:Gene,biolink:physical_interacts_with,PrimeKG,PrimeKG,knowledge_assertion,automated_agent,2025-07-08,,bd1b1329-75a1-5be0-b7ad-8965cef9a0c2
4,NCBIGene:5297,NCBIGene:8601,biolink:Gene,biolink:Gene,biolink:physical_interacts_with,PrimeKG,PrimeKG,knowledge_assertion,automated_agent,2025-07-08,,17fd6751-6b8d-5556-b51b-3f6b2113e7b1


## Integrate signor and cell_marker_genes.csv into pharmacogenomics KG

In [56]:
## Notice!! Please change the file path of following codes into your own
signor_cellmarker_files_path = '/Users/Weiqi0/ISB_working/Ilya_lab/Translator/AML_KG_deployment/files/'

In [57]:
## Check all files being read
## Read all AML related files in the path

for f in os.listdir(signor_cellmarker_files_path):
    if f.endswith('.csv'):
        print(f)

AML.MONDO_descendants.biolink.compliant.05_12_2025.csv
AMLKG_Gene_Drug_curie.csv
AMLKG_Gene_Disease_curie.csv
AMLKG_DrugApproval_curie.csv
cell_marker_genes.csv
van_galen_cell_type_genes.csv
signor_genes.csv


In [58]:
## Read each individual csv files
cell_marker_genes_df = pd.read_csv(signor_cellmarker_files_path + 'cell_marker_genes.csv')

## select only needed columns and separate into node & edge files
cell_marker_genes_edge_df = cell_marker_genes_df[['subject_identifier', 'predicate', 'object_identifier', 
                                                  'knowledge_level', 'anatomical_context_qualifier',
                                                 'Primary_Knowledge_Source', 'object_category', 'publications',
                                                 'subject_category']]
cell_marker_genes_node_subject_df = cell_marker_genes_df[['subject_identifier', 'subject_name', 'subject_category']]
cell_marker_genes_node_object_df = cell_marker_genes_df[['object_identifier', 'object_name', 'object_category']]

## rename those columns into desired format
cell_marker_genes_node_subject_df.rename(columns={'subject_identifier': 'id', 'subject_name': 'name',
                                                 'subject_category': 'category'}, inplace=True)
cell_marker_genes_node_object_df.rename(columns={'object_identifier': 'id', 'object_name': 'name',
                                                 'object_category': 'category'}, inplace=True)

cell_marker_genes_edge_df.rename(columns={'subject_identifier': 'subject', 'object_identifier': 'object',
                                          'Primary_Knowledge_Source': 'knowledge_source',
                                         }, inplace=True)

## vertical concatenation
cell_marker_genes_node_df = pd.concat([cell_marker_genes_node_subject_df, cell_marker_genes_node_object_df])

# cell_marker_genes_edge_df.head(5)

# print(len(cell_marker_genes_edge_df))  # Output file name and number of rows

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cell_marker_genes_node_subject_df.rename(columns={'subject_identifier': 'id', 'subject_name': 'name',
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cell_marker_genes_node_object_df.rename(columns={'object_identifier': 'id', 'object_name': 'name',
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cell_marker_genes_edge_df.rename(columns={'subject_identifier': 'subject', 'object_identifier': 'object',


In [59]:
## Read each individual csv files
signor_genes_df = pd.read_csv(signor_cellmarker_files_path + 'signor_genes.csv')

## select only needed columns and separate into node & edge files
signor_genes_edge_df = signor_genes_df[['subject_identifier', 'predicate', 'object_identifier', 
                                        'knowledge_level', 'anatomical_context_qualifier',
                                        'Primary_Knowledge_Source', 'object_category', 'publications',
                                        'subject_category']]
signor_genes_node_subject_df = signor_genes_df[['subject_identifier', 'subject_name', 'subject_category']]
signor_genes_node_object_df = signor_genes_df[['object_identifier', 'object_name', 'object_category']]

## rename those columns into desired format
signor_genes_node_subject_df.rename(columns={'subject_identifier': 'id', 'subject_name': 'name',
                                                 'subject_category': 'category'}, inplace=True)
signor_genes_node_object_df.rename(columns={'object_identifier': 'id', 'object_name': 'name',
                                                 'object_category': 'category'}, inplace=True)

signor_genes_edge_df.rename(columns={'subject_identifier': 'subject', 'object_identifier': 'object',
                                          'Primary_Knowledge_Source': 'knowledge_source',
                                         }, inplace=True)

## vertical concatenation
signor_genes_node_df = pd.concat([signor_genes_node_subject_df, signor_genes_node_object_df])

# signor_genes_edge_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  signor_genes_node_subject_df.rename(columns={'subject_identifier': 'id', 'subject_name': 'name',
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  signor_genes_node_object_df.rename(columns={'object_identifier': 'id', 'object_name': 'name',
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  signor_genes_edge_df.rename(columns={'subject_identifier': 'subject', 'object_identifier': 'object',


In [60]:
## concatenate signor and cellmarker dfs
list_of_input_node_files = [cell_marker_genes_node_df, signor_genes_node_df]

list_of_input_edge_files = [cell_marker_genes_edge_df, signor_genes_edge_df]

cellmarker_signor_node_df = pd.concat(list_of_input_node_files).drop_duplicates()
cellmarker_signor_edge_df = pd.concat(list_of_input_edge_files).drop_duplicates()

## Original row count
print("The original row count: ", cellmarker_signor_edge_df.shape[0])

## Step 1: Remove those with unknown predicate
# Method 1: Boolean indexing
value_to_remove = 'unknown'
cellmarker_signor_edge_df_filtered = cellmarker_signor_edge_df[cellmarker_signor_edge_df['predicate'] != value_to_remove]

print("The row count after filtering: ", cellmarker_signor_edge_df_filtered.shape[0])

The original row count:  46820
The row count after filtering:  45737


In [61]:
## step 2: assign biolink node 

## Depracted, switch to use the predicate_map and directly obtain biolink format node.
# cellmarker_signor_node_df['category'] = 'biolink:' + cellmarker_signor_node_df['category'].astype(str)
# cellmarker_signor_edge_df_filtered['subject_category'] = 'biolink:' + cellmarker_signor_edge_df_filtered['subject_category'].astype(str)
# cellmarker_signor_edge_df_filtered['object_category'] = 'biolink:' + cellmarker_signor_edge_df_filtered['object_category'].astype(str)

cellmarker_signor_node_df['category'] = (
    cellmarker_signor_node_df['category'].map(category_map)
)

cellmarker_signor_edge_df_filtered['subject_category'] = (
    cellmarker_signor_edge_df_filtered['subject_category'].map(category_map)
)

cellmarker_signor_edge_df_filtered['object_category'] = (
    cellmarker_signor_edge_df_filtered['object_category'].map(category_map)
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cellmarker_signor_edge_df_filtered['subject_category'] = (
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cellmarker_signor_edge_df_filtered['object_category'] = (


In [62]:
unique_category_values = cellmarker_signor_edge_df_filtered['predicate'].unique()
print("All possible predicate categories are here: " ,unique_category_values)

All possible predicate categories are here:  ['expressed_in' 'upregulates' 'downregulates' 'in_complex']


In [63]:
## Step 3:
## change expressed_in to biolink:expressed_in
## change upregulates to biolink:upregulated
## change downregulates to biolink:downregulated
## change in_complex to biolink: in_complex_with

## deprecated, replaced with directly mapping with dictionary
# Replace values in 'predicate' where...
# cellmarker_signor_edge_df_filtered.loc[cellmarker_signor_edge_df_filtered['predicate'] == 'expressed_in', 'predicate'] = 'biolink:expressed_in'
# cellmarker_signor_edge_df_filtered.loc[cellmarker_signor_edge_df_filtered['predicate'] == 'upregulates', 'predicate'] = 'biolink:upregulated'
# cellmarker_signor_edge_df_filtered.loc[cellmarker_signor_edge_df_filtered['predicate'] == 'downregulates', 'predicate'] = 'biolink:downregulated'
# cellmarker_signor_edge_df_filtered.loc[cellmarker_signor_edge_df_filtered['predicate'] == 'in_complex', 'predicate'] = 'biolink:in_complex_with'
# cellmarker_signor_edge_df_filtered.loc[cellmarker_signor_edge_df_filtered['predicate'] == 'associated_with_sensitivity_to', 'predicate'] = 'biolink:sensitivity_associated_with'
# cellmarker_signor_edge_df_filtered.loc[cellmarker_signor_edge_df_filtered['predicate'] == 'associated_with_resistance_to', 'predicate'] = 'biolink:resistance_associated_with'
# cellmarker_signor_edge_df_filtered.loc[cellmarker_signor_edge_df_filtered['predicate'] == 'GeneToDiseaseOrPhenotypicFeatureAssociation', 'predicate'] = 'biolink:gene_associated_with_condition'
# cellmarker_signor_edge_df_filtered.loc[cellmarker_signor_edge_df_filtered['predicate'] == 'treated_by', 'predicate'] = 'biolink:treated_by'
# cellmarker_signor_edge_df_filtered.loc[cellmarker_signor_edge_df_filtered['predicate'] == 'treats', 'predicate'] = 'biolink:treats'

# Rename 'old_name' to 'new_name'
cellmarker_signor_edge_df_filtered = cellmarker_signor_edge_df_filtered.rename(columns={'predicate': 'predicate_source_name'})

cellmarker_signor_edge_df_filtered['predicate'] = cellmarker_signor_edge_df_filtered.apply(
    lambda row: predicate_map.get((row['subject_category'], row['object_category'], row['predicate_source_name'])),
    axis=1
)

unique_category_values = cellmarker_signor_edge_df_filtered['predicate'].unique()
print("All possible predicate categories are here: " ,unique_category_values)

All possible predicate categories are here:  ['biolink:expressed_in' 'biolink:regulates' None 'biolink:in_complex_with']


In [64]:
import uuid
import pandas as pd

## generate uuid from column combination
def generate_uuid_from_columns(df, column_list, namespace=uuid.NAMESPACE_DNS):
    """
    Generates UUIDs based on the values in a specified column of a Pandas DataFrame.

    Args:
        df (pd.DataFrame): The input DataFrame.
        column_list (list): List of all names of columns to use for UUID generation.
        namespace (uuid.UUID): A UUID namespace (default is uuid.NAMESPACE_DNS).

    Returns:
        pd.Series: A Pandas Series containing the generated UUIDs.
    """
    return df[column_list].apply(lambda x: uuid.uuid5(namespace, str(x)).hex)

def generate_uuid(row):
    """
    Generates a UUID based on the combined values of multiple columns.
    """
    combined_string = ''.join(row.astype(str))
    return uuid.uuid5(uuid.NAMESPACE_DNS, combined_string)

In [65]:
## add deploy_date
cellmarker_signor_edge_df_filtered['deploy_date'] = deployment_date

### Add resources_id column, checking whether edge is already
column_list = ['subject', 'predicate', 'object', 'anatomical_context_qualifier', 'deploy_date']
# Apply the function to each row to generate UUIDs
cellmarker_signor_edge_df_filtered['id'] = cellmarker_signor_edge_df_filtered[column_list].apply(generate_uuid, axis=1)

# edge_df['id'] = generate_uuid_from_columns(edge_df, column_list)
cellmarker_signor_edge_df_filtered.head(5)

Unnamed: 0,subject,predicate_source_name,object,knowledge_level,anatomical_context_qualifier,knowledge_source,object_category,publications,subject_category,predicate,deploy_date,id
0,NCBIGene:216,expressed_in,CL:0000037,knowledge_assertion,UBERON:0000178,CellMarker,biolink:Cell,PMID:36300619,biolink:Gene,biolink:expressed_in,2025-07-08,43045311-11a8-58c1-9ffc-5184b8fb68c0
1,NCBIGene:100532731,expressed_in,CL:0000037,knowledge_assertion,UBERON:0000178,CellMarker,biolink:Cell,PMID:36300619,biolink:Gene,biolink:expressed_in,2025-07-08,766178a4-bd66-55a7-a6bc-c810ebb3ead6
2,NCBIGene:914,expressed_in,CL:0000084,knowledge_assertion,UBERON:0005408,CellMarker,biolink:Cell,PMID:36300619,biolink:Gene,biolink:expressed_in,2025-07-08,b6a2f2ba-6d36-5430-92fb-8a7e2fbdcef0
3,NCBIGene:920,expressed_in,CL:0000624,knowledge_assertion,UBERON:0000178,CellMarker,biolink:Cell,PMID:36300619,biolink:Gene,biolink:expressed_in,2025-07-08,a26e7342-f8ea-5e81-a800-78399ad33b5d
4,NCBIGene:920,expressed_in,CL:0000492,knowledge_assertion,UBERON:0005408,CellMarker,biolink:Cell,PMID:36300619,biolink:Gene,biolink:expressed_in,2025-07-08,04dfbe13-a9e3-56ad-8755-23d2ddfd60f5


### Plover needs the file to be biolink compliant, so category should always be things like biolink:XXX
* Here instead of "Gene", it should be "biolink:gene"
* The document of full biolink: predicates can be found here: 

In [66]:
sanity_check_df = cellmarker_signor_node_df[ (cellmarker_signor_node_df['category'] == 'Antibody') ]

# print(len(sanity_check_df))
sanity_check_df.head(5)

Unnamed: 0,id,name,category


In [67]:
## check all unique predicate values
counts = cellmarker_signor_node_df['category'].value_counts()
print(counts)

category
biolink:Gene                     8814
biolink:ChemicalEntity           1006
biolink:MacromolecularComplex     519
biolink:SmallMolecule             466
biolink:Cell                      460
biolink:PhenotypicFeature         209
biolink:Protein                   143
biolink:ProteinFamily              92
biolink:EnvironmentalProcess       26
biolink:MicroRNA                   25
biolink:Drug                       25
biolink:Noncoding_RNAProduct        1
Name: count, dtype: int64


In [68]:
## throw away those rows which either subject_category, object_category, or predicate is not start with "biolink:" prefix
## since they cannot be biolink-compliant converted
## Keep only rows where all three columns start with 'biolink:'
cellmarker_signor_edge_df_filtered = cellmarker_signor_edge_df_filtered[
    cellmarker_signor_edge_df_filtered['subject_category'].str.startswith('biolink:') &
    cellmarker_signor_edge_df_filtered['object_category'].str.startswith('biolink:') &
    cellmarker_signor_edge_df_filtered['predicate'].str.startswith('biolink:')
]

cellmarker_signor_node_df = cellmarker_signor_node_df[
    cellmarker_signor_node_df['category'].str.startswith('biolink:')]

In [69]:
## add a new agent_type column and set value to be 'manual_agent'
cellmarker_signor_edge_df_filtered['agent_type'] = 'text_mining_agent'

In [70]:
## Print the order of column names
print(list(cellmarker_signor_edge_df_filtered.columns))

## sanity on the column names of the parsed KG
## the desired order should be: 
## 'subject', 'object', 'subject_category', 'object_category', 'predicate', 'knowledge_source', 'knowledge_souce_total', 
## 'knowledge_level', 'agent_type', 'deploy_date', 'context_qualifier', 'id'
desired_column_order = ['subject', 'object', 'subject_category', 'object_category', 'predicate', 'knowledge_source', 'knowledge_souce_total', 
'knowledge_level', 'agent_type', 'deploy_date', 'context_qualifier', 'id']


['subject', 'predicate_source_name', 'object', 'knowledge_level', 'anatomical_context_qualifier', 'knowledge_source', 'object_category', 'publications', 'subject_category', 'predicate', 'deploy_date', 'id', 'agent_type']


In [71]:
## check all unique predicate values
counts = cellmarker_signor_edge_df_filtered['subject_category'].value_counts()
print(counts)

subject_category
biolink:Gene                     33076
biolink:MacromolecularComplex     2632
biolink:ChemicalEntity            2522
biolink:ProteinFamily             1341
biolink:SmallMolecule             1085
biolink:Protein                    357
biolink:EnvironmentalProcess       109
biolink:MicroRNA                    43
biolink:Drug                        29
biolink:Noncoding_RNAProduct         3
Name: count, dtype: int64


In [72]:
## check all unique predicate values
counts = cellmarker_signor_edge_df_filtered['object_category'].value_counts()
print(counts)

object_category
biolink:Gene                     29553
biolink:Cell                      7973
biolink:PhenotypicFeature         1573
biolink:MacromolecularComplex     1218
biolink:SmallMolecule              426
biolink:ProteinFamily              229
biolink:Protein                    225
Name: count, dtype: int64


In [73]:
## check all unique predicate values
counts = cellmarker_signor_edge_df_filtered['predicate'].value_counts()
print(counts)

predicate
biolink:regulates          33096
biolink:expressed_in        7973
biolink:in_complex_with      128
Name: count, dtype: int64


In [74]:
## Group by predicate
grouped = cellmarker_signor_edge_df_filtered.groupby('predicate')

## For each predicate, output unique (subject_category, object_category) pairs
for predicate, group in grouped:
    print(f"\nPredicate: {predicate}")
    pairs = group[['subject_category', 'object_category']].drop_duplicates()
    for _, row in pairs.iterrows():
        print(f"  ({row['subject_category']}, {row['object_category']})")


Predicate: biolink:expressed_in
  (biolink:Gene, biolink:Cell)

Predicate: biolink:in_complex_with
  (biolink:ProteinFamily, biolink:MacromolecularComplex)
  (biolink:MacromolecularComplex, biolink:MacromolecularComplex)
  (biolink:SmallMolecule, biolink:MacromolecularComplex)
  (biolink:ChemicalEntity, biolink:MacromolecularComplex)
  (biolink:Protein, biolink:MacromolecularComplex)

Predicate: biolink:regulates
  (biolink:Gene, biolink:Gene)
  (biolink:MacromolecularComplex, biolink:Gene)
  (biolink:MacromolecularComplex, biolink:PhenotypicFeature)
  (biolink:ChemicalEntity, biolink:Gene)
  (biolink:Protein, biolink:Gene)
  (biolink:Gene, biolink:MacromolecularComplex)
  (biolink:SmallMolecule, biolink:Gene)
  (biolink:SmallMolecule, biolink:SmallMolecule)
  (biolink:SmallMolecule, biolink:ProteinFamily)
  (biolink:ProteinFamily, biolink:Gene)
  (biolink:EnvironmentalProcess, biolink:Gene)
  (biolink:Gene, biolink:PhenotypicFeature)
  (biolink:MacromolecularComplex, biolink:Macromol

In [75]:
## Concatenate to obtain overall node & edge dataframes

list_of_input_node_files = [concat_node_df, cellmarker_signor_node_df]

list_of_input_edge_files = [edge_output_df, cellmarker_signor_edge_df_filtered]

node_final_df = pd.concat(list_of_input_node_files).drop_duplicates()
edge_final_df = pd.concat(list_of_input_edge_files).drop_duplicates()

# edge_df.head(10)

In [76]:
## exclude rows with Null values in the predicate column
edge_final_df = edge_final_df.dropna(subset=['predicate'])

In [77]:
## remove the knowledge_source_total column
## drop not needed columns
edge_final_df = edge_final_df.drop(columns=['knowledge_souce_total'])

## Now download the concatenated node & edge files

In [78]:
## download both node and edge files
## Download the result df
node_final_df.to_csv(download_path_node_file, sep ='\t', index=False)
edge_final_df.to_csv(download_path_edge_file, sep ='\t', index=False)

In [79]:
print("The formatted node file will be saved in this path: ", download_path_node_file)
print("The formatted edge file will be saved in this path: ", download_path_edge_file)

The formatted node file will be saved in this path:  /Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/primeKG_parsed_node_07_08_2025.tsv
The formatted edge file will be saved in this path:  /Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/primeKG_parsed_edge_07_08_2025.tsv


In [80]:
## Create a graph from the DataFrame
graph = nx.from_pandas_edgelist(edge_final_df, 'subject', 'object', edge_attr='predicate')

## Print graph information
print('Number of nodes', len(set(graph.nodes)))
print('Number of edges', len(set(graph.edges)))
print('Average degree', sum(dict(graph.degree).values()) / len(graph.nodes))

Number of nodes 51111
Number of edges 590865
Average degree 23.1208546105535


## Now checking how many duplicates are there in the BigGIM edge file

In [81]:
## Define the version number
BigGIM_version_number = "03_04_2025"
BigGIM_path_edge_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/BigGIM_plover_deployment/files/BigGIM_DrugResponse/biolink_compliant/BigGIM.interacts_with_associated_with.edges.biolink.compliant.tsv'

In [82]:
BigGIM_edge_df = pd.read_csv(BigGIM_path_edge_file, sep = '\t')
BigGIM_edge_df.head()

  BigGIM_edge_df = pd.read_csv(BigGIM_path_edge_file, sep = '\t')


Unnamed: 0,subject,predicate,object,agent_type,knowledge_level,anatomical_context_qualifier,knowledge_source,object_category,publications,subject_category,...,subject_aspect_qualifier,supporting_study_cohort,Data_set,P_value,context_qualifier,statistics_method,supporting_study_size,object_aspect_qualifier,deploy_date,id
0,NCBIGene:10461,biolink:expressed_in,CL:0000235,manual_agent,knowledge_assertion,UBERON_0000916,CellMarker,biolink:Cell,PMID:31982413,biolink:Gene,...,,,,,,,,,2025-03-01,b041e47f-c4f5-5166-b4fe-557207fcb593
1,NCBIGene:2215,biolink:expressed_in,CL:0000235,manual_agent,knowledge_assertion,UBERON_0000916,CellMarker,biolink:Cell,PMID:31982413,biolink:Gene,...,,,,,,,,,2025-03-01,99f71540-4efa-55bf-87f1-48cdceb65579
2,NCBIGene:4360,biolink:expressed_in,CL:0000235,manual_agent,knowledge_assertion,UBERON_0000916,CellMarker,biolink:Cell,PMID:31982413,biolink:Gene,...,,,,,,,,,2025-03-01,9683d6f3-ab01-5331-8a8d-0b1dcd291432
3,NCBIGene:11326,biolink:expressed_in,CL:0000235,manual_agent,knowledge_assertion,UBERON_0000916,CellMarker,biolink:Cell,PMID:31982413,biolink:Gene,...,,,,,,,,,2025-03-01,b035b572-65d6-5602-ad4e-bbe2c36cf9e9
4,NCBIGene:9332,biolink:expressed_in,CL:0000235,manual_agent,knowledge_assertion,UBERON_0000916,CellMarker,biolink:Cell,PMID:31982413,biolink:Gene,...,,,,,,,,,2025-03-01,20203999-528e-5335-906e-5441d94ee18c


In [83]:
# Specify the columns to compare
cols = ['subject', 'object', 'predicate']

# Use merge to find common rows based on the specified columns
common = pd.merge(edge_final_df[cols], BigGIM_edge_df[cols], on=cols)

# Get the number of duplicated rows
num_duplicates = len(common)

print(f"Number of duplicated rows: {num_duplicates}")

Number of duplicated rows: 148185


## Now checking how many duplicates are there in the PharmGKB edge file

In [84]:
## Define the version number
PharmGKB_version_number = "07_08_2025"
PharmGKB_path_edge_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/PharmGKB_parsed_edge_{PharmGKB_version_number}.tsv'

PharmGKB_edge_df = pd.read_csv(PharmGKB_path_edge_file, sep = '\t')
PharmGKB_edge_df.head()

Unnamed: 0,publications,subject,object,subject_category,object_category,predicate,knowledge_source,knowledge_level,agent_type,deploy_date,context_qualifier,id
0,25695618,NCBIGene:162282,MeSH:D000075222,biolink:Gene,biolink:Disease,biolink:associated_with,PharmGKB,knowledge_assertion,automated_agent,2025-07-08,,3f4622cd-cc81-5c60-9fce-4d25214b9b86
1,25695618,NCBIGene:162282,PUBCHEM.COMPOUND:3639,biolink:Gene,biolink:ChemicalEntity,biolink:associated_with,PharmGKB,knowledge_assertion,automated_agent,2025-07-08,,2ebd74fd-1cb6-519c-9051-4e0e5880a57d
2,18511948;25545243,NCBIGene:1728,PUBCHEM.COMPOUND:3385,biolink:Gene,biolink:ChemicalEntity,biolink:associated_with,PharmGKB,knowledge_assertion,automated_agent,2025-07-08,,3e9e9eee-e33e-5903-b6c0-fd6f960ab809
3,24924344;25545243,NCBIGene:1728,PUBCHEM.COMPOUND:6857599,biolink:Gene,biolink:ChemicalEntity,biolink:associated_with,PharmGKB,knowledge_assertion,automated_agent,2025-07-08,,5dfcff5b-dea4-53f7-b9ff-1db8d23309df
4,30237583,NCBIGene:1728,MeSH:D046152,biolink:Gene,biolink:Disease,biolink:associated_with,PharmGKB,knowledge_assertion,automated_agent,2025-07-08,,6149843b-5b1d-562b-b2e9-82391a552db3


In [85]:
## check all unique predicate values
counts = PharmGKB_edge_df['subject_category'].value_counts()
print(counts)

subject_category
biolink:Gene              11702
biolink:ChemicalEntity     5907
biolink:Disease            3349
Name: count, dtype: int64


In [86]:
## check all unique predicate values
counts = PharmGKB_edge_df['subject_category'].value_counts()
print(counts)

subject_category
biolink:Gene              11702
biolink:ChemicalEntity     5907
biolink:Disease            3349
Name: count, dtype: int64


In [87]:
## check all unique predicate values
counts = PharmGKB_edge_df['predicate'].value_counts()

print(counts)

predicate
biolink:associated_with    20958
Name: count, dtype: int64


In [88]:
# Specify the columns to compare
cols = ['subject', 'object', 'predicate']

# Use merge to find common rows based on the specified columns
common = pd.merge(edge_final_df[cols], PharmGKB_edge_df[cols], on=cols)

# Get the number of duplicated rows
num_duplicates = len(common)

print(f"Number of duplicated rows: {num_duplicates}")

Number of duplicated rows: 0


In [89]:
PharmGKB_path_node_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/PharmGKB_parsed_node_{version_number}.tsv'

PharmGKB_node_df = pd.read_csv(PharmGKB_path_node_file, sep = '\t')
PharmGKB_node_df.head()

Unnamed: 0,id,name,category
0,NCBIGene:162282,ANKFN1,biolink:Gene
1,NCBIGene:1728,NQO1,biolink:Gene
2,NCBIGene:6813,STXBP2,biolink:Gene
3,PUBCHEM.COMPOUND:33,chloroacetaldehyde,biolink:ChemicalEntity
4,NCBIGene:8000,PSCA,biolink:Gene


In [90]:
## Concatenate to obtain overall node & edge dataframes

list_of_input_node_files = [node_final_df, PharmGKB_node_df]

list_of_input_edge_files = [edge_final_df, PharmGKB_edge_df]

node_final_df = pd.concat(list_of_input_node_files).drop_duplicates()
edge_final_df = pd.concat(list_of_input_edge_files).drop_duplicates()

# edge_df.head(10)

In [91]:
## remove NA or empty strings rows from both subject and object columns
# Drop rows where 'name' is NaN, None, or empty string

edge_final_df = edge_final_df[~edge_final_df['subject'].isna() & (edge_final_df['subject'].str.strip() != '')]
edge_final_df = edge_final_df[~edge_final_df['object'].isna() & (edge_final_df['object'].str.strip() != '')]

In [92]:
## Create a graph from the DataFrame
graph = nx.from_pandas_edgelist(edge_final_df, 'subject', 'object', edge_attr='predicate')

## Print graph information
print('Number of nodes', len(set(graph.nodes)))
print('Number of edges', len(set(graph.edges)))
print('Average degree', sum(dict(graph.degree).values()) / len(graph.nodes))

Number of nodes 52602
Number of edges 599515
Average degree 22.794380441808297


In [93]:
## Define the output path for node & edge files after formatting
download_path_node_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/Pharmagenomics_KG_parsed_node_{version_number}.tsv'
download_path_edge_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/Pharmagenomics_KG_parsed_edge_{version_number}.tsv'

## download both node and edge files
## Download the result df
## disable download for testing
# node_final_df.to_csv(download_path_node_file, sep ='\t', index=False)
# edge_final_df.to_csv(download_path_edge_file, sep ='\t', index=False)

In [94]:
print("The formatted node file will be saved in this path: ", download_path_node_file)
print("The formatted edge file will be saved in this path: ", download_path_edge_file)

The formatted node file will be saved in this path:  /Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/Pharmagenomics_KG_parsed_node_07_08_2025.tsv
The formatted edge file will be saved in this path:  /Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/Pharmagenomics_KG_parsed_edge_07_08_2025.tsv


In [95]:
edge_final_df.columns

Index(['subject', 'object', 'subject_category', 'object_category', 'predicate',
       'knowledge_source', 'knowledge_level', 'agent_type', 'deploy_date',
       'context_qualifier', 'id', 'predicate_source_name',
       'anatomical_context_qualifier', 'publications'],
      dtype='object')

In [96]:
## Check the knowledge_source column again
## Count occurrences of each unique value in 'knowledge_source'
counts = edge_final_df['knowledge_source'].value_counts()

print(counts)

knowledge_source
PrimeKG          914870
PharmGKB          20427
CellMarker         6599
SIGNOR-250870         1
SIGNOR-58928          1
                  ...  
SIGNOR-249677         1
SIGNOR-277555         1
SIGNOR-256127         1
SIGNOR-261145         1
SIGNOR-261887         1
Name: count, Length: 33227, dtype: int64


In [97]:
edge_final_df['knowledge_source'].isna().sum()

np.int64(0)

In [98]:
edge_final_df['id'].isna().sum()

np.int64(0)

In [99]:
edge_final_df['predicate'].isna().sum()

np.int64(0)

In [100]:
edge_final_df['object'].isna().sum()

np.int64(0)

In [101]:
edge_final_df['subject'].isna().sum()

np.int64(0)

## following codes are used for quality control and sanity check
* check and confirm all subject & object types are correctly formatted

In [102]:
## check all unique predicate values
counts = edge_final_df['subject_category'].value_counts()

print(counts)

subject_category
biolink:Gene                     705625
biolink:Disease                  163798
biolink:Drug                      91961
biolink:ChemicalEntity             8166
biolink:MacromolecularComplex      2632
biolink:ProteinFamily              1341
biolink:SmallMolecule              1085
biolink:Protein                     357
biolink:EnvironmentalProcess        109
biolink:MicroRNA                     43
biolink:Noncoding_RNAProduct          3
Name: count, dtype: int64


In [103]:
## check all unique predicate values
counts = edge_final_df['object_category'].value_counts()

print(counts)

object_category
biolink:Gene                          683140
biolink:DiseaseOrPhenotypicFeature    175784
biolink:Disease                        79946
biolink:Drug                           20336
biolink:Cell                            6599
biolink:ChemicalEntity                  5644
biolink:PhenotypicFeature               1573
biolink:MacromolecularComplex           1218
biolink:SmallMolecule                    426
biolink:ProteinFamily                    229
biolink:Protein                          225
Name: count, dtype: int64


In [104]:
## check all unique predicate values
counts = edge_final_df['predicate'].value_counts()

print(counts)

predicate
biolink:physical_interacts_with           642150
biolink:has_phenotype                     111000
biolink:has_side_effect                    64784
biolink:broad_match                        49452
biolink:regulates                          33096
biolink:has_contraindication               24962
biolink:associated_with                    20427
biolink:target_for                         16380
biolink:expressed_in                        6599
biolink:GeneAffectsChemicalAssociation      3092
biolink:treats                              2186
biolink:can_be_carried_out_by                864
biolink:in_complex_with                      128
Name: count, dtype: int64
