## Parser for Table of Pharmacogenomic Biomarkers in Drug Labeling
* obtained the source table here: https://www.fda.gov/drugs/science-and-research-drugs/table-pharmacogenomic-biomarkers-drug-labeling
* last obtained timestamp: 06/25/2025
* Content current as of: 09/23/2024
* additiona information: 

In [117]:
## To do list:
## Add

In [1]:
## Load necessary packages
import os
import pandas as pd
import glob
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

## load the TCT related packages
from TCT import node_normalizer
from TCT import name_resolver
from TCT import translator_metakg
from TCT import translator_kpinfo
from TCT import translator_query
from TCT import TCT

## Define the version number
version_number = "09_04_2025"
deployment_date = "2025-09-04"

## Load the Biolink category and predicate dictionary for mapping subject, object, and predicate types

In [2]:
## Load the Biolink category and predicate dictionary for mapping subject, object, and predicate types
%run ./Biolink_category_and_predication_dictionary.ipynb

Date of last update:  2025-09-04
Order is to always process Node/category map first, since the Edeg/predicate map depends on biolink-complainat node values
-----------------------------------------------------------------------------------------------------------------------------
Dictionary: category_map, Key template: Subject_category or Object_category
------------------------------------------------------------------------------------------
Dictionary: predicate_map, Key template: (Subject_category, Object_category, Predicate)


## Load all helper functions

In [120]:
## Load all helper functions
%run /Users/Weiqi0/ISB_working/Hadlock_lab/QI_ISB_Git_repo/TranslatorPharcogenomicsKG/Parser_helper_functions.ipynb

## Load files and convert them into separate node & edge files
* check all imported file structure

In [7]:
## Notice!! Please change the file path of following codes into your own
raw_files_path = '/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/FDA_Pharmacogenomic_biomarkers_in_Drug_labeling/'

## Note! define download path in the very end of the notebook!!

In [8]:
## Check all node files being read
## Read all BigGIM node csv file in group 1

for f in os.listdir(raw_files_path):
    if f.endswith('.csv'):
        print(f)

Table_of_Pharmacogenomic_Biomarkers_in_Drug_Labeling_FDA.csv


In [9]:
## Read each individual csv files
source_df = pd.read_csv(raw_files_path + 'Table_of_Pharmacogenomic_Biomarkers_in_Drug_Labeling_FDA.csv')
source_df.head(10)

Unnamed: 0,Drug,subject,subject_category,Therapeutic Area*,Biomarker†,object,object_category,Labeling Sections
0,Odevixibat (1),RXCUI:2563966,Drug,Gastroenterology,ABCB11,NCBIGene:8647,Gene,"Indications and Usage, Clinical Pharmacology, ..."
1,Maralixibat (3),RXCUI:2571074,Drug,Gastroenterology,ABCB11,NCBIGene:8647,Gene,"Indications and Usage, Clinical Pharmacology, ..."
2,Maralixibat (2),RXCUI:2571074,Drug,Gastroenterology,ABCB4,NCBIGene:5244,Gene,"Clinical Pharmacology, Clinical Studies"
3,Resmetirom,RXCUI:2677894,Drug,Gastroenterology,ABCG2,NCBIGene:9429,Gene,Clinical Pharmacology
4,Triheptanoin,RXCUI:1313234,Drug,Inborn Errors of Metabolism,"ACADVL, CPT2, HADHA, HADHB (Long-Chain Fatty A...",MONDO:0008723,Disease,"Indications and Usage, Clinical Studies"
5,Zilucoplan,RXCUI:2672492,Drug,Neurology,ACHR,NCBIGene:1134,Gene,"Indications and Usage, Clinical Studies"
6,Rozanolixizumab-noli (1),RXCUI:2642273,Drug,Neurology,ACHR,NCBIGene:1134,Gene,"Indications and Usage, Clinical Pharmacology, ..."
7,Ravulizumab-cwvz (1),RXCUI:2107316,Drug,Neurology,ACHR,NCBIGene:1134,Gene,"Indications and Usage, Clinical Studies"
8,Efgartigimod Alfa-fcab,RXCUI:2587719,Drug,Neurology,ACHR,NCBIGene:1134,Gene,"Indications and Usage, Clinical Pharmacology, ..."
9,Eculizumab (1),RXCUI:591781,Drug,Neurology,ACHR,NCBIGene:1134,Gene,"Indications and Usage, Clinical Studies"


In [10]:
## remove all occurrences of \xa0
source_df['Drug'] = source_df['Drug'].str.replace(u'\xa0', '', regex=False)

## remove trailing part like (1), (2)
## Remove the trailing parentheses with number using regex
source_df['subject_name'] = source_df['Drug'].str.replace(r'\(\d+\).*', '', regex=True)

## remove all occurrences of \xa0
source_df['object_name'] = source_df['Biomarker†'].str.replace(u'\xa0', '', regex=False)

## mapping the subject_category and object_category 
source_df['subject_category'] = (
    source_df['subject_category'].map(category_map)
)

## mapping the subject_category and object_category 
source_df['object_category'] = (
    source_df['object_category'].map(category_map)
)

In [11]:
## check unique node_type and their node_source
unique_node_type_values = source_df['subject_name'].unique()
# print("All possible drug types are here: " ,unique_node_type_values)
print("--------------------------------------------------------------------------")
print(len(unique_node_type_values))

unique_node_source_values = source_df['object_name'].unique()
# print("All possible biomarker are here: " ,unique_node_source_values)
print("--------------------------------------------------------------------------")
print(len(unique_node_source_values))

--------------------------------------------------------------------------
395
--------------------------------------------------------------------------
174


## Execute name resolver to try to find all corresponding identifiers in Translator
* use name_resolver.loopup() function
* spotted some issue, thus switched to manual label for now
* Thus there are already subject & object identifiers for the input table

### first, sanity check on some strange drug names
* even with return_top_response = False, no necessorily find the best match

In [3]:
name = 'Bupivacaine'
input_node_info = name_resolver.lookup(name)
print(input_node_info)

TranslatorNode(curie='CHEBI:3215', label='Bupivacaine', types=['biolink:SmallMolecule', 'biolink:MolecularEntity', 'biolink:ChemicalEntity', 'biolink:PhysicalEssence', 'biolink:ChemicalOrDrugOrTreatment', 'biolink:ChemicalEntityOrGeneOrGeneProduct', 'biolink:ChemicalEntityOrProteinOrPolypeptide', 'biolink:NamedThing', 'biolink:Entity', 'biolink:PhysicalEssenceOrOccurrent', 'biolink:MolecularMixture', 'biolink:ChemicalMixture', 'biolink:Drug', 'biolink:OntologyClass'], synonyms=None, curie_synonyms=None)


In [4]:
print(input_node_info.curie)

CHEBI:3215


In [5]:
name = 'Articaine and Epinephrine '
input_node_info = name_resolver.lookup(name)
print(input_node_info)

TranslatorNode(curie='RXCUI:285091', label='Septocaine', types=['biolink:Drug', 'biolink:ChemicalOrDrugOrTreatment', 'biolink:OntologyClass', 'biolink:MolecularMixture', 'biolink:ChemicalMixture', 'biolink:ChemicalEntity', 'biolink:PhysicalEssence', 'biolink:ChemicalEntityOrGeneOrGeneProduct', 'biolink:ChemicalEntityOrProteinOrPolypeptide', 'biolink:NamedThing', 'biolink:Entity', 'biolink:PhysicalEssenceOrOccurrent'], synonyms=None, curie_synonyms=None)


In [6]:
## the best match is RxCUI 166283, https://mor.nlm.nih.gov/RxNav/search?searchBy=RXCUI&searchTerm=166283
## and it is not showing even in the all results of name_resolver.lookup
name = 'Lidocaine and Prilocaine'
input_node_info = name_resolver.lookup(name, return_top_response = False)
print(input_node_info)

[TranslatorNode(curie='CHEBI:6456', label='Lidocaine', types=['biolink:SmallMolecule', 'biolink:MolecularEntity', 'biolink:ChemicalEntity', 'biolink:PhysicalEssence', 'biolink:ChemicalOrDrugOrTreatment', 'biolink:ChemicalEntityOrGeneOrGeneProduct', 'biolink:ChemicalEntityOrProteinOrPolypeptide', 'biolink:NamedThing', 'biolink:Entity', 'biolink:PhysicalEssenceOrOccurrent', 'biolink:MolecularMixture', 'biolink:ChemicalMixture', 'biolink:Drug', 'biolink:OntologyClass'], synonyms=None, curie_synonyms=None), TranslatorNode(curie='RXCUI:197877', label='lidocaine 25 MG/ML / prilocaine 25 MG/ML Topical Cream', types=['biolink:Drug', 'biolink:ChemicalOrDrugOrTreatment', 'biolink:OntologyClass', 'biolink:MolecularMixture', 'biolink:ChemicalMixture', 'biolink:ChemicalEntity', 'biolink:PhysicalEssence', 'biolink:ChemicalEntityOrGeneOrGeneProduct', 'biolink:ChemicalEntityOrProteinOrPolypeptide', 'biolink:NamedThing', 'biolink:Entity', 'biolink:PhysicalEssenceOrOccurrent'], synonyms=None, curie_syno

In [13]:
## Get all names
names = source_df['subject_name'].drop_duplicates().tolist()

print(names)

['Odevixibat ', 'Maralixibat ', 'Resmetirom', 'Triheptanoin', 'Zilucoplan', 'Rozanolixizumab-noli ', 'Ravulizumab-cwvz ', 'Efgartigimod Alfa-fcab', 'Eculizumab ', 'Nedosiran', 'Lumasiran', 'Capivasertib ', 'Sodium Oxybate', 'Tremelimumab-actl ', 'Tepotinib ', 'Pemetrexed ', 'Pembrolizumab ', 'Nivolumab ', 'Lorlatinib ', 'Ipilimumab ', 'Durvalumab ', 'Crizotinib ', 'Ceritinib', 'Cemiplimab-rwlc ', 'Brigatinib', 'Brentuximab Vedotin ', 'Atezolizumab ', 'Alectinib', 'Cholic Acid', 'Nirogacestat ', 'Lecanemab-irmb', 'Aducanumab-avwa', 'Satralizumab-mwge', 'Inebilizumab-cdon', 'Sodium Phenylbutyrate', 'Glycerol Phenylbutyrate ', 'Succinylcholine ', 'Mivacurium', 'Polatuzumab Vedotin-piiq ', 'Vincristine', 'Ponatinib', 'Omacetaxine', 'Nilotinib ', 'Inotuzumab Ozogamicin', 'Imatinib ', 'Dasatinib', 'Busulfan', 'Bosutinib', 'Blinatumomab ', 'Asciminib', 'Vemurafenib ', 'Trametinib ', 'Tovorafenib', 'Nivolumab and Relatlimab-rmbw ', 'Encorafenib ', 'Dabrafenib ', 'Cobimetinib', 'Cetuximab ', 'B

In [131]:
# ## Apply name_resolver.lookup and extract .curie for subject name
# # source_df['subject'] = source_df['subject_name'].apply(lambda name: name_resolver.lookup(name).curie if name_resolver.lookup(name) else None)

# ## switch to use batch_lookup?
# import pandas as pd

# ## Get all names
# names = source_df['subject_name'].tolist()

# ## Break into batches of 25
# batch_size = 25
# batches = [names[i:i + batch_size] for i in range(0, len(names), batch_size)]

# ## Run batch lookups and collect results
# ## using the return_top_response = false option to get more than just top response
# results = {}
# for batch in batches:
#     lookup_results = name_resolver.batch_lookup(batch)  # Expected to return a dict: {name: result or None}
#     for name, result in lookup_results.items():
#         results[name] = result.curie if result else None

# # Map the resolved CURIEs back to the DataFrame
# source_df['subject'] = source_df['subject_name'].map(results)

In [132]:
# ## Apply name_resolver.lookup and extract .curie for object name as well
# # source_df['object'] = source_df['object_name'].apply(lambda name: name_resolver.lookup(name).curie if name_resolver.lookup(name) else None)

# ## switch to use batch_lookup?
# import pandas as pd

# # Get all names
# names = source_df['object_name'].tolist()

# # Break into batches of 25
# batch_size = 25
# batches = [names[i:i + batch_size] for i in range(0, len(names), batch_size)]

# # Run batch lookups and collect results
# results = {}
# for batch in batches:
#     lookup_results = name_resolver.batch_lookup(batch, return_top_response = False)  # Expected to return a dict: {name: result or None}
#     for name, result in lookup_results.items():
#         results[name] = result.curie if result else None

# # Map the resolved CURIEs back to the DataFrame
# source_df['object'] = source_df['object_name'].map(results)

## add additional needed columns
* 

In [133]:
## add a predicate "biolink:has_biomarker"
source_df['predicate'] = 'biolink:has_biomarker'

## add a new knowledge_souce column and set value to be "PrimeKG"
source_df['knowledge_source'] = 'FDA pharmacogenomics biomarkers table'

## add a new knowledge_level column and set value to be 'knowledge_assertion', 'prediction', or 'statistical_association'
source_df['knowledge_level'] = 'knowledge_assertion'

## add a new agent_type column and set value to be 'manual_agent', 'automated_agent', 'computational_model', or 'text_mining_agent'
source_df['agent_type'] = 'text_mining_agent'

## create a context_qualifier column and fill na
## if all of them are empty then fill na
source_df['context_qualifier'] = source_df['Therapeutic Area*']


In [134]:
source_df.head(4)

Unnamed: 0,Drug,subject,subject_category,Therapeutic Area*,Biomarker†,object,object_category,Labeling Sections,subject_name,object_name,predicate,knowledge_source,knowledge_level,agent_type,context_qualifier
0,Odevixibat (1),RXCUI:2563966,biolink:Drug,Gastroenterology,ABCB11,NCBIGene:8647,biolink:Gene,"Indications and Usage, Clinical Pharmacology, ...",Odevixibat,ABCB11,biolink:has_biomarker,FDA pharmacogenomics biomarkers table,knowledge_assertion,text_mining_agent,Gastroenterology
1,Maralixibat (3),RXCUI:2571074,biolink:Drug,Gastroenterology,ABCB11,NCBIGene:8647,biolink:Gene,"Indications and Usage, Clinical Pharmacology, ...",Maralixibat,ABCB11,biolink:has_biomarker,FDA pharmacogenomics biomarkers table,knowledge_assertion,text_mining_agent,Gastroenterology
2,Maralixibat (2),RXCUI:2571074,biolink:Drug,Gastroenterology,ABCB4,NCBIGene:5244,biolink:Gene,"Clinical Pharmacology, Clinical Studies",Maralixibat,ABCB4,biolink:has_biomarker,FDA pharmacogenomics biomarkers table,knowledge_assertion,text_mining_agent,Gastroenterology
3,Resmetirom,RXCUI:2677894,biolink:Drug,Gastroenterology,ABCG2,NCBIGene:9429,biolink:Gene,Clinical Pharmacology,Resmetirom,ABCG2,biolink:has_biomarker,FDA pharmacogenomics biomarkers table,knowledge_assertion,text_mining_agent,Gastroenterology


In [135]:
## copy to a final df
edge_df = source_df.copy(deep = True)

print(edge_df.shape[0])

## drop columns no longer needed
edge_df = edge_df.drop(columns=['Drug', 'Biomarker†', 'Labeling Sections', 'Therapeutic Area*'])

## Drop those unmatched rows
## Drop rows where 'name' is NaN, None, or empty string
edge_df = edge_df[~edge_df['predicate'].isna() & (edge_df['predicate'].str.strip() != '')]


## Remove rows where subject or object is empty
# Remove rows where 'Subject' OR 'Object' have NaN values
edge_df = edge_df.dropna(subset=['subject', 'object'])

print(edge_df.shape[0])

edge_df['deploy_date'] = deployment_date

608
542


In [136]:
### Add resources_id column, checking whether edge is already
column_list = ['subject', 'predicate', 'object', 'context_qualifier', 'deploy_date']
# Apply the function to each row to generate UUIDs
edge_df['id'] = edge_df[column_list].apply(generate_uuid, axis=1)

In [137]:
edge_df.head(2)

Unnamed: 0,subject,subject_category,object,object_category,subject_name,object_name,predicate,knowledge_source,knowledge_level,agent_type,context_qualifier,deploy_date,id
0,RXCUI:2563966,biolink:Drug,NCBIGene:8647,biolink:Gene,Odevixibat,ABCB11,biolink:has_biomarker,FDA pharmacogenomics biomarkers table,knowledge_assertion,text_mining_agent,Gastroenterology,2025-09-04,4e1f7def-7e8d-5204-9826-56887ff5ecb5
1,RXCUI:2571074,biolink:Drug,NCBIGene:8647,biolink:Gene,Maralixibat,ABCB11,biolink:has_biomarker,FDA pharmacogenomics biomarkers table,knowledge_assertion,text_mining_agent,Gastroenterology,2025-09-04,d5f51f86-e8a1-5481-8105-cdd2d9efc023


### Now create the corresponding node file
* only need three columns: id, name, category

In [138]:
node_subject_df = edge_df[['subject', 'subject_name', 'subject_category']]
node_object_df = edge_df[['object', 'object_name', 'object_category']]

## rename those columns into desired format
node_subject_df.rename(columns={'subject': 'id', 'subject_name': 'name', 'subject_category': 'category'}, inplace=True)
node_object_df.rename(columns={'object': 'id', 'object_name': 'name', 'object_category': 'category'}, inplace=True)

concat_node_df = pd.concat([node_subject_df, node_object_df]).drop_duplicates(keep='first')

concat_node_df.head(2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_subject_df.rename(columns={'subject': 'id', 'subject_name': 'name', 'subject_category': 'category'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_object_df.rename(columns={'object': 'id', 'object_name': 'name', 'object_category': 'category'}, inplace=True)


Unnamed: 0,id,name,category
0,RXCUI:2563966,Odevixibat,biolink:Drug
1,RXCUI:2571074,Maralixibat,biolink:Drug


## reorder to desired column orders so consistent across files

In [147]:
print(edge_df.columns.tolist())

['subject', 'subject_category', 'object', 'object_category', 'subject_name', 'object_name', 'predicate', 'knowledge_source', 'knowledge_level', 'agent_type', 'context_qualifier', 'deploy_date', 'id']


In [148]:
## reorder the dataframe
desired_order = [
    'subject', 'subject_name', 'subject_category', 'object', 'object_name', 'object_category', 
    'predicate', 
    'knowledge_source', 'knowledge_level', 'context_qualifier', 'agent_type', 
    'deploy_date', 'id'
]

edge_df = edge_df[desired_order]

print(edge_df.columns.tolist())

['subject', 'subject_name', 'subject_category', 'object', 'object_name', 'object_category', 'predicate', 'knowledge_source', 'knowledge_level', 'context_qualifier', 'agent_type', 'deploy_date', 'id']


### Now quality control of the parsed dataframe

In [149]:
## Check the knowledge_source column again
## Count occurrences of each unique value in 'knowledge_source'
counts = edge_df['knowledge_source'].value_counts()

print(counts)

knowledge_source
FDA pharmacogenomics biomarkers table    542
Name: count, dtype: int64


In [150]:
## check all unique predicate values
counts = edge_df['subject_category'].value_counts()
print(counts)

subject_category
biolink:Drug    542
Name: count, dtype: int64


In [151]:
## check all unique predicate values
counts = edge_df['object_category'].value_counts()
print(counts)

object_category
biolink:Gene                 506
biolink:Disease               28
biolink:PhenotypicFeature      6
biolink:Pathway                2
Name: count, dtype: int64


In [152]:
## Group by predicate
grouped = edge_df.groupby('predicate')

## For each predicate, output unique (subject_category, object_category) pairs
for predicate, group in grouped:
    print(f"\nPredicate: {predicate}")
    pairs = group[['subject_category', 'object_category']].drop_duplicates()
    for _, row in pairs.iterrows():
        print(f"  ({row['subject_category']}, {row['object_category']})")


Predicate: biolink:has_biomarker
  (biolink:Drug, biolink:Gene)
  (biolink:Drug, biolink:Disease)
  (biolink:Drug, biolink:Pathway)
  (biolink:Drug, biolink:PhenotypicFeature)


In [153]:
## Create a graph from the DataFrame
graph = nx.from_pandas_edgelist(edge_df, 'subject', 'object', edge_attr='predicate')

## Print graph information
print('Number of nodes', len(set(graph.nodes)))
print('Number of edges', len(set(graph.edges)))
print('Average degree', sum(dict(graph.degree).values()) / len(graph.nodes))

Number of nodes 501
Number of edges 536
Average degree 2.1397205588822357


## Now download the concatenated node & edge files

In [154]:
## Define the output path for node & edge files after formatting
download_path_node_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/wQualityControl/FDA_Pharmacogenomic_biomarkers_parsed_node_{version_number}.tsv'
download_path_edge_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/wQualityControl/FDA_Pharmacogenomic_biomarkers_parsed_edge_{version_number}.tsv'

## download both node and edge files
## Download the result df
concat_node_df.to_csv(download_path_node_file, sep ='\t', index=False)
edge_df.to_csv(download_path_edge_file, sep ='\t', index=False)

In [155]:
print("The formatted node file will be saved in this path: ", download_path_node_file)
print("The formatted edge file will be saved in this path: ", download_path_edge_file)

The formatted node file will be saved in this path:  /Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/wQualityControl/FDA_Pharmacogenomic_biomarkers_parsed_node_09_04_2025.tsv
The formatted edge file will be saved in this path:  /Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/wQualityControl/FDA_Pharmacogenomic_biomarkers_parsed_edge_09_04_2025.tsv
