## Parser for Table of Pharmacogenomic Biomarkers in Drug Labeling
* obtained the source table here: https://www.fda.gov/drugs/science-and-research-drugs/table-pharmacogenomic-biomarkers-drug-labeling
* last obtained timestamp: 06/25/2025
* Content current as of: 09/23/2024
* additiona information: 

In [65]:
## To do list:
## Add

In [66]:
## Load necessary packages
import os
import pandas as pd
import glob
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt

## load the TCT related packages
from TCT import node_normalizer
from TCT import name_resolver
from TCT import translator_metakg
from TCT import translator_kpinfo
from TCT import translator_query
from TCT import TCT

## Define the version number
version_number = "07_14_2025"
deployment_date = "2025-07-14"

In [67]:
## Load the Biolink category and predicate dictionary for mapping subject, object, and predicate types
%run ./Biolink_category_and_predication_dictionary.ipynb

Date of last update:  2025-07-08
Order is to always process Node/category map first, since the Edeg/predicate map depends on biolink-complainat node values
-----------------------------------------------------------------------------------------------------------------------------
Dictionary: category_map, Key template: Subject_category or Object_category
------------------------------------------------------------------------------------------
Dictionary: predicate_map, Key template: (Subject_category, Object_category, Predicate)


In [68]:
## Load all helper functions
%run /Users/Weiqi0/ISB_working/Hadlock_lab/QI_ISB_Git_repo/TranslatorPharcogenomicsKG/Parser_helper_functions.ipynb

## Load files and convert them into separate node & edge files
* check all imported file structure

In [69]:
## Notice!! Please change the file path of following codes into your own
raw_files_path = '/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/FDA_Pharmacogenomic_biomarkers_in_Drug_labeling/'

## Define the output path for node & edge files after formatting
download_path_node_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/FDA_Pharmacogenomic_biomarkers_parsed_node_{version_number}.tsv'
download_path_edge_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/FDA_Pharmacogenomic_biomarkers_parsed_edge_{version_number}.tsv'

In [70]:
## Check all node files being read
## Read all BigGIM node csv file in group 1

for f in os.listdir(raw_files_path):
    if f.endswith('.csv'):
        print(f)

Table_of_Pharmacogenomic_Biomarkers_in_Drug_Labeling_FDA.csv


In [71]:
## Read each individual csv files
source_df = pd.read_csv(raw_files_path + 'Table_of_Pharmacogenomic_Biomarkers_in_Drug_Labeling_FDA.csv')
source_df.head(10)

Unnamed: 0,Drug,Therapeutic Area*,Biomarker†,Labeling Sections
0,Articaine and Epinephrine (1),Anesthesiology,G6PD,Warnings and Precautions
1,Articaine and Epinephrine (2),Anesthesiology,Nonspecific (Congenital Methemoglobinemia),Warnings and Precautions
2,Bupivacaine (1),Anesthesiology,G6PD,Warnings
3,Bupivacaine (2),Anesthesiology,Nonspecific (Congenital Methemoglobinemia),Warnings
4,Chloroprocaine (1),Anesthesiology,G6PD,Warnings
5,Chloroprocaine (2),Anesthesiology,Nonspecific (Congenital Methemoglobinemia),Warnings
6,Codeine,Anesthesiology,CYP2D6,"Boxed Warning, Warnings and Precautions, Use i..."
7,Desflurane,Anesthesiology,"CACNA1S, RYR1 (Genetic Susceptibility to Mali...","Contraindications, Warnings and Precautions, C..."
8,Isoflurane,Anesthesiology,"CACNA1S, RYR1 (Genetic Susceptibility to Mali...","Contraindications, Warnings, Clinical Pharmaco..."
9,Lidocaine and Prilocaine (1),Anesthesiology,Nonspecific (Congenital Methemoglobinemia),Warnings and Precautions


In [72]:
## remove all occurrences of \xa0
source_df['Drug'] = source_df['Drug'].str.replace(u'\xa0', '', regex=False)

## remove trailing part like (1), (2)
## Remove the trailing parentheses with number using regex
source_df['subject_name'] = source_df['Drug'].str.replace(r'\(\d+\).*', '', regex=True)

## Add back a 'subject_category' column to indicate these nodes are all drugs
source_df['subject_category'] = 'biolink:Drug'

## remove all occurrences of \xa0
source_df['object_name'] = source_df['Biomarker†'].str.replace(u'\xa0', '', regex=False)

## Add back a 'object_category' column to indicate these nodes are all biomarkers
source_df['object_category'] = 'biolink:BiologicalEntity'

In [73]:
## check unique node_type and their node_source
unique_node_type_values = source_df['subject_name'].unique()
# print("All possible drug types are here: " ,unique_node_type_values)
print("--------------------------------------------------------------------------")
print(len(unique_node_type_values))

unique_node_source_values = source_df['object_name'].unique()
# print("All possible biomarker are here: " ,unique_node_source_values)
print("--------------------------------------------------------------------------")
print(len(unique_node_source_values))

--------------------------------------------------------------------------
395
--------------------------------------------------------------------------
174


## Execute name resolver to try to find all corresponding identifiers in Translator
* use name_resolver.loopup() function
*

In [74]:
name = 'Bupivacaine'
input_node_info = name_resolver.lookup(name)
print(input_node_info)

TranslatorNode(curie='CHEBI:3215', label='Bupivacaine', types=['biolink:SmallMolecule', 'biolink:MolecularEntity', 'biolink:ChemicalEntity', 'biolink:PhysicalEssence', 'biolink:ChemicalOrDrugOrTreatment', 'biolink:ChemicalEntityOrGeneOrGeneProduct', 'biolink:ChemicalEntityOrProteinOrPolypeptide', 'biolink:NamedThing', 'biolink:Entity', 'biolink:PhysicalEssenceOrOccurrent', 'biolink:MolecularMixture', 'biolink:ChemicalMixture', 'biolink:Drug', 'biolink:OntologyClass'], synonyms=None, curie_synonyms=None)


In [75]:
print(input_node_info.curie)

CHEBI:3215


In [76]:
## Apply name_resolver.lookup and extract .curie for subject name
# source_df['subject'] = source_df['subject_name'].apply(lambda name: name_resolver.lookup(name).curie if name_resolver.lookup(name) else None)

## switch to use batch_lookup?
import pandas as pd

# Get all names
names = source_df['subject_name'].tolist()

# Break into batches of 25
batch_size = 25
batches = [names[i:i + batch_size] for i in range(0, len(names), batch_size)]

# Run batch lookups and collect results
results = {}
for batch in batches:
    lookup_results = name_resolver.batch_lookup(batch)  # Expected to return a dict: {name: result or None}
    for name, result in lookup_results.items():
        results[name] = result.curie if result else None

# Map the resolved CURIEs back to the DataFrame
source_df['subject'] = source_df['subject_name'].map(results)

In [77]:
# source_df.head(20)

In [78]:
## Apply name_resolver.lookup and extract .curie for object name as well
# source_df['object'] = source_df['object_name'].apply(lambda name: name_resolver.lookup(name).curie if name_resolver.lookup(name) else None)

## switch to use batch_lookup?
import pandas as pd

# Get all names
names = source_df['object_name'].tolist()

# Break into batches of 25
batch_size = 25
batches = [names[i:i + batch_size] for i in range(0, len(names), batch_size)]

# Run batch lookups and collect results
results = {}
for batch in batches:
    lookup_results = name_resolver.batch_lookup(batch)  # Expected to return a dict: {name: result or None}
    for name, result in lookup_results.items():
        results[name] = result.curie if result else None

# Map the resolved CURIEs back to the DataFrame
source_df['object'] = source_df['object_name'].map(results)

In [79]:
## add a predicate "biolink:has_biomarker"
source_df['predicate'] = 'biolink:has_biomarker'

## add a new knowledge_souce column and set value to be "PrimeKG"
source_df['knowledge_source'] = 'FDA pharmacogenomics biomarkers table'

## add a new knowledge_level column and set value to be 'knowledge_assertion', 'prediction', or 'statistical_association'
source_df['knowledge_level'] = 'knowledge_assertion'

## add a new agent_type column and set value to be 'manual_agent', 'automated_agent', 'computational_model', or 'text_mining_agent'
source_df['agent_type'] = 'text_mining_agent'

In [80]:
# source_df.head(20)

In [99]:
## copy to a final df
edge_df = source_df.copy(deep = True)

print(edge_df.shape[0])
## Remove rows where subject or object is empty
# Remove rows where 'Subject' OR 'Object' have NaN values
edge_df = edge_df.dropna(subset=['subject', 'object'])

print(edge_df.shape[0])

edge_df['deploy_date'] = deployment_date

608
608


In [100]:
## create a context_qualifier column and fill na
## if all of them are empty then fill na
edge_df['context_qualifier'] = edge_df['Therapeutic Area*']

In [101]:
### Add resources_id column, checking whether edge is already
column_list = ['subject', 'predicate', 'object', 'context_qualifier', 'deploy_date']
# Apply the function to each row to generate UUIDs
edge_df['id'] = edge_df[column_list].apply(generate_uuid, axis=1)

In [102]:
## remove no longer needed columns
col_to_drop = ['Drug', 'Therapeutic Area*', 'Biomarker†', 'Labeling Sections']
edge_df = edge_df.drop(col_to_drop, axis = 1).drop_duplicates()

# edge_df.head(10)

### Now create the corresponding node file
* only need three columns: id, name, category

In [103]:
node_subject_df = edge_df[['subject', 'subject_name', 'subject_category']]
node_object_df = edge_df[['object', 'object_name', 'object_category']]

## rename those columns into desired format
node_subject_df.rename(columns={'subject': 'id', 'subject_name': 'name', 'subject_category': 'category'}, inplace=True)
node_object_df.rename(columns={'object': 'id', 'object_name': 'name', 'object_category': 'category'}, inplace=True)

concat_node_df = pd.concat([node_subject_df, node_object_df]).drop_duplicates(keep='first')

concat_node_df.head(2)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_subject_df.rename(columns={'subject': 'id', 'subject_name': 'name', 'subject_category': 'category'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_object_df.rename(columns={'object': 'id', 'object_name': 'name', 'object_category': 'category'}, inplace=True)


Unnamed: 0,id,name,category
0,RXCUI:285091,Articaine and Epinephrine,biolink:Drug
2,CHEBI:3215,Bupivacaine,biolink:Drug


In [104]:
## remove no longer needed columns
col_to_drop = ['subject_name', 'object_name']
edge_df = edge_df.drop(col_to_drop, axis = 1).drop_duplicates()

In [105]:
edge_output_df = edge_df.copy(deep = True)

## Drop rows where 'name' is NaN, None, or empty string
edge_output_df = edge_output_df[~edge_output_df['predicate'].isna() & (edge_output_df['predicate'].str.strip() != '')]

## throw away those rows which either subject_category, object_category, or predicate is not start with "biolink:" prefix
## since they cannot be biolink-compliant converted
## Keep only rows where all three columns start with 'biolink:'
edge_output_df = edge_output_df[
    edge_output_df['subject_category'].str.startswith('biolink:') &
    edge_output_df['object_category'].str.startswith('biolink:') &
    edge_output_df['predicate'].str.startswith('biolink:')
]

In [106]:
## Check the knowledge_source column again
## Count occurrences of each unique value in 'knowledge_source'
counts = edge_output_df['knowledge_source'].value_counts()

print(counts)

knowledge_source
FDA pharmacogenomics biomarkers table    599
Name: count, dtype: int64


### Now quality control of the parsed dataframe

In [107]:
## check all unique predicate values
counts = edge_output_df['subject_category'].value_counts()
print(counts)

subject_category
biolink:Drug    599
Name: count, dtype: int64


In [108]:
## check all unique predicate values
counts = edge_output_df['object_category'].value_counts()
print(counts)

object_category
biolink:BiologicalEntity    599
Name: count, dtype: int64


In [109]:
## Group by predicate
grouped = edge_output_df.groupby('predicate')

## For each predicate, output unique (subject_category, object_category) pairs
for predicate, group in grouped:
    print(f"\nPredicate: {predicate}")
    pairs = group[['subject_category', 'object_category']].drop_duplicates()
    for _, row in pairs.iterrows():
        print(f"  ({row['subject_category']}, {row['object_category']})")


Predicate: biolink:has_biomarker
  (biolink:Drug, biolink:BiologicalEntity)


## Now download the concatenated node & edge files

In [110]:
## download both node and edge files
## Download the result df
concat_node_df.to_csv(download_path_node_file, sep ='\t', index=False)
edge_output_df.to_csv(download_path_edge_file, sep ='\t', index=False)

In [111]:
print("The formatted node file will be saved in this path: ", download_path_node_file)
print("The formatted edge file will be saved in this path: ", download_path_edge_file)

The formatted node file will be saved in this path:  /Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/FDA_Pharmacogenomic_biomarkers_parsed_node_07_14_2025.tsv
The formatted edge file will be saved in this path:  /Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/FDA_Pharmacogenomic_biomarkers_parsed_edge_07_14_2025.tsv


In [112]:
## Create a graph from the DataFrame
graph = nx.from_pandas_edgelist(edge_output_df, 'subject', 'object', edge_attr='predicate')

## Print graph information
print('Number of nodes', len(set(graph.nodes)))
print('Number of edges', len(set(graph.edges)))
print('Average degree', sum(dict(graph.degree).values()) / len(graph.nodes))

Number of nodes 545
Number of edges 596
Average degree 2.1871559633027524
