## Parser for SIGNOR

1. It is part of the pharmacogenomics KG

2. Notes:

3. with predicate biolink:physically_interacts_with and biolink:gene_associated_with_condition

4. the corresponding config.json file is: config_bigGIM_interacts_with_associated_with

5. all of them have same cols "subject, predicate, object, agent_type, knowledge_level, knowledge_source, object_category, publications, subject_category"

6. Full list of tsv files handled in this group is:
    - signor_genes.csv
    

In [58]:
## Load necessary packages
import os
import pandas as pd
import glob
import numpy as np
import networkx as nx
# import matplotlib.pyplot as plt

## load the needed tqdm package
from tqdm import tqdm

# Enable tqdm for pandas apply
tqdm.pandas()

## Define the version number
version_number = "08_18_2025"
deployment_date = "2025-08-18"

### Get the category and predicate dictionary

In [59]:
## Load the Biolink category and predicate dictionary for mapping subject, object, and predicate types
%run ./Biolink_category_and_predication_dictionary.ipynb

Date of last update:  2025-09-01
Order is to always process Node/category map first, since the Edeg/predicate map depends on biolink-complainat node values
-----------------------------------------------------------------------------------------------------------------------------
Dictionary: category_map, Key template: Subject_category or Object_category
------------------------------------------------------------------------------------------
Dictionary: predicate_map, Key template: (Subject_category, Object_category, Predicate)


### Get all helper functions

In [69]:
## Load all helper functions
%run /Users/Weiqi0/ISB_working/Hadlock_lab/QI_ISB_Git_repo/TranslatorPharcogenomicsKG/Parser_helper_functions.ipynb

In [109]:
# ## load the open source LLM model from openAI for batch quality check
# ## example case to confirm working

# import ollama
# response = ollama.chat(
#     model="gpt-oss:20b",
#     messages=[
#         {"role": "user", "content": "How does SLCO1B1 gene affects a patient's response to Atorvastatin and Lovastatin? only 2 sentences summary is enough"},
#     ],
# )
# print(response["message"]["content"])

In [4]:
import requests

# Define the query function with the new model
def query_ollama(prompt, model='gpt-oss:20b'):
    response = requests.post(
        'http://localhost:11434/api/generate',
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    response.raise_for_status()
    return response.json()['response'].strip()

## Load files and concatenate them into one merge tsv file

In [5]:
## Notice!! Please change the file path of following codes into your own
raw_files_path = '/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/SIGNOR/'

## Define the output path for node & edge files after formatting
download_path_node_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/SIGNOR_parsed_node_{version_number}.tsv'
download_path_edge_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/SIGNOR_parsed_edge_{version_number}.tsv'

In [6]:
## Check all node files being read
## Read all BigGIM node csv file in group 1

for f in os.listdir(raw_files_path):
    if f.endswith('.csv'):
        print(f)

signor_genes.csv


In [7]:
## Read each individual csv files
source_df = pd.read_csv(raw_files_path + 'signor_genes.csv')

## check the length
print("current row number is: ", len(source_df))

## Drop nan values
source_df = source_df.dropna(subset=['subject_identifier', 'object_identifier'])

## check the length after filtering
print("current row number is: ", len(source_df))

source_df.head(10)

current row number is:  38847
current row number is:  38847


Unnamed: 0,subject_identifier,subject_name,subject_category,object_identifier,object_name,object_category,predicate,original_predicate,provided_by,Primary_Knowledge_Source,publications,knowledge_level,anatomical_context_qualifier
0,NCBIGene:1457,CSNK2A1,Gene,NCBIGene:11036,GTF2A1L,Gene,upregulates,up-regulates activity,SIGNOR,SIGNOR-250870,PMID:36243968|PMID:12107178,knowledge_assertion,
1,NCBIGene:2932,GSK3B,Gene,NCBIGene:6688,SPI1,Gene,downregulates,down-regulates quantity by destabilization,SIGNOR,SIGNOR-277542,PMID:36243968|PMID:33188146,knowledge_assertion,
2,NCBIGene:5566,PRKACA,Gene,NCBIGene:2011,MARK2,Gene,downregulates,down-regulates activity,SIGNOR,SIGNOR-276870,PMID:36243968|PMID:25512381,knowledge_assertion,
3,NCBIGene:5793,PTPRG,Gene,NCBIGene:6714,SRC,Gene,downregulates,down-regulates activity,SIGNOR,SIGNOR-254725,PMID:36243968|PMID:25624455,knowledge_assertion,
4,NCBIGene:5592,PRKG1,Gene,NCBIGene:29109,FHOD1,Gene,upregulates,up-regulates,SIGNOR,SIGNOR-170094,PMID:36243968|PMID:21106951,knowledge_assertion,BTO:0000887;BTO:0001260
5,NCBIGene:5595,MAPK3,Gene,NCBIGene:8036,SHOC2,Gene,downregulates,down-regulates quantity by destabilization,SIGNOR,SIGNOR-277443,PMID:36243968|PMID:30865892,knowledge_assertion,
6,NCBIGene:5914,RARA,Gene,NCBIGene:1499,CTNNB1,Gene,downregulates,down-regulates,SIGNOR,SIGNOR-73274,PMID:36243968|PMID:10607566,knowledge_assertion,
7,NCBIGene:3172,HNF4A,Gene,NCBIGene:114902,C1QTNF5,Gene,upregulates,up-regulates quantity by expression,SIGNOR,SIGNOR-254448,PMID:36243968|PMID:20621834,knowledge_assertion,
8,NCBIGene:51497,NELFCD,Gene,SIGNOR:SIGNOR-C521,NELF,Complex,in_complex,form complex,SIGNOR,SIGNOR-271399,PMID:36243968|PMID:18628398,knowledge_assertion,
9,SIGNOR:SIGNOR-C409,Sin3B_complex,Complex,NCBIGene:653604,H3C13,Gene,downregulates,down-regulates activity,SIGNOR,SIGNOR-266974,PMID:36243968|PMID:21041486,knowledge_assertion,


In [8]:
## check all unique values in the predicate column
unique_values = source_df['predicate'].unique()
print(unique_values)

['upregulates' 'downregulates' 'in_complex' 'unknown']


In [9]:
## remove those rows with col(predicate) == 'unknown' from source_df
source_df = source_df[source_df['predicate'] != 'unknown']

print("current row number is: ", len(source_df))

current row number is:  37764


In [10]:
## check all unique values in the subject_category column
unique_values = source_df['subject_category'].unique()
print(unique_values)

['Gene' 'Complex' 'Chemical' 'Phenotype' 'Protein' 'Smallmolecule'
 'Proteinfamily' 'Stimulus' 'Fusion Protein' 'Mirna' 'Antibody' 'Ncrna']


In [11]:
## check all unique values in the subject_category column
unique_values = source_df['object_category'].unique()
print(unique_values)

['Gene' 'Complex' 'Phenotype' 'Smallmolecule' 'Proteinfamily' 'Chemical'
 'Protein' 'Fusion Protein' 'Mirna' 'Stimulus']


In [43]:
## check all unique combinations between subject & object
from itertools import combinations

# Get all combinations of 3 column names
col_combos = ['subject_category', 'object_category', 'predicate']

unique_combinations = source_df[col_combos].drop_duplicates()

In [44]:
print(unique_combinations)

      subject_category object_category      predicate
0                 Gene            Gene    upregulates
1                 Gene            Gene  downregulates
8                 Gene         Complex     in_complex
9              Complex            Gene  downregulates
11             Complex       Phenotype    upregulates
...                ...             ...            ...
22783            Mirna       Phenotype  downregulates
24774    Smallmolecule         Protein  downregulates
32052    Smallmolecule         Complex  downregulates
32220        Phenotype   Proteinfamily    upregulates
34290             Gene            Gene     in_complex

[98 rows x 3 columns]


In [45]:
## Define the output path for node & edge files after formatting
download_path_temp_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/temp/unique_combos_subject_object_predicate.tsv'


## download both node and edge files
## Download the result df
## disable download for testing
unique_combinations.to_csv(download_path_temp_file, sep ='\t', index=False)

## Subject & object mapping
* using the current Biolink_category_and_predication_dictionary
* existing biolink category mapping: 
    * 'Gene': 'biolink:Gene',
    * 'Chemical': 'biolink:ChemicalEntity',
    * 'Smallmolecule': 'biolink:SmallMolecule',
    * 'Phenotype': 'biolink:PhenotypicFeature',
    * 'Protein': 'biolink:Protein',
* went through valid check cause there is potential issue:
    * 'Antibody': 'biolink:Drug', ## check if all are indeed drug
    * 'Complex': 'biolink:MacromolecularComplex',
    * 'Mirna': 'biolink:MicroRNA',
    * 'Ncrna': 'biolink:Noncoding_RNAProduct',
* proposed in the translator ingest group ticket for furture category:
* not included in current version of parsed file
    * 'Fusion Protein': parent node, 'biolink:Protein', ## Potential new entity to raise for data ingest meeting
    * 'Stimulus': partially overlap with 'biolink:EnvironmentalProcess', ## Potential new entity to raise for data ingest meeting

In [12]:
## create a temp node file

node_subject_df = source_df[['subject_identifier', 'subject_name', 'subject_category']]
node_object_df = source_df[['object_identifier',	'object_name',	'object_category']]

## rename those columns into desired format
node_subject_df.rename(columns={'subject_identifier': 'id', 'subject_name': 'name', 'subject_category': 'category'}, inplace=True)
node_object_df.rename(columns={'object_identifier': 'id', 'object_name': 'name', 'object_category': 'category'}, inplace=True)

concat_node_df = pd.concat([node_subject_df, node_object_df]).drop_duplicates(keep='first')

concat_node_df.head(5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_subject_df.rename(columns={'subject_identifier': 'id', 'subject_name': 'name', 'subject_category': 'category'}, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  node_object_df.rename(columns={'object_identifier': 'id', 'object_name': 'name', 'object_category': 'category'}, inplace=True)


Unnamed: 0,id,name,category
0,NCBIGene:1457,CSNK2A1,Gene
1,NCBIGene:2932,GSK3B,Gene
2,NCBIGene:5566,PRKACA,Gene
3,NCBIGene:5793,PTPRG,Gene
4,NCBIGene:5592,PRKG1,Gene


### Now check all "gene" nodes
* use LLM model to confirm they have the correct ids


In [29]:
# gene_node_df = concat_node_df[concat_node_df['category'] == 'Gene']
# print(len(gene_node_df))

In [30]:
# # Define the row labeling function
# def gene_row_checker(row):
#     combined_text = f"{row['name']} {row['id']}"
#     prompt = f"""Analyze if the NCBI gene identifier and names are matching and if it is an existing Gene, answer only in "yes" if both satisfied, or "no" if any condition is incorrect:
# \"\"\"{combined_text}\"\"\"
# Sentiment:"""
#     return query_ollama(prompt)

In [31]:
# ## apply to each row of the dataframe
# gene_node_df['label'] = gene_node_df.progress_apply(gene_row_checker, axis=1)
# print(gene_node_df[gene_node_df['label'] == 'no'])
# print(len(antibody_node_df))

### Now check all "antibody" nodes
* use LLM model as a simple example

In [13]:
antibody_node_df = concat_node_df[concat_node_df['category'] == 'Antibody']
print(len(antibody_node_df))

25


In [14]:
print(antibody_node_df)

                     id                                      name  category
3156   DRUGBANK:DB06650                                ofatumumab  Antibody
3449   DRUGBANK:DB09037                             pembrolizumab  Antibody
5396   DRUGBANK:DB00087                               alemtuzumab  Antibody
5782   DRUGBANK:DB06273                               Tocilizumab  Antibody
7210   DRUGBANK:DB08904                     Certolizumab (Cimzia)  Antibody
7223   DRUGBANK:DB06366                                pertuzumab  Antibody
7558   DRUGBANK:DB00112                               bevacizumab  Antibody
8145   DRUGBANK:DB08935                              obinutuzumab  Antibody
10318  DRUGBANK:DB05773                 ado-trastuzumab emtansine  Antibody
10909  DRUGBANK:DB09052                              blinatumomab  Antibody
14210  DRUGBANK:DB00056                     gemtuzumab ozogamicin  Antibody
14477  DRUGBANK:DB06674                       Golimumab (Simponi)  Antibody
15942  DRUGB

In [69]:
## run the LLM model using a custom query

# Define the row labeling function
def antibody_row_checker(row):
    combined_text = f"{row['name']} {row['id']}"
    prompt = f"""Analyze if the drugbank identifier and names are matching and if it is an existing FDA-approved drug, answer only in "yes" if both satisfied, or "no" if any condition is incorrect:
\"\"\"{combined_text}\"\"\"
Sentiment:"""
    return query_ollama(prompt)

In [77]:
## apply to each row of the dataframe
antibody_node_df['label'] = antibody_node_df.apply(antibody_row_checker, axis=1)
print(antibody_node_df[antibody_node_df['label'] == 'no'])
# print(antibody_node_df)

                     id        name  category label
24556  DRUGBANK:DB06186  ipilimumab  Antibody    no
24877  DRUGBANK:DB00002   cetuximab  Antibody    no


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  antibody_node_df['label'] = antibody_node_df.apply(antibody_row_checker, axis=1)


### LLM flagged edges -> manual check
* DRUGBANK:DB06186  ipilimumab is correct: https://go.drugbank.com/drugs/DB06186
* DRUGBANK:DB00002   cetuximab is correct: https://go.drugbank.com/drugs/DB00002

In [52]:
# # Temporarily show full column width (no truncation)
# pd.set_option('display.max_colwidth', None)

# print(antibody_node_df.head(1))

                    id        name  category  \
3156  DRUGBANK:DB06650  ofatumumab  Antibody   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 label  
3156  **Verifi

### Now check all "Complex" nodes
* 

In [15]:
complex_node_df = concat_node_df[concat_node_df['category'] == 'Complex']
print(len(complex_node_df))

519


In [16]:
## run the LLM model using a custom query

# Define the row labeling function
def complex_row_checker(row):
    combined_text = f"{row['name']} {row['id']}"
    prompt = f"""Analyze if it can be considered as a MacromolecularComplex, answer only in "yes" if satisfied, or "no" if condition is incorrect:
\"\"\"{combined_text}\"\"\"
Sentiment:"""
    return query_ollama(prompt)

In [None]:
## apply to each row of the dataframe
complex_node_df['label'] = complex_node_df.progress_apply(complex_row_checker, axis=1)
print(complex_node_df[complex_node_df['label'] == 'no'])

 38%|██████████████████████████▉                                            | 197/519 [16:42<28:59,  5.40s/it]

In [28]:
print(complex_node_df[complex_node_df['label'] == 'no'])

                       id                                          name  \
427    SIGNOR:SIGNOR-C532                                         Erlin   
521    SIGNOR:SIGNOR-C217                                           DGC   
757    SIGNOR:SIGNOR-C101                                           TSC   
831    SIGNOR:SIGNOR-C398                     Succinyl-CoA  ATP variant   
971    SIGNOR:SIGNOR-C131                                           NAE   
1441   SIGNOR:SIGNOR-C517                             R2SP co-chaperone   
1660   SIGNOR:SIGNOR-C113                                         RAGAC   
2073   SIGNOR:SIGNOR-C411                  Inner_mitochondrial_membrane   
2388   SIGNOR:SIGNOR-C481   Muscle cell-specific SWI/SNF ARID1A variant   
4017   SIGNOR:SIGNOR-C372                                          POMT   
5022   SIGNOR:SIGNOR-C396                                           IDH   
6244   SIGNOR:SIGNOR-C449                                          WICH   
6963   SIGNOR:SIGNOR-C555

In [46]:
## Define the output path for node & edge files after formatting
download_path_temp_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/temp/Signor_LLM_complex_no_rows.tsv'


## download both node and edge files
## Download the result df
## disable download for testing
complex_node_df[complex_node_df['label'] == 'no'].to_csv(download_path_temp_file, sep ='\t', index=False)

### LLM flagged edges -> manual check
* Those entities exist on SIGNOR, need to verify them on a thrid party source
* tried following DB: 
    * https://www.ebi.ac.uk/complexportal/
    * https://mmcif.wwpdb.org/cgi-bin/swish/swish.cgi?query=&si=0&sort=swishrank
* But I need domain expert help to decipher those results
* **Now remove all those rows being flagged by LLM**

### Now check all "Mirna" nodes
* 

In [48]:
Mirna_node_df = concat_node_df[concat_node_df['category'] == 'Mirna']
print(len(Mirna_node_df))

25


In [55]:
## run the LLM model using a custom query
# Define the row labeling function
def Mirna_row_checker(row):
    combined_text = f"{row['name']} {row['id']}"
    prompt = f"""Analyze if the identifier and names are matching and if it is an existing MicroRNA, answer only in "yes" if both satisfied, or "no" if any condition is incorrect:
\"\"\"{combined_text}\"\"\"
Sentiment:"""
    return query_ollama(prompt)

In [56]:
## apply to each row of the dataframe
Mirna_node_df['label'] = Mirna_node_df.apply(Mirna_row_checker, axis=1)
print(Mirna_node_df)

                                  id         name category label
1642   RNAcentral:URS000039E12D_9606     miR-130a    Mirna   yes
3330   RNAcentral:URS000062749E_9606      miR-155    Mirna   yes
4156   RNAcentral:URS00004CAC40_9606      mir-10b    Mirna   yes
4892   RNAcentral:URS0000150A7D_9606      miR-29b    Mirna   yes
6840   RNAcentral:URS0000759977_9606     miR-199a    Mirna   yes
6927   RNAcentral:URS00005EB596_9606    mir-133a1    Mirna   yes
8331   RNAcentral:URS0000233054_9606      miR-27a    Mirna   yes
11333  RNAcentral:URS000075CF56_9606       MIR1-1    Mirna    no
12565  RNAcentral:URS000037EC34_9606  hsa-mir-223    Mirna   yes
13634  RNAcentral:URS00001CC864_9606      miR-23a    Mirna   yes
14078  RNAcentral:URS0000245997_9606       miR221    Mirna   yes
16566  RNAcentral:URS00001F4E81_9606      miR-132    Mirna   yes
18198  RNAcentral:URS000075C517_9606      miR-495    Mirna   yes
26261  RNAcentral:URS000075B799_9606      miR-29c    Mirna   yes
27280  RNAcentral:URS0000

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Mirna_node_df['label'] = Mirna_node_df.apply(Mirna_row_checker, axis=1)


In [78]:
print(Mirna_node_df[Mirna_node_df['label'] == 'no'])

                                  id      name category label
11333  RNAcentral:URS000075CF56_9606    MIR1-1    Mirna    no
27280  RNAcentral:URS000075D8A0_9606  miR-146a    Mirna    no
38190  RNAcentral:URS000021E7E5_9606  miR-4784    Mirna    no
2631   RNAcentral:URS000033F823_9606    miR-34    Mirna    no


### LLM flagged -> manual check
* RNAcentral:URS000075CF56_9606       MIR1-1    Mirna is correct, https://rnacentral.org/rna/URS000075CF56/9606
* RNAcentral:URS000075D8A0_9606     miR-146a    Mirna is correct, https://rnacentral.org/rna/URS000075D8A0/9606
* RNAcentral:URS000021E7E5_9606     miR-4784    Mirna is correct, https://rnacentral.org/rna/URS000021E7E5/9606
* RNAcentral:URS000033F823_9606       miR-34    Mirna is indeed incorrect, 
* **https://rnacentral.org/rna/URS000033F823/9606, should be mir-34a**
* https://rnacentral.org/rna/URS0002914588/9606, miR-34 is actually the precursor family


### Now check all "Ncrna" nodes
* 

In [49]:
Ncrna_node_df = concat_node_df[concat_node_df['category'] == 'Ncrna']
print(len(Ncrna_node_df))

1


In [53]:
## run the LLM model using a custom query
# Define the row labeling function
def Ncrna_row_checker(row):
    combined_text = f"{row['name']} {row['id']}"
    prompt = f"""Analyze if the identifier and names are matching and if it is an existing Noncoding_RNAProduct, answer only in "yes" if both satisfied, or "no" if any condition is incorrect:
\"\"\"{combined_text}\"\"\"
Sentiment:"""
    return query_ollama(prompt)

In [54]:
## apply to each row of the dataframe
Ncrna_node_df['label'] = Ncrna_node_df.apply(Ncrna_row_checker, axis=1)
print(Ncrna_node_df)

                                 id    name category label
4968  RNAcentral:URS000075C808_9606  HOTAIR    Ncrna   yes


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Ncrna_node_df['label'] = Ncrna_node_df.apply(Ncrna_row_checker, axis=1)


### LLM flagged -> manual check
* all seems correct now

## Now check stimulus
* 

In [57]:
## now try to check Stimulus
Stimulus_node_df = concat_node_df[concat_node_df['category'] == 'Stimulus']
print(len(Stimulus_node_df))

26


In [58]:
## run the LLM model using a custom query
# Define the row labeling function
def Stimulus_row_checker(row):
    combined_text = f"{row['name']} {row['id']}"
    prompt = f"""Analyze if the identifier and names are matching and if it is an existing EnvironmentalProcess, answer only in "yes" if both satisfied, or "no" if any condition is incorrect:
\"\"\"{combined_text}\"\"\"
Sentiment:"""
    return query_ollama(prompt)

In [59]:
## apply to each row of the dataframe
Stimulus_node_df['label'] = Stimulus_node_df.apply(Stimulus_row_checker, axis=1)
print(Stimulus_node_df)

                       id                 name  category label
123    SIGNOR:SIGNOR-ST22    Unfolded_Proteins  Stimulus    no
180     SIGNOR:SIGNOR-ST1           DNA_damage  Stimulus    no
183    SIGNOR:SIGNOR-ST11                PAMPs  Stimulus    no
282    SIGNOR:SIGNOR-ST28       Osmotic_stress  Stimulus    no
559    SIGNOR:SIGNOR-ST20                  ECM  Stimulus    no
721     SIGNOR:SIGNOR-ST2                  ROS  Stimulus    no
1461    SIGNOR:SIGNOR-ST7            UV stress  Stimulus    no
2710   SIGNOR:SIGNOR-ST26  Blood vessel damage  Stimulus    no
3057    SIGNOR:SIGNOR-ST5           AminoAcids  Stimulus    no
3110   SIGNOR:SIGNOR-ST17        UVB radiation  Stimulus    no
3165    SIGNOR:SIGNOR-ST9            ER stress  Stimulus   yes
3915   SIGNOR:SIGNOR-ST13    Cell-Cell_contact  Stimulus    no
4478   SIGNOR:SIGNOR-ST21          Viral_dsRNA  Stimulus    no
4799   SIGNOR:SIGNOR-ST12                  GFs  Stimulus    no
6148   SIGNOR:SIGNOR-ST18                DAMPS  Stimulu

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Stimulus_node_df['label'] = Stimulus_node_df.apply(Stimulus_row_checker, axis=1)


### LLM flagged -> manual check
* Ionizing radiation: high-energy radiation capable of removing electrons from atoms, which can cause DNA damage, protein oxidation, and cellular stress.
* Hypoxia means a deficiency in oxygen levels, it activates a variety of cellular pathways
* in the long term we want a separate stimulus category in Biolink
* so for now, **removing all stimulus related edges**

## Now check Fusion protein
* 

In [89]:
## now try to check fusion_protein
fusion_protein_node_df = concat_node_df[concat_node_df['category'] == 'Fusion Protein']
print(len(fusion_protein_node_df))

15


In [90]:
print(fusion_protein_node_df.head(5))

                      id           name        category
1282   SIGNOR:SIGNOR-FP1       AML1-ETO  Fusion Protein
1283   SIGNOR:SIGNOR-FP2   PML-RARalpha  Fusion Protein
1331  SIGNOR:SIGNOR-FP16   NUP98 Fusion  Fusion Protein
1674   SIGNOR:SIGNOR-FP6        BCR-ABL  Fusion Protein
1891   SIGNOR:SIGNOR-FP3  CBFbeta-MYH11  Fusion Protein


In [91]:
## run the LLM model using a custom query
# Define the row labeling function
def fusion_protein_row_checker(row):
    combined_text = f"{row['name']} {row['id']}"
    prompt = f"""Analyze if the name is an existing fusion protein, answer only in "yes" if satisfied, or "no" if incorrect:
\"\"\"{combined_text}\"\"\"
Sentiment:"""
    return query_ollama(prompt)

In [92]:
## apply to each row of the dataframe
fusion_protein_node_df['label'] = fusion_protein_node_df.apply(fusion_protein_row_checker, axis=1)
print(fusion_protein_node_df[fusion_protein_node_df['label'] == 'no'])

                       id          name        category label
1331   SIGNOR:SIGNOR-FP16  NUP98 Fusion  Fusion Protein    no
1674    SIGNOR:SIGNOR-FP6       BCR-ABL  Fusion Protein    no
2593   SIGNOR:SIGNOR-FP14    MLL Fusion  Fusion Protein    no
8434    SIGNOR:SIGNOR-FP5       MLL-AF9  Fusion Protein    no
19983  SIGNOR:SIGNOR-FP10      ELE1-RET  Fusion Protein    no


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fusion_protein_node_df['label'] = fusion_protein_node_df.apply(fusion_protein_row_checker, axis=1)


### LLM flagged -> manual check
* used FusionPub to manually verify: https://compbio.uth.edu/FusionPub/
    * NUP98 fusion found 58 pairs
    * only BCR-ABL1 fusion found
    * MLL fusion found 34 pairs
    * MLL-AF9 **not found**
    * ELE1-RET **not found**


# Following codes are quality control steps and used to keep track of which changes made to source file
* Remove all rows with 'Fusion Protein' and 'Stimulus' for now
* Antibody all passed LLM test
* Remove those complex flagged by LLM
* replace miR-34 to miR-34a, which is the correct name for the identifier
* ncrna all passed LLM test

In [32]:
print("total row number before quality control is: ", len(source_df))

total row number before quality control is:  37051


In [33]:
## remove those rows with col(predicate) == 'unknown' from source_df
source_df = source_df[ (source_df['subject_category'] != 'Fusion Protein') & (source_df['object_category'] != 'Fusion Protein')]
source_df = source_df[ (source_df['subject_category'] != 'Stimulus') & (source_df['object_category'] != 'Stimulus')]

print("remaining row number is: ", len(source_df))

remaining row number is:  37051


In [34]:
## obtain all complex names need to be removed 
flagged_comple_df = complex_node_df[complex_node_df['label'] == 'no']
flagged_complex_names = flagged_comple_df['name'].tolist()
# print(flagged_complex_names)

## remove those rows with either subject_name or object_name in those flagged complex names
source_df = source_df[
    (~source_df['subject_name'].isin(flagged_complex_names)) &
    (~source_df['object_name'].isin(flagged_complex_names))
]

print("remaining row number is: ", len(source_df))

remaining row number is:  37051


In [35]:
## replace all 'miR-34' to 'miR-34a' in two columns subject_category and object_category in the pandas dataframe
source_df['subject_name'] = source_df['subject_name'].replace('miR-34', 'miR-34a')
source_df['object_name'] = source_df['object_name'].replace('miR-34', 'miR-34a')

In [36]:
## now check the source_df again with selected columns
source_edge_pair_df = source_df[['subject_name', 'predicate', 'object_name']]

print(source_edge_pair_df.head(5))

  subject_name      predicate object_name
0      CSNK2A1    upregulates     GTF2A1L
1        GSK3B  downregulates        SPI1
2       PRKACA  downregulates       MARK2
3        PTPRG  downregulates         SRC
4        PRKG1    upregulates       FHOD1


In [38]:
temp_df = source_edge_pair_df.head(20)
print(len(temp_df))

20


## Now run a similar LLM to check and see if the pair/edge makes sense
* flag anything doesnt seem to be a vaild edge during to **biological meaning not satisfied**
*

In [36]:
## run the LLM model using a custom query
# Define the row labeling function
def edge_pair_row_checker(row):
    combined_text = f"{row['subject_name']} {row['predicate']} {row['object_name']}"
    prompt = f"""Analyze whether the following represents a biologically valid subject → predicate → object relationship, 
                no known literature or no evidence indicates or no studies report or no literature support alone are not sufficient reasons for 'no',
                more specifically you need to find if there are evidence or literature showing opposite or different mechanismes. 
Reply strictly in this format:

Label: yes or no
Reason: within 30 words

\"\"\"{combined_text}\"\"\"
"""
    response = query_ollama(prompt)

    # Parse the response
    label = None
    reason = None
    for line in response.splitlines():
        if line.lower().startswith("label:"):
            label = line.split(":", 1)[1].strip().lower()
        elif line.lower().startswith("reason:"):
            reason = line.split(":", 1)[1].strip()

    return pd.Series({
        "label": label,
        "reason": reason
    })

In [None]:
## apply to each row of the dataframe
temp_df[['label', 'reason']] = temp_df.progress_apply(edge_pair_row_checker, axis=1)
print(temp_df[temp_df['label'] == 'no'])

 10%|████████▌                                                                             | 2/20 [00:29<04:27, 14.85s/it]

In [39]:
print(temp_df[temp_df['label'] == 'no'])

     subject_name      predicate      object_name label  \
0         CSNK2A1    upregulates          GTF2A1L    no   
1           GSK3B  downregulates             SPI1    no   
2          PRKACA  downregulates            MARK2    no   
4           PRKG1    upregulates            FHOD1    no   
5           MAPK3  downregulates            SHOC2    no   
7           HNF4A    upregulates          C1QTNF5    no   
9   Sin3B_complex  downregulates            H3C13    no   
11      ESCRT-III    upregulates  Membrane_fusion    no   
12         MAP2K4    upregulates           MAP2K4    no   
14          F2RL3    upregulates             GNAZ    no   
17         PCDH15    upregulates             TMC2    no   
19           DOK1  downregulates            ITGB4    no   

                                               reason  
0   No literature reports CSNK2A1 activating GTF2A...  
1   No published evidence shows GSK3B downregulate...  
2   No literature reports PRKACA downregulating MA...  
4   No l

In [42]:
## Sample 2000 rows randomly
sampled_df = source_edge_pair_df.sample(n=3000, random_state=1984)

In [43]:
## apply to sampled rows

## apply to each row of the dataframe
# source_edge_pair_df[['label', 'reason']] = source_edge_pair_df.apply(edge_pair_row_checker, axis=1)

sampled_df[['label', 'reason']] = sampled_df.progress_apply(edge_pair_row_checker, axis=1)
print(len(sampled_df[sampled_df['label'] == 'no']))

100%|███████████████████████████████████████████████████████████████████| 2000/2000 [9:13:22<00:00, 16.60s/it]

1119





## Manual check to spot any obvious pairs with issues
* Here starts the list of spotted issues:
* Structure: Subject, object, predicate, publications
* 1. self regulates vs indirect positive feedback loop
* MAP2K4	MAP2K4		upregulates SIGNOR-251420	PMID:36243968|PMID:9162092
* Coexpression of MEKK2 or MEKK3 with MKK4 in COS-7 cells resulted in activation of MKK4
* source: https://www.sciencedirect.com/science/article/pii/S0021925819625767?via%3Dihub
*
* TEK TEK upregulates PMID:36243968|PMID:11513602
* FGFR1 FGFR1 upregulates PMID:36243968|PMID:8622701
*
* 2
* Duplicates rows of PRKCA -> CSNK1D
* 3
* Pairs that are unlikely
* Phenotype Gene upregulates
* Phenotypes don't regulate genes; they are outcomes.
* Gene Phenotype downregulates A gene doesn't downregulate a phenotype. It can cause it
* Phenotype
Proteinfamily
downregulates
Phenotype is not a regulatory entity.
* Phenotype
Complex
downregulates / upregulates
Same reasoning — phenotype isn't active agent.
* Phenotype
Phenotype
downregulates
Conceptually invalid; phenotypes don’t regulate each other.
* 4
* 5
* 6
* 7
* 8

In [50]:
source_df.head(2)

Unnamed: 0,subject_identifier,subject_name,subject_category,object_identifier,object_name,object_category,predicate,original_predicate,provided_by,Primary_Knowledge_Source,publications,knowledge_level,anatomical_context_qualifier
0,NCBIGene:1457,CSNK2A1,Gene,NCBIGene:11036,GTF2A1L,Gene,upregulates,up-regulates activity,SIGNOR,SIGNOR-250870,PMID:36243968|PMID:12107178,knowledge_assertion,
1,NCBIGene:2932,GSK3B,Gene,NCBIGene:6688,SPI1,Gene,downregulates,down-regulates quantity by destabilization,SIGNOR,SIGNOR-277542,PMID:36243968|PMID:33188146,knowledge_assertion,


In [78]:
print("Row counts after LLM checking")
print(len(source_df))

## Now remove those obvious errors
# Drop rows where subject_name == object_name
source_df_filtered = source_df[source_df['subject_name'] != source_df['object_name']]

## rename a column from predicate to relationship, notice!
source_df_filtered = source_df_filtered.rename(columns={'predicate': 'relationship'})

## rename columns subject_category and object_category to subject and object
source_df_filtered = source_df_filtered.rename(columns={'subject_category': 'subject'})
source_df_filtered = source_df_filtered.rename(columns={'object_category': 'object'})

## mapping the subject_category and object_category 
source_df_filtered['subject_category'] = (
    source_df_filtered['subject'].map(category_map)
)

## mapping the subject_category and object_category 
source_df_filtered['object_category'] = (
    source_df_filtered['object'].map(category_map)
)

## match only combinations allowed in the pair dictionary
## Apply the mapping and return a Series with 3 columns
source_df_filtered[['predicate', 'subject_direction_qualifier', 'object_direction_qualifier']] = source_df_filtered.apply(
    lambda row: pd.Series(
        predicate_map.get(
            (row['subject_category'], row['object_category'], row['relationship']),
            [None, None, None]  # Default if not found
        )
    ),
    axis=1
)

## drop columns no longer needed
source_df_filtered = source_df_filtered.drop(columns=['subject', 'object', 'relationship', 'original_predicate'])

## Drop those unmatched rows
## Drop rows where 'name' is NaN, None, or empty string
source_df_filtered = source_df_filtered[~source_df_filtered['predicate'].isna() & (source_df_filtered['predicate'].str.strip() != '')]


print("Row counts after removing obvious issues")
print(len(source_df_filtered))


Row counts after LLM checking
37051
Row counts after removing obvious issues
32160


In [79]:
## check unique combination again
## check all unique combinations between subject & object
from itertools import combinations

# Get all combinations of 3 column names
col_combos = ['subject_category', 'object_category', 'predicate']

unique_combinations = source_df_filtered[col_combos].drop_duplicates()

print(unique_combinations)

                    subject_category                object_category  \
0                       biolink:Gene                   biolink:Gene   
9      biolink:MacromolecularComplex                   biolink:Gene   
11     biolink:MacromolecularComplex      biolink:PhenotypicFeature   
15            biolink:ChemicalEntity                   biolink:Gene   
37                   biolink:Protein                   biolink:Gene   
47                      biolink:Gene  biolink:MacromolecularComplex   
60             biolink:SmallMolecule                   biolink:Gene   
64             biolink:SmallMolecule          biolink:SmallMolecule   
86             biolink:SmallMolecule          biolink:ProteinFamily   
109            biolink:ProteinFamily                   biolink:Gene   
165                     biolink:Gene      biolink:PhenotypicFeature   
298    biolink:MacromolecularComplex  biolink:MacromolecularComplex   
308                     biolink:Gene                biolink:Protein   
492   

In [80]:
source_df_filtered.head(5)

Unnamed: 0,subject_identifier,subject_name,object_identifier,object_name,provided_by,Primary_Knowledge_Source,publications,knowledge_level,anatomical_context_qualifier,subject_category,object_category,predicate,subject_direction_qualifier,object_direction_qualifier
0,NCBIGene:1457,CSNK2A1,NCBIGene:11036,GTF2A1L,SIGNOR,SIGNOR-250870,PMID:36243968|PMID:12107178,knowledge_assertion,,biolink:Gene,biolink:Gene,biolink:regulates,,upregulated
1,NCBIGene:2932,GSK3B,NCBIGene:6688,SPI1,SIGNOR,SIGNOR-277542,PMID:36243968|PMID:33188146,knowledge_assertion,,biolink:Gene,biolink:Gene,biolink:regulates,,downregulated
2,NCBIGene:5566,PRKACA,NCBIGene:2011,MARK2,SIGNOR,SIGNOR-276870,PMID:36243968|PMID:25512381,knowledge_assertion,,biolink:Gene,biolink:Gene,biolink:regulates,,downregulated
3,NCBIGene:5793,PTPRG,NCBIGene:6714,SRC,SIGNOR,SIGNOR-254725,PMID:36243968|PMID:25624455,knowledge_assertion,,biolink:Gene,biolink:Gene,biolink:regulates,,downregulated
4,NCBIGene:5592,PRKG1,NCBIGene:29109,FHOD1,SIGNOR,SIGNOR-170094,PMID:36243968|PMID:21106951,knowledge_assertion,BTO:0000887;BTO:0001260,biolink:Gene,biolink:Gene,biolink:regulates,,upregulated


In [81]:
# Print all column names
print(source_df_filtered.columns.tolist())

['subject_identifier', 'subject_name', 'object_identifier', 'object_name', 'provided_by', 'Primary_Knowledge_Source', 'publications', 'knowledge_level', 'anatomical_context_qualifier', 'subject_category', 'object_category', 'predicate', 'subject_direction_qualifier', 'object_direction_qualifier']


### finalize the final edge file

In [84]:
## copy to a final edge df
edge_df = source_df_filtered.copy(deep = True)

# Print all column names
# print(edge_df.columns.tolist())

## rename columns to desired parsed format
edge_df = edge_df.rename(columns={'Primary_Knowledge_Source': 'knowledge_source'})
edge_df = edge_df.rename(columns={'anatomical_context_qualifier': 'context_qualifier'})
edge_df = edge_df.rename(columns={'subject_identifier': 'subject'})
edge_df = edge_df.rename(columns={'object_identifier': 'object'})

## add a new agent_type column and set value to be 'manual_agent'
edge_df['agent_type'] = 'automated_agent'

## add a deploy date column
edge_df['deploy_date'] = deployment_date

### Add resources_id column, checking whether edge is already
column_list = ['subject', 'predicate', 'object', 'context_qualifier', 'deploy_date']
# Apply the function to each row to generate UUIDs
edge_df['id'] = edge_df[column_list].apply(generate_uuid, axis=1)

## print head to check
edge_df.head(3)

['subject_identifier', 'subject_name', 'object_identifier', 'object_name', 'provided_by', 'Primary_Knowledge_Source', 'publications', 'knowledge_level', 'anatomical_context_qualifier', 'subject_category', 'object_category', 'predicate', 'subject_direction_qualifier', 'object_direction_qualifier']


Unnamed: 0,subject,subject_name,object,object_name,provided_by,knowledge_source,publications,knowledge_level,context_qualifier,subject_category,object_category,predicate,subject_direction_qualifier,object_direction_qualifier,agent_type,deploy_date,id
0,NCBIGene:1457,CSNK2A1,NCBIGene:11036,GTF2A1L,SIGNOR,SIGNOR-250870,PMID:36243968|PMID:12107178,knowledge_assertion,,biolink:Gene,biolink:Gene,biolink:regulates,,upregulated,automated_agent,2025-08-18,d8c53d1b-7e3d-53ee-965c-e60523bb3f4a
1,NCBIGene:2932,GSK3B,NCBIGene:6688,SPI1,SIGNOR,SIGNOR-277542,PMID:36243968|PMID:33188146,knowledge_assertion,,biolink:Gene,biolink:Gene,biolink:regulates,,downregulated,automated_agent,2025-08-18,32a7bd45-71ac-5b11-9cdb-f40329429aa6
2,NCBIGene:5566,PRKACA,NCBIGene:2011,MARK2,SIGNOR,SIGNOR-276870,PMID:36243968|PMID:25512381,knowledge_assertion,,biolink:Gene,biolink:Gene,biolink:regulates,,downregulated,automated_agent,2025-08-18,3cd03f76-b82b-57a2-b773-62f8538a9d28


In [86]:
## reorder the dataframe
desired_order = [
    'subject', 'subject_name', 'subject_category', 'object', 'object_name', 'object_category', 
    'predicate', 'subject_direction_qualifier', 'object_direction_qualifier', 'provided_by', 
    'knowledge_source', 'publications', 'knowledge_level', 'context_qualifier', 'agent_type', 
    'deploy_date', 'id'
]

edge_df = edge_df[desired_order]

print(edge_df.columns.tolist())

['subject', 'subject_name', 'subject_category', 'object', 'object_name', 'object_category', 'predicate', 'subject_direction_qualifier', 'object_direction_qualifier', 'provided_by', 'knowledge_source', 'publications', 'knowledge_level', 'context_qualifier', 'agent_type', 'deploy_date', 'id']


### Obtain the corresponding node file from the final edge file

In [89]:
signor_genes_node_subject_df = edge_df[['subject', 'subject_name', 'subject_category']]
signor_genes_node_object_df = edge_df[['object', 'object_name', 'object_category']]

## rename those columns into desired format
signor_genes_node_subject_df.rename(columns={'subject': 'id', 'subject_name': 'name',
                                                 'subject_category': 'category'}, inplace=True)
signor_genes_node_object_df.rename(columns={'object': 'id', 'object_name': 'name',
                                                 'object_category': 'category'}, inplace=True)

## vertical concatenation
signor_genes_node_df = pd.concat([signor_genes_node_subject_df, signor_genes_node_object_df])

## check # of total nodes
print(len(signor_genes_node_df))

# Drop duplicate rows
signor_genes_node_df = signor_genes_node_df.drop_duplicates()

## check # of total nodes
print(len(signor_genes_node_df))

64320
8666


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  signor_genes_node_subject_df.rename(columns={'subject': 'id', 'subject_name': 'name',
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  signor_genes_node_object_df.rename(columns={'object': 'id', 'object_name': 'name',


## following codes are used for quality control and sanity check
* check and confirm all subject & object types are correctly formatted

In [90]:
## check all unique predicate values
counts = signor_genes_node_df['category'].value_counts()
print(counts)

category
biolink:Gene                     6375
biolink:ChemicalEntity            992
biolink:SmallMolecule             418
biolink:MacromolecularComplex     416
biolink:PhenotypicFeature         206
biolink:Protein                   126
biolink:ProteinFamily              86
biolink:Drug                       25
biolink:MicroRNA                   21
biolink:Noncoding_RNAProduct        1
Name: count, dtype: int64


In [91]:
## check all unique predicate values
counts = edge_df['subject_category'].value_counts()

print(counts)

subject_category
biolink:Gene                     24401
biolink:MacromolecularComplex     2540
biolink:ChemicalEntity            2501
biolink:ProteinFamily             1334
biolink:SmallMolecule             1079
biolink:Protein                    231
biolink:MicroRNA                    42
biolink:Drug                        29
biolink:Noncoding_RNAProduct         3
Name: count, dtype: int64


In [92]:
## check all unique predicate values
counts = edge_df['object_category'].value_counts()

print(counts)

object_category
biolink:Gene                     28651
biolink:PhenotypicFeature         1550
biolink:MacromolecularComplex     1117
biolink:SmallMolecule              424
biolink:ProteinFamily              222
biolink:Protein                    196
Name: count, dtype: int64


In [98]:
## check missingness
print(edge_df[['predicate', 'subject_direction_qualifier', 'object_direction_qualifier']].isna().sum())

predicate                          0
subject_direction_qualifier    32160
object_direction_qualifier       109
dtype: int64


In [100]:
## fill na temporily to get the combo number
filled = edge_df[['predicate', 'subject_direction_qualifier', 'object_direction_qualifier']].fillna('None')
counts = filled.value_counts().reset_index(name='count')
print(counts)

                 predicate subject_direction_qualifier  \
0        biolink:regulates                        None   
1        biolink:regulates                        None   
2          biolink:affects                        None   
3          biolink:affects                        None   
4  biolink:in_complex_with                        None   

  object_direction_qualifier  count  
0                upregulated  15032  
1              downregulated   8832  
2                  increased   4886  
3                  decreased   3301  
4                       None    109  


In [101]:
## Create a graph from the DataFrame
graph = nx.from_pandas_edgelist(edge_df, 'subject', 'object', edge_attr='predicate')

## Print graph information
print('Number of nodes', len(set(graph.nodes)))
print('Number of edges', len(set(graph.edges)))
print('Average degree', sum(dict(graph.degree).values()) / len(graph.nodes))

Number of nodes 8666
Number of edges 24131
Average degree 5.56912070159243


## Now output those parsed files
*

In [102]:
## Define the output path for node & edge files after formatting
download_path_node_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/wQualityControl/SINGNOR_parsed_node_{version_number}.tsv'
download_path_edge_file = f'/Users/Weiqi0/ISB_working/Ilya_lab/Translator/Pharmagenomics_KG/files/parsed/wQualityControl/SINGNOR_parsed_edge_{version_number}.tsv'

## download both node and edge files
## Download the result df
## disable download for testing
signor_genes_node_df.to_csv(download_path_node_file, sep ='\t', index=False)
edge_df.to_csv(download_path_edge_file, sep ='\t', index=False)

In [76]:
# Temporarily show full column width (no truncation)
pd.set_option('display.max_colwidth', None)

# Set pandas to display all rows
pd.set_option('display.max_rows', None)
# print(counts)