In [122]:
from rouge_score import rouge_scorer
import llm2geneset
import json
import openai
import numpy as np
import random
import pandas as pd

aclient = openai.AsyncClient()

# ROUGE metric example

In [2]:
#rouge1 : unigram
#rouge2 : bigram
#rougeL : longest common subsequence 
scorer = rouge_scorer.RougeScorer(['rouge1','rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score('The quick brown fox jumps over the lazy dog',
                      'The quick brown dog jumps on the log.')

In [3]:
scores

{'rouge1': Score(precision=0.75, recall=0.6666666666666666, fmeasure=0.7058823529411765),
 'rouge2': Score(precision=0.2857142857142857, recall=0.25, fmeasure=0.26666666666666666),
 'rougeL': Score(precision=0.625, recall=0.5555555555555556, fmeasure=0.5882352941176471)}

In [4]:
lib_names = ["KEGG_2021_Human", 
             "Reactome_2022", 
             "WikiPathway_2023_Human"]
lib_names = ["KEGG_2021_Human"]

In [127]:
model = "gpt-4o-2024-05-13"

df_output=pd.DataFrame()

for lib_name in lib_names:
    with open("libs_human/" + model + "/" + lib_name + ".json") as f:
        gen_res = json.load(f)
        descr_cleaned = gen_res['descr_cleaned']
        curated_genesets = gen_res["curated_genesets"]
        lib_size = len(descr_cleaned)
        
        for i in range(10):
            # randomly choose 2 gene sets from library
            rand_choices = random.sample(range(lib_size), 2)
            gt_name = [descr_cleaned[rand_choices[0]], descr_cleaned[rand_choices[1]]]
            # combine and shuffle the genes from 2 gene sets
            genes = [curated_genesets[rand_choices[0]], curated_genesets[rand_choices[1]]]
            genes_combine = set()
            for g in genes:
                genes_combine.update(set(g))  
            genes_combine = list(genes_combine)
            random.shuffle(genes_combine)
            genes_queries = [[", ".join(genes_combine)]]
            
            # use llm2geneset to generate high confidence gene set names
            llm2geneset_res = await llm2geneset.get_pathways_from_genes(aclient, 
                                                                        genes_queries, 
                                                                        model="gpt-4o", 
                                                                        n_retry=3, 
                                                                        use_sysmsg=False)
            llm2geneset_names = np.array(llm2geneset_res[0]['parsed_pathways'])[np.array(llm2geneset_res[0]['conf'])=='high']
    
            # use GSAI to generate geneset name
            gsai_res = await llm2geneset.gsai(aclient, genes_queries, model="gpt-4o", n_retry=1)
            gsai_name = gsai_res[0]['name']
    
            gsai_scores=[]
            llm2geneset_scores=[]

            for ref in gt_name:
                gsai_scores.append(scorer.score(ref, gsai_name)['rouge1'].fmeasure)
                scores = []
                for pred in llm2geneset_names:
                    scores.append(scorer.score(ref, pred)['rouge1'].fmeasure)
                llm2geneset_scores.append(np.max(scores))

            output={"library":[lib_name],
                    "gt_name_1":[gt_name[0]],
                    "gt_name_2":[gt_name[1]],
                    "gsai_ROUGE1_1":[gsai_scores[0]],
                    "gsai_ROUGE1_2":[gsai_scores[1]],
                    "llm2geneset_ROUGE1_1":[llm2geneset_scores[0]],
                    "llm2geneset_ROUGE1_2":[llm2geneset_scores[1]]} 
            df_output = pd.concat([df_output, pd.DataFrame(output)], ignore_index=True)

100%|█████████████████████████████████████████████| 1/1 [00:10<00:00, 10.19s/it]
100%|█████████████████████████████████████████████| 1/1 [00:17<00:00, 17.11s/it]


Name: Immune Response Regulation and Signaling
LLM self-assessed confidence: 0.92

The system of interacting proteins provided is heavily involved in the regulation and signaling associated with immune responses. Below is the detailed analysis of how these proteins interact and function within this biological process:

1. **Cytokine Signaling and Regulation**:
   - *Interleukins and Receptors*: Proteins such as IL2 (and its receptors IL2RA, IL2RB, and IL2RG), IL4 (and IL4R), IL6 (and IL6R and IL6ST), IL18, IL21 (and IL21R), IL23A (and IL23R), IL27RA, and others, mediate various signaling pathways crucial for immune response modulation.
   - *Interferons and Receptors*: Proteins like IFNG, IFNB1, and multiple IFNA isoforms (IFNA1, IFNA2, IFNA5, IFNA6, IFNA7, IFNA8, IFNA10, IFNA13, IFNA14, IFNA16, IFNA17, IFNA21) along with their receptors (IFNGR1, IFNGR2) play significant roles in antiviral responses and immune regulation.
   - *Tumor Necrosis Factor*: TNF signaling mediated by proteins

100%|█████████████████████████████████████████████| 1/1 [00:09<00:00,  9.34s/it]
100%|█████████████████████████████████████████████| 1/1 [00:10<00:00, 10.38s/it]


### Name: Immune Response and Signal Transduction
### LLM self-assessed confidence: 0.92

This system of interacting proteins predominantly operates within immune response pathways and signal transduction mechanisms. The analysis indicates a robust network involving T cell activation, antigen recognition, intracellular signaling, and subsequent gene expression and cellular response.

1. **Protein Tyrosine Kinases and Adapters:** Proteins such as LCK, ZAP70, SRC, and LAT play a pivotal role in T cell receptor (TCR) signaling. LCK phosphorylates ITAMs (Immunoreceptor Tyrosine-based Activation Motifs) on the TCR complex, leading to ZAP70 recruitment and activation. LAT serves as an adaptor that organizes signaling complexes necessary for downstream signaling events.

2. **Transcription Factors and Signal Propagation:** The transcription factors JUN, FOS, and RELA integrate signals from MAPK and NF-κB pathways, translating extracellular stimuli into gene expression changes that drive cell 

100%|█████████████████████████████████████████████| 1/1 [00:05<00:00,  5.71s/it]
100%|█████████████████████████████████████████████| 1/1 [00:07<00:00,  7.80s/it]


Name: Immune response and signal transduction
LLM self-assessed confidence: 0.88

1. The presence of various interferon proteins, such as IFNA1, IFNA2, IFNA10, IFNA13, and IFNA21, suggests the system's involvement in the immune response. These interferons are pivotal for antiviral defense mechanisms and immune regulation.

2. Proteins like TLR4, TLR7, TLR8, and TLR3 (Toll-like receptors) further implicate this system in pathogen recognition and activation of innate immunity. These receptors detect microbial components and initiate innate immune responses through downstream signal transduction pathways, including the activation of pro-inflammatory cytokines like TNF and IL1B.

3. Signal transduction pathways are also supported by various MAPK family members (MAPK1, MAPK3, MAPK8, MAPK11, MAPK12, MAPK13, MAPK14) which are heavily involved in transducing extracellular signals to intracellular responses, contributing to cell growth, differentiation, and inflammatory responses.

4. JAK1, STA

100%|█████████████████████████████████████████████| 1/1 [00:02<00:00,  2.43s/it]
100%|█████████████████████████████████████████████| 1/1 [00:06<00:00,  6.64s/it]


Name: Collagen synthesis and modification
LLM self-assessed confidence: 0.90

The provided set of interacting proteins consists largely of enzymes involved in collagen biosynthesis, modification, and other structural proteins. Collagens, forming a major part of the extracellular matrix (ECM), are critical for tissue integrity and function.

1. **Type of Collagens:** Many collagens are included, such as COL1A1, COL1A2, COL2A1, COL4A1, COL4A2, COL5A1, COL5A2, COL6A1, COL6A2, COL7A1, COL8A1, COL8A2, COL9A1, COL9A2, COL10A1, COL11A1, COL11A2, and COL12A1. These proteins are integral to forming the structural framework of tissues.

2. **Collagen Processing Enzymes:** Proteins like UGT2A1, UGT2B7, and CES2 are part of the glycosylation and modification machinery, essential for post-translational modifications of collagen, impacting its stability, folding, and function.

3. **Cross-linking and Matrix Interactions:** Proteins such as SLC8A1, SLC8A2, and SLC8A3, involved in calcium homeostasis,

100%|█████████████████████████████████████████████| 1/1 [00:03<00:00,  3.09s/it]
100%|█████████████████████████████████████████████| 1/1 [00:12<00:00, 12.51s/it]


Name: Nucleotide metabolism and regulation
LLM self-assessed confidence: 0.97

1. **Adenylate Kinases (AKs)**: AK1, AK2, AK3, AK4, AK5, AK6, AK7, AK8, and AK9 are involved in the regulation of adenine nucleotide compositions, helping maintain the cellular energy homeostasis by catalyzing the interconversion of adenine nucleotides (ATP, ADP, and AMP). This group of enzymes is crucial for proper energy transfer within cells.

2. **Phosphodiesterases (PDEs)**: Multiple PDE family members, such as PDE1A, PDE1B, PDE1C, PDE2A, PDE3A, PDE3B, PDE4A, PDE4B, PDE4C, PDE4D, PDE5A, and others, play key roles in cyclic nucleotide metabolism by hydrolyzing cAMP and cGMP, which are important second messengers in numerous signaling pathways. PDE activity helps to regulate intracellular concentrations of these cyclic nucleotides, thereby controlling signal transduction responses.

3. **Nucleotidases (NTs)**: Proteins such as NT5C, NT5E, NT5C1A, NT5C2, and others, are involved in the hydrolysis of nucleo

100%|█████████████████████████████████████████████| 1/1 [00:05<00:00,  5.67s/it]
100%|█████████████████████████████████████████████| 1/1 [00:11<00:00, 11.25s/it]


Name: Cell Motility and Cytoskeletal Reorganization
LLM self-assessed confidence: 0.92

The provided protein interaction system prominently features proteins involved in cell motility, cytoskeletal reorganization, and integrin-mediated signaling pathways. 

1. **Integrins and Adhesion**: Proteins such as ITGA5, ITGB1, ITGA3, ITGB2, ITGA6, ITGA7, and ITGB3 are integrins that mediate cell adhesion to the extracellular matrix and are essential for cell migration. Integrins interact with actin cytoskeleton through complex signaling cascades, including focal adhesion kinases (FAKs) like PTK2, and scaffold proteins like PXN (paxillin).

2. **Actin Cytoskeleton Dynamics**: Many proteins are involved in actin filament polymerization and depolymerization, critical for pseudopod formation, lamellipodia extension, and cell motility. Actin-related proteins such as ACTN1, ACTB, ACTG1, and ARPC (Actin-Related Protein Complex subunits: ARPC1A, ARPC1B, ARPC2, ARPC3, ARPC4, and ARPC5) coordinate these 

100%|█████████████████████████████████████████████| 1/1 [00:04<00:00,  4.84s/it]
100%|█████████████████████████████████████████████| 1/1 [00:09<00:00,  9.27s/it]


Name: Inflammatory Signaling and Immune Response Regulation
LLM self-assessed confidence: 0.80

Analysis:

The given protein set comprises several diverse components that collectively play critical roles in inflammatory signaling, immune response regulation, and cellular dynamics.

1. **Cytokines and Receptors**: 
   - **IL1B, IL6, IL10, IL12A, IL12B, IL18, TNF, IFNG**: These proteins are key cytokines and interleukins involved in the regulation of immune responses. They stimulate inflammatory pathways and modulate the activities of various immune cells.
   - **FAS, FASLG**: These proteins mediate apoptosis via the extrinsic pathway, contributing to the regulation of immune responses and the elimination of infected or cancerous cells.
  
2. **Signaling Proteins and Kinases**: 
   - **MYD88, PRKACA, PRKACB, PRKACG, PRKCB, PRKCA, PRKCG**: These proteins are involved in various signaling pathways, including the activation of inflammatory genes and the regulation of immune cell function.
 

100%|█████████████████████████████████████████████| 1/1 [00:07<00:00,  7.62s/it]
100%|█████████████████████████████████████████████| 1/1 [00:06<00:00,  6.86s/it]


Name: Signal Transduction and Cellular Stress Response
LLM self-assessed confidence: 0.91

In the provided system, the predominant biological processes revolve around signal transduction pathways and cellular responses to external stimuli and stress.

1. **Signal Transduction**: The system involves multiple growth factors and receptor tyrosine kinases (RTKs), such as FGF family members (FGF1, FGF2, FGF19), and their receptors (FGFR1, FGFR2, FGFR3, FGFR4). These proteins are crucial for transmitting signals from the cell surface to the interior, triggering various cellular responses such as proliferation, differentiation, and survival. Additionally, proteins like VEGFA, VEGFB, and their receptors (VEGFR1, VEGFR2) play central roles in vascular endothelial growth signaling. The presence of integrins (ITGA2, ITGA5, ITGB1, ITGB3) further supports cell adhesion, migration, and integration of extracellular signals into intracellular signaling pathways.

2. **Cellular Stress Response**: Prote

100%|█████████████████████████████████████████████| 1/1 [00:03<00:00,  3.47s/it]
100%|█████████████████████████████████████████████| 1/1 [00:07<00:00,  7.22s/it]


Name: DNA Repair and Immune Response Regulation
LLM self-assessed confidence: 0.90

1. **DNA Repair**: This system includes several major players in nucleotide excision repair (NER) and other DNA repair pathways. Proteins like ERCC2, GTF2H3, ERCC5, ERCC3, and ERCC1 are components of the transcription factor IIH (TFIIH) complex, essential for NER. Other relevant proteins such as XPA and DDB1 are crucial for damage recognition and repair process coordination, indicating a strong emphasis on maintaining genomic stability.

2. **DNA Replication and Synthesis**: Proteins like POLD1, POLE, RFC1, RFC2, RFC4, PCNA, and LIG1 are central to DNA replication. They function in synthesizing new DNA strands, ensuring that DNA is accurately copied and maintained. This integration is critical for cell cycle progression and genomic fidelity.

3. **Immune Response Regulation**: Several proteins in this system, particularly those involving HLA classes (e.g., HLA-DPB1, HLA-DQB1, HLA-DRB1, and related molec

100%|█████████████████████████████████████████████| 1/1 [00:06<00:00,  6.17s/it]
100%|█████████████████████████████████████████████| 1/1 [00:09<00:00,  9.55s/it]

Name: Metabolic and xenobiotic processing
LLM self-assessed confidence: 0.85

1. The proteins in this system are primarily involved in various aspects of metabolism, including lipid, carbohydrate, amino acid, and retinoid metabolism. For example, HADHA and HADHB participate in fatty acid beta-oxidation, enabling the breakdown and utilization of fatty acids. 

2. Proteins such as LDHA, LDHB, and LDHC are involved in lactate metabolism, catalyzing the interconversion of lactate and pyruvate within the glycolytic cycle, which is crucial for cellular energy production under anaerobic conditions. 

3. Enzymes such as ADH1A, ADH1B, and ADH1C are responsible for the oxidation of alcohols to aldehydes, which is part of the detoxification process. Similarly, aldehyde dehydrogenases like ALDH1A1 and ALDH1A2 further convert these aldehydes into carboxylic acids, aiding in their excretion.

4. The large number of cytochrome P450 enzymes (e.g., CYP1A1, CYP2C9, CYP3A4) signify their collective role 




In [128]:
df_output

Unnamed: 0,library,gt_name_1,gt_name_2,gsai_ROUGE1_1,gsai_ROUGE1_2,llm2geneset_ROUGE1_1,llm2geneset_ROUGE1_2
0,KEGG_2021_Human,Th17 cell differentiation,Cytosolic DNA-sensing pathway,0.0,0.0,0.25,0.285714
1,KEGG_2021_Human,DNA replication,Yersinia infection,0.0,0.0,1.0,0.0
2,KEGG_2021_Human,Coronavirus disease,Hypertrophic cardiomyopathy,0.0,0.0,0.0,0.0
3,KEGG_2021_Human,Protein digestion and absorption,Drug metabolism,0.25,0.0,0.222222,0.5
4,KEGG_2021_Human,Purine metabolism,Glutathione metabolism,0.333333,0.333333,1.0,1.0
5,KEGG_2021_Human,RIG-I-like receptor signaling pathway,Regulation of actin cytoskeleton,0.0,0.0,0.545455,1.0
6,KEGG_2021_Human,Vasopressin-regulated water reabsorption,African trypanosomiasis,0.2,0.0,0.0,0.0
7,KEGG_2021_Human,PI3K-Akt signaling pathway,Ubiquinone and other terpenoid-quinone biosynt...,0.2,0.166667,1.0,0.0
8,KEGG_2021_Human,Nucleotide excision repair,Leishmaniasis,0.222222,0.0,0.4,0.0
9,KEGG_2021_Human,Retinol metabolism,Propanoate metabolism,0.333333,0.333333,0.5,0.5
