### Generating a signaling network using LLMs ###
The purpose of this script is to take in a gene list and query an LLM to generate a signaling network. We task the LLM with identifying interactions between the proteins coded by these genes based on experimental evidence. Lastly, we output the identified signaling networks using py4cytoscape which is a Cytoscape API.

In [1]:
import os, time, time, json, re, openai
import pandas as pd
from typing import List, Literal
from pydantic import BaseModel

In [2]:
# SDK API keys
api_key = '######'

In [3]:
# creating class that will be used as structured format for the LLM output
# using Pydantic basemodel to ensure objects adhere to the specified data types defined below

class GeneInteraction(BaseModel):
    # each GeneInteraction involves a source, a target, and a direction (stimulation or inhibition)
    source_node: str
    target_node: str
    action: Literal["stimulation", "inhibition"]
    pubmedIDs: list[str]


class InteractionList(BaseModel):
    # storing list of GeneInteractions - generating one, final InteractionList
    interactions: list[GeneInteraction]

In [4]:
from openai import OpenAI

# prompt_LLM should return InteractionList object
def prompt_LLM(gene_symbol_list, max_tokens: int, temp: float) -> InteractionList:
    
    client = OpenAI(api_key=api_key) # API key
    model_version = "gpt-4.1" # can change to 4.1, 4.1-turbo, gpt-5 etc.

    # High-level instructions to guide LLM assistant (how the model should behave)
    assistant_instructions = (
        "You are a helpful research assistant who is an expert at searching "
        "published research articles and identifying molecular interactions. "
        "Always respond in valid JSON matching the given schema."
    )

    # Text prompt to encourage structured output from the LLM
    prompt = f"""
You are given a list of genes and other signaling nodes:

{gene_symbol_list}

Identify interactions between any proteins encoded by the genes
in this list. Only include interactions that are supported by published
experimental literature.

Return ONLY JSON matching this schema:

{{
  "interactions": [
    {{
      "source_node": "string",
      "target_node": "string",
      "action": "stimulation" | "inhibition",
      "pubmedIDs": ["12345678", "23456789", "34567890"]
    }},
    ...
  ]
}}

- Please return at least 2 pubmedIDs per reaction
- There should be no duplicate interactions.
- Use exactly the field names above.
- Use "stimulation" or "inhibition" only.
- Use PubMed IDs as strings.
- Do not include any extra top-level keys or text.
    """.strip()

    # actual API call - parse method for structured outputs (use client.responses.create for unstructured output)
    response = client.responses.parse(
        model=model_version,
        input=prompt,                          # passing prompt
        max_output_tokens=max_tokens,
        temperature=temp,              
        instructions=assistant_instructions,   # instructions for how the LLM assistant should behave
        text_format=InteractionList,           # ensuring structured format (based on defined class)
    )

    # parsing output from LLM
    parsed = response.output_parsed

    return InteractionList(interactions=list(parsed.interactions))


In [5]:
# smooth muscle cell type 1 specific markers in atherosclerotic plaques
gene_list = ["ACTC1","SBSPON","RERGL","SORL1","C11orf96","TCEAL2","LDB3","TPH1","SCRG1","NPR1","HSPB7","MYOCD","GRIA2",
"CNN1","LDOC1","PLN","SOST","KCNMA1","SUSD5","CSRP2","HACD1","MRAP2","SMTN","ST6GAL2","PHACTR1"]

# TCGA data - glioblastoma solid tumors, bulk sequencing with differential expression comparing genes upregulated in recurrent gliomas (relative to chemoreponsive) 
gene_list = ["CDR1","CEP72","COL2A1","SGMS2","SMIM22","ATP5PD","TDRD6","S100A13","NME7","KRT86","PNKP","PKMYT1",
             "H2AC20","ALDH1A3","SNAI3","PIGH","FUT7","RNF214","PDZD7","FKBP15","CXCR5","PVALEF"]

# choose your own gene_list!

In [6]:
tokens = 40000 # reasoning models (5.0+ use a LOT of tokens - they take time to think, explore possibilities, and plan; hidden token usage)
temperature = 1.0

# prompting LLM and storing interaction list
interactions = prompt_LLM(gene_list, tokens, temperature)

In [8]:
# looking at InteractionList object and the list of gene interactions
interactions.interactions

[GeneInteraction(source_node='Hdac6', target_node='Sirt1', action='stimulation', pubmedIDs=['19416850', '22908229']),
 GeneInteraction(source_node='Rhoa', target_node='Foxm1', action='stimulation', pubmedIDs=['25622913', '31972819']),
 GeneInteraction(source_node='Il10', target_node='Il23a', action='inhibition', pubmedIDs=['19884610', '23698715']),
 GeneInteraction(source_node='Il10', target_node='Tnfsf12', action='inhibition', pubmedIDs=['15608266', '15187134']),
 GeneInteraction(source_node='Hdac5', target_node='Foxm1', action='inhibition', pubmedIDs=['18483264', '30471474']),
 GeneInteraction(source_node='Cpe', target_node='Igf1r', action='stimulation', pubmedIDs=['9639650', '12970300'])]

### Plotting output using cytoscape ###
Cytoscape is a java based software environment used for network visualization and analysis. It requires a node table and edge table. We can interface via the `py4cytoscape` API.

In [9]:
# parsing LLM output into a pandas dataframe
def interactionlist_to_cytoscape_df(interaction_list):

    # constructing edge dataframe
    edges_records = []
    for gene_interaction in interaction_list.interactions:
        edges_records.append({
            "source": gene_interaction.source_node,  # source
            "target": gene_interaction.target_node,  # target
            "interaction": gene_interaction.action,  # passing stimulation/inhibition to cytoscape interaction column
            "pmids": gene_interaction.pubmedIDs   # passing pubmedIDs as citation field
        })
    edges_df = pd.DataFrame(edges_records)

    # constructing node dataframe
    node_names = set()
    for gene_interaction in interaction_list.interactions:
        node_names.add(gene_interaction.source_node)
        node_names.add(gene_interaction.target_node)

    nodes_df = pd.DataFrame({"id": sorted(node_names)}) 

    return nodes_df, edges_df


In [10]:
# Convert to cytoscape compatible DataFrames
nodes_df, edges_df = interactionlist_to_cytoscape_df(interactions) # passing json interactions

In [11]:
# taking a peek at the node and edge tables
nodes_df

Unnamed: 0,id
0,Cpe
1,Foxm1
2,Hdac5
3,Hdac6
4,Igf1r
5,Il10
6,Il23a
7,Rhoa
8,Sirt1
9,Tnfsf12


In [12]:
edges_df

Unnamed: 0,source,target,interaction,pmids
0,Hdac6,Sirt1,stimulation,"[19416850, 22908229]"
1,Rhoa,Foxm1,stimulation,"[25622913, 31972819]"
2,Il10,Il23a,inhibition,"[19884610, 23698715]"
3,Il10,Tnfsf12,inhibition,"[15608266, 15187134]"
4,Hdac5,Foxm1,inhibition,"[18483264, 30471474]"
5,Cpe,Igf1r,stimulation,"[9639650, 12970300]"


In [14]:
import py4cytoscape as p4c

# Now construct the network in cytoscape and display 
p4c.create_network_from_data_frames(
    nodes=nodes_df,
    edges=edges_df,
    title='Network',
    collection='Collection'
)

Applying default style...
Applying preferred layout


325