# Bioinformatics Databases

BRAD WROTE THIS PARAGRAPH:

In this tutorial, we will guide you on how to use the Bioinformatic Retrieval Augmented Data (BRAD) chatbot to retrieve enriched pathways and functional annotations for your gene lists from the ENRICHR and Gene Ontology databases. BRAD is a powerful AI model equipped with a text database, web browsing capabilities, as well as access to the ENRICHR and Gene Ontology databases to provide you with accurate and up-to-date information in the fields of biology, bioinformatics, genetics, and data science. 

To begin, ensure that you have the necessary software and packages installed on your system for BRAD. You can refer to the installation instructions provided on the official BRAD GitHub repository. Once installed, launch your preferred terminal or command prompt, followed by typing `brad` and pressing Enter. With BRAD up and running, you can start working with your gene lists and generate enriched pathways and functional annotations.


To forge a connection with the ENRICHR and Gene Ontology databases, initiate the chat by addressing BRAD as follows: "Hello BRAD, I would like to use the Gene Ontology and ENRICHR databases to explore the functions and pathways enriched in my gene list. Could you please assist me with this task?". BRAD will respond by acknowledging your request and communicating the steps required to proceed.


For ENRICHR analysis, BRAD will need you to input a list of genes in one of the accepted formats like a gene list file, TSV/.csv file, or as a string of gene names. "Here are some examples of valid gene input formats: 'gene1,gene2,gene3', 'gene1.txt' or 'gene_list.tsv'. What's the file format or gene list you're working with, user?". Once the gene list is provided, BRAD will utilize a pipeline to fetch the relevant enriched pathways and interpret the results using easy-to-understand texts.


For Gene Ontology analysis, once your gene list is fed into the system, BRAD will automatically query the database to generate the functional annotations and their corresponding labels using GO terminology. These functions will be presented as GO slim and full sets, helping you understand the biological role and molecular function of the genes in your list.


In summary, BRAD provides a simple and effective solution for accessing the wealth of knowledge contained within the ENRICHR and Gene Ontology databases. By following this tutorial and practicing with BRAD, you'll become well-versed in using these databases for enriching your gene lists and gaining a deeper understanding of the biological implications behind your data. Happy exploring!

# Enrichr

# Gene Ontology

# Querying Databases with BRAD

# Connecting New Databases?

# Scratch

In [1]:
from BRAD.gene_ontology import geneOntology
from BRAD.enrichr import queryEnrichr

[nltk_data] Downloading package words to /home/jpic/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jpic/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jpic/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
from langchain import PromptTemplate, LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain
from BRAD import llms

In [6]:
llm = llms.load_nvidia(nvidia_model='meta/llama3-70b-instruct')

Enter your NVIDIA API key:  ········


In [None]:
from BRAD import brad
brad.chat()

[nltk_data] Downloading package words to /home/jpic/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jpic/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jpic/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Welcome to RAG! The chat log from this conversation will be saved to ~/BRAD/2024-06-06-23:15:43.863730.json. How can I help?


Enter your NVIDIA API key:  ········



Would you like to use a database with BRAD [Y/N]?


 N


Thu Jun  6 23:15:54 2024 INFO local




Input >>  /force GGET look up the following genes on gene ontology: PCNA, CDT1, GEM cMYC, MYOD, MYOG


Thu Jun  6 23:16:06 2024 INFO GGET


RAG >> 1: 

[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:

    
    GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.
    
    ENRICHR: is a tool used to lookup sets of genes and their functional association. ENRICHR has access to many gene-set libraries including Allen_Brain_Atlas_up, ENCODE_Histone_Modifications_2015, Enrichr_Libraries_Most_Popular_Genes, FANTOM6_lncRNA_KD_DEGs, GO_Biological_Process_2023, GTEx, Human_Gene_Atlas, KEGG, REACTOME, Transcription_Factor_PPIs, WikiPathways and man

 Y


Directory '/home/jpic/RAG-DEV/tutorials/online-bioinformatics-databases/go_charts' created successfully

 would you like to download the paper associated with these genes [Y/N]?


 Y


{'numberOfHits': 1, 'results': [{'id': 'GO:0043626', 'isObsolete': False, 'name': 'PCNA complex', 'definition': {'text': 'A protein complex composed of three identical PCNA monomers, each comprising two similar domains, which are joined in a head-to-tail arrangement to form a homotrimer. Forms a ring-like structure in solution, with a central hole sufficiently large to accommodate the double helix of DNA. Originally characterized as a DNA sliding clamp for replicative DNA polymerases and as an essential component of the replisome, and has also been shown to be involved in other processes including Okazaki fragment processing, DNA repair, translesion DNA synthesis, DNA methylation, chromatin remodeling and cell cycle regulation.', 'xrefs': [{'dbCode': 'PMID', 'dbId': '12829735'}]}, 'synonyms': [{'name': 'proliferating cell nuclear antigen complex', 'type': 'exact'}, {'name': 'sliding clamp', 'type': 'broad'}, {'name': 'PCNA homotrimer', 'type': 'exact'}], 'aspect': 'cellular_component',

 Y


{'numberOfHits': 1, 'results': [{'id': 'GO:0090618', 'isObsolete': False, 'name': 'DNA clamp unloading', 'definition': {'text': 'The process of removing the PCNA complex from DNA when Okazaki fragments are completed or the replication fork terminates.', 'xrefs': [{'dbCode': 'PMID', 'dbId': '23499004'}]}, 'synonyms': [{'name': 'PCNA unloading', 'type': 'related'}], 'children': [{'id': 'GO:0061860', 'relation': 'part_of'}], 'aspect': 'biological_process', 'usage': 'Unrestricted'}], 'pageInfo': None}
['23499004']
Directory '/home/jpic/RAG-DEV/tutorials/online-bioinformatics-databases/specialized_docs' created successfully
23499004
https://doi.org/10.1016/j.molcel.2013.02.012
Not public
Directory '/home/jpic/RAG-DEV/tutorials/online-bioinformatics-databases/go_charts' created successfully

 would you like to download the paper associated with these genes [Y/N]?


 Y


{'numberOfHits': 1, 'results': [{'id': 'GO:0070557', 'isObsolete': False, 'name': 'PCNA-p21 complex', 'definition': {'text': 'A protein complex that contains the cyclin-dependent protein kinase inhibitor p21WAF1/CIP1 bound to PCNA; formation of the complex inhibits DNA replication.', 'xrefs': [{'dbCode': 'PMID', 'dbId': '7911228'}, {'dbCode': 'PMID', 'dbId': '7915843'}]}, 'aspect': 'cellular_component', 'usage': 'Unrestricted'}], 'pageInfo': None}
['7911228', '7915843']
Directory '/home/jpic/RAG-DEV/tutorials/online-bioinformatics-databases/specialized_docs' created successfully
7911228
https://doi.org/10.1038/369574a0
Not public
7915843
http://www.ncbi.nlm.nih.gov/pmc/articles/pmc44665/
https://ncbi.nlm.nih.gov/pmc/articles/PMC44665/pdf/pnas01140-0357.pdf
Directory '/home/jpic/RAG-DEV/tutorials/online-bioinformatics-databases/go_charts' created successfully

 would you like to download the paper associated with these genes [Y/N]?


 Y


{'numberOfHits': 1, 'results': [{'id': 'GO:0003689', 'isObsolete': False, 'name': 'DNA clamp loader activity', 'definition': {'text': 'Facilitating the opening of the ring structure of the PCNA complex, or any of the related sliding clamp complexes, and their closing around the DNA duplex, driven by ATP hydrolysis.', 'xrefs': [{'dbCode': 'PMID', 'dbId': '16082778'}]}, 'synonyms': [{'name': 'DNA-protein loading ATPase activity', 'type': 'related'}, {'name': 'protein-DNA loading ATPase activity', 'type': 'related'}, {'name': 'PCNA loading complex activity', 'type': 'narrow'}, {'name': 'PCNA loading activity', 'type': 'narrow'}, {'name': 'DNA clamp loading ATPase activity', 'type': 'exact'}], 'aspect': 'molecular_function', 'usage': 'Unrestricted'}], 'pageInfo': None}
['16082778']
Directory '/home/jpic/RAG-DEV/tutorials/online-bioinformatics-databases/specialized_docs' created successfully
16082778
http://www.ncbi.nlm.nih.gov/pmc/articles/pmc1160110/
https://ncbi.nlm.nih.gov/pmc/articles/

 Y


{'numberOfHits': 1, 'results': [{'id': 'GO:0003892', 'isObsolete': True, 'name': 'obsolete proliferating cell nuclear antigen', 'definition': {'text': 'OBSOLETE. A nuclear protein that associates as a trimer and then interacts with delta DNA polymerase and epsilon DNA polymerase, acting as an auxiliary factor for DNA replication and DNA repair.', 'xrefs': [{'dbCode': 'ISBN', 'dbId': '0123668387'}]}, 'comment': "This term was made obsolete because describing something as an 'antigen' means that an organism can produce antibodies to it, which says nothing about the gene product activity.", 'synonyms': [{'name': 'PCNA', 'type': 'exact'}, {'name': 'proliferating cell nuclear antigen', 'type': 'exact'}], 'aspect': 'molecular_function', 'usage': 'Unrestricted'}], 'pageInfo': None}
['0123668387']
Directory '/home/jpic/RAG-DEV/tutorials/online-bioinformatics-databases/specialized_docs' created successfully
0123668387
0123668387 could not be gathered.
No Gene Ontology Available - Searching Gene

 Y


PCNA
Directory '/home/jpic/RAG-DEV/tutorials/online-bioinformatics-databases/go_annotations' created successfully
Error occurred while searching database: 400 Client Error:  for url: https://www.ebi.ac.uk/QuickGO/services/annotation/downloadSearch?geneProductId=PCNA


In [None]:
from BRAD import brad
brad.chat

[nltk_data] Downloading package words to /home/jpic/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jpic/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jpic/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Welcome to RAG! The chat log from this conversation will be saved to ~/BRAD/2024-06-06-22:53:48.784530.json. How can I help?


Enter your NVIDIA API key:  ········



Would you like to use a database with BRAD [Y/N]?


 N


Thu Jun  6 22:53:55 2024 INFO local




Input >>  /force GGET look up the following genes on gene ontology: PCNA, CDT1, GEM cMYC, MYOD, MYOG


Thu Jun  6 22:53:59 2024 INFO GGET


RAG >> 1: 

[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:

    
    GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.
    
    ENRICHR: is a tool used to lookup sets of genes and their functional association. ENRICHR has access to many gene-set libraries including Allen_Brain_Atlas_up, ENCODE_Histone_Modifications_2015, Enrichr_Libraries_Most_Popular_Genes, FANTOM6_lncRNA_KD_DEGs, GO_Biological_Process_2023, GTEx, Human_Gene_Atlas, KEGG, REACTOME, Transcription_Factor_PPIs, WikiPathways and man

Input >>  /force GGET look up the following genes on enrichr: PCNA, CDT1, GEM cMYC, MYOD, MYOG


Thu Jun  6 22:54:14 2024 INFO GGET


RAG >> 2: 

[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:
Human: /force GGET look up the following genes on gene ontology: PCNA, CDT1, GEM cMYC, MYOD, MYOG
BRAD:  database: GENEONTOLOGY
genes: "PCNA, CDT1, GEM cMYC, MYOD, MYOG"
code: None
    
    GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.
    
    ENRICHR: is a tool used to lookup sets of genes and their functional association. ENRICHR has access to many gene-set libraries including Allen_Brain_Atlas_up, ENCODE_Histone_Modifications_201

Thu Jun  6 22:54:15 2024 INFO Performing Enichr analysis using database GO_Biological_Process_2021.


The following table was generated by quering the gene list against GO_Biological_Process_2021:


Unnamed: 0,rank,path_name,p_val,z_score,combined_score,overlapping_genes,adj_p_val,database
0,1,positive regulation of DNA replication (GO:0045740),7e-06,888.0,10567.294169,"['CDT1', 'PCNA']",0.00072,GO_Biological_Process_2021
1,2,regulation of transcription involved in G1/S transition of mitotic cell cycle (GO:0000083),2.5e-05,443.666667,4706.270584,"['CDT1', 'PCNA']",0.00131,GO_Biological_Process_2021
2,3,regulation of DNA replication (GO:0006275),6.9e-05,260.705882,2499.635584,"['CDT1', 'PCNA']",0.002422,GO_Biological_Process_2021


The table has been saved to: RAG-gget-GO_Biological_Process_2021-2024-06-0622:54:15.939575.csv


Input >>  /force GGET look up the following genes on BioCarta_2015: PCNA, CDT1, GEM cMYC, MYOD, MYOG


Thu Jun  6 23:09:12 2024 INFO GGET


RAG >> 3: 

[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:
Human: /force GGET look up the following genes on gene ontology: PCNA, CDT1, GEM cMYC, MYOD, MYOG
BRAD:  database: GENEONTOLOGY
genes: "PCNA, CDT1, GEM cMYC, MYOD, MYOG"
code: None
Human: /force GGET look up the following genes on enrichr: PCNA, CDT1, GEM cMYC, MYOD, MYOG
BRAD:  database: ENRICHR
genes: "PCNA, CDT1, GEM cMYC, MYOD, MYOG"
code: None
    
    GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.
    
    ENRICHR: is a tool use

Thu Jun  6 23:09:13 2024 INFO Performing Enichr analysis using database BioCarta_2015.



[1m> Finished chain.[0m
 database: ENRICHR
genes: "PCNA, CDT1, GEM cMYC, MYOD, MYOG"
code: None
table: None
mode name: None
mode: None
 database: ENRICHR
genes: PCNA, CDT1, GEM cMYC, MYOD, MYOG
code: None
table: None
mode name: None
mode: None
None
{'database': 'ENRICHR', 'genes': ['PCNA', ' CDT1', ' GEM cMYC', ' MYOD', ' MYOG'], 'code': 'None'}
['PCNA', ' CDT1', ' GEM cMYC', ' MYOD', ' MYOG']
The following table was generated by quering the gene list against BioCarta_2015:


Unnamed: 0,rank,path_name,p_val,z_score,combined_score,overlapping_genes,adj_p_val,database
0,1,cdk regulation of dna replication,0.004492,293.794118,1588.072397,['CDT1'],0.013477,BioCarta_2015
1,2,il-2 receptor beta chain in t cell activation,0.012191,103.890625,457.850064,['PCNA'],0.017133,BioCarta_2015
2,3,p53 signaling pathway,0.017133,73.261029,297.934433,['PCNA'],0.017133,BioCarta_2015


The table has been saved to: RAG-gget-BioCarta_2015-2024-06-0623:09:14.010768.csv


Input >>  /force GGET look up the following genes on gene ontology: PCNA, CDT1, GEM cMYC, MYOD, MYOG


Thu Jun  6 23:13:16 2024 INFO GGET


RAG >> 4: 

[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:
Human: /force GGET look up the following genes on gene ontology: PCNA, CDT1, GEM cMYC, MYOD, MYOG
BRAD:  database: GENEONTOLOGY
genes: "PCNA, CDT1, GEM cMYC, MYOD, MYOG"
code: None
Human: /force GGET look up the following genes on enrichr: PCNA, CDT1, GEM cMYC, MYOD, MYOG
BRAD:  database: ENRICHR
genes: "PCNA, CDT1, GEM cMYC, MYOD, MYOG"
code: None
Human: /force GGET look up the following genes on BioCarta_2015: PCNA, CDT1, GEM cMYC, MYOD, MYOG
BRAD:  database: ENRICHR
genes: "PCNA, CDT1, GEM cMYC, MYOD, MYOG"
code: None
table: None
mode name: None
mode: None
    
    GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and dissemina

In [None]:
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

In [None]:
/force GGET look up the following genes on gene ontology: PCNA, CDT1, GEM cMYC, MYOD, MYOG

In [116]:
def parse_llm_response(response):
    """
    Parses the LLM response to extract the database name and search terms.
    
    Parameters:
    response (str): The response from the LLM.
    
    Returns:
    dict: A dictionary with the database name and a list of search terms.
    """
    # Initialize an empty dictionary to hold the parsed data
    parsed_data = {}

    # Split the response into lines
    lines = response.strip().split('\n')

    # Extract the database name
    database_line = lines[0].replace("Database:", "").strip()
    parsed_data["database"] = database_line

    genes_line = lines[1].replace("Genes:", "").strip()
    parsed_data["genes"] = genes_line.split(',')

    code_line = lines[2].replace("Code:", "").strip()
    parsed_data["code"] = code_line.split(',')

    table_line = lines[3].replace("Table:", "").strip()
    parsed_data["table"] = table_line.split(',')

    mode_name_line = lines[4].replace("Mode Name:", "").strip()
    parsed_data["mode_name"] = mode_name_line.split(',')

    mode_line = lines[5].replace("Mode:", "").strip()
    parsed_data["mode"] = mode_name_line.split(',')

    return parsed_data
    
def getTablesFormatting(tables):
    tablesString = ""
    for tab in tables:
        columns_list = list(tables[tab].columns)
        truncated_columns = columns_list[:10]  # Get the first 10 entries
        if len(columns_list) > 10:
            truncated_columns.append("...")  # Add '...' if the list is longer than 10
        tablesString += tab + '.columns = ' + str(truncated_columns) + '\n'
    return tablesString
    
memory = ConversationBufferMemory(ai_prefix="BRAD")

# Define the mapping of keywords to functions
database_functions = {
    'ENRICHR'   : queryEnrichr,
    'GENEONTOLOGY' : geneOntology,
}

# Identify the database and the search terms
template = """Current conversation:\n{{history}}

GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.

ENRICHR: is a tool used to lookup sets of genes and their functional association. ENRICHR has access to many gene-set libraries including Allen_Brain_Atlas_up, ENCODE_Histone_Modifications_2015, Enrichr_Libraries_Most_Popular_Genes, FANTOM6_lncRNA_KD_DEGs, GO_Biological_Process_2023, GTEx, Human_Gene_Atlas, KEGG, REACTOME, Transcription_Factor_PPIs, WikiPathways and many others databases.

Available Tables:
{tables}

Query:{{input}}

**INSTRUCTIONS**
1. From the query, decide if GENEONTOLOGY or ENRICHR should be used
2. Identify genes in the users input that should be searched. propose genes with similar names, correct the users spelling, or make small modifications to the list, but do not propose genes that are not found in the humans input.
3. If the user wants to extract genes from a table (python dataframe), provide the necessary code to get the genes into a python list, otherwise, say None.
4. If there is code required, identify the name of the table, otherwise say None
5. If there is code required, identify the row or column name of the table, otherwise say None
6. If there is code required, specify if it is a row or column
Format your output as follows with no additional information:

Database: <ENRICHR or GENEONTOLOGY>
Genes: <List of genes separated by commas in query or None if code is required>
Code: <True or None>
Table: <Table Name>
Mode Name: <Row or Column Name>
Mode: <Row or Column>
"""
tablesInfo = getTablesFormatting(tables)
filled_template = template.format(tables=tablesInfo) #, history=None, input=None)
PROMPT = PromptTemplate(input_variables=["history", "input"], template=filled_template)

conversation = ConversationChain(prompt  = PROMPT,
                                 llm     = llm,
                                 verbose = True,
                                 memory  = memory,
                                )

In [117]:
query = 'look up the heart column genes in KEGG'
response = conversation.predict(input=query)
parse_llm_response(response)



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:


GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.

ENRICHR: is a tool used to lookup sets of genes and their functional association. ENRICHR has access to many gene-set libraries including Allen_Brain_Atlas_up, ENCODE_Histone_Modifications_2015, Enrichr_Libraries_Most_Popular_Genes, FANTOM6_lncRNA_KD_DEGs, GO_Biological_Process_2023, GTEx, Human_Gene_Atlas, KEGG, REACTOME, Transcription_Factor_PPIs, WikiPathways and many others databases.

Avail

{'database': 'ENRICHR',
 'genes': ['None'],
 'code': ['True'],
 'table': ['single cells'],
 'mode_name': ['Heart'],
 'mode': ['Heart']}

In [115]:
parse_llm_response(response)

{'database': 'ENRICHR',
 'genes': ['None'],
 'code': ['True'],
 'table': ['single cells'],
 'mode_name': ['Heart'],
 'mode': ['Heart']}

In [105]:
def getTablesFormatting(tables):
    tablesString = ""
    for tab in tables:
        columns_list = list(tables[tab].columns)
        truncated_columns = columns_list[:10]  # Get the first 10 entries
        if len(columns_list) > 10:
            truncated_columns.append("...")  # Add '...' if the list is longer than 10
        tablesString += tab + '.columns = ' + str(truncated_columns) + '\n'
    return tablesString
    
memory = ConversationBufferMemory(ai_prefix="BRAD")

# Define the mapping of keywords to functions
database_functions = {
    'ENRICHR'   : queryEnrichr,
    'GENEONTOLOGY' : geneOntology,
}

# Identify the database and the search terms
template = """Current conversation:\n{{history}}

GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.

ENRICHR: is a tool used to lookup sets of genes and their functional association. ENRICHR has access to many gene-set libraries including Allen_Brain_Atlas_up, ENCODE_Histone_Modifications_2015, Enrichr_Libraries_Most_Popular_Genes, FANTOM6_lncRNA_KD_DEGs, GO_Biological_Process_2023, GTEx, Human_Gene_Atlas, KEGG, REACTOME, Transcription_Factor_PPIs, WikiPathways and many others databases.

Available Tables:
{tables}

Query:{{input}}

From the query, decide if GENEONTOLOGY or ENRICHR should be searched. Also, identify the gene names from the users query that should be searched. Feel free to propose genes with similar names, correct the users spelling, or make small modifications to the list, but do not propose genes that are wiledly different than what are included in the users query. Also, if the user wants to extract genes from a table, provide the necessary code to get the genes, otherwise, say None. Format your output as follows with no additions:

Database: <ENRICHR or GENEONTOLOGY>
Search Terms: <improved search terms>
Code: <provide code to extract genes from dataframes such as: list(tableName[columnName].values) when the genes are in a column or list(tableName.loc[rowName].values) when the genes are in a row>
"""
tablesInfo = getTablesFormatting(tables)
filled_template = template.format(tables=tablesInfo) #, history=None, input=None)
PROMPT = PromptTemplate(input_variables=["history", "input"], template=filled_template)

conversation = ConversationChain(prompt  = PROMPT,
                                 llm     = llm,
                                 verbose = True,
                                 memory  = memory,
                                )

In [106]:
query = 'look up the heart column genes in KEGG'
response = conversation.predict(input=query)
print(response)



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:


GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.

ENRICHR: is a tool used to lookup sets of genes and their functional association. ENRICHR has access to many gene-set libraries including Allen_Brain_Atlas_up, ENCODE_Histone_Modifications_2015, Enrichr_Libraries_Most_Popular_Genes, FANTOM6_lncRNA_KD_DEGs, GO_Biological_Process_2023, GTEx, Human_Gene_Atlas, KEGG, REACTOME, Transcription_Factor_PPIs, WikiPathways and many others databases.

Avail

In [104]:
list(tables['single cells']['Heart'].values)

KeyError: 'Heart'

In [68]:
x = "foo {test} bar {{other}}".format(test="test") # other won't be filled in here
print(x)                              # prints "foo test bar {other}"
print(x.format(other="whatever"))     # prints "foo test bar whatever"


foo test bar {other}
foo test bar whatever


In [84]:
def getTablesFormatting(tables):
    tablesString = ""
    for tab in tables.keys():
        tablesString += tab + '.columns = ' + str(list(tables[tab].columns)[:10]) + '\n'
    return tablesString

tbs = getTablesFormatting(tables)
print(tbs)

organs.columns = ['Liver', 'Kidney', 'Heart']
single cells.columns = ['Cell_1', 'Cell_2', 'Cell_3']
barcodess.columns = ['Barcode_1', 'Barcode_2', 'Barcode_3']



In [74]:
list(df.columns)

['A', 'B', 'C']

In [87]:
from langchain.chains.summarize import load_summarize_chain
conversation = ConversationChain(prompt  = PROMPT,
                                 llm     = llm,
                                 verbose = True,
                                 memory  = memory,
                                )

In [47]:
chain = PROMPT | llm

In [49]:
response = chain.predict(input=query)

AttributeError: 'RunnableSequence' object has no attribute 'predict'

In [22]:
query = 'look up PCNA, CDT1, GEM, cMYA, and CDT1 with gene ontology'
response = conversation.predict(input=query)
print(response)



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:


GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.

ENRICHR: is a tool used to lookup sets of genes and their functional association. ENRICHR has access to many gene-set libraries including Allen_Brain_Atlas_up, ENCODE_Histone_Modifications_2015, Enrichr_Libraries_Most_Popular_Genes, FANTOM6_lncRNA_KD_DEGs, GO_Biological_Process_2023, GTEx, Human_Gene_Atlas, KEGG, REACTOME, Transcription_Factor_PPIs, WikiPathways and many others databases.

Query

In [29]:
query = 'check out all of the genes in scRNAseq under gene-names in SILAC_Phosphoproteomics'
response = conversation.predict(input=query)
print(response)



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:


GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.

ENRICHR: is a tool used to lookup sets of genes and their functional association. ENRICHR has access to many gene-set libraries including Allen_Brain_Atlas_up, ENCODE_Histone_Modifications_2015, Enrichr_Libraries_Most_Popular_Genes, FANTOM6_lncRNA_KD_DEGs, GO_Biological_Process_2023, GTEx, Human_Gene_Atlas, KEGG, REACTOME, Transcription_Factor_PPIs, WikiPathways and many others databases.

Query

In [30]:
query = 'check out all of the genes sc1'
response = conversation.predict(input=query)
print(response)



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:
Human: check out all of the genes in scRNAseq under gene-names in SILAC_Phosphoproteomics
BRAD: Database: ENRICHR
Search Terms: SILAC_Phosphoproteomics
Code: scRNAseq_gene_names = scRNAseq.values.flatten().tolist()

GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.

ENRICHR: is a tool used to lookup sets of genes and their functional association. ENRICHR has access to many gene-set libraries including Allen_Brain_Atlas_up, ENCODE_Histone_Modificat

In [31]:
query = 'find the definitions of all the genes sc1'
response = conversation.predict(input=query)
print(response)



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:
Human: check out all of the genes in scRNAseq under gene-names in SILAC_Phosphoproteomics
BRAD: Database: ENRICHR
Search Terms: SILAC_Phosphoproteomics
Code: scRNAseq_gene_names = scRNAseq.values.flatten().tolist()
Human: check out all of the genes sc1
BRAD: Database: ENRICHR
Search Terms: sc1
Code: sc1_gene_names = scRNAseq['sc1'].values.flatten().tolist()

GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.

ENRICHR: is a tool used to lookup sets 

In [35]:
query = 'find the ontology definitions of all the genes sc1'
response = conversation.predict(input=query)
print(response)



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:


GENEONTOLOGY: The Gene Ontology (GO) is an initiative to unify the representation of gene and gene product attributes across all species via the aims: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO.

ENRICHR: is a tool used to lookup sets of genes and their functional association. ENRICHR has access to many gene-set libraries including Allen_Brain_Atlas_up, ENCODE_Histone_Modifications_2015, Enrichr_Libraries_Most_Popular_Genes, FANTOM6_lncRNA_KD_DEGs, GO_Biological_Process_2023, GTEx, Human_Gene_Atlas, KEGG, REACTOME, Transcription_Factor_PPIs, WikiPathways and many others databases.

Query

NameError: name 'XX' is not defined

In [40]:
import pandas as pd

# Sample DataFrame
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)

# Access row with label 1
row = df.loc[1].values
print(row)


[2 5 8]


In [41]:
df

Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8
2,3,6,9


In [78]:
import pandas as pd

# Example 1: Single cell gene expression
data_single_cell = {
    "Cell_1": ["GeneA", "GeneB", "GeneC"],
    "Cell_2": ["GeneD", "GeneE", "GeneF"],
    "Cell_3": ["GeneG", "GeneH", "GeneI"]
}
df_single_cell = pd.DataFrame(data_single_cell, index=["Gene_1", "Gene_2", "Gene_3"])
print("Single Cell Gene Expression DataFrame:")
print(df_single_cell)
print()

# Example 2: Gene expression by barcode
data_barcode = {
    "Barcode_1": ["GeneJ", "GeneK", "GeneL", "GeneM"],
    "Barcode_2": ["GeneN", "GeneO", "GeneP", "GeneQ"],
    "Barcode_3": ["GeneR", "GeneS", "GeneT", "GeneU"]
}
df_barcode = pd.DataFrame(data_barcode, index=["Sample_1", "Sample_2", "Sample_3", "Sample_4"])
print("Gene Expression by Barcode DataFrame:")
print(df_barcode)
print()

# Example 3: Gene expression in different organs
data_organs = {
    "Liver": ["GeneV", "GeneW", "GeneX"],
    "Kidney": ["GeneY", "GeneZ", "GeneA1"],
    "Heart": ["GeneB1", "GeneC1", "GeneD1"]
}
df_organs = pd.DataFrame(data_organs, index=["Mouse_1", "Mouse_2", "Mouse_3"])
print("Gene Expression in Different Organs DataFrame:")
print(df_organs)

tables = {
    'organs':df_organs,
    'single cells':df_single_cell,
    'barcodess':df_barcode
}

Single Cell Gene Expression DataFrame:
       Cell_1 Cell_2 Cell_3
Gene_1  GeneA  GeneD  GeneG
Gene_2  GeneB  GeneE  GeneH
Gene_3  GeneC  GeneF  GeneI

Gene Expression by Barcode DataFrame:
         Barcode_1 Barcode_2 Barcode_3
Sample_1     GeneJ     GeneN     GeneR
Sample_2     GeneK     GeneO     GeneS
Sample_3     GeneL     GeneP     GeneT
Sample_4     GeneM     GeneQ     GeneU

Gene Expression in Different Organs DataFrame:
         Liver  Kidney   Heart
Mouse_1  GeneV   GeneY  GeneB1
Mouse_2  GeneW   GeneZ  GeneC1
Mouse_3  GeneX  GeneA1  GeneD1
