# Retrieval Augmented Generation (RAG) with BRAD

Given a collection of documents (pdfs), the RAG first builds a database by splitting up the text into equal size ('chunk_size'). These chunks can optionally be set to have some overlap ('chunk_overlap'). Both of these values within BRAD are set to 700 and 200 respectively but can be manually changed for different applications. Then, we vectorize these chunks using an embedding model from HuggingFace (Note - this step may take a while). Then, given a query, it embeds the query in the same embedding space and finds the top k chunks (preset to 4) with the closest cosine similarity to the query in the embedding space and uses these chunks as a basis for the response.

# Literature Databases

## Building a Database

In [1]:
import subprocess
import os
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
import chromadb
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [3]:
from BRAD import rag

In [7]:
rag.create_database(docsPath='papers/',
                    dbName='tutorialDatabase',
                    dbPath='databases/',
                    HuggingFaceEmbeddingsModel = 'BAAI/bge-base-en-v1.5',
                    chunk_size=[700],
                    chunck_overlap=[200],
                    v=True)


Work Directory: /home/jpic/RAG-DEV/tutorials/RAG-with-BRAD


ImportError: Could not import sentence_transformers python package. Please install it with `pip install sentence-transformers`.

# Connecting Literature Databases to BRAD

In [21]:
from BRAD import llms
llm = llms.load_nvidia()

Enter your NVIDIA API key:  ········




## Specifiying the Database

When running `brad.chat()`, there is an option to use a previously saved database. **ADD how to s Type Y to supplement your query with the database.

In [8]:
import subprocess
import os
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
import chromadb
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain import PromptTemplate, LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

In [9]:
# Load the database
persist_directory = '/nfs/turbo/umms-indikar/shared/projects/RAG/databases/DigitalLibrary-10-June-2024/'
embeddings_model = HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5')
db_name = "DigitalLibrary"
_client_settings = chromadb.PersistentClient(path=(persist_directory + db_name))
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings_model, client=_client_settings, collection_name=db_name)

ImportError: Could not import sentence_transformers python package. Please install it with `pip install sentence-transformers`.

## Viewing the Documents from BRAD

In [None]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n" + d.page_content for i, d in enumerate(docs)]))

# MultiQuery RAG

In [5]:
from BRAD import brad
brad.chat(ragvectordb=vectordb)

Welcome to RAG! The chat log from this conversation will be saved to /home/jpic/BRAD/2024-06-16-23:53:20.661368.json. How can I help?


Sun 16 Jun 2024 11:53:20 PM EDT INFO local




Input >>  /force RAG what cellular processes is the MYOD gene involved in?


Sun 16 Jun 2024 11:53:42 PM EDT INFO RAG


RAG >> 1: 

  warn_deprecated(
Sun 16 Jun 2024 11:53:43 PM EDT INFO Generated queries: ['1. Which cellular functions does the MYOD gene influence or regulate?', '2. Can you identify the specific cellular pathways where the MYOD gene plays a role?', '3. What are the major cellular activities associated with the expression of the MYOD gene?']
  warn_deprecated(




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
expressed, how does Myod regulate skeletal muscle cell
differentiation? In one sense, the answer seems fairly simple:Myod is a transcription factor with binding sites in theregulatory regions of many genes that are expressed in skeletalmuscle. Myod forms heterodimers with the nearly ubiquitousE-protein sub-family of bHLH proteins through the interactionof the HLH domains (see Fig. 1) (Lassar et al., 1991; Murre etal., 1989). The basic regions act as sequence-speciﬁc DNA-binding domains that recognize a binding site with the simplecore consensus sequence of CANNTG, termed an E-box, andshow additional preferences for internal and ﬂankingsequences (Blackwell and Weintraub, 1

Input >>  q


Thanks for chatting today! I hope to talk soon, and don't forget that a record of this conversation is available at: /home/jpic/BRAD/2024-06-16-23:53:20.661368.json


# Contextual Compression

In [81]:
chatstatus = {
    'config' : {
        'debug':True
    }
}
query = 'What cellular processes is MYOD involved in?'
documentSearch = vectordb.similarity_search_with_relevance_scores(query=query, k=10)

In [82]:
documentSearch[0][1]

0.7678906917572021

In [77]:
def summarizeDocumentTemplate():
    template = """**INSTRUCTIONS**
You are an assistant responsible for compressing the important information in a document.
You will be given a users query and a piece of text. Summarize the text with the following aims:
1. remove information that is not complete ideas or unrelated to the topic of the user
2. improve the clarity of the writing and information
If there is no relevant information, say "None"

**USER QUERY**
{user_query}

**TEXT**
{text}

**OUTPUT**
<put summary output here>
"""
    return template

In [83]:
def contextualCompression(documentSearch, chatstatus):
    """
    Summarizes the content of documents based on a user query, updating the 
    document search results with these summaries.

    Args:
        documentSearch (list): A list of documents where each document is a tuple, 
                               and the first element of the tuple has an attribute 
                               `page_content` containing the text content of the document.
        chatstatus (dict): BRAD chatstatus used to track debuging

    Returns:
        list: The modified `documentSearch` list with updated `page_content` for each 
              document, replaced by their summaries.

    Example:
        documentSearch = [(Document(page_content="..."),), ...]
        chatstatus = {'config': {'debug': True}}
        updatedDocs = contextualCompression(documentSearch, chatstatus)
    """
    template = summarizeDocumentTemplate()
    PROMPT = PromptTemplate(input_variables=["user_query"], template=template)
    reducedDocs = []
    for i, doc in enumerate(documentSearch):
        pageContent = doc[0].page_content
        prompt = PROMPT.format(text=pageContent, user_query=query)
        res = llm.invoke(input=prompt)
        summary = res.content.strip()
        if chatstatus['config']['debug']:
            print('============')
            print(pageContent)
            print('Summary: ' + summary)
        doc[0].page_content = summary
        documentSearch[i] = doc
    return documentSearch

contextualCompression(documentSearch, chatstatus)

a combination of promoter-speciﬁc regulation of Myod binding and activity.
Because Myod initiates the myogenic differentiation
program and that program temporally regulates the activity of
Myod, it follows that Myod programs the regulation of its ownactivity. It does this, at least in part, through a feed-forward
Development
Summary: MYOD is involved in initiating the myogenic differentiation program, which temporally regulates its own activity through a feed-forward mechanism.
transcription and Myod protein
activity (Kopan et al., 1994; Nofziger et al., 1999), andprobably contributes to regulating differentiation in vivo. It isinteresting that while we have identiﬁed several mechanismsthat might delay myoblast differentiation, such as mitogens andNotch signaling, we do not yet have a good understanding ofthe events that occur in vivo to overcome these inhibitorysignals and to induce differentiation at a speciﬁc time andplace.
A feed-forward circuit as a quantal step
How does a single 

[(Document(page_content='MYOD is involved in initiating the myogenic differentiation program, which temporally regulates its own activity through a feed-forward mechanism.', metadata={'page': 4, 'source': '/nfs/turbo/umms-indikar/shared/projects/RAG/papers/DigitalLibrary-9-June-2024/The circuitry of a master switch.pdf'}),
  0.7678906917572021),
 (Document(page_content='MYOD is a transcription factor involved in muscle differentiation as indicated by its role in gene transcription and regulation of myoblast differentiation both in vitro and in vivo (Kopan et al., 1994; Nofziger et al., 1999). Despite known inhibitors of differentiation like mitogens and Notch signaling, the precise mechanisms allowing differentiation to occur at specific times and places remain unclear. As for the execution of an entire program of cell differentation by a single transcription factor, research has shown that expression levels of many RNAs change during skeletal muscle differentiation in cultured C2C12 c

In [80]:
documentSearch

[((Document(page_content='MYOD is involved in initiating the myogenic differentiation program, which temporally regulates its own activity through a feed-forward mechanism.', metadata={'page': 4, 'source': '/nfs/turbo/umms-indikar/shared/projects/RAG/papers/DigitalLibrary-9-June-2024/The circuitry of a master switch.pdf'}),
   0.7678906917572021),
  0.7678906917572021),
 ((Document(page_content='MYOD is a transcription factor involved in muscle differentiation as indicated by its role in gene transcription and regulation of myoblast differentiation both in vitro and in vivo (Kopan et al., 1994; Nofziger et al., 1999). Despite known inhibitors of differentiation like mitogens and Notch signaling, the precise mechanisms allowing differentiation to occur at specific times and places remain unclear. As for the execution of an entire program of cell differentation by a single transcription factor, research has shown that expression levels of many RNAs change during skeletal muscle different

In [72]:
documentSearch

[(Document(page_content='a combination of promoter-speciﬁc regulation of Myod binding and activity.\nBecause Myod initiates the myogenic differentiation\nprogram and that program temporally regulates the activity of\nMyod, it follows that Myod programs the regulation of its ownactivity. It does this, at least in part, through a feed-forward\nDevelopment', metadata={'page': 4, 'source': '/nfs/turbo/umms-indikar/shared/projects/RAG/papers/DigitalLibrary-9-June-2024/The circuitry of a master switch.pdf'}),
  0.7678906917572021),
 (Document(page_content='transcription and Myod protein\nactivity (Kopan et al., 1994; Nofziger et al., 1999), andprobably contributes to regulating differentiation in vivo. It isinteresting that while we have identiﬁed several mechanismsthat might delay myoblast differentiation, such as mitogens andNotch signaling, we do not yet have a good understanding ofthe events that occur in vivo to overcome these inhibitorysignals and to induce differentiation at a speciﬁc

In [59]:
reducedText

['Myod is involved in initiating the myogenic differentiation program, which regulates its own activity.',
 'MYOD is involved in the process of cell differentiation in muscular tissues, specifically during skeletal muscle differentiation. It does so by regulating gene expression, as evidenced by changes in expression levels of many RNAs observed in microarray studies.',
 'MYOD is involved in the regulation of myogenin expression during muscle cell differentiation.',
 'MYOD is involved in the process of muscle conversion in cells, as expressed from a constitutive promoter, it can transform different cell types into muscle. However, homOzygous gene-targeted mutants of MYOD or Myf-5 produce normal amounts of muscle in mice. The recent studies resolved this paradox by showing that both MyoD and Myf-5 are required in the double homOzygous mutants for proper muscle development.',
 'Myod is a transcription factor involved in skeletal muscle cell differentiation. It forms heterodimers with E-p

In [51]:
res.content.strip()

'MyoD is involved in defining the myoblast state, positioning cells in muscle-forming regions, and receiving inhibitory signals from the environment. It primarily stabilizes the determined state via autoactivation. Myogenin, which is activated by MyoD, is used for actual activation of most muscle structural genes. MRM, which shares features with myogenin, may have a partially overlapping function with myogenin. The distinctions between their functions can blur under certain conditions.'

In [46]:
prompt = PROMPT.format(text='YY', user_query='XX')
llm.invoke(input=prompt)

ChatMessage(content=" I'm just a computer program, so I don't have the ability to feel emotions like a human does. I'm here to help answer any questions you have to the best of my ability. Is there a specific topic you'd like to know more about?", response_metadata={'role': 'assistant', 'content': " I'm just a computer program, so I don't have the ability to feel emotions like a human does. I'm here to help answer any questions you have to the best of my ability. Is there a specific topic you'd like to know more about?", 'token_usage': {'prompt_tokens': 14, 'total_tokens': 70, 'completion_tokens': 56}, 'model_name': 'mistralai/mistral-7b-instruct-v0.2'}, id='run-41ae1e56-16c0-42e8-a37e-e87909df9eb5-0', role='assistant')

In [44]:
prompt = PROMPT.format(text='YY', user_query='XX')
print(prompt)

**INSTRUCTIONS**
You are an assistant responsible for compressing the important information in a document.
You will be given a users query and a piece of text. Summarize the text to contain only the information
relevant to answering the users question. If no information is in the text is related, return None in the
summary section.

**USER QUERY**
XX

**TEXT**
YY

**OUTPUT**
Summary: <output here>



In [32]:
doc.page_content

AttributeError: 'tuple' object has no attribute 'page_content'

In [34]:
llm

ChatNVIDIA(model='mistralai/mistral-7b-instruct-v0.2')

In [8]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
retriever = vectordb.as_retriever()
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)

compressed_docs = compression_retriever.get_relevant_documents(query='What cellular processes is MYOD involved in?')
pretty_print_docs(compressed_docs)

NameError: name 'pretty_print_docs' is not defined

In [10]:
pretty_print_docs(compressed_docs)

Document 1:
Myod initiates the myogenic differentiation program and is involved in regulating its own activity through a feed-forward mechanism. (Context: a combination of promoter-specific regulation of Myod binding and activity. Because Myod initiates the myogenic differentiation program and that program temporally regulates the activity of Myod, it follows that Myod programs the regulation of its own activity. It does this, at least in part, through a feed-forward mechanism.)
----------------------------------------------------------------------------------------------------
Document 2:
transcription and Myod protein activity (Kopan et al., 1994; Nofziger et al., 1999)
Myod expression and activity leads to changes in gene expression during skeletal muscle differentiation (Delgado et al., 2003; Tomczak et al., 2004)
----------------------------------------------------------------------------------------------------
Document 3:
MyoD is involved in the regulation of myogenin as well as

In [9]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n" + d.page_content for i, d in enumerate(docs)]))