# Retrieval Augmented Generation (RAG) with BRAD

Given a collection of documents (pdfs), the RAG first builds a database by splitting up the text into equal size ('chunk_size'). These chunks can optionally be set to have some overlap ('chunk_overlap'). Both of these values within BRAD are set to 700 and 200 respectively but can be manually changed for different applications. Then, we vectorize these chunks using an embedding model from HuggingFace (Note - this step may take a while). Then, given a query, it embeds the query in the same embedding space and finds the top k chunks (preset to 4) with the closest cosine similarity to the query in the embedding space and uses these chunks as a basis for the response.

# Literature Databases

## Building a Database

In [1]:
import subprocess
import os
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
import chromadb
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [3]:
from BRAD import rag

## Fixed Size Chunking

In [17]:
docsPath='papers/'
dbName='database'
dbPath='databases/'
HuggingFaceEmbeddingsModel = 'BAAI/bge-base-en-v1.5'
chunk_size=[700]
chunk_overlap=[200]
v=False

In [8]:
local = os.getcwd()  ## Get local dir
os.chdir(local)      ## shift the work dir to local dir

print('\nWork Directory: {}'.format(local)) if v else None

#%% Phase 1 - Load DB
embeddings_model = HuggingFaceEmbeddings(model_name=HuggingFaceEmbeddingsModel)

print('\nDocuments loading from:', docsPath) if v else None

In [None]:
text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader(docsPath,
                         glob="**/*.pdf",
                         loader_cls=PyPDFLoader, 
                         show_progress=True,
                         use_multithreading=True)
docs_data = loader.load()

In [20]:
print('\nDocuments loaded...') if v else None

for i in range(len(chunk_size)):
    for j in range(len(chunk_overlap)):
        text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size[i],
                                                        chunk_overlap = chunk_overlap[j],
                                                        separators=[" ", ",", "\n", ". "])
        data_splits = text_splitter.split_documents(docs_data)
        
        print('Documents split into chunks...') if v else None
        print('Initializing Chroma Database...') if v else None

        dbName = "DB_cosine_cSize_%d_cOver_%d" %(chunk_size[i], chunk_overlap[j])

        p2_2 = subprocess.run('mkdir  %s/*'%(dbPath+dbName), shell=True)
        _client_settings = chromadb.PersistentClient(path=(dbPath+dbName))

        vectordb = Chroma.from_documents(documents           = data_splits,
                                         embedding           = embeddings_model,
                                         client              = _client_settings,
                                         collection_name     = dbName,
                                         collection_metadata = {"hnsw:space": "cosine"})

        print('Completed Chroma Database: ', dbName) if v else None
        del vectordb, text_splitter, data_splits

mkdir: cannot create directory ‘databases/DB_cosine_cSize_700_cOver_200/*’: No such file or directory


### Contextual Chunking

In [4]:
from semantic_router.splitters import RollingWindowSplitter
from semantic_router.utils.logger import logger

In [8]:
HuggingFaceEmbeddingsModel = 'BAAI/bge-base-en-v1.5'
encoder = HuggingFaceEmbeddings(model_name=HuggingFaceEmbeddingsModel)



In [17]:
import os
from getpass import getpass
from semantic_router.encoders import OpenAIEncoder

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass(
    "OpenAI API key: "
)
# encoder = OpenAIEncoder(openai_base_url='https://integrate.api.nvidia.com/v1')
encoder = OpenAIEncoder(name="text-embedding-3-small", openai_base_url='https://integrate.api.nvidia.com/v1')

In [18]:
splitter = RollingWindowSplitter(
    encoder=encoder,
    dynamic_threshold=True,
    min_split_tokens=100,
    max_split_tokens=1000,
    window_size=2,
    plot_splits=True,  # set this to true to visualize chunking
    enable_statistics=True  # to print chunking stats
)

https://github.com/aurelio-labs/semantic-chunkers
  splitter = RollingWindowSplitter(


In [19]:
from datasets import load_dataset

dataset = load_dataset("jamescalam/ai-arxiv2", split="train")
dataset

Dataset({
    features: ['id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'content', 'references'],
    num_rows: 2673
})

In [20]:
logger.setLevel("WARNING")  # reduce logs from splitter
splits = splitter([dataset["content"][0]])

[31m2024-06-17 09:44:09 ERROR semantic_router.utils.logger Error encoding documents ['4 2 0 2', 'n a J 8 ] G L . s c [', '1 v 8 8 0 4 0 . 1 0 4 2 : v i X r a', '# Mixtral of Experts', 'Albert Q.', 'Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, LÃ©lio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, ThÃ©ophile Gervet, Thibaut Lavril, Thomas Wang, TimothÃ©e Lacroix, William El Sayed', 'Abstract', 'We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.', 'Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts).', 'For every token, at each layer, a router network selects two experts to process the current state and combine their outputs

ValueError: No embeddings returned. Error: 404 page not found

In [10]:
help(RollingWindowSplitter)

Help on class RollingWindowSplitter in module semantic_router.splitters.rolling_window:

class RollingWindowSplitter(semantic_router.splitters.base.BaseSplitter)
 |  RollingWindowSplitter(encoder: semantic_router.encoders.base.BaseEncoder, name='rolling_window_splitter', threshold_adjustment=0.01, dynamic_threshold: bool = True, window_size=5, min_split_tokens=100, max_split_tokens=300, split_tokens_tolerance=10, plot_splits=False, enable_statistics=False) -> None
 |  
 |  Method resolution order:
 |      RollingWindowSplitter
 |      semantic_router.splitters.base.BaseSplitter
 |      pydantic.v1.main.BaseModel
 |      pydantic.v1.utils.Representation
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __call__(self, docs: List[str]) -> List[semantic_router.schema.DocumentSplit]
 |      Split documents into smaller chunks based on semantic similarity.
 |      
 |      :param docs: list of text documents to be split, if only wanted to
 |          split a single document, pa

In [1]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_chroma import Chroma

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = UnstructuredPDFLoader('papers/Pore-C.pdf').load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db = Chroma.from_documents(documents, embeddings_model)


NameError: name 'UnstructuredPDFLoader' is not defined

In [5]:
import subprocess
import os
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
import chromadb

In [6]:
docsPath='papers/'
dbName='tutorialDatabase'
dbPath='databases/'
HuggingFaceEmbeddingsModel = 'BAAI/bge-base-en-v1.5'
chunk_size=[700]
chunck_overlap=[200]
v=True

In [7]:
dbPath   += dbName

local = os.getcwd()  ## Get local dir
os.chdir(local)      ## shift the work dir to local dir

print('\nWork Directory: {}'.format(local)) if v else None

#%% Phase 1 - Load DB
embeddings_model = HuggingFaceEmbeddings(model_name=HuggingFaceEmbeddingsModel)

print('\nDocuments loading from:', docsPath) if v else None

text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader(docsPath,
                         glob="**/*.pdf",
                         loader_cls=UnstructuredPDFLoader, 
                         #loader_kwargs=text_loader_kwargs,
                         show_progress=True,
                         )
# docs_data = loader.load()


Work Directory: /home/jpic/RAG-DEV/tutorials/RAG-with-BRAD





Documents loading from: papers/


In [None]:
docs_data = loader.load()

In [7]:
import pdfminer
help(pdfminer)
from pdfminer import psparser

Help on package pdfminer:

NAME
    pdfminer

PACKAGE CONTENTS
    _saslprep
    arcfour
    ascii85
    ccitt
    cmapdb
    converter
    data_structures
    encodingdb
    fontmetrics
    glyphlist
    high_level
    image
    jbig2
    latin_enc
    layout
    lzw
    pdfcolor
    pdfdevice
    pdfdocument
    pdffont
    pdfinterp
    pdfpage
    pdfparser
    pdftypes
    psparser
    runlength
    settings
    utils

FILE
    (built-in)




### WIKI Retrieval

In [34]:
from langchain_community.retrievers import WikipediaRetriever
retriever = WikipediaRetriever(top_k_results=10)

In [28]:
help(retriever)

Help on WikipediaRetriever in module langchain_community.retrievers.wikipedia object:

class WikipediaRetriever(langchain_core.retrievers.BaseRetriever, langchain_community.utilities.wikipedia.WikipediaAPIWrapper)
 |  WikipediaRetriever(*, wiki_client: Any = None, top_k_results: int = 3, lang: str = 'en', load_all_available_meta: bool = False, doc_content_chars_max: int = 4000, name: Optional[str] = None, tags: Optional[List[str]] = None, metadata: Optional[Dict[str, Any]] = None) -> None
 |  
 |  `Wikipedia API` retriever.
 |  
 |  It wraps load() to get_relevant_documents().
 |  It uses all WikipediaAPIWrapper arguments without any change.
 |  
 |  Method resolution order:
 |      WikipediaRetriever
 |      langchain_core.retrievers.BaseRetriever
 |      langchain_core.runnables.base.RunnableSerializable
 |      langchain_core.load.serializable.Serializable
 |      langchain_community.utilities.wikipedia.WikipediaAPIWrapper
 |      pydantic.v1.main.BaseModel
 |      pydantic.v1.utils

In [35]:
docs = retriever.invoke('kronecker product')

In [36]:
len(docs)

10

### Building ONLINE DB TUTORIAL

In [10]:
import bs4
from langchain import hub
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_text_splitters import RecursiveCharacterTextSplitter


In [19]:
# Load, chunk and index the contents of the blog.
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()


In [21]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Retrieve and generate using the relevant snippets of the blog.
retriever = vectorstore.as_retriever()


NameError: name 'OpenAIEmbeddings' is not defined

In [20]:
docs

[Document(page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final re

In [2]:
import poppler #-utils

In [3]:
import pdfinfo

ModuleNotFoundError: No module named 'pdfinfo'

In [1]:
from pdf2image import *
images = convert_from_path('papers/Pore-C.pdf')


PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

In [20]:
!apt-get update

/bin/bash: apt-get: command not found


In [3]:
rag.create_database(docsPath='papers/',
                    dbName='tutorialDatabase',
                    dbPath='databases/',
                    HuggingFaceEmbeddingsModel = 'BAAI/bge-base-en-v1.5',
                    chunk_size=[700],
                    chunck_overlap=[200],
                    v=True)


Work Directory: /home/jpic/RAG-DEV/tutorials/RAG-with-BRAD





Documents loading from: papers/


  0%|          | 0/1 [00:00<?, ?it/s]Error loading file papers/Pore-C.pdf
100%|██████████| 1/1 [00:00<00:00,  1.11it/s]

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

In [None]:
from BRAD import rag

In [1]:
import subprocess
import os
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
import chromadb

In [None]:
def create_database(self, docsFile=None, docsPath='/nfs/turbo/umms-indikar/shared/projects/RAG/papers/', dbName=None, dbPath='/nfs/turbo/umms-indikar/shared/projects/RAG/databases/', HuggingFaceEmbeddingsModel = 'BAAI/bge-base-en-v1.5', chunk_size=[700], chunck_overlap=[200], v=False):
    # Handle arguments
    docsPath += docsFile
    dbPath   += dbName
    
    local = os.getcwd()  ## Get local dir
    os.chdir(local)      ## shift the work dir to local dir
    
    print('\nWork Directory: {}'.format(local)) if v else None

    #%% Phase 1 - Load DB
    embeddings_model = HuggingFaceEmbeddings(model_name=HuggingFaceEmbeddingsModel)
    
    print('\nDocuments loading from:', docsPath) if v else None

    text_loader_kwargs={'autodetect_encoding': True}
    loader = DirectoryLoader(docsPath,
                             glob="**/*.pdf",
                             loader_cls=UnstructuredPDFLoader, 
                             loader_kwargs=text_loader_kwargs,
                             show_progress=True,
                             use_multithreading=True)
    docs_data = loader.load()

    print('\nDocuments loaded...') if v else None
    
    chunk_size = [700] #Chunk size 
    chunk_overlap = [200] #Chunk overlap

    for i in range(len(chunk_size)):
        for j in range(len(chunk_overlap)):
            text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size[i],
                                                            chunk_overlap = chunk_overlap[j],
                                                            separators=[" ", ",", "\n", ". "])
            data_splits = text_splitter.split_documents(docs_data)
            
            print('Documents split into chunks...') if v else None
            print('Initializing Chroma Database...') if v else None

            dbName = "DB_cosine_cSize_%d_cOver_%d" %(chunk_size[i], chunk_overlap[j])

            p2_2 = subprocess.run('mkdir  %s/*'%(dbPath+dbName), shell=True)
            _client_settings = chromadb.PersistentClient(path=(dbPath+dbName))

            vectordb = Chroma.from_documents(documents           = data_splits,
                                             embedding           = embeddings_model,
                                             client              = _client_settings,
                                             collection_name     = dbName,
                                             collection_metadata = {"hnsw:space": "cosine"})

            print('Completed Chroma Database: ', dbName) if v else None
            del vectordb, text_splitter, data_splits

# Connecting Literature Databases to BRAD

In [21]:
from BRAD import llms
llm = llms.load_nvidia()

Enter your NVIDIA API key:  ········




## Specifiying the Database

When running `brad.chat()`, there is an option to use a previously saved database. **ADD how to s Type Y to supplement your query with the database.

In [22]:
import subprocess
import os
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
import chromadb
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain import PromptTemplate, LLMChain
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

In [23]:
# Load the database
persist_directory = '/nfs/turbo/umms-indikar/shared/projects/RAG/databases/DigitalLibrary-10-June-2024/'
embeddings_model = HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5')
db_name = "DigitalLibrary"
_client_settings = chromadb.PersistentClient(path=(persist_directory + db_name))
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings_model, client=_client_settings, collection_name=db_name)



## Using the RAG system

## Viewing the Documents from BRAD

# MultiQuery RAG

In [5]:
from BRAD import brad
brad.chat(ragvectordb=vectordb)

Welcome to RAG! The chat log from this conversation will be saved to /home/jpic/BRAD/2024-06-16-23:53:20.661368.json. How can I help?


Sun 16 Jun 2024 11:53:20 PM EDT INFO local




Input >>  /force RAG what cellular processes is the MYOD gene involved in?


Sun 16 Jun 2024 11:53:42 PM EDT INFO RAG


RAG >> 1: 

  warn_deprecated(
Sun 16 Jun 2024 11:53:43 PM EDT INFO Generated queries: ['1. Which cellular functions does the MYOD gene influence or regulate?', '2. Can you identify the specific cellular pathways where the MYOD gene plays a role?', '3. What are the major cellular activities associated with the expression of the MYOD gene?']
  warn_deprecated(




[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
expressed, how does Myod regulate skeletal muscle cell
differentiation? In one sense, the answer seems fairly simple:Myod is a transcription factor with binding sites in theregulatory regions of many genes that are expressed in skeletalmuscle. Myod forms heterodimers with the nearly ubiquitousE-protein sub-family of bHLH proteins through the interactionof the HLH domains (see Fig. 1) (Lassar et al., 1991; Murre etal., 1989). The basic regions act as sequence-speciﬁc DNA-binding domains that recognize a binding site with the simplecore consensus sequence of CANNTG, termed an E-box, andshow additional preferences for internal and ﬂankingsequences (Blackwell and Weintraub, 1

Input >>  q


Thanks for chatting today! I hope to talk soon, and don't forget that a record of this conversation is available at: /home/jpic/BRAD/2024-06-16-23:53:20.661368.json


# Contextual Compression

In [81]:
chatstatus = {
    'config' : {
        'debug':True
    }
}
query = 'What cellular processes is MYOD involved in?'
documentSearch = vectordb.similarity_search_with_relevance_scores(query=query, k=10)

In [82]:
documentSearch[0][1]

0.7678906917572021

In [77]:
def summarizeDocumentTemplate():
    template = """**INSTRUCTIONS**
You are an assistant responsible for compressing the important information in a document.
You will be given a users query and a piece of text. Summarize the text with the following aims:
1. remove information that is not complete ideas or unrelated to the topic of the user
2. improve the clarity of the writing and information
If there is no relevant information, say "None"

**USER QUERY**
{user_query}

**TEXT**
{text}

**OUTPUT**
<put summary output here>
"""
    return template

In [83]:
def contextualCompression(documentSearch, chatstatus):
    """
    Summarizes the content of documents based on a user query, updating the 
    document search results with these summaries.

    Args:
        documentSearch (list): A list of documents where each document is a tuple, 
                               and the first element of the tuple has an attribute 
                               `page_content` containing the text content of the document.
        chatstatus (dict): BRAD chatstatus used to track debuging

    Returns:
        list: The modified `documentSearch` list with updated `page_content` for each 
              document, replaced by their summaries.

    Example:
        documentSearch = [(Document(page_content="..."),), ...]
        chatstatus = {'config': {'debug': True}}
        updatedDocs = contextualCompression(documentSearch, chatstatus)
    """
    template = summarizeDocumentTemplate()
    PROMPT = PromptTemplate(input_variables=["user_query"], template=template)
    reducedDocs = []
    for i, doc in enumerate(documentSearch):
        pageContent = doc[0].page_content
        prompt = PROMPT.format(text=pageContent, user_query=query)
        res = llm.invoke(input=prompt)
        summary = res.content.strip()
        if chatstatus['config']['debug']:
            print('============')
            print(pageContent)
            print('Summary: ' + summary)
        doc[0].page_content = summary
        documentSearch[i] = doc
    return documentSearch

contextualCompression(documentSearch, chatstatus)

a combination of promoter-speciﬁc regulation of Myod binding and activity.
Because Myod initiates the myogenic differentiation
program and that program temporally regulates the activity of
Myod, it follows that Myod programs the regulation of its ownactivity. It does this, at least in part, through a feed-forward
Development
Summary: MYOD is involved in initiating the myogenic differentiation program, which temporally regulates its own activity through a feed-forward mechanism.
transcription and Myod protein
activity (Kopan et al., 1994; Nofziger et al., 1999), andprobably contributes to regulating differentiation in vivo. It isinteresting that while we have identiﬁed several mechanismsthat might delay myoblast differentiation, such as mitogens andNotch signaling, we do not yet have a good understanding ofthe events that occur in vivo to overcome these inhibitorysignals and to induce differentiation at a speciﬁc time andplace.
A feed-forward circuit as a quantal step
How does a single 

[(Document(page_content='MYOD is involved in initiating the myogenic differentiation program, which temporally regulates its own activity through a feed-forward mechanism.', metadata={'page': 4, 'source': '/nfs/turbo/umms-indikar/shared/projects/RAG/papers/DigitalLibrary-9-June-2024/The circuitry of a master switch.pdf'}),
  0.7678906917572021),
 (Document(page_content='MYOD is a transcription factor involved in muscle differentiation as indicated by its role in gene transcription and regulation of myoblast differentiation both in vitro and in vivo (Kopan et al., 1994; Nofziger et al., 1999). Despite known inhibitors of differentiation like mitogens and Notch signaling, the precise mechanisms allowing differentiation to occur at specific times and places remain unclear. As for the execution of an entire program of cell differentation by a single transcription factor, research has shown that expression levels of many RNAs change during skeletal muscle differentiation in cultured C2C12 c

In [80]:
documentSearch

[((Document(page_content='MYOD is involved in initiating the myogenic differentiation program, which temporally regulates its own activity through a feed-forward mechanism.', metadata={'page': 4, 'source': '/nfs/turbo/umms-indikar/shared/projects/RAG/papers/DigitalLibrary-9-June-2024/The circuitry of a master switch.pdf'}),
   0.7678906917572021),
  0.7678906917572021),
 ((Document(page_content='MYOD is a transcription factor involved in muscle differentiation as indicated by its role in gene transcription and regulation of myoblast differentiation both in vitro and in vivo (Kopan et al., 1994; Nofziger et al., 1999). Despite known inhibitors of differentiation like mitogens and Notch signaling, the precise mechanisms allowing differentiation to occur at specific times and places remain unclear. As for the execution of an entire program of cell differentation by a single transcription factor, research has shown that expression levels of many RNAs change during skeletal muscle different

In [72]:
documentSearch

[(Document(page_content='a combination of promoter-speciﬁc regulation of Myod binding and activity.\nBecause Myod initiates the myogenic differentiation\nprogram and that program temporally regulates the activity of\nMyod, it follows that Myod programs the regulation of its ownactivity. It does this, at least in part, through a feed-forward\nDevelopment', metadata={'page': 4, 'source': '/nfs/turbo/umms-indikar/shared/projects/RAG/papers/DigitalLibrary-9-June-2024/The circuitry of a master switch.pdf'}),
  0.7678906917572021),
 (Document(page_content='transcription and Myod protein\nactivity (Kopan et al., 1994; Nofziger et al., 1999), andprobably contributes to regulating differentiation in vivo. It isinteresting that while we have identiﬁed several mechanismsthat might delay myoblast differentiation, such as mitogens andNotch signaling, we do not yet have a good understanding ofthe events that occur in vivo to overcome these inhibitorysignals and to induce differentiation at a speciﬁc

In [59]:
reducedText

['Myod is involved in initiating the myogenic differentiation program, which regulates its own activity.',
 'MYOD is involved in the process of cell differentiation in muscular tissues, specifically during skeletal muscle differentiation. It does so by regulating gene expression, as evidenced by changes in expression levels of many RNAs observed in microarray studies.',
 'MYOD is involved in the regulation of myogenin expression during muscle cell differentiation.',
 'MYOD is involved in the process of muscle conversion in cells, as expressed from a constitutive promoter, it can transform different cell types into muscle. However, homOzygous gene-targeted mutants of MYOD or Myf-5 produce normal amounts of muscle in mice. The recent studies resolved this paradox by showing that both MyoD and Myf-5 are required in the double homOzygous mutants for proper muscle development.',
 'Myod is a transcription factor involved in skeletal muscle cell differentiation. It forms heterodimers with E-p

In [51]:
res.content.strip()

'MyoD is involved in defining the myoblast state, positioning cells in muscle-forming regions, and receiving inhibitory signals from the environment. It primarily stabilizes the determined state via autoactivation. Myogenin, which is activated by MyoD, is used for actual activation of most muscle structural genes. MRM, which shares features with myogenin, may have a partially overlapping function with myogenin. The distinctions between their functions can blur under certain conditions.'

In [46]:
prompt = PROMPT.format(text='YY', user_query='XX')
llm.invoke(input=prompt)

ChatMessage(content=" I'm just a computer program, so I don't have the ability to feel emotions like a human does. I'm here to help answer any questions you have to the best of my ability. Is there a specific topic you'd like to know more about?", response_metadata={'role': 'assistant', 'content': " I'm just a computer program, so I don't have the ability to feel emotions like a human does. I'm here to help answer any questions you have to the best of my ability. Is there a specific topic you'd like to know more about?", 'token_usage': {'prompt_tokens': 14, 'total_tokens': 70, 'completion_tokens': 56}, 'model_name': 'mistralai/mistral-7b-instruct-v0.2'}, id='run-41ae1e56-16c0-42e8-a37e-e87909df9eb5-0', role='assistant')

In [44]:
prompt = PROMPT.format(text='YY', user_query='XX')
print(prompt)

**INSTRUCTIONS**
You are an assistant responsible for compressing the important information in a document.
You will be given a users query and a piece of text. Summarize the text to contain only the information
relevant to answering the users question. If no information is in the text is related, return None in the
summary section.

**USER QUERY**
XX

**TEXT**
YY

**OUTPUT**
Summary: <output here>



In [32]:
doc.page_content

AttributeError: 'tuple' object has no attribute 'page_content'

In [34]:
llm

ChatNVIDIA(model='mistralai/mistral-7b-instruct-v0.2')

In [8]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
retriever = vectordb.as_retriever()
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)

compressed_docs = compression_retriever.get_relevant_documents(query='What cellular processes is MYOD involved in?')
pretty_print_docs(compressed_docs)

NameError: name 'pretty_print_docs' is not defined

In [10]:
pretty_print_docs(compressed_docs)

Document 1:
Myod initiates the myogenic differentiation program and is involved in regulating its own activity through a feed-forward mechanism. (Context: a combination of promoter-specific regulation of Myod binding and activity. Because Myod initiates the myogenic differentiation program and that program temporally regulates the activity of Myod, it follows that Myod programs the regulation of its own activity. It does this, at least in part, through a feed-forward mechanism.)
----------------------------------------------------------------------------------------------------
Document 2:
transcription and Myod protein activity (Kopan et al., 1994; Nofziger et al., 1999)
Myod expression and activity leads to changes in gene expression during skeletal muscle differentiation (Delgado et al., 2003; Tomczak et al., 2004)
----------------------------------------------------------------------------------------------------
Document 3:
MyoD is involved in the regulation of myogenin as well as

In [11]:
compressed_docs

[Document(page_content='Myod initiates the myogenic differentiation program and is involved in regulating its own activity through a feed-forward mechanism. (Context: a combination of promoter-specific regulation of Myod binding and activity. Because Myod initiates the myogenic differentiation program and that program temporally regulates the activity of Myod, it follows that Myod programs the regulation of its own activity. It does this, at least in part, through a feed-forward mechanism.)', metadata={'page': 4, 'source': '/nfs/turbo/umms-indikar/shared/projects/RAG/papers/DigitalLibrary-9-June-2024/The circuitry of a master switch.pdf'}),
 Document(page_content='transcription and Myod protein activity (Kopan et al., 1994; Nofziger et al., 1999)\nMyod expression and activity leads to changes in gene expression during skeletal muscle differentiation (Delgado et al., 2003; Tomczak et al., 2004)', metadata={'page': 4, 'source': '/nfs/turbo/umms-indikar/shared/projects/RAG/papers/DigitalL

In [9]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n" + d.page_content for i, d in enumerate(docs)]))

In [4]:
import logging
from langchain.retrievers.multi_query import MultiQueryRetriever

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [11]:
retriever = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

In [20]:
docs = retriever.get_relevant_documents_with_scores(query='What genes are related to MYOD?')

AttributeError: 'MultiQueryRetriever' object has no attribute 'get_relevant_documents_with_scores'

In [16]:
docs

[Document(page_content='There is a positive correlation between regulated MyoD\nbinding (i.e., sites that are preferentially bound in myoblasts or\ndifferentiated myotubes) and gene expression. This suggests\nthat sites regulating gene transcription in myotubes require addi-tional factors to modulate MyoD binding. Indeed, the sites asso-\nciated with myotube-expressed genes are enriched for motifs of\nfactors that are activated by MyoD and function with MyoD ina positive feed-forward circuit, as demonstrated previouslywith Mef2 ( Penn et al., 2004 ) and Myog ( Cao et al., 2006 ). In addi-\ntion, the Pbx/Meis complex cooperates with MyoD in activatinga subset of genes ( Berkes et al., 2004; Maves et al., 2007 ).\nAnother ﬁnding was that genes decreasing expression with\ndifferentiation were associated with decreased MyoD binding\nat sites enriched for RP58 and AP1 motifs. A recent study (pub-lished after our analysis was complete) identiﬁed RP58 as a gene\nactivated by MyoD during muscl

In [3]:
help(MultiQueryRetriever.from_llm)

Help on method from_llm in module langchain.retrievers.multi_query:

from_llm(retriever: langchain_core.retrievers.BaseRetriever, llm: langchain_core.language_models.base.BaseLanguageModel, prompt: langchain_core.prompts.prompt.PromptTemplate = PromptTemplate(input_variables=['question'], template='You are an AI language model assistant. Your task is \n    to generate 3 different versions of the given user \n    question to retrieve relevant documents from a vector  database. \n    By generating multiple perspectives on the user question, \n    your goal is to help the user overcome some of the limitations \n    of distance-based similarity search. Provide these alternative \n    questions separated by newlines. Original question: {question}'), parser_key: Optional[str] = None, include_original: bool = False) -> 'MultiQueryRetriever' method of pydantic.v1.main.ModelMetaclass instance
    Initialize from llm using default template.
    
    Args:
        retriever: retriever to query do

# Building a L