# Building RAG Agents
The following will demonstrate how to use **LangChain** to build your own RAG agent, including chunking documents, constructing vector stores, and implementing the RAG chain.

## Import and Setup
Remember to fill in your **Gemini API Key** in this block

In [None]:
!pip install langchain langchain-community
!pip install langchain-google-genai
!pip install faiss-cpu
!pip install pymupdf arxiv
!pip install gradio

In [21]:
API_KEY="<Your-API-Key>"

## LangChain Basic

In [6]:
from langchain_google_genai import GoogleGenerativeAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

model = GoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=API_KEY)
prompt = ChatPromptTemplate.from_messages([
    ("system", "Only respond in rhymes"),
    ("user", "{input}")
])

rhyme_chain = prompt | model | StrOutputParser()

print(rhyme_chain.invoke({"input" : "Tell me about yourself!"}))

A language model, here I stand,
With words at hand, I'll do my best to understand.
To learn and grow, my purpose is to serve,
To answer questions, and to make you observe. 
I'm a tool for knowledge, a friend to you,
With information vast, I'll see you through. 
So tell me your desires, your thoughts, your fears,
And I'll respond with rhymes, through all the years. 



## Embeddings and Vector Stores

In [27]:
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings

llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=API_KEY)
embedder = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key=API_KEY)

vector = embedder.embed_query("Hello World!")
vector[:5]

[0.05169878527522087,
 -0.033477481454610825,
 -0.031893402338027954,
 -0.029319265857338905,
 0.019925475120544434]

In [32]:
from langchain.vectorstores import FAISS
from pprint import pprint

conversation = [
    "[User]  Hello! My name is Beras, and I'm a big blue bear! Can you please tell me about the rocky mountains?",
    "[Agent] The Rocky Mountains are a beautiful and majestic range of mountains that stretch across North America",
    "[Beras] Wow, that sounds amazing! Ive never been to the Rocky Mountains before, but Ive heard many great things about them.",
    "[Agent] I hope you get to visit them someday, Beras! It would be a great adventure for you!"
    "[Beras] Thank you for the suggestion! Ill definitely keep it in mind for the future.",
    "[Agent] In the meantime, you can learn more about the Rocky Mountains by doing some research online or watching documentaries about them."
    "[Beras] I live in the arctic, so I'm not used to the warm climate there. I was just curious, ya know!",
    "[Agent] Absolutely! Lets continue the conversation and explore more about the Rocky Mountains and their significance!"
]

vector_store = FAISS.from_texts(conversation, embedding=embedder)
retriever = vector_store.as_retriever()

pprint(retriever.invoke("What is your name?"))

[Document(metadata={}, page_content="[User]  Hello! My name is Beras, and I'm a big blue bear! Can you please tell me about the rocky mountains?"),
 Document(metadata={}, page_content='[Agent] Absolutely! Lets continue the conversation and explore more about the Rocky Mountains and their significance!'),
 Document(metadata={}, page_content='[Agent] The Rocky Mountains are a beautiful and majestic range of mountains that stretch across North America'),
 Document(metadata={}, page_content='[Beras] Wow, that sounds amazing! Ive never been to the Rocky Mountains before, but Ive heard many great things about them.')]


In [33]:
context_prompt = ChatPromptTemplate.from_template(
    "Answer the question using only the context"
    "\n\nRetrieved Context: {context}"
    "\n\nUser Question: {question}"
    "\nAnswer the user conversationally. User is not aware of context."
)

chain = (
    {'context': vector_store.as_retriever(),
     'question': (lambda x: x)}
    | context_prompt
    | llm
    | StrOutputParser()
)

pprint(chain.invoke("Where does Beras live?"))

'Based on the conversation, Beras lives in the arctic! \n'


In [34]:
pprint(chain.invoke("Where are the Rocky Mountains?"))

("The Rocky Mountains are a mountain range in North America! They're known for "
 'being beautiful and majestic. \n')


In [35]:
pprint(chain.invoke("How far away is Beras from the Rocky Mountains?"))

("We don't have enough information to know how far Beras is from the Rocky "
 "Mountains. We know Beras lives in the arctic, but we don't know exactly "
 'where that is.  \n')


## Build a RAG Agent
We download several papers about AI and use our custom RAG agent to answer user's questions.

### Loading And Chunking Documents

In [42]:
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import ArxivLoader

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=100,
    separators=["\n\n", "\n", ".", ";", ",", " "],
)

print("Loading Documents")
docs = [
    ArxivLoader(query="1706.03762").load(),  ## Attention Is All You Need Paper
    ArxivLoader(query="1810.04805").load(),  ## BERT Paper
    ArxivLoader(query="2005.11401").load(),  ## RAG Paper
    ArxivLoader(query="2205.00445").load(),  ## MRKL Paper
    ArxivLoader(query="2310.06825").load(),  ## Mistral Paper
    ArxivLoader(query="2306.05685").load(),  ## LLM-as-a-Judge
]

## Cut the paper short if references is included.
for doc in docs:
    content = json.dumps(doc[0].page_content)
    if "References" in content:
        doc[0].page_content = content[:content.index("References")]

## Split the documents and also filter out stubs (overly short chunks)
print("Chunking Documents\n")
docs_chunks = [ text_splitter.split_documents(doc) for doc in docs ]
docs_chunks = [[c for c in dchunks if len(c.page_content) > 200] for dchunks in docs_chunks]

## Make some custom Chunks to give big-picture details
doc_string = "Available Documents:"
doc_metadata = []
for chunks in docs_chunks:
    metadata = getattr(chunks[0], 'metadata', {})
    doc_string += "\n - " + metadata.get('Title')
    doc_metadata += [ str(metadata) ]

extra_chunks = [ doc_string ] + doc_metadata

## Printing out some summary information for reference
pprint(doc_string)
print()
for i, chunks in enumerate(docs_chunks):
    print(f"Document {i}")
    print(f" - # Chunks: {len(chunks)}")
    print(f" - Metadata: ")
    pprint(chunks[0].metadata)
    print()

Loading Documents
Chunking Documents

('Available Documents:\n'
 ' - Attention Is All You Need\n'
 ' - BERT: Pre-training of Deep Bidirectional Transformers for Language '
 'Understanding\n'
 ' - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks\n'
 ' - MRKL Systems: A modular, neuro-symbolic architecture that combines large '
 'language models, external knowledge sources and discrete reasoning\n'
 ' - Mistral 7B\n'
 ' - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena')

Document 0
 - # Chunks: 35
 - Metadata: 
{'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion '
            'Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin',
 'Published': '2023-08-02',
 'Summary': 'The dominant sequence transduction models are based on complex '
            'recurrent or\n'
            'convolutional neural networks in an encoder-decoder '
            'configuration. The best\n'
            'performing models also connect the encoder and decoder 

### Construct Document Vector Stores

In [43]:
print("Constructing Vector Stores")
vecstores = [FAISS.from_texts(extra_chunks, embedder)]
vecstores += [FAISS.from_documents(doc_chunks, embedder) for doc_chunks in docs_chunks]

Constructing Vector Stores
CPU times: user 251 ms, sys: 10.8 ms, total: 262 ms
Wall time: 3.67 s


In [48]:
from faiss import IndexFlatL2
from langchain_community.docstore.in_memory import InMemoryDocstore

embed_dims = len(embedder.embed_query("test"))
def default_FAISS():
    '''Useful utility for making an empty FAISS vectorstore'''
    return FAISS(
        embedding_function=embedder,
        index=IndexFlatL2(embed_dims),
        docstore=InMemoryDocstore(),
        index_to_docstore_id={},
        normalize_L2=False
    )

def aggregate_vstores(vectorstores):
    ## Initialize an empty FAISS Index and merge others into it
    ## We'll use default_faiss for simplicity, though it's tied to your embedder by reference
    agg_vstore = default_FAISS()
    for vstore in vectorstores:
        agg_vstore.merge_from(vstore)
    return agg_vstore

## Unintuitive optimization; merge_from seems to optimize constituent vector stores away
docstore = aggregate_vstores(vecstores)

print(f"Constructed aggregate docstore with {len(docstore.docstore._dict)} chunks")

Constructed aggregate docstore with 238 chunks


### Implement RAG Chain

In [52]:
from langchain_core.runnables.passthrough import RunnableAssign

convstore = default_FAISS()

def save_memory_and_get_output(d, vstore):
    """Accepts 'input'/'output' dictionary and saves to convstore"""
    vstore.add_texts([
        f"User previously responded with {d.get('input')}",
        f"Agent previously responded with {d.get('output')}"
    ])
    return d.get('output')

initial_msg = (
    "Hello! I am a document chat agent here to help the user!"
    f" I have access to the following documents: {doc_string}\n\nHow can I help you?"
)

chat_prompt = ChatPromptTemplate.from_messages([("system",
    "You are a document chatbot. Help the user as they ask questions about documents."
    " User messaged just asked: {input}\n\n"
    " From this, we have retrieved the following potentially-useful info: "
    " Conversation History Retrieval:\n{history}\n\n"
    " Document Retrieval:\n{context}\n\n"
    " (Answer only from retrieval. Only cite sources that are used. Make your response conversational.)"
), ('user', '{input}')])

stream_chain = chat_prompt | llm | StrOutputParser()
retrieval_chain = (
    {'input' : (lambda x: x)}
    | RunnableAssign({'history' : (lambda x: x['input']) | convstore.as_retriever()})
    | RunnableAssign({'context' : (lambda x: x['input']) | docstore.as_retriever()})
)


def chat_gen(message, history=[], return_buffer=True):
    buffer = ""
    retrieval = retrieval_chain.invoke(message)
    line_buffer = ""

    for token in stream_chain.stream(retrieval):
        buffer += token
        yield buffer if return_buffer else token

    save_memory_and_get_output({'input':  message, 'output': buffer}, convstore)


## Start of Agent Event Loop
test_question = "Tell me about RAG!"  ## <- modify as desired

for response in chat_gen(test_question, return_buffer=False):
    print(response, end='')

RAG stands for Retrieval-Augmented Generation. It's a pretty cool technique used in natural language processing (NLP) that combines the strengths of two different types of models:  parametric and non-parametric. 

Think of it like this: The parametric model is like a really smart person who has learned a lot from reading lots of books, but might not know everything. The non-parametric model is like a massive library, full of information but needing someone to find the right books. 

RAG brings these two together. It uses a pre-trained model (like a seq2seq model) as its "smart person" and a dense vector index of information (like Wikipedia) as its "library".  A neural retriever is used to find the most relevant information from the library, and this information is then used to help the model generate more accurate and factual responses. 

In short, RAG helps NLP models be more knowledge-intensive, making them better at tasks like answering open-domain questions and generating more fact

### Interact With Gradio Chatbot

In [53]:
import gradio as gr

chatbot = gr.Chatbot(value = [[None, initial_msg]])
demo = gr.ChatInterface(chat_gen, chatbot=chatbot).queue()

try:
  demo.launch(debug=True, share=True, show_api=False)
  demo.close()
except Exception as e:
  demo.close()
  print(e)
  raise e



Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://53951c0fefc1d7f3ed.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7862 <> https://53951c0fefc1d7f3ed.gradio.live
Closing server running on port: 7862


### Save and Reload

In [54]:
docstore.save_local("docstore_index")
!tar czvf docstore_index.tgz docstore_index
!rm -rf docstore_index

docstore_index/
docstore_index/index.faiss
docstore_index/index.pkl


In [58]:
!tar xzvf docstore_index.tgz
new_db = FAISS.load_local("docstore_index", embedder, allow_dangerous_deserialization=True)
docs = new_db.similarity_search("Tell me about Mistral 7B")
print(docs[0].page_content[:1000])

docstore_index/
docstore_index/index.faiss
docstore_index/index.pkl
. We also provide a model fine-tuned to follow instructions,\nMistral 7B \u2013 Instruct, that surpasses Llama 2 13B \u2013 chat model both on human and\nautomated benchmarks. Our models are released under the Apache 2.0 license.\nCode: https://github.com/mistralai/mistral-src\nWebpage: https://mistral.ai/news/announcing-mistral-7b/\n1\nIntroduction\nIn the rapidly evolving domain of Natural Language Processing (NLP), the race towards higher model\nperformance often necessitates an escalation in model size. However, this scaling tends to increase\ncomputational costs and inference latency, thereby raising barriers to deployment in practical,\nreal-world scenarios. In this context, the search for balanced models delivering both high-level\nperformance and efficiency becomes critically essential. Our model, Mistral 7B, demonstrates that\na carefully designed language model can deliver high performance while maintaining a