# TA Session Topic 4: RAG (Retrieval Augmented Generation)

**Learning Objectives:**

* Check the structure and codes of RAG
* Check the impact of chunking strategy on RAG performance

**Outline:**

1. **Simple RAG Chain**
2. **Indexing**
3. **Retrieval and Generation**

**Reference Links:**

1. **LangChain Vector Stores**: [Link](https://python.langchain.com/docs/concepts/vectorstores/)
2. **LangChain Retrievers**: [Link](https://python.langchain.com/docs/concepts/retrievers/)
3. **LangChain RAG Tutorial**: [Link](https://python.langchain.com/docs/tutorials/rag/)

## Environment Setup

In [None]:
# Install packages
!pip install --quiet --upgrade langchain langchain-community langchain-chroma

In [None]:
# Install langchain openai
!pip install -qU langchain-openai

### Set API key

In [None]:
# Set API key
OPENAI_API_KEY="your_api_key_here"

## Prepare Language Model

In [None]:
# Prepare model
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", api_key=OPENAI_API_KEY)

## 1. Simple RAG Chain Example

We will use a web document as a source for the retrieval.

Document source: [link](https://arxiv.org/html/2312.10997v5)

In [None]:
import bs4
from langchain import hub
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load the contents of the paper
loader = WebBaseLoader(
    web_paths=("https://arxiv.org/html/2312.10997v5",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("ltx_abstract", "ltx_section")
        )
    ),
)
docs = loader.load()

# Chunk and index the document
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings(model='text-embedding-3-small', api_key=OPENAI_API_KEY))

# Retrieve and generate using the relevant snippets of the paper.
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
rag_chain.invoke("What types of RAG exist?")

In [None]:
rag_chain.invoke("Tell me about the various chunking methods")

In [None]:
# cleanup
vectorstore.delete_collection()

## 2. Indexing

*   Load: Load the data with Document Loaders.
*   Split: Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't fit in a model's finite context window.
*   Store: Using a VectorStore and Embeddings model, store and index the splits, so that they can later be searched over.

![Indexing](https://python.langchain.com/assets/images/rag_indexing-8160f90a90a33253d0154659cf7d453f.png)

We use DocumentLoaders for load documents.

https://python.langchain.com/docs/concepts/document_loaders/

You can also load [pdf](https://python.langchain.com/docs/how_to/document_loader_pdf/), [csv](https://python.langchain.com/docs/how_to/document_loader_csv/) formatted documents.

In [None]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Only keep abstract, sections from the full HTML.
loader = WebBaseLoader(
    web_paths=("https://arxiv.org/html/2312.10997v5",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("ltx_abstract", "ltx_section")
        )
    ),
)
docs = loader.load()

len(docs[0].page_content)

In [None]:
print(docs[0].page_content[:300])

We will split our documents into chunks of 1000 characters with 200 characters of overlap between chunks.

The overlap helps mitigate the possibility of separating a statement from important context related to it.

We use [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/how_to/recursive_text_splitter/), which will recursively split the document using common separators (default: `["\n\n", "\n", " ", ""]`) until each chunk is the appropriate size.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

In [None]:
len(all_splits[1].page_content)

In [None]:
all_splits[10].metadata

Check [link](https://python.langchain.com/docs/how_to/#text-splitters) for other LangChain splitters.

We can embed and store all of our document splits in a single command using the Chroma vector store and OpenAIEmbeddings model.

In [None]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings(model='text-embedding-3-small', api_key=OPENAI_API_KEY))

## Retrieval and Generation

*   Retrieve: Given a user input, relevant splits are retrieved from storage using a Retriever.
*   Generate: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data

![Retrieval and Generation](https://python.langchain.com/assets/images/rag_retrieval_generation-1046a4668d6bb08786ef73c56d4f228a.png)

LangChain defines a Retriever interface which wraps an index that can return relevant Documents given a string query.

In [None]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

retrieved_docs = retriever.invoke("What are the approaches to the query rewriting?")

len(retrieved_docs)

In [None]:
print(retrieved_docs[0].page_content)

Let’s put it all together into a chain that takes a question, retrieves relevant documents, constructs a prompt, passes that to a model, and parses the output.

In [None]:
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

example_messages = prompt.invoke(
    {"context": "filler context", "question": "filler question"}
).to_messages()

example_messages

In [None]:
print(example_messages[0].content)

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
rag_chain.invoke("Tell me about the various chunking methods.")

LangChain built-in Chains

In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "Tell me about the various chunking methods."})
print(response["answer"])

In [None]:
for document in response["context"]:
    print(document)
    print()

In [None]:
# cleanup
vectorstore.delete_collection()

# Assignment 4: Naive RAG

In this assignment, you will implement Naive RAG by adjusting various strategies and compare the results.
1.   Chunking strategy
  *  Chunk size, overlap
  *  Token based splitter  
2.   Vector indexing
3.   Prompt compression

## Source document

We will use the document that we used in the TA sessuib.

*   Source: [RAG Survey Paper](https://arxiv.org/html/2312.10997v5)





In [None]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Only keep abstract, sections from the full HTML.
loader = WebBaseLoader(
    web_paths=("https://arxiv.org/html/2312.10997v5",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("ltx_abstract", "ltx_section")
        )
    ),
)
docs = loader.load()

len(docs[0].page_content)

## TODO: Try different chunking options

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter1 = RecursiveCharacterTextSplitter(
    ###### TODO: Change chunk_size, chunk_overlap ######
    chunk_size=1000, chunk_overlap=200
    ####################################################
)
all_splits1 = text_splitter1.split_documents(docs)

print(len(all_splits1))

## TODO: Try different splitter - split text by tokens

In [None]:
!pip install --upgrade --quiet langchain-text-splitters tiktoken

In [None]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter2 = CharacterTextSplitter.from_tiktoken_encoder(
    ########## TODO: Change chunk_size, chunk_overlap ##########
    encoding_name="cl100k_base", chunk_size=200, chunk_overlap=0
    ############################################################
)
all_splits2 = text_splitter2.split_documents(docs)

print(len(all_splits2))

## TODO: Try other vector indexing

Chroma uses HNSW (Hierarchical Navigable Small World) indexing by default.

In order to use other vector indexing, we will use FAISS as a vector store.

In [None]:
!pip install -qU langchain-community faiss-cpu

In [None]:
embeddings=OpenAIEmbeddings(model='text-embedding-3-small', api_key=OPENAI_API_KEY)

In [None]:
dimension_size = len(embeddings.embed_query("hello world"))
print(dimension_size)

We will use IndexFlatL2 as the indexing method.

In [None]:
import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_openai import OpenAIEmbeddings

# create FAISS vector store
db = FAISS(
    embedding_function=embeddings,
    index=faiss.IndexFlatL2(dimension_size),
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

In [None]:
db = FAISS.from_documents(documents=all_splits1, embedding=embeddings)
# db = FAISS.from_documents(documents=all_splits2, embedding=embeddings)

In [None]:
retriever2 = db.as_retriever()

# test retriever

retrieved_docs = retriever2.invoke("What are the approaches to the query rewriting?")

print(retrieved_docs[0].page_content)

## TODO: Apply prompt compression

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever
    #base_retriever=retriever2
)

compressed_docs = compression_retriever.invoke(
    "What are the approaches to the query rewriting?"
)

In [None]:
print(compressed_docs[0].page_content)

## Generate output

In [None]:
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain2 = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    #{"context": compression_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
######## Please try with your own questions ########
rag_chain2.invoke("Tell me about the various chunking methods.")