**Reference Link:** [RAG Systems Essentials (Analytics Vidhya)](https://courses.analyticsvidhya.com/courses/take/rag-systems-essentials/lessons/60148017-hands-on-deep-dive-into-rag-evaluation-metrics-generator-metrics-i)

# Project: Build a Document Retriever Search Engine on Wikipedia Data

## Install OpenAI, and LangChain dependencies

In [0]:
!pip install langchain==0.3.10
!pip install langchain-openai==0.2.12
!pip install langchain-community==0.3.11
!pip install langchain-huggingface==0.1.2
!pip install jq==1.8.0
!pip install pymupdf==1.25.1

## Install Chroma Vector DB and LangChain wrapper

In [0]:
!pip install langchain-chroma==0.1.4

## Enter Open AI API Key

In [0]:
from getpass import getpass

OPENAI_KEY = getpass('Enter Open AI API Key: ')

## Setup Environment Variables

In [0]:
import os

os.environ['OPENAI_API_KEY'] = OPENAI_KEY

### Open AI Embedding Models

LangChain enables us to access Open AI embedding models which include the newest models: a smaller and highly efficient `text-embedding-3-small` model, and a larger and more powerful `text-embedding-3-large` model.

In [0]:
from langchain_openai import OpenAIEmbeddings

# details here: https://openai.com/blog/new-embedding-models-and-api-updates
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

## Loading and Processing the Data

### Get the dataset

In [0]:
# if you can't download using the following code
# go to https://drive.google.com/file/d/1aZxZejfteVuofISodUrY2CDoyuPLYDGZ download it
# manually upload it on colab
# !gdown 1aZxZejfteVuofISodUrY2CDoyuPLYDGZ

In [0]:
# !unzip rag_docs.zip

### Load JSON Documents from Wikipedia Dump

In [0]:
from langchain.document_loaders import JSONLoader

loader = JSONLoader(file_path='../../rag_docs/wikidata_rag_demo.jsonl',
                    jq_schema='.',
                    text_content=False,
                    json_lines=True)
wiki_docs = loader.load()

In [0]:
len(wiki_docs)

In [0]:
wiki_docs[1500]

In [0]:
import json
from langchain.docstore.document import Document
wiki_docs_processed = []

for doc in wiki_docs:
    doc = json.loads(doc.page_content)
    metadata = {
        "title": doc['title'],
        "id": doc['id'],
        "source": "Wikipedia"
    }
    data = ' '.join(doc['paragraphs'])
    wiki_docs_processed.append(Document(page_content=data, metadata=metadata))

In [0]:
wiki_docs_processed[1500]

### Create function to generate contextual summaries for chunks

Here we borrow inspiration from Anthropic's [contextual retrieval](https://www.anthropic.com/news/contextual-retrieval) strategy which involves create a contextual summary for each chunk and adding it to the chunk before storing in the vector database.

![](https://i.imgur.com/cjnB831.png)

In [0]:
# load PDF files with langchain
from langchain.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("./rag_docs/attention_paper.pdf")
doc_pages = loader.load()

In [0]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=3500,
                                          chunk_overlap=0)
doc_chunks = splitter.split_documents(doc_pages)

In [0]:
len(doc_chunks)

In [0]:
# the actual research paper
big_doc = '\n'.join([doc.page_content for doc in doc_chunks])

In [0]:
len(big_doc.split(' '))

In [0]:
from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [0]:
# create a chat prompt
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser


def generate_chunk_context(document, chunk):

    chunk_process_prompt = """You are an AI assistant specializing in research paper analysis.
                            Your task is to provide brief, relevant context for a chunk of text
                            based on the following research paper.

                            Here is the research paper:
                            <paper>
                            {paper}
                            </paper>

                            Here is the chunk we want to situate within the whole document:
                            <chunk>
                            {chunk}
                            </chunk>

                            Provide a concise context (3-4 sentences max) for this chunk,
                            considering the following guidelines:

                            - Give a short succinct context to situate this chunk within the overall document
                            for the purposes of improving search retrieval of the chunk.
                            - Answer only with the succinct context and nothing else.
                            - Context should be mentioned like 'Focuses on ....'
                            do not mention 'this chunk or section focuses on...'

                            Context:
                        """

    prompt_template = ChatPromptTemplate.from_template(chunk_process_prompt)

    agentic_chunk_chain = (prompt_template
                                |
                            chatgpt
                                |
                            StrOutputParser())

    context = agentic_chunk_chain.invoke({'paper': document, 'chunk': chunk})

    return context

In [0]:
print(doc_chunks[5].page_content)

In [0]:
generate_chunk_context(big_doc, doc_chunks[5].page_content)

### Load and Process PDF Documents

In [0]:
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def create_contextual_chunks(file_path):

    print('Loading pages:', file_path)
    loader = PyMuPDFLoader(file_path)
    doc_pages = loader.load()

    print('Chunking pages:', file_path)
    splitter = RecursiveCharacterTextSplitter(chunk_size=3500,
                                              chunk_overlap=0)
    doc_chunks = splitter.split_documents(doc_pages)

    print('Generating contextual chunks:', file_path)
    original_doc = '\n'.join([doc.page_content for doc in doc_chunks])
    contextual_chunks = []
    for chunk in doc_chunks:
        context = generate_chunk_context(original_doc, chunk.page_content)
        contextual_chunks.append(Document(page_content=context+'\n'+chunk.page_content,
                                          metadata=chunk.metadata))
    print('Finished processing:', file_path)
    print()
    return contextual_chunks

In [0]:
from glob import glob

pdf_files = glob('./rag_docs/*.pdf')
pdf_files

In [0]:
paper_docs = []
for fp in pdf_files:
    paper_docs.extend(create_contextual_chunks(fp))

In [0]:
len(paper_docs)

In [0]:
paper_docs[0]

In [0]:
len(wiki_docs_processed)

In [0]:
total_docs = wiki_docs_processed + paper_docs
len(total_docs)

## Vector Databases

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector database takes care of storing embedded data and performing vector search for you.

### Chroma Vector DB

[Chroma](https://docs.trychroma.com/getting-started) is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.

### Create a Vector DB and persist on disk

Here we initialize a connection to a Chroma vector DB client, and also we want to save to disk, so we simply initialize the Chroma client and pass the directory where we want the data to be saved to.

In [0]:
from langchain_chroma import Chroma

# create vector DB of docs and embeddings - takes < 30s on Colab
chroma_db = Chroma.from_documents(documents=total_docs,
                                  collection_name='my_db',
                                  embedding=openai_embed_model,
                                  # need to set the distance function to cosine else it uses euclidean by default
                                  # check https://docs.trychroma.com/guides#changing-the-distance-function
                                  collection_metadata={"hnsw:space": "cosine"},
                                  persist_directory="./my_db")

### Load Vector DB from disk

This is just to show once you have a vector database on disk you can just load and create a connection to it anytime

In [0]:
# load from disk
chroma_db = Chroma(persist_directory="./my_db",
                   collection_name='my_db',
                   embedding_function=openai_embed_model)

In [0]:
chroma_db

## Experiment with Vector Database Retrievers

Here we will explore the following retrieval strategies on our Vector Database:

- Similarity or Ranking based Retrieval
- Multi Query Retrieval
- Contextual Compression Retrieval
- Chained Retrieval Pipeline

### Similarity or Ranking based Retrieval

We use cosine similarity here and retrieve the top 5 similar documents based on the user input query

In [0]:
similarity_retriever = chroma_db.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 5})

In [0]:
from IPython.display import display, Markdown

def display_docs(docs):
    for doc in docs:
        print('Metadata:', doc.metadata)
        print('Content Brief:')
        display(Markdown(doc.page_content[:1000]))
        print()

In [0]:
query = "what is machine learning?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is ML?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is the difference between transformers and vision transformers?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is a cnn?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is deep learning?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is nlp?"
top_docs = similarity_retriever.invoke(query)
display_docs(top_docs)

### Multi Query Retrieval

Retrieval may produce different results with subtle changes in query wording, or if the embeddings do not capture the semantics of the data well. Prompt engineering / tuning is sometimes done to manually address these problems, but can be tedious.

The [`MultiQueryRetriever`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.multi_query.MultiQueryRetriever.html) automates the process of prompt tuning by using an LLM to generate multiple queries from different perspectives for a given user input query. For each query, it retrieves a set of relevant documents and takes the unique union across all queries to get a larger set of potentially relevant documents.

In [0]:
from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [0]:
from langchain.retrievers.multi_query import MultiQueryRetriever
# Set logging for the queries
import logging

similarity_retriever = chroma_db.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 5})

mq_retriever = MultiQueryRetriever.from_llm(
    retriever=similarity_retriever, llm=chatgpt
)

logging.basicConfig()
# so we can see what queries are generated by the LLM
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [0]:
query = "what is a cnn?"
top_docs = mq_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is nlp?"
top_docs = mq_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is ML?"
top_docs = mq_retriever.invoke(query)
display_docs(top_docs)

### Contextual Compression Retrieval

The information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this. The idea is simple: instead of immediately returning retrieved documents as-is, you can compress them using the context of the given query, so that only the relevant information is returned.

This compression can happen in the form of:

- Remove parts of the content of retrieved documents which are not relevant to the query. This is done by extracting only relevant parts of the document to the given query

- Filter out documents which are not relevant to the given query but do not remove content from the document

Here we wrap our multi-query retriever with a `ContextualCompressionRetriever`. Then we'll add an `LLMChainExtractor`, which will iterate over the initially returned documents and extract from each only the content that is relevant to the query.

In [0]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor


# extracts from each document only the content that is relevant to the query
compressor = LLMChainExtractor.from_llm(llm=chatgpt)

# retrieves the documents similar to query and then applies the compressor
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=mq_retriever
)

In [0]:
query = "what is ML?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is nlp?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is a cnn?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is clustering?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is a neural network?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

The `LLMChainFilter` is slightly simpler but more robust compressor that uses an LLM chain to decide which of the initially retrieved documents to filter out and which ones to return, without manipulating the document contents.

In [0]:
from langchain.retrievers.document_compressors import LLMChainFilter

#  decides which of the initially retrieved documents to filter out and which ones to return
_filter = LLMChainFilter.from_llm(llm=chatgpt)

# retrieves the documents similar to query and then applies the filter
compression_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=mq_retriever
)

In [0]:
query = "what is ML?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is NLP?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is a neural network?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is a cnn?"
top_docs = compression_retriever.invoke(query)
display_docs(top_docs)

### Chained Retrieval Pipeline

This strategy uses a chain of multiple retrievers sequentially to get to the most relevant documents. The following is the flow

Similarity Retrieval → Compression Filter → Reranker Model Retrieval

![](http://i.imgur.com/77pXxLu.gif)

In [0]:
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain.retrievers.document_compressors import LLMChainFilter
from langchain.retrievers import ContextualCompressionRetriever

# Retriever 1 - simple cosine distance based retriever
similarity_retriever = chroma_db.as_retriever(search_type="similarity",
                                              search_kwargs={"k": 5})

#  decides which of the initially retrieved documents to filter out and which ones to return
_filter = LLMChainFilter.from_llm(llm=chatgpt)
# Retriever 2 - retrieves the documents similar to query and then applies the filter
compressor_retriever = ContextualCompressionRetriever(
    base_compressor=_filter, base_retriever=similarity_retriever
)

# download an open-source reranker model - BAAI/bge-reranker-v2-m3
reranker = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-large")
reranker_compressor = CrossEncoderReranker(model=reranker, top_n=3)
# Retriever 3 - Uses a Reranker model to rerank retrieval results from the previous retriever
final_retriever = ContextualCompressionRetriever(
    base_compressor=reranker_compressor, base_retriever=compressor_retriever
)

In [0]:
query = "what is ML?"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is a neural network?"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is a transformer model?"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is nlp?"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is clustering?"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is a vision transformer"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)

In [0]:
query = "what is statistics"
top_docs = final_retriever.invoke(query)
display_docs(top_docs)