# 🔁 Multi-Query RAGs - RAG 101

**Multi-Query RAG** is an advanced retrieval strategy where a single user query is rephrased or expanded into multiple diverse sub-queries. This helps in surfacing a wider and more comprehensive set of relevant documents from the vector store.

### ✨ Why Multi-Query RAG?

- Mitigates sparse or ambiguous queries
- Increases retrieval coverage
- Improves generation quality by exposing the model to diverse information slices

---

### ⚙️ How It Works – Multi-Stage RAG Pipeline

This workflow enhances context gathering and improves answer quality by combining multi-query expansion with staged RAG chaining:

1. **💬 User Question**  
   The process starts with a single input query from the user.

2. **🔀 Multi-Query Expansion RAG**  
   The original query is passed into a RAG chain using a custom prompt that generates multiple rephrasings or sub-queries.

3. **🔎 Sub-Query Retrieval RAG**  
   Each sub-query goes through its own retrieval chain. The system collects relevant document chunks for each sub-query and unions all the results.

4. **🤖 Final Answer Generation RAG**  
   The original question, along with the aggregated context from all sub-queries, is passed to the final RAG chain for synthesis and response generation.

---

### 🔁 Multi-Stage RAG Flow


<div style="text-align: center;">
  <img src="multiquery_rag.png"
       alt="Markdown Monster icon"
       style="margin-right: 10px;"
       width="500"
       height="500" />
</div>


<!-- Mermaid JS -->
<script type="module">
  import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs';
  mermaid.initialize({ startOnLoad: true });
</script>


All the important imports from langchain. Additionaly, I have also created to mode `.py` file one is `llm_call.py` and `embeddings.py` the main function of these file is to make this notebook more efficient and more readable. The python fiel `llm_call.py` is contain all the ChatLLM calls like - openAI, Groq Infrance API, Ollama, Huggingface open source models too. It contain a Class `LLMCall` and under this class I defined all the funtion with respective to the model(organization) which I am using. To dive deep into this please check out the file. Moreover, In `embeddings.py` file as ane suggest it contains a Class `Embeddings` in which you will find both the `OpenAI Embeddings` as well as `HuggingFace Embeddings` which I used in this Notebook as I move forward. If you want to dive deep again please chek out `Embeddings.py`

In [1]:
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.document_loaders import PyPDFLoader
from langchain.prompts import ChatPromptTemplate
from llm_call import LLMCall
from embeddings import Embeddings
from operator import itemgetter

In [2]:
# Load PDF and split it into chunks
pdf_file = 'sample.pdf'
chunk_size = 1000
chunk_overlap = 200

loader = PyPDFLoader(pdf_file)
documents = loader.load()

# Split the document into manageable chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
texts = text_splitter.split_documents(documents)

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


In [3]:
texts[0]

Document(metadata={'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 19.3 (Macintosh)', 'creationdate': '2024-06-18T14:09:48-07:00', 'moddate': '2024-06-18T14:10:14-07:00', 'trapped': '/False', 'source': 'sample.pdf', 'total_pages': 4, 'page': 0, 'page_label': '1'}, page_content='Before using iPhone, review the iPhone User Guide  at  \nsupport.apple.com/guide/iphone .\nSafety and Handling\nSee “Safety, handling, and support” in the iPhone  \nUser Guide .\nExposure to Radio Frequency\nOn iPhone, go to Settings > General > Legal &  \nRegulatory > RF Exposure. Or go to apple.com/  \nlegal/rfexposure .\nBattery and Charging\nAn iPhone battery should only be repaired by a trained \ntechnician to avoid battery damage, which could cause \noverheating, fire, or injury. Batteries should be recycled \nor disposed of separately from household waste and \naccording to local environmental laws and guidelines. For \ninformation about Apple lithium-ion batteries and battery \nservice a

In [4]:
len(texts)

8

## 🧠 Multi-Query Prompt Template

In a Multi-Query RAG system, we improve document retrieval by generating **multiple diverse versions** of a user's query. This is important because:

- Lexical variation increases the likelihood of matching relevant documents
- Similarity search can miss key chunks if the user's wording is too narrow
- Rephrasing helps cover edge cases and boost recall

This prompt is used to guide the LLM in producing **5 alternative phrasings** of the same question.  
Each version maintains the original intent but varies in vocabulary, structure, or tone — ideal for feeding into a retrieval pipeline.


In [None]:
# 🧠 Multi-Query Prompt Template for Generating Diverse Sub-Queries

multi_query_template = """

You are a question(query) generator. You can create multiple questions similar with multiple prespective to the given question with the same meaning.
By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search. 
Please generate 5 questions similar to the given question.
Make sure to use different words and phrases to express the same idea.
The questions should be clear and concise.
The questions should be grammatically correct and easy to understand.
The questions should be in English.
Provide these alternative questions separated by newlines.

The question is: {question}

"""

In [6]:
m_query_prompt = ChatPromptTemplate.from_messages(    [
	("system", "You are a helpful assistant."),
	("human", multi_query_template),
    ]
)


## 🧾 `get_unique_union()` – Deduplicating Retrieved Documents

When multiple sub-queries are run in a Multi-Query RAG pipeline, there's a chance that some documents are retrieved by more than one sub-query.

To avoid duplication in the final context, this function:

1. **Flattens** the nested document list (from multiple retrievers)
2. **Serializes** each document using `langchain.load.dumps`
3. **Deduplicates** them using Python’s built-in `set`
4. **Deserializes** back to `Document` objects

This ensures only **unique context chunks** are passed into the final answer generation stage.


In [None]:
from langchain.load import dumps, loads

def get_unique_union(documents: list[list]):
    """
    🔁 Unique Union of Retrieved Documents

    This function takes a list of lists of LangChain `Document` objects,
    flattens them into a single list, and returns only unique documents.

    Args:
        documents (list[list[Document]]): Nested list of retrieved documents
                                           from multiple sub-query RAGs.

    Returns:
        list[Document]: Deduplicated list of Document objects
    """
    # 🧱 Step 1: Flatten list of lists, and convert each Document to string
    flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]

    # ✅ Step 2: Get unique documents
    unique_docs = list(set(flattened_docs))

    # 🔄 Step 3: Deserialize strings back to Document objects
    return [loads(doc) for doc in unique_docs]

## 🎯 Final RAG Prompt Template for Answer Generation

This is the final instruction template given to the LLM in a RAG setup.

It ensures that:
- Responses are grounded only in the retrieved context
- Irrelevant queries are safely rejected
- The tone is professional and brand-aligned
- The model avoids hallucinations or personal opinions

This template helps keep the system focused, safe, and helpful — especially in domain-specific deployments like **Apple mobile customer service**.


In [None]:
# 📝 Define the custom prompt template used in the final RAG stage

rag_template = """
You are a customer service agent for a apple mobile company. 
You have been given the following information about the customer question and the context.
Customer Query: {question}
Context: {context}

Answer: 
The answer should be based on the context provided.
Your task is to answer the customer question based on the context provided. If the question is not related to the context, please say "I don't know or Do Not Answer it just say please ask me question related to Apple Mobiles only".
Do not make up any information or provide any personal opinions or experiences.
Please answer in a friendly and professional manner.
"""

In [9]:
rag_prompt = ChatPromptTemplate.from_messages(
    [
	("system", "You are a helpful assistant."),
	("human", rag_template),
    ]
)

print(rag_prompt)

input_variables=['context', 'question'] input_types={} partial_variables={} messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], input_types={}, partial_variables={}, template='You are a helpful assistant.'), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='\nYou are a customer service agent for a apple mobile company. \nYou have been given the following information about the customer question and the context.\nCustomer Query: {question}\nContext: {context}\n\nAnswer: \nThe answer should be based on the context provided.\nYour task is to answer the customer question based on the context provided. If the question is not related to the context, please say "I don\'t know or Do Not Answer it just say please ask me question related to Apple Mobiles only".\nDo not make up any information or provide any personal opinions or experiences.\nPlease answer in a

## ☁️ Using Azure OpenAI for Embeddings & Generation

First part of the notebook leverages **Azure OpenAI** services to power the RAG pipeline:

### 🔡 Embeddings
We use `AzureOpenAIEmbeddings` to convert documents and user queries into dense vector representations. These vectors are stored and queried using a **FAISS vector store**.

- Model: Typically `text-embedding-003 small`
- Usage: Semantic search to retrieve the most relevant document chunks

### 🤖 Language Model
For final response generation, we use `AzureChatOpenAI`, a wrapper around models `gpt-4o-mini-test` hosted on Azure infrastructure.

- Model: `gpt-4o-mini-test` (via deployment name)
- Input: Original user query + retrieved context
- Output: A grounded, domain-specific response


In [None]:
# 🧠 Initialize Azure OpenAI Embeddings
open_ai_embeddings = Embeddings.azure_openai()

In [11]:
vectorstore = FAISS.from_documents(
    texts,
    open_ai_embeddings
)

In [12]:
retriever = vectorstore.as_retriever()

In [None]:
# 🤖 Initialize Azure OpenAI Chat Model (LLM)
open_ai_llm = LLMCall.azure_openai()

## 🔄 Multi-Query Chain (Query Expansion)

This chain is responsible for generating **multiple diverse sub-queries** from a user's original question.

**Why?**  
Traditional RAG relies on retrieving documents using a single embedding. If the question is phrased narrowly, relevant information might be missed.

**How it works:**

1. A prompt instructs the LLM to generate 5 alternate versions of the query.
2. The Azure OpenAI model processes the prompt.
3. The result is parsed and split into individual questions.
4. These questions are then used to perform **multiple document retrievals**, improving recall.

This is a key step in **Multi-Query RAG** strategies that boost the quality of retrieved context.


In [None]:
# 🔁 Multi-Query Chain for Generating Rephrased Questions
# This chain uses an LLM to generate 5 diverse sub-queries from a single user question.
# It follows this sequence:
#   1. m_query_prompt: A prompt template instructing the LLM to generate alternate versions
#   2. open_ai_llm: The Azure OpenAI model used for generation
#   3. StrOutputParser(): Parses the raw LLM output into a string
#   4. lambda x: x.split("\n"): Splits the result into individual questions (assuming each is newline-separated)

multi_query_chain = m_query_prompt| open_ai_llm | StrOutputParser() | (lambda x: x.split("\n"))

## 🔍 Multi-Query Retrieval Chain

After generating multiple rephrasings of the user's question, this chain is responsible for retrieving documents based on each version.

### 🚀 Steps:
1. **Generate Sub-Queries**  
   The original question is transformed into 5 alternate forms using `multi_query_chain`.

2. **Parallel Retrieval**  
   Each sub-query is passed to the retriever (`retriever.map()`), which fetches the most relevant chunks for each.

3. **Deduplication**  
   The retrieved results are combined using `get_unique_union` to eliminate duplicate documents.

### 🎯 Result:
A clean, merged list of **diverse but relevant context documents**, ready to be passed to the final generation step.


In [None]:
# 🔍 Multi-Query Retrieval Chain
# This chain takes the sub-queries generated by the multi_query_chain and performs the following steps:
#   1. multi_query_chain — generates 5 rephrased versions of the original question
#   2. retriever.map() — runs retrieval for each sub-query in parallel
#   3. get_unique_union — flattens and deduplicates all retrieved chunks into one clean list of documents

multi_query_retrieval_chain = multi_query_chain | retriever.map() | get_unique_union

In [None]:
'''
Testing/ Showing if we are getting our questions of not and we are checking the length of unique chunks from it.
'''

# Retrieve
question = "Is there a warranty on the phone?"
docs = multi_query_retrieval_chain.invoke({"question":question})
len(docs)

  return [loads(doc) for doc in unique_docs]


5

## 🧠 Final RAG Chain – Context + Answer Generation

This is the last stage in the **Multi-Query RAG** pipeline.  
It takes the original question and the merged context retrieved from sub-queries and generates a final answer.

### 📦 Components:

- **`multi_query_retrieval_chain`**: Provides a deduplicated list of relevant chunks
- **`itemgetter("question")`**: Selects the original user query for use in the prompt
- **`rag_prompt`**: Structured template that instructs the model how to use the context
- **`open_ai_llm`**: Azure-hosted GPT model for generating the response
- **`StrOutputParser()`**: Cleans up the output and returns the final answer as a string

This chain ensures that the final answer is:
- Grounded in retrieved documents
- Context-aware
- Polite, accurate, and brand-aligned (as per the prompt)

In [None]:
# 🧠 Final RAG Chain – Answer Generation Step
# This chain combines the user's original question with the deduplicated retrieved context,
# feeds it into the final prompt, and uses the LLM to generate a grounded answer.

final_rag_chain = (
    {
	    "context": multi_query_retrieval_chain,    # 🔍 Retrieved and deduplicated context from multi-query RAG
        "question": itemgetter("question")         # 💬 Original user question (unchanged)
    } 
    | rag_prompt                                   # 📝 Custom prompt template guiding the LLM
    | open_ai_llm                                  # 🤖 Azure OpenAI model for response generation
    | StrOutputParser()                            # 🧾 Parse the final answer as plain text
)

In [17]:
response = final_rag_chain.invoke({"question":question})

print('📦 Answer:', response)

📦 Answer: Yes, there is a warranty on the phone. Apple offers a One-Year Limited Warranty that covers defects in materials and workmanship for one year from the date of original retail purchase. However, this warranty does not cover normal wear and tear or damage caused by accident or abuse. If you need service, you can call Apple or visit an Apple Store or an Apple Authorized Service Provider. For more detailed information, you can visit apple.com/legal/warranty.


## 🤗 Using Hugging Face for Embeddings & Generation

This RAG pipeline is powered by **open-source Hugging Face models** for both:

---

### 🔡 Embeddings
We use `HuggingFaceEmbeddings` (e.g., from `sentence-transformers`) to convert documents and queries into dense vector representations.

- 📦 Common model: `sentence-transformers/all-mpnet-base-v2`
- ⚙️ Device support: `cpu` or `cuda`
- 🧮 These vectors are used for semantic search via FAISS

---

### 🧠 Language Model
Text generation is handled using Hugging Face LLMs (e.g., `gemma-2b-it`, `mistral`, `phi`, etc.), loaded via `AutoModelForCausalLM`.

- 💬 Accepts original query + retrieved context to generate a grounded response
- 🔁 Wrapped using `HuggingFacePipeline` for LangChain compatibility
- 🎛️ Supports advanced tuning (`temperature`, `top_p`, etc.)

---

### 💸 Cost-Effective & Open Source

Using Hugging Face models offers major advantages:

- ✅ **Free and open-source**: No pay-per-token costs
- ✅ **Runs locally**: No API keys, no vendor lock-in
- ✅ **Budget-friendly**: Great for research, startups, and personal projects
- ✅ **Customizable**: You can fine-tune or quantize for your specific use case

> 💡 Hugging Face models help **save burden on the pocket** while offering full control and flexibility for local, private, or large-scale deployments.

In [18]:
huggingface_embeddings = Embeddings.huggingface()


In [19]:
vectorstore = FAISS.from_documents(
    texts,
    huggingface_embeddings
)

In [20]:
retriever = vectorstore.as_retriever()

In [21]:
# Using Same Multi-query template as above 
# but as we mentioned above but here we are using huggingface embeddings and huggingface models

In [None]:
huggingface_llm = LLMCall.huggingface()

In [None]:
multi_query_chain = m_query_prompt | huggingface_llm | StrOutputParser() | (lambda x: x.split("\n"))
multi_query_retrieval_chain = multi_query_chain | retriever.map() | get_unique_union

In [None]:
final_rag_chain = (
    {"context": multi_query_retrieval_chain, 
     "question": itemgetter("question")} 
    | rag_prompt
    | huggingface_llm
    | StrOutputParser()
)

In [25]:
response = final_rag_chain.invoke({"question":question})

print('📦 Answer:', response)

📦 Answer: Yes, there is a warranty on the phone. Apple offers a one-year limited warranty that covers defects in materials and workmanship for the included hardware product and accessories from the date of original retail purchase. However, this warranty does not cover normal wear and tear or damage caused by accident or abuse. If you need to obtain service, you can call Apple or visit an Apple Store or an Apple Authorized Service Provider. For more detailed information, you can visit apple.com/legal/warranty. If you have any further questions, feel free to ask!


## 🦙 Using Ollama for Local LLM Inference

In this pipeline, we're combining:

- 🤗 **Hugging Face Embeddings** (for vector similarity search)
- 🦙 **Ollama** (for fast, local LLM generation)

---

### 🔡 Embeddings
We continue to use `HuggingFaceEmbeddings` (e.g., `all-mpnet-base-v2`) for encoding both:
- Document chunks
- User queries

These embeddings are stored and queried using **FAISS**.

---

### 🧠 Language Model via Ollama

Instead of using models from OpenAI or Hugging Face Hub, we run a **locally downloaded LLM** using [Ollama](https://ollama.com/).

- ✅ In this example, we use the **`llama3`** model (v3.2)
- ✅ Ollama runs the model **on your local machine** using GPU or CPU
- ✅ It's fully private and doesn't incur per-token costs

---

### 📥 To Run with a Different Model

Ollama supports various models like `mistral`, `gemma`, `phi`, etc.  
To download and use another model:

```bash
ollama run <model_name>


In [26]:
ollama_llm = LLMCall.chat_ollama()

In [27]:
multi_query_chain = m_query_prompt | ollama_llm | StrOutputParser() | (lambda x: x.split("\n"))
multi_query_retrieval_chain = multi_query_chain | retriever.map() | get_unique_union

In [28]:
final_rag_chain = (
    {"context": multi_query_retrieval_chain, 
     "question": itemgetter("question")} 
    | rag_prompt
    | ollama_llm
    | StrOutputParser()
)

In [34]:
response = final_rag_chain.invoke({"question":question})

print('📦 Answer:', response)

📦 Answer: Yes, there is a warranty on the phone. According to the information provided, Apple offers a one-year limited warranty on the hardware product and accessories against defects in materials and workmanship from the date of original retail purchase. You can find more detailed information on obtaining service and the full terms of the warranty at apple.com/legal/warranty and support.apple.com.


Here’s a polished, notebook-ready **Markdown cell** for explaining your use of the **Groq Inference API**, including how to switch models and where it connects in your code:

---

```markdown
## ⚡ Using Groq Inference API

In this example, we're utilizing the **Groq Inference API** to power our language generation step with **ultra-fast LLM performance**.

---

### 🔑 API Key Access

To use Groq, you'll need an API key from [console.groq.com](https://console.groq.com).  
This key allows access to hosted LLMs via their **low-latency inference platform**.

---

### 🧠 Model in Use

We're currently using:

```
llama-3.3-70b-versatile
```

Also known as:

```
llama-3.3
```

This model is known for its large context window and versatility across general-purpose tasks.

---

### 🔄 Switching Models

If you'd like to use a different model from Groq's lineup (like `mixtral-8x7b`, `gemma-7b-it`, etc.):

- ✅ Option 1: Pass the model name into the function call directly:
  
  ```python
  llm = LLMCall.groq(model="mixtral-8x7b")
  ```

- ✅ Option 2: Update it in your custom logic:
  
  Modify the model name inside the `chat_groq()` method of your `LLMCall` class in `llm_call.py`.

---

### ⚡ Why Use Groq?

- 🚀 Extremely fast inference speeds
- ✅ No need to host your own models
- 💡 Ideal for low-latency, production-grade deployments

> 💡 You're still using local/affordable embeddings (e.g. Hugging Face), but generation is powered by high-speed, remote Groq-hosted LLMs.
```

In [30]:
groq_llm = LLMCall.chat_groq()

In [31]:
multi_query_chain = m_query_prompt | groq_llm | StrOutputParser() | (lambda x: x.split("\n"))
multi_query_retrieval_chain = multi_query_chain | retriever.map() | get_unique_union

In [32]:
final_rag_chain = (
    {"context": multi_query_retrieval_chain, 
     "question": itemgetter("question")} 
    | rag_prompt
    | groq_llm
    | StrOutputParser()
)

In [33]:
response = final_rag_chain.invoke({"question":question})

print('📦 Answer:', response)

📦 Answer: Yes, there is a warranty on the phone. According to the Apple One-Year Limited Warranty Summary, Apple warrants the included hardware product and accessories against defects in materials and workmanship for one year from the date of original retail purchase. You can find more detailed information on obtaining service at apple.com/legal/warranty and support.apple.com.


<!-- Font Awesome CDN (Add in <head> if not already included) -->
<link
  rel="stylesheet" 
  href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.5.0/css/all.min.css"
/>

<!-- Social Footer Section -->
<div style="
  background-color:rgb(199, 195, 195);
  padding: 40px 30px;
  border-radius: 20px;
  box-shadow: 0 4px 12px rgba(0,0,0,0.08);
  font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
  font-size: 18px;
  max-width: 900px;
  margin: 60px auto 30px;
  text-align: center;
  color: #444;
">
<!-- End of Notebook Note -->
  <h2 style="margin-bottom: 10px;">📘 End of Notebook</h2>
  <p style="color: #666; font-size: 14px;">
    Thank you for exploring! Feel free to connect via the links below.
  </p>

  <!-- Social Icons -->
<div style="
  display: flex;
  gap: 25px;
  align-items: center;
  flex-wrap: wrap;
  justify-content: center;
  margin-bottom: 25px;
">
  <!-- LinkedIn -->
  <a href="https://www.linkedin.com/in/ChiragB254" target="_blank" style="text-decoration: none; color: #0077b5;">
    <i class="fab fa-linkedin fa-lg"></i> LinkedIn
  </a>

  <!-- GitHub -->
  <a href="https://github.com/ChiragB254" target="_blank" style="text-decoration: none; color: #333;">
    <i class="fab fa-github fa-lg"></i> GitHub
  </a>

  <!-- Instagram -->
  <a href="https://www.instagram.com/data.scientist_chirag" target="_blank" style="text-decoration: none; color: #E1306C;">
    <i class="fab fa-instagram fa-lg"></i> Instagram
  </a>

  <!-- Email -->
  <a href="mailto:devchirag27@gmail.com" style="text-decoration: none; color: #D44638;">
    <i class="fas fa-envelope fa-lg"></i> Email
  </a>

  <!-- X (Twitter) -->
  <a href="https://x.com/ChiragB254" target="_blank" style="text-decoration: none; color: #000;">
    <i class="fab fa-x-twitter fa-lg"></i> X.com
  </a>
  </div>

  <p style="font-size: 13px; color: black; font-style: italic; margin-top: 8px;">
    <strong>Made with ❤️ by Chirag Bansal</strong>
  </p>
</div>

