# RAG Assignment Part b
In this assignment, I implemented a complete Retrieval-Augmented Generation (RAG) pipeline using LangChain and Milvus for semantic search, and OpenAI’s LLM for question answering. The process begins by loading a PDF file using PyPDFLoader, and then splitting it into manageable text chunks using the RecursiveCharacterTextSplitter. This allows the model to handle large documents efficiently by breaking them into overlapping sections of fixed size, ensuring context is preserved across chunks.

Next, I created a vector store using the Milvus vector database. For this, I used OpenAIEmbeddings to convert the text chunks into high-dimensional embedding vectors. The vector store is initialized with different index types (FLAT, IVF_FLAT, and HNSW) to compare performance across search algorithms. I added the embedded documents into Milvus and performed a similarity search to retrieve the top k most relevant chunks for a user query. The retrieval process also outputs the retrieval time and average similarity score as performance metrics.

To enhance the quality of the retrieved documents, I applied Maximum Marginal Relevance (MMR) reranking, which balances relevance and diversity using a parameter lambda_mult. This is done by converting the Milvus store into a retriever using MMR search, fetching a larger number of documents (fetch_k) and selecting the most diverse yet relevant subset (k).

Using the reranked documents, I then passed the query to a RetrievalQA chain with an OpenAI LLM (temperature set to 0 for deterministic responses). The model generated a natural language answer based on the retrieved context. Finally, the generated answer was saved in a .docx file using the python-docx library, with separate output files for each index type to support comparison.

Overall, this assignment helped me understand and implement each stage of a RAG system: document preprocessing, vectorization, semantic search, reranking, LLM answering, and result saving—demonstrating an end-to-end approach to building intelligent question-answering systems from unstructured documents.

In [1]:
import time
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_milvus import Milvus
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from docx import Document


# 1. Load and split PDF
def load_and_split_pdf(file_path, max_pages=200, chunk_size=1000, chunk_overlap=50):
    loader = PyPDFLoader(file_path)
    pages = loader.load()[:max_pages]

    splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    split_docs = splitter.split_documents(pages)

    return split_docs


# 2. Create vector store and add documents
def create_vector_store(index_type, split_docs, embeddings):
    store = Milvus(
        embedding_function=embeddings,
        connection_args={"uri": "./milvus_example01.db"},
        index_params={"index_type": index_type, "metric_type": "L2"},
        auto_id=True,
        drop_old=True
    )
    store.add_documents(split_docs)
    return store


# 3. Perform similarity search and return results + metrics
def perform_similarity_search(store, query, k=5):
    start = time.time()
    results = store.similarity_search_with_score(query, k=k)
    end = time.time()

    print(f"Retrieval time: {end - start:.4f} seconds")
    for i, (doc, score) in enumerate(results):
        print(f"Doc {i+1}: score = {score:.4f}")

    avg_score = sum(score for _, score in results) / len(results)
    print(f"Average similarity score: {avg_score:.4f}")

    return results, end - start, avg_score


# 4. Apply MMR reranking
def rerank_with_mmr(store, query, k=5, fetch_k=10, lambda_mult=0.5):
    retriever = store.as_retriever(
        search_type="mmr",
        search_kwargs={"k": k, "fetch_k": fetch_k, "lambda_mult": lambda_mult}
    )
    reranked_docs = retriever.get_relevant_documents(query)
    print("\nMMR Reranked Docs:")
    for i, doc in enumerate(reranked_docs):
        print(f"Doc {i+1} Preview: {doc.page_content[:100]}...")
    return retriever, reranked_docs


# 5. Answer with LLM
def generate_llm_answer(retriever, query):
    qa = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0), retriever=retriever)
    answer = qa.run(query)
    print("\nLLM Answer:", answer)
    return answer


# 6. Save answer to DOCX
def save_to_docx(answer, filename):
    docx = Document()
    docx.add_heading("Answer from LLM", 0)
    docx.add_paragraph(answer)
    docx.save(filename)


# === Main runner function ===
def run_pipeline(file_path, query):
    embeddings = OpenAIEmbeddings(model="text-embedding-3-large", dimensions=768)
    split_docs = load_and_split_pdf(file_path)

    index_types = ["FLAT", "IVF_FLAT", "HNSW"]
    for index_type in index_types:
        print(f"\n===== Testing Index Type: {index_type} =====")

        store = create_vector_store(index_type, split_docs, embeddings)
        perform_similarity_search(store, query)
        retriever, _ = rerank_with_mmr(store, query)
        answer = generate_llm_answer(retriever, query)
        save_to_docx(answer, f"answer_{index_type}.docx")


# === Run with actual inputs ===
if __name__ == "__main__":
    run_pipeline(
        file_path="/Users/ankita/Documents/Krish Naik Academy/Agentic Batch 2/RAG Assignment/ISLR.pdf",
        query="What is regularization in machine learning?"
    )



===== Testing Index Type: FLAT =====
Retrieval time: 0.5627 seconds
Doc 1: score = 1.0782
Doc 2: score = 1.1332
Doc 3: score = 1.1369
Doc 4: score = 1.1412
Doc 5: score = 1.1497
Average similarity score: 1.1278


  reranked_docs = retriever.get_relevant_documents(query)



MMR Reranked Docs:
Doc 1 Preview: 2.1 What Is Statistical Learning? 21
Y ears of Education Seniority
Income
FIGURE 2.4. A linear model...
Doc 2 Preview: rest.
Comparison to Logistic Regression
As a comparison, we can also fit a logistic regression model...
Doc 3 Preview: learning method increases, we observe a monotone decrease in the training
MSE and aU-shape in the te...
Doc 4 Preview: shortly bygeneralized additive models. Neural networksgained popularity
in the 1980s, andsupport vec...
Doc 5 Preview: known as ageneralized linear model(GLM). Thus, linear regression, logisticgeneralized
linear modelre...


  qa = RetrievalQA.from_chain_type(llm=OpenAI(temperature=0), retriever=retriever)
  answer = qa.run(query)



LLM Answer:  Regularization in machine learning is a technique used to prevent overfitting in a model. It involves adding a penalty term to the cost function, which penalizes large values of the model parameters. This helps to reduce the complexity of the model and prevent it from fitting too closely to the training data, which can lead to poor performance on new data. Regularization is commonly used in linear regression, logistic regression, and other models to improve their generalization ability.

===== Testing Index Type: IVF_FLAT =====
Retrieval time: 0.5309 seconds
Doc 1: score = 1.0785
Doc 2: score = 1.1332
Doc 3: score = 1.1368
Doc 4: score = 1.1415
Doc 5: score = 1.1497
Average similarity score: 1.1279

MMR Reranked Docs:
Doc 1 Preview: 2.1 What Is Statistical Learning? 21
Y ears of Education Seniority
Income
FIGURE 2.4. A linear model...
Doc 2 Preview: rest.
Comparison to Logistic Regression
As a comparison, we can also fit a logistic regression model...
Doc 3 Preview: learn

RPC error: [create_index], <MilvusException: (code=65535, message=invalid index type: HNSW, local mode only support FLAT IVF_FLAT AUTOINDEX: )>, <Time:{'RPC start': '2025-06-10 00:13:50.024840', 'RPC error': '2025-06-10 00:13:50.025244'}>


Retrieval time: 0.3492 seconds
Doc 1: score = 1.0777
Doc 2: score = 1.1353
Doc 3: score = 1.1368
Doc 4: score = 1.1412
Doc 5: score = 1.1488
Average similarity score: 1.1280

MMR Reranked Docs:
Doc 1 Preview: 2.1 What Is Statistical Learning? 21
Y ears of Education Seniority
Income
FIGURE 2.4. A linear model...
Doc 2 Preview: rest.
Comparison to Logistic Regression
As a comparison, we can also fit a logistic regression model...
Doc 3 Preview: learning method increases, we observe a monotone decrease in the training
MSE and aU-shape in the te...
Doc 4 Preview: shortly bygeneralized additive models. Neural networksgained popularity
in the 1980s, andsupport vec...
Doc 5 Preview: known as ageneralized linear model(GLM). Thus, linear regression, logisticgeneralized
linear modelre...

LLM Answer:  Regularization in machine learning is a technique used to prevent overfitting in a model. It involves adding a penalty term to the cost function, which penalizes large values of the model parameter

Key Points:
* When I used a chunk size below 1000, I encountered an error. I need to investigate and resolve this issue.
* TokenTextSplitter was not used for semantic search in this implementation.
* Among three Milvus index types; HNSW had the fastest retrieval time , followed by IVF_FLAT and FLAT.
* The average similarity scores were comparable across all index types, indicating consistent retrieval quality.