In [1]:
## Instal Dependencies
!pip install -q -U openai langchain-community chromadb tiktoken pypdf

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m463.1/463.1 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m59.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m47.4 MB/s[0m eta [36m0:00:00

In [2]:
## Mount Google Drive at "gdrive"
from google.colab import drive
drive.mount('gdrive', force_remount=True)

import warnings
warnings.filterwarnings('ignore')

# Importing necessary libraries
import openai
import os

root_dir = os.path.join('/content', 'gdrive', 'MyDrive')

open_api_key_filepath = os.path.join(root_dir, 'OpenAI_API_Key.txt')

if os.path.isfile(open_api_key_filepath):
  with open(open_api_key_filepath, mode='r', encoding='utf-8') as f:
    openai.api_key = f.read().strip()
    os.environ['OPENAI_API_KEY'] = openai.api_key
else:
  raise FileNotFoundError('Error: API Key filepath is invalid!')

Mounted at gdrive


## Problem Statement

---

The goal of the project will be to build a robust generative search system capable of effectively and accurately answering questions from various insurance policy documents. In short, we are building a Generative Search System for Insurance Policy Documents using LangChain.

## Why LangChain is the ideal framework for this project?

---

1. It enables **retrieval-augmented generation (RAG)**, ensuring more accurate answers based on policy documents.
2. Provides built-in **document loaders**, **embedding models**, **vector databases**, and **retrievers** for efficient search.
3. Supports **chaining multiple LLM operations**, enabling structured query handling.
4. Works well with **PDF processing**, allowing us to extract information from complex insurance documents.

## System Design

---

We need to create an optimal architecture for our system. Here’s a proposed high-level design:

- **Input**: User queries about an insurance policy.
  - **Processing**:
    - Load and process PDF insurance documents.
    - Chunk documents for efficient retrieval.
    - Generate embeddings and store them in a vector database.
    - Retrieve relevant chunks and re-rank results.
    - Use an LLM (GPT-based) to generate final answers.
- **Output**: A **reliable response** based on relevant insurance policy text.


## Code Implementation

---

We will follow structured coding with clear documentation at each step:

- **Step 1**: Load & Process PDFs (`PyPDFDirectoryLoader` or `PyPDFLoader`)
- **Step 2**: Chunk Documents (`RecursiveCharacterTextSplitter`)
- **Step 3**: Generate Embeddings (`OpenAIEmbeddings with LangChain`), Cache them using `CacheBackedEmbeddings`
- **Step 4**: Store in Vector DB (`ChromaDB`)
- **Step 5**: Re-rank with `HuggingFaceCrossEncoder`, Implement Retrievers (`ContextualCompressionRetriever`)
- **Step 6**: Build LangChain Pipeline (`RAG Chain`)

## Documentation

---

We will ensure the project has:

1. **Detailed markdown documentation** alongside the code.
2. **A flowchart** explaining the system architecture.
3. **A README.md** file with installation, setup, and usage instructions.

### LLM to use

---

**OpenAI GPT**, I’ll integrate it into my retrieval system

### Specific insurance policy documents or formats we need to handle

---

**PDF**

### Are these documents structured (tables, sections, key-value pairs) or unstructured (plain text)?

---

**Unstructured**

### Where will your PDFs be stored?

---

**Google Drive**

## **Step 1: Loading & Processing PDFs (Extract and Process Text).**

1. Use **LangChain's** `PyPDFLoader` to read multiple PDFs from a folder. Load the PDFs from Google Drive (I’ll download and process them locally).
2. Since the PDFs are unstructured, I will extract raw text instead of structured metadata and clean it by removing extra whitespaces and line breaks for better readability.
3. Display a sample or structured preview (first 500 characters) to verify quality or to validate the extracted text.

---


I originally planned to use `PyPDFDirectoryLoader`, which is **designed to load all PDFs from a directory at once**. However, the current code is using `PyPDFLoader`, which **loads PDFs individually**.

---


### Which One Should be Used?
- **PyPDFDirectoryLoader**: Loads all PDFs at once, reducing the need for manual file iteration. More efficient for bulk processing.
- **PyPDFLoader**: Loads PDFs one at a time, useful for custom per-file handling. Gives more control over individual document processing.

---


I recommend using `PyPDFLoader` as it offers better control over individual PDF files while ensuring proper data cleaning. The script extracts **source**, **page number**, and **page content** from each document and displays a preview. The preview will show the first three pages per PDF.

In [3]:
## Import the PyPDFLoader class from LangChain
from langchain.document_loaders import PyPDFLoader

In [4]:
# Path to the folder containing all 7 PDFs (Update this based on actual path)
pdf_folder = os.path.join(root_dir, 'Policy+Documents')

# List all PDF files in the folder
pdf_files = [f for f in os.listdir(pdf_folder) if f.endswith(".pdf")]

# Function to extract and clean text from PDFs
def extract_text_from_pdfs(pdf_files, pdf_folder):
  all_documents = {}

  for pdf in pdf_files:
    pdf_path = os.path.join(pdf_folder, pdf)
    loader = PyPDFLoader(pdf_path)
    pages = loader.load()

    pdf_data = []
    for page in pages:
      text = page.page_content.replace("\n", " ").strip()
      page_info = {
          "page_number": page.metadata.get("page", "Unknown"),
          "page_content": text
      }
      pdf_data.append(page_info)

    all_documents[pdf] = pdf_data

  return all_documents

# Extract text from all PDFs
documents = extract_text_from_pdfs(pdf_files, pdf_folder)

# Display preview of extracted data
for pdf, pages in documents.items():
  print(f"Source: {pdf}")
  for page in pages[:3]:  # Show first 3 pages as preview
    print(f"  Page: {page['page_number']}")
    print(f"  Content Preview: {page['page_content'][:300]}")  # Show first 300 characters
    print("  " + "-" * 80)
  print('\n')

Source: HDFC-Life-Group-Term-Life-Policy.pdf
  Page: 0
  Content Preview: F&U dated 15th October 2022                  UIN-101N169V02  P a g e  | 0                                      HDFC Life Group Term Life    OF      «OWNERNAME»               Based on the Proposal and the declarations and  any  statement made or referred to therein,  We will pay the Benefits mentione
  --------------------------------------------------------------------------------
  Page: 1
  Content Preview: F&U dated 15th October 2022                  UIN-101N169V02  P a g e  | 1                             PART A: Covering Letter with Policy Schedule                                                                                                                                                  <dd-mm-y
  --------------------------------------------------------------------------------
  Page: 2
  Content Preview: F&U dated 15th October 2022                  UIN-101N169V02  P a g e  | 2                           

## **Step 2: Document Chunking (Chunking the text to make it suitable for retrieval and embedding)**

---


### Why Chunking?
Since LLMs have a token limit, we must break down long insurance policy documents into meaningful chunks for:
1. Efficient retrieval – Smaller chunks ensure relevant portions are retrieved instead of entire documents.
2. Better embeddings – Improves vector search by ensuring contextually relevant text is stored.

---

### Explanation:
Chunking is a critical step in building a generative search system because it optimizes how information is stored and retrieved. Here’s why it matters:

1. **Chunking for Retrieval**: When a user asks a question, the system needs to find **relevant portions** of the document instead of searching the entire text.
  - **Without chunking**: Searching across a large document may return **irrelevant** or **less precise** results.
  - **With chunking**: Breaking text into **smaller**, **meaningful sections** makes it easier to retrieve precise answers.
  - **Example**: Instead of retrieving an entire 30-page insurance document, we can retrieve only the **most relevant policy section**.

2. **Chunking for Embeddings**: Embeddings convert text into **numerical representation** so that similar text can be compared using vector search.
  - **Without chunking**: Long documents produce large embeddings, which can exceed **LLM token limits** and **reduce retrieval accuracy**.
  - **With chunking**: Each small chunk is embedded separately, making **search and comparison faster and more precise**.
  - **Example**: A **1000-token chunk** ensures that embeddings capture local context while remaining efficient for retrieval.

---


## How This Helps Our Insurance Search System
- **Retrieves only the necessary policy sections** when answering a user’s question.
- **Improves accuracy** by keeping embeddings focused on specific concepts (e.g., claims process, coverage exclusions).
- **Handles large documents efficiently**, ensuring the system can scale to multiple policies.

---

## How to Implement Chunking
Use **LangChain’s** `RecursiveCharacterTextSplitter`, which:
1. **Preserves semantic meaning**: Tries to **split at paragraph level** ("\n\n") at first, then sentences ("\n"), then words (" ").
2. Ensures **semantic consistency** or **ensures continuity** while maintaining a fixed chunk size: Allows fine-tuning chunk **size and overlap** for better retrieval performance. The chunk overlap prevents loss of context between chunks.


---

## Here's why I didn't use other splitters (or other ways to split text using LangChain):

1. **Why Not `CharacterTextSplitter?`**:
  - Splits text by **a single character** (e.g., \n, " "), meaning it might break sentences or words in the middle, leading to poor chunk coherence.
  - **Example**: If splitting by \n, it may create incomplete chunks if text doesn't have enough line breaks.
2. **Why Not `TokenTextSplitter?`**
  - Splits by **number of tokens** (useful for LLM-specific tokenization like OpenAI's tiktoken).
  - This splitter **breaks text purely based on token count**, which is efficient but often disrupts contextual integrity (e.g., breaking mid-sentence).
  - It **doesn’t prioritize creating logical/meaningful pieces of text** (e.g., paragraph or sentence), making retrieval less reliable.
3. **Why Not `NLTKTextSplitter`?**
  - **NLTK sentence tokenizer** is good for structured data but may break long texts unpredictably.
  - Used to split sentences, but it may **over-split** long content like insurance documents.
  - **NLTK-based splitting** can fail on PDFs with inconsistent formatting
4. **Why Not `SpacyTextSplitter`?**
  - Leverages **spaCy's NLP models** for sentence splitting, useful for heavy pro-NLP tasks.
  -  **spaCy-based splitting** adds unnecessary processing overhead.

---

## Why Use `RecursiveCharacterTextSplitter` Instead of Other Splitters?
Chose `RecursiveCharacterTextSplitter` because it **prioritizes preserving the document’s semantic structure** while breaking it into chunks.
- It **recursively tries multiple split levels** (paragraph → sentence → word), ensuring the best **logical/meaningful** chunking possible.
- It **minimizes broken/lost context**, leading to **better retrieval and embeddings** for our generative search system.
- The **chunk overlap** (e.g., 200 tokens) ensures continuity, preventing loss of crucial details.

---


### **Code Implementation**
Document chunking using `RecursiveCharacterTextSplitter` with:
1. **Chunk Size Preference**: Default
  - 1000 tokens per chunk
2. **Chunk Overlap**: 200 tokens
  - Helps maintain continuity across chunks for better retrieval
  - Prevents information loss at chunk boundaries
3. **Preview**: First 3 chunks from each PDF

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Example: 'documents' is a dictionary with extracted text from previous step
def chunk_documents(documents, chunk_size=1000, chunk_overlap=200):
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size=chunk_size,  # Default chunk size
      chunk_overlap=chunk_overlap,  # Overlap for continuity
      separators=["\n\n", "\n", " ", ""]  # Prioritize paragraph, sentence, and word-level splits
  )

  chunked_documents = {}
  for pdf, pages in documents.items():
    full_text = " ".join([page["page_content"] for page in pages])  # Combine all pages from a document
    chunks = text_splitter.split_text(full_text)
    chunked_documents[pdf] = chunks

  return chunked_documents

# Apply chunking to extracted text
chunked_documents = chunk_documents(documents)

# Display preview of chunked text
for pdf, chunks in chunked_documents.items():
  print(f"Source: {pdf}")
  for i, chunk in enumerate(chunks[:3]):  # Show first 3 chunks as preview
    print(f"  Chunk {i+1}:")
    print(f"  Content Preview: {chunk[:300]}...")  # Show first 300 characters
    print("  " + "-" * 80)
  print()

Source: HDFC-Life-Group-Term-Life-Policy.pdf
  Chunk 1:
  Content Preview: F&U dated 15th October 2022                  UIN-101N169V02  P a g e  | 0                                      HDFC Life Group Term Life    OF      «OWNERNAME»               Based on the Proposal and the declarations and  any  statement made or referred to therein,  We will pay the Benefits mentione...
  --------------------------------------------------------------------------------
  Chunk 2:
  Content Preview: __________________      Your HDFC Life <Policy Name> with Policy No. <Policy no.>    Dear Mr./Ms.___________________________,    We thank you for choosing HDFC Life Insurance as your preferred life insurance solution provider..    We are pleased to enclose your Policy Bond, which carries the followi...
  --------------------------------------------------------------------------------
  Chunk 3:
  Content Preview: you can Benefit  from    We request you to carefully go through the information given in th

##  **Step 3: Generating Embeddings using `OpenAIEmbeddings` and Caching them**

Generate **vector embeddings** for each text chunk using **OpenAI's Embeddings API**. These embeddings allow us to:
1. Compare text chunks based on meaning rather than exact words.
2. Perform semantic search to retrieve the most relevant policy sections.
3. Store embeddings in a vector database.
---

### ***Code Implementation - Generate Embeddings***
1. Use `OpenAIEmbeddings` from LangChain to generate embeddings.
  - **OpenAI model for embeddings**: `text-embedding-ada-002`
2. Process each chunk and store its corresponding vector representation.
3. Preview a few embedding values to verify the process.

---


A **state-of-the-art application** must include **caching** to optimize performance.

### Why Use Caching for Embeddings?
- If an embedding already exists in the cache, we won't store it again.
- **Improves Speed**: Reusing existing embeddings means we don’t have to recompute for the same chunks. Faster lookup by avoiding unnecessary database inserts. In short, **improve retrieval speed**.
- **Ensures Consistency across stored embeddings**: Cached embeddings remain stable, ensuring uniform search results over time. Cached embeddings remain the same across retrievals.

---


### How Caching Will Work in ChromaDB?
1. Before storing an embedding, we check if it exists in the cache.
2. If not cached, we compute, store in ChromaDB, and update the cache.

---


### ***Code Implementation - Optimized Embedding Strategy with Caching***
1. Uses `CacheBackedEmbeddings` to prevent redundant API calls.
2. **Hashes each chunk** to check if embeddings exist before recomputing.
3. Uses `InMemoryCache` (can be replaced with persistent storage like SQLite or ChromaDB).
4. Displays embedding **previews for validation**
---


### Understanding Query Embeddings in Vector Search
When a user asks a question, the system must **retrieve the most relevant document chunks** from the vector database. This is done by **implicitly** embedding the query **in the same way as the document chunks**.

1. System **implicity** converts the user query into an embedding using the same `text-embedding-ada-002` model.
2. System compares the query embedding with stored document embeddings using similarity search.
3. System returns the top-matching chunks, ensuring relevant insurance policy sections are retrieved.

This allows the system to find semantically similar text, even if the query does not exactly match the wording in the document.

In [6]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.cache import InMemoryCache
from langchain.embeddings.cache import CacheBackedEmbeddings
import hashlib

# Initialize OpenAI Embeddings
openai_embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Initialize an in-memory cache (can be replaced with persistent storage like SQLite or ChromaDB)
cache = InMemoryCache()
cached_embeddings = CacheBackedEmbeddings.from_bytes_store(openai_embeddings, cache)

# Function to generate a unique hash for each chunk
def get_chunk_hash(text):
  return hashlib.md5(text.encode()).hexdigest()

# Function to compute and cache embeddings
def generate_embeddings(chunked_documents):
  embeddings_store = {}

  for pdf, chunks in chunked_documents.items():
    pdf_embeddings = {}
    for chunk in chunks:
      chunk_hash = get_chunk_hash(chunk)

      # Check if embedding exists in cache, else compute and store
      if not cache.lookup(chunk_hash, chunk):
        embedding = cached_embeddings.embed_query(chunk)
        cache.update(chunk, chunk_hash, embedding)
      else:
        embedding = cache.lookup(chunk_hash, chunk)

      pdf_embeddings[chunk_hash] = embedding

    embeddings_store[pdf] = pdf_embeddings

  return embeddings_store

# Generate and cache embeddings
embeddings = generate_embeddings(chunked_documents)

# Display preview of embeddings
for pdf, emb_data in embeddings.items():
  print(f"Source: {pdf}")
  for i, (chunk_hash, emb) in enumerate(emb_data.items()):
    print(f"  Chunk {i+1}: Hash {chunk_hash}")
    print(f"  Embedding Preview: {emb[:5]}...")  # Show first 5 dimensions
    print("  " + "-" * 80)
    if i == 2:  # Show only first 3 embeddings as preview
      break
  print()

  openai_embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")


Source: HDFC-Life-Group-Term-Life-Policy.pdf
  Chunk 1: Hash fb5e01a7880d9c8f90c7119e79b435df
  Embedding Preview: [-0.004571263401156658, -0.014109511510043768, -0.019963459003402396, -0.027441227077188362, -0.024903151301958637]...
  --------------------------------------------------------------------------------
  Chunk 2: Hash 61a55baa1dce2637dcaf5947d8bfa6e8
  Embedding Preview: [-0.01738223555907964, 0.01561045589324737, 0.007879052021756485, -0.04238849722603166, -0.01723458740881404]...
  --------------------------------------------------------------------------------
  Chunk 3: Hash cfae7596a9bec858168212b45b6873a9
  Embedding Preview: [-0.016857616379333952, 0.001441142153399626, 0.018874533223599655, -0.054947379246251184, -0.00352960587444882]...
  --------------------------------------------------------------------------------

Source: HDFC-Life-Easy-Health-101N110V03-Policy-Bond-Single-Pay.pdf
  Chunk 1: Hash 9daa24a4ea08fc1bac98f5d68d4347c8
  Embedding Preview: [0.002675

## **Step 4: Storing Embeddings in ChromaDB**

Since we have **generated and cached embeddings**, we will **store them in ChromaDB** for efficient retrieval.

---

### Why ChromaDB?
1. **High-performance vector storage**: Optimized for fast similarity search.
2. **Metadata support**: Can store source PDF name and page number/chunk hash for better document retrieval.
3. **Flexible persistence**: Can be stored in-memory (faster) or persisted (for reusability across sessions).

---

### ***Code Implementation***
1. **Initialize ChromaDB as our vector store** (in-memory for speed, but configurable for persistence).
2. Attach **metadata (source PDF file and chunk hash)** to each stored embedding for better retrieval context.
3. **Runs a test query** (`"Does this insurance policy cover hospitalization expenses for accidents?"`) to verify that embeddings are stored and retrievable.

---

## Why Use `from langchain.schema import Document`?
The `Document` class in LangChain is used to structure text data along with metadata before storing it in a vector database like ChromaDB.

**Reasons for using `Document` in the code** are as follows:
- Metadata Storage:
  - Each document (insurance policy chunk) needs **source details** (PDF name, chunk hash).
  - The `Document` class allows us to store and retrieve this metadata efficiently.
- Structured Data Format:
  - ChromaDB **expects input in a structured form**, and `Document` ensures a standard format.
  - It prevents storing raw text without context.
- Easy Retrieval:
  - When performing a **similarity search**, ChromaDB returns `Document` objects, allowing easy metadata access (`doc.metadata["source"]`).

In [7]:
from langchain.vectorstores import Chroma
from langchain.schema import Document

# Initialize OpenAI Embeddings
embedding_function = OpenAIEmbeddings(model="text-embedding-ada-002")

# Function to generate a unique hash for each chunk
def get_chunk_hash(text):
    return hashlib.md5(text.encode()).hexdigest()

# Prepare documents with metadata
documents = []
for pdf, chunks in chunked_documents.items():
  for chunk in chunks:
    chunk_hash = get_chunk_hash(chunk)
    doc = Document(
        page_content=chunk,
        metadata={"source": pdf, "chunk_hash": chunk_hash}
    )
    documents.append(doc)

# Store documents in ChromaDB with embeddings
chroma_db = Chroma.from_documents(
    documents=documents,
    embedding=embedding_function,
    persist_directory=None,

)

# Verify stored embeddings
print("Stored Embeddings in ChromaDB:")
docs = chroma_db.similarity_search("Does this insurance policy cover hospitalization expenses for accidents?", k=3)
for doc in docs:
  print(f"Source: {doc.metadata['source']}, Chunk Hash: {doc.metadata['chunk_hash']}")
  print(f"Content Preview: {doc.page_content[:300]}...")
  print("-" * 80)


Stored Embeddings in ChromaDB:
Source: HDFC-Life-Easy-Health-101N110V03-Policy-Bond-Single-Pay.pdf, Chunk Hash: 7d4a3c532f792e2dd64ebd927f0969fe
Content Preview: Part E  (Applicable charges, Fund name, fund options)                                              There are no additional charges under the Policy. Part F    1. Waiting Period     60 days waiting period  For Daily Hospital Cash Benefit and Surgical Benefit we will not pay any benefits under this P...
--------------------------------------------------------------------------------
Source: HDFC-Surgicare-Plan-101N043V01.pdf, Chunk Hash: ef3ec54b3eddbe4320d70bf293c8cb9d
Content Preview: period.     11. Loans    There is no facility of loans from us against this contract.      12. Assignments and Nominations    Any change in nomination must be notified in writing to us at our Correspondence  Address noted in your Policy schedule. The policy can not be assigned.      13. Exclusions: ...
-------------------------------------------

## **Step 5: Implementing Retrievers with Cross Encoding**

Retrievers provide Easy way to combine documents with language models.

A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Retriever stores data for it to be queried by a language model. It provides an interface that will return documents based on an unstructured query. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

---

### ***Code Implementation***
1. Uses `ContextualCompressionRetriever` to retrieve documents efficiently.
2. Retrieves top k documents with an **MMR score threshold of 0.8**.
3. Integrates `CrossEncoderReranker` (`BAAI/bge-reranker-base`) to re-rank retrieved results.
4. Returns and displays relevant documents for a given query.

In [8]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

# Initialize a document retriever using the existing vector storage (chroma_db)
# The retriever is configured to retrieve top 20 documents with MMR score more than 0.8 with cross encoding enabled

def get_retriever(topk=20):
  search_kwargs = {"k": topk, "score_threshold": 0.8}
  retriever = chroma_db.as_retriever(search_type="mmr", search_kwargs=search_kwargs)

  # Initialize cross-encoder model
  cross_encoder = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-base")

  # Set up reranker
  reranker = CrossEncoderReranker(model=cross_encoder, top_n=topk)
  return ContextualCompressionRetriever(base_compressor=reranker, base_retriever=retriever)

# Function to get top-k relevant documents
def get_topk_relevant_documents(query, topk=5):
  retriever = get_retriever(topk)
  relevant_docs = retriever.invoke(query)
  return relevant_docs

# Example Query
test_query = "What is covered under the insurance policy?"
retrieved_docs = get_topk_relevant_documents(test_query, topk=5)

# Display retrieved documents
print("Retrieved Documents:")
for doc in retrieved_docs:
  print(f"Source: {doc.metadata['source']}")
  print(f"Content Preview: {doc.page_content[:300]}...")
  print("-" * 80)

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

Retrieved Documents:
Source: HDFC-Life-Easy-Health-101N110V03-Policy-Bond-Single-Pay.pdf
Content Preview: Premium Paying Term – means the period as stated in the Policy Schedule, in years, over which Premiums  are payable;  (31) Surgery or Surgical Procedure means manual and / or operative procedure (s) required for treatment of an  Illness or Injury, correction of deformities and defects, diagnosis and...
--------------------------------------------------------------------------------
Source: HDFC-Life-Group-Term-Life-Policy.pdf
Content Preview: Premium Amount (Base Plan)          Premium Amount (Riders)          Service Tax Amount (Base Plan)        Service Tax Amount (Riders)        Applicable Cess        Total Amount F&U dated 15th October 2022                  UIN-101N169V02  P a g e  | 6                             B1 Definitions    In...
--------------------------------------------------------------------------------
Source: HDFC-Surgicare-Plan-101N043V01.pdf
Content Preview: of

## **Step 6: Building a RAG Chain for Final Response Generation**

Now that we have a **retriever system**, we will integrate it into a **Retrieval-Augmented Generation (RAG) Chain** using LangChain.

---

### Why Use a RAG Chain?
1. **Combines retrieval with an LLM**: Ensures responses are relevant to policy documents.
2. **Enhances accuracy**: Instead of generating from scratch, the model answers based on retrieved chunks.
3. **Maintains traceability**: Each response can be linked back to its source in the policy documents.

---

### ***Code Implementation***
- Use `ConversationalRetrievalChain` from LangChain to combine retrieval and generation (to connect retrieval and LLM generation).
- **Pull a prebuilt RAG prompt** (`rlm/rag-prompt`) **from LangChain Hub**.
- Integrate **OpenAI GPT for response generation**.
- Ensure the model outputs a reference to the retrieved documents. Meaning, **it returns both the generated response and source documents**.
- **Integrates** `RunnablePassthrough` and `StrOutputParser` **for structured output processing**.



In [9]:
# method for combining all relevant page content
def format_docs(docs):
  return "\n\n".join(doc.page_content for doc in docs)

In [11]:
from langchain.chains import ConversationalRetrievalChain
from langchain.chat_models import ChatOpenAI
from langchain import hub
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Load a prebuilt RAG prompt from LangChain Hub
prompt = hub.pull("rlm/rag-prompt")

# Initialize OpenAI chat model
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

retriever = get_retriever(50)
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Example Queries
queries = [
    "What are the exclusions in the insurance policy?",
    "What is the minimum age for doing a term insurance?",
    "What is criteria for HDFC group insurance?",
    "What are the benifits of HDFC Sampoorna-Jeevan insurance?",
    "Can a 100-year-plus person do a term insurance?",
    "What is the condition of death while not wearing a seat belt?",
]

# Execute queries
for query in queries:
  print(f"Query: {query}")
  response = rag_chain.invoke(query)
  print("Generated Answer:")
  print(response)
  print("-" * 80)

Query: What are the exclusions in the insurance policy?
Generated Answer:
The insurance policy excludes coverage for conditions such as deformities, congenital anomalies, and any surgery performed prior to the policy commencement. Additionally, it does not cover injuries or illnesses resulting from self-inflicted harm, criminal activities, or participation in hazardous sports without special endorsement. Other exclusions include hospitalization for cosmetic treatments, organ donation, and any treatment not deemed medically necessary.
--------------------------------------------------------------------------------
Query: What is the minimum age for doing a term insurance?
Generated Answer:
The minimum age for doing a term insurance policy is typically 18 years, as individuals below this age are considered minors. However, specific policies may have different entry age requirements, so it's essential to check the policy schedule for exact details.
----------------------------------------