## 📌 Problem Statement

Understanding complex insurance policies is challenging due to lengthy and jargon-heavy documents. Customers and agents struggle to find specific information quickly.

### **Objective**
- Build an AI-powered system to **answer insurance-related queries** efficiently.
- Utilize **Retrieval-Augmented Generation (RAG)** to retrieve relevant policy sections.
- Use **GPT-4** to generate precise, user-friendly responses.

### **Expected Outcome**
- **Faster query resolution** for policyholders.
- **Improved comprehension** of insurance terms.
- **Enhanced customer support** with AI-powered assistance.


# ✅ Flow for Generative Search with RAG

## 📂 Load PDF & Extract Text
- Read the PDF document.
- Extract text content from the pages.
- Extract tables (if applicable).

## ✂️ Chunk the Extracted Text
- Use **RecursiveCharacterTextSplitter** for text chunking.
- Ensure each chunk is of **optimal size** (e.g., 500 tokens with some overlap).
- Store in a **structured format** (e.g., a DataFrame).

## 🔢 Generate Embeddings
- Initialize **OpenAIEmbeddings** correctly.
- Convert text chunks into **vector embeddings**.

## 🗄️ Store in ChromaDB
- Persist embeddings and document chunks in **ChromaDB**.
- Ensure embeddings are **stored properly** without corruption.

## 🔍 Retrieve Relevant Documents using Embeddings
- Use the **query** to perform a **similarity search** in ChromaDB.
- Retrieve the **top k** most relevant document chunks.

## 📊 Rerank Results using Cross-Encoder
- Pass retrieved **document-query pairs** through a **cross-encoder**.
- Sort results by **relevance scores**.

## 📝 Generate Final Answer using GPT
- Use the **top reranked results** as **context**.
- Format the **prompt** correctly for GPT.
- Generate a **coherent, well-structured answer**.


# **1. Installing Neccessary Libraries**

In [1]:
pip install openai chromadb langchain pdfplumber pandas tiktoken sentence-transformers


Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting pdfplumber
  Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.5/42.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.16.0

In [50]:
pip install PyPDF

Collecting PyPDF
  Downloading pypdf-5.3.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.3.0-py3-none-any.whl (300 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/300.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m297.0/300.7 kB[0m [31m11.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m300.7/300.7 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF
Successfully installed PyPDF-5.3.0


In [21]:
!pip install --upgrade openai

Collecting openai
  Downloading openai-1.64.0-py3-none-any.whl.metadata (27 kB)
Downloading openai-1.64.0-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.3/472.3 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.61.1
    Uninstalling openai-1.61.1:
      Successfully uninstalled openai-1.61.1
Successfully installed openai-1.64.0


In [40]:
# Install required packages
!pip install --upgrade langchain-community langchain-core langchain

# Reinstalling dependencies just in case
!pip install openai chromadb langchain pdfplumber pandas tiktoken sentence-transformers



#**2. Importing Neccessary Libraries**

In [95]:
import pdfplumber
import chromadb
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os
import pandas as pd
import openai
import tiktoken
import chromadb
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import CrossEncoder


In [96]:
pdf_path = "/content/Principal-Sample-Life-Insurance-Policy.pdf"

# **3. Data Extraction**


> Text and Tables



In [97]:
def extract_text_and_tables(pdf_path, doc_name):
    data = []
    table_data = []

    with pdfplumber.open(pdf_path) as pdf:
        for page_no, page in enumerate(pdf.pages, start=1):
            text = page.extract_text() or ""  # Extract text

            # Extract tables
            tables = page.extract_tables()
            for table_idx, table in enumerate(tables):
                df_table = pd.DataFrame(table)  # Convert table to DataFrame
                df_table["Page No."] = page_no  # Add metadata
                df_table["Table Index"] = table_idx
                table_data.append(df_table)  # Store extracted table

            # Store extracted text
            data.append({
                "Page No.": page_no,
                "Page_Text": text,
                "Document Name": doc_name,
                "Text_Length": len(text),
                "Has_Tables": bool(tables)
            })

    text_df = pd.DataFrame(data)  # Store extracted text
    tables_df = pd.concat(table_data, ignore_index=True) if table_data else pd.DataFrame()  # Store extracted tables

    return text_df, tables_df

# Example Usage
pdf_path = "/content/Principal-Sample-Life-Insurance-Policy.pdf"
text_df, tables_df = extract_text_and_tables(pdf_path, "Group Life Insurance Policy")

# View Extracted Text Data
print("Extracted Text Data:")
print(text_df.head())

# View Extracted Tables
if not tables_df.empty:
    print("\nExtracted Tables:")
    print(tables_df.head())
else:
    print("\nNo tables found in the document.")


Extracted Text Data:
   Page No.                                          Page_Text  \
0         1  DOROTHEA GLAUSE S655\nRHODE ISLAND JOHN DOE 01...   
1         2                 This page left blank intentionally   
2         3  POLICY RIDER\nGROUP INSURANCE\nPOLICY NO: S655...   
3         4                 This page left blank intentionally   
4         5  PRINCIPAL LIFE INSURANCE COMPANY\n(called The ...   

                 Document Name  Text_Length  Has_Tables  
0  Group Life Insurance Policy          188       False  
1  Group Life Insurance Policy           34       False  
2  Group Life Insurance Policy         1468       False  
3  Group Life Insurance Policy           34       False  
4  Group Life Insurance Policy          709       False  

No tables found in the document.


In [98]:
print("Extracted Text DataFrame:")
print(text_df.head())  # Ensure text is extracted

print("\nExtracted Chunks DataFrame:")
print(df_exploded.head())  # Ensure text chunking worked

print("\nExtracted Tables DataFrame:")
if not tables_df.empty:
    print(tables_df.head())
else:
    print("No tables found in the document.")

Extracted Text DataFrame:
   Page No.                                          Page_Text  \
0         1  DOROTHEA GLAUSE S655\nRHODE ISLAND JOHN DOE 01...   
1         2                 This page left blank intentionally   
2         3  POLICY RIDER\nGROUP INSURANCE\nPOLICY NO: S655...   
3         4                 This page left blank intentionally   
4         5  PRINCIPAL LIFE INSURANCE COMPANY\n(called The ...   

                 Document Name  Text_Length  Has_Tables  
0  Group Life Insurance Policy          188       False  
1  Group Life Insurance Policy           34       False  
2  Group Life Insurance Policy         1468       False  
3  Group Life Insurance Policy           34       False  
4  Group Life Insurance Policy          709       False  

Extracted Chunks DataFrame:
   Page No.                                          Page_Text  \
0         1  DOROTHEA GLAUSE S655\nRHODE ISLAND JOHN DOE 01...   
1         2                 This page left blank intentionally   
2 

# **4. Chunking Text**

In [99]:
def chunk_text(text, max_tokens=500):
    """
    Splits the given text into chunks based on token count while ensuring each chunk
    does not exceed the specified max_tokens limit.

    Parameters:
        text (str): The input text to be chunked.
        max_tokens (int): The maximum number of tokens allowed in each chunk.

    Returns:
        list: A list of text chunks.
    """
    encoding = tiktoken.get_encoding("cl100k_base")  # Load tokenizer
    words = text.split()  # Split text into words

    chunks, chunk = [], []  # Initialize lists for storing chunks
    token_count = 0  # Counter for tracking tokens in the current chunk

    for word in words:
        word_tokens = len(encoding.encode(word))  # Count tokens in the word

        # If adding the current word exceeds max_tokens, store the current chunk
        if token_count + word_tokens > max_tokens:
            chunks.append(" ".join(chunk))  # Add current chunk to the list
            chunk = []  # Reset the chunk
            token_count = 0  # Reset token counter

        chunk.append(word)  # Add word to the current chunk
        token_count += word_tokens  # Update token count

    # Add the last chunk if not empty
    if chunk:
        chunks.append(" ".join(chunk))

    return chunks

# Apply chunking function to the "Page_Text" column
text_df["Chunks"] = text_df["Page_Text"].apply(lambda text: chunk_text(text))

# Explode the dataframe so that each chunk gets its own row
df_exploded = text_df.explode("Chunks").reset_index(drop=True)

# Display first few rows of the exploded dataframe
print(df_exploded.head())




   Page No.                                          Page_Text  \
0         1  DOROTHEA GLAUSE S655\nRHODE ISLAND JOHN DOE 01...   
1         2                 This page left blank intentionally   
2         3  POLICY RIDER\nGROUP INSURANCE\nPOLICY NO: S655...   
3         4                 This page left blank intentionally   
4         5  PRINCIPAL LIFE INSURANCE COMPANY\n(called The ...   

                 Document Name  Text_Length  Has_Tables  \
0  Group Life Insurance Policy          188       False   
1  Group Life Insurance Policy           34       False   
2  Group Life Insurance Policy         1468       False   
3  Group Life Insurance Policy           34       False   
4  Group Life Insurance Policy          709       False   

                                              Chunks  
0  DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...  
1                 This page left blank intentionally  
2  POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...  
3                 This page 

# **5. Creating Embeddings and Storing in ChromaDB**

In [100]:
# Load OpenAI API Key
with open("helpmate_api_key.txt", "r") as file:
    openai.api_key = file.read().strip()

# Initialize the OpenAI client with the API key.
client = openai.OpenAI(api_key=openai.api_key)
from langchain.embeddings.openai import OpenAIEmbeddings

# Initialize Embeddings
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai.api_key)

# Initialize ChromaDB
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection("insurance_policy")

# Embed and store chunks
for idx, row in df_exploded.iterrows():
    chunk_text = row["Chunks"]
    vector = embedding_model.embed_documents([chunk_text])[0]
    collection.add(
        documents=[chunk_text],
        metadatas=[{"page": row["Page No."], "doc": row["Document Name"]}],
        ids=[str(idx)]
    )
print("Embedding & storage complete.")




Embedding & storage complete.


In [101]:
print(f"Total documents stored in ChromaDB: {collection.count()}")


Total documents stored in ChromaDB: 74


 # **6.Implementing Query Search**


>  To Retrieve Relevant Documents from Embeddings



In [102]:
# Initialize ChromaDB Search
vector_db = Chroma(persist_directory="./chroma_db", embedding_function=embedding_model)

def search_query(query, top_k=3):
    """
    Searches the ChromaDB vector database for the most relevant chunks for the given query.

    Parameters:
        query (str): The search query.
        top_k (int): The number of top results to retrieve.

    Returns:
        list: Retrieved document chunks along with their similarity scores.
    """
    try:
        # Generate embedding for the query
        query_embedding = embedding_model.embed_query(query)

        # Perform similarity search
        results = vector_db.similarity_search_with_score(query, k=top_k)

        # If no results are found, handle it gracefully
        if not results:
            print(f"\n No relevant documents found for query: {query}")
        return results

    except Exception as e:
        print(f" Error during search: {e}")
        return []

# Example Queries
queries = [
    "What are the eligibility criteria for this policy?",
    "How is the claim process handled?",
    "What are the exclusions in this insurance policy?"
]

# Get Top Results
for query in queries:
    results = search_query(query)

    print(f"\n Query: {query}")
    if results:
        for idx, (doc, score) in enumerate(results):
            print(f" Result {idx+1}: {doc.page_content[:200]}... (Score: {score:.4f})")  # Display first 200 chars
    else:
        print(f"No documents found for query: {query}")



 Query: What are the eligibility criteria for this policy?
 Result 1: This policy has been updated effective  January 1, 2014 
 
 
PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS 
GC 6006 Section A - Eligibility, Page 1  
 
PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS 
 ... (Score: 0.3231)
 Result 2: (1) Employees: 
 
- at least 75% of all eligible employees must enroll; 
 
(2) Dependents: 
 
- maintain a Dependent participation of at least 75% of eligible Dependents; and 
 
d. if the Member is to... (Score: 0.3375)
 Result 3: consent to the change. 
 
 
Article 3 - Policyholder Eligibility Requirements 
 
To be an eligible group and to remain an eligible group, the Policyholder must:... (Score: 0.3378)

 Query: How is the claim process handled?
 Result 1: ERISA permits up to 45 days from receipt of claim for processing the claim.  If a claim cannot 
be processed due to incomplete information, The Principal will send a Written explanation prior 
to the ... (Score: 0.3309)
 Result 2: Th

# **7.  Rerank Results using Cross-Encoder**

In [103]:
from sentence_transformers import CrossEncoder

# Load cross-encoder for reranking
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def search_query(query, top_k=10):
    # Correct embedding for OpenAIEmbeddings
    query_embedding = embedding_model.embed_query(query)  # Fix: Use `embed_query()`

    # Retrieve initial results from ChromaDB
    results = vector_db.similarity_search_with_score(query, k=top_k)

    if not results:
        print(f"\n No relevant documents found for query: {query}")
        return []

    # Extract text and similarity scores
    retrieved_texts = [doc.page_content for doc, score in results]
    scores = [score for doc, score in results]

    # Rerank using Cross-Encoder
    rerank_scores = cross_encoder.predict([[query, text] for text in retrieved_texts])

    # Sort results based on reranker scores
    sorted_results = sorted(zip(retrieved_texts, rerank_scores), key=lambda x: x[1], reverse=True)

    # Select top-k reranked results
    top_reranked = sorted_results[:3]

    print(f"\n🔎 Query: {query}")
    for idx, (doc, score) in enumerate(top_reranked):
        print(f"🔹 Retrieved Chunk {idx+1}: {doc[:500]}... (Re-Rank Score: {score:.4f})")

    return top_reranked


# **8. Generate Final Answer using GPT**

In [105]:
def generate_answer(query, context):
    """
    Generates an answer using GPT-4 based on the retrieved document context.

    Parameters:
        query (str): The user’s question.
        context (str): The retrieved relevant document text.

    Returns:
        str: The AI-generated answer.
    """
    try:
        # Ensure the context is not empty to avoid irrelevant answers
        if not context.strip():
            return " No relevant information found in the document to answer this query."

        # Define system role and user query separately for structured interaction
        messages = [
            {"role": "system", "content": "You are an AI assistant answering questions about an insurance policy document. Use the provided document context to answer accurately."},
            {"role": "user", "content": f"Context: {context}\nUser Query: {query}\n\nPlease provide a clear, detailed, and helpful response."}
        ]

        # Call GPT-4 API to generate a response
        response = client.chat.completions.create(
            model="gpt-4",
            messages=messages
        )

        return response.choices[0].message.content

    except Exception as e:
        return f"Error generating answer: {e}"

# Example Queries
queries = [
    "What are the eligibility criteria for this policy?",
    "How is the claim process handled?",
    "What are the exclusions in this insurance policy?"
]

# Generate Answers for Queries
for query in queries:
    retrieved_docs = search_query(query, top_k=3)  # Retrieve relevant documents
    # Modify this line to extract the text from the tuples:
    combined_context = " ".join([doc for doc, score in retrieved_docs]) if retrieved_docs else ""

    answer = generate_answer(query, combined_context)

    print(f"\n🔎 Query: {query}")
    print(f"💬 Answer: {answer}")


🔎 Query: What are the eligibility criteria for this policy?
🔹 Retrieved Chunk 1: consent to the change. 
 
 
Article 3 - Policyholder Eligibility Requirements 
 
To be an eligible group and to remain an eligible group, the Policyholder must:... (Re-Rank Score: 4.1368)
🔹 Retrieved Chunk 2: This policy has been updated effective  January 1, 2014 
 
 
PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS 
GC 6006 Section A - Eligibility, Page 1  
 
PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS 
 
 
Section A - Eligibility 
 
 
Article 1 - Member Life Insurance 
 
A person will be eligible for Member Life Insurance on the date the person completes 30 
consecutive days of continuous Active Work with the Policyholder as a Member.... (Re-Rank Score: 4.0769)
🔹 Retrieved Chunk 3: (1) Employees: 
 
- at least 75% of all eligible employees must enroll; 
 
(2) Dependents: 
 
- maintain a Dependent participation of at least 75% of eligible Dependents; and 
 
d. if the Member is to contribute no part of th