

**1.The Embedding Layer:** The PDF document needs to be effectively processed, cleaned, and chunked for the embeddings. Here, the choice of the chunking strategy will have a large impact on the final quality of the retrieved results.

Another important aspect in the embedding layer is the choice of the embedding model. You can choose to embed your chunks using the OpenAI embedding model or any model from the SentenceTransformers library on HuggingFace.

**2. The Search Layer:** Here, you first need to design at least 3 queries against which you will test your system. You need to understand and skim through the document, and accordingly come up with some queries, the answers to which can be found in the policy document.

Next, you need to embed the queries and search your ChromaDB vector database against each of these queries. Implementing a cache mechanism is also mandatory.

Finally, you need to implement the re-ranking block, and for this you can choose from a range of cross-encoding models on HuggingFace.

**3. The Generation Layer:** In the generation layer, the final prompt that you design is the major component. Make sure that the prompt is exhaustive in its instructions, and the relevant information is correctly passed to the prompt. You may also choose to provide some few-shot examples in an attempt to improve the LLM output.

In [1]:
# Install necessary libraries

!pip install pdfplumber
!pip install chromadb
!pip install tiktoken
!pip install openai

Collecting pdfplumber
  Downloading pdfplumber-0.11.7-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20250506 (from pdfplumber)
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.7-py3-none-any.whl (60 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# Import all the required Libraries
import pdfplumber
from pathlib import Path
import pandas as pd
from operator import itemgetter
import json
import tiktoken
import chromadb
import openai

In [3]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import os
os.chdir('/content/drive/MyDrive/GenAI - Personal/HelpMate AI Codes/Week 3')
!ls

 Attendance.xlsx
 ChatGPT_Key.docx
 ChatGPT_Key.txt
 chroma
'Copy of Demo 2 _ Chunking, Embeddings & Semantic Search on Wikipedia Pages.ipynb'
'Copy of Demo 3.1 _ Embeddings & Retrieval with ChromaDB(1).ipynb'
'Copy of Demo 3.1 _ Embeddings & Retrieval with ChromaDB.ipynb'
'Copy of Demo 3.2 _ Generative Search with OpenAI and Chroma.ipynb'
'Copy of fixed_chunk_embeddings.csv'
'Copy of Food App Reviews.csv'
'Demo 1 _ Generating and Visualising Embeddings.ipynb'
'Demo 2 _ Chunking, Embeddings & Semantic Search on Wikipedia Pages.ipynb'
'Demo 3.1 _ Embeddings & Retrieval with ChromaDB.ipynb'
'Demo 3.2 _ Generative Search with OpenAI and Chroma.ipynb'
 fixed_chunk_embeddings.csv
'Food App Reviews.csv'
 Gemini_API_Key.docx
 HelpMate_AI_Live_Session.ipynb
'LangChain+Session+Materials+DS (2).zip'
 para_chunk_embeddings.csv
'Policy Documents'
 Policy+Documents.zip
 Principal-Sample-Life-Insurance-Policy.pdf
'Restaurant Reviews.xlsx'
 Semantic_Spotter_Project_with_llamaindex.ipynb


In [5]:
pdf_file_path ="/content/drive/MyDrive/GenAI - Personal/HelpMate AI Codes/Week 3/Principal-Sample-Life-Insurance-Policy.pdf"

In [6]:
# Set the API key
filepath = "/content/drive/MyDrive/GenAI - Personal/HelpMate AI Codes/Week 3/"

with open(filepath + "ChatGPT_Key.txt", "r") as f:
  openai.api_key = ' '.join(f.readlines())

In [7]:
messages = [
    {"role":"system", "content":"You are an AI assistant to user."},
    {"role":"user", "content":"What is the revenue of Zomato in 2024?"},
          ]

In [8]:
response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages)


In [9]:
response.choices[0].message.content

"I'm sorry, but as an AI assistant, I don't have access to real-time data or future information. I recommend checking the latest news updates or official reports from Zomato to find out their revenue in 2024."

In [10]:
retrieved= "In FY 2024, Zomato reported ₹12,114 crore in revenue — a 71% increase from the previous year — and posted a net profit of ₹351 crore, reversing a ₹971 crore loss in FY 2023."

In [11]:
messages=[{'role':'system','content':'You are a smart AI Assistant to user who helps him understand business reports'},
          {'role':'user','content': f"What is the revenue of Zomato in 2024?. Use the information available in {retrieved}"}]

In [12]:
response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages)
response.choices[0].message.content

'In FY 2024, Zomato reported a revenue of ₹12,114 crore.'

In [13]:
# Open the PDF file
with pdfplumber.open(pdf_file_path) as pdf:

    # Get one of the pages from the PDF and examine it
    single_page = pdf.pages[2]

    # Extract text from the first page
    text = single_page.extract_text()

    # Extract tables from the first page
    tables = single_page.extract_tables()

    # Print the extracted text
    print(text)
    print(tables)

POLICY RIDER
GROUP INSURANCE
POLICY NO: S655
COVERAGE: Life
EMPLOYER: RHODE ISLAND JOHN DOE
Effective on the later of the Date of Issue of this Group Policy or March 1, 2005, the following
will apply to your Policy:
From time to time The Principal may offer or provide certain employer groups who apply
for coverage with The Principal a Financial Services Hotline and Grief Support Services or
any other value added service for the employees of that employer group. In addition, The
Principal may arrange for third party service providers (i.e., optometrists, health clubs), to
provide discounted goods and services to those employer groups who apply for coverage
with The Principal or who become insureds/enrollees of The Principal. While The
Principal has arranged these goods, services and/or third party provider discounts, the third
party service providers are liable to the applicants/insureds/enrollees for the provision of
such goods and/or services. The Principal is not responsible for the 

Funtion to check if words boundries are inside table boundries. Returns True if word is inside table.

In [14]:
def is_word_inside_table(word, table_bbox):
    # Check whether word is inside a table bbox.
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]

In [15]:

# 1. Declare a variable p to store the iteration of the loop that will help us store page numbers alongside the text
# 2. Declare an empty list 'full_text' to store all the text files
# 3. Use pdfplumber to open the pdf pages one by one
# 4. Find the tables and their locations in the page
# 5. Extract the text from the tables in the variable 'tables'
# 6. Extract the regular words by calling the function check_bboxes() and checking whether words are present in the table or not
# 7. Use the cluster_objects utility to cluster non-table and table words together so that they retain the same chronology as in the original PDF
# 8. Declare an empty list 'lines' to store the page text
# 9. If a text element in present in the cluster, append it to 'lines', else if a table element is present, append the table
# 10. Append the page number and all lines to full_text, and increment 'p'
# 11. When the function has iterated over all pages, return the 'full_text' list

def extract_text_from_pdf(pdf_path):
    p = 0
    full_text = []


    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_no = f"Page {p+1}"
            text = page.extract_text()
            heading = text.split('\n')[0].strip() if text else None
            tables = page.find_tables()
            table_bboxes = [i.bbox for i in tables]
            tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
            non_table_words = [word for word in page.extract_words() if not any(
                [is_word_inside_table(word, table_bbox) for table_bbox in table_bboxes])]
            lines = []

            for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5):

                if 'text' in cluster[0]:
                    try:
                        lines.append(' '.join([i['text'] for i in cluster]))
                    except KeyError:
                        pass

                elif 'table' in cluster[0]:
                    lines.append(json.dumps(cluster[0]['table']))


            full_text.append([page_no,heading, " ".join(lines)])
            p +=1

    return full_text

In [16]:
extracted_text = extract_text_from_pdf(pdf_file_path)

In [17]:
 # Convert the extracted list to a PDF, and add a column to store document names
extracted_text_df = pd.DataFrame(extracted_text, columns=['Page No.', 'Heading','Page_Text'])
extracted_text_df.head()

Unnamed: 0,Page No.,Heading,Page_Text
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...
1,Page 2,This page left blank intentionally,This page left blank intentionally
2,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...
3,Page 4,This page left blank intentionally,This page left blank intentionally
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...


In [18]:
extracted_text_df['Text_Length'] = extracted_text_df['Page_Text'].apply(lambda x: len(x.split(' ')))

In [19]:
extracted_text_df.head(10)

Unnamed: 0,Page No.,Heading,Page_Text,Text_Length
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30
1,Page 2,This page left blank intentionally,This page left blank intentionally,5
2,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230
3,Page 4,This page left blank intentionally,This page left blank intentionally,5
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110
5,Page 6,TABLE OF CONTENTS,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,153
6,Page 7,Section A – Eligibility,Section A – Eligibility Member Life Insurance ...,176
7,Page 8,Section A - Member Life Insurance,Section A - Member Life Insurance Schedule of ...,171
8,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...,387
9,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...,251


Remove empty or blank pages

In [20]:
extracted_text_df = extracted_text_df.loc[extracted_text_df['Text_Length']>=10]
extracted_text_df.head(10)

Unnamed: 0,Page No.,Heading,Page_Text,Text_Length
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30
2,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110
5,Page 6,TABLE OF CONTENTS,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,153
6,Page 7,Section A – Eligibility,Section A – Eligibility Member Life Insurance ...,176
7,Page 8,Section A - Member Life Insurance,Section A - Member Life Insurance Schedule of ...,171
8,Page 9,P ART I - DEFINITIONS,P ART I - DEFINITIONS When used in this Group ...,387
9,Page 10,T he legally recognized union of two eligible ...,T he legally recognized union of two eligible ...,251
10,Page 11,(2) has been placed with the Member or spouse ...,(2) has been placed with the Member or spouse ...,299
11,Page 12,An institution that is licensed as a Hospital ...,An institution that is licensed as a Hospital ...,352


In [21]:
print(extracted_text_df['Heading'].str.len())

0     20
2     12
4     32
5     17
6     23
7     33
8     21
9     98
10    83
11    97
12    65
13    31
14    94
15    31
16    88
17    92
18    92
19    20
20    69
21    84
22    30
23    87
24    26
25    45
26    85
27    27
28    91
29    87
30    82
31    89
32    93
33    86
34    35
35    95
36    28
37    24
38    89
39    25
40    93
41    38
42    81
43    87
44    82
45    18
46    92
47    94
48    97
49    84
50    57
51    83
52    63
53    75
54     8
55    97
56    14
57    68
58    36
59    94
60    28
61    93
Name: Heading, dtype: int64


In [22]:
extracted_text_df['Metadata'] = extracted_text_df.apply(
    lambda x: {
        'Section': (x['Heading'][:25] if x['Heading'] else ''),
        'Page_No.': x['Page No.']
    },
    axis=1
)

In [23]:
extracted_text_df.reset_index(drop=True, inplace=True)
extracted_text_df.head()

Unnamed: 0,Page No.,Heading,Page_Text,Text_Length,Metadata
0,Page 1,DOROTHEA GLAUSE S655,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30,"{'Section': 'DOROTHEA GLAUSE S655', 'Page_No.'..."
1,Page 3,POLICY RIDER,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230,"{'Section': 'POLICY RIDER', 'Page_No.': 'Page 3'}"
2,Page 5,PRINCIPAL LIFE INSURANCE COMPANY,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110,"{'Section': 'PRINCIPAL LIFE INSURANCE ', 'Page..."
3,Page 6,TABLE OF CONTENTS,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,153,"{'Section': 'TABLE OF CONTENTS', 'Page_No.': '..."
4,Page 7,Section A – Eligibility,Section A – Eligibility Member Life Insurance ...,176,"{'Section': 'Section A – Eligibility', 'Page_N..."


## Generate and Store Embeddings using OpenAI and ChromaDB

In [24]:
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

In [25]:
chroma_data_path = '/content/drive/MyDrive/GenAI - Personal/HelpMate AI Codes/Week 3/insuranceDb'

In [26]:
client = chromadb.PersistentClient()

In [27]:
filepath = "/content/drive/MyDrive/GenAI - Personal/HelpMate AI Codes/Week 3/"

with open(filepath + "ChatGPT_Key.txt", "r") as f:
    openai_api_key = ' '.join(f.readlines()).strip()  # store in variable


openai.api_key = openai_api_key


In [28]:
# Set up the embedding function using the OpenAI embedding model

os.environ["OPENAI_API_KEY"] = openai_api_key


In [29]:
model = "text-embedding-ada-002"
embedding_function = OpenAIEmbeddingFunction(api_key=openai_api_key, model_name=model)

In [30]:
# Initialise a collection in chroma and pass the embedding_function to it so that it used OpenAI embeddings to embed the documents
insurance_collection = client.get_or_create_collection(name='InsurancePolicyDoc', embedding_function=embedding_function)

In [31]:
# Convert the page text and metadata from your dataframe to lists to be able to pass it to chroma

documents_list = extracted_text_df["Page_Text"].tolist()
metadata_list = extracted_text_df['Metadata'].tolist()

In [32]:
# Add the documents and metadata to the collection alongwith generic integer IDs. You can also feed the metadata information as IDs by combining the policy name and page no.

insurance_collection.add(
    documents= documents_list,
    ids = [str(i) for i in range(0, len(documents_list))],
    metadatas = metadata_list
)

In [33]:
# Let's take a look at the first few entries in the collection

insurance_collection.get(
    ids = ['0','1','2'],
    include = ['embeddings', 'documents', 'metadatas']
)

{'ids': ['0', '1', '2'],
 'embeddings': array([[-2.23880261e-02,  1.87088735e-02, -2.72935610e-02, ...,
         -3.68958823e-02,  2.89472216e-03, -1.34380336e-03],
        [-1.32057490e-02,  8.82212631e-03, -4.67860838e-03, ...,
         -1.56548154e-02, -4.84764605e-05,  7.25115696e-03],
        [-1.20378779e-02,  1.40740369e-02, -3.30295507e-03, ...,
         -2.85194907e-02, -9.43796150e-03,  1.02139572e-02]]),
 'documents': ['DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/01/2014 711 HIGH STREET GEORGE RI 02903 GROUP POLICY FOR: RHODE ISLAND JOHN DOE ALL MEMBERS Group Member Life Insurance Print Date: 07/16/2014',
  'POLICY RIDER GROUP INSURANCE POLICY NO: S655 COVERAGE: Life EMPLOYER: RHODE ISLAND JOHN DOE Effective on the later of the Date of Issue of this Group Policy or March 1, 2005, the following will apply to your Policy: From time to time The Principal may offer or provide certain employer groups who apply for coverage with The Principal a Financial Services Hotline and Gri

In [34]:
cache_collection = client.get_or_create_collection(name='Insurance_Cache', embedding_function=embedding_function)
cache_collection.peek()

{'ids': ['what are the  premium rate for each Member insured for Life Insurance ?'],
 'embeddings': array([[-0.0084867 ,  0.00680118,  0.00869522, ..., -0.02214467,
         -0.01064835, -0.01481872]]),
 'documents': ['what are the  premium rate for each Member insured for Life Insurance ?'],
 'uris': None,
 'included': ['metadatas', 'documents', 'embeddings'],
 'data': None,
 'metadatas': [{'metadatas_9': '{"Section": "Section F - Individual Pu", "Page_No.": "Page 42"}',
   'included_0': 'm',
   'ids_4': '32',
   'metadatas_4': '{"Page_No.": "Page 35", "Section": "Section C - Individual Te"}',
   'distances_6': 0.3369728624820709,
   'distances_1': 0.2915889024734497,
   'distances_2': 0.30502063035964966,
   'metadatas_5': '{"Section": "PART III - INDIVIDUAL REQ", "Page_No.": "Page 26"}',
   'metadatas_8': '{"Section": "Section A - Member Life I", "Page_No.": "Page 8"}',
   'documents_4': "Section C - Individual Terminations Article 1 - Member Life Insurance A Member's insurance unde

##  Semantic Search with Cache

In [35]:
# Read the user query

query = input()

what are the  premium rate for each Member insured for Life Insurance ?


In [36]:
# Query the cache collection to check if the results are already stored
cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)

# Print the results from the cache query for debugging
cache_results



{'ids': [['what are the  premium rate for each Member insured for Life Insurance ?']],
 'embeddings': None,
 'documents': [['what are the  premium rate for each Member insured for Life Insurance ?']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'distances_8': 0.33777451515197754,
    'included_7': 'a',
    'ids_6': '49',
    'included_0': 'm',
    'documents_9': "Section F - Individual Purchase Rights Article 1 - Member Life Insurance a. Individual Policy If a Member qualifies and makes timely application, he or she may convert the group coverage by purchasing an individual policy of life insurance under these terms: (1) The Member will not be required to submit Proof of Good Health. (2) The policy will be for life insurance only. No disability or other benefits will be included. (3) The policy will be on one of the forms, other than term insurance, then issued by The Principal to persons in the risk class to which the Member bel

In [37]:
results = insurance_collection.query(
query_texts=query,
n_results=10
)


In [38]:
results

{'ids': [['18', '19', '17', '43', '32', '23', '49', '56', '5', '39']],
 'embeddings': None,
 'documents': [["b . on any date the definition of Member or Dependent is changed; and c. on any date the Policyholder's business, as specified on the Policyholder application, is changed; and d. on any date that a schedule of insurance or class of insured Members is changed; and e. on any premium due date, if the Policyholder has been receiving a multiple policy discount rate and the Policyholder drops below the minimum number of coverages to receive such discount rate; and f. on any date the premium contribution required of Members is changed; and g. with respect to Member Life Insurance, on any Policy Anniversary, if the average age, average Scheduled Benefit amount, or the male/female distribution for then insured Members has changed since the last Policy Anniversary; and h. on any Policy Anniversary, if the volume of insurance for then insured Members has increased or decreased by more than

In [39]:
results.items()

dict_items([('ids', [['18', '19', '17', '43', '32', '23', '49', '56', '5', '39']]), ('embeddings', None), ('documents', [["b . on any date the definition of Member or Dependent is changed; and c. on any date the Policyholder's business, as specified on the Policyholder application, is changed; and d. on any date that a schedule of insurance or class of insured Members is changed; and e. on any premium due date, if the Policyholder has been receiving a multiple policy discount rate and the Policyholder drops below the minimum number of coverages to receive such discount rate; and f. on any date the premium contribution required of Members is changed; and g. with respect to Member Life Insurance, on any Policy Anniversary, if the average age, average Scheduled Benefit amount, or the male/female distribution for then insured Members has changed since the last Policy Anniversary; and h. on any Policy Anniversary, if the volume of insurance for then insured Members has increased or decrease

In [40]:
for key, val in results.items():
  print(key)

ids
embeddings
documents
uris
included
data
metadatas
distances


In [41]:
for key, val in results.items():
  print(key)
  print(val)

ids
[['18', '19', '17', '43', '32', '23', '49', '56', '5', '39']]
embeddings
None
documents
[["b . on any date the definition of Member or Dependent is changed; and c. on any date the Policyholder's business, as specified on the Policyholder application, is changed; and d. on any date that a schedule of insurance or class of insured Members is changed; and e. on any premium due date, if the Policyholder has been receiving a multiple policy discount rate and the Policyholder drops below the minimum number of coverages to receive such discount rate; and f. on any date the premium contribution required of Members is changed; and g. with respect to Member Life Insurance, on any Policy Anniversary, if the average age, average Scheduled Benefit amount, or the male/female distribution for then insured Members has changed since the last Policy Anniversary; and h. on any Policy Anniversary, if the volume of insurance for then insured Members has increased or decreased by more than 25% since the

In [42]:
import json

threshold = 0.2

results_df_1 = pd.DataFrame()

# Query the cache collection to check if the results are already stored
cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)

# Print the results from the cache query for debugging
print(cache_results)

# Check if the cache is empty or if the distance exceeds the threshold
if not cache_results['distances'][0] or cache_results['distances'][0][0] > threshold:
    # Query the main collection for the top 10 results
    results = insurance_collection.query(
        query_texts=query,
        n_results=10
    )

    # Prepare keys and values for storing in cache
    cache_data = {}
    for key, val in results.items():
        if val is None:
            continue
        # Adjust the loop to match the actual number of items in val
        for i in range(min(len(val[0]), 10)):  # Ensure you only loop over existing items
            cache_data[f"{key}_{i}"] = val[0][i]

    # Flatten the metadata for storage in ChromaDB
    flat_cache_data = {}
    for k, v in cache_data.items():
        if isinstance(v, dict):
            # Convert the dictionary to a JSON string
            flat_cache_data[k] = json.dumps(v)
        else:
            flat_cache_data[k] = v

    # Store the query in cache
    cache_collection.add(
        documents=[query],
        ids=[query],  # Alternatively, you can use a unique ID
        metadatas=flat_cache_data
    )

    print("Not found in cache. Found in main collection.")

    # Convert the results to a DataFrame
    result_dict = {
        'Metadatas_1': results['metadatas'][0],
        'Documents_1': results['documents'][0],
        'Distances_1': results['distances'][0],
        'IDs': results['ids'][0]
    }
    results_df_1 = pd.DataFrame.from_dict(result_dict)

# If the distance is within the threshold, retrieve results from the cache
elif cache_results['distances'][0][0] <= threshold:
    # Extract data from the cache
    cache_result_dict = cache_results['metadatas'][0][0]
    ids = []
    documents = []
    distances = []
    metadatas = []

    # Collect data based on keys
    for key, value in cache_result_dict.items():
        if 'ids' in key:
            ids.append(value)
        elif 'documents' in key:
            documents.append(value)
        elif 'distances' in key:
            distances.append(value)
        elif 'metadatas' in key:
            metadatas.append(value)

    print("Found in cache!")

    # Convert the cache data to a DataFrame
    results_df_1 = pd.DataFrame({
        'IDs_1': ids,
        'Documents_1': documents,
        'Distances_1': distances,
        'Metadatas_1': metadatas
    })



{'ids': [['what are the  premium rate for each Member insured for Life Insurance ?']], 'embeddings': None, 'documents': [['what are the  premium rate for each Member insured for Life Insurance ?']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'metadatas_7': '{"Section": "Section C - Dependent Lif", "Page_No.": "Page 59"}', 'included_4': 'd', 'ids_8': '5', 'ids_2': '17', 'metadatas_3': '{"Section": "PART IV - BENEFITS", "Page_No.": "Page 46"}', 'distances_7': 0.3376728892326355, 'metadatas_0': '{"Section": "b . on any date the defin", "Page_No.": "Page 21"}', 'ids_3': '43', 'distances_4': 0.3289873003959656, 'documents_0': "b . on any date the definition of Member or Dependent is changed; and c. on any date the Policyholder's business, as specified on the Policyholder application, is changed; and d. on any date that a schedule of insurance or class of insured Members is changed; and e. on any premium due date, if the Policyholder has 

In [43]:
results_df_1.head()

Unnamed: 0,IDs_1,Documents_1,Distances_1,Metadatas_1
0,5,b . on any date the definition of Member or De...,0.337673,"{""Section"": ""Section C - Dependent Lif"", ""Page..."
1,17,Section B - Premiums Article 1 - Payment Respo...,0.328987,"{""Section"": ""PART IV - BENEFITS"", ""Page_No."": ..."
2,43,PART IV - BENEFITS Section A - Member Life Ins...,0.334616,"{""Section"": ""b . on any date the defin"", ""Page..."
3,49,Section C - Dependent Life Insurance Article 1...,0.273043,"{""Page_No."": ""Page 35"", ""Section"": ""Section C ..."
4,18,(1) only one Accelerated Benefit payment will ...,0.318965,"{""Section"": ""Section A - Member Life I"", ""Page..."


In [44]:
query2=input()

what happens if i fail to pay to Premium?


In [45]:
cache_results= cache_collection.query(
         query_texts=query2,
          n_results=1
     )
cache_results

{'ids': [['what are the  premium rate for each Member insured for Life Insurance ?']],
 'embeddings': None,
 'documents': [['what are the  premium rate for each Member insured for Life Insurance ?']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'metadatas_4': '{"Page_No.": "Page 35", "Section": "Section C - Individual Te"}',
    'included_5': 'a',
    'metadatas_5': '{"Section": "PART III - INDIVIDUAL REQ", "Page_No.": "Page 26"}',
    'distances_5': 0.3346160650253296,
    'ids_1': '19',
    'ids_2': '17',
    'metadatas_9': '{"Section": "Section F - Individual Pu", "Page_No.": "Page 42"}',
    'metadatas_0': '{"Section": "b . on any date the defin", "Page_No.": "Page 21"}',
    'ids_4': '32',
    'documents_0': "b . on any date the definition of Member or Dependent is changed; and c. on any date the Policyholder's business, as specified on the Policyholder application, is changed; and d. on any date that a schedule of insurance

In [46]:
threshold = 0.2

results_df_2 = pd.DataFrame()

# Query the cache collection to check if the results are already stored
cache_results = cache_collection.query(
    query_texts=query2,
    n_results=1
)

# Print the results from the cache query for debugging
print(cache_results)

# Check if the cache is empty or if the distance exceeds the threshold
if not cache_results['distances'][0] or cache_results['distances'][0][0] > threshold:
    # Query the main collection for the top 10 results
    results = insurance_collection.query(
        query_texts=query2,
        n_results=10
    )

    # Prepare keys and values for storing in cache
    cache_data = {}
    for key, val in results.items():
        if val is None:
            continue
        # Adjust the loop to match the actual number of items in val
        for i in range(min(len(val[0]), 10)):  # Ensure you only loop over existing items
            cache_data[f"{key}_{i}"] = val[0][i]

    # Flatten the metadata for storage in ChromaDB
    flat_cache_data = {}
    for k, v in cache_data.items():
        if isinstance(v, dict):
            # Convert the dictionary to a JSON string
            flat_cache_data[k] = json.dumps(v)
        else:
            flat_cache_data[k] = v

    # Store the query in cache
    cache_collection.add(
        documents=[query2],
        ids=[query2],  # Alternatively, you can use a unique ID
        metadatas=flat_cache_data
    )

    print("Not found in cache. Found in main collection.")

    # Convert the results to a DataFrame
    result_dict = {
        'Metadatas_2': results['metadatas'][0],
        'Documents_2': results['documents'][0],
        'Distances_2': results['distances'][0],
        'IDs': results['ids'][0]
    }
    results_df_2 = pd.DataFrame.from_dict(result_dict)

# If the distance is within the threshold, retrieve results from the cache
elif cache_results['distances'][0][0] <= threshold:
    # Extract data from the cache
    cache_result_dict = cache_results['metadatas'][0][0]
    ids = []
    documents = []
    distances = []
    metadatas = []

    # Collect data based on keys
    for key, value in cache_result_dict.items():
        if 'ids' in key:
            ids.append(value)
        elif 'documents' in key:
            documents.append(value)
        elif 'distances' in key:
            distances.append(value)
        elif 'metadatas' in key:
            metadatas.append(value)

    print("Found in cache!")

    # Convert the cache data to a DataFrame
    results_df_2 = pd.DataFrame({
        'IDs_2': ids,
        'Documents_2': documents,
        'Distances_2': distances,
        'Metadatas_2': metadatas
    })

{'ids': [['what are the  premium rate for each Member insured for Life Insurance ?']], 'embeddings': None, 'documents': [['what are the  premium rate for each Member insured for Life Insurance ?']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'metadatas_9': '{"Section": "Section F - Individual Pu", "Page_No.": "Page 42"}', 'ids_0': '18', 'ids_6': '49', 'metadatas_7': '{"Section": "Section C - Dependent Lif", "Page_No.": "Page 59"}', 'documents_4': "Section C - Individual Terminations Article 1 - Member Life Insurance A Member's insurance under this Group Policy will terminate on the earliest of: a. the date this Group Policy is terminated; or b. the date the last premium is paid for the Member's insurance; or c. any date desired, if requested by the Member before that date; or d. the date the Member ceases to be a Member as defined in PART I; or e. the date the Member ceases to be in a class for which Member Life Insurance is provide

In [47]:
results_df_2.head()

Unnamed: 0,Metadatas_2,Documents_2,Distances_2,IDs
0,"{'Page_No.': 'Page 23', 'Section': 'Section C ...",Section C - Policy Termination Article 1 - Fai...,0.367496,20
1,"{'Section': 'Section B - Premiums', 'Page_No.'...",Section B - Premiums Article 1 - Payment Respo...,0.382645,17
2,"{'Section': 'T he Principal may termin', 'Page...",T he Principal may terminate the Policyholder'...,0.41705,21
3,"{'Section': 'b . on any date the defin', 'Page...",b . on any date the definition of Member or De...,0.419514,18
4,"{'Section': 'a. be actively engaged in', 'Page...",a. be actively engaged in business for profit ...,0.433545,14


In [49]:
query3 = input()

what documentaion are required fpr filing a claim?


In [50]:
cache_results= cache_collection.query(
         query_texts=query3,
          n_results=1
     )
cache_results

{'ids': [['what are the  premium rate for each Member insured for Life Insurance ?']],
 'embeddings': None,
 'documents': [['what are the  premium rate for each Member insured for Life Insurance ?']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'metadatas_8': '{"Section": "Section A - Member Life I", "Page_No.": "Page 8"}',
    'documents_3': "PART IV - BENEFITS Section A - Member Life Insurance Article 1 - Schedule of Insurance Subject to the Effective Date provisions of PART III, Section B, and the qualifying provisions of this Section A, the Scheduled Benefit for an insured Member will be based on his or her class: Class *Scheduled Benefit ALL MEMBERS $10,000 However, if a Member has received any payments under the Accelerated Benefits provision as described in Section A, Article 7, the Scheduled Benefit will be reduced by the amount of such payment. *The Scheduled Benefit is subject to the Proof of Good Health requirements as

In [51]:
threshold = 0.2

results_df_3 = pd.DataFrame()

# Query the cache collection to check if the results are already stored
cache_results = cache_collection.query(
    query_texts=query3,
    n_results=1
)

# Print the results from the cache query for debugging
#print(cache_results)

# Check if the cache is empty or if the distance exceeds the threshold
if not cache_results['distances'][0] or cache_results['distances'][0][0] > threshold:
    # Query the main collection for the top 10 results
    results = insurance_collection.query(
        query_texts=query3,
        n_results=10
    )

    # Prepare keys and values for storing in cache
    cache_data = {}
    for key, val in results.items():
        if val is None:
            continue
        # Adjust the loop to match the actual number of items in val
        for i in range(min(len(val[0]), 10)):  # Ensure you only loop over existing items
            cache_data[f"{key}_{i}"] = val[0][i]

    # Flatten the metadata for storage in ChromaDB
    flat_cache_data = {}
    for k, v in cache_data.items():
        if isinstance(v, dict):
            # Convert the dictionary to a JSON string
            flat_cache_data[k] = json.dumps(v)
        else:
            flat_cache_data[k] = v

    # Store the query in cache
    cache_collection.add(
        documents=[query3],
        ids=[query3],  # Alternatively, you can use a unique ID
        metadatas=flat_cache_data
    )

    print("Not found in cache. Found in main collection.")

    # Convert the results to a DataFrame
    result_dict = {
        'Metadatas_3': results['metadatas'][0],
        'Documents_3': results['documents'][0],
        'Distances_3': results['distances'][0],
        'IDs': results['ids'][0]
    }
    results_df_3 = pd.DataFrame.from_dict(result_dict)

# If the distance is within the threshold, retrieve results from the cache
elif cache_results['distances'][0][0] <= threshold:
    # Extract data from the cache
    cache_result_dict = cache_results['metadatas'][0][0]
    ids = []
    documents = []
    distances = []
    metadatas = []

    # Collect data based on keys
    for key, value in cache_result_dict.items():
        if 'ids' in key:
            ids.append(value)
        elif 'documents' in key:
            documents.append(value)
        elif 'distances' in key:
            distances.append(value)
        elif 'metadatas' in key:
            metadatas.append(value)

    print("Found in cache!")

    # Convert the cache data to a DataFrame
    results_df_3 = pd.DataFrame({
        'IDs_3': ids,
        'Documents_3': documents,
        'Distances_3': distances,
        'Metadatas_3': metadatas
    })

Not found in cache. Found in main collection.


In [52]:
results_df_3.head()

Unnamed: 0,Metadatas_3,Documents_3,Distances_3,IDs
0,"{'Page_No.': 'Page 61', 'Section': 'Section D ...",Section D - Claim Procedures Article 1 - Notic...,0.343138,58
1,"{'Section': 'A claimant may request an', 'Page...",A claimant may request an appeal of a claim de...,0.361748,59
2,"{'Page_No.': 'Page 18', 'Section': 'c . a copy...",c . a copy of the form which contains the stat...,0.383328,15
3,"{'Page_No.': 'Page 54', 'Section': 'f . claim ...","f . claim requirements listed in PART IV, Sect...",0.396981,51
4,"{'Page_No.': 'Page 17', 'Section': 'a. be acti...",a. be actively engaged in business for profit ...,0.4148,14


**Re-Ranking with a Cross Encoder**


In [53]:
!pip install sentence_transformers



In [54]:
# Import the CrossEncoder library from sentence_transformers

from sentence_transformers import CrossEncoder, util

In [55]:
 #Initialise the cross encoder model

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [56]:
cross_inputs_1 = [[query, response] for response in results_df_1['Documents_1']]
cross_rerank_scores_1 = cross_encoder.predict(cross_inputs_1)
cross_rerank_scores_1

array([ 2.2175784 ,  1.9360971 ,  0.15906563, -1.6021887 ,  0.7123252 ,
       -4.113749  , -4.5288877 , -2.4783213 , -0.08958972,  2.3985848 ],
      dtype=float32)

In [58]:
cross_inputs_1

[['what are the  premium rate for each Member insured for Life Insurance ?',
  "b . on any date the definition of Member or Dependent is changed; and c. on any date the Policyholder's business, as specified on the Policyholder application, is changed; and d. on any date that a schedule of insurance or class of insured Members is changed; and e. on any premium due date, if the Policyholder has been receiving a multiple policy discount rate and the Policyholder drops below the minimum number of coverages to receive such discount rate; and f. on any date the premium contribution required of Members is changed; and g. with respect to Member Life Insurance, on any Policy Anniversary, if the average age, average Scheduled Benefit amount, or the male/female distribution for then insured Members has changed since the last Policy Anniversary; and h. on any Policy Anniversary, if the volume of insurance for then insured Members has increased or decreased by more than 25% since the last Policy An

In [57]:
results_df_1['Reranked_scores'] = cross_rerank_scores_1
results_df_1

Unnamed: 0,IDs_1,Documents_1,Distances_1,Metadatas_1,Reranked_scores
0,5,b . on any date the definition of Member or De...,0.337673,"{""Section"": ""Section C - Dependent Lif"", ""Page...",2.217578
1,17,Section B - Premiums Article 1 - Payment Respo...,0.328987,"{""Section"": ""PART IV - BENEFITS"", ""Page_No."": ...",1.936097
2,43,PART IV - BENEFITS Section A - Member Life Ins...,0.334616,"{""Section"": ""b . on any date the defin"", ""Page...",0.159066
3,49,Section C - Dependent Life Insurance Article 1...,0.273043,"{""Page_No."": ""Page 35"", ""Section"": ""Section C ...",-1.602189
4,18,(1) only one Accelerated Benefit payment will ...,0.318965,"{""Section"": ""Section A - Member Life I"", ""Page...",0.712325
5,56,PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.305021,"{""Page_No."": ""Page 52"", ""Section"": ""(1) only o...",-4.113749
6,23,Section A - Member Life Insurance Schedule of ...,0.291589,"{""Section"": ""PART III - INDIVIDUAL REQ"", ""Page...",-4.528888
7,32,Section C - Individual Terminations Article 1 ...,0.336973,"{""Page_No."": ""Page 20"", ""Section"": ""Section B ...",-2.478321
8,39,Section F - Individual Purchase Rights Article...,0.337775,"{""Page_No."": ""Page 22"", ""Section"": ""The number...",-0.08959
9,19,The number of Members insured for Dependent Li...,0.340775,"{""Section"": ""Section F - Individual Pu"", ""Page...",2.398585


In [97]:
query

'what happens if i fail to pay to Premium?'

In [75]:
top_3_semantic_1 = results_df_1.sort_values(by='Distances_1')
top_3_semantic_1[:3]

Unnamed: 0,IDs_1,Documents_1,Distances_1,Metadatas_1,Reranked_scores
3,49,Section C - Dependent Life Insurance Article 1...,0.273043,"{""Page_No."": ""Page 35"", ""Section"": ""Section C ...",-1.602189
6,23,Section A - Member Life Insurance Schedule of ...,0.291589,"{""Section"": ""PART III - INDIVIDUAL REQ"", ""Page...",-4.528888
5,56,PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.305021,"{""Page_No."": ""Page 52"", ""Section"": ""(1) only o...",-4.113749


In [76]:
top_3_rerank_1 = results_df_1.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_1[:3]


Unnamed: 0,IDs_1,Documents_1,Distances_1,Metadatas_1,Reranked_scores
9,19,The number of Members insured for Dependent Li...,0.340775,"{""Section"": ""Section F - Individual Pu"", ""Page...",2.398585
0,5,b . on any date the definition of Member or De...,0.337673,"{""Section"": ""Section C - Dependent Lif"", ""Page...",2.217578
1,17,Section B - Premiums Article 1 - Payment Respo...,0.328987,"{""Section"": ""PART IV - BENEFITS"", ""Page_No."": ...",1.936097


In [62]:
print(top_3_rerank_1[:3])

  IDs_1                                        Documents_1  Distances_1  \
9    19  The number of Members insured for Dependent Li...     0.340775   
0     5  b . on any date the definition of Member or De...     0.337673   
1    17  Section B - Premiums Article 1 - Payment Respo...     0.328987   

                                         Metadatas_1  Reranked_scores  
9  {"Section": "Section F - Individual Pu", "Page...         2.398585  
0  {"Section": "Section C - Dependent Lif", "Page...         2.217578  
1  {"Section": "PART IV - BENEFITS", "Page_No.": ...         1.936097  


In [77]:
top_3_RAG_1 = top_3_rerank_1[["Documents_1", "Metadatas_1"]][:3]
top_3_RAG_1

Unnamed: 0,Documents_1,Metadatas_1
9,The number of Members insured for Dependent Li...,"{""Section"": ""Section F - Individual Pu"", ""Page..."
0,b . on any date the definition of Member or De...,"{""Section"": ""Section C - Dependent Lif"", ""Page..."
1,Section B - Premiums Article 1 - Payment Respo...,"{""Section"": ""PART IV - BENEFITS"", ""Page_No."": ..."


In [63]:
cross_inputs_2 = [[query2, response] for response in results_df_2['Documents_2']]


In [98]:
query2

'what happens if i fail to pay to Premium?'

In [64]:
cross_rerank_scores_2 = cross_encoder.predict(cross_inputs_2)
results_df_2['Reranked_scores'] = cross_rerank_scores_2

In [73]:
top_3_semantic_2 = results_df_2.sort_values(by='Distances_2')
top_3_semantic_2[:3]

Unnamed: 0,Metadatas_2,Documents_2,Distances_2,IDs,Reranked_scores
0,"{'Page_No.': 'Page 23', 'Section': 'Section C ...",Section C - Policy Termination Article 1 - Fai...,0.367496,20,3.999949
1,"{'Section': 'Section B - Premiums', 'Page_No.'...",Section B - Premiums Article 1 - Payment Respo...,0.382645,17,-6.523407
2,"{'Section': 'T he Principal may termin', 'Page...",T he Principal may terminate the Policyholder'...,0.41705,21,-3.041489


In [74]:
top_3_rerank_2 = results_df_2.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_2[:3]

Unnamed: 0,Metadatas_2,Documents_2,Distances_2,IDs,Reranked_scores
0,"{'Page_No.': 'Page 23', 'Section': 'Section C ...",Section C - Policy Termination Article 1 - Fai...,0.367496,20,3.999949
2,"{'Section': 'T he Principal may termin', 'Page...",T he Principal may terminate the Policyholder'...,0.41705,21,-3.041489
1,"{'Section': 'Section B - Premiums', 'Page_No.'...",Section B - Premiums Article 1 - Payment Respo...,0.382645,17,-6.523407


In [78]:
top_3_RAG_2 = top_3_rerank_2[["Documents_2", "Metadatas_2"]][:3]
top_3_RAG_2

Unnamed: 0,Documents_2,Metadatas_2
0,Section C - Policy Termination Article 1 - Fai...,"{'Page_No.': 'Page 23', 'Section': 'Section C ..."
2,T he Principal may terminate the Policyholder'...,"{'Section': 'T he Principal may termin', 'Page..."
1,Section B - Premiums Article 1 - Payment Respo...,"{'Section': 'Section B - Premiums', 'Page_No.'..."


In [99]:
query3

'what documentaion are required fpr filing a claim?'

In [72]:
cross_inputs_3 = [[query3, response] for response in results_df_3['Documents_3']]
cross_rerank_scores_3 = cross_encoder.predict(cross_inputs_3)

In [69]:
results_df_3['Reranked_scores'] = cross_rerank_scores_3
top_3_semantic_3 = results_df_3.sort_values(by='Distances_3')
top_3_semantic_3[:3]

Unnamed: 0,Metadatas_3,Documents_3,Distances_3,IDs,Reranked_scores
0,"{'Page_No.': 'Page 61', 'Section': 'Section D ...",Section D - Claim Procedures Article 1 - Notic...,0.343138,58,-2.471166
1,"{'Section': 'A claimant may request an', 'Page...",A claimant may request an appeal of a claim de...,0.361748,59,-4.036301
2,"{'Page_No.': 'Page 18', 'Section': 'c . a copy...",c . a copy of the form which contains the stat...,0.383328,15,-10.063317


In [70]:
top_3_rerank_3 = results_df_3.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_3[:3]

Unnamed: 0,Metadatas_3,Documents_3,Distances_3,IDs,Reranked_scores
0,"{'Page_No.': 'Page 61', 'Section': 'Section D ...",Section D - Claim Procedures Article 1 - Notic...,0.343138,58,-2.471166
3,"{'Page_No.': 'Page 54', 'Section': 'f . claim ...","f . claim requirements listed in PART IV, Sect...",0.396981,51,-2.598758
1,"{'Section': 'A claimant may request an', 'Page...",A claimant may request an appeal of a claim de...,0.361748,59,-4.036301


In [79]:

top_3_RAG_3 = top_3_rerank_3[["Documents_3", "Metadatas_3"]][:3]
top_3_RAG_3

Unnamed: 0,Documents_3,Metadatas_3
0,Section D - Claim Procedures Article 1 - Notic...,"{'Page_No.': 'Page 61', 'Section': 'Section D ..."
3,"f . claim requirements listed in PART IV, Sect...","{'Page_No.': 'Page 54', 'Section': 'f . claim ..."
1,A claimant may request an appeal of a claim de...,"{'Section': 'A claimant may request an', 'Page..."


In [82]:
# Define the function to generate the response. Provide a comprehensive prompt that passes the user query and the top 3 results to the model

def generate_response(query, top_3_RAG):
    """
    Generate a response using GPT-3.5's ChatCompletion based on the user query and retrieved information.
    """
    messages = [
        {"role": "system", "content": "You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents."},
        {"role": "user", "content": f"""
            You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents.
            You have a question asked by the user in '{query}' and you have some search results from a corpus of insurance documents in the dataframe '{top_3_RAG}'. These search results are essentially one page of an insurance document that may be relevant to the user query.

            The column 'documents' inside this dataframe contains the actual text from the policy document and the column 'metadata' contains the policy name and source page. The text inside the document may also contain tables in the format of a list of lists where each of the nested lists indicates a row.

            Use the documents in '{top_3_RAG}' to answer the query '{query}'. Frame an informative answer and also, use the dataframe to return the relevant policy names and page numbers as citations.

            Follow the guidelines below when performing the task:
            1. Try to provide relevant/accurate numbers if available.
            2. You don’t have to necessarily use all the information in the dataframe. Only choose information that is relevant.
            3. If the document text has tables with relevant information, please reformat the table and return the final information in a tabular format.
            4. Use the 'metadata' columns in the dataframe to retrieve and cite the policy name(s) and page number(s) as citation.
            5. If you can't provide the complete answer, please also provide any information that will help the user to search specific sections in the relevant cited documents.
            6. You are a customer-facing assistant, so do not provide any information on internal workings, just answer the query directly.

            The generated response should answer the query directly addressing the user and avoiding additional information. If you think that the query is not relevant to the document, reply that the query is irrelevant. Provide the final response as a well-formatted and easily readable text along with the citation. Provide your complete response first with all information, and then provide the citations.

            ### Few-Shot Examples

            ### Example 1: Basic Query about Coverage
            **Query:**
            What does the policy say about coverage for accidental death?

            **Top 3 RAG Results:**
            - **Document 1:** "This policy provides coverage for accidental death. The insured amount for accidental death is 200% of the base coverage amount if the death occurs within 90 days of the accident..."
            - **Document 2:** "Accidental death benefits are payable under this policy if the insured dies as a result of an accident. The benefit amount equals double the coverage amount, provided the death is a direct result of the accident and occurs within a specified time frame..."
            - **Document 3:** "In the event of accidental death, the policy pays an additional benefit, which is equal to twice the original coverage amount. This benefit is contingent on the death occurring within 180 days from the date of the accident..."

            **Response:**
            The policy provides coverage for accidental death, where the benefit amount is typically 200% of the base coverage. The death must occur as a direct result of an accident and within a specified period, which varies between 90 to 180 days depending on the policy.
            **Citations:**
            Document 1: Heading/Section X, Page 5
            Document 2: Heading/Section Y, Page 12
            Document 3: Heading/Section Z, Page 7

            ### Example 2: Query about Exclusions
            **Query:**
            Are there any exclusions for pre-existing conditions in this policy?

            **Top 3 RAG Results:**
            - **Document 1:** "This policy excludes coverage for any conditions that were diagnosed or treated within 12 months prior to the policy's start date. However, if the condition remains stable for 24 months after the policy's start date, it may be eligible for coverage..."
            - **Document 2:** "Pre-existing conditions are generally not covered under this policy unless explicitly stated otherwise. Any condition that has shown symptoms or required medical attention in the 12 months before the policy start date is excluded..."
            - **Document 3:** "Exclusions apply to pre-existing conditions if they were present within a 12-month window before the policy commencement. After a waiting period of 24 months, these conditions may be reconsidered for coverage..."

            **Response:**
            The policy excludes coverage for pre-existing conditions that were diagnosed or treated within 12 months before the policy start date. However, if the condition remains stable and does not require treatment for 24 months after the policy start date, it may be eligible for coverage.
            **Citations:**
            Document 1: Heading/Section X, Page 8
            Document 2: Heading/Section Y, Page 15
            Document 3: Heading/Section Z, Page 10

            ### Example 3: Query about Beneficiaries
            **Query:**
            How can I update the beneficiary for my life insurance policy?

            **Top 3 RAG Results:**
            - **Document 1:** "To update the beneficiary, the policyholder must submit a written request to the insurance company. The request should include the policy number, current beneficiary, and the new beneficiary's details..."
            - **Document 2:** "Beneficiary changes can be made by filling out the 'Beneficiary Change Form,' which must be signed by the policyholder and submitted to the insurance provider. The change takes effect upon the company's receipt and acceptance of the form..."
            - **Document 3:** "The policyholder has the right to change the beneficiary at any time by providing written notice to the insurer. The change will be recorded and acknowledged by the insurer upon receipt..."

            **Response:**
            To update the beneficiary of your life insurance policy, you need to submit a written request or complete a 'Beneficiary Change Form' provided by your insurer. This request must be signed by you as the policyholder, and the change will take effect once the insurance company has received and acknowledged it.
            **Citations:**
            Document 1: Heading/Section X, Page 3
            Document 2: Heading/Section Y, Page 6
            Document 3: Heading/Section Z, Page 4

            ### Example 4: Complex Query about Premium Payments
            **Query:**
            What are the options if I miss a premium payment?

            **Top 3 RAG Results:**
            - **Document 1:** "If a premium payment is missed, the policyholder typically has a 30-day grace period to make the payment without penalty. If payment is not received within this period, the policy may lapse, and coverage could be lost..."
            - **Document 2:** "The policy includes a 30-day grace period for missed payments. During this time, coverage remains in force. If payment is not made by the end of the grace period, the policyholder may request reinstatement of the policy, subject to underwriting..."
            - **Document 3:** "In the event of a missed premium, a 30-day grace period is granted. If the premium is not paid within this period, the policyholder may choose to reinstate the policy, which may require proof of insurability and payment of overdue premiums..."

            **Response:**
            If you miss a premium payment, your policy provides a 30-day grace period during which you can make the payment without losing coverage. If the payment is not made within this period, the policy may lapse. However, you may have the option to reinstate the policy by providing proof of insurability and paying the overdue premiums.
            **Citations:**
            Document 1: Heading/Section X, Page 10
            Document 2: Heading/Section Y, Page 11
            Document 3: Heading/Section Z, Page 9
        """},
    ]

    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages
    )

    return response.choices[0].message.content.split('\n')

In [83]:
response = generate_response(query, top_3_RAG_1)

In [84]:
print("\n".join(response))

The query "what are the premium rates for each Member insured for Life Insurance?" is not directly addressed in the provided documents. However, you can refer to specific sections within the documents to find relevant information related to premium rates for life insurance. Here are some guidelines to help you locate the information within the documents:

1. Look for sections specifically related to premium calculations, payment responsibilities, or premium rates within the documents.
2. Check for any tables or sections that outline premium rates for different types of insurance coverage, including life insurance.
3. Pay attention to details regarding individual members or insured persons to find information related to premium rates based on membership.

If you need specific premium rate details, it would be advisable to explore the sections related to premiums and calculations in the provided policy documents. Utilize the metadata provided to locate the relevant sections within the do

In [85]:
response = generate_response(query2, top_3_RAG_2)

In [86]:
print("\n".join(response))

**Response:**

If you fail to pay your premium, the policy typically includes a 30-day grace period during which you can make the payment without any penalty. However, if the payment is not received within this grace period, your policy may lapse, and you could lose coverage. In such a scenario, you may have the option to reinstate the policy, which might involve providing proof of insurability and paying any overdue premiums.

**Citations:**
- Policy Name: Section B - Premiums
- Page Number: Page 23


In [87]:
response = generate_response(query3, top_3_RAG_3)

In [88]:
print("\n".join(response))

**Query:**  
What documentation is required for filing a claim?

**Response:**  
For filing a claim, the required documentation typically includes:

1. Completed claim form containing the necessary details about the claim.
2. Proof of the incident or event that led to the claim, such as police reports, medical records, or photographs.
3. Supporting documents like invoices, receipts, estimates, or bills related to the claim.
4. Any relevant communication exchanged between the insured and the insurance company regarding the claim.
5. Any other documentation specified by the insurance policy for the particular type of claim.

**Citations:**  
- Document 1: Section D - Claim Procedures, Page 61
- Document 2: Part IV, Section f - Claim Requirements, Page 54
- Document 3: Appeal of Claim Denial, Relevant Section/Page not specified


**Final Code**

In [89]:
def get_context(query):
 threshold = 0.2

 results_df = pd.DataFrame()

 # Query the cache collection to check if the results are already stored
 cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
 )

 # Print the results from the cache query for debugging
 print(cache_results)

 # Check if the cache is empty or if the distance exceeds the threshold
 if not cache_results['distances'][0] or cache_results['distances'][0][0] > threshold:
    # Query the main collection for the top 10 results
    results = insurance_collection.query(
        query_texts=query,
        n_results=10
    )

    # Prepare keys and values for storing in cache
    cache_data = {}
    for key, val in results.items():
        if val is None:
            continue
        # Adjust the loop to match the actual number of items in val
        for i in range(min(len(val[0]), 10)):  # Ensure you only loop over existing items
            cache_data[f"{key}_{i}"] = val[0][i]

    # Flatten the metadata for storage in ChromaDB
    flat_cache_data = {}
    for k, v in cache_data.items():
        if isinstance(v, dict):
            # Convert the dictionary to a JSON string
            flat_cache_data[k] = json.dumps(v)
        else:
            flat_cache_data[k] = v

    # Store the query in cache
    cache_collection.add(
        documents=[query],
        ids=[query],  # Alternatively, you can use a unique ID
        metadatas=flat_cache_data
    )

    print("Not found in cache. Found in main collection.")

    # Convert the results to a DataFrame
    result_dict = {
        'Metadatas': results['metadatas'][0],
        'Documents': results['documents'][0],
        'Distances': results['distances'][0],
        'IDs': results['ids'][0]
    }
    results_df = pd.DataFrame.from_dict(result_dict)

 # If the distance is within the threshold, retrieve results from the cache
 elif cache_results['distances'][0][0] <= threshold:
    # Extract data from the cache
    cache_result_dict = cache_results['metadatas'][0][0]
    ids = []
    documents = []
    distances = []
    metadatas = []

    # Collect data based on keys
    for key, value in cache_result_dict.items():
        if 'ids' in key:
            ids.append(value)
        elif 'documents' in key:
            documents.append(value)
        elif 'distances' in key:
            distances.append(value)
        elif 'metadatas' in key:
            metadatas.append(value)

    print("Found in cache!")

    # Convert the cache data to a DataFrame
    results_df = pd.DataFrame({
        'IDs': ids,
        'Documents': documents,
        'Distances': distances,
        'Metadatas': metadatas
    })
    return results_df

In [90]:
def rerank(results_df):
  cross_inputs = [[query, response] for response in results_df['Documents']]
  cross_rerank_scores = cross_encoder.predict(cross_inputs)
  results_df['Reranked_scores'] = cross_rerank_scores
  return results_df

In [91]:
def top_3_context(results_df):
  top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
  return top_3_rerank[:3]

In [92]:
def get_reply(query):
    results_df = get_context(query)
    results_df = rerank(results_df)
    top_3_rerank = top_3_context(results_df)
    response=generate_response(query, top_3_rerank)
    return "\n".join(response)

In [93]:
query = input()

what happens if i fail to pay to Premium?


In [94]:
print(get_reply(query))

{'ids': [['what happens if i fail to pay to Premium?']], 'embeddings': None, 'documents': [['what happens if i fail to pay to Premium?']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'ids_0': '20', 'included_5': 'a', 'distances_9': 0.44724467396736145, 'ids_5': '19', 'distances_2': 0.417049765586853, 'distances_4': 0.43354514241218567, 'ids_7': '44', 'documents_1': 'Section B - Premiums Article 1 - Payment Responsibility; Due Dates; Grace Period The Policyholder is responsible for collection and payment of all premiums due while this Group Policy is in force. Payments must be sent to the home office of The Principal in Des Moines, Iowa. The first premium is due on the Date of Issue of this Group Policy. Each premium thereafter will be due on the first of each Insurance Month. Except for the first premium, a Grace Period of 31 days will be allowed for payment of premium. "Grace Period" means the first 31-day period following a premium