### 1. Install required libraries

In [35]:
!pip install pdfplumber
!pip install chromadb
!pip install tiktoken
!pip install openai
!pip install sentence-transformers



In [36]:
import pdfplumber
from pathlib import Path
import pandas as pd
from operator import itemgetter
import json
import tiktoken
import chromadb
import openai

### 2. Read and process the PDF document

In [37]:
pdf_url="https://raw.githubusercontent.com/Kamar-p/Helpmate_AI/main/Principal-Sample-Life-Insurance-Policy.pdf"

In [38]:
## read a single pdf page through pdf plumber

import requests
from io import BytesIO

# Download the PDF
response = requests.get(pdf_url)
pdf_file = BytesIO(response.content)

with pdfplumber.open(pdf_file) as pdf:

    single_page = pdf.pages[25]
    text = single_page.extract_text()
    tables = single_page.extract_tables()

    print(text)

PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS
Section A - Eligibility
Article 1 - Member Life Insurance
A person will be eligible for Member Life Insurance on the date the person completes 30
consecutive days of continuous Active Work with the Policyholder as a Member.
In no circumstance will a person be eligible for Member Life Insurance under this Group Policy
if the person is eligible under any other Group Term Life Insurance policy underwritten by The
Principal.
Article 2 - Member Accidental Death and Dismemberment Insurance
A person will be eligible for Member Accidental Death and Dismemberment Insurance on the
latest of:
a. the date the person is eligible for Member Life Insurance; or
b. the date the person enters a class for which Member Accidental Death and Dismemberment
Insurance is provided under this Group Policy; or
c. the date Member Accidental Death and Dismemberment Insurance is added to this Group
Policy.
Article 3 - Dependent Life Insurance
A person will be eligible fo

In [39]:
# Function to check whether a word is present in a table or not for segregation of regular text and tables

def check_bboxes(word, table_bbox):
    # Check whether word is inside a table bbox.
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]

In [40]:
# Function to extract text from a PDF file.
# 1. Declare a variable p to store the iteration of the loop that will help us store page numbers alongside the text
# 2. Declare an empty list 'full_text' to store all the text files
# 3. Use pdfplumber to open the pdf pages one by one
# 4. Find the tables and their locations in the page
# 5. Extract the text from the tables in the variable 'tables'
# 6. Extract the regular words by calling the function check_bboxes() and checking whether words are present in the table or not
# 7. Use the cluster_objects utility to cluster non-table and table words together so that they retain the same chronology as in the original PDF
# 8. Declare an empty list 'lines' to store the page text
# 9. If a text element in present in the cluster, append it to 'lines', else if a table element is present, append the table
# 10. Append the page number and all lines to full_text, and increment 'p'
# 11. When the function has iterated over all pages, return the 'full_text' list

def extract_text_from_pdf(pdf_file):
    p = 0
    full_text = []


    with pdfplumber.open(pdf_file) as pdf:
        for page in pdf.pages:
            page_no = f"Page {p+1}"
            text = page.extract_text()

            tables = page.find_tables()
            table_bboxes = [i.bbox for i in tables]
            tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
            non_table_words = [word for word in page.extract_words() if not any(
                [check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
            lines = []

            for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5):

                if 'text' in cluster[0]:
                    try:
                        lines.append(' '.join([i['text'] for i in cluster]))
                    except KeyError:
                        pass

                elif 'table' in cluster[0]:
                    lines.append(json.dumps(cluster[0]['table']))


            full_text.append([page_no, " ".join(lines)])
            p +=1

    return full_text

In [41]:
# Initialize an empty list to store the extracted texts and document names
data = []

# Process the PDF file
print(f"...Processing {pdf_file}")

# Call the function to extract the text from the PDF
extracted_text = extract_text_from_pdf(pdf_file)

# Convert the extracted list to a PDF, and add a column to store document names
extracted_text_df = pd.DataFrame(extracted_text, columns=['Page No.', 'Page_Text'])

# Append the extracted text and document name to the list
data.append(extracted_text_df)

# Print a message to indicate progress
print(f"Finished processing {pdf_file}")

# Print a message to indicate all PDFs have been processed
print("PDF have been processed.")

...Processing <_io.BytesIO object at 0x7b86f73a4d60>
Finished processing <_io.BytesIO object at 0x7b86f73a4d60>
PDF have been processed.


In [42]:
data

[   Page No.                                          Page_Text
 0    Page 1  DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...
 1    Page 2                 This page left blank intentionally
 2    Page 3  POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...
 3    Page 4                 This page left blank intentionally
 4    Page 5  PRINCIPAL LIFE INSURANCE COMPANY (called The P...
 ..      ...                                                ...
 59  Page 60  I f a Dependent who was insured dies during th...
 60  Page 61  Section D - Claim Procedures Article 1 - Notic...
 61  Page 62  A claimant may request an appeal of a claim de...
 62  Page 63                 This page left blank intentionally
 63  Page 64  Principal Life Insurance Company Des Moines, I...
 
 [64 rows x 2 columns]]

In [43]:
insurance_pdfs_data = pd.concat(data, ignore_index=True)

In [44]:
insurance_pdfs_data.head()

Unnamed: 0,Page No.,Page_Text
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...
1,Page 2,This page left blank intentionally
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...
3,Page 4,This page left blank intentionally
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...


In [45]:
insurance_pdfs_data['Text_Length']=insurance_pdfs_data['Page_Text'].apply(lambda x:len(x.split(" ")))

In [46]:
max(insurance_pdfs_data['Text_Length'])

462

In [47]:
insurance_pdfs_data['Metadata'] = insurance_pdfs_data.apply(lambda x: {'Page_No.': x['Page No.']}, axis=1)
insurance_pdfs_data

Unnamed: 0,Page No.,Page_Text,Text_Length,Metadata
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30,{'Page_No.': 'Page 1'}
1,Page 2,This page left blank intentionally,5,{'Page_No.': 'Page 2'}
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230,{'Page_No.': 'Page 3'}
3,Page 4,This page left blank intentionally,5,{'Page_No.': 'Page 4'}
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110,{'Page_No.': 'Page 5'}
...,...,...,...,...
59,Page 60,I f a Dependent who was insured dies during th...,285,{'Page_No.': 'Page 60'}
60,Page 61,Section D - Claim Procedures Article 1 - Notic...,418,{'Page_No.': 'Page 61'}
61,Page 62,A claimant may request an appeal of a claim de...,322,{'Page_No.': 'Page 62'}
62,Page 63,This page left blank intentionally,5,{'Page_No.': 'Page 63'}


In [48]:
insurance_pdfs_data.Page_Text[6]

'Section A – Eligibility Member Life Insurance Article 1 Member Accidental Death and Dismemberment Insurance Article 2 Dependent Life Insurance Article 3 Section B - Effective Dates Member Life Insurance Article 1 Member Accidental Death and Dismemberment Insurance Article 2 Dependent Life Insurance Article 3 Section C - Individual Terminations Member Life Insurance Article 1 Member Accidental Death and Dismemberment Insurance Article 2 Dependent Life Insurance Article 3 Termination for Fraud Article 4 Coverage While Outside of the United States Article 5 Section D - Continuation Member Life Insurance Article 1 Dependent Insurance - Developmentally Disabled or Physically Handicapped Children Article 2 Section E - Reinstatement Reinstatement Article 1 Federal Required Family and Medical Leave Act (FMLA) Article 2 Reinstatement of Coverage for a Member or Dependent When Coverage Ends due to Living Outside of the United States Article 3 Section F - Individual Purchase Rights Member Life I

### 3.Chunking

In [49]:
# Iterating over all page titles to create the final df with individual chunks
page_nos = insurance_pdfs_data["Page No."]
page_nos

Unnamed: 0,Page No.
0,Page 1
1,Page 2
2,Page 3
3,Page 4
4,Page 5
...,...
59,Page 60
60,Page 61
61,Page 62
62,Page 63


In [50]:
# Function to split text into fixed-size chunks with overlap
def split_text_into_chunks(text, chunk_size, overlap=50):
    chunks = []
    words = text.split()  # Split the text into words

    if overlap >= chunk_size:
        raise ValueError("Overlap must be smaller than chunk size.")

    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk = words[start:end]
        chunks.append(' '.join(chunk))
        start += chunk_size - overlap  # move forward with overlap

    return chunks


In [59]:
import copy  # Make sure to import this

def process_page(page_no, chunk_size=200, overlap=50):
    # Get the specific row for the page
    page_row = insurance_pdfs_data[insurance_pdfs_data['Page No.'] == page_no]

    if not page_row.empty:
        page_text = page_row.Page_Text.values[0]
        base_metadata = page_row.Metadata.values[0]

        # Split the page into overlapping chunks
        text_chunks = split_text_into_chunks(page_text, chunk_size, overlap)

        data = {'Title': [], 'Chunk Text': [], 'Metadata': []}

        for index, chunk in enumerate(text_chunks):
            data['Title'].append(page_no)
            data['Chunk Text'].append(chunk)

            #  Deep copy INSIDE loop so each chunk gets its own independent dict
            chunk_metadata = copy.deepcopy(base_metadata)
            chunk_metadata['Chunk_No.'] = index
            data['Metadata'].append(chunk_metadata)

        return pd.DataFrame(data)

    return None


In [60]:
# creating a dataframe after calling process
all_dfs = []
for page_no in page_nos:
    df = process_page(page_no)
    if df is not None:
        all_dfs.append(df)

fixed_chunk_df = pd.concat(all_dfs, ignore_index=True)
fixed_chunk_df

Unnamed: 0,Title,Chunk Text,Metadata
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,"{'Page_No.': 'Page 1', 'Chunk_No.': 0}"
1,Page 2,This page left blank intentionally,"{'Page_No.': 'Page 2', 'Chunk_No.': 0}"
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,"{'Page_No.': 'Page 3', 'Chunk_No.': 0}"
3,Page 3,Principal is not responsible for the provision...,"{'Page_No.': 'Page 3', 'Chunk_No.': 1}"
4,Page 4,This page left blank intentionally,"{'Page_No.': 'Page 4', 'Chunk_No.': 0}"
...,...,...,...
141,Page 62,A claimant may request an appeal of a claim de...,"{'Page_No.': 'Page 62', 'Chunk_No.': 0}"
142,Page 62,"Dependent, or Beneficiary. Article 5 - Medical...","{'Page_No.': 'Page 62', 'Chunk_No.': 1}"
143,Page 62,This policy has been updated effective January...,"{'Page_No.': 'Page 62', 'Chunk_No.': 2}"
144,Page 63,This page left blank intentionally,"{'Page_No.': 'Page 63', 'Chunk_No.': 0}"


# Create Embeddings

In [66]:
from sentence_transformers import SentenceTransformer

model_name = "all-MiniLM-L6-v2"
embedder = SentenceTransformer(model_name)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [67]:
# Function to generate embeddings for text
def generate_embeddings(texts):
    embeddings = embedder.encode(texts, convert_to_tensor=False)
    return embeddings

In [68]:
# function to generate embedding on dataframe
def generate_embeddings_on_df(df):
  df['Embeddings'] = df['Chunk Text'].apply(lambda x: generate_embeddings([x])[0])

In [69]:
# Create embeddings for 'Chunk Text' column on all three dataframes
generate_embeddings_on_df(fixed_chunk_df)

  return forward_call(*args, **kwargs)


In [71]:
# print the dataframe
fixed_chunk_df

Unnamed: 0,Title,Chunk Text,Metadata,Embeddings
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,"{'Page_No.': 'Page 1', 'Chunk_No.': 0}","[-0.025921896, 0.04777749, 0.05585775, 0.04239..."
1,Page 2,This page left blank intentionally,"{'Page_No.': 'Page 2', 'Chunk_No.': 0}","[0.029118983, 0.060574077, 0.046415307, 0.0377..."
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,"{'Page_No.': 'Page 3', 'Chunk_No.': 0}","[-0.10579571, -0.00053022936, 0.01657757, -0.0..."
3,Page 3,Principal is not responsible for the provision...,"{'Page_No.': 'Page 3', 'Chunk_No.': 1}","[-0.075226046, 0.08459611, 0.015716348, -0.086..."
4,Page 4,This page left blank intentionally,"{'Page_No.': 'Page 4', 'Chunk_No.': 0}","[0.029118983, 0.060574077, 0.046415307, 0.0377..."
...,...,...,...,...
141,Page 62,A claimant may request an appeal of a claim de...,"{'Page_No.': 'Page 62', 'Chunk_No.': 0}","[-0.061521724, 0.11785729, 0.077967495, -0.063..."
142,Page 62,"Dependent, or Beneficiary. Article 5 - Medical...","{'Page_No.': 'Page 62', 'Chunk_No.': 1}","[-0.1287402, 0.1252731, 0.07411633, -0.0609339..."
143,Page 62,This policy has been updated effective January...,"{'Page_No.': 'Page 62', 'Chunk_No.': 2}","[-0.12950504, 0.063508466, 0.09417583, -0.0406..."
144,Page 63,This page left blank intentionally,"{'Page_No.': 'Page 63', 'Chunk_No.': 0}","[0.029118983, 0.060574077, 0.046415307, 0.0377..."


# Store Embeddings to ChromaDB

In [72]:
# Define the path where chroma collections will be stored
chroma_data_path = "/content/drive/MyDrive/ChromaDB_Insurance"


In [106]:
# Call PersistentClient()
client = chromadb.PersistentClient(path=chroma_data_path)

In [107]:
# Create a collection to store the embeddings. Collections in Chroma are where you can store your embeddings, documents, and any additional metadata.
collection = client.get_or_create_collection(name="insurance-collection")

In [108]:
collection.add(
    embeddings = fixed_chunk_df['Embeddings'].to_list(),
    documents = fixed_chunk_df['Chunk Text'].to_list(),
    metadatas = fixed_chunk_df['Metadata'].to_list(),
    ids = [str(i) for i in range(0, len(fixed_chunk_df['Embeddings']))]
)

In [76]:
# get few of data by ids from collection
collection.get(
    ids = ['0','1','2'],
    include = ['embeddings', 'documents', 'metadatas']
)

{'ids': ['0', '1', '2'],
 'embeddings': array([[-0.0259219 ,  0.04777749,  0.05585775, ..., -0.04932665,
         -0.05851149,  0.02355204],
        [ 0.02911898,  0.06057408,  0.04641531, ...,  0.05954009,
         -0.02838372,  0.00531935],
        [-0.10579571, -0.00053023,  0.01657757, ..., -0.03772428,
          0.03662254, -0.04041128]]),
 'documents': ['DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/01/2014 711 HIGH STREET GEORGE RI 02903 GROUP POLICY FOR: RHODE ISLAND JOHN DOE ALL MEMBERS Group Member Life Insurance Print Date: 07/16/2014',
  'This page left blank intentionally',
  'POLICY RIDER GROUP INSURANCE POLICY NO: S655 COVERAGE: Life EMPLOYER: RHODE ISLAND JOHN DOE Effective on the later of the Date of Issue of this Group Policy or March 1, 2005, the following will apply to your Policy: From time to time The Principal may offer or provide certain employer groups who apply for coverage with The Principal a Financial Services Hotline and Grief Support Services or any other

In [77]:
# create a cache collection
cache_collection = client.get_or_create_collection(name='insurance-collection-cache')

In [78]:
# peek few of elements from cache collection
cache_collection.peek()

{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': [],
 'uris': None,
 'included': ['metadatas', 'documents', 'embeddings'],
 'data': None,
 'metadatas': []}

# Semantic Search with Cache

In [79]:
# Read the user query
query = input()

what is the life insurance coverage for death?


In [80]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results
cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 78.6MiB/s]


In [81]:
# get result from cache collection
cache_results

{'ids': [[]],
 'embeddings': None,
 'documents': [[]],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[]],
 'distances': [[]]}

In [82]:
# get result from main collection
results = collection.query(
query_texts=query,
n_results=10
)
print("Result size is : " + str(len(results.items())))
results.items()

Result size is : 8


dict_items([('ids', [['10', '145', '135', '37', '22', '101', '41', '74', '134', '89']]), ('embeddings', None), ('documents', [['Section A - Member Life Insurance Schedule of Insurance Article 1 Death Benefits Payable Article 2 Beneficiary Article 3 Facility of Payment Article 4 Settlement of Proceeds Article 5 Member Life Insurance - Coverage During Disability Article 6 Accelerated Benefits Article 7 Section B - Member Accidental Death and Dismemberment Insurance Schedule of Insurance Article 1 Benefit Qualification Article 2 Benefits Payable Article 3 Seat Belt Benefit Article 4 Loss of Use or Paralysis Benefit Article 5 Loss of Speech and/or Hearing Benefit Article 6 Repatriation Benefit Article 7 Educational Benefit Article 8 Limitations Article 9 Section C - Dependent Life Insurance Schedule of Insurance Article 1 Death Benefits Payable Article 2 Beneficiary Article 3 Section D - Claim Procedures Notice of Claim Article 1 Claim Forms Article 2 Proof of Loss Article 3 Payment, Denia

In [83]:
# Implementing Cache in Semantic Search

# Set a threshold for cache search
threshold = 0.2

ids = []
documents = []
distances = []
metadatas = []
results_df = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = collection.query(
      query_texts=query,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys = []
      Values = []

      size = len(results.items())

      for key, val in results.items():
        if val is None:
          continue
        for i in range(size):
          Keys.append(str(key)+str(i))
          Values.append(str(val[0][i]))


      cache_collection.add(
          documents= [query],
          ids = [query],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys, Values))
      )

      print("Not found in cache. Found in main collection.")

      result_dict = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
      results_df = pd.DataFrame.from_dict(result_dict)
      results_df


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results['distances'][0][0] <= threshold:
      cache_result_dict = cache_results['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict.items():
          if 'ids' in key:
              ids.append(value)
          elif 'documents' in key:
              documents.append(value)
          elif 'distances' in key:
              distances.append(value)
          elif 'metadatas' in key:
              metadatas.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df = pd.DataFrame({
        'IDs': ids,
        'Documents': documents,
        'Distances': distances,
        'Metadatas': metadatas
      })


Not found in cache. Found in main collection.


In [84]:
# print the results
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Chunk_No.': 0, 'Page_No.': 'Page 8'}",Section A - Member Life Insurance Schedule of ...,0.753707,10
1,"{'Chunk_No.': 0, 'Page_No.': 'Page 64'}","Principal Life Insurance Company Des Moines, I...",0.82455,145
2,"{'Page_No.': 'Page 59', 'Chunk_No.': 1}",Life benefit in excess of 50% of the Member's ...,0.827809,135
3,"{'Chunk_No.': 1, 'Page_No.': 'Page 20'}",remains in force during the Grace Period. Arti...,0.880485,37
4,"{'Chunk_No.': 0, 'Page_No.': 'Page 13'}",a . A licensed Doctor of Medicine (M.D.) or Os...,0.886265,22
5,"{'Page_No.': 'Page 46', 'Chunk_No.': 1}","the age(s) shown below, the amount of a Member...",0.890122,101
6,"{'Chunk_No.': 2, 'Page_No.': 'Page 21'}",result will then be multiplied by the premium ...,0.908242,41
7,"{'Chunk_No.': 1, 'Page_No.': 'Page 35'}",Accidental Death and Dismemberment Insurance i...,0.909882,74
8,"{'Chunk_No.': 0, 'Page_No.': 'Page 59'}",Section C - Dependent Life Insurance Article 1...,0.914557,134
9,"{'Chunk_No.': 0, 'Page_No.': 'Page 42'}",Section F - Individual Purchase Rights Article...,0.922996,89


### Re Ranking with Cross Encoder

In [85]:
from sentence_transformers import CrossEncoder, util

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [86]:

cross_inputs = [[query, response] for response in results_df['Documents']]
cross_rerank_scores = cross_encoder.predict(cross_inputs)

  return forward_call(*args, **kwargs)


In [87]:
# pritn the cross rerank scores
cross_rerank_scores

array([-1.0131531, -9.737984 ,  0.9311433, -3.8338141, -9.035873 ,
       -1.9379551, -1.4666394, -0.8521249, -3.766788 , -6.7241006],
      dtype=float32)

In [88]:
results_df['Reranked_scores'] = cross_rerank_scores
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Chunk_No.': 0, 'Page_No.': 'Page 8'}",Section A - Member Life Insurance Schedule of ...,0.753707,10,-1.013153
1,"{'Chunk_No.': 0, 'Page_No.': 'Page 64'}","Principal Life Insurance Company Des Moines, I...",0.82455,145,-9.737984
2,"{'Page_No.': 'Page 59', 'Chunk_No.': 1}",Life benefit in excess of 50% of the Member's ...,0.827809,135,0.931143
3,"{'Chunk_No.': 1, 'Page_No.': 'Page 20'}",remains in force during the Grace Period. Arti...,0.880485,37,-3.833814
4,"{'Chunk_No.': 0, 'Page_No.': 'Page 13'}",a . A licensed Doctor of Medicine (M.D.) or Os...,0.886265,22,-9.035873
5,"{'Page_No.': 'Page 46', 'Chunk_No.': 1}","the age(s) shown below, the amount of a Member...",0.890122,101,-1.937955
6,"{'Chunk_No.': 2, 'Page_No.': 'Page 21'}",result will then be multiplied by the premium ...,0.908242,41,-1.466639
7,"{'Chunk_No.': 1, 'Page_No.': 'Page 35'}",Accidental Death and Dismemberment Insurance i...,0.909882,74,-0.852125
8,"{'Chunk_No.': 0, 'Page_No.': 'Page 59'}",Section C - Dependent Life Insurance Article 1...,0.914557,134,-3.766788
9,"{'Chunk_No.': 0, 'Page_No.': 'Page 42'}",Section F - Individual Purchase Rights Article...,0.922996,89,-6.724101


In [89]:
# Return the top 3 results from semantic search

top_3_semantic = results_df.sort_values(by='Distances')
top_3_semantic[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Chunk_No.': 0, 'Page_No.': 'Page 8'}",Section A - Member Life Insurance Schedule of ...,0.753707,10,-1.013153
1,"{'Chunk_No.': 0, 'Page_No.': 'Page 64'}","Principal Life Insurance Company Des Moines, I...",0.82455,145,-9.737984
2,"{'Page_No.': 'Page 59', 'Chunk_No.': 1}",Life benefit in excess of 50% of the Member's ...,0.827809,135,0.931143


In [90]:
# Return the top 3 results after reranking

top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
2,"{'Page_No.': 'Page 59', 'Chunk_No.': 1}",Life benefit in excess of 50% of the Member's ...,0.827809,135,0.931143
7,"{'Chunk_No.': 1, 'Page_No.': 'Page 35'}",Accidental Death and Dismemberment Insurance i...,0.909882,74,-0.852125
0,"{'Chunk_No.': 0, 'Page_No.': 'Page 8'}",Section A - Member Life Insurance Schedule of ...,0.753707,10,-1.013153


In [91]:
top_3_RAG = top_3_rerank[["Documents", "Metadatas"]][:3]
top_3_RAG

Unnamed: 0,Documents,Metadatas
2,Life benefit in excess of 50% of the Member's ...,"{'Page_No.': 'Page 59', 'Chunk_No.': 1}"
7,Accidental Death and Dismemberment Insurance i...,"{'Chunk_No.': 1, 'Page_No.': 'Page 35'}"
0,Section A - Member Life Insurance Schedule of ...,"{'Chunk_No.': 0, 'Page_No.': 'Page 8'}"


# RAG

In [95]:

openai.api_key = "******"

def generate_response(query, results_df):
    """
    Generate a response using GPT-3.5's ChatCompletion based on the user query and retrieved information.
    """
    messages = [
                {"role": "system", "content":  "You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents."},
                {"role": "user", "content": f"""You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents.
                                                You have a question asked by the user in '{query}' and you have some search results from a corpus of insurance documents in the dataframe '{top_3_RAG}'. These search results are essentially one page of an insurance document that may be relevant to the user query.

                                                The column 'documents' inside this dataframe contains the actual text from the policy document and the column 'metadata' contains the policy name and source page. The text inside the document may also contain tables in the format of a list of lists where each of the nested lists indicates a row.

                                                Use the documents in '{top_3_RAG}' to answer the query '{query}'. Frame an informative answer and also, use the dataframe to return the relevant policy names and page numbers as citations.

                                                Follow the guidelines below when performing the task.
                                                1. Try to provide relevant/accurate numbers if available.
                                                2. You don’t have to necessarily use all the information in the dataframe. Only choose information that is relevant.
                                                3. If the document text has tables with relevant information, please reformat the table and return the final information in a tabular in format.
                                                3. Use the Metadatas columns in the dataframe to retrieve and cite the policy name(s) and page numbers(s) as citation.
                                                4. If you can't provide the complete answer, please also provide any information that will help the user to search specific sections in the relevant cited documents.
                                                5. You are a customer facing assistant, so do not provide any information on internal workings, just answer the query directly.

                                                The generated response should answer the query directly addressing the user and avoiding additional information. If you think that the query is not relevant to the document, reply that the query is irrelevant. Provide the final response as a well-formatted and easily readable text along with the citation. Provide your complete response first with all information, and then provide the citations.
                                                """},
              ]

    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages
    )

    return response.choices[0].message.content.split('\n')


In [96]:
response = generate_response(query, top_3_RAG)

In [97]:
print("\n".join(response))

The life insurance coverage for death varies depending on the policy you have. To provide accurate details, I would need to review the specific policy documents related to your insurance coverage. Below are the relevant policy names and page numbers from the insurance documents that may contain information about life insurance coverage for death:

1. Policy Name: Life benefit in excess of 50% of the Member's
   Page Number: Page 59

2. Policy Name: Accidental Death and Dismemberment Insurance
   Page Number: Page 35

3. Policy Name: Section A - Member Life Insurance Schedule
   Page Number: Page 8

To find details about the life insurance coverage for death, please refer to the corresponding policy sections in the cited documents above. If you need assistance in navigating the documents or interpreting specific sections, feel free to ask for further guidance.


### Queries

In [98]:
def search(query):

  # Set a threshold for cache search
  threshold = 0.2

  ids = []
  documents = []
  distances = []
  metadatas = []
  results_df = pd.DataFrame()

  # try to find from cache
  cache_results = cache_collection.query(
      query_texts=query,
      n_results=1
  )

  # If the distance is greater than the threshold, then return the results from the main collection.
  if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
        # Query the collection against the user query and return the top 10 results
        results = collection.query(
        query_texts=query,
        n_results=10
        )

        # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
        # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
        Keys = []
        Values = []

        size = len(results.items())

        for key, val in results.items():
          if val is None:
            continue
          for i in range(size):
            Keys.append(str(key)+str(i))
            Values.append(str(val[0][i]))


        cache_collection.add(
            documents= [query],
            ids = [query],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
            metadatas = dict(zip(Keys, Values))
        )

        #print("Not found in cache. Found in main collection.")

        result_dict = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
        results_df = pd.DataFrame.from_dict(result_dict)
        return results_df


  # If the distance is, however, less than the threshold, you can return the results from cache

  elif cache_results['distances'][0][0] <= threshold:
        cache_result_dict = cache_results['metadatas'][0][0]

        # Loop through each inner list and then through the dictionary
        for key, value in cache_result_dict.items():
            if 'ids' in key:
                ids.append(value)
            elif 'documents' in key:
                documents.append(value)
            elif 'distances' in key:
                distances.append(value)
            elif 'metadatas' in key:
                metadatas.append(value)

        #print("Found in cache!")

        # Create a DataFrame
        return pd.DataFrame({
          'IDs': ids,
          'Documents': documents,
          'Distances': distances,
          'Metadatas': metadatas
        })

In [99]:
def apply_cross_encoder(query, df):
  cross_inputs = [[query, response] for response in df['Documents']]
  cross_rerank_scores = cross_encoder.predict(cross_inputs)
  df['Reranked_scores'] = cross_rerank_scores
  return df

In [100]:
def get_topn(n, df):
  top_3_rerank = df.sort_values(by='Reranked_scores', ascending=False)
  return top_3_rerank[["Documents", "Metadatas"]][:n]

In [101]:
query = 'what is the life insurance coverage for disability'
df = search(query)
df = apply_cross_encoder(query, df)
df = get_topn(3, df)
response = generate_response(query, df)
print("\n".join(response))

  return forward_call(*args, **kwargs)


The query is not directly relevant to the documents provided, as the documents do not contain specific information on life insurance coverage for disability. It's recommended to refer to the policy's disability coverage section or contact the insurance provider for detailed information on life insurance coverage for disability. If you need further assistance, please let me know.

Citations:
1. Policy Name: Section A - Member Life Insurance Schedule
   Page Number: Page 8

2. Policy Name: Accidental Death and Dismemberment Insurance
   Page Number: Page 35

3. Policy Name: Life benefit in excess of 50% of the Member's...
   Page Number: Page 59


In [102]:
query = 'What are the Termination Rights of the Policyholder'
df = search(query)
df = apply_cross_encoder(query, df)
df = get_topn(3, df)
response = generate_response(query, df)
print("\n".join(response))

  return forward_call(*args, **kwargs)


The termination rights of the policyholder typically include the following:

1. Surrender Value: The policyholder may choose to surrender the policy and receive the surrender value as per the terms and conditions outlined in the policy document.
2. Free Look Period: The policyholder usually has a certain period, known as the free look period, within which they can cancel the policy without any penalties.
3. Non-Payment of Premiums: If the policyholder fails to pay the premiums within the grace period specified in the policy, the policy may lapse or be terminated.
4. Policy Expiry: Once the policy reaches its maturity date, it may terminate, and the policyholder can claim the maturity benefits.

Please refer to the following policy documents for more detailed information:

1. Policy Name: 'Life benefit in excess of 50% of the Member's ...'
   - Page Number: Page 59

2. Policy Name: 'Accidental Death and Dismemberment Insurance ...'
   - Page Number: Page 35

3. Policy Name: 'Section A -

In [103]:
query = 'What are the default benefits and provisions of the Group Policy?'
df = search(query)
df = apply_cross_encoder(query, df)
df = get_topn(3, df)
response = generate_response(query, df)
print("\n".join(response))




  return forward_call(*args, **kwargs)


The default benefits and provisions of the Group Policy include:

1. Life benefit exceeding 50% of the Member's salary.
2. Accidental Death and Dismemberment Insurance coverage.

Please find the detailed information in the relevant policy documents:

**Policy Name:** Section A - Member Life Insurance Schedule of Benefits  
**Page Number:** Page 59, Page 35  

If you need more specific details, refer to the sections related to Life benefits and Accidental Death and Dismemberment Insurance in the mentioned policy documents.


In [105]:
query = 'What are the coverages for an accident claim?'
df = search(query)
df = apply_cross_encoder(query, df)
df = get_topn(3, df)
response = generate_response(query, df)
print("\n".join(response))



  return forward_call(*args, **kwargs)


The coverages for an accident claim include:

1. **Life benefit in excess of 50% of the Member's salary**: This provides a coverage that exceeds 50% of the Member's salary in the event of an accident.
2. **Accidental Death and Dismemberment Insurance**: This insurance offers coverage in case of accidental death or dismemberment due to an accident.

Below are the policy names and page numbers for the referenced coverages:

1. Policy Name: Life benefit in excess of 50% of the Member's salary  
   - Page Number: Page 59

2. Policy Name: Accidental Death and Dismemberment Insurance  
   - Page Number: Page 35

I recommend reviewing the detailed sections mentioned in the policies on Page 59 and Page 35 for further information regarding the coverages for accident claims.
