<a href="https://colab.research.google.com/github/Aswani-ReddyKV/HelpMateAI/blob/main/Aswani_Reddy_HelpMateAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Environment readiness creation

In [1]:
# Install all the required libraries
# %pip install pdfplumber tiktoken openai chromaDB sentence-transformers


In [2]:
# Import all the required Libraries
import pdfplumber
from pathlib import Path
import pandas as pd
from operator import itemgetter
import json
import openai
import chromadb

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import pathlib
rootfolder = '/content/drive/MyDrive/Colab Notebooks/Course_6_GenAI'

# Remember to upload the file containing your OpenAI key. Make sure the file name is correct
with open(rootfolder + "/OpenAI_API_Key.txt", "r") as f:
  openai.api_key = ' '.join(f.readlines())


# # Alternatively you may also read the API key via Google Colab secrets
# from google.colab import userdata
# openai.api_key = userdata.get('OpenAI_API_Key')

if openai.api_key.startswith('sk-'):
    print('API key loaded successfully')
else:
  print('Improper API key format')

API key loaded successfully


In [5]:
# Define the path where all pdf documents are present
pdf_path = '/content/drive/MyDrive/Colab Notebooks/Course_6_GenAI/HelpMateAI_Project/'

In [6]:
# Function to check whether a word is present in a table or not for segregation of regular text and tables

def check_bboxes(word, table_bbox):
    # Check whether word is inside a table bbox.
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]

In [7]:
# Function to extract text from a PDF file.
# 1. Declare a variable p to store the iteration of the loop that will help us store page numbers alongside the text
# 2. Declare an empty list 'full_text' to store all the text files
# 3. Use pdfplumber to open the pdf pages one by one
# 4. Find the tables and their locations in the page
# 5. Extract the text from the tables in the variable 'tables'
# 6. Extract the regular words by calling the function check_bboxes() and checking whether words are present in the table or not
# 7. Use the cluster_objects utility to cluster non-table and table words together so that they retain the same chronology as in the original PDF
# 8. Declare an empty list 'lines' to store the page text
# 9. If a text element in present in the cluster, append it to 'lines', else if a table element is present, append the table
# 10. Append the page number and all lines to full_text, and increment 'p'
# 11. When the function has iterated over all pages, return the 'full_text' list

def extract_text_from_pdf(pdf_path):
    p = 0
    full_text = []


    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_no = f"Page {p+1}"
            text = page.extract_text()

            tables = page.find_tables()
            table_bboxes = [i.bbox for i in tables]
            tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
            non_table_words = [word for word in page.extract_words() if not any(
                [check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
            lines = []

            for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5):

                if 'text' in cluster[0]:
                    try:
                        lines.append(' '.join([i['text'] for i in cluster]))
                    except KeyError:
                        pass

                elif 'table' in cluster[0]:
                    lines.append(json.dumps(cluster[0]['table']))


            full_text.append([page_no, " ".join(lines)])
            p +=1

    return full_text

In [8]:
# Define the directory containing the PDF files
pdf_directory = Path(pdf_path)

# Initialize an empty list to store the extracted texts and document names
data = []

# Loop through all files in the directory
for pdf_path in pdf_directory.glob("*.pdf"):

    # Process the PDF file
    print(f"...Processing {pdf_path.name}")

    # Call the function to extract the text from the PDF
    extracted_text = extract_text_from_pdf(pdf_path)

    # Convert the extracted list to a PDF, and add a column to store document names
    extracted_text_df = pd.DataFrame(extracted_text, columns=['Page No.', 'Page_Text'])

    # Append the extracted text and document name to the list
    data.append(extracted_text_df)

    # Print a message to indicate progress
    print(f"Finished processing {pdf_path.name}")

# Print a message to indicate all PDFs have been processed
print("All PDFs have been processed.")

...Processing Principal-Sample-Life-Insurance-Policy.pdf
Finished processing Principal-Sample-Life-Insurance-Policy.pdf
All PDFs have been processed.


In [9]:
# Concatenate all the DFs in the list 'data' together

df_insurancedata = pd.concat(data, ignore_index=True)

In [10]:
df_insurancedata.shape

(64, 2)

In [11]:
df_insurancedata.head(5)

Unnamed: 0,Page No.,Page_Text
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...
1,Page 2,This page left blank intentionally
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...
3,Page 4,This page left blank intentionally
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...


In [12]:
# Store the metadata for each page in a separate column, now we can consider the Page_No is the metadata
df_insurancedata['Metadata'] = df_insurancedata.apply(lambda x: {'Page_No.': x['Page No.']}, axis=1)
df_insurancedata.head(5)

Unnamed: 0,Page No.,Page_Text,Metadata
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,{'Page_No.': 'Page 1'}
1,Page 2,This page left blank intentionally,{'Page_No.': 'Page 2'}
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,{'Page_No.': 'Page 3'}
3,Page 4,This page left blank intentionally,{'Page_No.': 'Page 4'}
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,{'Page_No.': 'Page 5'}


In [13]:
# Check one of the extracted page texts to ensure that the text has been correctly read
df_insurancedata.Page_Text[7]

'Section A - Member Life Insurance Schedule of Insurance Article 1 Death Benefits Payable Article 2 Beneficiary Article 3 Facility of Payment Article 4 Settlement of Proceeds Article 5 Member Life Insurance - Coverage During Disability Article 6 Accelerated Benefits Article 7 Section B - Member Accidental Death and Dismemberment Insurance Schedule of Insurance Article 1 Benefit Qualification Article 2 Benefits Payable Article 3 Seat Belt Benefit Article 4 Loss of Use or Paralysis Benefit Article 5 Loss of Speech and/or Hearing Benefit Article 6 Repatriation Benefit Article 7 Educational Benefit Article 8 Limitations Article 9 Section C - Dependent Life Insurance Schedule of Insurance Article 1 Death Benefits Payable Article 2 Beneficiary Article 3 Section D - Claim Procedures Notice of Claim Article 1 Claim Forms Article 2 Proof of Loss Article 3 Payment, Denial and Review Article 4 Medical Examinations Article 5 Autopsy Article 6 Legal Action Article 7 Time Limits Article 8 This polic

In [14]:

# Let's also check the length of all the texts as there might be some empty pages or pages with very few words that we can drop
df_insurancedata['Text_Length'] = df_insurancedata['Page_Text'].apply(lambda x: len(x.split(' ')))

In [15]:

# print the page length
df_insurancedata['Text_Length']

Unnamed: 0,Text_Length
0,30
1,5
2,230
3,5
4,110
...,...
59,285
60,418
61,322
62,5


## 3. Document Chunking

We will generate embeddings for texts. But the document contains several pages and huge text, before generating the embeddings, we need to generate the chunks. Let's start with a basic chunking technique, and chunk the text with fixed size.

In [16]:
# Check the entire page's text
df_insurancedata['Page_Text']

Unnamed: 0,Page_Text
0,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...
1,This page left blank intentionally
2,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...
3,This page left blank intentionally
4,PRINCIPAL LIFE INSURANCE COMPANY (called The P...
...,...
59,I f a Dependent who was insured dies during th...
60,Section D - Claim Procedures Article 1 - Notic...
61,A claimant may request an appeal of a claim de...
62,This page left blank intentionally


In [17]:
# Iterating over all page titles to create the final df with individual chunks
page_nos = df_insurancedata["Page No."]
page_nos

Unnamed: 0,Page No.
0,Page 1
1,Page 2
2,Page 3
3,Page 4
4,Page 5
...,...
59,Page 60
60,Page 61
61,Page 62
62,Page 63


### 3.1 Fixed-Size Chunking

In fixed-size chunking, the document is split into fixed-size windows with each window representing a separate document chunk.

In [18]:
# Function to split text into fixed-size chunks
def split_text_into_chunks(text, chunk_size):
    chunks = []
    words = text.split()  # Split the text into words

    current_chunk = []  # Store words for the current chunk
    current_chunk_word_count = 0  # Count of words in the current chunk

    for word in words:
        if current_chunk_word_count + len(word) + 1 <= chunk_size:
            current_chunk.append(word)
            current_chunk_word_count += len(word) + 1
        else:
            chunks.append(' '.join(current_chunk))
            current_chunk = [word]
            current_chunk_word_count = len(word)

    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks## 3. <font color = 'red'> Generating Embeddings </font>

#### Encoding Pipeline

In [19]:
def process_page(page_no):
    page = df_insurancedata[df_insurancedata['Page No.'] == page_no].Page_Text.values[0]
    metadata = df_insurancedata[df_insurancedata['Page No.'] == page_no].Metadata.values[0]

    if page is not None:
        # setting chunk size as 500
        chunk_size = 500
        text_chunks = split_text_into_chunks(page, chunk_size)

        # Creating a DataFrame to store the chunks, page title and page metadata
        data = {'Title': [], 'Chunk Text': [], 'Metadata': []}

        for index, chunk in enumerate(text_chunks):
            data['Title'].append(page_no)
            data['Chunk Text'].append(chunk)
            # adding chunk no as part of metadata
            metadata['Chunk_No.'] = index
            data['Metadata'].append(metadata)

        return pd.DataFrame(data)

In [20]:
# creating a dataframe after calling process
all_dfs = []
for page_no in page_nos:
    df = process_page(page_no)
    if df is not None:
        all_dfs.append(df)

fixed_chunk_df = pd.concat(all_dfs, ignore_index=True)
fixed_chunk_df

Unnamed: 0,Title,Chunk Text,Metadata
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,"{'Page_No.': 'Page 1', 'Chunk_No.': 0}"
1,Page 2,This page left blank intentionally,"{'Page_No.': 'Page 2', 'Chunk_No.': 0}"
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,"{'Page_No.': 'Page 3', 'Chunk_No.': 2}"
3,Page 3,arrange for third party service providers (i.e...,"{'Page_No.': 'Page 3', 'Chunk_No.': 2}"
4,Page 3,the provision of such goods and/or services no...,"{'Page_No.': 'Page 3', 'Chunk_No.': 2}"
...,...,...,...
226,Page 62,"requested additional information, The Principa...","{'Page_No.': 'Page 62', 'Chunk_No.': 3}"
227,Page 62,may have the Member or Dependent whose loss is...,"{'Page_No.': 'Page 62', 'Chunk_No.': 3}"
228,Page 62,proof of loss has been filed and before the ap...,"{'Page_No.': 'Page 62', 'Chunk_No.': 3}"
229,Page 63,This page left blank intentionally,"{'Page_No.': 'Page 63', 'Chunk_No.': 0}"


## 4. Generating Embeddings

#### Encoding Pipeline

In [21]:
from sentence_transformers import SentenceTransformer

In [22]:
# Load the embedding model
model_name = "all-MiniLM-L6-v2"
embedder = SentenceTransformer(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [23]:
# Function to generate embeddings for text
def generate_embeddings(texts):
    embeddings = embedder.encode(texts, convert_to_tensor=False)
    return embeddings

In [24]:
# function to generate embedding on dataframe
def generate_embeddings_on_df(df):
  df['Embeddings'] = df['Chunk Text'].apply(lambda x: generate_embeddings([x])[0])

In [25]:
# Create embeddings for 'Chunk Text' column on all three dataframes
generate_embeddings_on_df(fixed_chunk_df)

In [26]:
# print the dataframe
fixed_chunk_df

Unnamed: 0,Title,Chunk Text,Metadata,Embeddings
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,"{'Page_No.': 'Page 1', 'Chunk_No.': 0}","[-0.025921902, 0.047777493, 0.05585776, 0.0423..."
1,Page 2,This page left blank intentionally,"{'Page_No.': 'Page 2', 'Chunk_No.': 0}","[0.029118974, 0.060574062, 0.046415336, 0.0377..."
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,"{'Page_No.': 'Page 3', 'Chunk_No.': 2}","[-0.06453793, 0.04319715, -8.3862076e-05, -0.0..."
3,Page 3,arrange for third party service providers (i.e...,"{'Page_No.': 'Page 3', 'Chunk_No.': 2}","[-0.102009885, -0.028467722, -0.020565063, -0...."
4,Page 3,the provision of such goods and/or services no...,"{'Page_No.': 'Page 3', 'Chunk_No.': 2}","[-0.0900084, 0.07658198, 0.004927621, -0.08307..."
...,...,...,...,...
226,Page 62,"requested additional information, The Principa...","{'Page_No.': 'Page 62', 'Chunk_No.': 3}","[-0.04767193, 0.11277696, 0.069064945, -0.0508..."
227,Page 62,may have the Member or Dependent whose loss is...,"{'Page_No.': 'Page 62', 'Chunk_No.': 3}","[-0.07941993, 0.14404444, 0.031876117, -0.0655..."
228,Page 62,proof of loss has been filed and before the ap...,"{'Page_No.': 'Page 62', 'Chunk_No.': 3}","[-0.14206009, 0.12368372, 0.120924726, -0.0114..."
229,Page 63,This page left blank intentionally,"{'Page_No.': 'Page 63', 'Chunk_No.': 0}","[0.029118974, 0.060574062, 0.046415336, 0.0377..."


5. Store Embeddings in ChromaDB


In this section we will store embedding in ChromaDB collection.

In [27]:
# Define the path where chroma collections will be stored
chroma_data_path = '/content/drive/MyDrive/Colab Notebooks/Course_6_GenAI/HelpMateAI_Project/'

In [28]:
import chromadb

# Call PersistentClient()
client = chromadb.PersistentClient(path=chroma_data_path)

In [29]:
# Create a collection to store the embeddings. Collections in Chroma are where you can store your embeddings, documents, and any additional metadata.
collection = client.get_or_create_collection(name="insurance-collection")

In [31]:
collection.add(
    embeddings = fixed_chunk_df['Embeddings'].to_list(),
    documents = fixed_chunk_df['Chunk Text'].to_list(),
    metadatas = fixed_chunk_df['Metadata'].to_list(),
    ids = [str(i) for i in range(0, len(fixed_chunk_df['Embeddings']))]
)

In [32]:
# get few of data by ids from collection
collection.get(
    ids = ['0','1','2'],
    include = ['embeddings', 'documents', 'metadatas']
)

{'ids': ['0', '1', '2'],
 'embeddings': array([[-2.59219017e-02,  4.77774926e-02,  5.58577590e-02, ...,
         -4.93265763e-02, -5.85114695e-02,  2.35519279e-02],
        [ 2.91189738e-02,  6.05740622e-02,  4.64153364e-02, ...,
          5.95400855e-02, -2.83837058e-02,  5.31935133e-03],
        [-6.45379275e-02,  4.31971513e-02, -8.38620763e-05, ...,
         -3.78734320e-02,  1.79674719e-02, -7.36602070e-03]]),
 'documents': ['DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/01/2014 711 HIGH STREET GEORGE RI 02903 GROUP POLICY FOR: RHODE ISLAND JOHN DOE ALL MEMBERS Group Member Life Insurance Print Date: 07/16/2014',
  'This page left blank intentionally',
  'POLICY RIDER GROUP INSURANCE POLICY NO: S655 COVERAGE: Life EMPLOYER: RHODE ISLAND JOHN DOE Effective on the later of the Date of Issue of this Group Policy or March 1, 2005, the following will apply to your Policy: From time to time The Principal may offer or provide certain employer groups who apply for coverage with The Princi

In [33]:
# create a cache collection
cache_collection = client.get_or_create_collection(name='insurance-collection-cache')

In [34]:
# peek few of elements from cache collection
cache_collection.peek()

{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': [],
 'uris': None,
 'data': None,
 'metadatas': [],
 'included': [<IncludeEnum.embeddings: 'embeddings'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

## 6. Semantic Search with Cache

In this section, we will perform a semantic search of a query in the collections embeddings to get several top semantically similar results.

In [35]:
# Read the user query
query = input()

what is the life insurance coverage for disability


In [36]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results
cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:02<00:00, 39.2MiB/s]


In [37]:
# get result from cache collection
cache_results

{'ids': [[]],
 'embeddings': None,
 'documents': [[]],
 'uris': None,
 'data': None,
 'metadatas': [[]],
 'distances': [[]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [38]:
# get result from main collection
results = collection.query(
query_texts=query,
n_results=10
)
print("Result size is : " + str(len(results.items())))
results.items()

Result size is : 8


dict_items([('ids', [['143', '171', '147', '151', '183', '146', '180', '85', '179', '142']]), ('embeddings', None), ('documents', [['Member Life Insurance or Coverage During Disability terminates under this Group Policy. This policy has been updated effective January 1, 2014 PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS GC 6011 Section F - Individual Purchase Rights, Page 1', "Payment of benefits will be subject to the Beneficiary and Facility of Payment provisions of this PART IV, Section A. Article 6 - Member Life Insurance - Coverage During Disability A Member may be eligible to continue his or her Member Life and Member Accidental Death and Dismemberment Insurance and Dependent Life Insurance coverage during the Member's ADL Disability or Total Disability. a. Coverage Qualification To be qualified for Coverage During Disability, a Member must: (1) become ADL", 'any Accelerated Benefit payment as described in PART IV, Section A, Article 7. Article 2 - Dependent Life Insurance a. Ind

In [39]:
# Implementing Cache in Semantic Search

# Set a threshold for cache search
threshold = 0.2

ids = []
documents = []
distances = []
metadatas = []
results_df = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = collection.query(
      query_texts=query,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys = []
      Values = []

      size = len(results.items())

      for key, val in results.items():
        if val is None:
          continue
        for i in range(size):
          Keys.append(str(key)+str(i))
          Values.append(str(val[0][i]))


      cache_collection.add(
          documents= [query],
          ids = [query],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys, Values))
      )

      print("Not found in cache. Found in main collection.")

      result_dict = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
      results_df = pd.DataFrame.from_dict(result_dict)
      results_df


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results['distances'][0][0] <= threshold:
      cache_result_dict = cache_results['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict.items():
          if 'ids' in key:
              ids.append(value)
          elif 'documents' in key:
              documents.append(value)
          elif 'distances' in key:
              distances.append(value)
          elif 'metadatas' in key:
              metadatas.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df = pd.DataFrame({
        'IDs': ids,
        'Documents': documents,
        'Distances': distances,
        'Metadatas': metadatas
      })

Not found in cache. Found in main collection.


In [40]:
# print the results
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Chunk_No.': 4, 'Page_No.': 'Page 42'}",Member Life Insurance or Coverage During Disab...,0.712412,143
1,"{'Chunk_No.': 4, 'Page_No.': 'Page 49'}",Payment of benefits will be subject to the Ben...,0.802751,171
2,"{'Chunk_No.': 4, 'Page_No.': 'Page 43'}",any Accelerated Benefit payment as described i...,0.8602,147
3,"{'Chunk_No.': 4, 'Page_No.': 'Page 44'}",Dependent's Life Insurance terminates because ...,0.872965,151
4,"{'Chunk_No.': 3, 'Page_No.': 'Page 51'}",disability that: (1) results from willful self...,0.89798,183
5,"{'Chunk_No.': 4, 'Page_No.': 'Page 43'}",be the Coverage During Disability benefit in f...,0.908463,146
6,"{'Chunk_No.': 5, 'Page_No.': 'Page 50'}",Total Disability began. Failure to give Writte...,0.910234,180
7,"{'Chunk_No.': 4, 'Page_No.': 'Page 28'}","terms of the Prior Policy, to have their premi...",0.912592,85
8,"{'Chunk_No.': 5, 'Page_No.': 'Page 50'}","Disability is in force, The Principal will pay...",0.925082,179
9,"{'Chunk_No.': 4, 'Page_No.': 'Page 42'}","Premium Waiver Period as described in PART IV,...",0.932815,142


## 7. Re-Ranking with a Cross Encoder

Re-ranking the results obtained from your semantic search can sometime significantly improve the relevance of the retrieved results. This is often done by passing the query paired with each of the retrieved responses into a cross-encoder to score the relevance of the response w.r.t. the query.

In [41]:
# Import the CrossEncoder library from sentence_transformers
from sentence_transformers import CrossEncoder, util

In [42]:
# Initialise the cross encoder model
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [43]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

cross_inputs = [[query, response] for response in results_df['Documents']]
cross_rerank_scores = cross_encoder.predict(cross_inputs)

In [44]:
# pritn the cross rerank scores
cross_rerank_scores

array([ 3.0535285 ,  4.8101153 ,  0.53876984,  2.9317625 , -0.90366626,
        1.4724655 ,  0.26232424, -1.3536793 ,  2.4461827 , -2.6602545 ],
      dtype=float32)

In [45]:
results_df['Reranked_scores'] = cross_rerank_scores

In [46]:
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Chunk_No.': 4, 'Page_No.': 'Page 42'}",Member Life Insurance or Coverage During Disab...,0.712412,143,3.053529
1,"{'Chunk_No.': 4, 'Page_No.': 'Page 49'}",Payment of benefits will be subject to the Ben...,0.802751,171,4.810115
2,"{'Chunk_No.': 4, 'Page_No.': 'Page 43'}",any Accelerated Benefit payment as described i...,0.8602,147,0.53877
3,"{'Chunk_No.': 4, 'Page_No.': 'Page 44'}",Dependent's Life Insurance terminates because ...,0.872965,151,2.931762
4,"{'Chunk_No.': 3, 'Page_No.': 'Page 51'}",disability that: (1) results from willful self...,0.89798,183,-0.903666
5,"{'Chunk_No.': 4, 'Page_No.': 'Page 43'}",be the Coverage During Disability benefit in f...,0.908463,146,1.472466
6,"{'Chunk_No.': 5, 'Page_No.': 'Page 50'}",Total Disability began. Failure to give Writte...,0.910234,180,0.262324
7,"{'Chunk_No.': 4, 'Page_No.': 'Page 28'}","terms of the Prior Policy, to have their premi...",0.912592,85,-1.353679
8,"{'Chunk_No.': 5, 'Page_No.': 'Page 50'}","Disability is in force, The Principal will pay...",0.925082,179,2.446183
9,"{'Chunk_No.': 4, 'Page_No.': 'Page 42'}","Premium Waiver Period as described in PART IV,...",0.932815,142,-2.660254


In [47]:
# Return the top 3 results from semantic search

top_3_semantic = results_df.sort_values(by='Distances')
top_3_semantic[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Chunk_No.': 4, 'Page_No.': 'Page 42'}",Member Life Insurance or Coverage During Disab...,0.712412,143,3.053529
1,"{'Chunk_No.': 4, 'Page_No.': 'Page 49'}",Payment of benefits will be subject to the Ben...,0.802751,171,4.810115
2,"{'Chunk_No.': 4, 'Page_No.': 'Page 43'}",any Accelerated Benefit payment as described i...,0.8602,147,0.53877


In [48]:
# Return the top 3 results after reranking

top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
1,"{'Chunk_No.': 4, 'Page_No.': 'Page 49'}",Payment of benefits will be subject to the Ben...,0.802751,171,4.810115
0,"{'Chunk_No.': 4, 'Page_No.': 'Page 42'}",Member Life Insurance or Coverage During Disab...,0.712412,143,3.053529
3,"{'Chunk_No.': 4, 'Page_No.': 'Page 44'}",Dependent's Life Insurance terminates because ...,0.872965,151,2.931762


In [49]:
top_3_RAG = top_3_rerank[["Documents", "Metadatas"]][:3]
top_3_RAG

Unnamed: 0,Documents,Metadatas
1,Payment of benefits will be subject to the Ben...,"{'Chunk_No.': 4, 'Page_No.': 'Page 49'}"
0,Member Life Insurance or Coverage During Disab...,"{'Chunk_No.': 4, 'Page_No.': 'Page 42'}"
3,Dependent's Life Insurance terminates because ...,"{'Chunk_No.': 4, 'Page_No.': 'Page 44'}"


## 8. Retrieval Augmented Generation


Now that we have the final top search results, we can pass it to an GPT 3.5 along with the user query and a well-engineered prompt, to generate a direct answer to the query along with citations, rather than returning whole pages/chunks.

In [50]:
# Define the function to generate the response. Provide a comprehensive prompt that passes the user query and the top 3 results to the model

def generate_response(query, results_df):
    """
    Generate a response using GPT-3.5's ChatCompletion based on the user query and retrieved information.
    """
    messages = [
                {"role": "system", "content":  "You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents."},
                {"role": "user", "content": f"""You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents.
                                                You have a question asked by the user in '{query}' and you have some search results from a corpus of insurance documents in the dataframe '{top_3_RAG}'. These search results are essentially one page of an insurance document that may be relevant to the user query.

                                                The column 'documents' inside this dataframe contains the actual text from the policy document and the column 'metadata' contains the policy name and source page. The text inside the document may also contain tables in the format of a list of lists where each of the nested lists indicates a row.

                                                Use the documents in '{top_3_RAG}' to answer the query '{query}'. Frame an informative answer and also, use the dataframe to return the relevant policy names and page numbers as citations.

                                                Follow the guidelines below when performing the task.
                                                1. Try to provide relevant/accurate numbers if available.
                                                2. You don’t have to necessarily use all the information in the dataframe. Only choose information that is relevant.
                                                3. If the document text has tables with relevant information, please reformat the table and return the final information in a tabular in format.
                                                3. Use the Metadatas columns in the dataframe to retrieve and cite the policy name(s) and page numbers(s) as citation.
                                                4. If you can't provide the complete answer, please also provide any information that will help the user to search specific sections in the relevant cited documents.
                                                5. You are a customer facing assistant, so do not provide any information on internal workings, just answer the query directly.

                                                The generated response should answer the query directly addressing the user and avoiding additional information. If you think that the query is not relevant to the document, reply that the query is irrelevant. Provide the final response as a well-formatted and easily readable text along with the citation. Provide your complete response first with all information, and then provide the citations.
                                                """},
              ]

    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages
    )

    return response.choices[0].message.content.split('\n')

In [51]:
# Generate the response
response = generate_response(query, top_3_RAG)

In [52]:
# Print the response
print("\n".join(response))

The life insurance coverage for disability typically includes provisions for paying benefits in case the policyholder becomes disabled and is unable to work. Specific details may vary depending on the insurance policy and provider. To provide accurate information on the coverage, I would need to review the actual policy documents mentioned in the dataframe.

Given the documents provided in the dataframe, I will extract relevant details related to life insurance coverage for disability from the listed policies and cite the policy names and page numbers for reference.

**Relevant Policy Information:**
1. **Policy Name:** Member Life Insurance or Coverage During Disability
2. **Page Number:** Page 42

**Excerpt from the Policy Document (reformatted for clarity if necessary):**
- The policy provides coverage for disabled individuals, ensuring they receive benefits or coverage during their disability. More detailed information on the specific coverage for disability can be found on Page 42 

## 8. Queries

In [54]:
def search(query):

  # Set a threshold for cache search
  threshold = 0.2

  ids = []
  documents = []
  distances = []
  metadatas = []
  results_df = pd.DataFrame()

  # try to find from cache
  cache_results = cache_collection.query(
      query_texts=query,
      n_results=1
  )

  # If the distance is greater than the threshold, then return the results from the main collection.
  if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
        # Query the collection against the user query and return the top 10 results
        results = collection.query(
        query_texts=query,
        n_results=10
        )

        # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
        # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
        Keys = []
        Values = []

        size = len(results.items())

        for key, val in results.items():
          if val is None:
            continue
          for i in range(size):
            Keys.append(str(key)+str(i))
            Values.append(str(val[0][i]))


        cache_collection.add(
            documents= [query],
            ids = [query],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
            metadatas = dict(zip(Keys, Values))
        )

        #print("Not found in cache. Found in main collection.")

        result_dict = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
        results_df = pd.DataFrame.from_dict(result_dict)
        return results_df


  # If the distance is, however, less than the threshold, you can return the results from cache

  elif cache_results['distances'][0][0] <= threshold:
        cache_result_dict = cache_results['metadatas'][0][0]

        # Loop through each inner list and then through the dictionary
        for key, value in cache_result_dict.items():
            if 'ids' in key:
                ids.append(value)
            elif 'documents' in key:
                documents.append(value)
            elif 'distances' in key:
                distances.append(value)
            elif 'metadatas' in key:
                metadatas.append(value)

        #print("Found in cache!")

        # Create a DataFrame
        return pd.DataFrame({
          'IDs': ids,
          'Documents': documents,
          'Distances': distances,
          'Metadatas': metadatas
        })

In [55]:
def apply_cross_encoder(query, df):
  cross_inputs = [[query, response] for response in df['Documents']]
  cross_rerank_scores = cross_encoder.predict(cross_inputs)
  df['Reranked_scores'] = cross_rerank_scores
  return df

In [56]:
def get_topn(n, df):
  top_3_rerank = df.sort_values(by='Reranked_scores', ascending=False)
  return top_3_rerank[["Documents", "Metadatas"]][:n]

In [57]:
query = 'what is the life insurance coverage for disability'
df = search(query)
df = apply_cross_encoder(query, df)
df = get_topn(3, df)
response = generate_response(query, df)
print("\n".join(response))

The life insurance coverage for disability typically depends on the specific insurance policy you have. In general, life insurance coverage may include provisions for disability benefits which can provide financial support in case of a disability that prevents you from working.

Here are the relevant policies with information on life insurance coverage for disability and their respective citations:

1. Policy Name: Member Life Insurance or Coverage During Disability
   Page Number: Page 42
   
2. Policy Name: Dependent's Life Insurance
   Page Number: Page 44

For detailed information on the specific coverage provided for disability in the mentioned policies, you can refer to the corresponding pages cited above in the documents.

If you need more detailed information or have specific questions about the disability coverage under your life insurance policy, it is advisable to closely review the sections pertaining to disability benefits in the mentioned policy documents.


In [58]:
query = 'what is the Proof of ADL Disability or Total Disability'
df = search(query)
df = apply_cross_encoder(query, df)
df = get_topn(3, df)
response = generate_response(query, df)
print("\n".join(response))

The proof of ADL Disability or Total Disability typically refers to the documentation required to demonstrate that an individual is unable to perform Activities of Daily Living (ADLs) or is completely disabled according to the terms of the insurance policy.

In the provided insurance documents:

1. **Document: Member Life Insurance or Coverage During Disability**  
   - Relevant information about proof of ADL Disability or Total Disability may be present on **Page 42**.

2. **Document: Dependent's Life Insurance terminates because**  
   - Relevant information about proof of ADL Disability or Total Disability may be present on **Page 44**.

Unfortunately, the specific details regarding the proof of ADL Disability or Total Disability are not explicitly mentioned in the extracted text. To find more detailed information on what constitutes proof of ADL Disability or Total Disability and the documentation required, it's recommended to refer to the mentioned pages in these documents for a c

In [59]:
query = 'what is condition of death while not wearing Seat Belt'
df = search(query)
df = apply_cross_encoder(query, df)
df = get_topn(3, df)
response = generate_response(query, df)
print("\n".join(response))

The information regarding the condition of death while not wearing a seat belt is not directly available in the provided insurance documents. It's advisable to refer to sections related to coverage exclusions or conditions for death benefits in the mentioned policy documents to find specific details about the impact of not wearing a seat belt on death benefits.

Citations:
1. Policy Name: Payment of benefits will be subject to the Ben... (Page 49)
2. Policy Name: Member Life Insurance or Coverage During Disab... (Page 42)
3. Policy Name: Dependent's Life Insurance terminates because ... (Page 44)


In [60]:
query = 'What happens if a third-party service provider fails to provide the promised goods and services?'
df = search(query)
df = apply_cross_encoder(query, df)
df = get_topn(3, df)
response = generate_response(query, df)
print("\n".join(response))

When a third-party service provider fails to provide the promised goods and services, the insurance policy may have provisions for reimbursement or coverage depending on the specific circumstances outlined in the policy document. Typically, if there is a failure on the part of a third-party service provider, you may be eligible for compensation or reimbursement as stated in your insurance policy.

In referencing the documents provided:
1. "Payment of benefits will be subject to the Benefit Payments section of the policy document on Page 49."
2. "Member Life Insurance or Coverage During Disability" policy document on Page 42.
3. "Dependent's Life Insurance terminates because..." details can be found in the document on Page 44.

In summary, your insurance policies may have provisions to address situations where a third-party service provider fails to deliver promised goods or services. Please refer to the specific sections mentioned above in your policy documents for more detailed inform