# I. Requirements of the topic: Build Mr.HelpMate

<style>
  #Requirements of the topic: Build Mr.HelpMate {
    text-align: center;
    font-size: 48px;
  }
</style>

## Build Mr.HelpMate AI with:

### 1.  The Embedding Layer: 
The PDF document needs to be effectively processed, cleaned, and chunked for the embeddings. Here, the choice of the chunking strategy will have a large impact on the final quality of the retrieved results. So, make sure that you try out various stratgies and compare their performances.Another important aspect in the embedding layer is the choice of the embedding model. You can choose to embed your chunks using the OpenAI embedding model or any model from the SentenceTransformers library on HuggingFace. 

### 2.  The Search Layer: 
Here, you first need to design at least 3 queries against which you will test your system. You need to understand and skim through the document, and accordingly come up with some queries, the answers to which can be found in the policy document.Next, you need to embed the queries and search your ChromaDB vector database against each of these queries. Implementing a cache mechanism is also mandatory.Finally, you need to implement the re-ranking block, and for this you can choose from a range of cross-encoding models on HuggingFace.

### 3.  The Generation Layer:
In the generation layer, the final prompt that you design is the major component. Make sure that the prompt is exhaustive in its instructions, and the relevant information is correctly passed to the prompt. You may also choose to provide some few-shot examples in an attempt to improve the LLM output.


# II. Project Goals (Objectives): HelpMate AI  Search System

### Developing a Semantic Search System:

Employ the INS process encompassing three layers: Embedding Layer, Search and Rank Layer, and Generation Layer to enhance document retrieval efficiency.

### Extracting Information from PDF Documents:

Extract data from PDF documents and store it in a structured format. Subsequently, generate vector representations using the SentenceTransformerEmbedding all-MiniLM-L6-v2 model to facilitate search and analysis.

### Implementing a Cache Layer to Improve System Performance:

Integrate a cache layer to enhance system performance by storing and retrieving previous queries along with their results. Consequently, minimize processing time for recurring similar queries.


# Overall Design

<style>
  #OverallDesign {
    text-align: center;
    font-size: 48px;
  }
</style>
![Alt text](https://raw.githubusercontent.com/MrVuTuanAnh/HELPMATE_AI/main/HELPMATEAI/H1.png)


# III. Preparation: Install and Import the Required Libraries

### 1. Install libraries:
pip install -r requirements.txt
### 2. create folder
open terminal and type cmd:

1.      mkdir data 
The data folder is download and store pdf.
    
2.      mkdir chromadb
The chromadb folder store database of chromadb

In [1]:
# Import all the required Libraries
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
import openai
from pydantic import BaseModel
import os
from pathlib import Path
import requests
import pdfplumber
from operator import itemgetter
import json
import chromadb
import pandas as pd
from dotenv import load_dotenv
import fitz  # PyMuPDF
import numpy as np
from sentence_transformers import CrossEncoder, SentenceTransformer
from chromadb import PersistentClient
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction, SentenceTransformerEmbeddingFunction

# 3. Create the .env open api key 

OPENAI_API_KEY=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

In [2]:
# 4. Load environment variables
load_dotenv()
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
if not OPENAI_API_KEY:
    raise ValueError("No OpenAI API key found. Please set OPENAI_API_KEY in your environment.")

# IV. Implement Embedding Layer for Mr.HelpMate 

### 1. Design

![Alt text](https://raw.githubusercontent.com/MrVuTuanAnh/HELPMATE_AI/main/HELPMATEAI/H2.png)


The Embedding Layer process is complex and involves several detailed pre-processing steps. Initially, we process the documents, which in this context are related to the insurance sector. This involves extracting text from the documents, dividing it into smaller segments, and subsequently feeding these segments into the embedding model for further processing.

### 2. Download, Read, Process, and Chunk the PDF file

For processing and reading PDF files, we'll leverage the pdfplumber library (https://pypi.org/project/pdfplumber/).

pdfplumber goes beyond simple text extraction. It excels at parsing various PDF elements, including tables and images, for more comprehensive data access. Additionally, it provides a rich set of functionalities and visual debugging features, aiding in advanced preprocessing tasks.

In [3]:
# URL of the file to download
url = "https://cdn.upgrad.com/uploads/production/585ca56a-6fe1-4b93-903c-1c1a1de74bf1/Principal-Sample-Life-Insurance-Policy.pdf"
# Path where the file will be saved after downloading
save_path = './data/Principal-Sample-Life-Insurance-Policy.pdf'
pdf_path = './data/'

# Use requests to download the file
response = requests.get(url)
response.raise_for_status()  # If there's an error, an exception will be raised

# Open a new file to save the data
with open(save_path, 'wb') as file:
    file.write(response.content)

print("File has been downloaded and saved at:", save_path)


File has been downloaded and saved at: ./data/Principal-Sample-Life-Insurance-Policy.pdf


In [4]:
# Function to check whether a word is present in a table or not for segregation of regular text and tables

def check_bboxes(word, table_bbox):
    print("=" * 20)
    print("==== check_bboxes ====")
    print("=" * 20)
    print()
        
    """Checks if a word's bounding box is within a table's bounding box."""
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]

Extract text from a PDF file.

In [5]:
# 1. Initialize a variable `page_counter` to track page numbers throughout the loop.
# 2. Create an empty list `full_text` to store the processed text and corresponding page numbers.
# 3. Use `pdfplumber` to open the PDF and iterate through its pages one by one.
# 4. For each page, locate and store the positions of tables (bounding boxes).
# 5. Extract text from the identified tables and store it in the `tables` variable.
# 6. Extract regular text (excluding tables) using the `check_bboxes` function.
# 7. Utilize the `cluster_objects` function to group non-table and table elements while preserving their original order in the PDF.
# 8. Initialize an empty list `page_lines` to accumulate processed text for the current page.
# 9. Iterate through the clustered elements:
#    - If the element is text, append it to `page_lines`.
#    - If the element is a table, append the entire table to `page_lines`.
# 10. Combine the page number with the processed lines (`page_lines`) and append them together as a single entry to the `full_text` list.
# 11. Increment the `page_counter` for the next iteration.
# 12. After processing all pages, return the `full_text` list containing page numbers and corresponding processed text.

def extract_text_from_pdf(pdf_path):
    p = 0
    full_text = []


    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_no = f"Page {p+1}"
            text = page.extract_text()

            tables = page.find_tables()
            table_bboxes = [i.bbox for i in tables]
            tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
            non_table_words = [word for word in page.extract_words() if not any(
                [check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
            lines = []

            for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5):

                if 'text' in cluster[0]:
                    try:
                        lines.append(' '.join([i['text'] for i in cluster]))
                    except KeyError:
                        pass

                elif 'table' in cluster[0]:
                    lines.append(json.dumps(cluster[0]['table']))


            full_text.append([page_no, " ".join(lines)])
            p +=1

    return full_text

With the text and table extraction function defined, let's iterate through all the PDFs in our drive. We'll call the function for each PDF and store the extracted data (text and tables) in a list.

In [6]:
# Define the directory containing the PDF files
pdf_directory = Path(pdf_path)

# Initialize an empty list to store the extracted texts and document names
data = []

# Loop through all files in the directory
for pdf_path in pdf_directory.glob("*.pdf"):

    # Process the PDF file
    print(f"...Processing {pdf_path.name}")

    # Call the function to extract the text from the PDF
    extracted_text = extract_text_from_pdf(pdf_path)

    # Convert the extracted list to a PDF, and add a column to store document names
    extracted_text_df = pd.DataFrame(extracted_text, columns=['Page No.', 'Page_Text'])
    extracted_text_df['Document Name'] = pdf_path.name

    # Append the extracted text and document name to the list
    data.append(extracted_text_df)

    # Print a message to indicate progress
    print(f"Finished processing {pdf_path.name}")

# Print a message to indicate all PDFs have been processed
print("All PDFs have been processed.")

...Processing Principal-Sample-Life-Insurance-Policy.pdf
Finished processing Principal-Sample-Life-Insurance-Policy.pdf
All PDFs have been processed.


In [7]:
# Concatenate all the DFs in the list 'data' together

insurance_pdfs_data = pd.concat(data, ignore_index=True)

In [8]:
insurance_pdfs_data.shape

(64, 3)

In [9]:
insurance_pdfs_data.sample(2)

Unnamed: 0,Page No.,Page_Text,Document Name
49,Page 50,The Principal may require that a ADL Disabled ...,Principal-Sample-Life-Insurance-Policy.pdf
8,Page 9,P ART I - DEFINITIONS When used in this Group ...,Principal-Sample-Life-Insurance-Policy.pdf


In [10]:
# Let's also check the length of all the texts as there might be some empty pages or pages with very few words that we can drop

insurance_pdfs_data['Text_Length'] = insurance_pdfs_data['Page_Text'].apply(lambda x: len(x.split(' ')))

In [11]:
insurance_pdfs_data.sample(2)

Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length
12,Page 13,a . A licensed Doctor of Medicine (M.D.) or Os...,Principal-Sample-Life-Insurance-Policy.pdf,260
44,Page 45,(1) If termination is as described in b. (1) a...,Principal-Sample-Life-Insurance-Policy.pdf,179


In [12]:
max(insurance_pdfs_data['Text_Length'])

462

We'll filter out blank pages to ensure our data focuses on content-rich pages. A page is considered blank if it has:

1.  Fewer than 10 words: This ensures we capture pages even with minimal text.

2.  Only headers or footers: This excludes pages that solely contain header or footer information.

In [13]:
# Retain only the rows with a text length of at least 10
insurance_pdfs_data = insurance_pdfs_data.loc[insurance_pdfs_data['Text_Length'] >= 10]
insurance_pdfs_data.head()

Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,Principal-Sample-Life-Insurance-Policy.pdf,30
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,Principal-Sample-Life-Insurance-Policy.pdf,230
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,Principal-Sample-Life-Insurance-Policy.pdf,110
5,Page 6,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,Principal-Sample-Life-Insurance-Policy.pdf,153
6,Page 7,Section A – Eligibility Member Life Insurance ...,Principal-Sample-Life-Insurance-Policy.pdf,176


In [14]:
# Store the metadata for each page in a separate column
insurance_pdfs_data['Metadata'] = insurance_pdfs_data.apply(lambda x: {'Policy_Name': x['Document Name'][:-4], 'Page_No.': x['Page No.']}, axis=1)

In [15]:
insurance_pdfs_data.head()

Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length,Metadata
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,Principal-Sample-Life-Insurance-Policy.pdf,30,{'Policy_Name': 'Principal-Sample-Life-Insuran...
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,Principal-Sample-Life-Insurance-Policy.pdf,230,{'Policy_Name': 'Principal-Sample-Life-Insuran...
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,Principal-Sample-Life-Insurance-Policy.pdf,110,{'Policy_Name': 'Principal-Sample-Life-Insuran...
5,Page 6,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,Principal-Sample-Life-Insurance-Policy.pdf,153,{'Policy_Name': 'Principal-Sample-Life-Insuran...
6,Page 7,Section A – Eligibility Member Life Insurance ...,Principal-Sample-Life-Insurance-Policy.pdf,176,{'Policy_Name': 'Principal-Sample-Life-Insuran...


Since most pages contain a few hundred words, with a maximum of 1000, chunking the documents further isn't necessary. We can efficiently perform embeddings directly on individual pages. This approach is advantageous for two reasons:

1.  Insurance documents are typically well-structured, minimizing extraneous information within a page. This suggests a high degree of interrelation between text pieces on the same page.

2.  Larger chunks benefit the LLM (Large Language Model) during the generation layer by providing more context.

### 3.  Generate and Store Page Embeddings using OpenAI and ChromaDB

In this section, we will embed the pages in the dataframe through OpenAI's `text-embedding-ada-002` model, and store them in a ChromaDB collection.

In [16]:
# Import the OpenAI Embedding Function into chroma

from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

In [17]:
# Define the path where chroma collections will be stored
chroma_data_path = './chromadb/'


In [18]:
# Call PersistentClient()
client = chromadb.PersistentClient(path=chroma_data_path)

In [19]:
# Set up the embedding function using the OpenAI embedding model
model = "text-embedding-ada-002"
embedding_function = OpenAIEmbeddingFunction(api_key=OPENAI_API_KEY, model_name=model)

In [20]:
from chromadb.utils import embedding_functions
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

In [21]:
# Convert the page text and metadata from your dataframe to lists to be able to pass it to chroma

documents_list = insurance_pdfs_data["Page_Text"].tolist()
metadata_list = insurance_pdfs_data['Metadata'].tolist()

In [22]:
# Initialise a collection in chroma and pass the embedding_function to it so that it used OpenAI embeddings to embed the documents

insurance_collection = client.get_or_create_collection(name='RAG_on_Insurance', embedding_function=sentence_transformer_ef)

In [23]:
# Add the documents and metadata to the collection alongwith generic integer IDs. You can also feed the metadata information as IDs by combining the policy name and page no.

insurance_collection.add(
    documents= documents_list,
    ids = [str(i) for i in range(0, len(documents_list))],
    metadatas = metadata_list
)

Add of existing embedding ID: 0
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 5
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 9
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 17
Add of existing embedding ID: 18
Add of existing embedding ID: 19
Add of existing embedding ID: 20
Add of existing embedding ID: 21
Add of existing embedding ID: 22
Add of existing embedding ID: 23
Add of existing embedding ID: 24
Add of existing embedding ID: 25
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 29
Add of existing embe

In [24]:
# Let's take a look at the first few entries in the collection

insurance_collection.get(
    ids = ['0','1','2'],
    include = ['embeddings', 'documents', 'metadatas']
)

{'ids': ['0', '1', '2'],
 'embeddings': [[-0.025921911001205444,
   0.047777481377124786,
   0.05585775524377823,
   0.04239744320511818,
   0.05814303457736969,
   0.10849817842245102,
   0.02889098785817623,
   -0.00977775827050209,
   -0.08766452968120575,
   0.027251530438661575,
   0.0377378948032856,
   0.04159488528966904,
   -0.013698960654437542,
   -0.06046951189637184,
   -0.0953066349029541,
   -0.035520050674676895,
   -0.05023425444960594,
   0.013877619057893753,
   -0.03827238827943802,
   0.036519210785627365,
   0.009495572187006474,
   0.03799031674861908,
   -0.06191485747694969,
   -0.0333947017788887,
   0.032034873962402344,
   0.0003445710754022002,
   0.034971095621585846,
   -0.048203881829977036,
   0.02183268591761589,
   0.00023783583310432732,
   0.0046171932481229305,
   -0.0362711064517498,
   -0.03494366630911827,
   0.041484177112579346,
   0.0339135080575943,
   -0.008206150494515896,
   -0.04822534695267677,
   -0.0033177994191646576,
   -0.025191377

In [25]:
cache_collection = client.get_or_create_collection(name='Insurance_Cache', embedding_function=embedding_function)

In [26]:
cache_collection.peek()

{'ids': [],
 'embeddings': [],
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None}

### Generating and Storing Embeddings

Following the pre-processing and chunking of document text, we'll generate vector representations using a suitable text embedding model. Previously, we employed the Sentence Transformer library and the all-MiniLM-L6-V2 model for this purpose.

For this demonstration, we'll showcase OpenAI's text embedding model – specifically, the ada002 v2 model, which creates 1,536-dimensional vector representations of text. ChromaDB's utility functions are leveraged to utilize OpenAI's model for generating embeddings. Refer to the documentation for more details: https://docs.trychroma.com/embeddings#openai

Once generated, we'll store the embeddings within ChromaDB, our vector database. As explored earlier, Chroma collections need to be established before adding documents. Akshay utilized the get_or_create_collections method to either create a new collection or retrieve an existing one. This ensures the collection is available for storing documents.

Since we're using OpenAI embeddings instead of Chroma's default option, the embedding function needs to be passed during collection creation. Finally, the collection is populated with information, including the document list, text content, and any metadata. Additionally, Akshay created a separate Chroma collection for caching, which we'll explore in a later section.

# V. Implement Search Layer for Mr.HelpMate 

### 1. Design

![Alt text](https://raw.githubusercontent.com/MrVuTuanAnh/HELPMATE_AI/main/HELPMATEAI/H3.png)


Like any well-designed system, we need to plan for scalability when the application grows. This could involve a significant increase in documents or multiple users accessing the system concurrently. Here are key performance concerns to address:

1.  Concurrent Query Handling: How will the system manage multiple users submitting queries simultaneously? Can it efficiently handle this load without impacting response times?

2.  Search and Retrieval Performance: As the data volume increases, maintaining efficient search and retrieval becomes crucial. Are there strategies to optimize the system's performance in these operations?

### 2. Search with Caching

This section explores semantic search, where we'll query the collection's embeddings to retrieve the top semantically similar results. We'll also discuss design considerations for implementing a cache layer within the semantic search system.

#### Cache Functionality

When a document query is submitted:

1.  Cache Search: The system first searches the cache for the query.

2.  Cache Hit: If a match is found in the cache, the top k closest documents (or chunks of k documents) are retrieved from the cache and returned as the results.

3.  Cache Miss: If the query isn't found in the cache:

Main Vector Database Search: The system performs a search on the primary vector database.
Cache Update: The newly executed query and its results from the main database (top k closest documents or chunks of k documents) are then stored in the cache. This ensures faster retrieval for similar queries in the future.

#### Benefits of Caching:

Caching significantly improves search performance by reducing the load on the main vector database for frequently occurring or similar queries. By storing results in the cache, the system can return them quickly without needing to re-execute the search on the entire database every time.
    

In [50]:
# First the user query

query_1 = input()
print("=" * 120)
print("======================================== First the user query ========================================")
print("=" * 120)
print()
print()
print()
print(query_1)




What are the default benefits and provisions of the Group Policy?


#### Search the Cache Collection First

We begin by searching the cache collection to see if the user's query already has results stored. Here's the process:

1.  Cache Lookup: The system checks the cache for the user's specific query.

2.  Cache Hit: If a match is found:

3.  Return Cached Results: The system retrieves and returns the top 20 most relevant results pre-computed for that query. This provides a faster response for the user.

4.  Cache Miss: If the query isn't found in the cache:

The system proceeds to the next step (explained elsewhere) which likely involves querying the main vector database.

In [37]:
# Query the collection against the user query and return the top 20 results
cache_results = cache_collection.query(
    query_texts=query_1,
    n_results=1
)

In [38]:
cache_results

{'ids': [[]],
 'distances': [[]],
 'metadatas': [[]],
 'embeddings': None,
 'documents': [[]],
 'uris': None,
 'data': None}

In [39]:
results = insurance_collection.query(
query_texts=query_1,
n_results=10
)
results.items()

dict_items([('ids', [['13', '16', '14', '39', '3', '1', '28', '30', '18', '27']]), ('distances', [[0.8645804988670251, 0.884373824503724, 0.9797316235128661, 1.0185734171896121, 1.0266769868103995, 1.0518579649442414, 1.0616972339135968, 1.1031169514343104, 1.1138286497033836, 1.1176436039058013]]), ('metadatas', [[{'Page_No.': 'Page 16', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 19', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 17', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 42', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 6', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 3', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 31', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 33', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}, {'Page_No.': 'Page 21', 'Poli

In [42]:
# Implementing Cache in Search layer

# Set a threshold for cache search
threshold = 0.2

ids = []
documents = []
distances = []
metadatas = []
results_df = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = insurance_collection.query(
      query_texts=query_1,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys = []
      Values = []

      for key, val in results.items():
        if val is None:
          continue
        if key != 'embeddings':
          for i in range(10): # Top 10 variable, we can also put as 25 for top_n
            Keys.append(str(key)+str(i))
            Values.append(str(val[0][i]))


      cache_collection.add(
          documents= [query_1],
          ids = [query_1],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys, Values))
      )

      print("Not found in cache. Found in main collection.")

      result_dict = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
      results_df = pd.DataFrame.from_dict(result_dict)
      results_df


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results['distances'][0][0] <= threshold:
      cache_result_dict = cache_results['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict.items():
          if 'ids' in key:
              ids.append(value)
          elif 'documents' in key:
              documents.append(value)
          elif 'distances' in key:
              distances.append(value)
          elif 'metadatas' in key:
              metadatas.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df = pd.DataFrame({
        'IDs': ids,
        'Documents': documents,
        'Distances': distances,
        'Metadatas': metadatas
      })


Not found in cache. Found in main collection.


In [43]:
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Page_No.': 'Page 16', 'Policy_Name': 'Princi...",PART II - POLICY ADMINISTRATION Section A - Co...,0.86458,13
1,"{'Page_No.': 'Page 19', 'Policy_Name': 'Princi...",T he Principal has complete discretion to cons...,0.884374,16
2,"{'Page_No.': 'Page 17', 'Policy_Name': 'Princi...",a. be actively engaged in business for profit ...,0.979732,14
3,"{'Page_No.': 'Page 42', 'Policy_Name': 'Princi...",Section F - Individual Purchase Rights Article...,1.018573,39
4,"{'Page_No.': 'Page 6', 'Policy_Name': 'Princip...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,1.026677,3
5,"{'Page_No.': 'Page 3', 'Policy_Name': 'Princip...",POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,1.051858,1
6,"{'Page_No.': 'Page 31', 'Policy_Name': 'Princi...",Scheduled Benefit in force for the Member befo...,1.061697,28
7,"{'Page_No.': 'Page 33', 'Policy_Name': 'Princi...",a . In no event will Dependent Life Insurance ...,1.103117,30
8,"{'Page_No.': 'Page 21', 'Policy_Name': 'Princi...",b . on any date the definition of Member or De...,1.113829,18
9,"{'Page_No.': 'Page 30', 'Policy_Name': 'Princi...","(6) If, on the date a Member becomes eligible ...",1.117644,27


In [48]:
# Second the user query

query_2 = input()
print("=" * 120)
print("======================================== Second the user query ========================================")
print("=" * 120)
print()
print()
print()
print(query_2)




what does it mean by 'the later of the Date of Issue'?


In [52]:
# Query the collection against the user query and return the top 20 results

cache_results2 = cache_collection.query(
    query_texts=query_2,
    n_results=1
)

In [53]:
cache_results2

{'ids': [['What are the default benefits and provisions of the Group Policy?']],
 'distances': [[0.5435725924091256]],
 'metadatas': [[{'distances0': '0.8645804988670251',
    'distances1': '0.884373824503724',
    'distances2': '0.9797316235128661',
    'distances3': '1.0185734171896121',
    'distances4': '1.0266769868103995',
    'distances5': '1.0518579649442414',
    'distances6': '1.0616972339135968',
    'distances7': '1.1031169514343104',
    'distances8': '1.1138286497033836',
    'distances9': '1.1176436039058013',
    'documents0': "PART II - POLICY ADMINISTRATION Section A - Contract Article 1 - Entire Contract This Group Policy, the current Certificate, the attached Policyholder application, and any Member applications make up the entire contract. The Principal is obligated only as provided in this Group Policy and is not bound by any trust or plan to which it is not a signatory party. Article 2 - Policy Changes Insurance under this Group Policy runs annually to the Policy

In [54]:
# Implementing Cache in Search layer

# Set a threshold for cache search
threshold = 0.2

ids2 = []
documents2 = []
distances2 = []
metadatas2 = []
results_df2 = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results2['distances'][0] == [] or cache_results2['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = insurance_collection.query(
      query_texts=query_2,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys2 = []
      Values2 = []

      for key, val in results.items():
        if val is None:
          continue
        if key != 'embeddings':
          for i in range(10): # Top 10 variable, we can also put as 25 for top_n
            Keys2.append(str(key)+str(i))
            Values2.append(str(val[0][i]))


      cache_collection.add(
          documents= [query_2],
          ids = [query_2],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys2, Values2))
      )

      print("Not found in cache. Found in main collection.")

      result_dict2 = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
      results_df2 = pd.DataFrame.from_dict(result_dict2)
      results_df2


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results2['distances'][0][0] <= threshold:
      cache_result_dict2 = cache_results2['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict2.items():
          if 'ids' in key:
              ids2.append(value)
          elif 'documents' in key:
              documents2.append(value)
          elif 'distances' in key:
              distances2.append(value)
          elif 'metadatas' in key:
              metadatas2.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df2 = pd.DataFrame({
        'IDs': ids2,
        'Documents': documents2,
        'Distances': distances2,
        'Metadatas': metadatas2
      })


Not found in cache. Found in main collection.


In [55]:
results_df2

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Page_No.': 'Page 29', 'Policy_Name': 'Princi...",Insurance for which Proof of Good Health is re...,1.200996,26
1,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",I f a Member's Dependent is employed and is co...,1.259137,24
2,"{'Page_No.': 'Page 21', 'Policy_Name': 'Princi...",b . on any date the definition of Member or De...,1.332726,18
3,"{'Page_No.': 'Page 31', 'Policy_Name': 'Princi...",Scheduled Benefit in force for the Member befo...,1.349086,28
4,"{'Page_No.': 'Page 36', 'Policy_Name': 'Princi...",A Member's insurance under this Group Policy f...,1.359287,33
5,"{'Page_No.': 'Page 34', 'Policy_Name': 'Princi...",provided The Principal has been notified of th...,1.387343,31
6,"{'Page_No.': 'Page 61', 'Policy_Name': 'Princi...",Section D - Claim Procedures Article 1 - Notic...,1.391411,58
7,"{'Page_No.': 'Page 28', 'Policy_Name': 'Princi...",Section B - Effective Dates Article 1 - Member...,1.394463,25
8,"{'Page_No.': 'Page 30', 'Policy_Name': 'Princi...","(6) If, on the date a Member becomes eligible ...",1.409443,27
9,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi...",T he Principal may terminate the Policyholder'...,1.421555,21


In [56]:
# Third the user query

query_3 = input()
print("=" * 120)
print("======================================== Third the user query ========================================")
print("=" * 120)
print()
print()
print()
print(query_3)




What happens if a third-party service provider fails to provide the promised goods and services?


In [57]:
# Query the collection against the user query and return the top 20 results

cache_results3 = cache_collection.query(
    query_texts=query_3,
    n_results=1
)

In [58]:
cache_results3

{'ids': [['What are the default benefits and provisions of the Group Policy?']],
 'distances': [[0.46159295689951196]],
 'metadatas': [[{'distances0': '0.8645804988670251',
    'distances1': '0.884373824503724',
    'distances2': '0.9797316235128661',
    'distances3': '1.0185734171896121',
    'distances4': '1.0266769868103995',
    'distances5': '1.0518579649442414',
    'distances6': '1.0616972339135968',
    'distances7': '1.1031169514343104',
    'distances8': '1.1138286497033836',
    'distances9': '1.1176436039058013',
    'documents0': "PART II - POLICY ADMINISTRATION Section A - Contract Article 1 - Entire Contract This Group Policy, the current Certificate, the attached Policyholder application, and any Member applications make up the entire contract. The Principal is obligated only as provided in this Group Policy and is not bound by any trust or plan to which it is not a signatory party. Article 2 - Policy Changes Insurance under this Group Policy runs annually to the Polic

In [59]:
# Implementing Cache in Search layer

# Set a threshold for cache search
threshold = 0.2

ids3 = []
documents3 = []
distances3 = []
metadatas3 = []
results_df3 = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results3['distances'][0] == [] or cache_results3['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = insurance_collection.query(
      query_texts=query_3,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys3 = []
      Values3 = []

      for key, val in results.items():
        if val is None:
          continue
        if key != 'embeddings':
          for i in range(10): # Top 10 variable, we can also put as 25 for top_n
            Keys3.append(str(key)+str(i))
            Values3.append(str(val[0][i]))


      cache_collection.add(
          documents= [query_3],
          ids = [query_3],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys3, Values3))
      )

      print("Not found in cache. Found in main collection.")

      result_dict3 = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
      results_df3 = pd.DataFrame.from_dict(result_dict3)
      results_df3


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results3['distances'][0][0] <= threshold:
      cache_result_dict3 = cache_results3['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict3.items():
          if 'ids' in key:
              ids3.append(value)
          elif 'documents' in key:
              documents3.append(value)
          elif 'distances' in key:
              distances3.append(value)
          elif 'metadatas' in key:
              metadatas3.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df3 = pd.DataFrame({
        'IDs': ids3,
        'Documents': documents3,
        'Distances': distances3,
        'Metadatas': metadatas3
      })


Not found in cache. Found in main collection.


In [60]:
results_df3

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Page_No.': 'Page 3', 'Policy_Name': 'Princip...",POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,1.06731,1
1,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi...",T he Principal may terminate the Policyholder'...,1.369205,21
2,"{'Page_No.': 'Page 23', 'Policy_Name': 'Princi...",Section C - Policy Termination Article 1 - Fai...,1.382593,20
3,"{'Page_No.': 'Page 19', 'Policy_Name': 'Princi...",T he Principal has complete discretion to cons...,1.416428,16
4,"{'Page_No.': 'Page 47', 'Policy_Name': 'Princi...","M ember's death, the Death Benefits Payable ma...",1.425472,44
5,"{'Page_No.': 'Page 6', 'Policy_Name': 'Princip...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,1.469127,3
6,"{'Page_No.': 'Page 18', 'Policy_Name': 'Princi...",c . a copy of the form which contains the stat...,1.470377,15
7,"{'Page_No.': 'Page 17', 'Policy_Name': 'Princi...",a. be actively engaged in business for profit ...,1.498452,14
8,"{'Page_No.': 'Page 61', 'Policy_Name': 'Princi...",Section D - Claim Procedures Article 1 - Notic...,1.49882,58
9,"{'Page_No.': 'Page 36', 'Policy_Name': 'Princi...",A Member's insurance under this Group Policy f...,1.500066,33


#### A semantic cache stores the meaning or intent of a query and its corresponding response, rather than just the raw data. This significantly reduces the number of queries the main database needs to process by leveraging previously executed queries and their results.

##### Improved Performance with Cache

The cache system acts as an intermediary, bypassing the semantic search layer which can be a performance bottleneck. It can directly provide responses for previously encountered queries stored in the cache collection.

##### Cache Lookup and Fallback

When a user submits a query, the system first generates its vector representation and searches the cache collection. If a match is found, the cached response is retrieved and returned to the user, delivering a faster response time.

##### Cache Update and Main Collection Search

If the query isn't found in the cache (cache miss), the system performs a search on the main collection to identify the top k most relevant documents or chunks for that query. These results are then returned to the user and simultaneously stored in the cache along with the original query, enriching the cache for future use.

##### Cache Optimization

By monitoring and customizing the cache's performance, we can further enhance its efficiency. Since the cache stores the meaning and results of past queries, it can retrieve them quickly without additional processing. This translates to faster response times and a smoother user experience.

### 3. Re-Ranking with a Cross Encoder

While semantic search retrieves relevant documents, their order might not perfectly reflect their true relevance to the query.  Re-ranking can significantly improve the quality of search results.

##### Here's how it works:

1.  Query-Response Pairs: Each retrieved document (response) is paired with the original user query.

2.  Cross-Encoder Scoring: These pairs are then fed into a cross-encoder, a neural network model that assesses the semantic similarity between the query and the document.

3.  Improved Ranking: Based on the scores from the cross-encoder, the documents are re-ranked, placing the most relevant ones at the top.

This approach ensures that the final results are not only semantically similar to the query but also the most relevant within that set of similar documents.

![Alt text](https://raw.githubusercontent.com/MrVuTuanAnh/HELPMATE_AI/main/HELPMATEAI/H4.png)

#### Design Cross Encoder

![Alt text](https://raw.githubusercontent.com/MrVuTuanAnh/HELPMATE_AI/main/HELPMATEAI/H5.png)


#### Refining Results with Re-Ranking


Re-ranking is an important step in building a robust semantic search pipeline. While our system retrieves the top K documents relevant to the user's query, the information quality within these documents can vary.  Some retrieved documents might not perfectly capture the user's intent.

#### The Role of Re-Ranking

The re-ranking stage addresses this by meticulously evaluating the top K results. It verifies how well each document aligns with the user's query and assigns an importance score to reflect its relevance. This process ensures the most relevant and informative documents are presented at the top.

#### Benefits of Re-Ranking:

1.  Enhanced Accuracy and Relevance: Re-ranking improves the overall accuracy and relevance of the retrieved results by prioritizing the most valuable documents.

2.  Reduced Noise: It minimizes the amount of irrelevant or inaccurate information presented to the user, leading to a more focused search experience.

3.  Personalized Results: Re-ranking can potentially personalize search results by considering factors beyond strict relevance, potentially tailoring them to specific tasks or domains (depending on the implementation).

4.  Common Re-Ranking Methods: Traditionally, various re-ranking methods have been employed in search, including Reciprocal Rank Fusion (RRF), hybrid search approaches, and cross-encoder models. For this project, we'll focus on the popular technique of using cross-encoders for re-ranking.


In [61]:
results_df.head()

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Page_No.': 'Page 16', 'Policy_Name': 'Princi...",PART II - POLICY ADMINISTRATION Section A - Co...,0.86458,13
1,"{'Page_No.': 'Page 19', 'Policy_Name': 'Princi...",T he Principal has complete discretion to cons...,0.884374,16
2,"{'Page_No.': 'Page 17', 'Policy_Name': 'Princi...",a. be actively engaged in business for profit ...,0.979732,14
3,"{'Page_No.': 'Page 42', 'Policy_Name': 'Princi...",Section F - Individual Purchase Rights Article...,1.018573,39
4,"{'Page_No.': 'Page 6', 'Policy_Name': 'Princip...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,1.026677,3


In [62]:
# Initialise the cross encoder model

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [63]:
# Test the cross encoder model

scores = cross_encoder.predict([['Does the insurance cover diabetic patients?', 'The insurance policy covers some pre-existing conditions including diabetes, heart diseases, etc. The policy does not howev'],
                                ['Does the insurance cover diabetic patients?', 'The premium rates for various age groups are given as follows. Age group (<18 years): Premium rate']])

In [64]:
scores

array([  3.8467638, -11.25288  ], dtype=float32)

##### Cross-Encoders for Re-Ranking

In our semantic search pipeline, we'll utilize a cross-encoder model for re-ranking. Cross-encoders are neural networks specifically designed to assess the semantic similarity between two text pieces (query and document in our case).

##### Output Scores: Reflecting Similarity

Previously, cross-encoders might have primarily output scores in the range of 0 to 1, where higher values indicated greater similarity between the query and the document.

However, advancements in cross-encoder models have led to a wider range of possible output scores, potentially including positive and negative values.  Positive scores continue to represent similarity, while negative scores now explicitly indicate dissimilarity. This wider range of scores can provide more nuanced information for re-ranking.

##### Cross-Encoder Input Format

It's important to note that the input format for cross-encoders typically requires a list of lists.  For re-ranking, this translates to:

Outer List: Represents all the top K retrieved documents (results) after the initial semantic search stage.
Inner Lists: Each inner list within the outer list represents a single document (response) paired with the original user query.

By feeding these pairs into the cross-encoder, we obtain similarity scores that guide the re-ranking process.

##### For 1st Querry

In [66]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

cross_inputs = [[query_1, response] for response in results_df['Documents']]
cross_rerank_scores = cross_encoder.predict(cross_inputs)

In [67]:
cross_rerank_scores

array([ -0.05854175,   2.2480361 ,  -9.004519  ,  -2.9255505 ,
       -10.979068  ,  -0.9663528 ,  -3.1405935 ,  -5.1891694 ,
        -9.256308  ,  -4.61058   ], dtype=float32)

In [68]:
# Store the rerank_scores in results_df

results_df['Reranked_scores'] = cross_rerank_scores

In [69]:
results_df

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 16', 'Policy_Name': 'Princi...",PART II - POLICY ADMINISTRATION Section A - Co...,0.86458,13,-0.058542
1,"{'Page_No.': 'Page 19', 'Policy_Name': 'Princi...",T he Principal has complete discretion to cons...,0.884374,16,2.248036
2,"{'Page_No.': 'Page 17', 'Policy_Name': 'Princi...",a. be actively engaged in business for profit ...,0.979732,14,-9.004519
3,"{'Page_No.': 'Page 42', 'Policy_Name': 'Princi...",Section F - Individual Purchase Rights Article...,1.018573,39,-2.92555
4,"{'Page_No.': 'Page 6', 'Policy_Name': 'Princip...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,1.026677,3,-10.979068
5,"{'Page_No.': 'Page 3', 'Policy_Name': 'Princip...",POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,1.051858,1,-0.966353
6,"{'Page_No.': 'Page 31', 'Policy_Name': 'Princi...",Scheduled Benefit in force for the Member befo...,1.061697,28,-3.140594
7,"{'Page_No.': 'Page 33', 'Policy_Name': 'Princi...",a . In no event will Dependent Life Insurance ...,1.103117,30,-5.189169
8,"{'Page_No.': 'Page 21', 'Policy_Name': 'Princi...",b . on any date the definition of Member or De...,1.113829,18,-9.256308
9,"{'Page_No.': 'Page 30', 'Policy_Name': 'Princi...","(6) If, on the date a Member becomes eligible ...",1.117644,27,-4.61058


In [82]:
# Return the top 3 results from search

top_3_search1 = results_df.sort_values(by='Distances')
top_3_search1[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 16', 'Policy_Name': 'Princi...",PART II - POLICY ADMINISTRATION Section A - Co...,0.86458,13,-0.058542
1,"{'Page_No.': 'Page 19', 'Policy_Name': 'Princi...",T he Principal has complete discretion to cons...,0.884374,16,2.248036
2,"{'Page_No.': 'Page 17', 'Policy_Name': 'Princi...",a. be actively engaged in business for profit ...,0.979732,14,-9.004519


In [71]:
# Return the top 3 results after reranking

top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
1,"{'Page_No.': 'Page 19', 'Policy_Name': 'Princi...",T he Principal has complete discretion to cons...,0.884374,16,2.248036
0,"{'Page_No.': 'Page 16', 'Policy_Name': 'Princi...",PART II - POLICY ADMINISTRATION Section A - Co...,0.86458,13,-0.058542
5,"{'Page_No.': 'Page 3', 'Policy_Name': 'Princip...",POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,1.051858,1,-0.966353


In [72]:
top_3_INS_q1 = top_3_rerank[["Documents", "Metadatas"]][:3]

In [73]:
top_3_INS_q1

Unnamed: 0,Documents,Metadatas
1,T he Principal has complete discretion to cons...,"{'Page_No.': 'Page 19', 'Policy_Name': 'Princi..."
0,PART II - POLICY ADMINISTRATION Section A - Co...,"{'Page_No.': 'Page 16', 'Policy_Name': 'Princi..."
5,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,"{'Page_No.': 'Page 3', 'Policy_Name': 'Princip..."


#### For 2nd Querry

In [74]:
results_df2.head()

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Page_No.': 'Page 29', 'Policy_Name': 'Princi...",Insurance for which Proof of Good Health is re...,1.200996,26
1,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",I f a Member's Dependent is employed and is co...,1.259137,24
2,"{'Page_No.': 'Page 21', 'Policy_Name': 'Princi...",b . on any date the definition of Member or De...,1.332726,18
3,"{'Page_No.': 'Page 31', 'Policy_Name': 'Princi...",Scheduled Benefit in force for the Member befo...,1.349086,28
4,"{'Page_No.': 'Page 36', 'Policy_Name': 'Princi...",A Member's insurance under this Group Policy f...,1.359287,33


In [75]:
query_2

"what does it mean by 'the later of the Date of Issue'?"

In [77]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

cross_inputs2 = [[query_2, response] for response in results_df2['Documents']]
cross_rerank_scores2 = cross_encoder.predict(cross_inputs2)

In [78]:
cross_rerank_scores2

array([ -3.9870455,  -8.9873085,  -6.471612 ,  -6.4148617,  -6.6036673,
        -8.850884 ,  -8.264681 ,  -8.286323 ,  -4.0928993, -10.221123 ],
      dtype=float32)

In [79]:
# Store the rerank_scores in results_df

results_df2['Reranked_scores'] = cross_rerank_scores2

In [80]:
results_df2

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 29', 'Policy_Name': 'Princi...",Insurance for which Proof of Good Health is re...,1.200996,26,-3.987046
1,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",I f a Member's Dependent is employed and is co...,1.259137,24,-8.987309
2,"{'Page_No.': 'Page 21', 'Policy_Name': 'Princi...",b . on any date the definition of Member or De...,1.332726,18,-6.471612
3,"{'Page_No.': 'Page 31', 'Policy_Name': 'Princi...",Scheduled Benefit in force for the Member befo...,1.349086,28,-6.414862
4,"{'Page_No.': 'Page 36', 'Policy_Name': 'Princi...",A Member's insurance under this Group Policy f...,1.359287,33,-6.603667
5,"{'Page_No.': 'Page 34', 'Policy_Name': 'Princi...",provided The Principal has been notified of th...,1.387343,31,-8.850884
6,"{'Page_No.': 'Page 61', 'Policy_Name': 'Princi...",Section D - Claim Procedures Article 1 - Notic...,1.391411,58,-8.264681
7,"{'Page_No.': 'Page 28', 'Policy_Name': 'Princi...",Section B - Effective Dates Article 1 - Member...,1.394463,25,-8.286323
8,"{'Page_No.': 'Page 30', 'Policy_Name': 'Princi...","(6) If, on the date a Member becomes eligible ...",1.409443,27,-4.092899
9,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi...",T he Principal may terminate the Policyholder'...,1.421555,21,-10.221123


In [81]:
# Return the top 3 results from semantic search

top_3_search2_q2 = results_df2.sort_values(by='Distances')
top_3_search2_q2[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 29', 'Policy_Name': 'Princi...",Insurance for which Proof of Good Health is re...,1.200996,26,-3.987046
1,"{'Page_No.': 'Page 27', 'Policy_Name': 'Princi...",I f a Member's Dependent is employed and is co...,1.259137,24,-8.987309
2,"{'Page_No.': 'Page 21', 'Policy_Name': 'Princi...",b . on any date the definition of Member or De...,1.332726,18,-6.471612


In [83]:
# Return the top 3 results after reranking

top_3_rerank_q2 = results_df2.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_q2[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 29', 'Policy_Name': 'Princi...",Insurance for which Proof of Good Health is re...,1.200996,26,-3.987046
8,"{'Page_No.': 'Page 30', 'Policy_Name': 'Princi...","(6) If, on the date a Member becomes eligible ...",1.409443,27,-4.092899
3,"{'Page_No.': 'Page 31', 'Policy_Name': 'Princi...",Scheduled Benefit in force for the Member befo...,1.349086,28,-6.414862


In [85]:
top_3_INS_q2 = top_3_rerank_q2[["Documents", "Metadatas"]][:3]

In [86]:
top_3_INS_q2

Unnamed: 0,Documents,Metadatas
0,Insurance for which Proof of Good Health is re...,"{'Page_No.': 'Page 29', 'Policy_Name': 'Princi..."
8,"(6) If, on the date a Member becomes eligible ...","{'Page_No.': 'Page 30', 'Policy_Name': 'Princi..."
3,Scheduled Benefit in force for the Member befo...,"{'Page_No.': 'Page 31', 'Policy_Name': 'Princi..."


#### For Query 3

In [87]:
query_3

'What happens if a third-party service provider fails to provide the promised goods and services?'

In [88]:
results_df3.head()

Unnamed: 0,Metadatas,Documents,Distances,IDs
0,"{'Page_No.': 'Page 3', 'Policy_Name': 'Princip...",POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,1.06731,1
1,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi...",T he Principal may terminate the Policyholder'...,1.369205,21
2,"{'Page_No.': 'Page 23', 'Policy_Name': 'Princi...",Section C - Policy Termination Article 1 - Fai...,1.382593,20
3,"{'Page_No.': 'Page 19', 'Policy_Name': 'Princi...",T he Principal has complete discretion to cons...,1.416428,16
4,"{'Page_No.': 'Page 47', 'Policy_Name': 'Princi...","M ember's death, the Death Benefits Payable ma...",1.425472,44


In [89]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

cross_inputs3 = [[query_3, response] for response in results_df3['Documents']]
cross_rerank_scores3 = cross_encoder.predict(cross_inputs3)

In [90]:
cross_rerank_scores3

array([ -0.472562, -10.96442 ,  -8.44989 , -11.01502 ,  -9.646118,
       -10.951803, -11.003003, -10.978257,  -8.754427, -11.057825],
      dtype=float32)

In [91]:
# Store the rerank_scores in results_df

results_df3['Reranked_scores'] = cross_rerank_scores3

In [92]:
results_df3

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 3', 'Policy_Name': 'Princip...",POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,1.06731,1,-0.472562
1,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi...",T he Principal may terminate the Policyholder'...,1.369205,21,-10.96442
2,"{'Page_No.': 'Page 23', 'Policy_Name': 'Princi...",Section C - Policy Termination Article 1 - Fai...,1.382593,20,-8.44989
3,"{'Page_No.': 'Page 19', 'Policy_Name': 'Princi...",T he Principal has complete discretion to cons...,1.416428,16,-11.01502
4,"{'Page_No.': 'Page 47', 'Policy_Name': 'Princi...","M ember's death, the Death Benefits Payable ma...",1.425472,44,-9.646118
5,"{'Page_No.': 'Page 6', 'Policy_Name': 'Princip...",TABLE OF CONTENTS PART I - DEFINITIONS PART II...,1.469127,3,-10.951803
6,"{'Page_No.': 'Page 18', 'Policy_Name': 'Princi...",c . a copy of the form which contains the stat...,1.470377,15,-11.003003
7,"{'Page_No.': 'Page 17', 'Policy_Name': 'Princi...",a. be actively engaged in business for profit ...,1.498452,14,-10.978257
8,"{'Page_No.': 'Page 61', 'Policy_Name': 'Princi...",Section D - Claim Procedures Article 1 - Notic...,1.49882,58,-8.754427
9,"{'Page_No.': 'Page 36', 'Policy_Name': 'Princi...",A Member's insurance under this Group Policy f...,1.500066,33,-11.057825


In [93]:
# Return the top 3 results from search

top_3_search_q3 = results_df3.sort_values(by='Distances')
top_3_search_q3[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 3', 'Policy_Name': 'Princip...",POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,1.06731,1,-0.472562
1,"{'Page_No.': 'Page 24', 'Policy_Name': 'Princi...",T he Principal may terminate the Policyholder'...,1.369205,21,-10.96442
2,"{'Page_No.': 'Page 23', 'Policy_Name': 'Princi...",Section C - Policy Termination Article 1 - Fai...,1.382593,20,-8.44989


In [94]:
# Return the top 3 results after reranking

top_3_rerank_q3 = results_df3.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_q3[:3]

Unnamed: 0,Metadatas,Documents,Distances,IDs,Reranked_scores
0,"{'Page_No.': 'Page 3', 'Policy_Name': 'Princip...",POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,1.06731,1,-0.472562
2,"{'Page_No.': 'Page 23', 'Policy_Name': 'Princi...",Section C - Policy Termination Article 1 - Fai...,1.382593,20,-8.44989
8,"{'Page_No.': 'Page 61', 'Policy_Name': 'Princi...",Section D - Claim Procedures Article 1 - Notic...,1.49882,58,-8.754427


In [95]:
top_3_INS_q3 = top_3_rerank_q3[["Documents", "Metadatas"]][:3]

In [96]:
top_3_INS_q3

Unnamed: 0,Documents,Metadatas
0,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,"{'Page_No.': 'Page 3', 'Policy_Name': 'Princip..."
2,Section C - Policy Termination Article 1 - Fai...,"{'Page_No.': 'Page 23', 'Policy_Name': 'Princi..."
8,Section D - Claim Procedures Article 1 - Notic...,"{'Page_No.': 'Page 61', 'Policy_Name': 'Princi..."


# VI.   Generation Layer System Design

### 1. Design
![Alt text](https://raw.githubusercontent.com/MrVuTuanAnh/HELPMATE_AI/main/HELPMATEAI/H6.png)
