# Employee Pipeline for Entirprise Internal CHatbot Step 1

This notebook contains the decision making/ evaluation aspects we took into consideration for different chunk sizes, chunk overlaps and top-ks


## 1. Importing libraries

In [1]:
#pip installing:
%pip install langchain
%pip install langchain_community
%pip install langchain_huggingface
%pip install langchain_pinecone
%pip install dotenv
%pip install pymupdf

#Pip installing for milvus:
%pip install -qU  langchain_milvus
#pip installing for chunking:
%pip install --upgrade --quiet langchain-text-splitters tiktoken



import os
import langchain #its giving module not found error
import langchain_community
import langchain_huggingface
import langchain_pinecone
import dotenv

#Testing Miluvs imports:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_milvus import Milvus
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_text_splitters import TokenTextSplitter
from uuid import uuid4

#for timing the retrivals
import time

#for parsing:
import re




Collecting langchain
  Downloading langchain-0.3.5-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.13 (from langchain)
  Downloading langchain_core-0.3.13-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Downloading langchain_text_splitters-0.3.1-py3-none-any.whl.metadata (2.3 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.137-py3-none-any.whl.metadata (13 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain-core<0.4.0,>=0.3.13->langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting httpx<1,>=0.23.0 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

## Documentation/Helpful Links

Using RAG with langchain and FAST API:
https://www.youtube.com/watch?v=Arf7UwWjGyc

Building Milvus vector DBs with langchain:
https://python.langchain.com/docs/integrations/vectorstores/milvus/

timing python programs:
https://stackoverflow.com/questions/1557571/how-do-i-get-time-of-a-python-programs-execution


## Step 1: Determining Chunk size, chunk overlap and embedding dimensions:

Using milvus as a vector DB for testing purposes, will revert back to pinecone for the final product

## For Chunk Sizes:

First i read these article on choosing right chunk sizes:

https://www.llamaindex.ai/blog/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5

https://www.mattambrogi.com/posts/chunk-size-matters/

then, i came to know 128 is the smallest chunk size (usually) so this can be one, then there is 256 (which is more common i believe) and for smaller chunk sizes, the issue is relevant response may not be in the top-2 or top-3 returns from vectordb, so in that case, we would need to set k's value more, however, the benefit being that the second article suggests that this is the most accurate chunk size. and for larger chunk sizes, the issue is of speed as larger chunks are slower to fetch. I now have a general understanding that smaller chunk sizes are beneficial. If we look at our dataset of RAG documents, the smallest document is "Employee Terminal Policy" in HR folder with content worth 1.5 pages, with 625 words.

For our personal evaluation, since we want to optimize for accuracy and relevant chunks without loosing out on speed, we will be testing top 3 most relevant retrievals against 3 categories:

#### Low: ( Chunk Size: 128)
#### medium: ( Chunk Size: 256)
#### high: ( Chunk Size: 1024)

I have decided to keep the chunk overlap at 10%, as per this question: https://learn.microsoft.com/en-us/answers/questions/1551865/how-do-you-set-document-chunk-length-and-overlap-w

The evaluation metric we will be using is a custom one, with questions from all 8 documents in the HR folder, and seeing how relevant the top 3 returned responses are to the query, and to which document they belong.



In [None]:
#Low Chunk Code
embeddings = HuggingFaceEmbeddings()
combined_docs = []
CHUNK_SIZE = 128
CHUNK_OVERLAP = 0.1
URI = "./Low_Chunk_Size.db"

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": URI},
)

docs_to_load = ["Code-of-conduct.pdf", "Compensation-Benefits-Guide.pdf", "Employee Termination Policy.pdf", "Employee-Handbook.pdf", "Onboarding Manual.pdf", "employee-appraisal-form.pdf", "health-and-safety-guidelines.pdf", "remote-work-policy.pdf"]
for doc in docs_to_load:
  loader = PyMuPDFLoader(doc)
  documents = loader.load()
  text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_SIZE*CHUNK_OVERLAP)
  docs = text_splitter.split_documents(documents)
  for j in range(len(docs)):
      combined_docs.append(docs[j])

vector_store_saved = Milvus.from_documents(
    combined_docs,
    embeddings,
    collection_name="Hr_Low_Chunk_Size",
    connection_args={"uri": URI},
)

vector_store_loaded = Milvus(
    embeddings,
    connection_args={"uri": URI},
    collection_name="Hr_Low_Chunk_Size",
)

uuids = [str(uuid4()) for _ in range(len(combined_docs))]
vector_store.add_documents(documents=combined_docs, ids=uuids)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: eef10422e7ef442eb0ec8887a1e0b4b6
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 096d97bd808149fd88c6a6c9af48e8da
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 443d584b708c4fb58e5ffc818c272b11


['72409066-ee05-4872-83f5-4a9c17235b6a',
 '90dec752-2f6d-41c5-babe-594c92f7d963',
 'd2e663a1-c50a-4188-8ec5-2803434c534d',
 '2f64f22b-49d7-4763-a980-c90054c6dd78',
 'deadacc6-a15b-4589-9d2b-c2308bf675a0',
 '440f57e9-e23c-436b-975c-11ae91578987',
 'a20f207e-f0c8-4d59-8aae-b5cd63de426c',
 '23d64551-f483-42df-b2e3-17522a23f918',
 'f06a4cef-fa69-4d9f-87ab-373566667a40',
 'b8704444-0072-43e9-a046-3174637d70a5',
 'c52cd423-6442-46a7-84d4-f5ca2b6ece9f',
 '98d2c39f-33fb-4343-b68d-94b13be08f29',
 '0350476e-a50a-452b-90fb-ddcfa4175fec',
 'f454582e-f08f-4057-8b37-39eb2303b094',
 '91fc2b49-0b18-43ca-8447-8aa495e1e0b1',
 '96517ce1-6c32-483c-9e79-7b9e724b9035',
 '70660359-41be-44b3-9e4d-9ab3cf4d8ec4',
 '6c3f72df-84a1-41a7-9023-b502c1ba194b',
 '71a78305-c28a-474a-b26a-2fd38d864cda',
 'b7df3de9-8428-4469-beba-f599e7d15561',
 'b72cbf1c-43e8-4274-9c06-0377f2c09133',
 '33fd76db-5567-40c5-a6ea-8a4ff2a00264',
 '2e5a1728-44cb-415e-b8fb-b3cb2af0d827',
 '52327281-94c8-4ca2-b7ec-50492c14ca38',
 '0ab6a567-5f75-

In [None]:
#This is from Code-Of_conduct
start_time = time.time()
results = vector_store.similarity_search(
    "What are the ethical business practices?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF CODE OF CONDUCT QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")
#This is from Compensation-Benefits Guide
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Medical Insurance & Surviving Spouse Coverage",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF COMPENSATIONS-BENEFITS GUIDE QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#This is from Employee Termination policy handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Voluntary Terminations",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE TERMINATION QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#this is from employee handbook, but conflicts with compensation-benefits, so we have to see if it is captured correctly from employee handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Compensation & development?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE HANDBOOK QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

QUERY EXECUTED IN --- 0.1267240047454834 seconds ---
RESULTS OF CODE OF CONDUCT QUERY: 0
CONTENT: Page  4 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
SECTION 1 
 
ETHICAL BUSINESS PRACTICES 
 
• We should conduct our business in accordance with all material applicable laws, rules and 
regulations. 
 
• We should maintain the highest standards of ethical business conduct and integrity by: 
 
 Being fair and honest in all business dealings, including our professional relationships; 
 
 Properly maintaining all information and records, recognizing errors and, when an error is 
confirmed, promptly correcting it; and 
 
 Cooperating fully with all internal and external audits and investigations initiated or 
sanctioned by Comerica. 
 
• We must protect the confidentiality and privacy of confidential customer, shareholder, 
proprietary and third-party information and records. 
 
• We must make business decisions that align with Comerica’s risk appetite

## Low Chunk Size Observations:

so for query 1, the response 1 and 2 were highly relevant, but had some irrelevant data. response 3 was very poor, but for this query, all 3 chunks retrieved were from the correct file

for query 2, all 3 responses were highly relevant with response 1 and 2 being the most relevant. and all 3 were retrieved from the correct document

for query 3, response 1 and 2 were from employee handbook file, but were regarding employee terminations and response 3 was from the original employee terminations file, it can be inspected that the termination details are stronger in response 1 and 2. however, the "voluntary" terminations are more seen in response 3

for query 4, response 1 and 2 were irrelevant and from the wrong file (compensations was chosen as a misleading term in this example), however, response 3 was relevant and from the correct file.

## Reflection:
we need to parse and clean the data before turning it into embeddings (specifically, remove the contents.) furthermore, i think retrival and chunk size was fine for this one, khair, lets explore the other ones

In [None]:
#Medium Chunk Code
embeddings = HuggingFaceEmbeddings()
combined_docs = []
CHUNK_SIZE = 256
CHUNK_OVERLAP = 0.1
URI = "./Medium_Chunk_Size.db"

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": URI},
)

docs_to_load = ["Code-of-conduct.pdf", "Compensation-Benefits-Guide.pdf", "Employee Termination Policy.pdf", "Employee-Handbook.pdf", "Onboarding Manual.pdf", "employee-appraisal-form.pdf", "health-and-safety-guidelines.pdf", "remote-work-policy.pdf"]
for doc in docs_to_load:
  loader = PyMuPDFLoader(doc)
  documents = loader.load()
  text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_SIZE*CHUNK_OVERLAP)
  docs = text_splitter.split_documents(documents)
  for j in range(len(docs)):
      combined_docs.append(docs[j])

vector_store_saved = Milvus.from_documents(
    combined_docs,
    embeddings,
    collection_name="Hr_Medium_Chunk_Size",
    connection_args={"uri": URI},
)

vector_store_loaded = Milvus(
    embeddings,
    connection_args={"uri": URI},
    collection_name="Hr_Medium_Chunk_Size",
)

uuids = [str(uuid4()) for _ in range(len(combined_docs))]
vector_store.add_documents(documents=combined_docs, ids=uuids)

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 708a3c90127e41698ed8b39b53feea35
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: b60b57c60b124c53b6981b786cbd32c2
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: b9cdc80ff2a54131b71e0c0abfbb2170


['1317bc6a-02ea-4750-b606-63d6da016577',
 'b9f30aed-45db-4389-8913-c0b7cfd9b5d2',
 '64e5d3aa-201d-42b3-a1d9-20a6299ce351',
 '4483d73f-8234-40be-8729-e94e82d13515',
 '0438ad34-c88c-47f7-a47e-da809a9ca351',
 'b7c8c1a1-1780-4552-9605-24c14c76cad1',
 'c654dcd0-fadf-4f8a-abe8-7c44c44701dd',
 'ebc23d23-7ed5-4d33-b00f-fc33fe4da3e1',
 '0c75558c-546a-4bad-abdf-9f728ce5c0fb',
 'aed4880e-3cb6-419b-ab7a-83c38afbb49c',
 '520c5448-b107-4058-8ce0-2f8ccfe9082f',
 'feb438af-acb9-4c18-bf45-75ebbbe4ced8',
 'deba72ca-cc62-4bfc-bd91-324e563161a9',
 '81ec5d61-f1b1-46e2-b58d-a9aa7ba4e8b9',
 '02049cc1-6f8a-47f9-bbb3-99d7113bebe3',
 '0e5070bb-688f-4a03-82a1-74e0edc6066d',
 'f7b8a07b-0449-4852-848d-65c2fb6bbdfe',
 '40aba342-c9e8-4088-92b6-04005d8f0610',
 'd9a9ca36-f5d7-403a-b9f9-96bdaa25d0f2',
 'b8df7612-31ad-4bf6-b99d-a32e0d0c38d2',
 '6a79486a-c40b-4a05-b725-0321fc76c474',
 'a4b00bac-4a91-4962-8119-d979ce45f887',
 '0d421c20-5d13-4ab8-bee1-d567528ed757',
 'bd522e92-eba9-4e73-9556-1222fc42b466',
 '4fe2e37c-0da8-

In [None]:
#This is from Code-Of_conduct
start_time = time.time()
results = vector_store.similarity_search(
    "What are the ethical business practices?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF CODE OF CONDUCT QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")
#This is from Compensation-Benefits Guide
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Medical Insurance & Surviving Spouse Coverage",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF COMPENSATIONS-BENEFITS GUIDE QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#This is from Employee Termination policy handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Voluntary Terminations",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE TERMINATION QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#this is from employee handbook, but conflicts with compensation-benefits, so we have to see if it is captured correctly from employee handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Compensation & development?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE HANDBOOK QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

QUERY EXECUTED IN --- 0.09604978561401367 seconds ---
RESULTS OF CODE OF CONDUCT QUERY: 0
CONTENT: Page  4 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
SECTION 1 
 
ETHICAL BUSINESS PRACTICES 
 
• We should conduct our business in accordance with all material applicable laws, rules and 
regulations. 
 
• We should maintain the highest standards of ethical business conduct and integrity by: 
 
 Being fair and honest in all business dealings, including our professional relationships; 
 
 Properly maintaining all information and records, recognizing errors and, when an error is 
confirmed, promptly correcting it; and 
 
 Cooperating fully with all internal and external audits and investigations initiated or 
sanctioned by Comerica. 
 
• We must protect the confidentiality and privacy of confidential customer, shareholder, 
proprietary and third-party information and records. 
 
• We must make business decisions that align with Comerica’s risk appetit

## Medium Chunk Size Observations:

for query 1, the chunks returned are valid, chunk 1 is highly valid, 2 seems to be a bit out of line (can be cleaned out) and chunk 3 can be used to build background by llm.
for query 2, chunks 1 and 2 are highly relevant, chunk 3 seems to be junk (can be cleaned out)
for query 3, all 3 chunks seem highly relevant and are from diverse sources, so like this is fine, i am just concerned a bit of the length of the returned responses.
for query 4, chunk 1 seems irrelevant (can be cleaned out), chunk 2 and 3 are from the correct source and highly highly relevant.



## Reflections:
I think this size would be fine, as relevant chunks are being extracted, only filtering out the contents/ irrelevant parts is important


In [None]:
#High Chunk Code
embeddings = HuggingFaceEmbeddings()
combined_docs = []
CHUNK_SIZE = 1024
CHUNK_OVERLAP = 0.1
URI = "./High_Chunk_Size.db"

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": URI},
)

docs_to_load = ["Code-of-conduct.pdf", "Compensation-Benefits-Guide.pdf", "Employee Termination Policy.pdf", "Employee-Handbook.pdf", "Onboarding Manual.pdf", "employee-appraisal-form.pdf", "health-and-safety-guidelines.pdf", "remote-work-policy.pdf"]
for doc in docs_to_load:
  loader = PyMuPDFLoader(doc)
  documents = loader.load()
  text_splitter = CharacterTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_SIZE*CHUNK_OVERLAP)
  docs = text_splitter.split_documents(documents)
  for j in range(len(docs)):
      combined_docs.append(docs[j])

vector_store_saved = Milvus.from_documents(
    combined_docs,
    embeddings,
    collection_name="Hr_High_Chunk_Size",
    connection_args={"uri": URI},
)

vector_store_loaded = Milvus(
    embeddings,
    connection_args={"uri": URI},
    collection_name="Hr_High_Chunk_Size",
)

uuids = [str(uuid4()) for _ in range(len(combined_docs))]
vector_store.add_documents(documents=combined_docs, ids=uuids)

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: f00a3ff1d5e34393a9bc03ab5f395836
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: d09f562e688046dd98d6cbe1b1be9e02
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: bf00051d034846e5ac384049f4680259


['9374dea8-4493-459c-8688-8c2609422a38',
 'f3e259e0-f3c1-496d-a351-25dd3ec239b9',
 '80bda9ce-6bf1-4109-a5ba-ee3dcb7cae6f',
 'e353e52c-11fe-495b-b8db-633b938ccf6b',
 '5df1cf28-8b0c-4658-a894-a924492470c2',
 '4db2b608-c34e-496d-bcc2-382f0cacc5b1',
 'f0bcb94a-1beb-4930-b49a-9538bed27a06',
 'd06ff55d-bbb8-4d9a-9324-2e9957a814f0',
 '95460b99-d425-4239-98d7-1d2c4aa774fd',
 '10edb731-c4e6-49e7-8b87-929e4b959927',
 '4a80dd7a-7978-44d2-8dc5-0f800d8ffb22',
 'c6bdd2df-f494-4e01-b787-3c59e453df5c',
 '967bdb64-1cea-4d5b-87c2-888ddf508185',
 'cc07de54-7b46-4ea2-b49c-b913dea59563',
 'b93c0677-cefe-4313-aa33-aa4c6c5daba8',
 '0230388d-1eb5-4748-a419-4041614c255e',
 '07b0a688-4722-4295-a3f4-59a8b8ab3188',
 'a307e4c3-0c7b-4513-b9a5-4ccb1218a606',
 '3449d900-5d32-48be-b425-16366766ebd8',
 'ea14adaf-c4c0-4314-8c80-8dd374dadf0b',
 '1ea50c3d-e116-420a-820a-6a17ce228117',
 '2ad0e2ab-4d20-4f05-9b02-f70431b9cc0d',
 '02309de3-d205-4778-ad2c-fa01dd5f2c4e',
 '117ac5ee-be61-49bf-9e0f-da56607966b7',
 'b7490622-b2af-

In [None]:
#This is from Code-Of_conduct
start_time = time.time()
results = vector_store.similarity_search(
    "What are the ethical business practices?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF CODE OF CONDUCT QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")
#This is from Compensation-Benefits Guide
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Medical Insurance & Surviving Spouse Coverage",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF COMPENSATIONS-BENEFITS GUIDE QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#This is from Employee Termination policy handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Voluntary Terminations",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE TERMINATION QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#this is from employee handbook, but conflicts with compensation-benefits, so we have to see if it is captured correctly from employee handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Compensation & development?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE HANDBOOK QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

QUERY EXECUTED IN --- 0.09062647819519043 seconds ---
RESULTS OF CODE OF CONDUCT QUERY: 0
CONTENT: Page  4 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
SECTION 1 
 
ETHICAL BUSINESS PRACTICES 
 
• We should conduct our business in accordance with all material applicable laws, rules and 
regulations. 
 
• We should maintain the highest standards of ethical business conduct and integrity by: 
 
 Being fair and honest in all business dealings, including our professional relationships; 
 
 Properly maintaining all information and records, recognizing errors and, when an error is 
confirmed, promptly correcting it; and 
 
 Cooperating fully with all internal and external audits and investigations initiated or 
sanctioned by Comerica. 
 
• We must protect the confidentiality and privacy of confidential customer, shareholder, 
proprietary and third-party information and records. 
 
• We must make business decisions that align with Comerica’s risk appetit

## High Chunk Size Observations:
for query 1, we can now see it is is capturing entire pages, i extrcted 2 insights from this method of parsing, we first parse through pages, and then it is making the chunks, this is the theory that i can come up with, given response 2 is too short to be of 1024 characters tokens i did some data analysis and realized that entire content is from 1 page. so to make chunking and retrieval better, i need to parse, concatenate page's contents and remove any title/content pages etc so that i have contigous text to make chunks from.
so i will have to make the parsing correct.

however, the retrievals are mostly accuracte, the only problem being too much information is returned, and thus has some irrelevant texts, currently, it is limited to returning entire pages.



## Reflections:


## Final Decisions:

I will be using chunk size of 256 tokens as i believe that had sufficient amount of relevant text, also given that our shortest document has 615 words so a chunk size of 256 works. I now have to focus on improvig the parsing and appending methods of reading new pdf files

# For chunk overlap:

this article was helpful
https://medium.com/@kadamsay06/chunking-strategies-for-rag-simplifying-complex-data-retrieval-1facc04f8303

it gives insights into what chunk overlap is: "refers to the practice of having some parts of text appear in multiple chunks" and why it is important: "Overlapping ensures that important context is not lost when chunks are processed separately. It helps in maintaining the flow of information and makes it easier to understand the text when chunks are retrieved independently."

it also presents some insights via code chunks like:

```
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    chunk_size=1000,
    separator="\n\n",
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False
)

texts = text_splitter.create_documents(texts)
```
this tells the character splitter actually had more tunable params like separators etc,

it then also gave ideas for other types of chunkers such as document chunkers and semantic chunkers, which we will IA use when evaluating the final product.


then to evaluate the advantages of chunk overlap, i went to this:
https://www.reddit.com/r/LangChain/comments/1bjxvov/what_is_the_advantage_of_overlapping_in_chunking/

here, it was mentioned by user HaroldYardley 7 months ago:
"It's something that's easier to visualize if you use longer chunks, for example:

...| chunk 1 | chunk 3 |...

.......... | chunk 2 | ...

if you're trying to get an answer for a query using a specific line of text and have minimal/no overlap, the answer could be split between 2 chunks and not be retrievable. However if you chunk with significant overlap you won't "lose" information due to splitting. "

for the exact quantities of chunk overlapping, i used this article to have an idea of what is the norm: https://www.mongodb.com/developer/products/atlas/choosing-chunking-strategy-rag/

and then i saw: "For chunking strategies with token overlap, we set the overlap to 15% of the chunk size. While we have kept the overlap percentage constant here, you can experiment with different values if needed. A chunk overlap between 5% and 20% of the chunk size is recommended for most datasets."

so here i how i have devised my 3 categories to test upon:

#### Low: (10% chunk overlap)
#### medium: (20% chunk overlap)
#### high: (50% chunk overlap)

plus, i will also try to reconstruct and improve the parser.





# Parsing text:

So... i realized i would need some other way to parse and chunk the text so going onto documentations for that:

https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/

before going to the documentation, i was using the character splitter, given in the assignment, then i really pondered on what exactly is a chunk? and what is chunk size? i then realized it "should" be the number of tokens, and that the docs being separated via pages is a major issue, i need to combine all of the pages contents. why was it given using the "character splitter" in the assignment? i dont know, anyways, to figure out how to split by tokens, i followed the above documentation, saw the word "token splitter" , which lead me to:

https://python.langchain.com/v0.1/docs/modules/data_connection/document_transformers/split_by_token/

the above documentation gave me a a suggestion of splitting by tokens, after a coupled of failed attempts with:

```
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding="cl100k_base", chunk_size=100, chunk_overlap=0
)
texts = text_splitter.split_text(state_of_the_union)
```
i finally got this to work:

```
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])
```







In [None]:
#Parsing data analysis
combined_text = ""
CHUNK_SIZE = 256
CHUNK_OVERLAP = 0.1


docs_to_load = ["Code-of-conduct.pdf"] #, "Compensation-Benefits-Guide.pdf", "Employee Termination Policy.pdf", "Employee-Handbook.pdf", "Onboarding Manual.pdf", "employee-appraisal-form.pdf", "health-and-safety-guidelines.pdf", "remote-work-policy.pdf"]
for doc in docs_to_load:
  loader = PyMuPDFLoader(doc)
  documents = loader.load()
  # print(documents)
  for page in documents:
    combined_text += page.page_content
  # print(combined_text)
text_splitter = TokenTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=int(CHUNK_SIZE*CHUNK_OVERLAP))
texts = text_splitter.split_text(combined_text)

# for idx, text in enumerate(texts):
#  print("chunk: ", idx+1)
#  print(text)

docs = text_splitter.create_documents(texts)
print(docs)




[Document(metadata={}, page_content=' \n \n \n \n \n \n \n \n \n \n \nCODE OF BUSINESS \nCONDUCT AND ETHICS \nFOR EMPLOYEES \n \n \n \nBusiness Practices for \nEthical Employee \nConduct \n(Effective July 25, 2023) \n \n \n \n \n \n \n \n \n \n \n \nPage  2 \n \n \n \n \n \nTable of Contents \n \n \n \nLetter from Chairman Curtis C. Farmer ………………………………………… 3 \n \nSection 1: Ethical Business Practices and Business Conduct……………………. 4 \n \nSection 2: Responsibilities………………………………………………………..20 \n \nSection 3: Getting Help…………………………………………………………...22 \nPage  3 \n \n \n \n \n \nDear Colleagues: \n \nOur Code of Business Conduct and Ethics for Employees is the most important document at \nComerica.  It is the foundation on which all our business practices at Comerica are constructed and, \nfor that reason, I consider'), Document(metadata={}, page_content=' is the foundation on which all our business practices at Comerica are constructed and, \nfor that reason, I consider it a critical one for e

In [None]:
#parsing correct word count verification check:
combined_text = ""
CHUNK_SIZE = 256
CHUNK_OVERLAP = 0
URI = "./Low_Chunk_Overlap.db"

docs_to_load = ["Code-of-conduct.pdf"] #, "Compensation-Benefits-Guide.pdf", "Employee Termination Policy.pdf", "Employee-Handbook.pdf", "Onboarding Manual.pdf", "employee-appraisal-form.pdf", "health-and-safety-guidelines.pdf", "remote-work-policy.pdf"]
for doc in docs_to_load:
  loader = PyMuPDFLoader(doc)
  documents = loader.load()
  # print(documents)
  for page in documents:
    text = page.page_content
    if "contents" in text.lower():
      continue
    text = re.sub(r'\bPage\s+\d+\b', '', text, flags=re.IGNORECASE) #removing the page numbers
    text = re.sub(r'\n', '', text).strip() #removing all newlines
    # print(text)
    text = re.sub(r'[^\w\s.,?!:;\'\"()&-]', '', text)
    combined_text += text + " "
combined_text = combined_text.strip()
# print(combined_text)
text_splitter = TokenTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=int(CHUNK_SIZE*CHUNK_OVERLAP))
texts = text_splitter.split_text(combined_text)
docs = text_splitter.create_documents(texts)

print(len(combined_text.split(" ")))
print(len(docs))


8140
39


Making the milvus DB and running experiments here now:

In [3]:
#Low Chunk Overlap Code
embeddings = HuggingFaceEmbeddings()
combined_text = ""
CHUNK_SIZE = 256
CHUNK_OVERLAP = 0.1
URI = "./Low_Chunk_Overlap3.db" #had to change the DB and collection names as i think with the same db/collection names, it kept appending to the old doc chunks and old doc chunks were still being returned, even after improved parsing (will look into deleting the milvus vectordb/its contents in a bit)

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": URI},
)

docs_to_load = ["Code-of-conduct.pdf", "Compensation-Benefits-Guide.pdf", "Employee Termination Policy.pdf", "Employee-Handbook.pdf", "Onboarding Manual.pdf", "employee-appraisal-form.pdf", "health-and-safety-guidelines.pdf", "remote-work-policy.pdf"]
for doc in docs_to_load:
  loader = PyMuPDFLoader(doc)
  documents = loader.load()
  # print(documents)
  for page in documents:
    text = page.page_content
    if "contents" in text.lower():
      continue
    text = re.sub(r'\bPage\s+\d+\b', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\n', '', text).strip() #removing all newlines
    # print(text)
    text = re.sub(r'[^\w\s.,?!:;\'\"()&-]', '', text)
    combined_text += text + " "
combined_text = combined_text.strip()
# print(combined_text)
text_splitter = TokenTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=int(CHUNK_SIZE*CHUNK_OVERLAP))
texts = text_splitter.split_text(combined_text)
docs = text_splitter.create_documents(texts)
# print(len(combined_text.split(" ")))
# print(len(docs))

vector_store_saved = Milvus.from_documents(
    docs,
    embeddings,
    collection_name="Hr_Low_Chunk_Overlap3",
    connection_args={"uri": URI},
)

vector_store_loaded = Milvus(
    embeddings,
    connection_args={"uri": URI},
    collection_name="Hr_Low_Chunk_Overlap3",
)

uuids = [str(uuid4()) for _ in range(len(docs))]
vector_store.add_documents(documents=docs, ids=uuids)

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: cb824109e162423b858505dab8a0bf7a
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: e418bb9c66e344c29f749fcfe0994cb3
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 1e0879b0e7904d718053440baabf0bde


['0c53b4ad-8965-47e8-9080-00653acb2d91',
 'bf4e2e1b-caa0-4986-99c0-e2075670c778',
 '93509659-fd77-49ac-ad5d-89c66e971ccf',
 '51e2eae6-3e5c-4c27-aedd-1f92c216b92a',
 '08b9a025-c3f0-4b2b-b27b-c356d477f0ff',
 '8bf825df-1490-484f-a52e-0ab554a51f6f',
 'fcd0d710-f580-4a34-b1b8-4628794b6140',
 '388ee128-1c04-4ad4-811d-385b12a7b0c8',
 '1f19653b-c8d9-465d-b94e-bed2dc407877',
 'e568f979-52cf-4d21-b006-c55c65429d2f',
 '24250269-2f1f-4007-836d-cbd075b8bb69',
 'f7c0a0a6-e31d-4851-a843-d6a0d0e6afc0',
 '1d74eddd-0b0c-4486-9b68-0a3947c27a14',
 '831a96dd-aaef-40e4-afe2-6cbcaa511c21',
 '57fba66c-15f1-43b9-98e8-36e541a90a57',
 'dd710314-1236-4ac3-8287-5af4a3863944',
 'ec269842-07dd-485a-b63b-0f7f95408313',
 '28d06148-de8c-43a2-9e79-fd821e9f5c25',
 'aa6703a6-cd44-4fd2-9d8f-b0597421413b',
 '9de71654-fa40-4d80-98e0-f1272e57fe2e',
 '1f8c9843-6601-4fb0-aba3-b3b1fd9caa92',
 '8b1a78d0-c495-46e0-8f32-d26651a80f08',
 'd95ec413-748d-477d-b349-aa4efbdee992',
 'a446f5a6-a553-4069-b7e1-f3a89d53c60b',
 'd4ac8951-45dd-

In [4]:
#This is from Code-Of_conduct
start_time = time.time()
results = vector_store.similarity_search(
    "What are the ethical business practices?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF CODE OF CONDUCT QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")
#This is from Compensation-Benefits Guide
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Medical Insurance & Surviving Spouse Coverage",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF COMPENSATIONS-BENEFITS GUIDE QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#This is from Employee Termination policy handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Voluntary Terminations",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE TERMINATION QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#this is from employee handbook, but conflicts with compensation-benefits, so we have to see if it is captured correctly from employee handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Compensation & development?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE HANDBOOK QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

QUERY EXECUTED IN --- 0.16851353645324707 seconds ---
RESULTS OF CODE OF CONDUCT QUERY: 0
CONTENT:  In the final analysis, at Comerica each of us is personally accountable for reading and understanding the Code of Business Conduct and Ethics for Employees, thinking about the principles on which it is constructed, and then incorporating those principles into our life.  If you have questions about the Code of Business Conduct and Ethics for Employees or any ethical issue you may face, please contact your manager, or the Corporate Legal, Human Resources or Audit Departments for assistance.  Alternatively, you may report ethics-related matters confidentially through one of Comericas hotlines, as described in more detail in the Code of Business Conduct and Ethics for Employees. Thank you.       Curtis C. Farmer  Chairman, President and Chief Executive Officer SECTION 1  ETHICAL BUSINESS PRACTICES   We should conduct our business in accordance with all material applicable laws, rules and reg

## Low chunk overlap observations:

The responses returned now seem much much better, (now that parsing is done), i would like to test this against the 1024 chunk size as well to see how much of a difference it would make, the responses are relatively better with little to no noise/irrelevant data.

In [5]:
#Low Chunk Overlap with High chunk size Code
embeddings = HuggingFaceEmbeddings()
combined_text = ""
CHUNK_SIZE = 1024
CHUNK_OVERLAP = 0.1
URI = "./Low_Chunk_Overlap_high_chunk_size.db" #had to change the DB and collection names as i think with the same db/collection names, it kept appending to the old doc chunks and old doc chunks were still being returned, even after improved parsing (will look into deleting the milvus vectordb/its contents in a bit)

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": URI},
)

docs_to_load = ["Code-of-conduct.pdf", "Compensation-Benefits-Guide.pdf", "Employee Termination Policy.pdf", "Employee-Handbook.pdf", "Onboarding Manual.pdf", "employee-appraisal-form.pdf", "health-and-safety-guidelines.pdf", "remote-work-policy.pdf"]
for doc in docs_to_load:
  loader = PyMuPDFLoader(doc)
  documents = loader.load()
  # print(documents)
  for page in documents:
    text = page.page_content
    if "contents" in text.lower():
      continue
    text = re.sub(r'\bPage\s+\d+\b', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\n', '', text).strip() #removing all newlines
    # print(text)
    text = re.sub(r'[^\w\s.,?!:;\'\"()&-]', '', text)
    combined_text += text + " "
combined_text = combined_text.strip()
# print(combined_text)
text_splitter = TokenTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=int(CHUNK_SIZE*CHUNK_OVERLAP))
texts = text_splitter.split_text(combined_text)
docs = text_splitter.create_documents(texts)
# print(len(combined_text.split(" ")))
# print(len(docs))

vector_store_saved = Milvus.from_documents(
    docs,
    embeddings,
    collection_name="Hr_Low_Chunk_Overlap_high_chunk_size",
    connection_args={"uri": URI},
)

vector_store_loaded = Milvus(
    embeddings,
    connection_args={"uri": URI},
    collection_name="Hr_Low_Chunk_Overlap_high_chunk_size",
)

uuids = [str(uuid4()) for _ in range(len(docs))]
vector_store.add_documents(documents=docs, ids=uuids)

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: edf3790a078641b6b5c001d4fca401d5
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: cbfe69c652d94724afb188e655341ea4
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 3b688f7db1aa4fa18de92d43c2b32e22


['0f5f41ce-27dc-4513-ba47-2c8878fa0483',
 '9d7c1fd8-a2f0-44d9-9d9f-6fc3103e66f7',
 'acf2c24c-cf29-4c25-9ba7-9025b904dbca',
 'a6bbf4aa-2a94-4b8e-8e5d-5b9ca2da418d',
 'fb6babef-1751-4583-9535-21a3f0b80d16',
 'b71b02e2-dad0-41cd-9e90-b19b10a13667',
 '45d29e72-92b9-41d9-9190-70db56288a75',
 '6a9be96f-1a47-425e-b9fd-b09127b17d06',
 'ed6a630b-2e68-42a7-8737-4cb994ed43a8',
 '19a516d4-0e17-4063-b78a-d28af5fb562b',
 '4827a591-628c-4111-81e7-9467abc8e3d3',
 '84a8061b-655c-4208-835c-5b1b1805fea9',
 '65863422-90b8-4433-9a8a-bd914cb8cbed',
 'ee0da272-8bfd-4a51-b585-e0a0a04607f6',
 'c864a398-0567-4160-89cf-ac9dd8eb3583',
 '304474af-7656-4c96-b230-a71f4a36ec91',
 'e6e0eae2-9173-42bb-b063-8b47b4b540f2',
 '50862161-0380-4906-a71b-6c9ba17ca752',
 '377eb9d0-8a2f-48b4-8426-d509089b3fd3',
 'cd94e823-1750-4aa4-8118-755e1592342f',
 '32de77b9-2cb0-4f2d-b346-70e640adc95f',
 'e9dec1c5-71ce-4d3a-80b7-989e6015b5c7',
 'a8f24b90-abcd-4137-977f-4b0cd73e44c7',
 'a4e14996-5503-4051-b6c4-65ca7f10a2f0',
 'b8de4212-bf47-

In [6]:
#This is from Code-Of_conduct
start_time = time.time()
results = vector_store.similarity_search(
    "What are the ethical business practices?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF CODE OF CONDUCT QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")
#This is from Compensation-Benefits Guide
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Medical Insurance & Surviving Spouse Coverage",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF COMPENSATIONS-BENEFITS GUIDE QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#This is from Employee Termination policy handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Voluntary Terminations",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE TERMINATION QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#this is from employee handbook, but conflicts with compensation-benefits, so we have to see if it is captured correctly from employee handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Compensation & development?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE HANDBOOK QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

QUERY EXECUTED IN --- 0.22192931175231934 seconds ---
RESULTS OF CODE OF CONDUCT QUERY: 0
CONTENT: CODE OF BUSINESS CONDUCT AND ETHICS FOR EMPLOYEES    Business Practices for Ethical Employee Conduct (Effective July 25, 2023) Dear Colleagues:  Our Code of Business Conduct and Ethics for Employees is the most important document at Comerica.  It is the foundation on which all our business practices at Comerica are constructed and, for that reason, I consider it a critical one for each of us to read and understand.  Our Code of Business Conduct and Ethics for Employees is a values-based document, rather than compliance-based, which means it goes beyond a simple listing of right and wrong.  As you read through, you will see that the Code of Business Conduct and Ethics for Employees explains in detail the ethical business practices and conduct that must govern our life here at Comerica.  We are one of the leading financial institutions in the United States today.  There are many, many reaso

## Low chunk overlap with high chunk size observations:

The very very apparent observation of course is now of the returned responses/ chunk sizes and response times, the first response returned for high chunk size is in 0.22 seconds and for low chunk size, it is 0.16 seconds, rest of them are pretty avg at around 0.11 chunks (Note to self: figure out why the first query only takes time and rest of them are fast enough, maybe something to do with caching, and write here)

anyways, for high chunk sizes, i do observe some irrelevant information, and high chunk sizes would be difficult for an LLM to process and it may hallucinate (specially if we are going to be using a smaller publicly accessible LLM from hugging face) hence, i think we'll just stick to the 256 chunk size and k=3.

In [7]:
#Medium Chunk Overlap Code
embeddings = HuggingFaceEmbeddings()
combined_text = ""
CHUNK_SIZE = 256
CHUNK_OVERLAP = 0.20
URI = "./Medium_Chunk_Overlap.db" #had to change the DB and collection names as i think with the same db/collection names, it kept appending to the old doc chunks and old doc chunks were still being returned, even after improved parsing (will look into deleting the milvus vectordb/its contents in a bit)

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": URI},
)

docs_to_load = ["Code-of-conduct.pdf", "Compensation-Benefits-Guide.pdf", "Employee Termination Policy.pdf", "Employee-Handbook.pdf", "Onboarding Manual.pdf", "employee-appraisal-form.pdf", "health-and-safety-guidelines.pdf", "remote-work-policy.pdf"]
for doc in docs_to_load:
  loader = PyMuPDFLoader(doc)
  documents = loader.load()
  # print(documents)
  for page in documents:
    text = page.page_content
    if "contents" in text.lower():
      continue
    text = re.sub(r'\bPage\s+\d+\b', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\n', '', text).strip() #removing all newlines
    # print(text)
    text = re.sub(r'[^\w\s.,?!:;\'\"()&-]', '', text)
    combined_text += text + " "
combined_text = combined_text.strip()
# print(combined_text)
text_splitter = TokenTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=int(CHUNK_SIZE*CHUNK_OVERLAP))
texts = text_splitter.split_text(combined_text)
docs = text_splitter.create_documents(texts)
# print(len(combined_text.split(" ")))
# print(len(docs))

vector_store_saved = Milvus.from_documents(
    docs,
    embeddings,
    collection_name="Hr_Medium_Chunk_Overlap",
    connection_args={"uri": URI},
)

vector_store_loaded = Milvus(
    embeddings,
    connection_args={"uri": URI},
    collection_name="Hr_Medium_Chunk_Overlap",
)

uuids = [str(uuid4()) for _ in range(len(docs))]
vector_store.add_documents(documents=docs, ids=uuids)

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 2fecbb3551cd42d998ea7aed543c9a52
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: f882be553f694720854e3959de024eca
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: cc2fe6f350e14c5f8e8b0222284923b2


['e8a3cfec-1dcf-4087-ba06-3553cd10d8d1',
 '18ed4127-4645-49ae-9111-5ab5c132c6d6',
 '1739da0c-1b91-488c-b0df-d6f9cdae1364',
 'b9ce645e-eec2-419b-9593-c52f98b3a528',
 '3be9b4e5-4423-43a0-bac2-0d31cbfdbcf3',
 '2116c710-ee6c-43cf-91cd-688eb7a68501',
 'fe862387-f667-4034-9fb5-2643a474680f',
 '7d000139-2247-4167-92ee-4a4bbd7860cd',
 '08856e72-b23b-460a-a9c0-0e9701c4c20e',
 '5120c360-6776-4877-9b56-8e9c39f12023',
 'adedb3e6-8573-4c3e-a76a-a68ecab8a758',
 'c734be9c-254d-4e11-a8a1-273d698a7108',
 '13665b4f-c58d-49a3-b608-8e10de1b15b5',
 'fcd3783b-935c-474c-add0-96db072b8f25',
 '9dd23caf-99e8-46e7-9f3c-bf75ea64d6be',
 '2f3fbfcf-e120-4657-a473-1fda1c9901b2',
 'f72ba4d8-fa9c-45b4-938a-b56d36d3d668',
 'dbda64c9-bb09-4210-81db-e2cedd1f68b5',
 '1d9026bc-7ec6-4e13-b3d0-1fae0b47b6e5',
 '275b5fef-4b1f-4937-b700-794c32fe8cdc',
 'c7be5da5-1505-41a1-af70-98c185603ac1',
 'ced125ce-a5ad-4f0a-ab03-07edabcc9e03',
 '990f9169-8029-4f13-a7aa-96de2e62bf10',
 'd863636e-5bf5-4ab1-acb1-6fef2767fae7',
 'a664e449-e9ba-

In [8]:
#This is from Code-Of_conduct
start_time = time.time()
results = vector_store.similarity_search(
    "What are the ethical business practices?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF CODE OF CONDUCT QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")
#This is from Compensation-Benefits Guide
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Medical Insurance & Surviving Spouse Coverage",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF COMPENSATIONS-BENEFITS GUIDE QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#This is from Employee Termination policy handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Voluntary Terminations",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE TERMINATION QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#this is from employee handbook, but conflicts with compensation-benefits, so we have to see if it is captured correctly from employee handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Compensation & development?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE HANDBOOK QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

QUERY EXECUTED IN --- 0.10467219352722168 seconds ---
RESULTS OF CODE OF CONDUCT QUERY: 0
CONTENT:  We should maintain the highest standards of ethical business conduct and integrity by:   Being fair and honest in all business dealings, including our professional relationships;   Properly maintaining all information and records, recognizing errors and, when an error is confirmed, promptly correcting it; and   Cooperating fully with all internal and external audits and investigations initiated or sanctioned by Comerica.   We must protect the confidentiality and privacy of confidential customer, shareholder, proprietary and third-party information and records.   We must make business decisions that align with Comericas risk appetite, are in the best interests of Comerica and without regard to personal gain.  This means that we should use good judgment and endeavor to avoid even the appearance of any conflict between our individual interests and those of Comerica.   BUSINESS CONDUCT  1.  

## Medium Chunk overlap observations:

These chunks seem much better and have a sense of completeness in them, i personally like the chunk overlap to be 20%

In [9]:
#High Chunk OverlapCode
embeddings = HuggingFaceEmbeddings()
combined_text = ""
CHUNK_SIZE = 256
CHUNK_OVERLAP = 0.50
URI = "./High_Chunk_Overlap.db" #had to change the DB and collection names as i think with the same db/collection names, it kept appending to the old doc chunks and old doc chunks were still being returned, even after improved parsing (will look into deleting the milvus vectordb/its contents in a bit)

vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": URI},
)

docs_to_load = ["Code-of-conduct.pdf", "Compensation-Benefits-Guide.pdf", "Employee Termination Policy.pdf", "Employee-Handbook.pdf", "Onboarding Manual.pdf", "employee-appraisal-form.pdf", "health-and-safety-guidelines.pdf", "remote-work-policy.pdf"]
for doc in docs_to_load:
  loader = PyMuPDFLoader(doc)
  documents = loader.load()
  # print(documents)
  for page in documents:
    text = page.page_content
    if "contents" in text.lower():
      continue
    text = re.sub(r'\bPage\s+\d+\b', '', text, flags=re.IGNORECASE)
    text = re.sub(r'\n', '', text).strip() #removing all newlines
    # print(text)
    text = re.sub(r'[^\w\s.,?!:;\'\"()&-]', '', text)
    combined_text += text + " "
combined_text = combined_text.strip()
# print(combined_text)
text_splitter = TokenTextSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=int(CHUNK_SIZE*CHUNK_OVERLAP))
texts = text_splitter.split_text(combined_text)
docs = text_splitter.create_documents(texts)
# print(len(combined_text.split(" ")))
# print(len(docs))

vector_store_saved = Milvus.from_documents(
    docs,
    embeddings,
    collection_name="Hr_High_Chunk_Overlap",
    connection_args={"uri": URI},
)

vector_store_loaded = Milvus(
    embeddings,
    connection_args={"uri": URI},
    collection_name="Hr_High_Chunk_Overlap",
)

uuids = [str(uuid4()) for _ in range(len(docs))]
vector_store.add_documents(documents=docs, ids=uuids)

DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 5b2ca833428249a88d985b04c8ba6481
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: 0c21fd86c4b7492a8df7e57cf7bcd699
DEBUG:pymilvus.milvus_client.milvus_client:Created new connection using: f3cca915906d43a98940dcb3009e8284


['de3c755d-33fa-4a86-b623-a4eb25d56621',
 '17362ccf-d4cd-4853-852f-bb1e0fbc5a04',
 'fabd3cbb-403b-466e-897d-aa5cb3a8a649',
 '42962183-bfb3-4eac-b922-3dfb963f0dc9',
 'ff87a21d-7064-4fc2-8a57-40e20f7c1081',
 'c13e3903-e8d7-42f8-9edd-7bf274f5fd51',
 '8d02a198-7df5-4cc4-bfb4-ef6678907013',
 '647bbe5f-5f85-496a-97ce-0dd03b1ce915',
 '37c0f3af-a07b-416c-9834-14a02264627b',
 '8dd2e1dd-9105-4523-9608-b0217129cd39',
 '078b280f-8fac-419f-8b30-93abe202ad5d',
 'e8324e81-aae5-4e75-b47a-5d3cdd61f217',
 '3a477f70-4063-4233-9b5f-d7836d11805f',
 '1eeddfd1-ca28-464e-895b-37c9a0a125da',
 'a760766f-79aa-467d-ad6c-227259160b29',
 '3a6bd9c5-a209-4acf-a8d4-fa1b8ecb8611',
 '00dd2e1a-9d8c-4245-9d8e-c69921f9e17e',
 '1800cd36-48be-4dfd-a136-f8c3cf8535ea',
 '1af5ed9d-dcbf-40b2-9e97-182c058fb8b3',
 '9324811d-e2f4-4e55-9e69-317a5288053c',
 'd3a31935-0780-4363-b6ab-b89c15b4dd02',
 '7e320692-6421-47cd-886f-b02677633cc6',
 'bdd3f132-d599-460e-8d21-360f3cbba1ae',
 '41ae01e5-2add-47a5-a424-2df7feafd01c',
 '5190b4e6-67a7-

In [10]:
#This is from Code-Of_conduct
start_time = time.time()
results = vector_store.similarity_search(
    "What are the ethical business practices?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF CODE OF CONDUCT QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")
#This is from Compensation-Benefits Guide
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Medical Insurance & Surviving Spouse Coverage",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF COMPENSATIONS-BENEFITS GUIDE QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#This is from Employee Termination policy handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Voluntary Terminations",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE TERMINATION QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

#this is from employee handbook, but conflicts with compensation-benefits, so we have to see if it is captured correctly from employee handbook
start_time = time.time()
results = vector_store.similarity_search(
    "tell me about Compensation & development?",
    k=3,
)
print("QUERY EXECUTED IN --- %s seconds ---" % (time.time() - start_time))
for idx, res in enumerate(results):
    print(f"RESULTS OF EMPLOYEE HANDBOOK QUERY: {idx}")
    print(f"CONTENT: {res.page_content}")
    print(f"METADATA: {res.metadata}")
    print("== next response ==")

print("======================================NEXT QUERY==============================================")

QUERY EXECUTED IN --- 0.1386854648590088 seconds ---
RESULTS OF CODE OF CONDUCT QUERY: 0
CONTENT:  BUSINESS PRACTICES   We should conduct our business in accordance with all material applicable laws, rules and regulations.   We should maintain the highest standards of ethical business conduct and integrity by:   Being fair and honest in all business dealings, including our professional relationships;   Properly maintaining all information and records, recognizing errors and, when an error is confirmed, promptly correcting it; and   Cooperating fully with all internal and external audits and investigations initiated or sanctioned by Comerica.   We must protect the confidentiality and privacy of confidential customer, shareholder, proprietary and third-party information and records.   We must make business decisions that align with Comericas risk appetite, are in the best interests of Comerica and without regard to personal gain.  This means that we should use good judgment and endeavor 

## High chunk overlap observations:
 the responses returned are more or less very similar to the responses returned by medium indexing, however, as i also saw before when i was parsing and dividing into chunks, the number of chunks with 50% overlap is almost 2x of that with 0% overlap, this increased in number of chunks is obviously a reason for a increased index creation time, (from my observations, 10% index created in 4 mins, 20% created in 12 mins and 50% created in 20 mins)



# Final Decisions:
## Chunk Size: (256 tokens)
as they return smaller more meaningful chunks
## chunk Overlap: (20%)
 keeping the index creation in mind (since we also plan to add documents in real time on demand as well) I aim to reduce the indexing time and thus, i think a 20% chunk overlap would be fine.
