## Window-sentence retrieval RAG
1. Forcus on Individual sentences 
2. Search the query embedding with most relevant sentence in the pdf 
3. Perform top k similarity with the sentence and retrive the relevant sentence with surrounding sentences as context.

### Import relevant libraries from llama-Index

In [None]:
# basic modules
import os
import re
from pprint import pprint

# llama index for rag system
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import load_index_from_storage
from llama_index.core import Document
from llama_index.core import Settings

# ollama model with llama index
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

# evaluation metrics
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

In [45]:
documents = SimpleDirectoryReader(
    input_files=["./data/About us.pdf", 
                 "./data/contact us.pdf",
                 "./data/delivery.pdf",
                 "./data/payment options.pdf",
                 "./data/returns.pdf",
                 "./data/services.pdf",
                 "./data/testings.pdf",
                 "./data/tracking.pdf",
                 "./data/warranty.pdf",
                 "./data/whatsapp order.pdf"]
).load_data()

print(len(documents), "\n")
print(type(documents[0]))


18 

<class 'llama_index.core.schema.Document'>


In [46]:
# initial examine the text in the documents
for doc in documents:
    pprint(doc.text[:100])

('About Us  \n'
 'TRONIC.LK is an electronic component and module sourcing company under Sigma '
 'Electronics ')
('Contact us \n'
 'Electronic Shop - Kohuwala, Nugegoda  \n'
 'TRONIC.LK  \n'
 '8, 1/1, Sunethradevi Road,  \n'
 'Kohuwala')
("If the item(s) (components or modules) you need isn't listed, we can try to "
 'supply that for you. \n'
 'Yo')
('Delivery \n'
 'DELIVERY VIA COURIER SERVICE  \n'
 ' \n'
 'We deliver items island-wide via a courier service. The c')
('If you have any question regarding delivery via courier service, please '
 'contact us via \n'
 'info@tronic.')
('Payment Options \n'
 'Here are the payment options we offer. \n'
 'We prefer bank transfers to all other optio')
('Returns \n'
 'We reserve the right to accept or reject return or exchange requests other '
 'than Warranty \n'
 'C')
('Services \n'
 'As most of you know, TRONIC.LK is running by set of engineers who have years '
 'of experience')
('If you need to mould your enclosure or any other part using pla

In [47]:
# there are \n in the document so clean the text of the documents
def clean_text_fn(text):
    cleaned_text = re.sub(r'\s+', ' ', text)
    return cleaned_text

# save cleaned text of each document in an array
clean_texts = [clean_text_fn(document.text) for document in documents]

In [48]:
# load data function has divided the document in to pages and made document objects from them
# including texts
for doc in documents:
    print(doc.metadata)

# concatenate data to one document and separate each pdf with \n\n
document = Document(text="\n\n".join([txt for txt in clean_texts]))
print(document.text[:1000])

{'page_label': '1', 'file_name': 'About us.pdf', 'file_path': 'data\\About us.pdf', 'file_type': 'application/pdf', 'file_size': 88750, 'creation_date': '2025-02-28', 'last_modified_date': '2025-02-28'}
{'page_label': '1', 'file_name': 'contact us.pdf', 'file_path': 'data\\contact us.pdf', 'file_type': 'application/pdf', 'file_size': 104443, 'creation_date': '2025-02-28', 'last_modified_date': '2025-02-28'}
{'page_label': '2', 'file_name': 'contact us.pdf', 'file_path': 'data\\contact us.pdf', 'file_type': 'application/pdf', 'file_size': 104443, 'creation_date': '2025-02-28', 'last_modified_date': '2025-02-28'}
{'page_label': '1', 'file_name': 'delivery.pdf', 'file_path': 'data\\delivery.pdf', 'file_type': 'application/pdf', 'file_size': 103930, 'creation_date': '2025-02-28', 'last_modified_date': '2025-02-28'}
{'page_label': '2', 'file_name': 'delivery.pdf', 'file_path': 'data\\delivery.pdf', 'file_type': 'application/pdf', 'file_size': 103930, 'creation_date': '2025-02-28', 'last_mod

In [49]:
# node parser get the sentences from the document and create node based on sentences and include the surrounding context
node_parser = SentenceWindowNodeParser.from_defaults(
    # how many sentences on either side to capture
    window_size=3,
    # the metadata key that holds the window of surrounding sentences
    window_metadata_key="window",
    # the metadata key that holds the original sentence
    original_text_metadata_key="original_sentence",
)

In [50]:
llm = Ollama(model="deepseek-r1:1.5b", temperature=0.1,request_timeout=120)

response = llm.complete("Who is Laurie Voss? write in 10 words")

In [31]:
print(response)

<think>
Okay, so I need to figure out who Laurie Voss is and how to answer the question in 10 words. Let me start by recalling what I know about her.

I think she's a politician. Maybe from the U.S.? I remember hearing something about her being involved in the 2008 election. Oh, right! She was a candidate for the U.S. House of Representatives. That makes sense because that's where she would run for office.

Wait, when did she get into politics? I think she ran against another candidate named Jim Cramer. They were both running for the same position, which is interesting. So Laurie Voss and Jim Cramer were two House members who were fighting it out in a race to fill a specific seat.

I should make sure about her name and the year of her election. I believe she was running against Cramer from 2008 until 2012, which is when she became U.S. Representative for New Mexico. That seems right because New Mexico has a significant population in the U.S., so it's plausible that someone would run fo

### Setup embedding model from the llama-Index

1. we can set up services globally by Settings

- Settings.llm = OpenAI(model="gpt-3.5-turbo")
- Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
- Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
- Settings.num_output = 512
- Settings.context_window = 3900

2. we can set up locally 

- 

In [51]:
# make service context to rag application 
# nomic-embed-text:latest as embedding model to the service context
ollama_embedding = OllamaEmbedding(
    model_name="nomic-embed-text:latest",
    base_url="http://localhost:11434",
    ollama_additional_kwargs={"mirostat": 0},
)

# a vector store index only needs an embed model
# base node parser is a sentence splitter
text_splitter = SentenceSplitter()


### Extract Nodes

you can extract the nodes using node parser this gives the nodes that are with context and the base sentence.

you can extract the base sentence by the normal parser sentencesplitter().

In [52]:
# extract nodes and the sentences context using node parser 
nodes = node_parser.get_nodes_from_documents([document])
# extract sentences using sentenceSplitter()
base_nodes = text_splitter.get_nodes_from_documents([document])
print(len(base_nodes))
print(len(nodes))
print(nodes[1].metadata)

7
201
{'window': 'About Us TRONIC.LK is an electronic component and module sourcing company under Sigma Electronics (Pvt) Ltd.  We supply high quality electronic components, modules and tools manufactured in China, Taiwan, Hong Kong, Japan, USA, Italy, England, Germany and Australia.  Being electronic hobbyists, we strive to provide high-quality modules and components at a reasonable price.  We currently supply Arduinos, Raspberry Pis, Orange Pis, Micro:bits, NodeMCUs, Atmel & Microchip microcontrollers, ICs & other passive components, Creality 3D Printers & filaments, Pneumatic parts, CNC accessories, Inverters, Battery chargers, Multimeters, Ronix Tools, etc... We were the first to introduce Arduino development boards and modules to Sri Lanka way back in 2011 under the company named Lankatronics (Pvt) Ltd., which later evolved into TRONIC.LK.  Today, TRONIC.LK is one of the leading electronic stores in Sri Lanka. ', 'original_sentence': 'We supply high quality electronic components, 

above mentioned way splitting the sentences are not well fit to the given scenario. Try of the regular expression custom splitting.

In [53]:
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s+', document.text.strip())
print(sentences)

['About Us TRONIC.LK is an electronic component and module sourcing company under Sigma Electronics (Pvt) Ltd.', 'We supply high quality electronic components, modules and tools manufactured in China, Taiwan, Hong Kong, Japan, USA, Italy, England, Germany and Australia.', 'Being electronic hobbyists, we strive to provide high-quality modules and components at a reasonable price.', 'We currently supply Arduinos, Raspberry Pis, Orange Pis, Micro:bits, NodeMCUs, Atmel & Microchip microcontrollers, ICs & other passive components, Creality 3D Printers & filaments, Pneumatic parts, CNC accessories, Inverters, Battery chargers, Multimeters, Ronix Tools, etc...', 'We were the first to introduce Arduino development boards and modules to Sri Lanka way back in 2011 under the company named Lankatronics (Pvt) Ltd., which later evolved into TRONIC.LK.', 'Today, TRONIC.LK is one of the leading electronic stores in Sri Lanka.', 'In our clientele, we have all the leading universities, technology instit

In [54]:
def custom_sentence_splitter(text):
    # Apply the regex pattern to split sentences
    sentences = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s+", text)
    return sentences

node_parser = SentenceWindowNodeParser.from_defaults(
    # to split the sentences us a custom splitter
    sentence_splitter=custom_sentence_splitter,
    # how many sentences on either side to capture
    window_size=3,
    # the metadata key that holds the window of surrounding sentences
    window_metadata_key="window",
    # the metadata key that holds the original sentence
    original_text_metadata_key="original_sentence",
)

In [None]:
# from the whole document with all details of the company
nodes = node_parser.get_nodes_from_documents([document])
for node in nodes[:2]:
    print(node.metadata['original_sentence'], sep='/n')
    pprint(node.metadata) 

About Us TRONIC.LK is an electronic component and module sourcing company under Sigma Electronics (Pvt) Ltd.
{'original_sentence': 'About Us TRONIC.LK is an electronic component and '
                      'module sourcing company under Sigma Electronics (Pvt) '
                      'Ltd.',
 'window': 'About Us TRONIC.LK is an electronic component and module sourcing '
           'company under Sigma Electronics (Pvt) Ltd. We supply high quality '
           'electronic components, modules and tools manufactured in China, '
           'Taiwan, Hong Kong, Japan, USA, Italy, England, Germany and '
           'Australia. Being electronic hobbyists, we strive to provide '
           'high-quality modules and components at a reasonable price. We '
           'currently supply Arduinos, Raspberry Pis, Orange Pis, Micro:bits, '
           'NodeMCUs, Atmel & Microchip microcontrollers, ICs & other passive '
           'components, Creality 3D Printers & filaments, Pneumatic parts, CNC '
     

In [None]:
# index the database base on the sentence and store in vector store
sentence_index = VectorStoreIndex(nodes,embed_model=ollama_embedding)

In [None]:
# index the documents based on the document and sentence splitting
# if the current ./sentence_index is already created no need to again create the indexes
if not os.path.exists("./sentence_index"):
    sentence_index = VectorStoreIndex.from_documents(
        [document],
        node_parser=node_parser,
        embed_model=ollama_embedding
    )

    sentence_index.storage_context.persist(persist_dir="./sentence_index")
else:
    sentence_index = load_index_from_storage(
        StorageContext.from_defaults(persist_dir="./sentence_index"),
        service_context=sentence_context
    )

In [None]:
# simple query engine to retrieve the most relevant context to the given question
query_engine = sentence_index.as_query_engine(
    # query engine use the llm as deepseek previous mentioned model
    llm=llm,
    # only get the most relevant
    similarity_top_k=2,
    # the target key defaults to `window` to match the node_parser's default
    # node parser has the window key in metadata that with the sentence with surrounding context
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

# query function get the llm and the retrieve sentence window 
window_response = query_engine.query(
    "How to contact them?"
)
print(window_response)

<think>
Okay, so I need to figure out how to contact TRONIC.LK based on the provided context. Let me go through each section of the context step by step.

First, looking at the "Contact Us" section, it mentions that we can prepare your order and receive bank details via email. The email address is info@tronic.lk. So, if someone wants to contact us directly, they should send an email to this address.

Next, in the "Payment Options" section, there are several methods mentioned for payment processing. These include using a Frimi app, PayHere, or other credit card/debit card options. Each of these has its own fee structure, but the main point is that customers can choose how they pay, which might be helpful if someone wants to make a purchase and isn't sure about their payment method.

The "Returns" section states that we'll accept returns unless there are issues like unopened packages or sealed items. The customer gets notified within 7 days, so if someone needs to return an item, they ca

### Do the evaluation based on experimental setup

model:"deepseek-r1:1.5b"

window_size:3

sentence_window_retrieval