## Window-sentence retrieval RAG
1. Forcus on Individual sentences 
2. Search the query embedding with most relevant sentence in the pdf 
3. Perform top k similarity with the sentence and retrive the relevant sentence with surrounding sentences as context.

### Import relevant libraries from llama-Index

In [28]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Document
from llama_index.core import Settings

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.ollama import OllamaEmbedding

from pprint import pprint
import re

In [3]:
documents = SimpleDirectoryReader(
    input_files=["./data/About us.pdf", 
                 "./data/contact us.pdf",
                 "./data/delivery.pdf",
                 "./data/payment options.pdf",
                 "./data/returns.pdf",
                 "./data/services.pdf",
                 "./data/testings.pdf",
                 "./data/tracking.pdf",
                 "./data/warranty.pdf",
                 "./data/whatsapp order.pdf"]
).load_data()

print(len(documents), "\n")
print(type(documents[0]))


18 

<class 'llama_index.core.schema.Document'>


In [None]:
# load data function has divided the document in to pages and made document objects from them
# including texts
for doc in documents:
    pprint(doc.metadata)

# concatenate data to one document
document = Document(text="\n\n".join([doc.text for doc in documents]))
pprint(document.text)

{'creation_date': '2025-02-28',
 'file_name': 'About us.pdf',
 'file_path': 'data\\About us.pdf',
 'file_size': 88750,
 'file_type': 'application/pdf',
 'last_modified_date': '2025-02-28',
 'page_label': '1'}
{'creation_date': '2025-02-28',
 'file_name': 'contact us.pdf',
 'file_path': 'data\\contact us.pdf',
 'file_size': 104443,
 'file_type': 'application/pdf',
 'last_modified_date': '2025-02-28',
 'page_label': '1'}
{'creation_date': '2025-02-28',
 'file_name': 'contact us.pdf',
 'file_path': 'data\\contact us.pdf',
 'file_size': 104443,
 'file_type': 'application/pdf',
 'last_modified_date': '2025-02-28',
 'page_label': '2'}
{'creation_date': '2025-02-28',
 'file_name': 'delivery.pdf',
 'file_path': 'data\\delivery.pdf',
 'file_size': 103930,
 'file_type': 'application/pdf',
 'last_modified_date': '2025-02-28',
 'page_label': '1'}
{'creation_date': '2025-02-28',
 'file_name': 'delivery.pdf',
 'file_path': 'data\\delivery.pdf',
 'file_size': 103930,
 'file_type': 'application/pdf',


In [None]:
# node parser get the sentences from the document and create node based on sentences and include the surrounding context
node_parser = SentenceWindowNodeParser.from_defaults(
    # how many sentences on either side to capture
    window_size=3,
    # the metadata key that holds the window of surrounding sentences
    window_metadata_key="window",
    # the metadata key that holds the original sentence
    original_text_metadata_key="original_sentence",
)

In [17]:
llm = Ollama(model="deepseek-r1:1.5b", temperature=0.1,request_timeout=120)

response = llm.complete("Who is Laurie Voss? write in 10 words")

In [18]:
print(response)

<think>
Okay, so I need to figure out who Laurie Voss is and how to answer the question about her. Let me start by recalling what I know from previous knowledge.

I remember that Laurie Voss was a former U.S. Vice President. She served as the President of the United States from 1978 until she died in 2003. That's a key point because it tells me her political career and timeline.

Now, the question is asking for an answer in 10 words. So I need to condense her information into that limit. Let me think about the main points: she was Vice President, served as President, died in 2003, and was a former U.S. Vice President.

I should make sure not to include too much detail beyond what's necessary for the answer. Maybe I can mention her presidency duration or her political career but keep it concise.

Putting it all together: Laurie Voss was the former U.S. Vice President who served as President from 1978 until she died in 2003.
</think>

Laurie Voss was the former U.S. Vice President who se

### Setup embedding model from the llama-Index

1. we can set up services globally by Settings

- Settings.llm = OpenAI(model="gpt-3.5-turbo")
- Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
- Settings.node_parser = SentenceSplitter(chunk_size=512, chunk_overlap=20)
- Settings.num_output = 512
- Settings.context_window = 3900

2. we can set up locally 

- 

In [None]:
# make service context to rag application 
# nomic-embed-text:latest as embedding model to the service context
ollama_embedding = OllamaEmbedding(
    model_name="nomic-embed-text:latest",
    base_url="http://localhost:11434",
    ollama_additional_kwargs={"mirostat": 0},
)

# a vector store index only needs an embed model
# base node parser is a sentence splitter
text_splitter = SentenceSplitter()


### Extract Nodes

you can extract the nodes using node parser this gives the nodes that are with context and the base sentence.

you can extract the base sentence by the normal parser sentencesplitter().

In [27]:
# extract nodes and the sentences context using node parser 
nodes = node_parser.get_nodes_from_documents([document])
# extract sentences using sentenceSplitter()
base_nodes = text_splitter.get_nodes_from_documents([document])
print(len(base_nodes))
print(len(nodes))
pprint(nodes[1].metadata)

7
209
{'original_sentence': 'We supply high quality electronic components, modules '
                      'and tools manufactured in \n'
                      'China, Taiwan, Hong Kong, Japan, USA, Italy, England, '
                      'Germany and Australia. ',
 'window': 'About Us  \n'
           'TRONIC.LK is an electronic component and module sourcing company '
           'under Sigma Electronics  \n'
           '(Pvt) Ltd.  We supply high quality electronic components, modules '
           'and tools manufactured in \n'
           'China, Taiwan, Hong Kong, Japan, USA, Italy, England, Germany and '
           'Australia.  Being electronic \n'
           'hobbyists, we strive to provide high-quality modules and '
           'components at a reasonable price.  \n'
           '  \n'
           ' We currently supply Arduinos, Raspberry Pis, Orange Pis, '
           'Micro:bits, NodeMCUs, Atmel & \n'
           'Microchip microcontrollers, ICs & other passive components, '
         

above mentioned way splitting the sentences are not well fit to the given scenario. Try of the regular expression custom splitting.

In [30]:
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s+', document.text.strip())
pprint(sentences)

['About Us  \n'
 'TRONIC.LK is an electronic component and module sourcing company under Sigma '
 'Electronics  \n'
 '(Pvt) Ltd.',
 'We supply high quality electronic components, modules and tools manufactured '
 'in \n'
 'China, Taiwan, Hong Kong, Japan, USA, Italy, England, Germany and Australia.',
 'Being electronic \n'
 'hobbyists, we strive to provide high-quality modules and components at a '
 'reasonable price.',
 'We currently supply Arduinos, Raspberry Pis, Orange Pis, Micro:bits, '
 'NodeMCUs, Atmel & \n'
 'Microchip microcontrollers, ICs & other passive components, Creality 3D '
 'Printers & filaments, \n'
 'Pneumatic parts, CNC accessories, Inverters, Battery chargers, Multimeters, '
 'Ronix Tools, etc...',
 'We were the first to introduce Arduino development boards and modules to Sri '
 'Lanka way back \n'
 'in 2011 under the company named Lankatronics (Pvt) Ltd., which later evolved '
 'into TRONIC.LK.',
 'Today, TRONIC.LK is one of the leading electronic stores in Sri La

In [31]:
def custom_sentence_splitter(text):
    # Apply the regex pattern to split sentences
    sentences = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s+", text)
    return sentences

node_parser = SentenceWindowNodeParser.from_defaults(
    # to split the sentences us a custom splitter
    sentence_splitter=custom_sentence_splitter,
    # how many sentences on either side to capture
    window_size=3,
    # the metadata key that holds the window of surrounding sentences
    window_metadata_key="window",
    # the metadata key that holds the original sentence
    original_text_metadata_key="original_sentence",
)

In [39]:
nodes = node_parser.get_nodes_from_documents([document])
for node in nodes:
    pprint(node.metadata['original_sentence'])

('About Us  \n'
 'TRONIC.LK is an electronic component and module sourcing company under Sigma '
 'Electronics  \n'
 '(Pvt) Ltd.')
('We supply high quality electronic components, modules and tools manufactured '
 'in \n'
 'China, Taiwan, Hong Kong, Japan, USA, Italy, England, Germany and Australia.')
('Being electronic \n'
 'hobbyists, we strive to provide high-quality modules and components at a '
 'reasonable price.')
('We currently supply Arduinos, Raspberry Pis, Orange Pis, Micro:bits, '
 'NodeMCUs, Atmel & \n'
 'Microchip microcontrollers, ICs & other passive components, Creality 3D '
 'Printers & filaments, \n'
 'Pneumatic parts, CNC accessories, Inverters, Battery chargers, Multimeters, '
 'Ronix Tools, etc...')
('We were the first to introduce Arduino development boards and modules to Sri '
 'Lanka way back \n'
 'in 2011 under the company named Lankatronics (Pvt) Ltd., which later evolved '
 'into TRONIC.LK.')
'Today, TRONIC.LK is one of the leading electronic stores in Sri Lan

done a pretty good job with the help of regular expression but the '\n' should be avoided to get good look with the sentences.