# Nodes with LlamaIndex

---
*NOTE: to runt the notebook*
1. *Remove 'local:' from llm = lmql.model("local:llama.cpp:/home/dorota/models/mistral-7b-instruct-v0.2.Q6_K.gguf", tokenizer="mistralai/Mistral-7B-Instruct-v0.2")*
1. *start a service in terminal with: lmql serve-model llama.cpp:/home/dorota/models/mistral-7b-instruct-v0.2.Q6_K.gguf --verbose True --n_gpu_layers 20 --n_ctx 0*
---


CONTENT: Investigation of chunking into nodes with LlamaIndex build in tools using mistral-7b-instruct-Q6 on scientific article and patents. Additional investigation of node creation using different node splitters provided by LlamaIndex.

RESULTS AND COMMENTS:
NOTE: very limited investigation!
* article -> documents 21 (1/page) -> 48 sematic nodes : better results with semantic nodes
* patent1 -> documaents 17 (1/page) -> 26 semantic nodes: better results with documents
* patent2 -> documents 47 (1/page) -> 84 semantic nodes: better results with documents

* semantic nodes for both article and patent typically start with footnote stating the title/number/page for example "7 EP 2 671 601 A1 8" or "WO 2014/076653 PCT/IB2013/060133" and another half way through the page. Furhter investigation after cleaning the text and larger sample is needed to draw any conclusions.
 


In [1]:
import lmql
from llama_index.core import GPTVectorStoreIndex, VectorStoreIndex, SimpleDirectoryReader, ServiceContext, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# llama.cpp endpoint: https://lmql.ai/docs/models/llama.cpp.html#running-without-a-model-server
# tokenizer.model from https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/tree/main

llm = lmql.model("llama.cpp:/home/dorota/models/mistral-7b-instruct-v0.2.Q6_K.gguf", tokenizer="mistralai/Mistral-7B-Instruct-v0.2", n_gpu_layers=10, n_ctx=0, verbose=False) 

In [3]:
# set global variables to create vector embeddings for text nodes
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
Settings.tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2').encode

## VectorStoreIndex with documents or sematic nodes tested on scientific article and patent
SimpleDirectoryReader: https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/ \
more readers availble at https://llamahub.ai/

In [24]:
# read in all documents from assigned folder
# documents = SimpleDirectoryReader(input_files=["/home/dorota/LLM-diploma-project/00_concept_tests/data/40001_2023_Article_1364.pdf"]).load_data() 
# documents = SimpleDirectoryReader(input_files=["/home/dorota/LLM-diploma-project/00_concept_tests/data/patents/EP2671601A1.pdf"]).load_data()
documents = SimpleDirectoryReader(input_files=["/home/dorota/LLM-diploma-project/00_concept_tests/data/patents/WO2014076653A1.pdf"]).load_data()

In [None]:
len(documents)
# -> list of Document objects with 1 doc/page with metadata and tags (documents[0].text)

indexing documents...

In [None]:
index = VectorStoreIndex.from_documents(documents, show_progress=True) #[0:1] # index = VectorStoreIndex(nodes)

Settings.llm = None # =None to enable correct setting in query_engine
query_engine = index.as_query_engine(streaming=True, llm=None) # llm=None sets llm to Settings.llm thus defined as None

indexing Semantic nodes...

In [29]:
from llama_index.core.node_parser import SemanticSplitterNodeParser

splitter = SemanticSplitterNodeParser(buffer_size=1, breakpoint_percentile_threshold=99, embed_model=Settings.embed_model)
nodes = splitter.get_nodes_from_documents(documents)
index= VectorStoreIndex(nodes)
Settings.llm = None
query_engine = index.as_query_engine(streaming=True, llm=None) 

LLM is explicitly disabled. Using MockLLM.


In [None]:
len(nodes)

In [None]:
for node in nodes:
    print(node.text)
    print("----------------------------------------------------------------------------")

In [16]:
similarity_top_k = 4

@lmql.query(model=llm)
async def index_query(question: str):
    '''lmql
    "You are a QA bot that helps users answer questions.\n"
    
    # ask the question
    "Question: {question}\n"

    # look up and insert relevant information into the context
    response = query_engine.query(question)
    for s in response.source_nodes:
        print(s.node.get_text())
        print('----------------------------------------------------------')
    information = "\n\n".join([s.node.get_text() for s in response.source_nodes])
    "\nRelevant Information: {information}\n"
    
    # generate a response
    "Your response based on relevant information:[RESPONSE]" where STOPS_AT(RESPONSE, ".")
    '''


... extracting info from scientific article

In [12]:
result = await index_query("What is the main finding?", 
                   output_writer=lmql.stream(variable="RESPONSE"))

#-------------------------------------------------------------------------------------------------------------
# with documents
# The main finding of the study is that the researchers used the dual-map analysis feature in CiteSpace 6.

#-------------------------------------------------------------------------------------------------------------
# with semantic nodes
# The main finding of the study is that the analysis primarily focused on the relationship between breast cancer and protein synthesis, including gene expression, translation, and apoptosis.

Page 14 of 21 Xu et al. European Journal of Medical Research          (2023) 28:461 
and new knowledge that emerged. 
----------------------------------------------------------
Employing a segmentation process, topics exhibit -
ing akin clusters were deftly allocated to cohesive areas, 
thereby engendering a heightened sense of organization 
and a more comprehensive grasp of the underlying data 
(Fig.  8a). In this analysis, a keyword co-occurrence analy -
sis was conducted to identify the most frequently appear -
ing terms. The analysis included five keywords: “breast 
cancer” with 1339 occurrences, “expression” with 831 
occurrences, “cancer” with 407 occurrences, “protein” 
with 358 occurrences, and “translation” with 350 occur -
rences. These results suggest that the analysis primarily 
focused on the relationship between breast cancer and 
protein synthesis, including gene expression, translation, 
and apoptosis. The aim of this analysis was to identify the 
most frequent keywords

... extracting info from patent

In [38]:
result = await index_query("What is the invention?", 
                   output_writer=lmql.stream(variable="RESPONSE"))

#------------------------------------------
# with documents
# The invention described in EP 2 671 601 A1 is an irrigation system suitable for rectal irrigation, which can be used for self-administration.
# The invention described in the patent application is a twist drill and bone tap, along with a method for assessing bone quality during dental implantation procedures.

#------------------------
# with semantic nodes
# The invention described in EP 2 671 601 A1 is not explicitly stated in the provided search report.
# The invention described in the patent (Fig.

Declarations under Rule 4.17: 
as to the identity of the inventor (Rule 4.17(i)) 
as to applicant's entitlement to apply for and be granted a 
patent (Rule 4.17(ii)) 
of inventorship (Rule 4.17(iv)) 
Published: 
with international search report (Art. 21(3)) 
before the expiration of the time limit for amending the 
claims and to be republished in the event of receipt of 
amendments (Rule 48.2(h)) 
(54) Title: DRILL AND TAP AND METHOD FOR PREOPERATIVE ASSESSMENT OF BONE QUALITY 
Fig. 1 
42 
(57) Abstract: A twist drill and bone tap each monitor torque while drilling or threading to assess jaw bone quality and a method for 
accessing bone quality prior to or while tapping into the bone during a dental implantation procedure. The twist drill for assessing 
bone quality includes a shank having a proximal section and a distal section. A mounting portion is formed in the proximal section 
and is adapted to connect with a torque monitoring device. A drill bit is connected to the distal sectio

The invention described in the patent application is a twist drill and bone tap, along with a method for assessing bone quality during dental implantation procedures.

In [39]:
result = await index_query("Extract claim 1 from the patent?", 
                   output_writer=lmql.stream(variable="RESPONSE")
                   )

# -------------------------------------------------------------------------
# with documents
# To extract claim 1 from the patent, you can refer to the patent document itself.
# Claim 1 could not be extracted directly from the provided patent document as it is not mentioned in the text.

#-------------------------------------
# with semantic nodes
# To extract claim 1 from the patent, you would need to refer to the patent document itself as the relevant information provided does not contain the claim.
# Claim 1 from the patent cannot be extracted directly from the provided information as the patent claims are not explicitly stated in the text.

International application No. 
INTERNATIONAL SEARCH REPORT PCT/1B2013/060133 
Box No. Il Observations where certain claims were found unsearchable (Continuation of item 2 of first sheet) 
This international search report has not been established in respect of certain claims under Article 17(2)(a) for the following reasons: 
1. Claims Nos.: - - 
because they relate to subject matter rot required to be searched by this Authority, namely: 
Rule 39.1(iv) PCT - Method for treatment of the human or animal body by 
surgery 
2. [ | Claims Nos.: 
because they relate to parts of the international application that do not comply with the prescribed requirements to such 
an extent that no meaningful international search can be carried out, specifically: 
3. [ | Claims Nos.: 
because they are dependent claims and are not drafted in accordance with the second and third sentences of Rule 6.4(a). 
Box No. Ill Observations where unity of invention is lacking (Continuation of item 3 of first sheet) 
This

---
---

## Testing node creation with different splitters provided by LlamaIndex:

### 1. SentenceSplitter
The SentenceSplitter attempts to split text in chunks while respecting the boundaries of sentences. \
https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/

In [None]:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20,
)
nodes = splitter.get_nodes_from_documents(documents)

In [None]:
# # can be defined globaly
# Settings.text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)

# # an be dafound per-index through transformations
# index = VectorStoreIndex.from_documents(
#     documents,
#     transformations=[SentenceSplitter(chunk_size=1024, chunk_overlap=20)],
# )

In [None]:
len(nodes)

In [None]:
print(nodes[2].text)

### 2. SentenceWindowNodeParser
Splits all documents into individual sentences. The resulting nodes also contain the surrounding "window" of sentences around each node in the metadata.\
https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/

In [47]:
import nltk
from llama_index.core.node_parser import SentenceWindowNodeParser

node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=2,  # how many sentences on either side to capture
    window_metadata_key="window", # the metadata key that holds the window of surrounding sentences
    original_text_metadata_key="original_sentence", # the metadata key that holds the original sentence
)

In [48]:
nodes = node_parser.get_nodes_from_documents(documents)

In [None]:
len(nodes)

In [None]:
print(nodes[3])

In [None]:
print(nodes[3].text)

### 3. SemanticSplitterNodeParser
https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/

In [None]:
from llama_index.core.node_parser import SemanticSplitterNodeParser

splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=Settings.embed_model
)

In [None]:
nodes = splitter.get_nodes_from_documents(documents)

In [None]:
len(nodes)

In [None]:
print(nodes[2].text)

### 4. HierarchicalNodeParser
Input is chunked into several hierarchies of chunk sizes, with each node containing a reference to it's parent node. https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/ \
When combined with the AutoMergingRetriever, this enables us to automatically replace retrieved nodes with their parents when a majority of children are retrieved. https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_merging_retriever/ (conclusion in tutorial that output quality similar to non hierarchical approach...)

Chunk into parent, child, grandchild (leaf) nodes

In [None]:
from llama_index.core.node_parser import HierarchicalNodeParser

splitter = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128] # chunk size parent, child, grandchild
)

In [None]:
nodes = splitter.get_nodes_from_documents(documents)

In [None]:
len(nodes)

In [None]:
nodes[10]

Isolate grandchild nodes from root nodes

In [None]:
from llama_index.core.node_parser import get_leaf_nodes, get_root_nodes

base_nodes = get_leaf_nodes(nodes)
root_nodes = get_root_nodes(nodes)

len(base_nodes), len(root_nodes)

Load all nodes into SimpleDocumentStore and only leaf nodes into VectoreStore

In [None]:
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core import StorageContext

docstore = SimpleDocumentStore()
docstore.add_documents(nodes)
storage_context = StorageContext.from_defaults(docstore=docstore) # define storage context (will include vector store by default too)

## Load index into vector index
from llama_index.core import VectorStoreIndex

base_index = VectorStoreIndex(
    base_nodes,
    storage_context=storage_context,
)

Define Retriever

In [None]:
from llama_index.core.retrievers import AutoMergingRetriever

base_retriever = base_index.as_retriever(similarity_top_k=3)
retriever = AutoMergingRetriever(base_retriever, storage_context, verbose=True)

# query_str = ("What is the title of the article?")
query_str = ("What is the main topic of the article?")

nodes = retriever.retrieve(query_str)
base_nodes = base_retriever.retrieve(query_str)

len(nodes), len(base_nodes)

In [None]:
from llama_index.core.response.notebook_utils import display_source_node
import matplotlib

for node in base_nodes:
    display_source_node(node, source_length=10000)

In [None]:
for node in nodes:
    display_source_node(node, source_length=10000)

---


TokenTextSplitter https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_metadata_extractor/

In [None]:
# NOTE: seem to be the same output: nodes.get_content(), nodes.text, nodes.get_text()