<a href="https://colab.research.google.com/github/ChiaoYunTing/ADA-Group15/blob/main/day_seven_chunking_strategies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#A Comparison of Different Chunking Strategies
This notebook will demonstrate some of the different chunking stategies available in [LlamaIndex](https://docs.llamaindex.ai/).

First we need some text (writing credit: ChatGPT):

In [1]:
ram_text = "Professor Ram Gopal is a prominent academic in the field of \
  Information Systems Management, currently serving as a professor at Warwick \
  Business School, which is part of the University of Warwick. He specializes \
  in Information Systems and Analytics, focusing his research on data privacy, \
  digital platforms, and the intersection of technology and business. Notably, \
  he has been recognized for his contributions to the academic community, \
  having won an award for the best paper presented at a conference hosted by \
  Warwick Business School.Before his tenure at Warwick, Professor Gopal held \
  positions at several other prestigious institutions. He was a professor at \
  the University of Connecticut in the United States, where he significantly \
  contributed to research in management and information systems. His academic \
  journey also includes a professorship at the University of Southampton, \
  enhancing his international academic experience and influence. Throughout his \
  career, Professor Gopal has been actively involved in numerous academic and \
  professional activities. He regularly presents at major conferences and \
  seminars, contributing to the broader discourse in information systems and \
  digital innovation. His work continues to influence both academic circles and \
  industry practices globally. Cats and dogs differ in several key ways, \
  including their domestication history, behavior, communication, and physical \
  attributes. Dogs, domesticated from wolves about 15,000 years ago, are \
  generally more social and trainable. They communicate through barks, growls, \
  and body language. Cats, domesticated around 9,000 years ago for pest \
  control, are more independent and use meows, purrs, and body language to \
  communicate. Physically, dogs vary widely in size and shape, while cats are \
  more uniform. Dogs are omnivores, whereas cats are obligate carnivores, \
  requiring a meat-based diet. Typically, cats live longer than dogs and have \
  different health care needs. These distinctions make each suitable for \
  different types of households and lifestyles."

You may have noticed this isn't all (semi-hallucinated) content about Ram - I've also stuffed some ChatGPT content at the end about cats and dogs at the bottom. Let's see how this works with different splitters.

In [2]:
with open('ram.txt', "w") as text_file:
  text_file.write(ram_text) # save the file

Now we can load this into LlamaIndex via its reader systems. Firstly though we need to install everything:

In [3]:
!pip install llama_index.core
!pip install llama_index.readers.file

Collecting llama_index.core
  Downloading llama_index_core-0.10.35.post1-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m24.0 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json (from llama_index.core)
  Downloading dataclasses_json-0.6.5-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama_index.core)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama_index.core)
  Downloading dirtyjson-1.0.8-py3-none-any.whl (25 kB)
Collecting httpx (from llama_index.core)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llamaindex-py-client<0.2.0,>=0.1.18 (from llama_index.core)
  Downloading llamaindex_py_client-0.1.19-py3-none-any.whl (141 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.9/141.9 

## Sentence Splitter
Now we can read in the text and implement the SetnenceSplitter:

In [4]:
from llama_index.readers.file import FlatReader
from llama_index.core.node_parser import SentenceSplitter
from pathlib import Path # for finding the file

ram_docs = FlatReader().load_data(Path("/content/ram.txt"))

# we will limit to chunk size 100
parser = SentenceSplitter(chunk_size=100, chunk_overlap=0)
ram_nodes = parser.get_nodes_from_documents(ram_docs)

Let's check the first two outputs:

In [5]:
ram_nodes[0].text

'Professor Ram Gopal is a prominent academic in the field of   Information Systems Management, currently serving as a professor at Warwick   Business School, which is part of the University of Warwick. He specializes   in Information Systems and Analytics, focusing his research on data privacy,   digital platforms, and the intersection of technology and business.'

In [6]:
ram_nodes[1].text

'Notably,   he has been recognized for his contributions to the academic community,   having won an award for the best paper presented at a conference hosted by   Warwick Business School.Before his tenure at Warwick, Professor Gopal held   positions at several other prestigious institutions. He was a professor at   the University of Connecticut in the United States, where he significantly   contributed to research in management and information systems.'

We can see the SentenceSplitter includes multiple sentences in each chunk (impacted by the _chunk\_size_ hyperparameter) but ends each chunk at the nearest sentence end. Let's play with the hyperparameters:

In [7]:
# we will limit to chunk size 200
parser = SentenceSplitter(chunk_size=200, chunk_overlap=100)
new_ram_nodes = parser.get_nodes_from_documents(ram_docs)

And again we'll check the first two ...



In [8]:
new_ram_nodes[0].text

'Professor Ram Gopal is a prominent academic in the field of   Information Systems Management, currently serving as a professor at Warwick   Business School, which is part of the University of Warwick. He specializes   in Information Systems and Analytics, focusing his research on data privacy,   digital platforms, and the intersection of technology and business. Notably,   he has been recognized for his contributions to the academic community,   having won an award for the best paper presented at a conference hosted by   Warwick Business School.Before his tenure at Warwick, Professor Gopal held   positions at several other prestigious institutions. He was a professor at   the University of Connecticut in the United States, where he significantly   contributed to research in management and information systems. His academic   journey also includes a professorship at the University of Southampton,   enhancing his international academic experience and influence.'

In [9]:
new_ram_nodes[1].text

'He was a professor at   the University of Connecticut in the United States, where he significantly   contributed to research in management and information systems. His academic   journey also includes a professorship at the University of Southampton,   enhancing his international academic experience and influence. Throughout his   career, Professor Gopal has been actively involved in numerous academic and   professional activities. He regularly presents at major conferences and   seminars, contributing to the broader discourse in information systems and   digital innovation. His work continues to influence both academic circles and   industry practices globally. Cats and dogs differ in several key ways,   including their domestication history, behavior, communication, and physical   attributes. Dogs, domesticated from wolves about 15,000 years ago, are   generally more social and trainable. They communicate through barks, growls,   and body language.'

A larger split has had an obvious effect. You can also notice the effect of _chunk\_overlap_ on the results with the first two sentences of the second chunk also in the first chunk.

## Semantic Chunking
We can also chunk based on meaning.

In [10]:
!pip install llama-index-embeddings-huggingface

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

# Initialize a HuggingFace Embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Specify the embedding model into LlamaIndex's settings
Settings.embed_model = embed_model

Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.2.0-py3-none-any.whl (7.1 kB)
Collecting sentence-transformers<3.0.0,>=2.6.1 (from llama-index-embeddings-huggingface)
  Downloading sentence_transformers-2.7.0-py3-none-any.whl (171 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/171.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m163.8/171.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers<3.0.0,>=2.6.1->llama-index-embeddings-huggingface)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers<3.0.0,>=2.6.1->llama-ind

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [11]:
from llama_index.core.node_parser import SemanticSplitterNodeParser

parser = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=90, embed_model=embed_model
)

semantic_ram_nodes = parser.get_nodes_from_documents(ram_docs)

And we can inspect the results:

In [None]:
semantic_ram_nodes[0].text

'Professor Ram Gopal is a prominent academic in the field of   Information Systems Management, currently serving as a professor at Warwick   Business School, which is part of the University of Warwick. He specializes   in Information Systems and Analytics, focusing his research on data privacy,   digital platforms, and the intersection of technology and business. Notably,   he has been recognized for his contributions to the academic community,   having won an award for the best paper presented at a conference hosted by   Warwick Business School.Before his tenure at Warwick, Professor Gopal held   positions at several other prestigious institutions. He was a professor at   the University of Connecticut in the United States, where he significantly   contributed to research in management and information systems. His academic   journey also includes a professorship at the University of Southampton,   enhancing his international academic experience and influence. Throughout his   career, P

In [None]:
semantic_ram_nodes[1].text

'His work continues to influence both academic circles and   industry practices globally. '

In [None]:
semantic_ram_nodes[2].text

'Cats and dogs differ in several key ways,   including their domestication history, behavior, communication, and physical   attributes. Dogs, domesticated from wolves about 15,000 years ago, are   generally more social and trainable. They communicate through barks, growls,   and body language. Cats, domesticated around 9,000 years ago for pest   control, are more independent and use meows, purrs, and body language to   communicate. Physically, dogs vary widely in size and shape, while cats are   more uniform. Dogs are omnivores, whereas cats are obligate carnivores,   requiring a meat-based diet. Typically, cats live longer than dogs and have   different health care needs. These distinctions make each suitable for   different types of households and lifestyles.'

Interestingly it considers the last sentece of the generated Ram text to be semantically different! However, it has identified that the cats and dogs text does differ correctly.

Let's see what happens if we play with the _breakpoint\_percentile\_threshold_:

In [None]:
parser = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=50, embed_model=embed_model
)

semantic_ram_nodes = parser.get_nodes_from_documents(ram_docs)

In [None]:
semantic_ram_nodes[0].text

'Professor Ram Gopal is a prominent academic in the field of   Information Systems Management, currently serving as a professor at Warwick   Business School, which is part of the University of Warwick. He specializes   in Information Systems and Analytics, focusing his research on data privacy,   digital platforms, and the intersection of technology and business. '

In [None]:
semantic_ram_nodes[1].text

'Notably,   he has been recognized for his contributions to the academic community,   having won an award for the best paper presented at a conference hosted by   Warwick Business School.Before his tenure at Warwick, Professor Gopal held   positions at several other prestigious institutions. '

In [None]:
semantic_ram_nodes[2].text

'He was a professor at   the University of Connecticut in the United States, where he significantly   contributed to research in management and information systems. '

We can see a lower value means the algorithm will find more splits in the data (most of the time).

##Using the Chunker and a Database
Lastly we will look at using a chunker to load data into a database.

We will create an empty database in Faiss by specifying the number of dimensions. As above, we will use the "BAAI/bge-small-en-v1.5" embedding model. We can see from the [model card](https://huggingface.co/BAAI/bge-small-en-v1.5) on HuggingFace this model has 384 dimensions (you need to scroll right down on the model card) and will pass that to Faiss.

In [None]:
!pip install faiss-gpu
!pip install llama-index-vector-stores-faiss

import faiss

# create the empty Faiss database
d = 384 # 384 embedding dimensions
faiss_index = faiss.IndexFlatIP(d) # cosine

Next we will export our LlamaIndex nodes into the Faiss database:

In [None]:
from llama_index.core import (
    load_index_from_storage,
    VectorStoreIndex,
    StorageContext,
)

from llama_index.vector_stores.faiss import FaissVectorStore

# create a vector store variable
vector_store = FaissVectorStore(faiss_index=faiss_index)

# set the vector database into the storage context of LlamaIndex
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# create the Faiss database
li_index = VectorStoreIndex(ram_nodes, storage_context=storage_context)

# save index to disk
li_index.storage_context.persist()

print(f"Number of vectors in the Faiss index: {faiss_index.ntotal}")

Number of vectors in the Faiss index: 6


We will need to embed a query with the same embedding model (as we did in the previous Notebook):

In [None]:
from sentence_transformers import SentenceTransformer

# Instantiate the sentence-level DistilBERT
model = SentenceTransformer("BAAI/bge-small-en-v1.5")

# Q&A prompt
qna_prompt = "what do cats eat?"

# Convert Q&A prompt to vectors
rag_embedding = model.encode(qna_prompt, show_progress_bar=True)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

We can now use the encoded prompt to query our database. As we only now have a very small database (6 vectors) we will set k as 3:

In [None]:
import numpy as np

# Retrieve the top nearest neighbour
cs_similarity, similar = faiss_index.search(np.array([rag_embedding]), k=3)
similar = similar.flatten().tolist()

# Print the result
print(f'Top results: {similar}')

Top results: [4, 5, 3]


Let's print these results to screen:

In [None]:
for result in similar:
  print(ram_nodes[result])
  print("\n")

Node ID: 1bc7c0bb-364b-49fd-94b0-72d22149a317
Text: Cats, domesticated around 9,000 years ago for pest   control,
are more independent and use meows, purrs, and body language to
communicate. Physically, dogs vary widely in size and shape, while
cats are   more uniform. Dogs are omnivores, whereas cats are obligate
carnivores,   requiring a meat-based diet.


Node ID: 3772acec-062f-488b-954d-9ff28349bdc9
Text: Typically, cats live longer than dogs and have   different
health care needs. These distinctions make each suitable for
different types of households and lifestyles.


Node ID: 8561593b-e901-4793-8871-44299efb2405
Text: Cats and dogs differ in several key ways,   including their
domestication history, behavior, communication, and physical
attributes. Dogs, domesticated from wolves about 15,000 years ago, are
generally more social and trainable. They communicate through barks,
growls,   and body language.




Our most similar document certainly seems to answer the question!