In [None]:
%%capture
!pip install llama-index==0.10.25 llama-index-embeddings-fastembed qdrant-client llama-index-vector-stores-qdrant llama-index-llms-cohere

In [1]:
import os
import sys
from getpass import getpass
import nest_asyncio

from IPython.display import Markdown, display

from dotenv import load_dotenv

nest_asyncio.apply()

load_dotenv("../.env")

sys.path.append('../helpers')

from utils import setup_llm, setup_embed_model, setup_vector_store

In [2]:
CO_API_KEY = os.environ['CO_API_KEY'] or getpass("Enter your Cohere API key: ")

In [3]:
QDRANT_URL = os.environ['QDRANT_URL'] or getpass("Enter your Qdrant URL:")

In [4]:
QDRANT_API_KEY = os.environ['QDRANT_API_KEY'] or  getpass("Enter your Qdrant API Key:")

In [5]:
from llama_index.core.settings import Settings
from utils import setup_llm, setup_embed_model

setup_llm(api_key=CO_API_KEY)

setup_embed_model(provider="fastembed")

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

In [9]:
from utils import get_documents_from_docstore

senpai_documents = get_documents_from_docstore("../data/words-of-the-senpais")

In [8]:
from datasets import load_dataset

eval_dataset = load_dataset("harpreetsahota/LI_Learning_RAG_Eval_Set", split='train')

eval_dataset = eval_dataset.filter(lambda x: x['question_groundedness_score'] is not None and x['question_groundedness_score'] >= 4)

Filter:   0%|          | 0/316 [00:00<?, ? examples/s]

# Optimizing Chunk Size

In this lesson, we'll explore what chunking is, how it affects the indexing and retrieval process, and how you can customize chunk size and overlap to optimize your results.

> **The Chunking Commandment:** Your goal is not to chunk for chunking sake, our goal is to get our data in a format where it can be retrieved for value later.
>
> -- Greg Kamradt, [5 Levels Of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)

## Understanding Chunking

When documents are ingested into an index, `LlamaIndex` splits them into smaller pieces called "chunks."  This process is known as chunking. By default, LlamaIndex uses a *chunk size* of 1024 and a *chunk overlap* of 20. 

But what do these numbers mean, and how do they impact the indexing and retrieval process?

### Chunk Size

The chunk size determines the maximum number of tokens (roughly equivalent to words) that each chunk will contain. With a default chunk size of 1024, `LlamaIndex` will split your documents into chunks that are no longer than 1024 tokens each.

#### **🤏 Smaller Chunk Size**

*   More precise and focused embeddings

*   Beneficial for retrieving specific information

#### **👐Larger Chunk Size**

*   More general embeddings with broader context

*   Useful for document overviews, but may miss details

### Chunk Overlap

*   Shared tokens between adjacent chunks (default: 20)

*   Maintains context and prevents information loss

I recommend taking a look at [this chunk visualizer](https://huggingface.co/spaces/m-ric/chunk_visualizer) to get an intuitive sense for chunk size and overlap.

## 🤔 The Impact of Chunk Size
 
 #### **📏 Relevance and Granularity**

*   Smaller chunks (e.g., 128) offer granularity but risk missing vital information, or lack sufficient context.

*   Larger chunks (e.g., 512) are more likely to capture necessary context, but also run the risk of including irrelevant information.

*   Faithfulness and Relevancy metrics help assess response quality. 

 #### **🎯 Chunk Size and Use Case**

*   **Question Answering:** Shorter, specific chunks for precise answers.

*   **Summarization:** Longer chunks to capture the overall context.

 #### **⏳ Response Generation Time**

*   Larger chunks provide more context but may slow down the system.

*   Balancing comprehensiveness with speed is crucial.
    
 #### **⚖️ Finding the Optimal Size**

*   Testing various chunk sizes is essential for specific use cases and datasets. 

*   Balancing information capture with efficiency is key.

### Considerations When Customizing Chunk Size

When deciding on a chunk size, there are a few things to keep in mind:

| Factor | Description |
|--------|-------------|
| 📄 **Data Characteristics** | The optimal chunk size depends on the data you're indexing. Long, detailed documents, may require a larger chunk size to capture more context. Smaller chunk size may be more appropriate for short, focused passages. |
| 🔍 **Retrieval Requirements** | If you need to retrieve very specific details, a smaller chunk size may be better. If you're looking for more general information, a larger chunk size may suffice. |
| 🔢 **Similarity Parameters** | When using a larger chunk size with a `VectorStoreIndex`, you may also want to increase the `similarity_top_k` parameter. This parameter determines how many of the most similar chunks are retrieved for each query. With larger chunks, you may need to retrieve more chunks to cover the same amount of information. |

### There are [various methods](https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules) you can use to chunk your documents. 

| Parser Type | Splitter Name | Description |
|-------------|---------------|-------------|
| 📁 File-Based Node Parsers | 📄`SimpleFileNodeParser` | The simplest flow: `FlatFileReader` + `SimpleFileNodeParser` which automatically use the best node parser for each type of content. Then, you may want to chain the file-based node parser with a text-based node parser to account for the actual length of the text. |
| | 🌐`HTMLNodeParser` | This node parser uses beautifulsoup to parse raw HTML. By default, it will parse a select subset of HTML tags, but you can override this. The default tags are: ["p", "h1", "h2", "h3", "h4", "h5", "h6", "li", "b", "i", "u", "section"] |
| | 🎭`JSONNodeParser` | The `JSONNodeParser` parses raw JSON. |
| | 📝`MarkdownNodeParser` | The `MarkdownNodeParser` parses raw markdown text. |
| ✂️ Text-Splitters | 💻`CodeSplitter` | Splits raw code-text based on the language it is written in. |
| | 🦜🔗`LangchainNodeParser` | You can also wrap any existing text splitter from langchain with a node parser. |
| | 📜`SentenceSplitter` | The `SentenceSplitter` attempts to split text while respecting the boundaries of sentences. |
| | 🪟`SentenceWindowNodeParser` | The `SentenceWindowNodeParser` splits all documents into individual sentences. The resulting nodes also contain the surrounding "window" of sentences around each node in the metadata.|
| | 🧠`SemanticSplitterNodeParser` | Instead of chunking text with a fixed chunk size, the semantic splitter adaptively picks the breakpoint in-between sentences using embedding similarity. This ensures that a "chunk" contains sentences that are semantically related to each other. |
| | 🪙`TokenTextSplitter` | The `TokenTextSplitter` attempts to split to a consistent chunk size according to raw token counts. |
| 🔗 Relation-Based Node Parsers | 🌿`HierarchicalNodeParser` | This node parser will chunk nodes into hierarchical nodes. This means a single input will be chunked into several hierarchies of chunk sizes, with each node containing a reference to it's parent node. |


## We're only going to focus on a few strategies

I'll show you how to split/chunk test using each method belwo. 

Then, I'll randomly select one configuration one the strategies below and we'll evaluate against our evaluation set. 

Just one evaluations in total, you can do the rest on your own.

 - 🪙`TokenTextSplitter`
 
 - 📜`SentenceSplitter`

 - 🪟`SentenceWindowNodeParser`

 - 🧠`SemanticSplitterNodeParser`

# 🪙 `TokenTextSplitter`

The primary function is to divide a given text into smaller chunks, ensuring each chunk stays within a specified token limit. 

### **How it Works**

1.  **Tokenization:** It utilizes a tokenizer to break down the text into individual tokens (words or subwords).  The default tokenizer is the `tiktoken` tokenizer for GPT-3.5-Turbo.

2.  **Chunking:** It then groups these tokens into chunks, ensuring each chunk's size is within the defined `chunk_size` limit. 

3.  **Overlap Handling:** To maintain context and coherence between chunks, it can incorporate an overlap, specified by `chunk_overlap`, where the last few tokens of one chunk are repeated at the beginning of the next.

### Arguments you need to know

*   **`chunk_size`**: Controls the maximum token count for each chunk. Defualts to 1024.

*   **`chunk_overlap`**: Determines the number of overlapping tokens between consecutive chunks. Defaults to 20.

*   **`separator`**: Specifies the primary character used to split the text into words. Defaults to space (`" "`).

*   **`backup_separators`**: Provides additional characters for splitting if the primary separator isn't sufficient. Defaults to new line character (`"\n"`).

*   **`include_metadata`**: Enables or disables the inclusion of metadata within each chunk. Defaults to `True`.

* **`include_prev_next_rel`**: Enables or disables tracking the relationship between nodes. Defaults to `True`.

### Usage Example

The basic usage pattern is as follows (you don't need to pass anything if you want to keep the default values.):

```python
from llama_index.core.node_parser import TokenTextSplitter

splitter = TokenTextSplitter()

nodes = splitter.get_nodes_from_documents(documents)
```


Let's evaluate the impact varying chunk size has on our `ragas` metrics. I'll limit our exploration to the `chunk_sizes = [128, 256, 512, 1024]` and hold `chunk_overlap` fixed to 16 tokens. 

In [25]:
print(documents[42].text)

Set a very high hourly aspirational rate for yourself and stick to it. It should seem and feel absurdly high. If it doesnt, its not high enough. Whatever you picked, my advice to you would be to raise it. Like I said, for myself, even before I had money, for the longest time I used $5,000 an hour. And if you extrapolate that out into what it looks like as an annual salary, its multiple millions of dollars per year. Ironically, I actually think Ive beaten it. Im not the hardest working personIm actually a lazy person. I work through bursts of energy where Im really motivated with something. If I actually look at how much Ive earned per actual hour that Ive put in, its probably quite a bit higher than that. Can you expand on your statement, If you secretly despise wealth, it will elude you? If you get into a relative mindset, youre always going to hate people who do better than you, youre always going to be jealous or envious of them. Theyll sense those feelings when you try and do busin

In [34]:
from llama_index.core.node_parser import TokenTextSplitter
TokenTextSplitter(chunk_size=64, chunk_overlap=16).split_text(documents[42].text)

['Set a very high hourly aspirational rate for yourself and stick to it. It should seem and feel absurdly high. If it doesnt, its not high enough. Whatever you picked, my advice to you would be to raise it. Like I said, for myself, even before I had money, for the longest time',
 'I said, for myself, even before I had money, for the longest time I used $5,000 an hour. And if you extrapolate that out into what it looks like as an annual salary, its multiple millions of dollars per year. Ironically, I actually think Ive beaten it. Im not the hardest working',
 'year. Ironically, I actually think Ive beaten it. Im not the hardest working personIm actually a lazy person. I work through bursts of energy where Im really motivated with something. If I actually look at how much Ive earned per actual hour that Ive put in, its probably quite a bit higher than that. Can you',
 'that Ive put in, its probably quite a bit higher than that. Can you expand on your statement, If you secretly despise we

In [12]:
from llama_index.core.node_parser import TokenTextSplitter

def token_splitter(chunk_size, documents):
    splitter = TokenTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=16,
        )
    nodes = splitter.get_nodes_from_documents(documents)
    return nodes

In [13]:
token_splitter_results = {}

# Iterate over each chunk size and perform token splitting
for size in chunk_sizes:
    key = f"token_split_chunk_size_{size}"
    token_splitter_results[key] = token_splitter(size, documents)

In [17]:
for key, value in token_splitter_results.items():
    print(f"With {key} we get {len(value)} chunks.")

With chunk_size_128 we get 8669 chunks.
With chunk_size_256 we get 3812 chunks.
With chunk_size_512 we get 1938 chunks.
With chunk_size_1024 we get 1884 chunks.


# 📜`SentenceSplitter`

The `SentenceSplitter` class, as its name suggests, specializes in splitting text while trying to keep complete sentences and paragraphs together. This is in contrast to the `TokenTextSplitter`, which focuses on token limits.

### How it Works

1. **Initial Splitting:**

    *   The text is first divided into paragraphs using the specified `paragraph_separator` (defaults to triple newline characters `"\n\n\n"`).

    *   Each paragraph is then further split using a "chunking tokenizer" (defaults to [`PunktSentenceTokenizer`](https://www.nltk.org/api/nltk.tokenize.PunktSentenceTokenizer.html) from the `nltk` library). Which basically looks for sentences boundaries.

    *   If these methods don't yield enough splits, it resorts to a backup regex and the default separators (`CHUNKING_REGEX = "[^,.;。？！]+[,.;。？！]?"`).

2. **Chunking with Sentence Awareness:**

    *   The resulting splits are grouped into chunks, keeping sentences together as much as possible. 

    *   It considers the `is_sentence` flag for each split during this process.

    *   Chunk size and overlap still play a role, but sentence boundaries are given preference.

3. **Overlap Handling:**

    *   Similar to `TokenTextSplitter`, it incorporates overlap between chunks to maintain context. 

    *   However, it prioritizes using the last complete sentence for overlap rather than just the last few tokens.

### Arguments you need to know

*   **`chunk_size`**: The target token size for each chunk.

*   **`chunk_overlap`**: The number of overlapping tokens between chunks.

*   **`separator`**: The default separator for splitting (e.g., space).

*   **`paragraph_separator`**: The string used to identify paragraph breaks.

*   **`secondary_chunking_regex`**: A backup regex for splitting if the primary methods are insufficient.

### Usage Example

```python
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(chunk_size=256, chunk_overlap=50)

nodes = splitter.get_nodes_from_documents(documents)
```

### When to Use SentenceSplitter

*   When preserving complete sentences and paragraphs is essential for understanding the context.

*   When dealing with text where sentence boundaries are meaningful (e.g., legal documents, narratives).

*   When you want to avoid having broken sentences at the beginning or end of chunks.

In [35]:
from llama_index.core.node_parser import SentenceSplitter

SentenceSplitter(chunk_size=64, chunk_overlap=16).split_text(documents[42].text)

['Set a very high hourly aspirational rate for yourself and stick to it. It should seem and feel absurdly high. If it doesnt, its not high enough. Whatever you picked, my advice to you would be to raise it.',
 'Whatever you picked, my advice to you would be to raise it. Like I said, for myself, even before I had money, for the longest time I used $5,000 an hour.',
 'And if you extrapolate that out into what it looks like as an annual salary, its multiple millions of dollars per year. Ironically, I actually think Ive beaten it. Im not the hardest working personIm actually a lazy person. I work through bursts of energy where Im really motivated with something.',
 'I work through bursts of energy where Im really motivated with something. If I actually look at how much Ive earned per actual hour that Ive put in, its probably quite a bit higher than that. Can you expand on your statement, If you secretly despise wealth, it will elude you?',
 'If you get into a relative mindset, youre always

In [36]:
def sentence_splitter(chunk_size, documents):
    splitter = SentenceSplitter(
        chunk_size=chunk_size,
        chunk_overlap=16,
        )
    nodes = splitter.get_nodes_from_documents(documents)
    return nodes

In [39]:
sentence_splitter_results = {}

# Iterate over each chunk size and perform sentence splitting
for size in chunk_sizes:
    key = f"sentence_split_chunk_size_{size}"
    sentence_splitter_results[key] = sentence_splitter(size, documents)

In [40]:
for key, value in sentence_splitter_results.items():
    print(f"With {key} we get {len(value)} chunks.")

With sentence_split_chunk_size_128 we get 9150 chunks.
With sentence_split_chunk_size_256 we get 4018 chunks.
With sentence_split_chunk_size_512 we get 1937 chunks.
With sentence_split_chunk_size_1024 we get 1884 chunks.


### Recap: `TokenTextSplitter` vs `SentenceSplitter`

`TokenTextSplitter` splits the text into chunks based on a specified number of tokens. It uses a tokenizer to break down the text into individual tokens (words or subwords), and then groups these tokens into chunks of a specified size. If the text doesn't divide evenly into the specified chunk size, the last chunk will contain the remaining tokens, which could be less than the specified chunk size.

`SentenceSplitter`, on the other hand, splits the text into chunks based on sentences. It uses a sentence boundary detection algorithm to identify where sentences begin and end, and then groups these sentences into chunks. The size of these chunks can vary depending on the length of the sentences.

In [None]:
import random

# Randomly select a key from the chunk_size_results dictionary
random_key = random.choice(list(chunk_size_results.keys()))
print(f"Randomly selected key: {random_key}")
