# 3.3. Chunking Experiment

The search solution is comprised of both **ingestion** and **retrieval**. One does not exist without the other. While the other experiments are focused on data retrieval, ingestion plays equal importance in the effectiveness of the search solution.

<!-- Certain aspects of data ingestion need to be experimented as part of the experimentation phase: -->

https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingSearch.md#learnings-from-engagements-1

When processing data, splitting the source documents into chunks requires care and expertise to ensure the resulting chunks are small enough to be effective during fact retrieval but not too small so that enough context is provided during summarization.

```{note}
Our goal here is not to identify which chunking strategy is the “best” in general but rather to demonstrate how various choices of chunking may have a non-trivial impact on the ultimate outcome from the retrieval-augmented-generation solution.
```

[Learnings fromm other engagements](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingSearch.md#learnings-from-engagements-1)

<!-- https://vectara.com/blog/grounded-generation-done-right-chunking/#:~:text=In%20the%20context%20of%20Grounded%20Generation%2C%20chunking%20is,find%20natural%20segments%20like%20complete%20sentences%20or%20paragraphs. -->

## Why Chunking Size Matters

As mentioned [here](https://learn.microsoft.com/en-us/azure/search/semantic-search-overview), the models used to generate embedding vectors have maximum limits on the text fragments provided as input. For example, the maximum length of input text for the Azure OpenAI embedding models is **8,191** tokens. Given that each token is around 4 characters of text for common OpenAI models, this maximum limit is equivalent to around 6000 words of text. If you're using these models to generate embeddings, it's critical that the input text stays under the limit. Partitioning your content into chunks ensures that your data can be processed by the Large Language Models (LLM) used for indexing and queries.

**Relevance and Granularity**: A small chunk size, like 128, yields more granular chunks. This granularity, however, presents a risk: vital information might not be among the top retrieved chunks, especially if the similarity _top_k_ setting is as restrictive as 2. Conversely, a chunk size of 512 is likely to encompass all necessary information within the top chunks, ensuring that answers to queries are readily available. To navigate this, we employ the _Faithfulness and Relevancy_ metrics. These measure the absence of ‘hallucinations’ and the ‘relevancy’ of responses based on the query and the retrieved contexts respectively.

**Response Generation Time**: As the chunk_size increases, so does the volume of information directed into the LLM to generate an answer. While this can ensure a more comprehensive context, it might also slow down the system. Ensuring that the added depth doesn't compromise the system's responsiveness is crucial.

In essence, determining the optimal chunk_size is about striking a balance: capturing all essential information without sacrificing speed. It's vital to undergo thorough testing with various sizes to find a configuration that suits the specific use case and dataset.

https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5

Example code: https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/data-chunking/textsplit-data-chunking-example.ipynb

Read [Common Chunking Technique](https://learn.microsoft.com/en-us/azure/search/semantic-search-overview), [Content overlap considerations](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents#content-overlap-considerations), [Simple example of how to create chunks with sentences](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents#content-overlap-considerations)

CODE: https://github.com/microsoft/rag-openai/blob/438999a5470bef7946fa1c8714ed1090e1ed40c3/samples/searchEvaluation/customskills/utils/chunker/text_chunker.py


# Chunk the Data


In [1]:
%%capture --no-display
%pip install langchain-community==0.0.18
# %pip install langchain-core==0.1.20
%pip install unstructured==0.12.3
%pip install unstructured-client==0.17.0
%pip install langchain==0.1.5

In [2]:
import tqdm
import glob
from langchain_community.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import MarkdownTextSplitter
import os

# Code also https://github.com/microsoft/rag-openai/blob/438999a5470bef7946fa1c8714ed1090e1ed40c3/samples/searchEvaluation/customskills/utils/chunker/text_chunker.py

## Create chunks


In [3]:
def load_documents_from_folder(path, totalNumberOfDocuments=200) -> list[str]:
    print("Loading documents...")
    markdown_documents = []
    i = 0
    for file in tqdm.tqdm(glob.glob(path, recursive=True)):
        loader = UnstructuredFileLoader(file)
        document = loader.load()
        markdown_documents.append(document)
        if i == totalNumberOfDocuments:
            return markdown_documents
        i += 1

In [4]:
import json


def create_chunks_and_save_to_file(
    path_to_output, totalNumberOfDocuments=200, chunk_size=300, chunk_overlap=30
) -> list:
    try:
        if os.path.exists(path_to_output):
            print(f"Chunks already created at: {path_to_output} ")
            return

        documents = load_documents_from_folder(
            "..\data\docs\**\*.md", totalNumberOfDocuments
        )

        print("Creating chunks...")
        markdown_splitter = MarkdownTextSplitter.from_tiktoken_encoder(
            chunk_size=chunk_size, chunk_overlap=chunk_overlap
        )
        lengths = {}
        all_chunks = []
        chunk_id = 0
        for document in tqdm.tqdm(documents):
            current_chunks_text_list = markdown_splitter.split_text(
                document[0].page_content
            )  # output = ["content chunk1", "content chunk2", ...]

            for i, chunk in enumerate(
                current_chunks_text_list
            ):  # (0, "content chunk1"), (1, "content chunk2"), ...
                current_chunk_dict = {
                    "chunkId": f"chunk{chunk_id}_{i}",
                    "chunkContent": chunk,
                    "source": document[0].metadata["source"],
                }
                all_chunks.append(current_chunk_dict)

            chunk_id += 1

            n_chunks = len(current_chunks_text_list)
            # lengths = {[Number of chunks]: [number of documents with that number of chunks]}
            if n_chunks not in lengths:
                lengths[n_chunks] = 1
            else:
                lengths[n_chunks] += 1

        with open(path_to_output, "w") as f:
            json.dump(all_chunks, f)
        print(f"Chunks created: ", lengths)
    except Exception as e:
        print(f"Error creating chunks: {e}")
    return all_chunks

Create the chunks

Note:

- we are only chunking the first `totalNumberOfDocuments` from `..\data\docs\**\*.md`
- `chunk_size` is the number of tokens a chunk should have
- `chunk_overlap` is the percentage of overlap between two chunks


In [5]:
totalNumberOfDocuments = 200
path_to_chunks_output = f"./output/chunks-solution-ops-{totalNumberOfDocuments}.json"
chunks = create_chunks_and_save_to_file(
    path_to_chunks_output, totalNumberOfDocuments, chunk_size=300, chunk_overlap=30
)

Chunks already created at: ./output/chunks-solution-ops-200.json 


In this workshop, to separate our experiments, we will take the _Full Reindex_ strategy by creating a new index


In [6]:
%run -i ./helpers/search.ipynb

# 1. Create the new index
new_index_name = "solution-ops-chunking-300-30"
create_index(new_index_name)

# 2. Generate embeddings for the new chunks
generated_embeddings_path = f"./output/chunks-solution-ops-embedded-{totalNumberOfDocuments}.json"
generate_embeddings_for_chunks_and_save_to_file(path_to_chunks_file=path_to_chunks_output, path_to_output=generated_embeddings_path)

# 3. Upload the embeddings to the new index
upload_data(file_path=generated_embeddings_path, search_index_name=new_index_name)

NameError: name 'key' is not defined

In [None]:
# # Semantic chunking
# # https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-retrieval-augumented-generation?view=doc-intel-4.0.0

# # Using SDK targeting 2023-10-31-preview, make sure your resource is in one of these regions: East US, West US2, West Europe
# # pip install azure-ai-documentintelligence==1.0.0b1
# # pip install langchain langchain-community azure-ai-documentintelligence

# from azure.ai.documentintelligence import DocumentIntelligenceClient

# endpoint = "https://<my-custom-subdomain>.cognitiveservices.azure.com/"
# key = "<api_key>"

# from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
# from langchain.text_splitter import MarkdownHeaderTextSplitter

# # Initiate Azure AI Document Intelligence to load the document. You can either specify file_path or url_path to load the document.
# loader = AzureAIDocumentIntelligenceLoader(file_path="<path to your file>", api_key = key, api_endpoint = endpoint, api_model="prebuilt-layout")
# docs = loader.load()

# # Split the document into chunks base on markdown headers.
# headers_to_split_on = [
#     ("#", "Header 1"),
#     ("##", "Header 2"),
#     ("###", "Header 3"),
# ]
# text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# docs_string = docs[0].page_content
# splits = text_splitter.split_text(docs_string)
# splits

# Built-in skillset: SplitSkill


Upload files to a storage account so we can create an Indexer
https://github.com/microsoft/rag-openai/blob/438999a5470bef7946fa1c8714ed1090e1ed40c3/samples/searchEvaluation/upload_files.py
