# 3.2. Chunking Experiment

The search solution is comprised of both **ingestion** and **retrieval**. One does not exist without the other. While other experiments are focused on data retrieval, ingestion plays equal importance in the effectiveness of the search solution. During this experiment, we will look at various chunking strategies.

## Experiment Overview

There are multiple methods of ingestion, depending on the type of data. For example, unstructured data such as documents or web pages can be split into chunks and embedded into vectors, while structured data such as tables or databases can be summarized or converted into natural language. In our case, since we are working with unstructured text data, we will look into two chunking strategies: fixed-size chunking and semantic chunking.

```{note}
Our goal here is not to identify which chunking strategy is the “best” in general but rather to demonstrate how the choice of chunking may have a non-trivial impact on the ultimate outcome from the RAG solution.
```

<!-- Certain aspects of data ingestion need to be experimented as part of the experimentation phase: -->

<!-- # Chunking Strategy

Code also https://github.com/microsoft/rag-openai/blob/438999a5470bef7946fa1c8714ed1090e1ed40c3/samples/searchEvaluation/customskills/utils/chunker/text_chunker.py

There are multiple methods of ingestion, depending on the type of data. For example, unstructured data such as documents or web pages can be split into chunks and embedded into vectors, while structured data such as tables or databases can be summarized or converted into natural language. In our case, since we are working with unstructured text data, we will look into different chunking strategies.
 -->

📝**Hypothesis**

Exploratory hypothesis: "Can introducing a new chunking strategy improve system's performance?"

🎯 **Evaluation Metrics**

For this experiment we will look at Accuracy and Cosine Similarity to compare the performance.

📊 **Data**

In this experiment, the data that we would like to chunk consists of the first 200 documents from the Solution Ops Playbook.

<!-- The metrics used for document retrieval evaluation. -->

<!-- [Learnings fromm other engagements](https://github.com/microsoft/rag-openai/blob/main/topics/RAG_EnablingSearch.md#learnings-from-engagements-1) -->

<!-- https://vectara.com/blog/grounded-generation-done-right-chunking/#:~:text=In%20the%20context%20of%20Grounded%20Generation%2C%20chunking%20is,find%20natural%20segments%20like%20complete%20sentences%20or%20paragraphs. -->

<!-- https://towardsdatascience.com/how-to-chunk-text-data-a-comparative-analysis-3858c4a0997a -->

<!-- https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5

Example code: https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/data-chunking/textsplit-data-chunking-example.ipynb

Read [Common Chunking Technique](https://learn.microsoft.com/en-us/azure/search/semantic-search-overview), [Content overlap considerations](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents#content-overlap-considerations), [Simple example of how to create chunks with sentences](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-chunk-documents#content-overlap-considerations)

CODE: https://github.com/microsoft/rag-openai/blob/438999a5470bef7946fa1c8714ed1090e1ed40c3/samples/searchEvaluation/customskills/utils/chunker/text_chunker.py -->


## Setup

Import necessary libraries


In [38]:
%run -i ./pre-requisites.ipynb

Let's load our documents from Solution Ops Playbook


In [39]:
# import tqdm
import glob
from langchain_community.document_loaders import UnstructuredFileLoader
import os


def load_documents_from_folder(path, totalNumberOfDocuments=200) -> list[str]:
    print("Loading documents...")
    markdown_documents = []
    i = 0
    # tqdm.tqdm
    for file in glob.glob(path, recursive=True):
        loader = UnstructuredFileLoader(file)
        document = loader.load()
        markdown_documents.append(document)
        if i == totalNumberOfDocuments:
            print("Finished loading documents")
            return markdown_documents
        i += 1

In [40]:
totalNumberOfDocuments = 200
documents = load_documents_from_folder(
    "..\data\docs\**\*.md", totalNumberOfDocuments)

Loading documents...
Finished loading documents


## 1. Fixed-sized chunking strategy

This is one of the most basic form of splitting up text. It is the process of simply dividing the text into N-character sized chunks regardless of their content or form. This method isn't recommended for any applications - but it's a great starting point for us to understand the basics.

**[Why Chunking Size Matters](https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5)**

When processing data, splitting the source documents into chunks requires care and expertise to ensure the resulting chunks are small enough to be effective during fact retrieval but not too small so that enough context is provided during summarization. The models used to generate embedding vectors have maximum limits on the text fragments provided as input. For example, the maximum length of input text for the Azure OpenAI embedding models is **8,191** tokens. Given that each token is around 4 characters of text for common OpenAI models, this maximum limit is equivalent to around 6000 words of text. If you're using these models to generate embeddings, it's critical that the input text stays under the limit. Partitioning your content into chunks ensures that your data can be processed by the Large Language Models (LLM) used for indexing and queries.

**Relevance and Granularity**: A small chunk size, like 128, yields more granular chunks. This granularity, however, presents a risk: vital information might not be among the top retrieved chunks, especially if the similarity _top_k_ setting is as restrictive as 2. Conversely, a chunk size of 512 is likely to encompass all necessary information within the top chunks, ensuring that answers to queries are readily available. To navigate this, we employ the _Faithfulness and Relevancy_ metrics. These measure the absence of ‘hallucinations’ and the ‘relevancy’ of responses based on the query and the retrieved contexts respectively.

**Response Generation Time**: As the chunk_size increases, so does the volume of information directed into the LLM to generate an answer. While this can ensure a more comprehensive context, it might also slow down the system. Ensuring that the added depth doesn't compromise the system's responsiveness is crucial.

In essence, determining the optimal chunk_size is about striking a balance: capturing all essential information without sacrificing speed. It's vital to undergo thorough testing with various sizes to find a configuration that suits the specific use case and dataset.

- **Pros**: Easy & Simple
- **Cons**: Very rigid and doesn't take into account the structure of your text

Concept to know:

- **Chunk Size** - The number of characters you would like in your chunks. 50, 100, 100,000, etc.
- **Chunk Overlap** - The amount you would like your sequential chunks to overlap. This is to try to avoid cutting a single piece of context into multiple pieces. This will create duplicate data across chunks.


Let's load LangChain's `MarkdownTextSplitter` to split the text for us


In [2]:
from langchain.text_splitter import MarkdownTextSplitter

Let's load up the text splitted. You need to specify the `chunk overlap` and `chunk size`


In [3]:
chunk_size = 5
chunk_overlap = 3
markdown_splitter = MarkdownTextSplitter.from_tiktoken_encoder(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap)

We can split a text via `split_text` function. Let's take a sample text:


In [4]:
text = "This is the text I would like to chunk up. It is the example text for this exercise"

In [5]:
current_chunks_text_list = markdown_splitter.split_text(text)
current_chunks_text_list

['This is the text I',
 'the text I would like',
 'I would like to chunk',
 'like to chunk up.',
 'chunk up. It is',
 'It is the example text',
 'the example text for this',
 'text for this exercise']

### 👩‍💻 Create a function to chunk data from Solution-Ops using Fixed-Size Chunking Strategy


In [9]:
import json


def create_chunks_and_save_to_file(
    documents, path_to_output, chunk_size=300, chunk_overlap=30
) -> list:
    try:
        if os.path.exists(path_to_output):
            print(f"Chunks already created at: {path_to_output} ")
            return

        print("Creating chunks...")
        markdown_splitter = MarkdownTextSplitter.from_tiktoken_encoder(
            chunk_size=chunk_size, chunk_overlap=chunk_overlap
        )
        lengths = {}
        all_chunks = []
        chunk_id = 0
        for document in tqdm.tqdm(documents):
            current_chunks_text_list = markdown_splitter.split_text(
                document[0].page_content
            )
            for i, chunk in enumerate(
                current_chunks_text_list
            ):
                source = document[0].metadata["source"]
                current_chunk_dict = {
                    "chunkId": f"chunk{chunk_id}_{i}",
                    "chunkContent": chunk,
                    "source": source,
                }
                all_chunks.append(current_chunk_dict)

            chunk_id += 1

            n_chunks = len(current_chunks_text_list)
            # lengths = {[Number of chunks]: [number of documents with that number of chunks]}
            if n_chunks not in lengths:
                lengths[n_chunks] = 1
            else:
                lengths[n_chunks] += 1

        with open(path_to_output, "w") as f:
            json.dump(all_chunks, f)
        # print(f"Chunks created: ", lengths)
    except Exception as e:
        print(f"Error creating chunks: {e}")
    return lengths

Create the chunks

Note:

- we are only chunking the first `totalNumberOfDocuments` from `..\data\docs\**\*.md`
- `chunk_size` is the number of tokens a chunk should have
- `chunk_overlap` is the percentage of overlap between two chunks


In [10]:
totalNumberOfDocuments = 200
chunk_size = 300
chunk_overlap = 30
fixed_chunks_output_prefix = "fixed-size-chunks-solution-ops"

path_to_chunks_output = f"./output/generated/{fixed_chunks_output_prefix}-{totalNumberOfDocuments}-{chunk_size}-{chunk_overlap}.json"
print(path_to_chunks_output)


chunks = create_chunks_and_save_to_file(
    documents, path_to_chunks_output, chunk_size=chunk_size, chunk_overlap=chunk_overlap)

./output/generated/fixed-size-chunks-solution-ops-200-300-30.json
Chunks already created at: ./output/generated/fixed-size-chunks-solution-ops-200-300-30.json 


<!-- In this workshop, to separate our experiments, we will take the _Full Reindex_ strategy by creating a new index -->


In [12]:
# %run -i ./helpers/search.ipynb

# # 1. Create the new index
# fixed_chunking_index_name = "solution-ops-fixed-chunking-300-30"
# create_index(fixed_chunking_index_name)

# # 2. Generate embeddings for the new chunks
# generated_embeddings_path = f"./output/generated/{fixed_chunks_output_prefix}-embedded-{totalNumberOfDocuments}-{chunk_size}-{chunk_overlap}.json"
# generate_embeddings_for_chunks_and_save_to_file(path_to_chunks_file=path_to_chunks_output, path_to_output=generated_embeddings_path)

# # 3. Upload the embeddings to the new index
# upload_data(file_path=generated_embeddings_path, search_index_name=fixed_chunking_index_name)

Index: 'solution-ops-fixed-chunking-300-30' created or updated
Embeddings were already created for chunked data at: ./output/generated/fixed-size-chunks-solution-ops-200-300-30.json 
Uploaded 991 documents to Index: solution-ops-fixed-chunking-300-30


## 2. [Semantic Chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker)

Azure: https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept-retrieval-augumented-generation?view=doc-intel-4.0.0#semantic-chunking)


In the previous approach, we chose a constant value for chunk size, in a random way. We did not leverage the actual content of the document, the structure, etc. In this section, we will look at [Semantic Chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker) from LangChain. This approach splits the text based on semantic similarity.

For insights on what it is doing, you can have a look at [Level 4. Semantic Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb).


In [13]:
%run -i ./pre-requisites.ipynb
from langchain_openai.embeddings import AzureOpenAIEmbeddings
from langchain_experimental.text_splitter import SemanticChunker

embeddings = AzureOpenAIEmbeddings(
    azure_deployment="embeddings",
    openai_api_version="2023-05-15",
)

with open("../data/docs/code-with-dataops/capabilities/analytical-systems/data-ingestion/batch-stream-ingestion/index.md") as f:
    state_of_the_union = f.read()

text_splitter = SemanticChunker(embeddings)

docs = text_splitter.create_documents([state_of_the_union])
print(docs[1].page_content)

This group is either determined by a specific time interval or a certain size limit. Stream ingestion deals with continuous, unbounded datasets. That said, most of the current stream ingestion approaches use mini-batches to ingest data as it reduces the number of I/O operations. ## Batch Ingestion

Traditionally data ingestion has been done in batches due to the limitations of the legacy systems. It still remains a popular way to ingest data for the simplicity of its implementation. Almost every ETL tool supports batch ingestion. ### Batch Ingestion Architectural Patterns

#### Pull

Most batch ingestion data pipelines will connect to a source system and pull data from it at regular interval. There are two common patterns for a batch job to load data from a source system, full load and delta load. - _Full load_: The job will load all the records from the source table. - _Delta load_: The job will load only the new records added since the last execution of the job. For this approach to 

### 👩‍💻 Create a function to chunk data from Solution-Ops using Semantic Chunker


In [14]:
import json


def create_semantic_chunks_and_save_to_file(documents, path_to_output) -> list:
    try:
        if os.path.exists(path_to_output):
            print(f"Chunks already created at: {path_to_output} ")
            return
        lengths = {}
        all_chunks = []
        chunk_id = 0
        # tqdm.tqdm(
        embeddings = AzureOpenAIEmbeddings(
            azure_deployment=azure_openai_embedding_deployment,
            openai_api_version="2023-05-15",
        )

        text_splitter = SemanticChunker(embeddings)
        for document in documents:
            content = document[0].page_content
            source = document[0].metadata["source"]

            splitted_documents = text_splitter.create_documents(
                [content])
            for i, splitted_content in enumerate(splitted_documents):
                current_chunk_dict = {
                    "chunkId": f"chunk{chunk_id}_{i}",
                    "chunkContent": splitted_content.page_content,
                    "source": source,
                }
                chunk_id += 1
                all_chunks.append(current_chunk_dict)
        with open(path_to_output, "w") as f:
            json.dump(all_chunks, f)
    except Exception as e:
        print(f"Error creating chunks: {e}")
    return all_chunks

In [15]:
totalNumberOfDocuments = 200
semantic_chunks_output_prefix = "semantic-chunks-solution-ops"
path_to_semantic_chunks_output = f"./output/generated/{semantic_chunks_output_prefix}-{totalNumberOfDocuments}.json"
print(path_to_semantic_chunks_output)

chunks = create_semantic_chunks_and_save_to_file(
    documents, path_to_semantic_chunks_output)

./output/generated/semantic-chunks-solution-ops-200.json
Chunks already created at: ./output/generated/semantic-chunks-solution-ops-200.json 


<!-- # Use the built-in skillset: [SplitSkill](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-textsplit) -->


## 📈 Evaluation

In this workshop, to separate our experiments, we will take the **Full Reindex** strategy and we will create a new index per chunking strategy.

Therefore, for each chunking strategy we will:

1. Create a new index. Note: make sure to give a relevant name.
2. Embed the chunks that you have previously created.
3. Populate the index with chunks.

```{note}
You can reuse available functions from [./helpers/search.ipynb](./helpers/search.ipynb), such as: *create_index* and *upload_data*. By running the next cell, all the functions from search.ipynb will become available.
```


In [37]:
%%capture --no-display
%run -i ./helpers/search.ipynb

Sample code for creating a new index and uploading the data:


In [None]:
%run -i ./helpers/search.ipynb

# 1. Create the new index
fixed_chunking_index_name = "solution-ops-fixed-chunking-300-30"
create_index(fixed_chunking_index_name)

# 2. Generate embeddings for the new chunks
generated_embeddings_path = f"./output/generated/{fixed_chunks_output_prefix}-embedded-{totalNumberOfDocuments}-{chunk_size}-{chunk_overlap}.json"
generate_embeddings_for_chunks_and_save_to_file(path_to_chunks_file=path_to_chunks_output, path_to_output=generated_embeddings_path)

# 3. Upload the embeddings to the new index
upload_data(file_path=generated_embeddings_path, search_index_name=fixed_chunking_index_name)

In [None]:
%run -i ./helpers/search.ipynb

# 1. Create the new index
semantic_chunking_index_name = "solution-ops-semantic-chunking-200"
create_index(semantic_chunking_index_name)

# 2. Generate embeddings for the new chunks
generated_embeddings_path = f"./output/generated/{semantic_chunks_output_prefix}-embedded-{totalNumberOfDocuments}.json"
generate_embeddings_for_chunks_and_save_to_file(path_to_chunks_file=path_to_semantic_chunks_output, path_to_output=generated_embeddings_path)

# 3. Upload the embeddings to the new index
upload_data(file_path=generated_embeddings_path, search_index_name=semantic_chunking_index_name)

### Evaluation Dataset

Note: The evaluation dataset can be found at [solution-ops-200-qa.json](./output/qa/evaluation/solution-ops-200-qa.json). The format is:

```json
"user_prompt": "", # The question
"output_prompt": "", # The answer
"context": "", # The relevant piece of information from a document
"chunk_id": "", # The ID of the chunk
"source": "" # The path to the document, i.e. "..\\data\\docs\\code-with-dataops\\index.md"
```


Let us configure the path to evaluation dataset and reload environment variables


In [21]:
path_to_evaulation_dataset = "./output/qa/evaluation/solution-ops-200-qa.json"
%run -i ./pre-requisites.ipynb

### Evaluation Metrics

<!-- `Retrieval_evaluation` function goes through our evaluation dataset and verifies for each question if the retrieved documents include the expected document. -->


In [41]:
%run -i ./helpers/search.ipynb

from statistics import mean, median
import os
import numpy as np
from numpy.linalg import norm

Let's define the evaluation metrics:
- Cosine similarity: will calculate the similarity between the first retrieved chunk and the expected chunk. We will look at the average and mean cosine similarity across our evaluation dataset.
- Accuracy: we will calculate how many times the system returned the expected document, and by document we mean the actual path to the markdown file.

In [34]:
def calculate_cosine_similarity(expected_document_vector, retrieved_document_vector):
    cosine_sim = np.dot(expected_document_vector, retrieved_document_vector) / \
        (norm(expected_document_vector)*norm(retrieved_document_vector))
    return float(cosine_sim)

In [28]:
def calculate_metrics(evaluation_data_path, search_index_name, embedding_function=oai_query_embedding):
    """ Evaluate the retrieval performance of the search index using the evaluation data set.
    Args:
    evaluation_data_path (str): The path to the evaluation data set.
    embedding_function (function): The function to use for embedding the question.
    search_index_name (str): The name of the search index to use for retrieval.

    Returns:
    list: The cosine similarities between the expected documents and the top retrieved documents.
    """
    if not os.path.exists(evaluation_data_path):
        print(
            f"The path to the evaluation data set {evaluation_data_path} does not exist. Please check the path and try again."
        )
        return
    nr_correctly_retrieved_documents = 0
    nr_qa = 0
    cosine_similarities = []

    with open(evaluation_data_path, "r", encoding="utf-8") as file:
        evaluation_data = json.load(file)
        for data in evaluation_data:
            user_prompt = data["user_prompt"]
            expected_document = data["source"]
            expected_document_vector = embedding_function(data["context"])

            # 1. Search in the index
            search_response = search_documents(
                search_index_name=search_index_name,
                input=user_prompt,
                embedding_function=embedding_function,
            )

            retrieved_documents = [response["source"]
                                   for response in search_response]
            top_retrieved_document = search_response[0]["chunkContentVector"]

            # 2. Calculate cosine similarity between the expected document and the top retrieved document
            cosine_similarity = calculate_cosine_similarity(
                expected_document_vector, top_retrieved_document)
            cosine_similarities.append(cosine_similarity)

            # 3. If the expected document is part of the retrieved documents,
            # we will consider it correctly retrieved
            if expected_document in retrieved_documents:
                nr_correctly_retrieved_documents += 1

            nr_qa += 1
    accuracy = (nr_correctly_retrieved_documents / nr_qa)*100
    print(
        f"Accuracy: {accuracy}% of the documents were correctly retrieved from Index {index_name}.")

    return cosine_similarities

### 👩‍💻 1. Evaluate the fixed-size chunking strategy


In [35]:
# TODO: Replace this with the name of the index you want to evaluate
index_name = fixed_chunking_index_name

cosine_similarities = calculate_metrics(
    evaluation_data_path=path_to_evaulation_dataset,
    search_index_name=index_name,
)

avg_score = mean(cosine_similarities)
print(f"Avg score:{avg_score}")
median_score = median(cosine_similarities)
print(f"Median score: {median_score}")

Accuracy: 33.33333333333333% of the documents were correctly retrieved from Index solution-ops-fixed-chunking-300-30.
Avg score:0.8343853292433316
Median score: 0.8167772607332786


### 👩‍💻 2. Evaluate the semantic chunking strategy


In [36]:
# TODO: Replace this with the name of the index you want to evaluate
index_name = semantic_chunking_index_name

cosine_similarities = calculate_metrics(
    evaluation_data_path=path_to_evaulation_dataset,
    search_index_name=index_name,
)

avg_score = mean(cosine_similarities)
print(f"Avg score:{avg_score}")
median_score = median(cosine_similarities)
print(f"Median score: {median_score}")

Accuracy: 66.66666666666666% of the documents were correctly retrieved from Index solution-ops-semantic-chunking-200.
Avg score:0.8404864049201174
Median score: 0.8171080900560082


## 💡 Conclusions
