# The process of creating a RAG and an integral part of it - Chunking

Dividing text into chunks before embedding is a very impactful and integral step as it dictates what information is included in the vector and then is found during the search. Chunks should be of appropriate size as too small chunks lose context and that which are too big are non-specific which ultimately impacts the retrieval of query specific information. Excessive context might also lead to hallucination and deduce the LLM performance. Also, the chunk size should not exceed the context length of the embedder or else we will lose the information - this is called **truncation**. Therefore chunk size is important and it affects both the quality and performance of both retrieval and generation. 

Simplest strategy is to **fix the length of the chunk**. Character chunking divides the document into chunks based on predetermined number of characters of tokens (common choices are 100 or 256 tokens, or 500 characters). This size should be chosen depending on the type of the document. This is the cheapest and easiest system to implement. 

We can implement **random size chunking** when the collection is non homogeneous and can potentially capture more semantic context.

## Chunking without overlap

for example, 
```
"Nepal is a landlocked country and therefore it has no access to ocean making it difficult to import electronic goods from china directly."

is divided into 

Chunk 1: Nepal is a landlocked country
Chunk 2: and therefore it has no access
Chunk 3: to ocean making it difficult to
Chunk 4: import electronic goods from China directly
```

It works well when there are clear boundaries between chunks, for example if the context drastically changes between adjacent chunks. But this is rare and hence the lack of overlap destroys context.

In [1]:
from transformers import BertTokenizer
from IPython.display import HTML
from langchain_text_splitters import TokenTextSplitter

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text_splitter = TokenTextSplitter(
    chunk_size =50, 
    chunk_overlap=0
)

In [3]:
def color_text_chunks(text, text_splitter):
    docs = text_splitter.create_documents([text])
    chunks = [doc.page_content for doc in docs]
    
    colored_text = ""
    colors = ["#ff9999","#66ffff", "#99ff99", "#ff66ff"]
    
    for i, chunk in enumerate(chunks):
        color = colors[i % len(colors)]
        chunk_html = f"<span style='background-color: {color};color:black'>{chunk}</span>"
        colored_text += chunk_html + "<br/>"
    
    return HTML(colored_text)

In [4]:
hemmingway = """
Today is only one day in all the days that will ever be. But what will happen in all the other days that ever come can depend on what you do today. It's been that way all this year. It's been that way so many times. All of war is that way.
"""

In [5]:
color_text_chunks(hemmingway, text_splitter)


## Chunking with overlap

Chunking with overlap can be achieved using a sliding window which maintains overlap between chunks. This ensures that the system has contextual information at the chunk boundaries allowing better semantic context and increasing the chances that relevant information will be found it spans across multiple chunks. This is a much more expensive strategy because we need to divide the document into more chunks that increases the number of entries in the database. Also, if there are some redundant information, so the overlap should be no more than a small percentage of entire chunk.

In [6]:
from collections import defaultdict

In [7]:
def color_overlap_text_chunks(text, chunk_size, overlap):
    text_splitter = TokenTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap
    )
    docs = text_splitter.create_documents([text])
    chunks = [doc.page_content for doc in docs]  # Access the text attribute
    
    colors = ['#ff9999', '#99ff99', '#9999ff', '#ffff99', '#99ffff', '#ff99ff', '#cccccc',
              '#ff6666', '#66ff66', '#6666ff', '#ffff66', '#66ffff', '#ff66ff', '#ccccff',
              '#996699', '#669999', '#999966', '#669966', '#966696', '#696669']

    color_positions = []
    for i in range(len(chunks)):
        chunk_tokens = tokenizer.tokenize(chunks[i])
        chunk_length = len(chunk_tokens)
        if chunk_length > overlap:
            unique_length = chunk_length - overlap
            chunk_colors = [colors[i % len(colors)]] * unique_length + \
                           [blend_colors([colors[i % len(colors)], colors[(i+1) % len(colors)]])] * overlap
        else:
            chunk_colors = [blend_colors([colors[i % len(colors)], colors[(i+1) % len(colors)]])] * chunk_length
        color_positions.append(chunk_colors)

    for i in range(1, len(color_positions)):
        overlap_color = blend_colors([colors[(i-1) % len(colors)], colors[i % len(colors)]])
        color_positions[i] = [overlap_color] * overlap + color_positions[i][overlap:]

    colored_text = ""
    for i, chunk in enumerate(chunks):
        tokens = tokenizer.tokenize(chunk)
        for j, token in enumerate(tokens):
            color = color_positions[i][j]
            token_html = f'<span style="background-color:{color}; color:black">{token}</span>'
            colored_text += token_html + ' '
        colored_text += '<br><br>'

    return HTML(colored_text)

def blend_colors(color_list):
    r, g, b = 0, 0, 0
    for color in color_list:
        c = int(color[1:], 16)
        r += c >> 16
        g += (c >> 8) & 0xff
        b += c & 0xff
    n = len(color_list)
    r = min(r // n, 255)
    g = min(g // n, 255)
    b = min(b // n, 255)
    return f'#{r:02x}{g:02x}{b:02x}'

In [8]:
text = ''' That night at the hotel, in our room with the long empty hall outside and our shoes outside the door, a thick carpet on the floor of the room, outside the windows the rain falling and in the room light and pleasant and cheerful, then the light out and it exciting with smooth sheets and the bed comfortable, feeling that we had come home, feeling no longer alone, waking in the night to find the other one there, and not gone away; all other things were unreal. We slept when we were tired and if we woke the other one woke so no one was not alone. Often a man wishes to be alone and a girl wishes to be alone too and if they love each other they are jealous of that in each other, but I can truly say we never felt that. We could feel alone when we were together, alone against the others. It has only happened to me like that once. I have been alone while I Was with many girls and that is the way you can be most lonely. But we were never lonely and never afraid when we were together. I know that the night is not the same as the day: that all things are different, that the things of the night cannot be explained in the day, because they do not then exist, and the night can be a dreadful time for lonely people once their loneliness has started. But with Catherine there was almost no difference in the night except that it was an even better time. If people bring so much courage to this world the world has to kill them to break them, so of course it kills them. The world breaks every one and afterward many are strong at the broken places. But those that will not break it kills. It kills the very good and the very gentle and the very brave impartially. If you are none of these you can be sure it will kill you too but there will be no special hurry.'''

In [9]:
chunk_size = 80
overlap = 20

In [11]:
color_overlap_text_chunks(text, chunk_size, overlap)

## Simple chunking based on the number of tokens and the presence of the new line in the text. (i.e. Character Based Chunking)

In [18]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 256,
    chunk_overlap = 0
)


In [47]:
def text_based_chunk(text, text_splitter):
    docs = text_splitter.create_documents([text])
    chunks = [doc.page_content for doc in docs]
    
    colored_text_based_chunk  = ""
    colors = ["#B2EBF2","#F48FB1", "#C8E6C9","#FFCDD2", "#B39DDB", "#B2DFDB", "#F06292", "#FFAB40", "#C5E1A5", "#FF8C94"]
    
    for i, chunk in enumerate(chunks):
        color = colors[i % len(colors)]
        chunk_html = f'<span style="background-color:{color}; color:black">{chunk}</span>'
        colored_text_based_chunk += chunk_html + "<br/>"
        
    return HTML(colored_text_based_chunk)

In [48]:
a_farewell_to_arms = """That night at the hotel, in our room with the long empty hall outside and our shoes outside the door, 
a thick carpet on the floor of the room, outside the windows the rain falling and in the room light and pleasant and cheerful, then the light out and it exciting with smooth sheets and the bed comfortable, 
feeling that we had come home, feeling no longer alone, waking in the night to find the other one there, and not gone away; all other things were unreal. 
We slept when we were tired and if we woke the other one woke so no one was not alone. 
Often a man wishes to be alone and a girl wishes to be alone too and if they love each other they are jealous of that in each other, but I can truly say we never felt that. 
We could feel alone when we were together, alone against the others. 
It has only happened to me like that once. 
I have been alone while I Was with many girls and that is the way you can be most lonely. 
But we were never lonely and never afraid when we were together. 
I know that the night is not the same as the day: that all things are different, that the things of the night cannot be explained in the day, 
because they do not then exist, 
and the night can be a dreadful time for lonely people once their loneliness has started. 
But with Catherine there was almost no difference in the night except that it was an even better time. 
If people bring so much courage to this world the world has to kill them to break them, so of course it kills them. 
The world breaks every one and afterward many are strong at the broken places. 
But those that will not break it kills. 
It kills the very good and the very gentle and the very brave impartially. 
If you are none of these you can be sure it will kill you too but there will be no special hurry
"""

In [50]:
text_based_chunk(a_farewell_to_arms, text_splitter)

## Chunking strategy

From so many chunkers that exist, the best chunker is the one that suits our need.

**Chunker based on document structure**:\
Text and the structure determine the chunk size and the chunking strategy. In the case where we have document of same type (HTML, LaTeX, Markdown etc.) a specific chunk might be of best choice, if the collection is heterogeneous, we could create a pipeline that conducts chunking according to file type. 

**For Performance and Resource Optimization**:
- For space and computational cost limitations, a simple **fixed size chunker** might be of optimal choice.
- For better information integrity, improvement in relevancy and accuracy **Semantic chunker** could be used. It requires knowledge of the text, thought, and might be optimal choice for a system that is used by general users.
- If we have high computation resources **Contextual chunking** gives better performance.

**Chunker based on model context**:\
Chunking size must respect the dimension of context length. Also, the size of embedding model and the LLM should also be taken into account when generating chunks.

**Based on user query pattern**:\
The type of question we expect from the users should be taken into account. If the users ask questions that requires the model to find multiple facts, it is better to have small chunks that gives direct answer, but if the system is more discursive, it is suggested to give more context to the chunks.