# The process of creating a RAG and an integral part of it - Chunking

Dividing text into chunks before embedding is a very impactful and integral step as it dictates what information is included in the vector and then is found during the search. Chunks should be of appropriate size as too small chunks lose context and that which are too big are non-specific which ultimately impacts the retrieval of query specific information. Excessive context might also lead to hallucination and deduce the LLM performance. Also, the chunk size should not exceed the context length of the embedder or else we will lose the information - this is called **truncation**. Therefore chunk size is important and it affects both the quality and performance of both retrieval and generation. 

Simplest strategy is to **fix the length of the chunk**. Character chunking divides the document into chunks based on predetermined number of characters of tokens (common choices are 100 or 256 tokens, or 500 characters). This size should be chosen depending on the type of the document. This is the cheapest and easiest system to implement. 

We can implement **random size chunking** when the collection is non homogeneous and can potentially capture more semantic context.

## Chunking without overlap

for example, 
```
"Nepal is a landlocked country and therefore it has no access to ocean making it difficult to import electronic goods from china directly."

is divided into 

Chunk 1: Nepal is a landlocked country
Chunk 2: and therefore it has no access
Chunk 3: to ocean making it difficult to
Chunk 4: import electronic goods from China directly
```

It works well when there are clear boundaries between chunks, for example if the context drastically changes between adjacent chunks. But this is rare and hence the lack of overlap destroys context.

In [1]:
from transformers import BertTokenizer
from IPython.display import HTML
from langchain_text_splitters import TokenTextSplitter

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text_splitter = TokenTextSplitter(
    chunk_size =50, 
    chunk_overlap=0
)

In [3]:
def color_text_chunks(text, text_splitter):
    docs = text_splitter.create_documents([text])
    chunks = [doc.page_content for doc in docs]
    
    colored_text = ""
    colors = ["#ff9999","#66ffff", "#99ff99", "#ff66ff"]
    
    for i, chunk in enumerate(chunks):
        color = colors[i % len(colors)]
        chunk_html = f"<span style='background-color: {color};color:black'>{chunk}</span>"
        colored_text += chunk_html + "<br/>"
    
    return HTML(colored_text)

In [4]:
hemmingway = """
Today is only one day in all the days that will ever be. But what will happen in all the other days that ever come can depend on what you do today. It's been that way all this year. It's been that way so many times. All of war is that way.
"""

In [5]:
color_text_chunks(hemmingway, text_splitter)


## Chunking with overlap

Chunking with overlap can be achieved using a sliding window which maintains overlap between chunks. This ensures that the system has contextual information at the chunk boundaries allowing better semantic context and increasing the chances that relevant information will be found it spans across multiple chunks. This is a much more expensive strategy because we need to divide the document into more chunks that increases the number of entries in the database. Also, if there are some redundant information, so the overlap should be no more than a small percentage of entire chunk.

In [6]:
from collections import defaultdict

In [28]:
def color_overlap_text_chunks(text, chunk_size, overlap):
    text_splitter = TokenTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap
    )
    docs = text_splitter.create_documents([text])
    chunks = [doc.page_content for doc in docs]  # Access the text attribute

    colors = ["#ff9999", "#66ffff", "#99ff99", "#ff66ff"]

    color_positions = []
    for i in range(len(chunks)):
        chunk_tokens = tokenizer.tokenize(chunks[i])
        chunk_length = len(chunk_tokens)
        if chunk_length > overlap:
            unique_length = chunk_length - overlap
            chunk_colors = [colors[i % len(colors)]] * unique_length + \
                           [blend_colors([colors[i % len(colors)], colors[(i+1) % len(colors)]])] * overlap
        else:
            chunk_colors = [blend_colors([colors[i % len(colors)], colors[(i+1) % len(colors)]])] * chunk_length
        color_positions.append(chunk_colors)

    for i in range(1, len(color_positions)):
        overlap_color = blend_colors([colors[(i-1) % len(colors)], colors[i % len(colors)]])
        color_positions[i] = [overlap_color] * overlap + color_positions[i][overlap:]

    colored_text = ""
    for i, chunk in enumerate(chunks):
        tokens = tokenizer.tokenize(chunk)
        for j, token in enumerate(tokens):
            color = color_positions[i][j]
            token_html = f'<span style="background-color:{color}; color:black">{token}</span>'
            colored_text += token_html + ' '
        colored_text += '<br><br>'

    return HTML(colored_text)

def blend_colors(color_list):
    r, g, b = 0, 0, 0
    for color in color_list:
        c = int(color[1:], 16)
        r += c >> 16
        g += (c >> 8) & 0xff
        b += c & 0xff
    n = len(color_list)
    r = min(r // n, 255)
    g = min(g // n, 255)
    b = min(b // n, 255)
    return f'#{r:02x}{g:02x}{b:02x}'

In [25]:
text = '''To be or not to be, that is the question.
Whether tis nobler in the mind to suffer
The slings and arrows of outrageous fortune
Or to take arms against a sea of troubles,
And by opposing, end them. To die, to sleep
No more, and by a sleep to say we end,
The heartache and the thousand natural shocks
That flesh is heir to, tis a consummation
Devoutly to be wished.'''

In [26]:
chunk_size = 80
overlap = 20

In [29]:
color_overlap_text_chunks(text, chunk_size, overlap)