# The process of creating a RAG and an integral part of it - Chunking

Dividing text into chunks before embedding is a very impactful and integral step as it dictates what information is included in the vector and then is found during the search. Chunks should be of appropriate size as too small chunks lose context and that which are too big are non-specific which ultimately impacts the retrieval of query specific information. Excessive context might also lead to hallucination and deduce the LLM performance. Also, the chunk size should not exceed the context length of the embedder or else we will lose the information - this is called **truncation**. Therefore chunk size is important and it affects both the quality and performance of both retrieval and generation. 

Simplest strategy is to **fix the length of the chunk**. Character chunking divides the document into chunks based on predetermined number of characters of tokens (common choices are 100 or 256 tokens, or 500 characters). This size should be chosen depending on the type of the document. This is the cheapest and easiest system to implement. 

We can implement **random size chunking** when the collection is non homogeneous and can potentially capture more semantic context.

## Chunking without overlap

for example, 
```
"Nepal is a landlocked country and therefore it has no access to ocean making it difficult to import electronic goods from china directly."

is divided into 

Chunk 1: Nepal is a landlocked country
Chunk 2: and therefore it has no access
Chunk 3: to ocean making it difficult to
Chunk 4: import electronic goods from China directly
```

It works well when there are clear boundaries between chunks, for example if the context drastically changes between adjacent chunks. But this is rare and hence the lack of overlap destroys context.

In [1]:
from transformers import BertTokenizer
from IPython.display import HTML
from langchain_text_splitters import TokenTextSplitter

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
text_splitter = TokenTextSplitter(
    chunk_size =50, 
    chunk_overlap=0
)

In [11]:
def color_text_chunks(text, text_splitter):
    docs = text_splitter.create_documents([text])
    chunks = [doc.page_content for doc in docs]
    
    colored_text = ""
    colors = ["#ff9999","#66ffff", "#99ff99", "#ff66ff"]
    
    for i, chunk in enumerate(chunks):
        color = colors[i % len(colors)]
        chunk_html = f"<span style='background-color: {color};color:black'>{chunk}</span>"
        colored_text += chunk_html + "<br/>"
    
    return HTML(colored_text)

In [12]:
hemmingway = """
Today is only one day in all the days that will ever be. But what will happen in all the other days that ever come can depend on what you do today. It's been that way all this year. It's been that way so many times. All of war is that way.
"""

In [13]:
color_text_chunks(hemmingway, text_splitter)
