## Optimizing Chunk Size

> **The Chunking Commandment:** Your goal is not to chunk for chunking sake, our goal is to get our data in a format where it can be retrieved for value later.
>
> -- Greg Kamradt, [5 Levels Of Text Splitting](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)


When building a RAG system, you must understand how documents are processed and indexed. 

Chunking is the most fundamental aspect of this process. In this lesson, we'll explore what chunking is, how it affects the indexing and retrieval process, and how you can customize chunk size and overlap to optimize your results.

### Understanding Chunking

When documents are ingested into an index, `LlamaIndex` splits them into smaller pieces called "chunks."  This process is known as chunking. By default, LlamaIndex uses a chunk size of 1024 and a chunk overlap of 20. 

But what do these numbers mean, and how do they impact the indexing and retrieval process?

#### Chunk Size

The chunk size determines the maximum number of tokens (roughly equivalent to words) that each chunk will contain. With a default chunk size of 1024, `LlamaIndex` will split your documents into chunks that are no longer than 1024 tokens each.

*   **ü§è Smaller Chunk Size**
    *   More precise and focused embeddings
    *   Beneficial for retrieving specific information

*   **üëêLarger Chunk Size**
    *   More general embeddings with broader context
    *   Useful for document overviews, but may miss details

#### Chunk Overlap

*   **üîó Chunk Overlap** 
    *   Shared tokens between adjacent chunks (default: 20)
    *   Maintains context and prevents information loss

I recommend taking a look at [this chunk visualizer](https://huggingface.co/spaces/m-ric/chunk_visualizer) to get an intuitive sense for chunk size and overlap.

#### ü§î The Impact of Chunk Size
 - **üìè Relevance and Granularity**
    *   Smaller chunks (e.g., 128) offer granularity but risk missing vital information, or lack sufficient context.
    *   Larger chunks (e.g., 512) are more likely to capture necessary context, but also run the risk of including irrelevant information.
    *   Faithfulness and Relevancy metrics help assess response quality. 

 -  **üéØ Chunk Size and Use Case**
    *   **Question Answering:** Shorter, specific chunks for precise answers.
    *   **Summarization:** Longer chunks to capture the overall context.

 - **‚è≥ Response Generation Time**
    *   Larger chunks provide more context but may slow down the system.
    *   Balancing comprehensiveness with speed is crucial.
    
 - **ü§∑üèΩ‚Äç‚ôÇÔ∏è Finding the Optimal Size ‚öñÔ∏è**
    *   Testing various chunk sizes is essential for specific use cases and datasets. 
    *   Balancing information capture with efficiency is key.



### Considerations When Customizing Chunk Size

When deciding on a chunk size, there are a few things to keep in mind:

| Factor | Description |
|--------|-------------|
| üìÑ **Data Characteristics** | The optimal chunk size may depend on the type of data you're indexing. For example, if you're indexing long, detailed documents, you may want to use a larger chunk size to capture more context. If you're indexing short, focused passages, a smaller chunk size may be more appropriate. |
| üîç **Retrieval Requirements** | Consider what kind of information you need to retrieve from your index. If you need to retrieve very specific details, a smaller chunk size may be better. If you're looking for more general information, a larger chunk size may suffice. |
| üî¢ **Similarity Parameters** | When using a larger chunk size with a `VectorStoreIndex`, you may also want to increase the `similarity_top_k` parameter. This parameter determines how many of the most similar chunks are retrieved for each query. With larger chunks, you may need to retrieve more chunks to cover the same amount of information. |
