# Advanced Chunking Methods for RAG

In [1]:
!python --version

Python 3.11.7


Semantic chunking takes the idea of chunking documents (usually for RAG) to optimize for their end state of _vector embeddings_. Vector embeddings are retrieved based on semantic similarity, and _semantic chunking_ focuses on building chunks using the exact same mechanism.

That means that we optimize our chunks for ideal retrieval performance. In essence, we are doing this by identifying the optimal chunk size that maintains a concise semantic meaning. A concise semantic meaning is important because we are compressing our chunk into a _single_ vector embedding, so if the meaning of that chunk is not concise we would, in theory, produce suboptimal embeddings that are attempting to capture multiple meanings into a single vector, which just isn't possible — at best, we produce a type of _average_ over the multiple meanings.

In this example, we'll explore semantic chunking and see the full pipeline from raw data through to chunking and embedding our data, ready for RAG.

In [2]:
!pip install -qU \
    semantic-router==0.0.37 \
    pinecone-client==3.1.0 \
    datasets==2.19.0

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
langchain-openai 0.1.13 requires tiktoken<1,>=0.7, but you have tiktoken 0.6.0 which is incompatible.
pyppeteer 2.0.0 requires urllib3<2.0.0,>=1.25.8, but you have urllib3 2.2.2 which is incompatible.[0m[31m
[0m

In [3]:
from datasets import load_dataset

dataset = load_dataset("jamescalam/ai-arxiv2", split="train")
dataset

Downloading readme:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/217M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2673 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'content', 'references'],
    num_rows: 2673
})

In [8]:
print(type(dataset))

<class 'datasets.arrow_dataset.Dataset'>


Let's initialize our encoder which will be used to identify semantically concise splits in our dataset.

In [10]:
import os
from getpass import getpass
from semantic_router.encoders import OpenAIEncoder

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass("OpenAI API key: ")

encoder = OpenAIEncoder(name="text-embedding-3-small")

  warn_deprecated(


In [11]:
from semantic_router.splitters import RollingWindowSplitter
from semantic_router.utils.logger import logger

logger.setLevel("WARNING")  # reduce logs from splitter

splitter = RollingWindowSplitter(
    encoder=encoder,
    dynamic_threshold=True,
    min_split_tokens=100,
    max_split_tokens=500,
    window_size=2,
    plot_splits=True,  # set this to true to visualize chunking
    enable_statistics=True  # to print chunking stats
)

ValidationError: 2 validation errors for RollingWindowSplitter
encoder -> name
  field required (type=value_error.missing)
encoder -> score_threshold
  field required (type=value_error.missing)

In [6]:
splits = splitter([dataset["content"][0]])

[31m2024-07-05 13:29:35 ERROR semantic_router.utils.logger Error encoding documents ['4 2 0 2', 'n a J 8 ] G L . s c [', '1 v 8 8 0 4 0 . 1 0 4 2 : v i X r a', '# Mixtral of Experts', 'Albert Q.', 'Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, LÃ©lio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, ThÃ©ophile Gervet, Thibaut Lavril, Thomas Wang, TimothÃ©e Lacroix, William El Sayed', 'Abstract', 'We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model.', 'Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts).', 'For every token, at each layer, a router network selects two experts to process the current state and combine their outputs

ValueError: No embeddings returned. Error: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

In [None]:
splits = splitter([dataset["content"][1]])

In [None]:
splits = splitter([dataset["content"][2]])

We can view a few of our splits:

In [None]:
splitter.print(splits[:3])

The actual structure of splits we get from this function is contained within a `DocumentSplit` object:

In [None]:
splits[:3]

In here, our chunks are separated into a list of _sentence-like_ strings. To output our chunk from this we use the `.content` attribute:

In [None]:
splits[0].content

When creating embeddings we can include additional contextual information to improve retrieval performance. One performant but simple method for this is to prefix titles or headers to our chunks — this works particularly well for more structured documents like PDFs.

We can define a function that will do this for us.

In [None]:
def build_chunk(title: str, content: str):
    return f"# {title}\n{content}"

# we use it like:
title = dataset[2]["title"]
for s in splits[:3]:
    print("---")
    print(build_chunk(title=title, content=s.content))

These chunks are all we need to create our embeddings, but we don't necessarily want to feed the same information into our LLM that has been fed into our embedding model. In our example, we will add a little more structure to what the LLM sees, and also a little more context.

To achieve this, we will keep the `title` and `content` fields separate in metadata so that during retrieval we can format them in a way that makes sense for us.

Additionally, we may also want to pull in some context from surrounding chunks for the LLM — for that, we must track before and after chunks which we will place in two new metadata fields, `prechunk` and `postchunk`.

Last, but not least — we may want to allow for connections between different documents. To support this we will add a `arxiv_id` field that identifies _this_ paper, and also a `references` field that includes other paper's `arxiv_id` that were referenced in _this_ paper.

In [None]:
arxiv_id = dataset[2]["id"]
refs = list(dataset[2]["references"].values())

metadata = []
for i, s in enumerate(splits[:3]):
    prechunk = "" if i == 0 else splits[i-1].content
    postchunk = "" if i-1 == len(splits) else splits[i+1].content
    metadata.append({
        "title": title,
        "content": s.content,
        "prechunk": prechunk,
        "postchunk": postchunk,
        "arxiv_id": arxiv_id,
        "references": refs
    })

In [None]:
metadata[1]

In [None]:
from semantic_router.schema import DocumentSplit


def build_metadata(doc: dict, doc_splits: list[DocumentSplit]):
    # get document level metadata first
    arxiv_id = doc["id"]
    title = doc["title"]
    refs = list(doc["references"].values())
    # init split level metadata list
    metadata = []
    for i, split in enumerate(doc_splits):
        # get neighboring chunks
        prechunk_id = "" if i == 0 else f"{arxiv_id}#{i-1}"
        postchunk_id = "" if i+1 == len(doc_splits) else f"{arxiv_id}#{i+1}"
        # create dict and append to metadata list
        metadata.append({
            "id": f"{arxiv_id}#{i}",
            "title": title,
            "content": split.content,
            "prechunk_id": prechunk_id,
            "postchunk_id": postchunk_id,
            "arxiv_id": arxiv_id,
            "references": refs
        })
    return metadata

In [None]:
metadata = build_metadata(
    doc=dataset[2],
    doc_splits=splits[:3]
)

In [None]:
metadata

When feeding this structure into our LLM we will be able to tweak the exact format, how much info we provide, and how we handle connected documents — but by having this info here we can quickly iterate on the later generative steps.

## Implementation and Indexing

So far, we've seen how to process our data but we still haven't process our full dataset, nor have we begun embedding and storing our data. Now, we will do that.

To begin, we will setup a Pinecone index where we'll be storing everything.

In [None]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.getenv("PINECONE_API_KEY") or getpass("Pinecone API key: ")

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/guides/projects/understanding-projects).

In [None]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-west-2"  # us-east-1
)

Before creating an index, we need the dimensionality of our OpenAI embedding model, we get this by embedding an example and measuring the vector dimension:

In [None]:
dims = len(encoder(["some random text"])[0])
dims

Now we create the index using our embedding dimensionality, and a metric also compatible with the model (this can be either cosine or dotproduct). We also pass our spec to index initialization.

In [None]:
import time

index_name = "better-rag-chunking"

# check if index already exists (it shouldn't if this is first time)
if index_name not in pc.list_indexes().names():
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=dims,  # dimensionality of embed 3
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

# Populating our Index

Now our knowledge base is ready to be populated with our data. We will use the embed helper function to embed our documents and then add them to our index.

We will also include metadata from each record in the format we developed earlier.

---

_**Note**: You can find a prechunked version of the dataset [here](https://huggingface.co/datasets/jamescalam/ai-arxiv2-semantic-chunks) — you can use this to save time and money on chunking by loading from HF datasets with:_

```python
dataset = load_dataset(
    "jamescalam/ai-arxiv2-semantic-chunks",
    split="train[:10000]"
)
```

_You can also reduce the dataset size via the `split` parameter, to use only the first 10K rows we write `split="train[:10000]"`._

---

In [None]:
from tqdm.auto import tqdm

# easier to work with dataset as pandas dataframe
data = dataset.to_pandas().iloc[:10000]
# store dataset *without* embeddings here
full_dataset = []

batch_size = 128

# adjust splitter to not display stats and visuals
splitter.enable_statistics = False
splitter.plot_splits = False

for doc in tqdm(dataset):
    # create splits
    splits = splitter([doc["content"]])
    # create IDs and metadata for all splits in doc
    metadata = build_metadata(doc=doc, doc_splits=splits)
    for i in range(0, len(splits), batch_size):
        i_end = min(len(splits), i+batch_size)
        # get batch of data
        metadata_batch = metadata[i:i_end]
        full_dataset.extend(metadata_batch)
        # generate unique ids for each chunk
        ids = [m["id"] for m in metadata_batch]
        # get text content to embed
        content = [
            build_chunk(
                title=x["title"], content=x["content"]
            ) for x in metadata_batch
        ]
        # embed text
        embeds = encoder(content)
        # add to Pinecone
        index.upsert(vectors=zip(ids, embeds, metadata))

Now that we have our chunks stored we can go ahead and begin querying against them...

In [None]:
def query(text: str):
    xq = encoder([text])[0]
    matches = index.query(
        vector=xq,
        top_k=3,
        include_metadata=True
    )
    chunks = []
    for m in matches["matches"]:
        content = m["metadata"]["content"]
        title = m["metadata"]["title"]
        pre = m["metadata"]["prechunk_id"]
        post = m["metadata"]["postchunk_id"]
        other_chunks = index.fetch(ids=[pre, post])["vectors"]
        prechunk = other_chunks[pre]["metadata"]["content"]
        postchunk = other_chunks[post]["metadata"]["content"]
        chunk = f"""# {title}

        {prechunk[-400:]}
        {content}
        {postchunk[:400]}"""
        chunks.append(chunk)
    return chunks

In [None]:
query("what are large language models?")

Once finished, you can delete your Pinecone index to save resources — _**but careful, this cannot be recovered without rerunning the index process!**_

In [None]:
answer = input("Type 'y' to confirm deletion of the index...\n>> ")
if answer == "y":
    pc.delete_index(index_name)
    print("Index Deleted!")
else:
    print("Deletion Cancelled")

---