**Semantic Chunking with Chonkie and Model2Vec**

Semantic chunking is a task of identifying the semantic boundaries of a piece of text. In this tutorial, we will use the [Chonkie](https://github.com/bhavnicksm/chonkie) library to perform semantic chunking on the book War and Peace. Chonkie is a library that provides a lightweight and fast solution to semantic chunking using pre-trained models. It supports our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) out of the box, which we will be using in this tutorial.

After chunking our text, we will be using [Vicinity](https://github.com/MinishLab/vicinity), a lightweight nearest neighbors library, to create an index of our chunks and query them.

In [None]:
# Install the necessary libraries
!pip install datasets model2vec numpy tqdm vicinity

# Import the necessary libraries
import random 
import re
import requests
from time import perf_counter
from chonkie import SDPMChunker
from model2vec import StaticModel
from vicinity import Vicinity

random.seed(0)

**Loading and pre-processing**

First, we will download War and Peace and apply some basic pre-processing.

In [None]:
# URL for War and Peace on Project Gutenberg
url = "https://www.gutenberg.org/files/2600/2600-0.txt"

# Download the book
response = requests.get(url)
book_text = response.text

def preprocess_text(text: str, min_length: int = 5):
    """Basic text preprocessing function."""
    text = text.replace("\n", " ")
    text = text.replace("\r", " ")
    sentences = re.findall(r'[^.!?]*[.!?]', text)
    # Filter out sentences shorter than the specified minimum length
    filtered_sentences = [sentence.strip() for sentence in sentences if len(sentence.split()) >= min_length]
    # Recombine the filtered sentences
    return ' '.join(filtered_sentences)

# Preprocess the text
book_text = preprocess_text(book_text)

**Chunking with Chonkie**

Next, we will use Chonkie to chunk our text into semantic chunks.

In [45]:
# Initialize a SemanticChunker from Chonkie with the potion-base-8M model
chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-8M",
    similarity_threshold=0.3
)

# Chunk the text
time = perf_counter()
chunks = chunker.chunk(book_text)
print(f"Number of chunks: {len(chunks)}")
print(f"Time taken: {perf_counter() - time}")

Number of chunks: 7261
Time taken: 2.201361084007658


And that's it, we chunked the entirety of War and Peace in ~2 seconds. Not bad! Let's look at some example chunks.

In [49]:
# Print a few example chunks
for _ in range(5):
    chunk = random.choice(chunks)
    print(chunk.text, "\n")

 And what role is your young monarch  playing in that monstrous crowd? 

 How can you chuck it in like that or shove it under the cord  where it’ll get rubbed? 

 The general’s face clouded, his lips quivered and trembled. He took  out a notebook, hurriedly scribbled something in pencil, tore out the  leaf, gave it to Kozlóvski, stepped quickly to the window, and threw  himself into a chair, gazing at those in the room as if asking, “Why  do they look at me? ” Then he lifted his head, stretched his neck as  if he intended to say something, but immediately, with affected  indifference, began to hum to himself, producing a queer sound which  immediately broke off. 

 “I like your being businesslike about  it. ”    And patting Berg on the shoulder he got up, wishing to end the  conversation. But Berg, smiling pleasantly, explained that if he did not  know for certain how much Véra would have and did not receive at least  part of the dowry in advance, he would have to break matters off. “B

Those look good. Next, let's create a vector search index with Vicinity and Model2Vec.

**Creating a vector search index**

In [47]:
# Initialize an embedding model and encode the chunk texts
time = perf_counter()
model = StaticModel.from_pretrained("minishlab/potion-base-8M")
chunk_texts = [chunk.text for chunk in chunks]
chunk_embeddings = model.encode(chunk_texts)

# Create a Vicinity instance
vicinity = Vicinity.from_vectors_and_items(vectors=chunk_embeddings, items=chunk_texts)
print(f"Time taken: {perf_counter() - time}")

Time taken: 1.6793621249962598


Done! We embedded all our chunks and created an in index in ~1.5 seconds. Now that we have our index, let's query it with some queries.

**Querying the index**

In [48]:
queries = ["Napoleon", "The battle of Austerlitz", "Paris"]
for query in queries:
    print(f"Query: {query}\n{'-' * 50}")
    query_embedding = model.encode(query)
    results = vicinity.query(query_embedding, k=3)[0]

    for result in results:
        print(result[0], "\n")

Query: Napoleon
--------------------------------------------------
 Why, that must be  Napoleon’s own. 

 That Napoleon  has left Moscow? 

 Napoleon was to enter the  town next day. 

Query: The battle of Austerlitz
--------------------------------------------------
 I remember his limited, self-satisfied face on the  field of Austerlitz. 

 That  city is taken; the Russian army suffers heavier losses than the opposing  armies had suffered in the former war from Austerlitz to Wagram. 

 Behave as you did at  Austerlitz, Friedland, Vítebsk, and Smolénsk. 

Query: Paris
--------------------------------------------------
 “I have been in Paris. 

 A man who doesn’t know Paris  is a savage. You can tell a Parisian two leagues off. Paris is Talma, la  Duchénois, Potier, the Sorbonne, the boulevards,” and noticing that  his conclusion was weaker than what had gone before, he added quickly:  “There is only one Paris in the world. You have been to Paris and have  remained Russian. 

 It rises

These indeed look like relevant chunks, nice! That's it for this tutorial. We were able to chunk, index, and query War and Peace in about 3.5 seconds using Chonkie, Vicinity, and Model2Vec. Lightweight and fast, just how we like it.