**Semantic Chunking with Chonkie and Model2Vec**

Semantic chunking is a task of identifying the semantic boundaries of a piece of text. In this tutorial, we will use the [Chonkie](https://github.com/bhavnicksm/chonkie) library to perform semantic chunking on the book War and Peace. Chonkie is a library that provides a lightweight and fast solution to semantic chunking using pre-trained models. It supports our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) out of the box, which we will be using in this tutorial.

After chunking our text, we will be using [Vicinity](https://github.com/MinishLab/vicinity), a lightweight nearest neighbors library, to create an index of our chunks and query them.

In [None]:
# Install the necessary libraries
!pip install datasets model2vec numpy tqdm vicinity

# Import the necessary libraries
import random 
import re
import requests
from time import perf_counter
from chonkie import SDPMChunker
from model2vec import StaticModel
from vicinity import Vicinity

random.seed(0)

**Loading and pre-processing**

First, we will download War and Peace and apply some basic pre-processing.

In [None]:
# URL for War and Peace on Project Gutenberg
url = "https://www.gutenberg.org/files/2600/2600-0.txt"

# Download the book
response = requests.get(url)
book_text = response.text

def preprocess_text(text: str, min_length: int = 5):
    """Basic text preprocessing function."""
    text = text.replace("\n", " ")
    text = text.replace("\r", " ")
    sentences = re.findall(r'[^.!?]*[.!?]', text)
    # Filter out sentences shorter than the specified minimum length
    filtered_sentences = [sentence.strip() for sentence in sentences if len(sentence.split()) >= min_length]
    # Recombine the filtered sentences
    return ' '.join(filtered_sentences)

# Preprocess the text
book_text = preprocess_text(book_text)

**Chunking with Chonkie**

Next, we will use Chonkie to chunk our text into semantic chunks.

In [65]:
# Initialize a SemanticChunker from Chonkie with the potion-base-8M model
chunker = SDPMChunker(
    embedding_model="minishlab/potion-base-8M",
    similarity_threshold=0.3,
    skip_window=5,
    chunk_size = 256
)

# Chunk the text
time = perf_counter()
chunks = chunker.chunk(book_text)
print(f"Number of chunks: {len(chunks)}")
print(f"Time taken: {perf_counter() - time}")

Number of chunks: 6148
Time taken: 2.2917541670030914


And that's it, we chunked the entirety of War and Peace in ~2 seconds. Not bad! Let's look at some example chunks.

In [66]:
# Print a few example chunks
for _ in range(5):
    chunk = random.choice(chunks)
    print(chunk.text, "\n")

 ”    “I can’t think what the servants are about,” said the countess, turning  to her husband. 

 He was received in the best houses not merely as a doctor, but  as an equal. 

 “That’s enough, Natásha,” said Sónya. 

 Pierre did not catch what they were saying, but knew they were talking  about him. He reddened and turned away. “Well, now to the health of handsome women! ” said Dólokhov, and  with a serious expression, but with a smile lurking at the corners of  his mouth, he turned with his glass to Pierre. “Here’s to the health of lovely women, Peterkin—and their  lovers! Pierre, with downcast eyes, drank out of his glass without looking at  Dólokhov or answering him. The footman, who was distributing leaflets  with Kutúzov’s cantata, laid one before Pierre as one of the  principal guests. He was just going to take it when Dólokhov, leaning  across, snatched it from his hand and began reading it. Pierre looked  at Dólokhov and his eyes dropped, the something terrible and monstrous  

Those look good. Next, let's create a vector search index with Vicinity and Model2Vec.

**Creating a vector search index**

In [67]:
# Initialize an embedding model and encode the chunk texts
time = perf_counter()
model = StaticModel.from_pretrained("minishlab/potion-base-8M")
chunk_texts = [chunk.text for chunk in chunks]
chunk_embeddings = model.encode(chunk_texts)

# Create a Vicinity instance
vicinity = Vicinity.from_vectors_and_items(vectors=chunk_embeddings, items=chunk_texts)
print(f"Time taken: {perf_counter() - time}")

Time taken: 1.5817922909918707


Done! We embedded all our chunks and created an in index in ~1.5 seconds. Now that we have our index, let's query it with some queries.

**Querying the index**

In [68]:
queries = ["Emperor Napoleon", "The battle of Austerlitz", "Paris"]
for query in queries:
    print(f"Query: {query}\n{'-' * 50}")
    query_embedding = model.encode(query)
    results = vicinity.query(query_embedding, k=3)[0]

    for result in results:
        print(result[0], "\n")

Query: Emperor Napoleon
--------------------------------------------------
 In 1808 the Emperor Alexander went to Erfurt for a fresh interview with  the Emperor Napoleon, and in the upper circles of Petersburg there was  much talk of the grandeur of this important meeting. CHAPTER XXII    In 1809 the intimacy between “the world’s two arbiters,” as  Napoleon and Alexander were called, was such that when Napoleon declared  war on Austria a Russian corps crossed the frontier to co-operate with  our old enemy Bonaparte against our old ally the Emperor of Austria, and  in court circles the possibility of marriage between Napoleon and one  of Alexander’s sisters was spoken of. 

 ) “It’s in the Emperor’s  service. 

 “The day before yesterday it was ‘Napoléon, France,  bravoure’; yesterday, ‘Alexandre, Russie, grandeur. ’ One day our  Emperor gives it and next day Napoleon. Tomorrow our Emperor will send  a St. 

Query: The battle of Austerlitz
-----------------------------------------------

These indeed look like relevant chunks, nice! That's it for this tutorial. We were able to chunk, index, and query War and Peace in about 3.5 seconds using Chonkie, Vicinity, and Model2Vec. Lightweight and fast, just how we like it.