**Semantic Chunking with Chonkie and Model2Vec**

Semantic chunking is a task of identifying the semantic boundaries of a piece of text. In this tutorial, we will use the [Chonkie](https://github.com/bhavnicksm/chonkie) library to perform semantic chunking on the book War & Peace. Chonkie is a library that provides a lightweight and fast solution to semantic chunking using pre-trained models. It supports our [potion models](https://huggingface.co/collections/minishlab/potion-6721e0abd4ea41881417f062) out of the box, which we will be using in this tutorial.

After chunking our text, we will be using [Vicinity](https://github.com/MinishLab/vicinity), a lightweight nearest neighbors library, to create an index of our chunks and query them.

In [None]:
# Install the necessary libraries
!pip install datasets model2vec numpy tqdm vicinity

# Import the necessary libraries
import random 
import re
import requests
from chonkie import SemanticChunker
from model2vec import StaticModel
from vicinity import Vicinity

random.seed(0)

**Loading and pre-processing**

First, we will download War and Peace and apply some basic pre-processing.

In [None]:
# URL for War and Peace on Project Gutenberg
url = "https://www.gutenberg.org/files/2600/2600-0.txt"

# Download the book
response = requests.get(url)
book_text = response.text

def preprocess_text(text: str, min_length: int = 5):
    """Basic text preprocessing function."""
    text = text.replace("\n", " ")
    text = text.replace("\r", " ")
    sentences = re.findall(r'[^.!?]*[.!?]', text)
    # Filter out sentences shorter than the specified minimum length
    filtered_sentences = [sentence.strip() for sentence in sentences if len(sentence.split()) >= min_length]
    # Recombine the filtered sentences
    return ' '.join(filtered_sentences)

# Preprocess the text
book_text = preprocess_text(book_text)

**Chunking with Chonkie**

Next, we will use Chonkie to chunk our text into semantic chunks.

In [14]:
# Initialize a SemanticChunker from Chonkie with the potion-base-8M model
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",
    similarity_threshold=0.3
)

# Chunk the text
chunks = chunker.chunk(book_text)

And that's it, we chunked the entirety of War and Peace in ~3 seconds. Not bad! Let's look at some example chunks.

In [21]:
# Print a few example chunks
for _ in range(5):
    chunk = random.choice(chunks)
    print(chunk.text, "\n")

 “He is sleeping well as it is, after a sleepless night. 

 In the  yard, at the gates, at the window of the wings, wounded officers and  their orderlies were to be seen. 

 Toward dawn, Count Orlóv-Denísov, who had dozed off, was awakened by a  deserter from the French army being brought to him. This was a Polish  sergeant of Poniatowski’s corps, who explained in Polish that he had  come over because he had been slighted in the service: that he ought  long ago to have been made an officer, that he was braver than any of  them, and so he had left them and wished to pay them out. He said that  Murat was spending the night less than a mile from where they were,  and that if they would let him have a convoy of a hundred men he would  capture him alive. Count Orlóv-Denísov consulted his fellow officers. 

 But before the words were well out of his mouth, his cap flew off and a  fierce blow jerked his head to one side. 

 Any guard might arrest him, but by  strange chance no one does so and

Those look good. Next, let's create a vector search index with Vicinity and Model2Vec.

**Creating a vector search index**

In [11]:
# Initialize an embedding model and encode the chunk texts
model = StaticModel.from_pretrained("minishlab/potion-base-8M")
chunk_texts = [chunk.text for chunk in chunks]
chunk_embeddings = model.encode(chunk_texts)

# Create a Vicinity instance
vicinity = Vicinity.from_vectors_and_items(vectors=chunk_embeddings, items=chunk_texts)

Now that we have our index, let's query it with some queries.

**Querying the index**

In [None]:
queries = ["Napoleon", "The battle of Austerlitz", "Paris"]
for query in queries:
    print(f"Query: {query}\n{'-' * 50}")
    query_embedding = model.encode(query)
    results = vicinity.query(query_embedding, k=3)[0]

    for result in results:
        print(result[0], "\n")

Query: Napoleon
--------------------------------------------------
 He is alive,” said Napoleon. 

 Why, that must be  Napoleon’s own. 

 Napoleon’s position is most brilliant. 

Query: The battle of Austerlitz
--------------------------------------------------
 On the first arrival of the news of the battle of Austerlitz, Moscow had  been bewildered. 

 I remember his limited, self-satisfied face on the  field of Austerlitz. 

 That  city is taken; the Russian army suffers heavier losses than the opposing  armies had suffered in the former war from Austerlitz to Wagram. 

Query: Paris
--------------------------------------------------
 “I have been in Paris. 

 A man who doesn’t know Paris  is a savage. You can tell a Parisian two leagues off. Paris is Talma, la  Duchénois, Potier, the Sorbonne, the boulevards,” and noticing that  his conclusion was weaker than what had gone before, he added quickly:  “There is only one Paris in the world. You have been to Paris and have  remained Rus

These indeed look like relevant chunks, nice! That's it for this tutorial. We were able to chunk, index, and query War and Peace in less than 5 seconds using Chonkie, Vicinity, and Model2Vec. 