# Knowledge based chunker
In this notebook I show you an experiment to create a knowledge based chunking mechanism. The available embedders create chunks using a sentence splitter, or a max token splitter. The problem is that each chunk can contain multiple knowledge items. This can happen in one sentence, but even more in longer chunks. To implement a good RAG system, you need chunks that contain only one knowledge item. That way our query will match the best chunks. 

In [71]:
import json
import re
from typing import List

from dotenv import load_dotenv
from openai import OpenAI
from rag4p.indexing.input_document import InputDocument
from rag4p.indexing.splitter import Splitter
from rag4p.indexing.splitters.max_token_splitter import MaxTokenSplitter
from rag4p.integrations.openai.openai_embedder import OpenAIEmbedder
from rag4p.rag.model.chunk import Chunk
from rag4p.rag.store.local.internal_content_store import InternalContentStore
from rag4p.util.key_loader import KeyLoader

load_dotenv()
key_loader = KeyLoader()
print(f"OpenAI key is available: {key_loader.get_openai_api_key() is not None}")

OpenAI key is available: True


## Below is the input text that we will use to test the knowledge based chunker.

In [58]:
input_text = """Ever thought about building your very own question-answering system? Like the one that powers Siri, Alexa, or Google Assistant? Well, we've got something awesome lined up for you! In our hands-on workshop, we'll guide you through the ins and outs of creating a question-answering system. We prefer using Python for the workshop. We have prepared a GUI that works with python. If you prefer another language, you can still do the workshop, but you will miss the GUI to test your application.

You'll get your hands dirty with vector stores and Large Language Models, we help you combine these two in a way you've never done before. You've probably used search engines for keyword-based searches, right? Well, prepare to have your mind blown. We'll dive into something called semantic search, which is the next big thing after traditional searches. It’s like moving from asking Google to search "best pizza places" to "Where can I find a pizza place that my gluten-intolerant, vegan friend would love?" – you get the idea, right?
 
We’ll be teaching you how to build an entire pipeline, starting from collecting data from various sources, converting that into vectors (yeah, it’s more math, but it’s cool, we promise), and storing it so you can use it to answer all sorts of queries. It's like building your own mini Google!

We've got a repository ready to help you set up everything you need on your laptop. By the end of our workshop, you'll have your question-answering system ready and running. So, why wait? Grab your laptop, bring your coding hat, and let's start building something fantastic together. Trust us, it’s going to be a blast!

Some of the highlights of the workshop: 
- Use a vector store (OpenSearch, Elasticsearch, Weaviate)
- Use a Large Language Model (OpenAI, HuggingFace, Cohere, PaLM, Bedrock)
- Use a tool for content extraction (Unstructured, Llama)
- Create your pipeline (Langchain, Custom)
"""

In [52]:
splitter = MaxTokenSplitter(max_tokens=200)
chunks = splitter.split(InputDocument(document_id="input-doc", text=input_text, properties={}))

for chunk in chunks:
    print(f"Chunk: {chunk.chunk_id} \n {chunk.chunk_text}")
    print("----")

Chunk: 0 
 Ever thought about building your very own question-answering system? Like the one that powers Siri, Alexa, or Google Assistant? Well, we've got something awesome lined up for you! In our hands-on workshop, we'll guide you through the ins and outs of creating a question-answering system. We prefer using Python for the workshop. We have prepared a GUI that works with python. If you prefer another language, you can still do the workshop, but you will miss the GUI to test your application.

You'll get your hands dirty with vector stores and Large Language Models, we help you combine these two in a way you've never done before. You've probably used search engines for keyword-based searches, right? Well, prepare to have your mind blown. We'll dive into something called semantic search, which is the next big thing after traditional searches. It’s like moving from asking Google to search "best pizza places" to "Where can I find a pizza place that
----
Chunk: 1 
  my gluten-intoleran

In [56]:
class SectionSplitter(Splitter):
    def split(self, input_document: InputDocument) -> List[Chunk]:
        sections = re.split(r"\n\s*\n", input_document.text)
        print(f"Num sections: {len(sections)}")

        chunks_ = []
        for i, section in enumerate(sections):
            chunk_ = Chunk(input_document.document_id, i, len(sections), section, input_document.properties)
            chunks_.append(chunk_)

        return chunks_

    @staticmethod
    def name() -> str:
        return SectionSplitter.__name__

In [59]:
splitter = SectionSplitter()
chunks = splitter.split(InputDocument(document_id="input-doc", text=input_text, properties={}))
for chunk in chunks:
    print(f"Chunk: {chunk.chunk_id}, Num chunks: {chunk.total_chunks} \n {chunk.chunk_text}")
    print("----")

Num sections: 5
Chunk: 0, Num chunks: 5 
 Ever thought about building your very own question-answering system? Like the one that powers Siri, Alexa, or Google Assistant? Well, we've got something awesome lined up for you! In our hands-on workshop, we'll guide you through the ins and outs of creating a question-answering system. We prefer using Python for the workshop. We have prepared a GUI that works with python. If you prefer another language, you can still do the workshop, but you will miss the GUI to test your application.
----
Chunk: 1, Num chunks: 5 
 You'll get your hands dirty with vector stores and Large Language Models, we help you combine these two in a way you've never done before. You've probably used search engines for keyword-based searches, right? Well, prepare to have your mind blown. We'll dive into something called semantic search, which is the next big thing after traditional searches. It’s like moving from asking Google to search "best pizza places" to "Where can I

In [75]:
openai_client = OpenAI(api_key=key_loader.get_openai_api_key())


def fetch_knowledge_chunks(orig_chunk: Chunk) -> List[Chunk]:

    prompt = f"""Task: Extract Knowledge Chunks
    
    Please extract knowledge chunks from the following text. Each chunk should capture distinct, self-contained units of information in a subject-description format. Return the extracted knowledge chunks as a JSON object or array, ensuring that each chunk includes both the subject and its corresponding description.
    
    Text:
    {orig_chunk.chunk_text}
    """

    completion = openai_client.chat.completions.create(
        model="gpt-4o",
        response_format={"type": "json_object"},
        messages=[
            {"role": "system",
             "content": "You are an assistant that takes apart a piece of text into semantic chunks to be used in a RAG system."},
            {"role": "user", "content": prompt},
        ],
        stream=False,
    )
    
    answer = json.loads(completion.choices[0].message.content)

    chunks_ = []
    for index, kc in enumerate(answer["knowledge_chunks"]):
        chunk_ = Chunk(orig_chunk.get_id(), index, len(answer["knowledge_chunks"]), f'{kc["subject"]}: {kc["description"]}', {"original_text": orig_chunk.chunk_text, "original_chunk_id": orig_chunk.get_id(), "original_total_chunks": orig_chunk.total_chunks})
        chunks_.append(chunk_)
        
    return chunks_

In [80]:
knowledge_chunks = fetch_knowledge_chunks(chunks[1])
for kc in knowledge_chunks:
    print(f"Chunk: {kc.get_id()}, Num chunks: {kc.total_chunks} \n {kc.chunk_text} \n Original: {kc.properties['original_text']}")
    print("----")

Chunk: input-doc_1_0, Num chunks: 5 
 Hands-on experience: You'll get your hands dirty with vector stores and Large Language Models. 
 Original: You'll get your hands dirty with vector stores and Large Language Models, we help you combine these two in a way you've never done before. You've probably used search engines for keyword-based searches, right? Well, prepare to have your mind blown. We'll dive into something called semantic search, which is the next big thing after traditional searches. It’s like moving from asking Google to search "best pizza places" to "Where can I find a pizza place that my gluten-intolerant, vegan friend would love?" – you get the idea, right?
----
Chunk: input-doc_1_1, Num chunks: 5 
 Combining vector stores and Large Language Models: We help you combine these two in a way you've never done before. 
 Original: You'll get your hands dirty with vector stores and Large Language Models, we help you combine these two in a way you've never done before. You've pr

In [None]:
from rag4p.integrations.openai import EMBEDDING_SMALL

# Create an in memory content store to hold some chunks
openai_embedder = OpenAIEmbedder(api_key=key_loader.get_openai_api_key(), embedding_model=EMBEDDING_SMALL)
content_store = InternalContentStore(embedder=openai_embedder, metadata=None)