# 3.a Basic RAG

In this notebook you will see:
- How to create a basic RAG workflow
- How to add naive query expansion

Note that here a lot of improvements can be imagined, that depend on the use case (metadata usage, query contextualization, ...).

What we did in previous sections is reused.

# Setup

In [2]:
import os
import json
import re
import shutil

from docling.document_converter import DocumentConverter

from conversational_toolkit.vectorstores.chromadb import ChromaDBVectorStore
from conversational_toolkit.llms.base import LLMMessage, Roles
from conversational_toolkit.embeddings.openai import OpenAIEmbeddings
from conversational_toolkit.llms.openai import OpenAILLM
from conversational_toolkit.chunking.base import Chunk

from utils.specific_chunker import SpecificCharChunker

  from .autonotebook import tqdm as notebook_tqdm


Consider using the pymupdf_layout package for a greatly improved page layout analysis.


In [4]:
path_to_docs = "data/docs"
path_to_document = os.path.join(path_to_docs, "alexnet_paper.pdf")

path_to_db = "data/db"
path_to_vectorstore = os.path.join(path_to_db, "example.db")

In [5]:
doc_converter = DocumentConverter()

conv_res = doc_converter.convert(path_to_document)
md = conv_res.document.export_to_markdown()

# replace \n per " ", as often just new lines
md = re.sub(r"(?<!\n)\n(?!\n)", " ", md)

doc_title_to_document = {"alexnet_paper.pdf": md}

chunker = SpecificCharChunker()
chunks = chunker.make_chunks(
    split_characters=["\n\n\n", "\n\n", "\n"],
    document_to_text=doc_title_to_document,
    max_number_of_characters=1024,
)

if os.path.exists(path_to_vectorstore):
    shutil.rmtree(path_to_vectorstore)
embedding_model = OpenAIEmbeddings(model_name="text-embedding-3-small")
embeddings = await embedding_model.get_embeddings([c.content for c in chunks])

vector_store = ChromaDBVectorStore(path_to_vectorstore)

await vector_store.insert_chunks(chunks=chunks, embedding=embeddings)

[32m[INFO] 2026-02-26 15:23:50,242 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-26 15:23:50,249 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-26 15:23:50,249 [RapidOCR] main.py:53: Using C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-26 15:23:50,330 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-26 15:23:50,332 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-26 15:23:50,333 [RapidOCR] main.py:53: Using C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\mod

# Define the blocks

In [6]:
# We already have the embeddings model and the vector store

# We need the LLM
llm = OpenAILLM(model_name="gpt-4.1-mini")

2026-02-26 15:24:02.900 | DEBUG    | conversational_toolkit.llms.openai:__init__:63 - OpenAI LLM loaded: gpt-4.1-mini; temperature: 0.5; seed: 42; tools: None; tool_choice: None; response_format: {'type': 'text'}


In [7]:
# Also, a function is required to convert the chunks into texts to be used in the prompt template
def chunks_to_text(chunks: list[Chunk]) -> str:
    text = ""

    for chunk in chunks:
        text += (
            f"## Chunk {chunk.title}:\n```\n{chunk.content}\n```\n" + "-" * 30 + "\n\n"
        )

    # Remove the last separator
    text = text[:-4]

    return text

In [8]:
# We also need a system prompt for the RAG agent
system_prompt = """You are a helpful assistant that answers questions using the provided document excerpts (chunks).

At each step, you will receive several chunks that are relevant to the user's question. Your task is to produce the best possible answer using only the information contained in those chunks.

Rules you must follow:
- Use the chunks as your only source of truth. Do not rely on outside knowledge.
- Use all relevant chunks when forming your answer. Do not ignore any provided information.
- If the answer cannot be found in the chunks, clearly say that you do not know.
- Keep your answer concise and focused, without unnecessary details.
- Cite your sources from the provided chunks."""

In [9]:
# Then a prompt template for the RAG agent
prompt_template = """# User question:\n{question}\n\n# Relevant chunks:\n{chunks}\n\n# Your answer:\n\n"""

In [10]:
print(
    prompt_template.format(
        question="What is the main contribution of the paper?",
        chunks=chunks_to_text(chunks[:3]),
    )
)

# User question:
What is the main contribution of the paper?

# Relevant chunks:
## Chunk 1:
```
## ImageNet Classification with Deep Convolutional Neural Networks
```
------------------------------

## Chunk 2:
```
Alex Krizhevsky University of Toronto kriz@cs.utoronto.ca
```
------------------------------

## Chunk 3:
```
Ilya Sutskever University of Toronto ilya@cs.utoronto.ca
```
----------------------------

# Your answer:




# Merging all

In practice, this will be abstracted

In [11]:
# Get the relevant chunks
query = "What are the top-1 and top-5 scores obtained on 'ILSVRC-2010'?"
query_embedding = await embedding_model.get_embeddings([query])
retrieved_chunks = await vector_store.get_chunks_by_embedding(
    embedding=query_embedding, top_k=5
)

2026-02-26 15:24:03.124 | INFO     | conversational_toolkit.embeddings.openai:get_embeddings:38 - OpenAI embeddings shape: (1, 1024)


In [12]:
# Prepare conversation for the LLM
system_prompt_message = LLMMessage(content=system_prompt, role=Roles.SYSTEM)
filled_template = prompt_template.format(
    question=query, chunks=chunks_to_text(retrieved_chunks)
)
query_message = LLMMessage(content=filled_template, role=Roles.USER)

conversation = [system_prompt_message, query_message]

In [13]:
answer = await llm.generate(conversation)

2026-02-26 15:24:05.057 | DEBUG    | conversational_toolkit.llms.openai:generate:87 - Completion: ChatCompletion(id='chatcmpl-DDWZX7GkIhpDo6hebBjYN43uFiGFn', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The top-1 and top-5 test set error rates obtained on ILSVRC-2010 by the network are 37.5% and 17.0%, respectively. These results outperform the best performance achieved during the ILSVRC-2010 competition, which were 47.1% top-1 error and 28.2% top-5 error, as well as the best published results before this work, which were 45.7% top-1 error and 25.7% top-5 error [Chunk 67].', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1772115843, model='gpt-4.1-mini-2025-04-14', object='chat.completion', service_tier='default', system_fingerprint='fp_a391f2cee0', usage=CompletionUsage(completion_tokens=115, prompt_tokens=708, total_tokens=823, completion_tokens_details=CompletionToken

In [14]:
print(answer.content)

The top-1 and top-5 test set error rates obtained on ILSVRC-2010 by the network are 37.5% and 17.0%, respectively. These results outperform the best performance achieved during the ILSVRC-2010 competition, which were 47.1% top-1 error and 28.2% top-5 error, as well as the best published results before this work, which were 45.7% top-1 error and 25.7% top-5 error [Chunk 67].


In [15]:
# The LLM received:
print("SYSTEM PROMPT:\n")
print(system_prompt_message.content, "\n\n")
print("QUERY:\n")
print(query_message.content, "\n\n")

SYSTEM PROMPT:

You are a helpful assistant that answers questions using the provided document excerpts (chunks).

At each step, you will receive several chunks that are relevant to the user's question. Your task is to produce the best possible answer using only the information contained in those chunks.

Rules you must follow:
- Use the chunks as your only source of truth. Do not rely on outside knowledge.
- Use all relevant chunks when forming your answer. Do not ignore any provided information.
- If the answer cannot be found in the chunks, clearly say that you do not know.
- Keep your answer concise and focused, without unnecessary details.
- Cite your sources from the provided chunks. 


QUERY:

# User question:
What are the top-1 and top-5 scores obtained on 'ILSVRC-2010'?

# Relevant chunks:
## Chunk 67:
```
Our results on ILSVRC-2010 are summarized in Table 1. Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0% 5 . The best performance achieved during t

# Query expansion

Asking the LLM to generate alternatives queries helps to search more broadly and improve sometimes the original query. The implementation below is a very simple version of it.

In [16]:
# The system prompt of the call to expand the query before retrieval
query_expansion_system_prompt = """You are a helpful assistant that expands user queries to be more effective for retrieving relevant information from a vector store. Your role is to take a user query and expand it by proposing several variations of the query that might be more effective for retrieval. 

Propose 5 variations of the query, ensuring that they are semantically similar but use different wording or focus on different aspects of the original query. The goal is to increase the chances of retrieving relevant information from the vector store. Here is an example:
"""

# The schema of the output of the query expansion
output_schema = {
    "type": "object",
    "name": "AnswerSchema",
    "description": "The schema for the output of the query expansion. It contains 5 variations of the original query.",
    "properties": {
        "expanded_queries": {
            "type": "array",
            "description": "List of the 5 expanded queries.",
            "items": {
                "type": "string",
            },
        },
    },
    "required": ["expanded_queries"],
    "additionalProperties": False,
}

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "AnswerSchema",
        "schema": output_schema,
    },
}

# The prompt template for the query expansion
query_expansion_prompt_template = f"""# User query:\n{query}\n\n# Expanded queries:\n"""

# The LLM instance to use for the query expansion
llm = OpenAILLM(model_name="gpt-4.1-mini", response_format=response_format)

2026-02-26 15:24:05.292 | DEBUG    | conversational_toolkit.llms.openai:__init__:63 - OpenAI LLM loaded: gpt-4.1-mini; temperature: 0.5; seed: 42; tools: None; tool_choice: None; response_format: {'type': 'json_schema', 'json_schema': {'name': 'AnswerSchema', 'schema': {'type': 'object', 'name': 'AnswerSchema', 'description': 'The schema for the output of the query expansion. It contains 5 variations of the original query.', 'properties': {'expanded_queries': {'type': 'array', 'description': 'List of the 5 expanded queries.', 'items': {'type': 'string'}}}, 'required': ['expanded_queries'], 'additionalProperties': False}}}


In [17]:
# The user query
query = "What are the top-1 and top-5 scores obtained on 'ILSVRC-2010'?"

# Call the LLM to expand the query
filled_template_query_expansion = query_expansion_prompt_template.format(query=query)
query_expansion_message = LLMMessage(
    content=filled_template_query_expansion, role=Roles.USER
)
system_prompt_query_expansion_message = LLMMessage(
    content=query_expansion_system_prompt, role=Roles.SYSTEM
)
conversation_query_expansion = [
    system_prompt_query_expansion_message,
    query_expansion_message,
]

# Get the expanded queries
query_expansion_response = await llm.generate(conversation_query_expansion)

# Extract them from the response
expanded_queries = json.loads(query_expansion_response.content)["expanded_queries"]
print(len(expanded_queries))
print(expanded_queries)

2026-02-26 15:24:07.138 | DEBUG    | conversational_toolkit.llms.openai:generate:87 - Completion: ChatCompletion(id='chatcmpl-DDWZZXiCTCed6CXsNrHeAws1zBv93', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{"expanded_queries":["What are the top-1 and top-5 accuracy scores for the ILSVRC-2010 dataset?","Can you provide the top-1 and top-5 classification results on ILSVRC-2010?","What top-1 and top-5 performance metrics were achieved in the ILSVRC 2010 challenge?","Show the best top-1 and top-5 scores reported for ILSVRC-2010.","What are the highest top-1 and top-5 scores recorded on the ImageNet Large Scale Visual Recognition Challenge 2010?"]}', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1772115845, model='gpt-4.1-mini-2025-04-14', object='chat.completion', service_tier='default', system_fingerprint='fp_a391f2cee0', usage=CompletionUsage(completion_tokens=124, prompt_to

5
['What are the top-1 and top-5 accuracy scores for the ILSVRC-2010 dataset?', 'Can you provide the top-1 and top-5 classification results on ILSVRC-2010?', 'What top-1 and top-5 performance metrics were achieved in the ILSVRC 2010 challenge?', 'Show the best top-1 and top-5 scores reported for ILSVRC-2010.', 'What are the highest top-1 and top-5 scores recorded on the ImageNet Large Scale Visual Recognition Challenge 2010?']


In [18]:
# Now let's retrieve the chunks for each of the expanded queries and see how many relevant chunks we get for each of them
retrieved_chunks: list[Chunk] = []

for expanded_query in expanded_queries:
    expanded_query_embedding = await embedding_model.get_embeddings([expanded_query])
    chunks_for_expanded_query = await vector_store.get_chunks_by_embedding(
        embedding=expanded_query_embedding, top_k=5
    )
    retrieved_chunks.extend(chunks_for_expanded_query)

    print(f"Expanded query: {expanded_query}")
    print(f"Number of retrieved chunks: {len(chunks_for_expanded_query)}")
    print("-" * 50)

# Put them all together and remove duplicates
unique_retrieved_chunks = {chunk.id: chunk for chunk in retrieved_chunks}.values()
print(f"Total number of unique retrieved chunks: {len(unique_retrieved_chunks)}")

2026-02-26 15:24:07.304 | INFO     | conversational_toolkit.embeddings.openai:get_embeddings:38 - OpenAI embeddings shape: (1, 1024)


Expanded query: What are the top-1 and top-5 accuracy scores for the ILSVRC-2010 dataset?
Number of retrieved chunks: 5
--------------------------------------------------


2026-02-26 15:24:08.089 | INFO     | conversational_toolkit.embeddings.openai:get_embeddings:38 - OpenAI embeddings shape: (1, 1024)


Expanded query: Can you provide the top-1 and top-5 classification results on ILSVRC-2010?
Number of retrieved chunks: 5
--------------------------------------------------


2026-02-26 15:24:08.625 | INFO     | conversational_toolkit.embeddings.openai:get_embeddings:38 - OpenAI embeddings shape: (1, 1024)


Expanded query: What top-1 and top-5 performance metrics were achieved in the ILSVRC 2010 challenge?
Number of retrieved chunks: 5
--------------------------------------------------


2026-02-26 15:24:08.840 | INFO     | conversational_toolkit.embeddings.openai:get_embeddings:38 - OpenAI embeddings shape: (1, 1024)
2026-02-26 15:24:09.007 | INFO     | conversational_toolkit.embeddings.openai:get_embeddings:38 - OpenAI embeddings shape: (1, 1024)


Expanded query: Show the best top-1 and top-5 scores reported for ILSVRC-2010.
Number of retrieved chunks: 5
--------------------------------------------------
Expanded query: What are the highest top-1 and top-5 scores recorded on the ImageNet Large Scale Visual Recognition Challenge 2010?
Number of retrieved chunks: 5
--------------------------------------------------
Total number of unique retrieved chunks: 10


In [19]:
# Now, we can go back to the original prompt template and prepare the conversation for the LLM with the retrieved chunks from all the expanded queries
# Prepare conversation for the LLM
system_prompt_message = LLMMessage(content=system_prompt, role=Roles.SYSTEM)
filled_template = prompt_template.format(
    question=query, chunks=chunks_to_text(unique_retrieved_chunks)
)
query_message = LLMMessage(content=filled_template, role=Roles.USER)

conversation = [system_prompt_message, query_message]

In [20]:
answer = await llm.generate(conversation)

2026-02-26 15:24:11.925 | DEBUG    | conversational_toolkit.llms.openai:generate:87 - Completion: ChatCompletion(id='chatcmpl-DDWZdObVO7A9gR96bS382w6ZBXi19', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{"expanded_queries":["What are the top-1 and top-5 error rates achieved on the ILSVRC-2010 test set?","Top-1 and top-5 scores obtained on the ILSVRC-2010 dataset?","Performance metrics for ILSVRC-2010: top-1 and top-5 error rates?","Reported top-1 and top-5 error rates for the ILSVRC-2010 competition?","What were the achieved top-1 and top-5 error percentages on ILSVRC-2010?"]}', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1772115849, model='gpt-4.1-mini-2025-04-14', object='chat.completion', service_tier='default', system_fingerprint='fp_a391f2cee0', usage=CompletionUsage(completion_tokens=119, prompt_tokens=1804, total_tokens=1923, completion_tokens_details=Completio

In [21]:
print(answer.content)

{"expanded_queries":["What are the top-1 and top-5 error rates achieved on the ILSVRC-2010 test set?","Top-1 and top-5 scores obtained on the ILSVRC-2010 dataset?","Performance metrics for ILSVRC-2010: top-1 and top-5 error rates?","Reported top-1 and top-5 error rates for the ILSVRC-2010 competition?","What were the achieved top-1 and top-5 error percentages on ILSVRC-2010?"]}


In [23]:
# The LLM received:
print("SYSTEM PROMPT:\n")
print(system_prompt_message.content, "\n\n")
print("QUERY:\n")
print(query_message.content, "\n\n")

SYSTEM PROMPT:

You are a helpful assistant that answers questions using the provided document excerpts (chunks).

At each step, you will receive several chunks that are relevant to the user's question. Your task is to produce the best possible answer using only the information contained in those chunks.

Rules you must follow:
- Use the chunks as your only source of truth. Do not rely on outside knowledge.
- Use all relevant chunks when forming your answer. Do not ignore any provided information.
- If the answer cannot be found in the chunks, clearly say that you do not know.
- Keep your answer concise and focused, without unnecessary details.
- Cite your sources from the provided chunks. 


QUERY:

# User question:
What are the top-1 and top-5 scores obtained on 'ILSVRC-2010'?

# Relevant chunks:
## Chunk 67:
```
Our results on ILSVRC-2010 are summarized in Table 1. Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0% 5 . The best performance achieved during t

--------------------