# 3.b RAG as tool

In this notebook you will see:
- How to implement RAG as a tool
- See how the LLM might (or not) use the tool w.r.t. the query.

Again, a lot of improvements could be imagined there.

What we did in previous sections is reused.

# Setup

In [2]:
import os
import json
from typing import Any
import re
import shutil

from docling.document_converter import DocumentConverter

from conversational_toolkit.vectorstores.chromadb import ChromaDBVectorStore
from conversational_toolkit.llms.base import LLMMessage, Roles
from conversational_toolkit.tools.base import Tool
from conversational_toolkit.embeddings.openai import OpenAIEmbeddings
from conversational_toolkit.llms.openai import OpenAILLM
from conversational_toolkit.chunking.base import Chunk

from utils.specific_chunker import SpecificCharChunker

  from .autonotebook import tqdm as notebook_tqdm


Consider using the pymupdf_layout package for a greatly improved page layout analysis.


In [4]:
path_to_docs = "data/docs"
path_to_document = os.path.join(path_to_docs, "alexnet_paper.pdf")

path_to_db = "data/db"
path_to_vectorstore = os.path.join(path_to_db, "example.db")

In [5]:
doc_converter = DocumentConverter()

conv_res = doc_converter.convert(path_to_document)
md = conv_res.document.export_to_markdown()

# replace \n per " ", as often just new lines
md = re.sub(r"(?<!\n)\n(?!\n)", " ", md)

doc_title_to_document = {"alexnet_paper.pdf": md}

chunker = SpecificCharChunker()
chunks = chunker.make_chunks(
    split_characters=["\n\n\n", "\n\n", "\n"],
    document_to_text=doc_title_to_document,
    max_number_of_characters=1024,
)

if os.path.exists(path_to_vectorstore):
    shutil.rmtree(path_to_vectorstore)
embedding_model = OpenAIEmbeddings(model_name="text-embedding-3-small")
embeddings = await embedding_model.get_embeddings([c.content for c in chunks])
vector_store = ChromaDBVectorStore(path_to_vectorstore)

await vector_store.insert_chunks(chunks=chunks, embedding=embeddings)

[32m[INFO] 2026-02-26 15:20:52,644 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-26 15:20:52,655 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-26 15:20:52,656 [RapidOCR] main.py:53: Using C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-26 15:20:52,738 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-26 15:20:52,740 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-26 15:20:52,741 [RapidOCR] main.py:53: Using C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\mod

# Create the tool

First, let's create a tool that sends back to the LLM the relevant chunks, if it is called

In [6]:
def chunks_to_text(chunks: list[Chunk]) -> str:
    text = ""

    for chunk in chunks:
        text += (
            f"## Chunk {chunk.title}:\n```\n{chunk.content}\n```\n" + "-" * 30 + "\n\n"
        )

    text = text[:-4]

    return text

In [7]:
class RetrieveRelevantChunks(Tool):
    def __init__(
        self,
        name: str,
        description: str,
        parameters: dict[str, Any],
    ):
        self.name = name
        self.description = description
        self.parameters = parameters

    async def call(self, args: dict[str, Any]) -> dict[str, Any]:
        query = args.get("query")
        top_k = args.get("top_k", 5)

        if top_k > 10:
            raise ValueError("top_k cannot be greater than 10.")

        query_embedding = await embedding_model.get_embeddings([query])
        retrieved_chunks = await vector_store.get_chunks_by_embedding(
            embedding=query_embedding, top_k=top_k
        )

        retrieved_chunks_as_text = chunks_to_text(retrieved_chunks)

        return {"result": retrieved_chunks_as_text}

In [8]:
alexnet_retriever_tool = RetrieveRelevantChunks(
    name="retrieve_relevant_chunks",
    description="Retrieves the most relevant chunks from the AlexNet paper based on a query.",
    # What parameters it expects
    parameters={
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "The query to retrieve relevant chunks for.",
            },
            "top_k": {
                "type": "number",
                "description": "The number of top relevant chunks to retrieve, maximum is 10.",
            },
        },
        "required": ["query"],
        "additionalProperties": False,
    },
)

In [9]:
# Test if it works
result = await alexnet_retriever_tool.call(
    {
        "query": "What are the top-1 and top-5 scores obtained on 'ILSVRC-2010'?",
        "top_k": 1,
    }
)

print(result["result"])

2026-02-26 15:21:05.516 | INFO     | conversational_toolkit.embeddings.openai:get_embeddings:38 - OpenAI embeddings shape: (1, 1024)


## Chunk 67:
```
Our results on ILSVRC-2010 are summarized in Table 1. Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0% 5 . The best performance achieved during the ILSVRC2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features [2], and since then the best published results are 45.7% and 25.7% with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features [24].
```
----------------------------


In [10]:
# Test if it works without top_k, should default to 5
result = await alexnet_retriever_tool.call(
    {"query": "What are the top-1 and top-5 scores obtained on 'ILSVRC-2010'?"}
)

print(result["result"])

2026-02-26 15:21:05.834 | INFO     | conversational_toolkit.embeddings.openai:get_embeddings:38 - OpenAI embeddings shape: (1, 1024)


## Chunk 67:
```
Our results on ILSVRC-2010 are summarized in Table 1. Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0% 5 . The best performance achieved during the ILSVRC2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features [2], and since then the best published results are 45.7% and 25.7% with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features [24].
```
------------------------------

## Chunk 69:
```
Table 1: Comparison of results on ILSVRC2010 test set. In italics are best results achieved by others.
```
------------------------------

## Chunk 15:
```
ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so this is the version on which we performed most of our experiments. Since we also entered our model in the ILSVRC-2012 competit

# Provide the tool to the LLM

In [11]:
system_prompt = """You are a helpful assistant that answers questions.

You have access to the following tool:
- retrieve_relevant_chunks: Retrieves the most relevant chunks from the AlexNet paper based on a query.

Only use the tool if it's relevant, else answer based on your own knowledge. Always try to use the tool if you think it can help you answer the question better.

If you use the tool, follow these guidelines:
- Use the chunks as your only source of truth. Do not rely on outside knowledge.
- Use all relevant chunks when forming your answer. Do not ignore any provided information.
- If the answer cannot be found in the chunks, clearly say that you do not know.
- Keep your answer concise and focused, without unnecessary details.
- Cite your sources from the provided chunks."""

prompt_message = LLMMessage(content=system_prompt, role=Roles.SYSTEM)

prompt_template = """# User question:\n{question}\n\nYour answer:\n\n"""

In [12]:
llm = OpenAILLM(tools=[alexnet_retriever_tool], tool_choice="auto")

2026-02-26 15:21:06.116 | DEBUG    | conversational_toolkit.llms.openai:__init__:63 - OpenAI LLM loaded: gpt-4o-mini; temperature: 0.5; seed: 42; tools: [<__main__.RetrieveRelevantChunks object at 0x0000023604E80B30>]; tool_choice: auto; response_format: {'type': 'text'}


# Test the tool

## General Question

In [13]:
query = "What is Einstein's theory of relativity? Answer concisely in 2-3 sentences."
user_message = LLMMessage(role=Roles.USER, content=query)

response = await llm.generate(conversation=[prompt_message, user_message])

2026-02-26 15:21:08.570 | DEBUG    | conversational_toolkit.llms.openai:generate:87 - Completion: ChatCompletion(id='chatcmpl-DDWWgBVzJaXA16SDTL9ct7BzIUhm6', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Einstein's theory of relativity comprises two main parts: special relativity and general relativity. Special relativity, introduced in 1905, establishes that the laws of physics are the same for all non-accelerating observers and that the speed of light is constant, leading to the famous equation E=mc². General relativity, published in 1915, generalizes this concept to include gravity, describing it as the curvature of spacetime caused by mass.", refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1772115666, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier='default', system_fingerprint='fp_414ba99a04', usage=CompletionUsage(completion_tokens=96, prompt_t

In [14]:
# Here the LLM should not call the tool, as it can answer based on its own knowledge
response

LLMMessage(content="Einstein's theory of relativity comprises two main parts: special relativity and general relativity. Special relativity, introduced in 1905, establishes that the laws of physics are the same for all non-accelerating observers and that the speed of light is constant, leading to the famous equation E=mc². General relativity, published in 1915, generalizes this concept to include gravity, describing it as the curvature of spacetime caused by mass.", role=<Roles.ASSISTANT: 'assistant'>, tool_calls=[], tool_call_id=None, name=None)

## AlexNet Question

In [15]:
# Let's ask a question about AlexNet
query = "What are the top-1 and top-5 scores obtained on 'ILSVRC-2010'?"
user_message = LLMMessage(role=Roles.USER, content=query)

response = await llm.generate(conversation=[prompt_message, user_message])

2026-02-26 15:21:09.666 | DEBUG    | conversational_toolkit.llms.openai:generate:87 - Completion: ChatCompletion(id='chatcmpl-DDWWiCeyqr5rcBZCBF9XHaE3Vibdq', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='call_crFhBPpXEa1zdrX0tGSX81uh', function=Function(arguments='{"query":"top-1 and top-5 scores ILSVRC-2010"}', name='retrieve_relevant_chunks'), type='function')]))], created=1772115668, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier='default', system_fingerprint='fp_b8d58b1866', usage=CompletionUsage(completion_tokens=30, prompt_tokens=260, total_tokens=290, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached

In [16]:
# Here the LLM indeed calls the tool
# As sown in 1c, now it should be forwarded
# Note that the query is not the same, the LLM rewrote it
response

LLMMessage(content='', role=<Roles.ASSISTANT: 'assistant'>, tool_calls=[ToolCall(id='call_crFhBPpXEa1zdrX0tGSX81uh', function=Function(name='retrieve_relevant_chunks', arguments='{"query":"top-1 and top-5 scores ILSVRC-2010"}'), type='function')], tool_call_id=None, name=None)

In [17]:
results = {}

for tool_call in response.tool_calls:
    tool_name = tool_call.function.name
    tool_args = tool_call.function.arguments

    tool = next((t for t in llm.tools if t.name == tool_name), None)

    if tool is not None:
        tools_args_json = json.loads(tool_args)
        tool_result = await tool.call(tools_args_json)

        results[tool_name] = tool_result

print(results["retrieve_relevant_chunks"]["result"])

2026-02-26 15:21:12.735 | INFO     | conversational_toolkit.embeddings.openai:get_embeddings:38 - OpenAI embeddings shape: (1, 1024)


## Chunk 69:
```
Table 1: Comparison of results on ILSVRC2010 test set. In italics are best results achieved by others.
```
------------------------------

## Chunk 67:
```
Our results on ILSVRC-2010 are summarized in Table 1. Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0% 5 . The best performance achieved during the ILSVRC2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features [2], and since then the best published results are 45.7% and 25.7% with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features [24].
```
------------------------------

## Chunk 15:
```
ILSVRC-2010 is the only version of ILSVRC for which the test set labels are available, so this is the version on which we performed most of our experiments. Since we also entered our model in the ILSVRC-2012 competit

In [18]:
tools_answers = []

for tool_call in response.tool_calls:
    tool_name = tool_call.function.name
    result = results[tool_name]
    call_id = tool_call.id

    tool_answer = LLMMessage(
        role=Roles.TOOL,
        name=tool_name,
        content=json.dumps(result),
        tool_call_id=call_id,
    )
    tools_answers.append(tool_answer)

In [None]:
conversation = [user_message, response, *tools_answers]

final_response = await llm.generate(conversation)

2026-02-26 15:21:15.275 | DEBUG    | conversational_toolkit.llms.openai:generate:87 - Completion: ChatCompletion(id='chatcmpl-DDWWmWbmGpKVWhdYoQwfqZkCy3L3p', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The top-1 and top-5 scores obtained on the ILSVRC-2010 are as follows:\n\n- **Top-1 error rate**: 37.5%\n- **Top-5 error rate**: 17.0%\n\nThe best performance during the ILSVRC-2010 competition was 47.1% for top-1 and 28.2% for top-5.', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1772115672, model='gpt-4o-mini-2024-07-18', object='chat.completion', service_tier='default', system_fingerprint='fp_373a14eb6f', usage=CompletionUsage(completion_tokens=86, prompt_tokens=712, total_tokens=798, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDeta

In [20]:
print(final_response.content)

The top-1 and top-5 scores obtained on the ILSVRC-2010 are as follows:

- **Top-1 error rate**: 37.5%
- **Top-5 error rate**: 17.0%

The best performance during the ILSVRC-2010 competition was 47.1% for top-1 and 28.2% for top-5.


--------------------------