# 2.d Retrieval

In this notebook you will see:
- How to retrieve based on embeddings
- How to retrieve based on metadata
- How to retrieve based on all of them.

We will use as a parser the `docling`m the `SpecificCharChunker` chunker and `ChromaDBVectorStore` store.

Other retrieval strategies exist, that are not covered here by creating a new instance of vector store and changing a bit the logic:
- Not using `top_k` (threshold, ...)
- Include keyword score (BM25, ...)
- Make use of the document architecture (hierarchical retrieval, ...)

# Setup

In [2]:
import os
import re
import shutil

from docling.document_converter import DocumentConverter

from conversational_toolkit.vectorstores.chromadb import ChromaDBVectorStore
from conversational_toolkit.embeddings.openai import OpenAIEmbeddings

from utils.specific_chunker import SpecificCharChunker

Consider using the pymupdf_layout package for a greatly improved page layout analysis.


In [4]:
path_to_docs = "data/docs"
path_to_document = os.path.join(path_to_docs, "alexnet_paper.pdf")

path_to_db = "data/db"
path_to_vectorstore = os.path.join(path_to_db, "example.db")

In [5]:
doc_converter = DocumentConverter()

conv_res = doc_converter.convert(path_to_document)
md = conv_res.document.export_to_markdown()

# replace \n per " ", as often just new lines
md = re.sub(r"(?<!\n)\n(?!\n)", " ", md)

doc_title_to_document = {"alexnet_paper.pdf": md}

chunker = SpecificCharChunker()
chunks = chunker.make_chunks(
    split_characters=["\n\n\n", "\n\n", "\n"],
    document_to_text=doc_title_to_document,
    max_number_of_characters=1024,
)

if os.path.exists(path_to_vectorstore):
    shutil.rmtree(path_to_vectorstore)
embedding_model = OpenAIEmbeddings(model_name="text-embedding-3-small")
embeddings = await embedding_model.get_embeddings([c.content for c in chunks])

vector_store = ChromaDBVectorStore(path_to_vectorstore)

await vector_store.insert_chunks(chunks=chunks, embedding=embeddings)

[32m[INFO] 2026-02-26 15:19:53,964 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-26 15:19:53,977 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-26 15:19:53,977 [RapidOCR] main.py:53: Using C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-26 15:19:54,066 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-26 15:19:54,068 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-26 15:19:54,068 [RapidOCR] main.py:53: Using C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\mod

# Embedding Retrieval



In [6]:
query = "What are the top-1 and top-5 scores obtained on 'ILSVRC-2010'?"
query_embedding = await embedding_model.get_embeddings([query])

retrieved_chunks = await vector_store.get_chunks_by_embedding(
    embedding=query_embedding,
    # The 5 best will be retrieved
    top_k=5,
)

2026-02-26 15:20:06.765 | INFO     | conversational_toolkit.embeddings.openai:get_embeddings:38 - OpenAI embeddings shape: (1, 1024)


In [7]:
for i, chunk in enumerate(retrieved_chunks):
    print(
        f"Chunk {i + 1}:\n{chunk.content}\n",
        f"Metadata: {chunk.metadata}",
        "\n",
        "-" * 30,
        "\n",
    )

Chunk 1:
Our results on ILSVRC-2010 are summarized in Table 1. Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0% 5 . The best performance achieved during the ILSVRC2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features [2], and since then the best published results are 45.7% and 25.7% with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features [24].
 Metadata: {'title': '67', 'doc_title': 'alexnet_paper.pdf', 'mime_type': 'text/markdown'} 
 ------------------------------ 

Chunk 2:
Table 1: Comparison of results on ILSVRC2010 test set. In italics are best results achieved by others.
 Metadata: {'doc_title': 'alexnet_paper.pdf', 'mime_type': 'text/markdown', 'title': '69'} 
 ------------------------------ 

Chunk 3:
ILSVRC-2010 is the only version of ILSVRC for which the test

# Metadata Retrieval

In [8]:
query = "What are the top-1 and top-5 scores obtained on 'ILSVRC-2010'?"
query_embedding = await embedding_model.get_embeddings([query])

considered_metadata_title = [str(i) for i in range(60, 80)]
# Any kind of SQL-like filter can be applied
filters = {"title": {"$in": considered_metadata_title}}

retrieved_chunks = await vector_store.get_chunks_by_embedding(
    embedding=query_embedding,
    top_k=5,
    # Only consider chunks with a title in the range of 70 to 90 (inclusive)
    filters=filters,
)

2026-02-26 15:20:07.034 | INFO     | conversational_toolkit.embeddings.openai:get_embeddings:38 - OpenAI embeddings shape: (1, 1024)


In [9]:
for i, chunk in enumerate(retrieved_chunks):
    print(
        f"Chunk {i + 1}:\n{chunk.content}\n",
        f"Metadata: {chunk.metadata}",
        "\n",
        "-" * 30,
        "\n",
    )

Chunk 1:
Our results on ILSVRC-2010 are summarized in Table 1. Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0% 5 . The best performance achieved during the ILSVRC2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features [2], and since then the best published results are 45.7% and 25.7% with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features [24].
 Metadata: {'mime_type': 'text/markdown', 'title': '67', 'doc_title': 'alexnet_paper.pdf'} 
 ------------------------------ 

Chunk 2:
Table 1: Comparison of results on ILSVRC2010 test set. In italics are best results achieved by others.
 Metadata: {'title': '69', 'mime_type': 'text/markdown', 'doc_title': 'alexnet_paper.pdf'} 
 ------------------------------ 

Chunk 3:
Table 2: Comparison of error rates on ILSVRC-2012 validation

-------------