# 2.c Embedding

In this notebook you will see:
- How to embed the chunks
- How to save them with their meta data in a VectorStore

We will use as a chunker the `docling` and the `SpecificCharChunker` from last notebook.

Here, OpenAI default embedding used, however specific embeddings might be relevant in specific situation (specific vocabulary, dialect, images, ...).

# Setup

In [1]:
import os
import re

from docling.document_converter import DocumentConverter

from conversational_toolkit.vectorstores.chromadb import ChromaDBVectorStore
from conversational_toolkit.embeddings.openai import OpenAIEmbeddings

from utils.specific_chunker import SpecificCharChunker

  from .autonotebook import tqdm as notebook_tqdm


Consider using the pymupdf_layout package for a greatly improved page layout analysis.


In [3]:
path_to_docs = "data/docs"
path_to_document = os.path.join(path_to_docs, "alexnet_paper.pdf")

path_to_db = "data/db"
path_to_vectorstore = os.path.join(path_to_db, "example.db")

In [4]:
doc_converter = DocumentConverter()

conv_res = doc_converter.convert(path_to_document)
md = conv_res.document.export_to_markdown()

# replace \n per " ", as often just new lines
md = re.sub(r"(?<!\n)\n(?!\n)", " ", md)

doc_title_to_document = {"alexnet_paper.pdf": md}

chunker = SpecificCharChunker()
chunks = chunker.make_chunks(
    split_characters=["\n\n\n", "\n\n", "\n"],
    document_to_text=doc_title_to_document,
    max_number_of_characters=1024,
)
print(len(chunks))

[32m[INFO] 2026-02-26 15:17:20,024 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-26 15:17:20,035 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-26 15:17:20,035 [RapidOCR] main.py:53: Using C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-26 15:17:20,110 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-26 15:17:20,112 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-26 15:17:20,112 [RapidOCR] main.py:53: Using C:\Users\sieverin\SDSC\Code\sme-kt-zh-collaboration-rag\rag_venv\Lib\site-packages\rapidocr\mod

96


# Embedding

Below, we use a wrapper over our abstract implementation, that converts each text in a float representation.

REMINDER: Calling OpenAI for embedding induces some (limited) cost.

```python
class EmbeddingsModel(ABC):
    """
    Abstract base class for embeddings models.

    Attributes:
        model_name (str): The name of the embeddings model.
        embedding_size (int): The size of the embedding vector.
    """

    @abstractmethod
    async def get_embeddings(self, texts: Union[str, list[str]]) -> NDArray[np.float64]:
        """
        Retrieves the embedding for the given text.

        Args:
            texts (list[str]): The input text for which the embedding needs to be retrieved.

        Returns:
            np.ndarray: The embedding vector for the input text.
        """
        pass
```

In [5]:
# Define the embedding model
embedding_model = OpenAIEmbeddings(model_name="text-embedding-3-small")

# Compute the embeddings for the chunks
embeddings = await embedding_model.get_embeddings([c.content for c in chunks])

2026-02-26 15:17:31.595 | DEBUG    | conversational_toolkit.embeddings.openai:__init__:20 - OpenAI embeddings model loaded: text-embedding-3-small
2026-02-26 15:17:32.393 | INFO     | conversational_toolkit.embeddings.openai:get_embeddings:38 - OpenAI embeddings shape: (96, 1024)


# Store them

Those embeddings have to be saved for further used, this is done in a vector store, which typically looks like:

```python
class VectorStore(ABC):
    @abstractmethod
    async def insert_chunks(self, chunks: list[Chunk], embedding: NDArray[np.float64]) -> None:
        pass

    @abstractmethod
    async def get_chunks_by_embedding(
        self, embedding: NDArray[np.float64], top_k: int, filters: dict[str, Any] | None = None
    ) -> list[ChunkMatch]:
        pass

    @abstractmethod
    async def get_chunks_by_ids(self, chunk_ids: Union[int, list[int]]) -> list[Chunk]:
        pass
```

We will use our implementation that `ChromaDB`, it will save both the text, it's embedding and the metadata.

In [6]:
vector_store = ChromaDBVectorStore(path_to_vectorstore)

await vector_store.insert_chunks(chunks=chunks, embedding=embeddings)

In [7]:
# Let's check the content
print(vector_store.collection.get().keys(), "\n")

print(vector_store.collection.get()["documents"][62], "\n")

print(vector_store.collection.get()["metadatas"][62], "\n")

vector_store.collection.get(include=["embeddings"])["embeddings"][62, :]

dict_keys(['ids', 'embeddings', 'documents', 'uris', 'included', 'data', 'metadatas']) 

Our results on ILSVRC-2010 are summarized in Table 1. Our network achieves top-1 and top-5 test set error rates of 37.5% and 17.0% 5 . The best performance achieved during the ILSVRC2010 competition was 47.1% and 28.2% with an approach that averages the predictions produced from six sparse-coding models trained on different features [2], and since then the best published results are 45.7% and 25.7% with an approach that averages the predictions of two classifiers trained on Fisher Vectors (FVs) computed from two types of densely-sampled features [24]. 

{'mime_type': 'text/markdown', 'doc_title': 'alexnet_paper.pdf', 'title': '67'} 



array([ 0.01619667, -0.0226672 ,  0.06462389, ..., -0.04083079,
       -0.0090411 ,  0.0530936 ], shape=(1024,))

-----------------------------------------