# Feature 1: Chunking Validation
## coded by Adelene Lai (Econetta AG), 24.02.2026

4 aspects of validating chunking:

* make sure chunk size < max token embedding size
* all chunks around same average size
* number of information (try to ensure 1 topic per chunk so that retrieval is targeted and effective)
* chunk relevance/context

For now, we use the `all-MiniLM-L6-v2` local embedding model via ollama, as used in feature0_baseline_rag notebook. 

This model has max token embedding size 256 tokens ~ 1000 characters. See 'Intended Uses': https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2



In [2]:
## Prelim Imports taken from feature0_baseline_rag notebook
from pathlib import Path
from typing import Any, Callable, Optional

from conversational_toolkit.chunking.base import Chunk

from conversational_toolkit.agents.base import QueryWithContext
from conversational_toolkit.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from conversational_toolkit.retriever.vectorstore_retriever import VectorStoreRetriever

from sme_kt_zh_collaboration_rag.feature0_baseline_rag import (
    load_chunks,
    inspect_chunks,
    build_vector_store,
    inspect_retrieval,
    build_agent,
    build_llm,
    ask,
    DATA_DIR,
    VS_PATH,
    EMBEDDING_MODEL,
    RETRIEVER_TOP_K,
)

# Choose your LLM backend: "ollama" (local, requires `ollama serve`) or "openai" (requires OPENAI_API_KEY)
BACKEND = "ollama"  # set this before running

if not BACKEND:
    raise ValueError(
        'BACKEND is not set. Edit the line above and set it to "ollama", or "openai".\n'
        "See Renku_README.md for setup instructions."
    )

ROOT = Path().resolve().parents[1]
print(f"Project root : {ROOT}")
print(f"Data dir     : {DATA_DIR}")
print(f"Vector store : {VS_PATH}")
print(f"LLM backend  : {BACKEND}")

Consider using the pymupdf_layout package for a greatly improved page layout analysis.


[0;93m2026-02-25 14:16:50.496970771 [W:onnxruntime:Default, device_discovery.cc:211 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:91 ReadFileContents Failed to open file: "/sys/class/drm/card0/device/vendor"[m
  from .autonotebook import tqdm as notebook_tqdm


Project root : /home/alai/projects/myfork/sme-kt-zh-collaboration-rag
Data dir     : /home/alai/projects/myfork/sme-kt-zh-collaboration-rag/data
Vector store : /home/alai/projects/myfork/sme-kt-zh-collaboration-rag/backend/data_vs.db
LLM backend  : ollama


In [10]:
# Load documents from DATA_DIR and split them into chunks.
chunks = load_chunks(max_files=None)
# Print a statistical summary and sampled content for visual inspection.
inspect_chunks(chunks)

# Print size distribution
char_lengths = [len(c.content) for c in chunks]
over_limit = sum(1 for n in char_lengths if n > 1024)
print(f"\nChunks total       : {len(chunks)}")
print(f"Mean length (chars): {sum(char_lengths) // len(char_lengths)}")
print(f"Over 1024-char limit (≈256 tok embedding limit): {over_limit} / {len(chunks)}")
print("\nSuccessfully loaded and chunked the documents!")

2026-02-25 14:27:25.769 | INFO     | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:199 - Chunking 35 files from /home/alai/projects/myfork/sme-kt-zh-collaboration-rag/data
2026-02-25 14:27:26.055 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:211 -   ART_internal_procurement_policy.pdf: 12 chunks
2026-02-25 14:27:26.267 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:211 -   ART_logylight_incomplete_datasheet.pdf: 6 chunks
2026-02-25 14:27:26.385 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:211 -   ART_product_catalog.pdf: 7 chunks
2026-02-25 14:27:26.393 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:211 -   ART_product_overview.xlsx: 1 chunks
2026-02-25 14:27:26.497 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:211 -   ART_relicyc_logypal1_datasheet_2021.pdf: 5 chunks
2026-02-25 14:27:26.499 | DEBUG    | sme_kt_zh_collaboration_


Chunks total       : 374
Mean length (chars): 4095
Over 1024-char limit (≈256 tok embedding limit): 104 / 374

Successfully loaded and chunked the documents!


In [4]:
from sme_kt_zh_collaboration_rag.feature0_ingestion import (
    header_based_chunks,
    fixed_size_chunks,
    paragraph_aware_chunks,
    analyze_chunks,
    compare_strategies,
    print_comparison_table,
    char_histogram,
    ChunkStats,
)
from loguru import logger

In [None]:
# compare_chunk_size(f"{DATA_DIR}/ART_internal_procurement_policy.pdf")

2026-02-25 13:48:52.401 | INFO     | sme_kt_zh_collaboration_rag.feature0_ingestion:compare_chunk_size:221 - Validation - header_based           chunks=  12, avg=   416, min=   32, max=   958, >256tok=   0



=== header_based ===
header_based           chunks=  12, avg=   416, min=   32, max=   958, >256tok=   0

Chunk size histogram:
      32-125    | ████████████████████ 2
     125-217    | ██████████ 1
     217-310    | ██████████ 1
     310-402    | ██████████ 1
     402-495    | ████████████████████████████████████████ 4
     495-588    | ██████████ 1
     588-680    |  0
     680-773    |  0
     773-865    | ██████████ 1
     865-958    | ██████████ 1


2026-02-25 13:48:52.706 | INFO     | sme_kt_zh_collaboration_rag.feature0_ingestion:compare_chunk_size:221 - Validation - fixed_size_800         chunks=   8, avg=   712, min=   97, max=   800, >256tok=   0



=== fixed_size_800 ===
fixed_size_800         chunks=   8, avg=   712, min=   97, max=   800, >256tok=   0

Chunk size histogram:
      97-167    | █████ 1
     167-238    |  0
     238-308    |  0
     308-378    |  0
     378-448    |  0
     448-519    |  0
     519-589    |  0
     589-659    |  0
     659-730    |  0
     730-800    | ████████████████████████████████████████ 7


2026-02-25 13:48:53.022 | INFO     | sme_kt_zh_collaboration_rag.feature0_ingestion:compare_chunk_size:221 - Validation - paragraph_600          chunks=  11, avg=   450, min=   79, max=   593, >256tok=   0



=== paragraph_600 ===
paragraph_600          chunks=  11, avg=   450, min=   79, max=   593, >256tok=   0

Chunk size histogram:
      79-130    | ██████████ 1
     130-182    |  0
     182-233    |  0
     233-285    |  0
     285-336    | ██████████ 1
     336-387    |  0
     387-439    | ████████████████████ 2
     439-490    | ████████████████████ 2
     490-542    | ██████████ 1
     542-593    | ████████████████████████████████████████ 4


{'header_based': ([Chunk(title='# Supplier Sustainability Requirements', content='# Supplier Sustainability Requirements\n\nVersion: 1.2 | Approved by CEO (Andrea Frei) | Effective: 1 January 2024 Classification: Internal use only, do not share externally without management approval\n\n', mime_type='text/markdown', metadata={'chapters': ['# Supplier Sustainability Requirements']}),
   Chunk(title='## 1. Purpose and Scope', content="## 1. Purpose and Scope\n\n This document establishes the minimum sustainability requirements for all packaging product suppliers from whom PrimePack AG procures goods for resale or distribution. It applies to all supplier relationships, both new and existing.\n\nThe policy is designed to support PrimePack AG's obligations under the EU Corporate Sustainability Reporting Directive (CSRD) and to enable consistent, evidence-based responses to customer sustainability inquiries.\n\n", mime_type='text/markdown', metadata={'chapters': ['# Supplier Sustainability Re

In [6]:
results = compare_strategies(
    f"{DATA_DIR}/ART_internal_procurement_policy.pdf",
)

2026-02-25 14:25:54.156 | INFO     | sme_kt_zh_collaboration_rag.feature0_ingestion:compare_strategies:173 - Comparing chunking strategies on: ART_internal_procurement_policy.pdf
2026-02-25 14:25:54.456 | INFO     | sme_kt_zh_collaboration_rag.feature0_ingestion:compare_strategies:186 -   header_based           chunks=  12, avg=   416, min=   32, max=   958, >256tok=   0
2026-02-25 14:25:54.777 | INFO     | sme_kt_zh_collaboration_rag.feature0_ingestion:compare_strategies:186 -   fixed_size_800         chunks=   8, avg=   712, min=   97, max=   800, >256tok=   0
2026-02-25 14:25:55.048 | INFO     | sme_kt_zh_collaboration_rag.feature0_ingestion:compare_strategies:186 -   paragraph_600          chunks=  11, avg=   450, min=   79, max=   593, >256tok=   0


In [7]:
print_comparison_table(results)

Strategy                Chunks  Avg chars    Min     Max  >256tok
-----------------------------------------------------------------
header_based           chunks=  12, avg=   416, min=   32, max=   958, >256tok=   0
None
fixed_size_800         chunks=   8, avg=   712, min=   97, max=   800, >256tok=   0
None
paragraph_600          chunks=  11, avg=   450, min=   79, max=   593, >256tok=   0
None


In [11]:
len(chunks)

374

In [14]:
chunks[0]

Chunk(title='# Supplier Sustainability Requirements', content='# Supplier Sustainability Requirements\n\nVersion: 1.2 | Approved by CEO (Andrea Frei) | Effective: 1 January 2024 Classification: Internal use only, do not share externally without management approval\n\n', mime_type='text/markdown', metadata={'chapters': ['# Supplier Sustainability Requirements'], 'source_file': 'ART_internal_procurement_policy.pdf', 'source': 'ART_internal_procurement_policy.pdf', 'title': '# Supplier Sustainability Requirements'})

In [None]:
from sme_kt_zh_collaboration_rag.feature0_ingestion import validate_chunks_one_topic