# Feature Track 3: Retrieval Strategies

---

The baseline retriever embeds the user query and finds the nearest chunks by cosine similarity. This works well when the query is well-formed, specific, and aligned with how the data is written. In practice, this assumption often breaks.

Retrieval failures are not only about wording, they arise whenever there is a mismatch between user intent, document structure, and retrieval mechanics. Common failure modes include:
- Queries that rely on exact identifiers or codes
- Queries that mix semantic intent with keyword-level constraints
- Queries where many chunks are vaguely relevant, but only a few are actually useful
- Queries that should be restricted to a known document or subset, but aren’t

One visible manifestation of these failures is vocabulary mismatch:

> **Vocabulary mismatch** -> the right chunk exists but ranks too low because the phrasing differs:

| User query | Document text | Issue |
|---|---|---|
| `"FSC-C147829 certificate"` | `"complies with FSC standard C147829"` | Exact string -> semantic embedding is noisy |
| `"PFAS-free"` | `"no intentionally added per- and polyfluoroalkyl substances"` | Acronym vs. full name |
| `"Is this Blauer Engel certified?"` | `"Blauer Engel DE-UZ 14"` | Equivalent claims, different phrasing |


But the underlying issue is broader: single-shot semantic retrieval is a blunt tool. It optimizes for general semantic similarity, not precision, constraints, or intent.

This notebook covers three strategies that address this:

| # | Strategy | Approach | Extra cost |
|---|---|---|---|
| 1 | **Baseline** | Single semantic query | — |
| 2 | **BM25** | Keyword retrieval, no embedding | Corpus index at startup |
| 3 | **Hybrid** | Semantic + BM25 via Reciprocal Rank Fusion | Corpus index at startup |
| 4 | **Metadata filter** | Restrict semantic search to a known document | None |


Query-transformation strategies are covered in a seperate notebook

---

## Setup

**Prerequisites:** `conversational-toolkit` and `backend` must be installed in editable mode.
For **OpenAI**, set `OPENAI_API_KEY`. For **Ollama**, start `ollama serve` and pull the model.

This notebook reuses the vector store from `feature0_baseline_rag.ipynb`.
Run that notebook first if the store does not exist yet.

In [4]:
from conversational_toolkit.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

from sme_kt_zh_collaboration_rag.feature0_baseline_rag import (
    build_llm,
    build_vector_store,
    load_chunks,
    EMBEDDING_MODEL,
    VS_PATH,
    RETRIEVER_TOP_K,
)
from sme_kt_zh_collaboration_rag.feature3_advanced_retrieval import (
    retrieve_baseline,
    retrieve_bm25,
    retrieve_hybrid,
    compare_retrieval_strategies,
    get_corpus_from_vector_store,
    print_strategy_comparison,
)

# Choose your LLM backend (only needed for the final answer cells)
BACKEND = "openai"  # "ollama", "openai", or "qwen"

embedding_model = SentenceTransformerEmbeddings(model_name=EMBEDDING_MODEL)
print(f"Embedding model: {EMBEDDING_MODEL}")

# Load documents from DATA_DIR and split them into chunks.
chunks = load_chunks(max_files=None)

# Set reset=True to rebuild the store from scratch
vector_store = await build_vector_store(
    chunks,
    embedding_model,
    db_path=VS_PATH,
    reset=True,
)
print("Vector store ready.")

llm = build_llm(backend=BACKEND)
print(f"LLM backend: {BACKEND}")

2026-02-26 14:14:14.335 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:__init__:57 - Sentence Transformer embeddings model loaded: sentence-transformers/all-MiniLM-L6-v2 with kwargs: {}
2026-02-26 14:14:14.343 | INFO     | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:204 - Chunking 34 files from /Users/pkoerner/Desktop/Kanton_Zurich/sme-kt-zh-collaboration-rag/data


Embedding model: sentence-transformers/all-MiniLM-L6-v2


2026-02-26 14:14:14.606 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:216 -   ART_internal_procurement_policy.pdf: 12 chunks
2026-02-26 14:14:14.804 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:216 -   ART_logylight_incomplete_datasheet.pdf: 6 chunks
2026-02-26 14:14:14.881 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:216 -   ART_product_catalog.pdf: 7 chunks
2026-02-26 14:14:14.890 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:216 -   ART_product_overview.xlsx: 1 chunks
2026-02-26 14:14:14.979 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:216 -   ART_relicyc_logypal1_datasheet_2021.pdf: 5 chunks
2026-02-26 14:14:14.980 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:216 -   ART_response_inquiry_frische_felder.md: 6 chunks
2026-02-26 14:14:15.074 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_c

Vector store ready.
LLM backend: openai


---

## Build the BM25 Corpus

**BM25** (Best Match 25) is a lexical retrieval algorithm, so it requires access to the raw text of the corpus. Unlike vector search, it cannot query an external index incrementally, the corpus must be tokenized and indexed at startup.

ndexing is necessary because BM25 precomputes:
- Term frequencies per chunk
- Document frequencies across the corpus
- Normalization statistics (e.g. average document length)
At query time, scoring is fast because these statistics are already available.

**Memory note:** the corpus is stored as a list of ChunkRecord objects, containing the chunk text and metadata.

In [6]:
print("Fetching corpus from vector store for BM25 indexing...")
corpus = await get_corpus_from_vector_store(vector_store, embedding_model, n=500)
print(f"Corpus size: {len(corpus)} chunks")
print("\nSample chunk titles:")
for c in corpus[:5]:
    src = c.metadata.get("source_file", "?")
    print(f"  {src:<52}  {repr(c.title)[:48]}")

Fetching corpus from vector store for BM25 indexing...


2026-02-26 14:17:32.368 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)


Corpus size: 367 chunks

Sample chunk titles:
  ART_supplier_brochure_tesa_ECO.pdf                    '### Climate performance:'
  EPD_cardboard_grupak_corrugated.pdf                   '#### [LCA ][information]'
  EPD_cardboard_grupak_corrugated.pdf                   '#### [LCA ][information]'
  EPD_cardboard_grupak_corrugated.pdf                   '#### [LCA ][information]'
  REF_ghg_protocol_product_lca.pdf                      '# **_T_**'


---

## BM25 Retrieval

**BM25** (Best Match 25) ranks chunks using exact token matches, weighted by corpus-level statistics. There are no embeddings involved—retrieval is a pure string-based scoring operation.


BM25 combines three key ideas:
- Term Frequency Saturation: Relevance increases with term frequency, but with diminishing returns. A chunk mentioning “python” 100 times is not 10× more relevant than one mentioning it 10 times.
- Document Length Normalization: Longer chunks naturally contain more terms. BM25 corrects for this so long chunks are not unfairly favored.
- Inverse Document Frequency (IDF): Rare terms (e.g. identifiers, codes, acronyms) carry more weight than common words.

The behavior of BM25 depends on two parameters:
- k1 (typically 1.2–2.0): controls term-frequency saturation
- b (0–1): controls document-length normalization

### When BM25 wins over semantic search
BM25 excels when queries depend on exact lexical matches, such as:
- Identifiers, codes, or SKUs ("FSC-C147829")
- Acronyms and abbreviations ("PFAS", "SOC 2")
- Product names, standards, or regulatory labels
- Error codes or configuration keys
In these cases, semantic embeddings may blur or down-rank the most relevant chunk, while BM25 retrieves it reliably.

### When BM25 fails
BM25 is **vocabulary-literal**. If two phrases share no tokens, BM25 treats them as unrelated:
- "carbon footprint decrease"
- "CO₂ reduction"
Semantic search handles this naturally; BM25 does not.
BM25 also struggles with:
- Paraphrases
- Synonyms
- Implicit or conceptual queries

In [None]:
# Query where BM25 excels: exact product code / certification number
QUERY = "FSC-C147829 certificate"
KEYWORDS = ["FSC", "C147829"]
# QUERY = "Is the carbon neutrality claim for the tape product independently verified?"
# KEYWORDS = ["CO₂", "carbon", "tesa", "verified", "68%", "tesapack"]

results_bm25_exact = await retrieve_bm25(QUERY, corpus, top_k=RETRIEVER_TOP_K)
results_semantic_exact = await retrieve_baseline(
    QUERY, embedding_model, vector_store, top_k=RETRIEVER_TOP_K
)

print(f"Exact query: {QUERY!r}\n")
print("── BM25 " + "─" * 55)
for chunk in results_bm25_exact.chunks[:10]:
    src = chunk.metadata.get("source_file", "?")
    title = chunk.title or "(no title)"
    hit = any(kw.lower() in (src + title + chunk.content).lower() for kw in KEYWORDS)
    print(
        f"  {'✓' if hit else '·'}  score={chunk.score:.4f}  {src:<44}  {title[:38]!r}"
    )

print("\n── Semantic baseline " + "─" * 43)
for chunk in results_semantic_exact.chunks[:10]:
    src = chunk.metadata.get("source_file", "?")
    title = chunk.title or "(no title)"
    hit = any(kw.lower() in (src + title + chunk.content).lower() for kw in KEYWORDS)
    print(
        f"  {'✓' if hit else '·'}  score={chunk.score:.4f}  {src:<44}  {title[:38]!r}"
    )
print("\n✓ = chunk contains a relevant keyword; · = no match")

2026-02-26 15:00:47.980 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)


Exact query: 'FSC-C147829 certificate'

── BM25 ───────────────────────────────────────────────────────
  ✓  score=24.2626  ART_supplier_brochure_CPR_wood_pallet.pdf     '## Material Sourcing'
  ✓  score=16.7802  ART_internal_procurement_policy.pdf           '### 3.5 FSC and Other Chain-of-Custody'
  ✓  score=14.6716  ART_response_inquiry_frische_felder.md        '## Incoming Customer Email'
  ✓  score=13.2907  ART_product_catalog.pdf                       '## Open Action Items (January 2025)'
  ·  score=8.7897  EPD_pallet_relicyc_logypal1.pdf               '# CONTENT DECLARATION'

── Semantic baseline ───────────────────────────────────────────
  ✓  score=0.9273  ART_internal_procurement_policy.pdf           '### 3.5 FSC and Other Chain-of-Custody'
  ✓  score=1.0484  ART_product_catalog.pdf                       '## Open Action Items (January 2025)'
  ✓  score=1.2262  ART_response_inquiry_frische_felder.md        '### Response to Section 2: Cardboard P'
  ·  score=1.2539  ART_supplier

---

## Hybrid Retrieval: Semantic + BM25 via Reciprocal Rank Fusion

Neither strategy dominates. Hybrid retrieval runs both in parallel and merges
their ranked lists using **Reciprocal Rank Fusion (RRF)**:

```
Semantic retriever ──► ranked list A 
                                   ├── RRF merge ──► final top-k
BM25 retriever     ──► ranked list B
```

**RRF formula:**

$$\text{RRF}(d) = \sum_{r \in \text{retrievers}} \frac{1}{k + \text{rank}_r(d)}$$

where $k = 60$ (standard default from the RRF paper). Using *ranks* rather than raw scores means L2 distances and BM25 scores, which are on incomparable scales, never need to be normalised.

Key properties:
- Chunks appearing in **both** lists get the highest scores
- Chunks appearing in **only one** list still get credit
- Sub-retrievers run **in parallel** -> latency is bounded by the slower one, not their sum

In [15]:
# Exact-term query: hybrid should match BM25's strong result
QUERY = "Are any tape products free of per- and polyfluoroalkyl substances?"
KEYWORDS = ["PFAS", "per-", "polyfluoroalkyl"]
# QUERY = "Is the carbon neutrality claim for the tape product independently verified?"
# KEYWORDS = ["CO₂", "carbon", "tesa", "verified", "68%", "tesapack"]

results_exact_hybrid = await retrieve_hybrid(
    QUERY, embedding_model, vector_store, corpus, top_k=RETRIEVER_TOP_K
)

print(f"\nQuery: {QUERY!r}\n")
results = await compare_retrieval_strategies(
    QUERY, embedding_model, vector_store, corpus, top_k=RETRIEVER_TOP_K
)
print_strategy_comparison(results, relevant_keywords=KEYWORDS, top_n=5)

2026-02-26 17:25:03.707 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-26 17:25:03.788 | INFO     | sme_kt_zh_collaboration_rag.feature3_advanced_retrieval:compare_retrieval_strategies:130 - Comparing retrieval strategies for: 'Are any tape products free of per- and polyfluoroalkyl substances?'
2026-02-26 17:25:03.802 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-26 17:25:03.918 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)



Query: 'Are any tape products free of per- and polyfluoroalkyl substances?'


Relevant keywords: ['PFAS', 'per-', 'polyfluoroalkyl']

Strategy                Top-5 retrieved sources
──────────────────────────────────────────────────────────────────────────────────────────

baseline
  ✓  ART_internal_procurement_policy.pdf               '### 3.3 PFAS (Per- and Polyfluoroalkyl'
  ·  EPD_tape_IPG_wateractivated.pdf                   '###### **Product**'
  ·  EPD_tape_IPG_wateractivated.pdf                   '#### **Product**'
  ·  EPD_tape_IPG_wateractivated.pdf                   '###### 17'
  ·  EPD_tape_IPG_wateractivated.pdf                   '###### **Packaging**'

bm25
  ✓  ART_internal_procurement_policy.pdf               '### 3.3 PFAS (Per- and Polyfluoroalkyl'
  ✓  ART_response_inquiry_frische_felder.md            '## Incoming Customer Email'
  ·  EPD_tape_IPG_wateractivated.pdf                   '###### **Product**'
  ·  EPD_tape_IPG_wateractivated.pdf                   '###### 

In [17]:
from conversational_toolkit.llms.base import LLMMessage, Roles
from conversational_toolkit.utils.retriever import build_query_with_chunks
from sme_kt_zh_collaboration_rag.feature0_baseline_rag import SYSTEM_PROMPT


async def rag_answer_from_chunks(llm, chunks, query):
    """Generate a RAG answer from pre-retrieved chunks (no internal retrieval)."""
    prompt = build_query_with_chunks(query, list(chunks))
    messages = [
        LLMMessage(role=Roles.SYSTEM, content=SYSTEM_PROMPT),
        LLMMessage(role=Roles.USER, content=prompt),
    ]
    return (await llm.generate(messages)).content


for strategy in ("baseline", "bm25", "hybrid"):
    result = results[strategy]
    print(f"\n{'─' * 72}")
    print(f"Strategy: {strategy.upper()}")
    print(f"{'─' * 72}")
    print("Sources used:")
    for chunk in result.chunks:
        src = chunk.metadata.get("source_file", "?")
        hit = any(
            kw.lower() in (src + (chunk.title or "") + chunk.content).lower()
            for kw in KEYWORDS
        )
        print(f"  {'✓' if hit else '·'}  {src:<50}  {repr(chunk.title or '')[:35]}")
    answer = await rag_answer_from_chunks(llm, result.chunks, QUERY)
    print(f"\n{answer}")


────────────────────────────────────────────────────────────────────────
  Strategy: BASELINE
────────────────────────────────────────────────────────────────────────
Sources used:
  ✓  ART_internal_procurement_policy.pdf                 '### 3.3 PFAS (Per- and Polyfluoroa
  ·  EPD_tape_IPG_wateractivated.pdf                     '###### **Product**'
  ·  EPD_tape_IPG_wateractivated.pdf                     '#### **Product**'
  ·  EPD_tape_IPG_wateractivated.pdf                     '###### 17'
  ·  EPD_tape_IPG_wateractivated.pdf                     '###### **Packaging**'


2026-02-26 17:34:45.786 | DEBUG    | conversational_toolkit.llms.openai:generate:87 - Completion: ChatCompletion(id='chatcmpl-DDYbxUq4gg5gMTA44rMuMcN6fFv0R', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='As of now, there is no specific information provided in the excerpts regarding whether any tape products are confirmed to be free of per- and polyfluoroalkyl substances (PFAS). However, it is stated that effective **1 July 2024**, PrimePack AG will not accept new products containing intentionally added PFAS and that all tape and adhesive suppliers must provide an explicit PFAS declaration for each product, either confirming that the product is free of intentionally added PFAS or disclosing any PFAS content with concentration and substance identification (source: 1134b8f1-0cab-479d-bd42-bba8f5ca4dba).\n\nTherefore, while the commitment to not accept PFAS-containing products is clear, the current status of specific tape products rega


As of now, there is no specific information provided in the excerpts regarding whether any tape products are confirmed to be free of per- and polyfluoroalkyl substances (PFAS). However, it is stated that effective **1 July 2024**, PrimePack AG will not accept new products containing intentionally added PFAS and that all tape and adhesive suppliers must provide an explicit PFAS declaration for each product, either confirming that the product is free of intentionally added PFAS or disclosing any PFAS content with concentration and substance identification (source: 1134b8f1-0cab-479d-bd42-bba8f5ca4dba).

Therefore, while the commitment to not accept PFAS-containing products is clear, the current status of specific tape products regarding PFAS content is not detailed in the provided excerpts. Thus, I cannot confirm if any tape products are currently free of PFAS.

────────────────────────────────────────────────────────────────────────
  Strategy: BM25
────────────────────────────────────

2026-02-26 17:34:50.350 | DEBUG    | conversational_toolkit.llms.openai:generate:87 - Completion: ChatCompletion(id='chatcmpl-DDYc2ZSwPezJBRZIUpI3tgjxRpT69', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='As of now, PrimePack AG does not have PFAS declarations from its tape suppliers, and therefore cannot confirm whether any of the tape products are free of per- and polyfluoroalkyl substances (PFAS). The company has initiated requests for these declarations and will forward them as they are received, but currently, they cannot confirm PFAS-free status for any tape products (source: c5e88c3f-8a17-4994-8d6e-a657b31d8bda).\n\nEffective from July 1, 2024, PrimePack AG will not accept new products containing intentionally added PFAS, and all tape and adhesive suppliers are required to provide explicit PFAS declarations for each product (source: 1134b8f1-0cab-479d-bd42-bba8f5ca4dba).', refusal=None, role='assistant', annotations=[], audio


As of now, PrimePack AG does not have PFAS declarations from its tape suppliers, and therefore cannot confirm whether any of the tape products are free of per- and polyfluoroalkyl substances (PFAS). The company has initiated requests for these declarations and will forward them as they are received, but currently, they cannot confirm PFAS-free status for any tape products (source: c5e88c3f-8a17-4994-8d6e-a657b31d8bda).

Effective from July 1, 2024, PrimePack AG will not accept new products containing intentionally added PFAS, and all tape and adhesive suppliers are required to provide explicit PFAS declarations for each product (source: 1134b8f1-0cab-479d-bd42-bba8f5ca4dba).

────────────────────────────────────────────────────────────────────────
  Strategy: HYBRID
────────────────────────────────────────────────────────────────────────
Sources used:
  ✓  ART_internal_procurement_policy.pdf                 '### 3.3 PFAS (Per- and Polyfluoroa
  ·  EPD_tape_IPG_wateractivated.pdf      

2026-02-26 17:34:55.343 | DEBUG    | conversational_toolkit.llms.openai:generate:87 - Completion: ChatCompletion(id='chatcmpl-DDYc6uqHvub3Ncu2BExUXkfHUEUqI', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='As of now, there is no specific information provided in the excerpts regarding whether any tape products from PrimePack AG are free of per- and polyfluoroalkyl substances (PFAS). However, it is stated that effective **1 July 2024**, PrimePack AG will not accept new products containing intentionally added PFAS into their portfolio, and that all tape and adhesive suppliers must provide a PFAS declaration for each product confirming either that the product is free of intentionally added PFAS or that PFAS content is disclosed with concentration and substance identification (source: 1134b8f1-0cab-479d-bd42-bba8f5ca4dba).\n\nIf you need specific declarations for particular tape products, such as those mentioned in the customer email (Pro


As of now, there is no specific information provided in the excerpts regarding whether any tape products from PrimePack AG are free of per- and polyfluoroalkyl substances (PFAS). However, it is stated that effective **1 July 2024**, PrimePack AG will not accept new products containing intentionally added PFAS into their portfolio, and that all tape and adhesive suppliers must provide a PFAS declaration for each product confirming either that the product is free of intentionally added PFAS or that PFAS content is disclosed with concentration and substance identification (source: 1134b8f1-0cab-479d-bd42-bba8f5ca4dba).

If you need specific declarations for particular tape products, such as those mentioned in the customer email (Product IDs: 50-100, 50-101, 50-102), this information is not available in the provided excerpts. Therefore, I cannot confirm if any of those products are free of PFAS.


---

## Quantitative Answer Quality Evaluation

Two reference-free RAGAS metrics — no ground truth or proxy labels needed:

| Metric | What it measures | Ground truth? |
|--------|------------------|---------------|
| **Faithfulness** | Fraction of answer claims directly supported by the retrieved context | No |
| **AnswerRelevancy** | Whether the answer addresses the actual question | No |

For each query, each strategy retrieves chunks and generates an answer. The RAGAS judge LLM then checks the answer against those chunks.

In [None]:
from langchain_openai import ChatOpenAI, OpenAIEmbeddings as LangChainOpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper  # type: ignore[import-untyped]
from ragas.metrics import (  # type: ignore[attr-defined]
    Faithfulness as RagasFaithfulness,
    AnswerRelevancy as RagasAnswerRelevancy,
)

from conversational_toolkit.evaluation import EvaluationSample
from conversational_toolkit.evaluation.adapters import evaluate_with_ragas
from sme_kt_zh_collaboration_rag.feature1_evaluation import EVALUATION_QUERIES

ragas_llm = LangchainLLMWrapper(
    ChatOpenAI(model="gpt-4o-mini", max_completion_tokens=8192)
)
ragas_embeddings = LangChainOpenAIEmbeddings(model="text-embedding-3-small")

queries = [q["query"] for q in EVALUATION_QUERIES]

# For each strategy, retrieve chunks, generate an answer, and build an EvaluationSample.
strategy_samples: dict[str, list[EvaluationSample]] = {
    s: [] for s in ("baseline", "bm25", "hybrid")
}

for query in queries:
    results_q = await compare_retrieval_strategies(
        query, embedding_model, vector_store, corpus, top_k=RETRIEVER_TOP_K
    )
    for strategy in ("baseline", "bm25", "hybrid"):
        chunks = list(results_q[strategy].chunks)
        answer = await rag_answer_from_chunks(llm, chunks, query)
        strategy_samples[strategy].append(
            EvaluationSample(
                query=query,
                answer=answer,
                retrieved_chunks=chunks,
            )
        )

print(f"Built {len(queries)} samples x 3 strategies ({len(queries) * 3} total)")

  from ragas.metrics import (  # type: ignore[attr-defined]
  from ragas.metrics import (  # type: ignore[attr-defined]
  ragas_llm = LangchainLLMWrapper(
2026-02-26 18:18:02.675 | INFO     | sme_kt_zh_collaboration_rag.feature3_advanced_retrieval:compare_retrieval_strategies:130 - Comparing retrieval strategies for: 'Does PrimePack AG offer a product called the Lara Pallet?'
2026-02-26 18:18:02.938 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-26 18:18:03.104 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-26 18:18:07.310 | DEBUG    | conversational_toolkit.llms.openai:generate:87 - Completion: ChatCompletion(id='chatcmpl-DDZHvILdeEYAu3MggAQS2DtY6u31R', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Based on

Built 7 samples × 3 strategies (21 total)


In [23]:
metrics = [
    RagasFaithfulness(),  # type: ignore[call-arg]
    RagasAnswerRelevancy(strictness=1),  # type: ignore[call-arg]
]

reports = {}
for strategy, samples in strategy_samples.items():
    print(f"Evaluating {strategy}...")
    reports[strategy] = evaluate_with_ragas(
        samples=samples,
        metrics=metrics,
        llm=ragas_llm,
        embeddings=ragas_embeddings,
    )

metric_names = list(next(iter(reports.values())).summary().keys())
print(f"\n{'Strategy':<12}  " + "  ".join(f"{m:>20}" for m in metric_names))
print("─" * (14 + 22 * len(metric_names)))
for strategy, report in reports.items():
    print(
        f"{strategy:<12}  "
        + "  ".join(f"{s:>20.3f}" for s in report.summary().values())
    )

Evaluating baseline...


Evaluating:   0%|          | 0/14 [00:00<?, ?it/s]

Evaluating bm25...


Evaluating:   0%|          | 0/14 [00:00<?, ?it/s]

Evaluating hybrid...


Evaluating:   0%|          | 0/14 [00:00<?, ?it/s]


Strategy              faithfulness      answer_relevancy
──────────────────────────────────────────────────────────
baseline                     0.964                 0.138
bm25                         0.825                 0.327
hybrid                       0.915                 0.250


---

## Summary

| Strategy | Best for | Limitation |
|---|---|---|
| **Baseline (semantic)** | Vocabulary mismatch, paraphrases | Fails on exact terms, IDs | 
| **BM25** | Exact terms, product codes, acronyms | Fails on semantic queries | 
| **Hybrid** | Both — consistent across query types | BM25 corpus must fit in memory | 