# Feature Track 3: Retrieval Strategies

---

The baseline retriever embeds the user query and finds the nearest chunks by cosine similarity. This works well when the query is well-formed, specific, and aligned with how the data is written. In practice, this assumption often breaks.

Retrieval failures are not only about wording, they arise whenever there is a mismatch between user intent, document structure, and retrieval mechanics. Common failure modes include:
- Queries that rely on exact identifiers or codes
- Queries that mix semantic intent with keyword-level constraints
- Queries where many chunks are vaguely relevant, but only a few are actually useful
- Queries that should be restricted to a known document or subset, but aren’t

One visible manifestation of these failures is vocabulary mismatch:

> **Vocabulary mismatch** -> the right chunk exists but ranks too low because the phrasing differs:

| User query | Document text | Issue |
|---|---|---|
| `"FSC-C147829 certificate"` | `"complies with FSC standard C147829"` | Exact string -> semantic embedding is noisy |
| `"PFAS-free"` | `"no intentionally added per- and polyfluoroalkyl substances"` | Acronym vs. full name |
| `"Is this Blauer Engel certified?"` | `"Blauer Engel DE-UZ 14"` | Equivalent claims, different phrasing |


But the underlying issue is broader: single-shot semantic retrieval is a blunt tool. It optimizes for general semantic similarity, not precision, constraints, or intent.

This notebook covers three strategies that address this:

| # | Strategy | Approach | Extra cost |
|---|---|---|---|
| 1 | **Baseline** | Single semantic query | — |
| 2 | **BM25** | Keyword retrieval, no embedding | Corpus index at startup |
| 3 | **Hybrid** | Semantic + BM25 via Reciprocal Rank Fusion | Corpus index at startup |
| 4 | **Metadata filter** | Restrict semantic search to a known document | None |


Query-transformation strategies are covered in a seperate notebook

---

## Setup

**Prerequisites:** `conversational-toolkit` and `backend` must be installed in editable mode.
For **OpenAI**, set `OPENAI_API_KEY`. For **Ollama**, start `ollama serve` and pull the model.

This notebook reuses the vector store from `feature0_baseline_rag.ipynb`.
Run that notebook first if the store does not exist yet.

In [4]:
from collections import Counter

from conversational_toolkit.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

from sme_kt_zh_collaboration_rag.feature0_baseline_rag import (
    build_llm,
    build_vector_store,
    load_chunks,
    EMBEDDING_MODEL,
    VS_PATH,
    RETRIEVER_TOP_K,
)
from sme_kt_zh_collaboration_rag.feature3_advanced_retrieval import (
    retrieve_baseline,
    retrieve_bm25,
    retrieve_hybrid,
    retrieve_with_metadata_filter,
    compare_retrieval_strategies,
    get_corpus_from_vector_store,
    print_strategy_comparison,
)

# Choose your LLM backend (only needed for the final answer cells)
BACKEND = "openai"  # "ollama", "openai", or "qwen"

embedding_model = SentenceTransformerEmbeddings(model_name=EMBEDDING_MODEL)
print(f"Embedding model: {EMBEDDING_MODEL}")

# Load documents from DATA_DIR and split them into chunks.
chunks = load_chunks(max_files=None)

# Set reset=True to rebuild the store from scratch
vector_store = await build_vector_store(
    chunks,
    embedding_model,
    db_path=VS_PATH,
    reset=True,
)
print("Vector store ready.")

llm = build_llm(backend=BACKEND)
print(f"LLM backend: {BACKEND}")

2026-02-26 14:14:14.335 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:__init__:57 - Sentence Transformer embeddings model loaded: sentence-transformers/all-MiniLM-L6-v2 with kwargs: {}
2026-02-26 14:14:14.343 | INFO     | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:204 - Chunking 34 files from /Users/pkoerner/Desktop/Kanton_Zurich/sme-kt-zh-collaboration-rag/data


Embedding model: sentence-transformers/all-MiniLM-L6-v2


2026-02-26 14:14:14.606 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:216 -   ART_internal_procurement_policy.pdf: 12 chunks
2026-02-26 14:14:14.804 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:216 -   ART_logylight_incomplete_datasheet.pdf: 6 chunks
2026-02-26 14:14:14.881 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:216 -   ART_product_catalog.pdf: 7 chunks
2026-02-26 14:14:14.890 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:216 -   ART_product_overview.xlsx: 1 chunks
2026-02-26 14:14:14.979 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:216 -   ART_relicyc_logypal1_datasheet_2021.pdf: 5 chunks
2026-02-26 14:14:14.980 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_chunks:216 -   ART_response_inquiry_frische_felder.md: 6 chunks
2026-02-26 14:14:15.074 | DEBUG    | sme_kt_zh_collaboration_rag.feature0_baseline_rag:load_c

Vector store ready.
LLM backend: openai


---

## Build the BM25 Corpus

**BM25** (Best Match 25) is a lexical retrieval algorithm, so it requires access to the raw text of the corpus. Unlike vector search, it cannot query an external index incrementally, the corpus must be tokenized and indexed at startup.

ndexing is necessary because BM25 precomputes:
- Term frequencies per chunk
- Document frequencies across the corpus
- Normalization statistics (e.g. average document length)
At query time, scoring is fast because these statistics are already available.

**Memory note:** the corpus is stored as a list of ChunkRecord objects, containing the chunk text and metadata.

In [6]:
print("Fetching corpus from vector store for BM25 indexing...")
corpus = await get_corpus_from_vector_store(vector_store, embedding_model, n=500)
print(f"Corpus size: {len(corpus)} chunks")
print("\nSample chunk titles:")
for c in corpus[:5]:
    src = c.metadata.get("source_file", "?")
    print(f"  {src:<52}  {repr(c.title)[:48]}")

Fetching corpus from vector store for BM25 indexing...


2026-02-26 14:17:32.368 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)


Corpus size: 367 chunks

Sample chunk titles:
  ART_supplier_brochure_tesa_ECO.pdf                    '### Climate performance:'
  EPD_cardboard_grupak_corrugated.pdf                   '#### [LCA ][information]'
  EPD_cardboard_grupak_corrugated.pdf                   '#### [LCA ][information]'
  EPD_cardboard_grupak_corrugated.pdf                   '#### [LCA ][information]'
  REF_ghg_protocol_product_lca.pdf                      '# **_T_**'


---

## BM25 Retrieval

**BM25** (Best Match 25) ranks chunks using exact token matches, weighted by corpus-level statistics. There are no embeddings involved—retrieval is a pure string-based scoring operation.


BM25 combines three key ideas:
- Term Frequency Saturation: Relevance increases with term frequency, but with diminishing returns. A chunk mentioning “python” 100 times is not 10× more relevant than one mentioning it 10 times.
- Document Length Normalization: Longer chunks naturally contain more terms. BM25 corrects for this so long chunks are not unfairly favored.
- Inverse Document Frequency (IDF): Rare terms (e.g. identifiers, codes, acronyms) carry more weight than common words.

The behavior of BM25 depends on two parameters:
- k1 (typically 1.2–2.0): controls term-frequency saturation
- b (0–1): controls document-length normalization

### When BM25 wins over semantic search
BM25 excels when queries depend on exact lexical matches, such as:
- Identifiers, codes, or SKUs ("FSC-C147829")
- Acronyms and abbreviations ("PFAS", "SOC 2")
- Product names, standards, or regulatory labels
- Error codes or configuration keys
In these cases, semantic embeddings may blur or down-rank the most relevant chunk, while BM25 retrieves it reliably.

### When BM25 fails
BM25 is **vocabulary-literal**. If two phrases share no tokens, BM25 treats them as unrelated:
- "carbon footprint decrease"
- "CO₂ reduction"
Semantic search handles this naturally; BM25 does not.
BM25 also struggles with:
- Paraphrases
- Synonyms
- Implicit or conceptual queries

In [None]:
# Query where BM25 excels: exact product code / certification number
QUERY = "FSC-C147829 certificate"
KEYWORDS = ["FSC", "C147829"]
# QUERY = "Is the carbon neutrality claim for the tape product independently verified?"
# KEYWORDS = ["CO₂", "carbon", "tesa", "verified", "68%", "tesapack"]

results_bm25_exact = await retrieve_bm25(QUERY, corpus, top_k=RETRIEVER_TOP_K)
results_semantic_exact = await retrieve_baseline(
    QUERY, embedding_model, vector_store, top_k=RETRIEVER_TOP_K
)

print(f"Exact query: {QUERY!r}\n")
print("── BM25 " + "─" * 55)
for chunk in results_bm25_exact.chunks[:10]:
    src = chunk.metadata.get("source_file", "?")
    title = chunk.title or "(no title)"
    hit = any(kw.lower() in (src + title + chunk.content).lower() for kw in KEYWORDS)
    print(
        f"  {'✓' if hit else '·'}  score={chunk.score:.4f}  {src:<44}  {title[:38]!r}"
    )

print("\n── Semantic baseline " + "─" * 43)
for chunk in results_semantic_exact.chunks[:10]:
    src = chunk.metadata.get("source_file", "?")
    title = chunk.title or "(no title)"
    hit = any(kw.lower() in (src + title + chunk.content).lower() for kw in KEYWORDS)
    print(
        f"  {'✓' if hit else '·'}  score={chunk.score:.4f}  {src:<44}  {title[:38]!r}"
    )
print("\n✓ = chunk contains a relevant keyword; · = no match")

2026-02-26 15:00:47.980 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)


Exact query: 'FSC-C147829 certificate'

── BM25 ───────────────────────────────────────────────────────
  ✓  score=24.2626  ART_supplier_brochure_CPR_wood_pallet.pdf     '## Material Sourcing'
  ✓  score=16.7802  ART_internal_procurement_policy.pdf           '### 3.5 FSC and Other Chain-of-Custody'
  ✓  score=14.6716  ART_response_inquiry_frische_felder.md        '## Incoming Customer Email'
  ✓  score=13.2907  ART_product_catalog.pdf                       '## Open Action Items (January 2025)'
  ·  score=8.7897  EPD_pallet_relicyc_logypal1.pdf               '# CONTENT DECLARATION'

── Semantic baseline ───────────────────────────────────────────
  ✓  score=0.9273  ART_internal_procurement_policy.pdf           '### 3.5 FSC and Other Chain-of-Custody'
  ✓  score=1.0484  ART_product_catalog.pdf                       '## Open Action Items (January 2025)'
  ✓  score=1.2262  ART_response_inquiry_frische_felder.md        '### Response to Section 2: Cardboard P'
  ·  score=1.2539  ART_supplier

---

## Hybrid Retrieval: Semantic + BM25 via Reciprocal Rank Fusion

Neither strategy dominates. Hybrid retrieval runs both in parallel and merges
their ranked lists using **Reciprocal Rank Fusion (RRF)**:

```
Semantic retriever ──► ranked list A 
                                   ├── RRF merge ──► final top-k
BM25 retriever     ──► ranked list B
```

**RRF formula:**

$$\text{RRF}(d) = \sum_{r \in \text{retrievers}} \frac{1}{k + \text{rank}_r(d)}$$

where $k = 60$ (standard default from the RRF paper). Using *ranks* rather than raw scores means L2 distances and BM25 scores, which are on incomparable scales, never need to be normalised.

Key properties:
- Chunks appearing in **both** lists get the highest scores
- Chunks appearing in **only one** list still get credit
- Sub-retrievers run **in parallel** -> latency is bounded by the slower one, not their sum

In [22]:
# Exact-term query: hybrid should match BM25's strong result
QUERY = "FSC-C147829 certificate"
KEYWORDS = ["FSC", "C147829"]
# QUERY = "Is the carbon neutrality claim for the tape product independently verified?"
# KEYWORDS = ["CO₂", "carbon", "tesa", "verified", "68%", "tesapack"]

results_exact_hybrid = await retrieve_hybrid(
    QUERY, embedding_model, vector_store, corpus, top_k=RETRIEVER_TOP_K
)

print(f"\nQuery: {QUERY!r}\n")
results = await compare_retrieval_strategies(
    QUERY, embedding_model, vector_store, corpus, top_k=RETRIEVER_TOP_K
)
print_strategy_comparison(results, relevant_keywords=KEYWORDS, top_n=5)

2026-02-26 15:16:51.149 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-26 15:16:51.153 | INFO     | sme_kt_zh_collaboration_rag.feature3_advanced_retrieval:compare_retrieval_strategies:123 - Comparing retrieval strategies for: 'FSC-C147829 certificate'
2026-02-26 15:16:51.163 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-26 15:16:51.267 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)



Query: 'FSC-C147829 certificate'


Relevant keywords: ['FSC', 'C147829']

Strategy                Top-5 retrieved sources
──────────────────────────────────────────────────────────────────────────────────────────

baseline
  ✓  ART_internal_procurement_policy.pdf               '### 3.5 FSC and Other Chain-of-Custody'
  ✓  ART_product_catalog.pdf                           '## Open Action Items (January 2025)'
  ✓  ART_response_inquiry_frische_felder.md            '### Response to Section 2: Cardboard P'
  ·  ART_supplier_brochure_tesa_ECO.pdf                '### Certifications held:'
  ✓  ART_supplier_brochure_CPR_wood_pallet.pdf         '## Material Sourcing'

bm25
  ✓  ART_supplier_brochure_CPR_wood_pallet.pdf         '## Material Sourcing'
  ✓  ART_internal_procurement_policy.pdf               '### 3.5 FSC and Other Chain-of-Custody'
  ✓  ART_response_inquiry_frische_felder.md            '## Incoming Customer Email'
  ✓  ART_product_catalog.pdf                           '## Open Act

---

## Metadata Filtering

When a user's question is explicitly about a specific document,`"Summarise the environmental performance of the tesa tape"`, the search space can be **scoped to that document** with a metadata filter.

### ChromaDB filter syntax

Filters use MongoDB-style operators:
```python
{"source_file": {"$eq": "EPD_tesa.pdf"}} # single condition
{"$or": [{"source_file": {"$eq": "A.pdf"}},
          {"source_file": {"$eq": "B.pdf"}}]} # either of two files
{"$and": [{"source_file": {"$eq": "A.pdf"}},
           {"mime_type": {"$eq": "text/markdown"}}]} # both conditions
```
Available operators: `$eq` (equals), `$ne` (not equals), `$gt (greater than)`, `$lt` (less than), `$gte` (greater than or equal to), `$lte` (less than or equal to), `$and`, `$or`.

In [16]:
source_counts = Counter(c.metadata.get("source_file", "?") for c in corpus)
print(f"Source files in corpus ({len(source_counts)} total):")
for fname, n in sorted(source_counts.items()):
    print(f"  {fname:<60}  {n:>3} chunks")

Source files in corpus (34 total):
  ART_internal_procurement_policy.pdf                            12 chunks
  ART_logylight_incomplete_datasheet.pdf                          6 chunks
  ART_product_catalog.pdf                                         7 chunks
  ART_product_overview.xlsx                                       1 chunks
  ART_relicyc_logypal1_datasheet_2021.pdf                         5 chunks
  ART_response_inquiry_frische_felder.md                          6 chunks
  ART_supplier_brochure_CPR_wood_pallet.pdf                       6 chunks
  ART_supplier_brochure_tesa_ECO.pdf                              8 chunks
  EPD_cardboard_grupak_corrugated.pdf                            35 chunks
  EPD_cardboard_redbox_cartonpallet.pdf                          11 chunks
  EPD_pallet_CPR_noe.pdf                                         11 chunks
  EPD_pallet_relicyc_logypal1.pdf                                17 chunks
  EPD_pallet_stabilplastik_ep08.pdf                              

In [24]:
# Scope retrieval to a single document
TARGET_FILE = "EPD_pallet_relicyc_logypal1.pdf"
FILTER_QUERY = "What is the global warming potential of this product?"

filter_def = {"source_file": {"$eq": TARGET_FILE}}

results_filtered = await retrieve_with_metadata_filter(
    FILTER_QUERY,
    embedding_model,
    vector_store,
    filters=filter_def,
    top_k=RETRIEVER_TOP_K,
)
results_unfiltered = await retrieve_baseline(
    FILTER_QUERY, embedding_model, vector_store, top_k=RETRIEVER_TOP_K
)

print(f"\nQuery: {FILTER_QUERY!r}")
print(f"Target file: {TARGET_FILE}")
print()
print("── Without filter (all documents) " + "─" * 30)
for chunk in results_unfiltered.chunks[:RETRIEVER_TOP_K]:
    src = chunk.metadata.get("source_file", "?")
    marker = "✓" if src == TARGET_FILE else "·"
    print(f"  {marker}  {src}")

print()
print("── With metadata filter (single document) " + "─" * 22)
for chunk in results_filtered.chunks[:RETRIEVER_TOP_K]:
    src = chunk.metadata.get("source_file", "?")
    print(f"  ✓  {src}  |  {chunk.title}")
print()
print("All results from the filtered search belong to the target file.")

2026-02-26 15:18:05.926 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-26 15:18:05.940 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)



Query: 'What is the global warming potential of this product?'
Target file: EPD_pallet_relicyc_logypal1.pdf

── Without filter (all documents) ──────────────────────────────
  ·  ART_supplier_brochure_tesa_ECO.pdf
  ·  EPD_pallet_stabilplastik_ep08.pdf
  ·  REF_ghg_protocol_product_lca.pdf
  ·  EPD_pallet_CPR_noe.pdf
  ·  REF_ghg_protocol_product_lca.pdf

── With metadata filter (single document) ──────────────────────
  ✓  EPD_pallet_relicyc_logypal1.pdf  |  # LCA INFORMATION
  ✓  EPD_pallet_relicyc_logypal1.pdf  |  # ADDITIONAL INFORMATION
  ✓  EPD_pallet_relicyc_logypal1.pdf  |  # ENVIRONMENTAL PERFORMANCE
  ✓  EPD_pallet_relicyc_logypal1.pdf  |  # REFERENCES
  ✓  EPD_pallet_relicyc_logypal1.pdf  |  # PRODUCT INFORMATION

All results from the filtered search belong to the target file.


---

## Side-by-Side Comparison

`compare_retrieval_strategies()` runs baseline, BM25, hybrid, and (optionally)
metadata-filtered retrieval for a single query.
`print_strategy_comparison()` marks chunks containing expected keywords
so retrieval gaps are immediately visible.

In [None]:
# Include a metadata-filter strategy in the comparison
SCOPED_QUERY = "What materials is Logypal 1 made out of?"
FILTER = {"source_file": {"$eq": TARGET_FILE}}

results = await compare_retrieval_strategies(
    SCOPED_QUERY,
    embedding_model,
    vector_store,
    corpus,
    top_k=RETRIEVER_TOP_K,
    metadata_filters=FILTER,
)
print_strategy_comparison(results, relevant_keywords=["material", "Logypal 1"], top_n=5)

2026-02-26 15:25:43.131 | INFO     | sme_kt_zh_collaboration_rag.feature3_advanced_retrieval:compare_retrieval_strategies:123 - Comparing retrieval strategies for: 'What materials is Logypal 1 made out of?'
2026-02-26 15:25:43.218 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-26 15:25:43.320 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-26 15:25:43.335 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)



Relevant keywords: ['material', 'Logypal 1']

Strategy                Top-5 retrieved sources
──────────────────────────────────────────────────────────────────────────────────────────

baseline
  ✓  ART_relicyc_logypal1_datasheet_2021.pdf           '## Overview'
  ✓  EPD_tape_IPG_wateractivated.pdf                   '###### 23'
  ·  ART_logylight_incomplete_datasheet.pdf            '## Product Overview'
  ✓  EPD_tape_IPG_wateractivated.pdf                   '###### 26'
  ✓  EPD_pallet_CPR_noe.pdf                            '#### 5. Content declaration'

bm25
  ✓  EPD_pallet_relicyc_logypal1.pdf                   '# PRODUCT INFORMATION'
  ✓  EPD_pallet_relicyc_logypal1.pdf                   '# CONTENT DECLARATION'
  ✓  SPEC_pallet_CPR_recycled_plastic.pdf              '# **NOÈ pallet made of recycled plasti'
  ✓  EPD_pallet_CPR_noe.pdf                            '#### 2. Company information'
  ✓  REF_eu_csrd.pdf                                   '# DIRECTIVES'

hybrid
  ✓  ART_relicyc

In [32]:
# Include a metadata-filter strategy in the comparison
SCOPED_QUERY = "Until when is the PCR valid for the Logypal 1 pallet?"
FILTER = {"source_file": {"$eq": TARGET_FILE}}

results = await compare_retrieval_strategies(
    SCOPED_QUERY,
    embedding_model,
    vector_store,
    corpus,
    top_k=RETRIEVER_TOP_K,
    metadata_filters=FILTER,
)
print_strategy_comparison(
    results, relevant_keywords=["PCR", "Logypal 1", "valid", "pallet"], top_n=5
)

2026-02-26 15:26:55.507 | INFO     | sme_kt_zh_collaboration_rag.feature3_advanced_retrieval:compare_retrieval_strategies:123 - Comparing retrieval strategies for: 'Until when is the PCR valid for the Logypal 1 pallet?'
2026-02-26 15:26:55.759 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-26 15:26:55.868 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)
2026-02-26 15:26:55.878 | DEBUG    | conversational_toolkit.embeddings.sentence_transformer:get_embeddings:76 - sentence-transformers/all-MiniLM-L6-v2 embeddings size: (1, 384)



Relevant keywords: ['PCR', 'Logypal 1', 'valid', 'pallet']

Strategy                Top-5 retrieved sources
──────────────────────────────────────────────────────────────────────────────────────────

baseline
  ✓  ART_relicyc_logypal1_datasheet_2021.pdf           '## Overview'
  ✓  EPD_pallet_relicyc_logypal1.pdf                   '# CONTENT DECLARATION'
  ✓  SPEC_pallet_CPR_recycled_plastic.pdf              '# **NOÈ pallet made of recycled plasti'
  ✓  ART_logylight_incomplete_datasheet.pdf            '## Product Overview'
  ✓  SPEC_pallet_CPR_plastic.pdf                       '## 1200 x 800 mm'

bm25
  ✓  EPD_pallet_relicyc_logypal1.pdf                   '### Accountabilities for PCR, LCA and '
  ✓  EPD_tape_IPG_hotmelt.pdf                          '## **EPD Programme Information**'
  ✓  EPD_tape_IPG_wateractivated.pdf                   '## **EPD Programme Information**'
  ✓  EPD_cardboard_redbox_cartonpallet.pdf             '## **REDBOX Srl**'
  ✓  EPD_pallet_relicyc_logypal1.pdf  

> **Observation:** No single strategy dominates all query types.
> - **BM25** wins for product codes and certification numbers.
> - **Semantic** wins for meaning-level queries with vocabulary mismatch.
> - **Hybrid** is the most consistent across both types — the recommended default.
> - **Metadata filter** is useful when the user has already identified the document.

---

## Summary

| Strategy | Best for | Limitation | Extra cost |
|---|---|---|---|
| **Baseline (semantic)** | Vocabulary mismatch, paraphrases | Fails on exact terms, IDs | None |
| **BM25** | Exact terms, product codes, acronyms | Fails on semantic queries | Corpus index at startup |
| **Hybrid** | Both — consistent across query types | BM25 corpus must fit in memory | Corpus index at startup |
| **Metadata filter** | Scoped / known-document queries | Requires caller to know the document | None |