Skip to content

Improve chunking and search retrieval quality #362

@RobertLD

Description

@RobertLD

Problem Statement

LibScope's current chunking and search pipelines have several limitations that reduce retrieval quality — the core value prop of a knowledge base. This issue tracks a set of targeted, incremental improvements informed by a codebase audit and independent technical review.


Current State

Chunking (src/core/indexing.ts)

  • chunkContent() splits on markdown h1-h3 headings, hard-caps at 1500 chars
  • Injects <!-- context: H1 > H2 --> breadcrumbs as HTML comments (noisy for embeddings)
  • When maxChunkSize is hit, cuts at the current line — no paragraph awareness
  • Chunk embeddings contain only chunk.content — document title, library, version, and topic are never embedded (biggest quality hole)
  • No inter-chunk overlap (while chunkContentStreaming() has 10% overlap)

Search (src/core/search.ts)

  • Sequential fallback: vector → FTS5 → LIKE (no hybrid fusion)
  • FTS5 query uses OR logic ("React" OR "hooks") — extremely noisy for multi-word queries
  • Title never factors into ranking
  • Count query re-executes the full search wrapped in SELECT COUNT(*)
  • No retrieval quality tests exist — all tests are purely functional

Proposed Improvements

Improvement 0: Retrieval Quality Benchmark & Regression Gate — P0 (FIRST)

Create a curated test corpus (~10-15 realistic docs across overlapping topics), ~20 test queries with ground-truth expected results, and metric functions (Recall@k, MRR). Establish baseline quality metrics for the current FTS5 implementation, then use those baselines as CI regression gates — any future change that degrades retrieval quality fails CI. Thresholds ratchet upward as improvements land.

Components:

  • tests/fixtures/benchmark-corpus.ts — curated test documents
  • tests/fixtures/benchmark-queries.ts — ground-truth query set
  • tests/fixtures/benchmark-metrics.ts — Recall@k, MRR metric functions
  • tests/benchmark/retrieval-quality.test.ts — benchmark test suite

Constraints: Runs against FTS5 only (test DB has no sqlite-vec). Uses MockEmbeddingProvider. This is acceptable — FTS5 is the production fallback path.


Improvement 1: Metadata Embedding — P0

Prepend structured document metadata (title, library, version, topic) to each chunk before computing its embedding. Store original chunk content without prefix in the DB.

const metadataPrefix = [
  `Title: ${input.title}`,
  input.library ? `Library: ${input.library}` : null,
].filter(Boolean).join("\n") + "\n\n";

const chunksForEmbedding = chunks.map(chunk => metadataPrefix + chunk);
const embeddings = await provider.embedBatch(chunksForEmbedding);

Impact: Fixes the biggest quality hole — metadata-based queries can't match semantically today.


Improvement 2: Chunk Overlap — P0

Add configurable overlap (~150 chars, ~10% of maxChunkSize) between consecutive chunks. Tail of chunk N is prepended to chunk N+1. Align chunkContentStreaming() to use the same configurable overlap.

Impact: Industry-standard RAG practice. Significantly improves recall for boundary-spanning queries.


Improvement 3: Replace HTML Comment Breadcrumbs — P1

Replace <!-- context: H1 > H2 --> with plain text prefix H1 > H2\n. HTML comments waste embedding token budget and dilute semantic signal.


Improvement 4: Title Boosting — P1

Boost result score when query terms appear in document title (case-insensitive, configurable factor ~1.5x). Include in scoreExplanation.boostFactors. Trivial to implement (~30 lines).


Improvement 5: FTS5 AND-by-Default — P1

Change FTS5 query from "w1" OR "w2" to "w1" "w2" (implicit AND). Support quoted phrase search. No auto-OR fallback — return zero results rather than confusing non-deterministic behavior.


Improvement 6: Hybrid Search (RRF) — P0-deferred

Run vector + FTS5 search, merge via Reciprocal Rank Fusion (score = Σ 1/(60 + rank)). Add searchMode option: hybrid/vector/keyword. Graceful fallback. Deferred until foundation is solid.

Testing strategy: RRF fusion function is pure (takes two ranked lists → merged list) — fully testable without sqlite-vec. Unit tests mock vector path.


Improvement 7: Lazy Count — P2

Make totalCount optional, add hasMore: boolean. Skip expensive double count query by default. Opt-in via countMode: 'exact'.


Improvement 8: Paragraph-Boundary Splitting — P3

When maxChunkSize is hit, scan backward for \n\n within last 200 chars. If found, split there. No code block tracking, no strategy enum. 80% of benefit for 20% of complexity.


Implementation Priority & Order

# Improvement Impact Effort Priority
0 Quality Benchmark & Regression Gate 🔴 High Medium P0 — FIRST
1 Metadata Embedding 🔴 High Low P0
2 Chunk Overlap 🔴 High Low P0
3 Text Breadcrumbs 🟠 Medium Low P1
4 Title Boosting 🟠 Medium Low P1
5 FTS5 AND-by-Default 🟠 Medium Low P1
6 Hybrid Search (RRF) 🔴 High Medium P0-deferred
7 Lazy Count 🟢 Low Medium P2
8 Paragraph Splitting 🟢 Low Low P3

Order: 0 → 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8

Notes

  • Quality-first: Benchmark (0) is implemented first so every subsequent improvement is objectively measurable. Thresholds ratchet upward after each improvement.
  • Re-indexing: Chunking improvements (1-3, 8) require existing docs to be re-indexed. Consider a chunking_version field for selective re-indexing.
  • API compatibility: Improvement 7 changes SearchResponse shape — use feature flags / make totalCount optional.
  • Testing: Hybrid search RRF fusion logic should be a pure function, testable without sqlite-vec.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions