-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem Statement
LibScope's current chunking and search pipelines have several limitations that reduce retrieval quality — the core value prop of a knowledge base. This issue tracks a set of targeted, incremental improvements informed by a codebase audit and independent technical review.
Current State
Chunking (src/core/indexing.ts)
chunkContent()splits on markdown h1-h3 headings, hard-caps at 1500 chars- Injects
<!-- context: H1 > H2 -->breadcrumbs as HTML comments (noisy for embeddings) - When
maxChunkSizeis hit, cuts at the current line — no paragraph awareness - Chunk embeddings contain only
chunk.content— document title, library, version, and topic are never embedded (biggest quality hole) - No inter-chunk overlap (while
chunkContentStreaming()has 10% overlap)
Search (src/core/search.ts)
- Sequential fallback: vector → FTS5 → LIKE (no hybrid fusion)
- FTS5 query uses OR logic (
"React" OR "hooks") — extremely noisy for multi-word queries - Title never factors into ranking
- Count query re-executes the full search wrapped in
SELECT COUNT(*) - No retrieval quality tests exist — all tests are purely functional
Proposed Improvements
Improvement 0: Retrieval Quality Benchmark & Regression Gate — P0 (FIRST)
Create a curated test corpus (~10-15 realistic docs across overlapping topics), ~20 test queries with ground-truth expected results, and metric functions (Recall@k, MRR). Establish baseline quality metrics for the current FTS5 implementation, then use those baselines as CI regression gates — any future change that degrades retrieval quality fails CI. Thresholds ratchet upward as improvements land.
Components:
tests/fixtures/benchmark-corpus.ts— curated test documentstests/fixtures/benchmark-queries.ts— ground-truth query settests/fixtures/benchmark-metrics.ts— Recall@k, MRR metric functionstests/benchmark/retrieval-quality.test.ts— benchmark test suite
Constraints: Runs against FTS5 only (test DB has no sqlite-vec). Uses MockEmbeddingProvider. This is acceptable — FTS5 is the production fallback path.
Improvement 1: Metadata Embedding — P0
Prepend structured document metadata (title, library, version, topic) to each chunk before computing its embedding. Store original chunk content without prefix in the DB.
const metadataPrefix = [
`Title: ${input.title}`,
input.library ? `Library: ${input.library}` : null,
].filter(Boolean).join("\n") + "\n\n";
const chunksForEmbedding = chunks.map(chunk => metadataPrefix + chunk);
const embeddings = await provider.embedBatch(chunksForEmbedding);Impact: Fixes the biggest quality hole — metadata-based queries can't match semantically today.
Improvement 2: Chunk Overlap — P0
Add configurable overlap (~150 chars, ~10% of maxChunkSize) between consecutive chunks. Tail of chunk N is prepended to chunk N+1. Align chunkContentStreaming() to use the same configurable overlap.
Impact: Industry-standard RAG practice. Significantly improves recall for boundary-spanning queries.
Improvement 3: Replace HTML Comment Breadcrumbs — P1
Replace <!-- context: H1 > H2 --> with plain text prefix H1 > H2\n. HTML comments waste embedding token budget and dilute semantic signal.
Improvement 4: Title Boosting — P1
Boost result score when query terms appear in document title (case-insensitive, configurable factor ~1.5x). Include in scoreExplanation.boostFactors. Trivial to implement (~30 lines).
Improvement 5: FTS5 AND-by-Default — P1
Change FTS5 query from "w1" OR "w2" to "w1" "w2" (implicit AND). Support quoted phrase search. No auto-OR fallback — return zero results rather than confusing non-deterministic behavior.
Improvement 6: Hybrid Search (RRF) — P0-deferred
Run vector + FTS5 search, merge via Reciprocal Rank Fusion (score = Σ 1/(60 + rank)). Add searchMode option: hybrid/vector/keyword. Graceful fallback. Deferred until foundation is solid.
Testing strategy: RRF fusion function is pure (takes two ranked lists → merged list) — fully testable without sqlite-vec. Unit tests mock vector path.
Improvement 7: Lazy Count — P2
Make totalCount optional, add hasMore: boolean. Skip expensive double count query by default. Opt-in via countMode: 'exact'.
Improvement 8: Paragraph-Boundary Splitting — P3
When maxChunkSize is hit, scan backward for \n\n within last 200 chars. If found, split there. No code block tracking, no strategy enum. 80% of benefit for 20% of complexity.
Implementation Priority & Order
| # | Improvement | Impact | Effort | Priority |
|---|---|---|---|---|
| 0 | Quality Benchmark & Regression Gate | 🔴 High | Medium | P0 — FIRST |
| 1 | Metadata Embedding | 🔴 High | Low | P0 |
| 2 | Chunk Overlap | 🔴 High | Low | P0 |
| 3 | Text Breadcrumbs | 🟠 Medium | Low | P1 |
| 4 | Title Boosting | 🟠 Medium | Low | P1 |
| 5 | FTS5 AND-by-Default | 🟠 Medium | Low | P1 |
| 6 | Hybrid Search (RRF) | 🔴 High | Medium | P0-deferred |
| 7 | Lazy Count | 🟢 Low | Medium | P2 |
| 8 | Paragraph Splitting | 🟢 Low | Low | P3 |
Order: 0 → 1 → 2 → 3 → 4 → 5 → 6 → 7 → 8
Notes
- Quality-first: Benchmark (0) is implemented first so every subsequent improvement is objectively measurable. Thresholds ratchet upward after each improvement.
- Re-indexing: Chunking improvements (1-3, 8) require existing docs to be re-indexed. Consider a
chunking_versionfield for selective re-indexing. - API compatibility: Improvement 7 changes
SearchResponseshape — use feature flags / maketotalCountoptional. - Testing: Hybrid search RRF fusion logic should be a pure function, testable without sqlite-vec.