Skip to content

1.11.0

Choose a tag to compare

@github-actions github-actions released this 25 Jun 23:40
· 7 commits to main since this release

v1.11.0: whole-note structure-aware chunking (foundational embedding overhaul)

Notes are no longer truncated to 1500 chars and chunked by blind sentence windows.
The whole note is now split at LOGICAL boundaries into coherent idea-chunks, so a note
is an overall embedding PLUS a set of section-level chunks other notes/queries align
with section-by-section.

  • splitIntoSections(): parse the raw note into sections at ATX headings (code-fence +
    frontmatter aware), carrying a heading breadcrumb; headings are the primary idea
    boundary, paragraphs the secondary one. A window never crosses a section/paragraph.
  • splitToBudget(): hard char guard (MAX_CHUNK_CHARS=480 ≈ 120 tokens) splitting any
    window at sentence then whitespace boundaries so the model never silently truncates
    a chunk (EN + DE). TARGET_WORDS 60->80.
  • Whole-note coverage: removed embedCharLimit truncation entirely; chunk-count cap
    16->48 (adaptive tiers raised), over-cap keeps every section's first window.
  • Heading context (new setting, default on): the first chunk of each section embeds
    with a "Note > H1 > H2:" breadcrumb prefix (embed input only; raw text kept for
    snippets), the LLM-free contextual-retrieval trick, scoped to avoid embedding
    collapse. The window is clamped so prefix+window stays within the token budget.
  • INDEX_VERSION 4->5: one-time full re-embed on update. meanVector + biMax unchanged
    (better inputs). Both build() and embedFile() embed via a shared chunkEmbedInput()
    helper so the full and incremental paths can't diverge.

Verified by a research-backed design pass + an adversarial review (fixed a high-sev
incremental-path regression and a prefix-budget truncation before shipping).