Skip to content

Conversation

enitrat
Copy link
Collaborator

@enitrat enitrat commented Jul 27, 2025

PR Summary: Replace Manual Markdown Chunking with RecursiveMarkdownSplitter

Overview

This PR refactors the markdown chunking logic in the Cairo Book and Core Library ingesters to use a new RecursiveMarkdownSplitter utility. The previous implementation relied on a simple, manual H1-header-based splitting approach, which had limitations (e.g., no support for H2 headers, potential issues with code blocks, lack of overlap, and no handling for small chunks). The new splitter provides a more robust, recursive strategy that respects header hierarchies, preserves code blocks, enforces chunk size limits with merging, and adds configurable overlap for better context retention in downstream processes (e.g., vector stores or LLMs).

This change improves chunk quality for ingestion into vector stores, reduces the risk of splitting code blocks or headers incorrectly, and makes the chunking logic reusable and testable. No breaking changes to the ingester interfaces or outputs—chunks are still produced as Document<BookChunk>[], but with potentially better granularity and metadata.

Motivation

  • Limitations of old approach: The manual H1 splitting could mishandle code blocks (e.g., treating fenced headers as real ones), didn't support multi-level headers, lacked overlap (leading to context loss between chunks), and had no min/max size enforcement with merging.
  • Benefits of new approach:
    • Recursive splitting starts with headers, falls back to paragraphs/lines if needed.
    • Preserves code blocks intact.
    • Configurable options (e.g., max/min chars, overlap, header levels) for flexibility.
    • Adds metadata like headerPath, uniqueId, and character offsets for better traceability.
    • Tested for edge cases like unclosed code blocks, tiny chunks, and reconstruction (ensuring original content can be rebuilt from chunks).
  • Future-proofing: This utility can be reused across other ingesters or markdown-based sources, and it's easier to maintain/extend.

Key Changes

  • New Utility: RecursiveMarkdownSplitter.ts

    • Implements a class-based splitter with options for max/min chars, overlap, header levels, code block preservation, ID prefix, and trimming.
    • Tokenizes markdown to extract real headers (ignoring those in code blocks) and code fences.
    • Recursively splits segments by headers > paragraphs > lines.
    • Merges small segments to meet minChars, with flexibility for final chunks (up to 1.5x maxChars to avoid tiny orphans).
    • Applies backward overlap, adjusting to avoid mid-code-block starts.
    • Attaches metadata: title (from last relevant header), chunk number (per title), unique ID, char offsets, and full header path.
    • Exports a convenience function splitMarkdownToChunks.
    • ~750 LOC, including internal interfaces and methods.
  • Ingesters Refactored:

    • CairoBookIngester.ts and CoreLibDocsIngester.ts:
      • Removed manual H1 parsing logic (line-by-line code block tracking, regex matching, section extraction).
      • Replaced with RecursiveMarkdownSplitter instantiation and usage in chunkSummaryFile / chunkCorelibSummaryFile.
      • Configured splitter with: maxChars=2048, minChars=500, overlap=256, headerLevels=[1,2], preserveCodeBlocks=true, idPrefix=('cairo-book' or 'corelib'), trim=true.
      • Updated chunk mapping to use splitter's metadata (e.g., chunk.meta.title, chunk.meta.chunkNumber).
      • Kept calculateHash for content hashing.
      • Updated logging to reflect new splitter usage.
  • Tests Added:

    • __tests__/RecursiveMarkdownSplitter.test.ts: Covers basic functionality, header/code block handling, overlap, metadata, splitting strategies, and edge cases (e.g., empty input, unclosed blocks).
    • __tests__/RecursiveMarkdownSplitter.finalChunk.test.ts: Focuses on deterministic handling of tiny final chunks and merging.
    • __tests__/RecursiveMarkdownSplitter.minChars.test.ts: Verifies minChars merging, code block respect, and specific formatting examples.
    • __tests__/RecursiveMarkdownSplitter.reconstruction.test.ts: Ensures chunks (minus overlaps) reconstruct the original markdown exactly, testing headers, code blocks, and complex docs.
  • Other:

    • Imported calculateHash from contentUtils (previously inline).
    • No changes to dependencies or build configs.

Impact

  • Performance: The new splitter is more computationally intensive (tokenization + recursion), but for typical book/docs sizes (~10-100KB), it's negligible. Tested with large code blocks and nested headers.
  • Output Changes: Chunks may be more numerous/finer-grained due to H2 support and merging. Metadata is richer (e.g., headerPath array). Existing consumers (e.g., VectorStore) should be unaffected as the Document<BookChunk> shape is preserved.
  • Backward Compatibility: Old chunks hashed differently? No—hashes are still content-based. If re-ingesting, expect updated chunks.
  • Risks: If misconfigured (e.g., low maxChars with large code blocks), could log warnings for oversized chunks. Mitigated by tests.
  • Metrics: In tests, reconstruction is 100% accurate; code blocks remain intact; tiny chunks are merged appropriately.

Testing

  • Unit Tests: 100% coverage for new splitter (basic, edges, reconstruction). Run with npm test.
  • Manual: Ingest a sample Cairo Book markdown file and verify chunks in logs (e.g., number of chunks, content integrity).
  • Integration: Run full ingester pipeline and check vector store entries for completeness.

Next Steps / Open Questions

  • Consider exposing splitter options via config/env for tuning per-ingester.
  • If needed, add benchmarks for large docs (>1MB).
  • Review for any missed edge cases in markdown variants (e.g., ATX vs Setext headers—currently only ATX supported).

This PR improves maintainability and chunk quality—ready for review!

@enitrat enitrat changed the base branch from main to feat/migrate-dspy July 27, 2025 11:39
@enitrat enitrat force-pushed the feat/better-markdown-splitter branch from 8791eb4 to 442e80f Compare July 30, 2025 15:36
@enitrat enitrat changed the base branch from feat/migrate-dspy to main July 30, 2025 15:36
@enitrat enitrat force-pushed the feat/better-markdown-splitter branch 2 times, most recently from bb95bab to eea3aa7 Compare July 30, 2025 22:30
@enitrat enitrat force-pushed the feat/better-markdown-splitter branch from eea3aa7 to bfc4058 Compare July 31, 2025 13:58
@enitrat enitrat merged commit 696f312 into main Jul 31, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant