feat(ingester): better markdown splitter #33

enitrat · 2025-07-27T11:39:16Z

PR Summary: Replace Manual Markdown Chunking with RecursiveMarkdownSplitter

Overview

This PR refactors the markdown chunking logic in the Cairo Book and Core Library ingesters to use a new RecursiveMarkdownSplitter utility. The previous implementation relied on a simple, manual H1-header-based splitting approach, which had limitations (e.g., no support for H2 headers, potential issues with code blocks, lack of overlap, and no handling for small chunks). The new splitter provides a more robust, recursive strategy that respects header hierarchies, preserves code blocks, enforces chunk size limits with merging, and adds configurable overlap for better context retention in downstream processes (e.g., vector stores or LLMs).

This change improves chunk quality for ingestion into vector stores, reduces the risk of splitting code blocks or headers incorrectly, and makes the chunking logic reusable and testable. No breaking changes to the ingester interfaces or outputs—chunks are still produced as Document<BookChunk>[], but with potentially better granularity and metadata.

Motivation

Limitations of old approach: The manual H1 splitting could mishandle code blocks (e.g., treating fenced headers as real ones), didn't support multi-level headers, lacked overlap (leading to context loss between chunks), and had no min/max size enforcement with merging.
Benefits of new approach:
- Recursive splitting starts with headers, falls back to paragraphs/lines if needed.
- Preserves code blocks intact.
- Configurable options (e.g., max/min chars, overlap, header levels) for flexibility.
- Adds metadata like headerPath, uniqueId, and character offsets for better traceability.
- Tested for edge cases like unclosed code blocks, tiny chunks, and reconstruction (ensuring original content can be rebuilt from chunks).
Future-proofing: This utility can be reused across other ingesters or markdown-based sources, and it's easier to maintain/extend.

Key Changes

New Utility: RecursiveMarkdownSplitter.ts
- Implements a class-based splitter with options for max/min chars, overlap, header levels, code block preservation, ID prefix, and trimming.
- Tokenizes markdown to extract real headers (ignoring those in code blocks) and code fences.
- Recursively splits segments by headers > paragraphs > lines.
- Merges small segments to meet minChars, with flexibility for final chunks (up to 1.5x maxChars to avoid tiny orphans).
- Applies backward overlap, adjusting to avoid mid-code-block starts.
- Attaches metadata: title (from last relevant header), chunk number (per title), unique ID, char offsets, and full header path.
- Exports a convenience function splitMarkdownToChunks.
- ~750 LOC, including internal interfaces and methods.
Ingesters Refactored:
- CairoBookIngester.ts and CoreLibDocsIngester.ts:
  - Removed manual H1 parsing logic (line-by-line code block tracking, regex matching, section extraction).
  - Replaced with RecursiveMarkdownSplitter instantiation and usage in chunkSummaryFile / chunkCorelibSummaryFile.
  - Configured splitter with: maxChars=2048, minChars=500, overlap=256, headerLevels=[1,2], preserveCodeBlocks=true, idPrefix=('cairo-book' or 'corelib'), trim=true.
  - Updated chunk mapping to use splitter's metadata (e.g., chunk.meta.title, chunk.meta.chunkNumber).
  - Kept calculateHash for content hashing.
  - Updated logging to reflect new splitter usage.
Tests Added:
- __tests__/RecursiveMarkdownSplitter.test.ts: Covers basic functionality, header/code block handling, overlap, metadata, splitting strategies, and edge cases (e.g., empty input, unclosed blocks).
- __tests__/RecursiveMarkdownSplitter.finalChunk.test.ts: Focuses on deterministic handling of tiny final chunks and merging.
- __tests__/RecursiveMarkdownSplitter.minChars.test.ts: Verifies minChars merging, code block respect, and specific formatting examples.
- __tests__/RecursiveMarkdownSplitter.reconstruction.test.ts: Ensures chunks (minus overlaps) reconstruct the original markdown exactly, testing headers, code blocks, and complex docs.
Other:
- Imported calculateHash from contentUtils (previously inline).
- No changes to dependencies or build configs.

Impact

Performance: The new splitter is more computationally intensive (tokenization + recursion), but for typical book/docs sizes (~10-100KB), it's negligible. Tested with large code blocks and nested headers.
Output Changes: Chunks may be more numerous/finer-grained due to H2 support and merging. Metadata is richer (e.g., headerPath array). Existing consumers (e.g., VectorStore) should be unaffected as the Document<BookChunk> shape is preserved.
Backward Compatibility: Old chunks hashed differently? No—hashes are still content-based. If re-ingesting, expect updated chunks.
Risks: If misconfigured (e.g., low maxChars with large code blocks), could log warnings for oversized chunks. Mitigated by tests.
Metrics: In tests, reconstruction is 100% accurate; code blocks remain intact; tiny chunks are merged appropriately.

Testing

Unit Tests: 100% coverage for new splitter (basic, edges, reconstruction). Run with npm test.
Manual: Ingest a sample Cairo Book markdown file and verify chunks in logs (e.g., number of chunks, content integrity).
Integration: Run full ingester pipeline and check vector store entries for completeness.

Next Steps / Open Questions

Consider exposing splitter options via config/env for tuning per-ingester.
If needed, add benchmarks for large docs (>1MB).
Review for any missed edge cases in markdown variants (e.g., ATX vs Setext headers—currently only ATX supported).

This PR improves maintainability and chunk quality—ready for review!

enitrat changed the base branch from main to feat/migrate-dspy July 27, 2025 11:39

enitrat force-pushed the feat/better-markdown-splitter branch from 8791eb4 to 442e80f Compare July 30, 2025 15:36

enitrat changed the base branch from feat/migrate-dspy to main July 30, 2025 15:36

enitrat force-pushed the feat/better-markdown-splitter branch 2 times, most recently from bb95bab to eea3aa7 Compare July 30, 2025 22:30

enitrat added 3 commits July 31, 2025 15:58

migrate to DSPy

8685935

dspy migration

ce433a4

feat: better markdown splitter

bfc4058

enitrat force-pushed the feat/better-markdown-splitter branch from eea3aa7 to bfc4058 Compare July 31, 2025 13:58

enitrat merged commit 696f312 into main Jul 31, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ingester): better markdown splitter #33

feat(ingester): better markdown splitter #33

Uh oh!

enitrat commented Jul 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(ingester): better markdown splitter #33

feat(ingester): better markdown splitter #33

Uh oh!

Conversation

enitrat commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary: Replace Manual Markdown Chunking with RecursiveMarkdownSplitter

Overview

Motivation

Key Changes

Impact

Testing

Next Steps / Open Questions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

enitrat commented Jul 27, 2025 •

edited

Loading