feat(ingester): better markdown splitter #33
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Summary: Replace Manual Markdown Chunking with RecursiveMarkdownSplitter
Overview
This PR refactors the markdown chunking logic in the Cairo Book and Core Library ingesters to use a new
RecursiveMarkdownSplitter
utility. The previous implementation relied on a simple, manual H1-header-based splitting approach, which had limitations (e.g., no support for H2 headers, potential issues with code blocks, lack of overlap, and no handling for small chunks). The new splitter provides a more robust, recursive strategy that respects header hierarchies, preserves code blocks, enforces chunk size limits with merging, and adds configurable overlap for better context retention in downstream processes (e.g., vector stores or LLMs).This change improves chunk quality for ingestion into vector stores, reduces the risk of splitting code blocks or headers incorrectly, and makes the chunking logic reusable and testable. No breaking changes to the ingester interfaces or outputs—chunks are still produced as
Document<BookChunk>[]
, but with potentially better granularity and metadata.Motivation
headerPath
,uniqueId
, and character offsets for better traceability.Key Changes
New Utility:
RecursiveMarkdownSplitter.ts
splitMarkdownToChunks
.Ingesters Refactored:
CairoBookIngester.ts
andCoreLibDocsIngester.ts
:RecursiveMarkdownSplitter
instantiation and usage inchunkSummaryFile
/chunkCorelibSummaryFile
.chunk.meta.title
,chunk.meta.chunkNumber
).calculateHash
for content hashing.Tests Added:
__tests__/RecursiveMarkdownSplitter.test.ts
: Covers basic functionality, header/code block handling, overlap, metadata, splitting strategies, and edge cases (e.g., empty input, unclosed blocks).__tests__/RecursiveMarkdownSplitter.finalChunk.test.ts
: Focuses on deterministic handling of tiny final chunks and merging.__tests__/RecursiveMarkdownSplitter.minChars.test.ts
: Verifies minChars merging, code block respect, and specific formatting examples.__tests__/RecursiveMarkdownSplitter.reconstruction.test.ts
: Ensures chunks (minus overlaps) reconstruct the original markdown exactly, testing headers, code blocks, and complex docs.Other:
calculateHash
fromcontentUtils
(previously inline).Impact
headerPath
array). Existing consumers (e.g.,VectorStore
) should be unaffected as theDocument<BookChunk>
shape is preserved.Testing
npm test
.Next Steps / Open Questions
This PR improves maintainability and chunk quality—ready for review!