Skip to content

Conversation

@enitrat
Copy link
Collaborator

@enitrat enitrat commented Oct 5, 2025

Ingestion: Markdown Splitter (TypeScript)

  • RecursiveMarkdownSplitter.ts
    • Added parsing of special Sources blocks delimited by ---Sources:---.
    • Computes active source ranges; assigns chunk.meta.sourceLink to the first URL listed in the most recent Sources block for all subsequent chunks until the next block.
    • Extended ChunkMeta with optional sourceLink and tokenizer Tokens with sourceRanges.
    • Wired sourceRanges into metadata attachment without altering chunking behavior.
  • Tests
    • New test: ingesters/src/utils/__tests__/RecursiveMarkdownSplitter.sources.test.ts validates activeSource mapping across multiple Sources blocks.

Splitter robustness and fallback for fenced code blocks

  • Robust fenced code detection
    • Accept up to 3 leading spaces before fences.
    • Track fence char and exact length; closing fence must match char and have length ≥ open, fence-only (no info string).
    • Fallback-close a malformed open block when a new opening fence appears while still in a block (marks previous as breakable).
    • Mark unclosed-at-EOF blocks as breakable: true.
  • Breakable blocks and size threshold
    • New SplitOptions:
      • codeBlockMaxChars (default: 2× maxChars) — closed blocks larger than this are treated as breakable (splittable).
      • fallbackCloseOnNestedOpen (default: true).
  • Splitting rules updated
    • Paragraph/line splitting skips split points only inside non-breakable blocks; allows splits inside breakable blocks.
    • Final boundary pass adjusts segment starts/ends only for non-breakable blocks; ensures monotonic, non-overlapping segments.
    • Overlap start is pushed past block end only if the enclosing block is non-breakable.
  • Tests
    • New: ingesters/src/utils/__tests__/RecursiveMarkdownSplitter.noStartInsideCodeBlock.test.ts enforces that no chunk starts inside a non-breakable code block; allows starts inside breakable (malformed/oversized) blocks.

  - Added parsing of special Sources blocks delimited by `---` … `Sources:` … `---`.
  - Computes active source ranges; assigns `chunk.meta.sourceLink` to the first URL listed in the most recent Sources block for all subsequent chunks until the next block.
  - Extended `ChunkMeta` with optional `sourceLink` and tokenizer `Tokens` with `sourceRanges`.
  - Wired sourceRanges into metadata attachment without altering chunking behavior.
- Tests
  - New test: `ingesters/src/utils/__tests__/RecursiveMarkdownSplitter.sources.test.ts` validates activeSource mapping across multiple Sources blocks.

- Robust fenced code detection
  - Accept up to 3 leading spaces before fences.
  - Track fence char and exact length; closing fence must match char and have length ≥ open, fence-only (no info string).
  - Fallback-close a malformed open block when a new opening fence appears while still in a block (marks previous as breakable).
  - Mark unclosed-at-EOF blocks as `breakable: true`.
- Breakable blocks and size threshold
  - New `SplitOptions`:
    - `codeBlockMaxChars` (default: 2× `maxChars`) — closed blocks larger than this are treated as breakable (splittable).
    - `fallbackCloseOnNestedOpen` (default: true).
- Splitting rules updated
  - Paragraph/line splitting skips split points only inside non-breakable blocks; allows splits inside breakable blocks.
  - Final boundary pass adjusts segment starts/ends only for non-breakable blocks; ensures monotonic, non-overlapping segments.
  - Overlap start is pushed past block end only if the enclosing block is non-breakable.
- Tests
  - New: `ingesters/src/utils/__tests__/RecursiveMarkdownSplitter.noStartInsideCodeBlock.test.ts` enforces that no chunk starts inside a non-breakable code block; allows starts inside breakable (malformed/oversized) blocks.
@enitrat enitrat merged commit 4d7bd49 into main Oct 5, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants