Skip to content

.Net: Fix TextChunker.SplitPlainTextLines to actually split on newlines regardless of token count #12558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

shethaadit
Copy link
Contributor

Description

Summary

Fixes issue #12556 where TextChunker.SplitPlainTextLines does not actually split text on newlines when the input token count is less than the maxTokensPerLine parameter.

Problem

The SplitPlainTextLines method had two issues:

  1. Incorrect separator: Used "\n\r" which looks for text containing both newline AND carriage return, rather than splitting ON newlines
  2. Token-first logic: Only attempted splitting when token count exceeded the limit, ignoring newlines in shorter texts

This caused the method to return unsplit text with preserved newline characters instead of separate lines, which was counterintuitive given the method name.

Solution

  • Added a dedicated s_plaintextLineSplitOptions array with "\n" as the first separator specifically for line splitting
  • Modified SplitPlainTextLines to always split on newlines first, regardless of token count
  • Applied token-based splitting to individual lines only if they exceed the token limit
  • Preserved existing behavior for SplitPlainTextParagraphs by keeping the original split options

Changes

  • Added: New s_plaintextLineSplitOptions array for proper line splitting behavior
  • Modified: SplitPlainTextLines method to prioritize newline splitting over token limits
  • Preserved: Existing SplitPlainTextParagraphs functionality unchanged

Testing

Verification

The method now correctly splits this input:

"First line\nSecond line\nThird line"

Fixes #12557

@shethaadit shethaadit requested a review from a team as a code owner June 20, 2025 23:46
@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code kernel Issues or pull requests impacting the core kernel labels Jun 20, 2025
@github-actions github-actions bot changed the title Fix TextChunker.SplitPlainTextLines to actually split on newlines regardless of token count .Net: Fix TextChunker.SplitPlainTextLines to actually split on newlines regardless of token count Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernel Issues or pull requests impacting the core kernel .NET Issue or Pull requests regarding .NET code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants