Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix split_text chunking bug #2088

Closed
wants to merge 17 commits into from
Closed
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 15 additions & 14 deletions autogpt/processing/text.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
"""Text processing functions"""
from typing import Dict, Generator, Optional

Pwuts marked this conversation as resolved.
Show resolved Hide resolved
from typing import Generator, Optional, Dict, List
from selenium.webdriver.remote.webdriver import WebDriver

from autogpt.config import Config
Expand All @@ -12,23 +11,25 @@


def split_text(text: str, max_length: int = 8192) -> Generator[str, None, None]:
"""Split text into chunks of a maximum length

Args:
text (str): The text to split
max_length (int, optional): The maximum length of each chunk. Defaults to 8192.

Yields:
str: The next chunk of text

Raises:
ValueError: If the text is longer than the maximum length
"""
Pwuts marked this conversation as resolved.
Show resolved Hide resolved
paragraphs = text.split("\n")
current_length = 0
current_chunk = []

def split_long_paragraph(paragraph: str, max_length: int) -> List[str]:
return [
paragraph[i : i + max_length] for i in range(0, len(paragraph), max_length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much of a difference it makes for the performance of the LLM, but could you try splitting it on a whitespace (or other non-word) character instead of exactly on the max_length?

Copy link
Member

@Pwuts Pwuts Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after a thorough review of the split_text function I found out we can simplify it a lot using textwrap. No need to have this split_long_paragraph function anymore.

Please check the new revised split_text. Also you might want to check the tests I added

]

for paragraph in paragraphs:
if len(paragraph) > max_length:
long_paragraph_chunks = split_long_paragraph(paragraph, max_length)
for long_chunk in long_paragraph_chunks[:-1]:
if current_chunk:
yield "\n".join(current_chunk)
current_chunk = []
yield long_chunk
paragraph = long_paragraph_chunks[-1]

if current_length + len(paragraph) + 1 <= max_length:
current_chunk.append(paragraph)
current_length += len(paragraph) + 1
Expand Down