Fix split_text chunking bug #2088

vzla0094 · 2023-04-17T04:12:33Z

Background

Handle long paragraphs in split_text function by splitting them into smaller chunks, ensuring that no chunk exceeds the max_length.

Fixes: #1820, #1211, #796, #38

Changes

Updated split_text function to handle paragraphs longer than max_length by splitting them into smaller chunks
Added a while loop to process long paragraphs and create sub_paragraphs of length max_length
Maintained consistency with the original implementation for appending chunks to current_chunk and updating current_length

Documentation

Added comments in the code explaining step by step the chunk splitting logic

Test Plan

Manually test the updated split_text function with different input text scenarios, including long paragraphs and varying max_length values
Ensure that the function works as expected and no chunks exceed the specified max_length

PR Quality Checklist

My pull request is atomic and focuses on a single change.
I have thoroughly tested my changes with multiple different prompts.
I have considered potential risks and mitigations for my changes.
I have documented my changes clearly and comprehensively.
I have not snuck in any "extra" small tweaks changes

…hunks.

nponeccop · 2023-04-17T11:31:26Z

Asked the team to merge out of band

Pwuts · 2023-04-17T12:02:39Z

@vzla0094 we aren't merging into stable, can you change the base branch back to master?

p-i- · 2023-04-17T12:36:17Z

I'm not ready to merge this as is due to code quality. It looks unpythonic.

Code should self-document. We don't say i += 1 # add one to i.

And there is surely a Pythonic way to chunk a string using something out of itertools maybe, or using a generator.

p-i- · 2023-04-17T16:13:57Z

Closing this as I think #2062 is doing this better

vaknin · 2023-04-17T20:03:39Z

Hey @p-i-,
neither #2062 nor #2088 fix the mentioned issue, as also stated inside #2062.

I've checked both solutions by applying the changes to the stable branch, and neither fixed it.
The error happens usually with large texts, especially on long URLs.

openai.error.InvalidRequestError: This model's maximum context length is 8191 tokens, however you requested 9221 tokens (9221 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

vzla0094 · 2023-04-17T21:15:59Z

@p-i- The referenced #2062 doesn't address the split_text function which is the one involved in the "max_token_limit" error. See @vaknin message

This one does, I can find a way to tidy this one up if you'd like yo re-open it

Pwuts · 2023-04-17T22:12:51Z

Sure, go ahead. And as @p-i- already mentioned, in rewriting the PR, using existing functionality from the standard library is preferable over DIY implementations. :)

github-actions · 2023-04-17T22:43:25Z

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

github-actions · 2023-04-18T01:29:55Z

Conflicts have been resolved! 🎉 A maintainer will review the pull request shortly.

vzla0094 · 2023-04-18T01:34:18Z

Sure, go ahead. And as @p-i- already mentioned, in rewriting the PR, using existing functionality from the standard library is preferable over DIY implementations. :)

Just pushed an update removing the comments and restructuring also. Still don't think it's really easy to understand tho, but what do you guys think? feel free to push modifications or I could also use some third party library for chunking like Funcy.

Not a python dev, just trying out things, code works tho but please feel free to point me in the right direction

Pwuts

This is looking better :)

autogpt/processing/text.py

Pwuts · 2023-04-18T02:13:04Z

autogpt/processing/text.py

    paragraphs = text.split("\n")
    current_length = 0
    current_chunk = []

+    def split_long_paragraph(paragraph: str, max_length: int) -> List[str]:
+        return [
+            paragraph[i : i + max_length] for i in range(0, len(paragraph), max_length)


I'm not sure how much of a difference it makes for the performance of the LLM, but could you try splitting it on a whitespace (or other non-word) character instead of exactly on the max_length?

after a thorough review of the split_text function I found out we can simplify it a lot using textwrap. No need to have this split_long_paragraph function anymore.

Please check the new revised split_text. Also you might want to check the tests I added

vzla0094 · 2023-04-19T01:14:33Z

@s0meguy1 I'm not sure about any other failing function but this split_text is definitely one of the causes. One thing you could do to find out is running the original split_text in the master branch and run it against the test I added in this PR, you'll see how buggy it is

vzla0094 · 2023-04-19T01:18:04Z

@vaknin @Pwuts @s0meguy1 I think I know why the confusion, I might have linked the wrong issues here but this PR fixes the split_text function that's used for the web scrapper/browser command. Not file ingesting, neither google searching

Pwuts · 2023-04-19T01:28:07Z

@vzla0094 doesn't look too hard to refactor file_operations.py to use processing/text.py > split_text() https://github.com/Significant-Gravitas/Auto-GPT/blob/master/autogpt/commands/file_operations.py#L52

vzla0094 · 2023-04-19T01:46:38Z

@Pwuts it does look like an easy fix hahah but don't want to risk having to spend more time on this if case some edge case arises lol.

If this one is merged I might find some time tomorrow to do the quick fix of the other one:)

autogpt/processing/text.py

bszollosinagy · 2023-04-19T14:44:46Z

This PR splits the text based on character count, not token count. It also splits in the middle of a sentence.

Can I recommend that you take a look at #2542 , which solves these issues?

bszollosinagy · 2023-04-19T15:41:47Z

If you want, you can just merge this PR, after all, vzla0094 put some work into it, and then I'll just adjust my PR to make the additional changes on top of it.

vzla0094 · 2023-04-19T17:35:05Z

If you want, you can just merge this PR, after all, vzla0094 put some work into it, and then I'll just adjust my PR to make the additional changes on top of it.

Whatever's best for everyone 🤷‍♂️ but yeah I think you might want to use the tests at least. Nice job on your PR btw

Pwuts · 2023-04-19T21:23:18Z

@vzla0094 we'll merge #2542 for the upcoming release and cherry pick your test, probably soon after. Thanks a lot for the work, and sorry for having you do all of it before (partially) turning it down. 😅

Merging #2542 instead

github-actions · 2023-04-19T21:36:32Z

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

p-i- · 2023-05-05T00:56:07Z

This is a mass message from the AutoGPT core team.
Our apologies for the ongoing delay in processing PRs.
This is because we are re-architecting the AutoGPT core!

For more details (and for infor on joining our Discord), please refer to:
https://github.com/Significant-Gravitas/Auto-GPT/wiki/Architecting

kinance · 2023-05-24T14:52:43Z

The split_text function on the master branch has chunking. This issue should no longer exist. Please sync to the latest and retry.

Pwuts · 2023-05-26T13:14:06Z

@vzla0094 sorry, we didn't get to cherry-picking the tests yet, just didn't have the time. They also don't conform to the test structure used in the rest of the project, and the rest of the PR is obsolete by now. As such, we can't merge it. :/

I'm going to close this PR, with a big thanks for your efforts and the inspiration that your solution provided. You are welcome to submit a PR implementing tests for the text processing module that is currently in master.

vzla0094 added 2 commits April 16, 2023 20:40

Handle long paragraphs in split_text by splitting them into smaller c…

9c5603a

…hunks.

document splitting logic in comments

bb0470c

vzla0094 changed the base branch from master to stable April 17, 2023 04:12

nponeccop added bug Something isn't working B7 labels Apr 17, 2023

mihail-yartsev mentioned this pull request Apr 17, 2023

Maximum context length exceeded after browse_website #796

Closed

1 task

p-i- added the needs restructuring PRs that should be split or restructured label Apr 17, 2023

p-i- mentioned this pull request Apr 17, 2023

Fix split file to handle edge case where overlap size > last chunk size #2062

Merged

5 tasks

p-i- closed this Apr 17, 2023

cleanup split_text function

88fd002

Pwuts reopened this Apr 17, 2023

Pwuts changed the base branch from stable to master April 17, 2023 22:13

nponeccop mentioned this pull request Apr 17, 2023

PR Batch 7 #2256

Closed

1 task

github-actions bot added the conflicts Automatically applied to PRs with merge conflicts label Apr 17, 2023

This was linked to issues Apr 18, 2023

Improve chunking and chunk handling #38

Closed

Maximum context length exceeded after browse_website #796

Closed

vzla0094 added 2 commits April 17, 2023 18:25

Merge branch 'master' into fix-chunking-bug

beb89f0

Merge branch 'master' into fix-chunking-bug

405f759

github-actions bot removed the conflicts Automatically applied to PRs with merge conflicts label Apr 18, 2023

Pwuts requested changes Apr 18, 2023

View reviewed changes

Pwuts assigned BillSchumacher Apr 19, 2023

Pwuts mentioned this pull request Apr 19, 2023

Prompt overflows aren't handled gracefully #1841

Closed

1 task

Pwuts previously approved these changes Apr 19, 2023

View reviewed changes

BillSchumacher suggested changes Apr 19, 2023

View reviewed changes

autogpt/processing/text.py Show resolved Hide resolved

vzla0094 requested a review from BillSchumacher April 19, 2023 01:53

Drlordbasil mentioned this pull request Apr 19, 2023

Exceeding tokenlimit leads to loop #2590

Closed

1 task

Pwuts added the function: process text label Apr 19, 2023

github-actions bot added the conflicts Automatically applied to PRs with merge conflicts label Apr 19, 2023

This was unlinked from issues Apr 22, 2023

Improve chunking and chunk handling #38

Closed

Maximum context length exceeded after browse_website #796

Closed

Pwuts added testing and removed B7 bug Something isn't working labels Apr 26, 2023

ntindle assigned Pwuts and unassigned BillSchumacher May 20, 2023

kinance added the Obsolete? label May 24, 2023

Pwuts closed this May 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix split_text chunking bug #2088

Fix split_text chunking bug #2088

vzla0094 commented Apr 17, 2023 •

edited

Loading

nponeccop commented Apr 17, 2023

Pwuts commented Apr 17, 2023

p-i- commented Apr 17, 2023

p-i- commented Apr 17, 2023

vaknin commented Apr 17, 2023 •

edited

Loading

vzla0094 commented Apr 17, 2023

Pwuts commented Apr 17, 2023

github-actions bot commented Apr 17, 2023

github-actions bot commented Apr 18, 2023

vzla0094 commented Apr 18, 2023 •

edited

Loading

Pwuts left a comment

Pwuts Apr 18, 2023

Pwuts Apr 18, 2023 •

edited

Loading

vzla0094 Apr 19, 2023

vzla0094 commented Apr 19, 2023

vzla0094 commented Apr 19, 2023

Pwuts commented Apr 19, 2023

vzla0094 commented Apr 19, 2023

bszollosinagy commented Apr 19, 2023

bszollosinagy commented Apr 19, 2023

vzla0094 commented Apr 19, 2023

Pwuts commented Apr 19, 2023

github-actions bot commented Apr 19, 2023

p-i- commented May 5, 2023

kinance commented May 24, 2023

Pwuts commented May 26, 2023

Fix split_text chunking bug #2088

Fix split_text chunking bug #2088

Conversation

vzla0094 commented Apr 17, 2023 • edited Loading

Background

Changes

Documentation

Test Plan

PR Quality Checklist

nponeccop commented Apr 17, 2023

Pwuts commented Apr 17, 2023

p-i- commented Apr 17, 2023

p-i- commented Apr 17, 2023

vaknin commented Apr 17, 2023 • edited Loading

vzla0094 commented Apr 17, 2023

Pwuts commented Apr 17, 2023

github-actions bot commented Apr 17, 2023

github-actions bot commented Apr 18, 2023

vzla0094 commented Apr 18, 2023 • edited Loading

Pwuts left a comment

Choose a reason for hiding this comment

Pwuts Apr 18, 2023

Choose a reason for hiding this comment

Pwuts Apr 18, 2023 • edited Loading

Choose a reason for hiding this comment

vzla0094 Apr 19, 2023

Choose a reason for hiding this comment

vzla0094 commented Apr 19, 2023

vzla0094 commented Apr 19, 2023

Pwuts commented Apr 19, 2023

vzla0094 commented Apr 19, 2023

bszollosinagy commented Apr 19, 2023

bszollosinagy commented Apr 19, 2023

vzla0094 commented Apr 19, 2023

Pwuts commented Apr 19, 2023

github-actions bot commented Apr 19, 2023

p-i- commented May 5, 2023

kinance commented May 24, 2023

Pwuts commented May 26, 2023

vzla0094 commented Apr 17, 2023 •

edited

Loading

vaknin commented Apr 17, 2023 •

edited

Loading

vzla0094 commented Apr 18, 2023 •

edited

Loading

Pwuts Apr 18, 2023 •

edited

Loading