Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix split_text chunking bug #2088

Closed
wants to merge 17 commits into from

Conversation

vzla0094
Copy link

@vzla0094 vzla0094 commented Apr 17, 2023

Background

Handle long paragraphs in split_text function by splitting them into smaller chunks, ensuring that no chunk exceeds the max_length.

Fixes: #1820, #1211, #796, #38

Changes

  • Updated split_text function to handle paragraphs longer than max_length by splitting them into smaller chunks
  • Added a while loop to process long paragraphs and create sub_paragraphs of length max_length
  • Maintained consistency with the original implementation for appending chunks to current_chunk and updating current_length

Documentation

  • Added comments in the code explaining step by step the chunk splitting logic

Test Plan

  • Manually test the updated split_text function with different input text scenarios, including long paragraphs and varying max_length values
  • Ensure that the function works as expected and no chunks exceed the specified max_length

PR Quality Checklist

  • My pull request is atomic and focuses on a single change.
  • I have thoroughly tested my changes with multiple different prompts.
  • I have considered potential risks and mitigations for my changes.
  • I have documented my changes clearly and comprehensively.
  • I have not snuck in any "extra" small tweaks changes

@vzla0094 vzla0094 changed the base branch from master to stable April 17, 2023 04:12
@nponeccop nponeccop added bug Something isn't working B7 labels Apr 17, 2023
@nponeccop
Copy link
Contributor

Asked the team to merge out of band

@Pwuts
Copy link
Member

Pwuts commented Apr 17, 2023

@vzla0094 we aren't merging into stable, can you change the base branch back to master?

@p-i-
Copy link
Contributor

p-i- commented Apr 17, 2023

I'm not ready to merge this as is due to code quality. It looks unpythonic.

Code should self-document. We don't say i += 1 # add one to i.

And there is surely a Pythonic way to chunk a string using something out of itertools maybe, or using a generator.

@p-i-
Copy link
Contributor

p-i- commented Apr 17, 2023

Closing this as I think #2062 is doing this better

@p-i- p-i- closed this Apr 17, 2023
@vaknin
Copy link

vaknin commented Apr 17, 2023

Hey @p-i-,
neither #2062 nor #2088 fix the mentioned issue, as also stated inside #2062.

I've checked both solutions by applying the changes to the stable branch, and neither fixed it.
The error happens usually with large texts, especially on long URLs.

openai.error.InvalidRequestError: This model's maximum context length is 8191 tokens, however you requested 9221 tokens (9221 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

@vzla0094
Copy link
Author

@p-i- The referenced #2062 doesn't address the split_text function which is the one involved in the "max_token_limit" error. See @vaknin message

This one does, I can find a way to tidy this one up if you'd like yo re-open it

@Pwuts
Copy link
Member

Pwuts commented Apr 17, 2023

Sure, go ahead. And as @p-i- already mentioned, in rewriting the PR, using existing functionality from the standard library is preferable over DIY implementations. :)

@Pwuts Pwuts reopened this Apr 17, 2023
@Pwuts Pwuts changed the base branch from stable to master April 17, 2023 22:13
@nponeccop nponeccop mentioned this pull request Apr 17, 2023
1 task
@github-actions github-actions bot added the conflicts Automatically applied to PRs with merge conflicts label Apr 17, 2023
@github-actions
Copy link
Contributor

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

@github-actions github-actions bot removed the conflicts Automatically applied to PRs with merge conflicts label Apr 18, 2023
@github-actions
Copy link
Contributor

Conflicts have been resolved! 🎉 A maintainer will review the pull request shortly.

@vzla0094
Copy link
Author

vzla0094 commented Apr 18, 2023

Sure, go ahead. And as @p-i- already mentioned, in rewriting the PR, using existing functionality from the standard library is preferable over DIY implementations. :)

Just pushed an update removing the comments and restructuring also. Still don't think it's really easy to understand tho, but what do you guys think? feel free to push modifications or I could also use some third party library for chunking like Funcy.

Not a python dev, just trying out things, code works tho but please feel free to point me in the right direction

Copy link
Member

@Pwuts Pwuts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking better :)

autogpt/processing/text.py Show resolved Hide resolved
autogpt/processing/text.py Show resolved Hide resolved
paragraphs = text.split("\n")
current_length = 0
current_chunk = []

def split_long_paragraph(paragraph: str, max_length: int) -> List[str]:
return [
paragraph[i : i + max_length] for i in range(0, len(paragraph), max_length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much of a difference it makes for the performance of the LLM, but could you try splitting it on a whitespace (or other non-word) character instead of exactly on the max_length?

Copy link
Member

@Pwuts Pwuts Apr 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after a thorough review of the split_text function I found out we can simplify it a lot using textwrap. No need to have this split_long_paragraph function anymore.

Please check the new revised split_text. Also you might want to check the tests I added

@vzla0094
Copy link
Author

@s0meguy1 I'm not sure about any other failing function but this split_text is definitely one of the causes. One thing you could do to find out is running the original split_text in the master branch and run it against the test I added in this PR, you'll see how buggy it is

@vzla0094
Copy link
Author

@vaknin @Pwuts @s0meguy1 I think I know why the confusion, I might have linked the wrong issues here but this PR fixes the split_text function that's used for the web scrapper/browser command. Not file ingesting, neither google searching

@Pwuts
Copy link
Member

Pwuts commented Apr 19, 2023

@vzla0094 doesn't look too hard to refactor file_operations.py to use processing/text.py > split_text() https://github.com/Significant-Gravitas/Auto-GPT/blob/master/autogpt/commands/file_operations.py#L52

@vzla0094
Copy link
Author

@Pwuts it does look like an easy fix hahah but don't want to risk having to spend more time on this if case some edge case arises lol.

If this one is merged I might find some time tomorrow to do the quick fix of the other one:)

Pwuts
Pwuts previously approved these changes Apr 19, 2023
@bszollosinagy
Copy link
Contributor

This PR splits the text based on character count, not token count. It also splits in the middle of a sentence.

Can I recommend that you take a look at #2542 , which solves these issues?

@bszollosinagy
Copy link
Contributor

If you want, you can just merge this PR, after all, vzla0094 put some work into it, and then I'll just adjust my PR to make the additional changes on top of it.

@vzla0094
Copy link
Author

If you want, you can just merge this PR, after all, vzla0094 put some work into it, and then I'll just adjust my PR to make the additional changes on top of it.

Whatever's best for everyone 🤷‍♂️ but yeah I think you might want to use the tests at least. Nice job on your PR btw

@Pwuts
Copy link
Member

Pwuts commented Apr 19, 2023

@vzla0094 we'll merge #2542 for the upcoming release and cherry pick your test, probably soon after. Thanks a lot for the work, and sorry for having you do all of it before (partially) turning it down. 😅

@Pwuts Pwuts dismissed their stale review April 19, 2023 21:23

Merging #2542 instead

@github-actions github-actions bot added the conflicts Automatically applied to PRs with merge conflicts label Apr 19, 2023
@github-actions
Copy link
Contributor

This pull request has conflicts with the base branch, please resolve those so we can evaluate the pull request.

@Pwuts Pwuts added testing and removed B7 bug Something isn't working labels Apr 26, 2023
@p-i-
Copy link
Contributor

p-i- commented May 5, 2023

This is a mass message from the AutoGPT core team.
Our apologies for the ongoing delay in processing PRs.
This is because we are re-architecting the AutoGPT core!

For more details (and for infor on joining our Discord), please refer to:
https://github.com/Significant-Gravitas/Auto-GPT/wiki/Architecting

@ntindle ntindle assigned Pwuts and unassigned BillSchumacher May 20, 2023
@kinance
Copy link
Contributor

kinance commented May 24, 2023

The split_text function on the master branch has chunking. This issue should no longer exist. Please sync to the latest and retry.

@Pwuts
Copy link
Member

Pwuts commented May 26, 2023

@vzla0094 sorry, we didn't get to cherry-picking the tests yet, just didn't have the time. They also don't conform to the test structure used in the rest of the project, and the rest of the PR is obsolete by now. As such, we can't merge it. :/

I'm going to close this PR, with a big thanks for your efforts and the inspiration that your solution provided. You are welcome to submit a PR implementing tests for the text processing module that is currently in master.

@Pwuts Pwuts closed this May 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conflicts Automatically applied to PRs with merge conflicts function: process text needs restructuring PRs that should be split or restructured Obsolete? testing
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Please fix the chunking issue, which makes browse_website unusable
9 participants