Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exponential Backoff Mechanism for RateLimit Issues in /Chat #500

Closed

Conversation

vrajroutu
Copy link

Purpose

  • ...

Does this introduce a breaking change?

[ ] Yes
[ X] No

Pull Request Type

What kind of change does this Pull Request introduce?

When releasing this app in production with 100+ users, there is a possibility of encountering rate limit issues when using the chat feature. To mitigate this, we can implement tenacity exponential backoff. This approach will automatically retry the chat requests if a rate limit is encountered, and it will provide the appropriate response once the rate limit has been lifted. By incorporating tenacity exponential backoff, the app will handle rate limit scenarios more gracefully and provide a smoother user experience during periods of high traffic or usage.

Reference : https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb

[ ] Bugfix
[ X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

  • Get the code
  • azd deploy
git clone [repo-address]
cd [repo-name]
git checkout [branch-name]
npm install
  • Test the code

What to Check

Verify that the following are valid

  • ...

Other Information

@pamelafox
Copy link
Collaborator

Thanks @vrajroutu ! We were just discussing this today. Did you do loadtesting with this in place? I'm wondering if the backoff ends up increasing the requests overall, putting users in competition with each other, or if you find it's working well to alleviate "burst" situations.

Also I assume you're maxing your deployments to 240K for this situation.

@vrajroutu
Copy link
Author

vrajroutu commented Aug 4, 2023

@pamelafox

Firstly, enabling streaming in the environment has helped to reduce rate limit issues, but even with 100+ users on the platform simultaneously, I still observed a few rate limit problems. In my exploration of solutions, I came across various articles, including this one AvoidRateLimits. This change won't impact the existing functionality, but during high loads on the environment and requests to the OAI model, it will be beneficial. A similar change was successfully implemented in the prep docs to address rate limit issues in indexing the documents. I plan to conduct load testing and will update the results accordingly.

This change is especially beneficial for users utilizing GPT-4, as the TPM is limited to 90k per subscription. By implementing the suggested approach, we can better manage the rate limits and ensure a smoother user experience, even during peak usage periods.

@pamelafox
Copy link
Collaborator

Ah I didn't realize GPT-4 had a lower TPM limit. Yeah, the backoff technique works well in prepdocs.py where there's a single caller to the API - my concern with using it for the per-user API calls is that it may increase load overall, if there truly are too many users than it can handle. I could see this approach being able to smooth over spikes of activity, but not being the solution to a long period of high load. Other ideas: request more quota, put a message asking users to reload the page (thus reducing their history), just removing the history entirely from messages.

Let us know how it goes in production!

@vrajroutu
Copy link
Author

Absolutely, I agree with your points. Increasing the capacity of GPT-4 would certainly be a long-term solution, but since it's currently paused by the MS team, we need to find ways to optimize the current setup. Enabling streaming has already shown positive results in reducing rate limit issues, and incorporating the backoff technique will further help in smoothing out spikes in activity. I understand that users might value keeping the chat history and continuing their conversations seamlessly. We can explore different options to manage the rate limit, like gradually requesting more quota and monitoring the usage patterns. If the backoff strategy combined with streaming can alleviate the majority of the rate limit issues and provide a good user experience, that would be a positive step forward. i will keep an eye on the system's performance.

@github-actions
Copy link

github-actions bot commented Oct 4, 2023

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed.

@github-actions github-actions bot added the Stale label Oct 4, 2023
@pamelafox
Copy link
Collaborator

I think we may still want to merge this, I just want to do more loadtesting first.

@github-actions github-actions bot removed the Stale label Oct 5, 2023
Copy link

This PR is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed.

@pamelafox
Copy link
Collaborator

@vrajroutu Are you still using tenacity in this situation, now that the OpenAI SDK has built-in retries and a customizable max_retries parameter? It seems less necessary these days.

@vrajroutu
Copy link
Author

@vrajroutu Are you still using tenacity in this situation, now that the OpenAI SDK has built-in retries and a customizable max_retries parameter? It seems less necessary these days.

Hi @pamelafox, I don't think it's necessary anymore. Users can now utilize API management or an app gateway to scale Azure OpenAI to multiple instances, and it works quite effectively.

@vrajroutu vrajroutu closed this Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants