Investigate usage limits #55

mruwnik · 2023-07-10T23:20:40Z

Find out how many requests per second can be handled by the current system. This applies both to the server infrastructure, but also to the underlying LLM system

ccstan99 · 2023-07-10T23:26:37Z

OpenAI LLM chat rate limits
Current defaults as of July 2023:
3,500 RPM (requests per minute) and 90,000 TPM (tokens per minute)

henri123lemoine · 2023-07-13T04:31:24Z

We can apply to increase our rate limits. Should we?
It's also important to note that, should we decide to use it, GPT-4's current rate-limit is 200. OpenAI will increase that number over time.

henri123lemoine · 2023-07-14T17:00:35Z

Current rate limits are of 200 messages or 40,000 tokens per minute, which will likely be reached in 5-6 questions if we use the full context window for every chat. That might be a problem.
We need to set up a system that handles GPT-4 rate limits by switching to ChatGPT if they occur. Additionally, we could limit spamming by slowing down the streaming if a single user sends queries at an impossible speed.

ishaan-jaff · 2023-11-20T22:19:47Z

@mruwnik @ccstan99 @henri123lemoine

i'm the maintainer of LiteLLM we allow you to maximize your throughput/increase rate limits - load balance between multiple deployments (Azure, OpenAI)
I believe litellm can be helpful here - and i'd love your feedback if we're missing something

Here's how to use it
Docs: https://docs.litellm.ai/docs/routing

from litellm import Router

model_list = [{ # list of model deployments 
    "model_name": "gpt-3.5-turbo", # model alias 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-v-2", # actual model name
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "azure/chatgpt-functioncalling", 
        "api_key": os.getenv("AZURE_API_KEY"),
        "api_version": os.getenv("AZURE_API_VERSION"),
        "api_base": os.getenv("AZURE_API_BASE")
    }
}, {
    "model_name": "gpt-3.5-turbo", 
    "litellm_params": { # params for litellm completion/embedding call 
        "model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ", 
        "api_key": os.getenv("OPENAI_API_KEY"),
    }
}]

router = Router(model_list=model_list)

# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo", 
                messages=[{"role": "user", "content": "Hey, how's it going?"}])

print(response)

ishaan-jaff mentioned this issue Sep 13, 2023

Add Budget Manager + Support for Anthropic, Cohere, Palm (100+ LLMs using LiteLLM) #99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate usage limits #55

Investigate usage limits #55

mruwnik commented Jul 10, 2023

ccstan99 commented Jul 10, 2023

henri123lemoine commented Jul 13, 2023

henri123lemoine commented Jul 14, 2023

ishaan-jaff commented Nov 20, 2023

Investigate usage limits #55

Investigate usage limits #55

Comments

mruwnik commented Jul 10, 2023

ccstan99 commented Jul 10, 2023

henri123lemoine commented Jul 13, 2023

henri123lemoine commented Jul 14, 2023

ishaan-jaff commented Nov 20, 2023