-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
queueless Ragna? #204
Comments
In my experience, contributions and engagement are tightly linked to how open the core dev team is to feedback. The
Thanks for making this public. IMO It's the right approach to have two branches in parallel - there are so many 'breakthroughs' right now in Gen AI that it feels like a good idea to explore both paths in parallel for as long as possible. That being said, I'm not fully sure I understand the rationale for removing the queue only from the Python API. My current understanding is that a queue is useful for CPU-bound tasks in Python. The most common CPU-bound tasks that I am familiar with for RAG applications are i) loading and chunking the documents, ii) computing the embeddings and iii) performing a vector-similarity search or lexical-search. There are other less common (for now!) tasks that I haven't mentioned such as re-ranking, but I'll ignore these. These three sub-tasks are currently (mostly) abstracted behind the For I/O bound tasks and for an application that already uses async methods eg FastAPI web-server, I don't really see the benefit of using a queue (at least for workloads that I have experienced in the past). When interacting with an external API like OpenAI, I don't think there is much benefit to waiting for a response on a worker, as opposed to waiting for a response on the main thread. Disadvantages include issues like #185. So, my point is, why restrict the queue experimentation branch only to the Python API? Surely the web server/API would also benefit from this. The key for me is distinguishing between I/O bound and CPU bound tasks, and then executing them in the right place ie on the main thread, or on a worker. Specific Questions:
|
@nenb You are correct that the task queue is mostly useful for CPU bound tasks. But since we allow generic extensions, we cannot make any assumptions on the nature of the task. And since I/O bound tasks still work correctly with a task queue, this is just the common denominator. Indeed we lose nice-to-have features like #185, but it lowers the maintenance burden by only having one way to do everything.
Yes, that could work, but we would need a bigger design for this. Let's say for example I have an CPU bound |
Answering my own question here:
Well, I guess we can have both as discussed in #183 (comment). |
Just "vomiting" my thoughts here for later:
I think this can work 🙂 |
We had a lengthy internal discussion about this last night and came to the following conclusion: the task queue is a premature optimization for an architecture that has changed quite a bit since we started on Ragna. As such, it has very little going for it. In addition, it causes real issues for users. And just because they are just nice-to-haves from a functionality perspective, it doesn't mean that users turn away because of them. Thus, our current plan is to remove the task queue completely and only later revisit this if the need arises. @nenb I think the FastAPI analogy is pretty neat here. Sync functions will be run on a separate thread and async ones are just awaited in the main thread. That makes the Python API fully async and thus directly usable by the REST API as well. One thing that I worry about is DX while debugging multithreaded code. Not sure if that is an issue, but I would prefer being able to use a debugger if possible. Still, this is not blocking for the design. |
Development for this is happening in #205. |
@pmeier Thanks for the updates, this sounds great. I'll follow #205 with considerable interest, and will post here if I have any further comments. PS I do think that the task queue has a place for heavier workloads, or if someone is using |
Hi @nenb @pmeier I noticed your team removed the LLM requests queue? I'm curious why ? I'm the maintainer of LiteLLM https://github.com/BerriAI/litellm and we have a queue implementation that handles 100+ req/seconds for Any LLM. Would it be helpful to use our request queue here - if not I'd love your feedback on why not ? Here's the quick start on the litellm request queue docs: https://docs.litellm.ai/docs/routing#queuing-beta Quick Start
REDIS_HOST="my-redis-endpoint"
REDIS_PORT="my-redis-port"
REDIS_PASSWORD="my-redis-password" # [OPTIONAL] if self-hosted
REDIS_USERNAME="default" # [OPTIONAL] if self-hosted
$ litellm --config /path/to/config.yaml --use_queue Here's an example config for config.yaml (This will load balance between OpenAI + Azure endpoints) model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: gpt-3.5-turbo
api_key:
- model_name: gpt-3.5-turbo
litellm_params:
model: azure/chatgpt-v-2 # actual model name
api_key:
api_version: 2023-07-01-preview
api_base: https://openai-gpt-4-test-v-1.openai.azure.com/
$ litellm --test_async --num_requests 100 Available Endpoints
|
@pmeier I'd love to help here - we implemented a queue for LLM requests to handle 100+ request/second. If this is not a good enough solution- I'd love to know what's missing |
We didn't remove an "LLM requests queue", but rather the task queue that handled everything inside the RAG workflow from extracting text, embedding it, retrieving it, and finally sending it to an LLM.
You can read this thread for details. TL;DR it was confusing to users and limited our ability to implement things like streaming API responses. I see that LiteLLM supports streaming, but that is likely only possible because you hard depend on Redis as queue. We didn't want to take a hard dependency on that since it is not Since hitting an LLM API is heavily I/O bound, we just went vanilla async execution instead.
I don't want to have a mechanism only for the LLMs (assistants in Ragna lingo) unless absolutely necessary. Whatever abstraction we build should work for everything. Right now we have that. What I think would be totally possible is to implement a |
While responding to all issues since the launch (thank you everyone that took the time to get in touch!) it became somewhat evident that the queue backend is confusing / limiting.
In general, the task queue is used to scale deployments of Ragna. When auto scaling worker nodes on the queue length, that is all there is to do to load balance Ragnas backend, while keeping the REST API nodes constant (of cource they also need to be autoscaled at some point if you have a lot of concurrent users). Other options like a service mesh were a lot more complicated when we want to have one consistent REST API the user can hit extend with plugins.
Still, this doesn't answer why the queue is part of the Python API. Let me give you some historical context on how it came into being. In the very beginning, Ragna had no public Python API (#17). The REST API was the only real entrypoint for users. After I rewrote the project in #28, I failed to cleanly separate the two. Back then, the Python API even still had the database baked in (#65). I removed that in #67.
Meaning, right now, I can't come up with a good reason for why we have a task queue as part of the Python API. Not having it could potentially simplify the implementation and enable us to implement a few nice to have features, such as #185.
The only downside I can currently come up with is that parallelizing tasks the Ragna API will be harder in the future. Right now, you can simply select a non-memory task queue and start a worker and be done with it. Without a task queue the user has to implement their own parallelization scheme. That being said, since the Python API is mostly meant for experimentation, I'm not sure if this is a strong argument.
I'm going to start a branch soon to move the task queue to the Python API to see if there are more downsides that I'm currently not seeing. Will post updates here.
The text was updated successfully, but these errors were encountered: