You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know some others have requested alternative backends/apis in other threads, but I wanted to ask about supporting Ollama and/or Litellm. The reason I'm asking is that consumer hardware is often limited. My 4090 only has 24GB of VRAM. Tabby takes up about 10GB when I'm using it.
Let's say I'm working on a project using Tabby in my code editor, then I want to jump to my WebUI to ask a coding question. I'd first need to ssh into my dedicated inference server, stop tabby to clear the VRAM, ask my question in the WebUI which uses Ollama as a backend, then restart tabby afterwards. This is not a realistic workflow.
The benefit of using a single backend like Ollama (or Litellm on top of Ollama), is that Ollama can dynamically switch out models on the fly, and it can queue requests. It would be much better if local LLM/AI projects supported such backends out of the box to enable more efficient management of precious VRAM. If everyone just relied on using their own separate backends, we'd never be able to make use of multiple tools at once.
I've only been using Tabby for a day or so, and it seems like something I'd definitely like to integrate into my workflow. However, as I also heavily rely on Ollama in my current workflow, I cant really use both simultaneously without creating extra hassle.
I guess the other alternative would be to have the ability to unload models using a keybinding in the text editor (I use nvim). I've seen issue 624, however this seems related to shutting down the whole docker container (which is running on the same machine). Just having the option to temporarily unload the model with an api call would be more suitable.
The text was updated successfully, but these errors were encountered:
Thank you for submitting such a detailed FR. I thoroughly understand your use case. Before delving into another post on why tabby relies on the token decoding interface ...
There's a chat playground within Tabby, with --chat-model and --webserver set as arguments.
I know some others have requested alternative backends/apis in other threads, but I wanted to ask about supporting Ollama and/or Litellm. The reason I'm asking is that consumer hardware is often limited. My 4090 only has 24GB of VRAM. Tabby takes up about 10GB when I'm using it.
Let's say I'm working on a project using Tabby in my code editor, then I want to jump to my WebUI to ask a coding question. I'd first need to ssh into my dedicated inference server, stop tabby to clear the VRAM, ask my question in the WebUI which uses Ollama as a backend, then restart tabby afterwards. This is not a realistic workflow.
The benefit of using a single backend like Ollama (or Litellm on top of Ollama), is that Ollama can dynamically switch out models on the fly, and it can queue requests. It would be much better if local LLM/AI projects supported such backends out of the box to enable more efficient management of precious VRAM. If everyone just relied on using their own separate backends, we'd never be able to make use of multiple tools at once.
I've only been using Tabby for a day or so, and it seems like something I'd definitely like to integrate into my workflow. However, as I also heavily rely on Ollama in my current workflow, I cant really use both simultaneously without creating extra hassle.
I guess the other alternative would be to have the ability to unload models using a keybinding in the text editor (I use nvim). I've seen issue 624, however this seems related to shutting down the whole docker container (which is running on the same machine). Just having the option to temporarily unload the model with an api call would be more suitable.
The text was updated successfully, but these errors were encountered: