Local LLMs for TurboWarp #2486
Closed
Brackets-Coder
started this conversation in
Ideas
Replies: 2 comments 1 reply
-
|
not planning anything. don't think there's a reasonable way to run a useful LLM on typical end user devices ($200 chromebooks), and i don't intend to fund anything at this time |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
@CubesterYT if you're interested it might be possible for NitroBolt but I don't know that it would be practical |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Introduction
Lots of pull requests have attmepted to add large language models to TurboWarp. There's obviously a high demand for it, but many of these have been disputed or closed due to reliance on external APIs (especially with authentication keys stored inside projects), or, for self-hosted APIs, the concern of relying on someone else to always keep a free server available.
The clear solution to both of these problems is to bring the AI to the device, where it runs offline, all the time—never stops working, and 100% secure. While this hasn't been practical in the past due to hardware requirements, Gemma 4 E4B just came out and is a fairly small but powerful model designed to provide high-level reasoning to mobile devices. I'm not certain, but I think it may be able to fit within browser memory limitations.
Proposal
We could use something like WebLLM or compile llama.cpp to WASM + WebGPU for a proprietary implementation, then get everything running in JS. My concern and hurdle to overcome is where to store the model weights, as the smallest Gemma 4 model is 7 GB. Obviously, this is too large to store directly in the extension, and it woud be impractical to request users to install this themselves. When quantized or compressed though it should be able to fit in much smaller amounts of memory. If we can figure out this, I think it's a feasible idea.
Attempt
Before coming up with the idea to run local LLMs in the browser, I attempted to develop a small Ollama extension for TW that sent a fetch request to localhost:11434. This worked for a plugin I'm working on for another application, but I'm seeming to run into CORS proxy limitations in the browser, and this would require users to install Ollama and the appropriate open models (the former of which is only available for desktop devices). Obviously, this prevented me from looking into it further
I just wanted to get the community's thoughts on this. It seems feasible conceptually, but I don't know if it's practical or logistical to pursue.
Beta Was this translation helpful? Give feedback.
All reactions