-
Notifications
You must be signed in to change notification settings - Fork 43.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
💡Fully Air-Gapped Offline Auto-GPT #348
Comments
+1 |
Would be great to use an off line TTS as well. larynx or something. |
tortoise-tts could also be an interesting option for local voice |
Dropping this here for those who don't know: You can serve any model as an OpenAI API compatible API endpoint with Basaran: https://github.com/hyperonym/basaran "Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models."
|
@MarkSchmidty yeah this has to happen eventually. another option is also no embeddings at all, after all you can do a basic BM 25 search and get pretty good results. Redis supports search, not very scalable but will do it for now and we have a redis instance implemented For the large language model:
anyone wants to chime in and suggest light llms that can run locally ? |
Vicuna-13B and other 13B fine tunes in 4bit are only 8GB and even run purely on CPU at useful speeds. "Open Assistant LLaMA-13B" is also highly capable, similar to Vicuna. LLaMA-33B is 20GB in 4bit and more capable fine tunes of it are coming. 20GB runs on a single consumer GPU at faster speeds than GPT-Turbo. So running models locally is really not an issue. You can also run 65B split across two 24GB consumer GPUs at high speeds. |
Quick question about "Open Assistant LLaMA-13B", is it as good as Vicuna
when it comes to back-and-forth chat specifically?
…On Mon, Apr 10, 2023, 19:18 Mark Schmidt ***@***.***> wrote:
Vicuna-13B and other 13B fine tunes in 4bit are only 8GB and even run
purely on CPU at useful speeds. "Open Assistant LLaMA-13B" is also highly
capable, similar to Vicuna.
LLaMA-33B is 20GB in 4bit and more capable fine tunes of it are coming.
20GB runs on a single consumer GPU at faster speeds than GPT-Turbo. So
running models locally is really not an issue.
You can also run 65B split across two 24GB consumer GPUs at high speeds.
—
Reply to this email directly, view it on GitHub
<#348 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMMMJUL4NZPTFU2OGTQUF53XAQ6HLANCNFSM6AAAAAAWV3X55I>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks for this, I'll be looking into it! |
I think Vicuna (7B for "most" people) right now is the best option and if you don't have enough VRAM, CPP variants are good (use normal RAM) but from what I've seen and played with they still have limited GPU support (and CPU-only is much slower). One that can use CPP + GPU/CUDA support fully (though it be nice if you run run on non-NVIDIA) would be king (but other many other optimizations are also available). RAM is cheap and so is storage.. just go pick up a 18TB drive (like I did) for ~$270 ($15/TB, amazing) and you're good. Next I want to add more RAM. I have 32gb which turns out is not enough for what I want to do right now (30b+ models), so planning to add 64GB more. But obviously things will get more optimized over time and there are new advancements daily. Tbh, it's hard to keep up. Also, it's worth checking out Alpaca-Electron which is amazingly simple to get running and is basically a local ChatGPT clone. It doesn't work for my purposes because it doesn't expose an API but is really cool regardless. https://github.com/ItsPi3141/alpaca-electron Additionally, if you have GhatGPT API access and want more advanced features, like the $20 plus subscription on steroids (but only pay for what you use, and gpt3.5-turbo is cheap), check out BetterChatGPT. It doesn't run locally as it's basically a front end clone of ChatGPT with extended feature, but I think it would be an awesome self-hosted front-end app if you simply pointed to a local instance (and still have the ability to use the backend local API/agents for automation, etc.). |
That's amazing! I'm just not sure if I want to drop $3K+ on GPU's alone (then higher power PSU, cooling, etc.). Lol. If I can scale my bots, and monetize AI more, maybe. Might just need another dedicated AI rig, but need to rationalize $6K. Insane, but truly awesome. |
For CPU, this fork of Alpaca-Turbo exposes an OpenAI API compatible API which can be used with Auto-GPT. https://github.com/alexanderatallah/Alpaca-Turbo#using-the-api For GPU, Vicuna-7B and Vicuna-13B are fully supported on GPU and can be used with Basaran to expose an OpenAI API generation compatible API.
As mentioned above, you can run 7B and 13B models on CPU at usable speeds with just 8GB/16GB of RAM respectively. For GPU, a 24GB Nvidia P40 is $200 and can support up to 33B parameters in 4bit at high speeds. Two $200 P40s can run 65B at fairly high speeds. You do not need a $3000 or even $800 GPU to run the largest LLaMA models. For GPU models use Basaran to expose an OpenAI completion compatible API for use with Auto-GPT and/or any project made for GPT-3 or GPT-4. |
That's great, really good suggestion, thank you! I naively was only thinking of a 4090 for 24GB VRAM. I have an old T5500 (my old PC) that I didn't have a purpose for, it supports 4x GPU's (but space is limited, so a riser would be needed and I'm not sure it'll fit), has 96gb RAM and a 875W PSU. I just dropped $385 on 2x K80's, and a 16TB HDD. I searched eBay, got a list of cheap used NVIDIA GPU's with 24GB+ VRAM, and compiled a simple list of specs (if anyone is interested, see list below). I'm going to build an AI home-lab and you just made my day, sir!
Sources: |
I think you'll find you need Pascal (P40) or newer to run models in 4bit. |
So basically because the other don't list half precision specs, the only one on the list what will likely work with 4bit quantization is the P40? |
To my knowledge, yes. |
And I found this.. "NVIDIA Tesla P40 GPU supports mixed precision training" and the "NVIDIA Tesla K80 GPU is based on the Kepler architecture, which does not have Tensor Cores. Therefore, it does not support mixed precision training natively." Thanks. |
I cancelled the order and bought two P40's instead. Thanks, you saved me a headache. |
For Me the Privacy Issue a major concern GPT 4 TOS Basically makes them 0 responsible under any circumstances so say some one gets access to all your information going back and forth to the GPT4 Service? Auto-GPT is a great Project but I do not want my DATA in Microsoft's Grubby little hands and I think GPT4 all but in name owned by Bill Gates is enough Reason to want Everything Run Locally thank you. |
The open PR #2594 would resolve this issue for LLaMA based models and go a long way towards supporting all models. It adds a a configurable API server URL and embeddings options for LLaMA models. |
So much this. I mean, it's inevitable that one way or another the ClosedAIs are gonna scan the shit out of us like it happened with the open/free (not anymore) internet through tracking and the likes, but still! |
"offline API" is a recurring topic here, and some other folks mentioned the lack of "learning", where Auto-GPT keeps looking for information (thinking) that it should already have. Obviously, that won't "copy" all of ChatGPT locally, but it might be a good starting point to use a proxy LLM, especially for folks having to re-run the same agent(s) over and over again, because queries/prompts and responses would likely to be pretty similar: #347 Thoughts ? |
This is the goal of Issue #25 and pull request #2594 should do this, based on a fork which already supports private/local/offline LLMs. |
#114 is one issue about Auto-GPT looking for information it should in theory already have. |
So cool. |
It's DGdev91 :) So, matatonic basically a fork a fork of oobabooga's WebUI with an openai-like API. pretty cool! |
Could you elaborate on the |
should work with --extensions openai Also, those changes have been recently merged, so you can do that in oobabooga's web ui. Tried that myself using Vicuna 13b. it tried to execute the "command name" command, wich of course is wrong. that new openai extension itself works just fine on my fork (wich i hope gets merged soon) |
Did the cards arrive? |
Closing based on comment #348 (comment). If the issue persists, please reopen or create a new bug. |
Why close? |
PR #2594 has recently been merged and is now possibile to use any external service wich expose a LLM over the same API used by OpenAI (but not every local LLM is good enough to work in the same way as GPT3.5/4) |
Co-authored-by: SwiftyOS <craigswift13@gmail.com>
This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days. |
This issue was closed automatically because it has been stale for 10 days with no activity. |
Duplicates
Summary 💡
Implement "Fully Air-Gapped Offline Auto-GPT" functionality that allows users to run Auto-GPT without any internet connection, relying on local models and embeddings.
This feature depends on the completion of the feature requests #347 Support Local Embeddings and #25 Suggestion: Add ability to choose LLM endpoint.
Examples 🌈
Motivation 🔦
The text was updated successfully, but these errors were encountered: