Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💡Fully Air-Gapped Offline Auto-GPT #348

Closed
1 task done
MarkSchmidty opened this issue Apr 6, 2023 · 33 comments
Closed
1 task done

💡Fully Air-Gapped Offline Auto-GPT #348

MarkSchmidty opened this issue Apr 6, 2023 · 33 comments
Labels
enhancement New feature or request Stale

Comments

@MarkSchmidty
Copy link

MarkSchmidty commented Apr 6, 2023

Duplicates

  • I have searched the existing issues

Summary 💡

Implement "Fully Air-Gapped Offline Auto-GPT" functionality that allows users to run Auto-GPT without any internet connection, relying on local models and embeddings.

This feature depends on the completion of the feature requests #347 Support Local Embeddings and #25 Suggestion: Add ability to choose LLM endpoint.

Examples 🌈

  1. Secure facilities with strict data access controls can utilize Auto-GPT without risking sensitive information leaks.
  2. Researchers working in remote locations without internet access can continue to experiment with and use Auto-GPT.
  3. Users who prefer not to rely on cloud services can run Auto-GPT entirely within their local environment.
  4. Robots can make decisions and communicate using LLMs even when they can't connect to the internet.

Motivation 🔦

  1. Enhanced Security: Fully air-gapped Auto-GPT ensures that no sensitive data is transmitted over the internet, protecting it from potential leaks or unauthorized access.
  2. Offline Capabilities: This feature allows users to leverage Auto-GPT's power in environments with limited or no internet access, expanding its potential use cases.
  3. Reduced Dependency: By removing reliance on external servers, users gain greater control over their data and can manage resources more efficiently.
  4. Cost Savings: By not requiring cloud-based services, users can save on costs related to storage, data transfer, and processing.
  5. Customization and Control: Users can manage their own models and embeddings, tailoring the system to their specific needs and preferences.
@MarkSchmidty MarkSchmidty changed the title Fully Air-Gapped Offline Auto-GPT 💡Fully Air-Gapped Offline Auto-GPT Apr 6, 2023
@Torantulino Torantulino added the enhancement New feature or request label Apr 6, 2023
@9cento
Copy link

9cento commented Apr 8, 2023

+1

@Mikec78660
Copy link

Would be great to use an off line TTS as well. larynx or something.

@Stonedge
Copy link

Stonedge commented Apr 9, 2023

Would be great to use an off line TTS as well. larynx or something.

tortoise-tts could also be an interesting option for local voice

@MarkSchmidty
Copy link
Author

MarkSchmidty commented Apr 10, 2023

Dropping this here for those who don't know: You can serve any model as an OpenAI API compatible API endpoint with Basaran: https://github.com/hyperonym/basaran

"Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models." 

 

@waynehamadi
Copy link
Contributor

@MarkSchmidty yeah this has to happen eventually.
for the embeddings:

another option is also no embeddings at all, after all you can do a basic BM 25 search and get pretty good results. Redis supports search, not very scalable but will do it for now and we have a redis instance implemented
Weaviate supports both vector search and keyword search, so this would definitely be a great addition.

For the large language model:

  • that's the tricky part, even if you host something locally, it's going to be at least 20Gb as I speak (Vicuna is the best/lightest I know)

anyone wants to chime in and suggest light llms that can run locally ?

@MarkSchmidty
Copy link
Author

Vicuna-13B and other 13B fine tunes in 4bit are only 8GB and even run purely on CPU at useful speeds. "Open Assistant LLaMA-13B" is also highly capable, similar to Vicuna.

LLaMA-33B is 20GB in 4bit and more capable fine tunes of it are coming. 20GB runs on a single consumer GPU at faster speeds than GPT-Turbo. So running models locally is really not an issue.

You can also run 65B split across two 24GB consumer GPUs at high speeds.

@9cento
Copy link

9cento commented Apr 10, 2023 via email

@aliasfoxkde
Copy link

Dropping this here for those who don't know: You can serve any model as an OpenAI API compatible API endpoint with Basaran: https://github.com/hyperonym/basaran

"Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models." 

 

Thanks for this, I'll be looking into it!

@aliasfoxkde
Copy link

@MarkSchmidty yeah this has to happen eventually. for the embeddings:

another option is also no embeddings at all, after all you can do a basic BM 25 search and get pretty good results. Redis supports search, not very scalable but will do it for now and we have a redis instance implemented Weaviate supports both vector search and keyword search, so this would definitely be a great addition.

For the large language model:

  • that's the tricky part, even if you host something locally, it's going to be at least 20Gb as I speak (Vicuna is the best/lightest I know)

anyone wants to chime in and suggest light llms that can run locally ?

I think Vicuna (7B for "most" people) right now is the best option and if you don't have enough VRAM, CPP variants are good (use normal RAM) but from what I've seen and played with they still have limited GPU support (and CPU-only is much slower). One that can use CPP + GPU/CUDA support fully (though it be nice if you run run on non-NVIDIA) would be king (but other many other optimizations are also available). RAM is cheap and so is storage.. just go pick up a 18TB drive (like I did) for ~$270 ($15/TB, amazing) and you're good. Next I want to add more RAM. I have 32gb which turns out is not enough for what I want to do right now (30b+ models), so planning to add 64GB more. But obviously things will get more optimized over time and there are new advancements daily. Tbh, it's hard to keep up.

Also, it's worth checking out Alpaca-Electron which is amazingly simple to get running and is basically a local ChatGPT clone. It doesn't work for my purposes because it doesn't expose an API but is really cool regardless.

https://github.com/ItsPi3141/alpaca-electron
https://www.youtube.com/watch?v=KopKQDmGk_o

Additionally, if you have GhatGPT API access and want more advanced features, like the $20 plus subscription on steroids (but only pay for what you use, and gpt3.5-turbo is cheap), check out BetterChatGPT. It doesn't run locally as it's basically a front end clone of ChatGPT with extended feature, but I think it would be an awesome self-hosted front-end app if you simply pointed to a local instance (and still have the ability to use the backend local API/agents for automation, etc.).

https://github.com/ztjhz/BetterChatGPT

@aliasfoxkde
Copy link

Vicuna-13B and other 13B fine tunes in 4bit are only 8GB and even run purely on CPU at useful speeds. "Open Assistant LLaMA-13B" is also highly capable, similar to Vicuna.

LLaMA-33B is 20GB in 4bit and more capable fine tunes of it are coming. 20GB runs on a single consumer GPU at faster speeds than GPT-Turbo. So running models locally is really not an issue.

You can also run 65B split across two 24GB consumer GPUs at high speeds.

That's amazing!

I'm just not sure if I want to drop $3K+ on GPU's alone (then higher power PSU, cooling, etc.). Lol.

If I can scale my bots, and monetize AI more, maybe. Might just need another dedicated AI rig, but need to rationalize $6K. Insane, but truly awesome.

@MarkSchmidty
Copy link
Author

MarkSchmidty commented Apr 13, 2023

@MarkSchmidty yeah this has to happen eventually. for the embeddings:

another option is also no embeddings at all, after all you can do a basic BM 25 search and get pretty good results. Redis supports search, not very scalable but will do it for now and we have a redis instance implemented Weaviate supports both vector search and keyword search, so this would definitely be a great addition.
For the large language model:

  • that's the tricky part, even if you host something locally, it's going to be at least 20Gb as I speak (Vicuna is the best/lightest I know)

anyone wants to chime in and suggest light llms that can run locally ?

I think Vicuna (7B for "most" people) right now is the best option and if you don't have enough VRAM, CPP variants are good (use normal RAM) but from what I've seen and played with they still have limited GPU support (and CPU-only is much slower). One that can use CPP + GPU/CUDA support fully (though it be nice if you run run on non-NVIDIA) would be king (but other many other optimizations are also available). RAM is cheap and so is storage.. just go pick up a 18TB drive (like I did) for ~$270 ($15/TB, amazing) and you're good. Next I want to add more RAM. I have 32gb which turns out is not enough for what I want to do right now (30b+ models), so planning to add 64GB more. But obviously things will get more optimized over time and there are new advancements daily. Tbh, it's hard to keep up.

Also, it's worth checking out Alpaca-Electron which is amazingly simple to get running and is basically a local ChatGPT clone. It doesn't work for my purposes because it doesn't expose an API but is really cool regardless.

ItsPi3141/alpaca-electron youtube.com/watch?v=KopKQDmGk_o

Additionally, if you have GhatGPT API access and want more advanced features, like the $20 plus subscription on steroids (but only pay for what you use, and gpt3.5-turbo is cheap), check out BetterChatGPT. It doesn't run locally as it's basically a front end clone of ChatGPT with extended feature, but I think it would be an awesome self-hosted front-end app if you simply pointed to a local instance (and still have the ability to use the backend local API/agents for automation, etc.).

ztjhz/BetterChatGPT

For CPU, this fork of Alpaca-Turbo exposes an OpenAI API compatible API which can be used with Auto-GPT. https://github.com/alexanderatallah/Alpaca-Turbo#using-the-api

For GPU, Vicuna-7B and Vicuna-13B are fully supported on GPU and can be used with Basaran to expose an OpenAI API generation compatible API.

That's amazing!

I'm just not sure if I want to drop $3K+ on GPU's alone (then higher power PSU, cooling, etc.). Lol.

If I can scale my bots, and monetize AI more, maybe. Might just need another dedicated AI rig, but need to rationalize $6K. Insane, but truly awesome.

As mentioned above, you can run 7B and 13B models on CPU at usable speeds with just 8GB/16GB of RAM respectively.

For GPU, a 24GB Nvidia P40 is $200 and can support up to 33B parameters in 4bit at high speeds. Two $200 P40s can run 65B at fairly high speeds. You do not need a $3000 or even $800 GPU to run the largest LLaMA models.

For GPU models use Basaran to expose an OpenAI completion compatible API for use with Auto-GPT and/or any project made for GPT-3 or GPT-4.

@aliasfoxkde
Copy link

aliasfoxkde commented Apr 14, 2023

For GPU, a 24GB Nvidia P40 is $200 and can support up to 33B parameters in 4bit at high speeds. Two $200 P40s can run 65B at fairly high speeds. You do not need a $3000 or even $800 GPU to run the largest LLaMA models.

For GPU models use Basaran to expose an OpenAI completion compatible API for use with Auto-GPT and/or any project made for GPT-3 or GPT-4.

That's great, really good suggestion, thank you! I naively was only thinking of a 4090 for 24GB VRAM.

I have an old T5500 (my old PC) that I didn't have a purpose for, it supports 4x GPU's (but space is limited, so a riser would be needed and I'm not sure it'll fit), has 96gb RAM and a 875W PSU. I just dropped $385 on 2x K80's, and a 16TB HDD. I searched eBay, got a list of cheap used NVIDIA GPU's with 24GB+ VRAM, and compiled a simple list of specs (if anyone is interested, see list below). I'm going to build an AI home-lab and you just made my day, sir!

Name VRAM G3D Mark FP16 (half) FP32 (float) FP64 (double) Bandwidth TDP COST
Tesla K80 24GB 7,025 ? 4.113 TF 1,371 GF 240.6 GB/s 300W $80
Tesla M40 24GB 10,212 ? 6.832 TF 213.5 GF 288.4 GB/s 250W $160
Tesla P40 24GB 16,864 183.7 GF 11.76 TF 367.4 GF 694.3 GB/s 250W $200
Tesla M10 32GB 3,490 ? 1.672 TF 52.24 GF 83.20 GB/s 225W $250

Sources:
https://www.techpowerup.com/
https://www.videocardbenchmark.net/GPU_mega_page.html

@MarkSchmidty
Copy link
Author

I think you'll find you need Pascal (P40) or newer to run models in 4bit.

@aliasfoxkde
Copy link

I think you'll find you need Pascal (P40) or newer to run models in 4bit.

So basically because the other don't list half precision specs, the only one on the list what will likely work with 4bit quantization is the P40?

@MarkSchmidty
Copy link
Author

To my knowledge, yes.

@aliasfoxkde
Copy link

And I found this.. "NVIDIA Tesla P40 GPU supports mixed precision training" and the "NVIDIA Tesla K80 GPU is based on the Kepler architecture, which does not have Tensor Cores. Therefore, it does not support mixed precision training natively." Thanks.

@aliasfoxkde
Copy link

To my knowledge, yes.

I cancelled the order and bought two P40's instead. Thanks, you saved me a headache.

@drogongod
Copy link

drogongod commented Apr 21, 2023

For Me the Privacy Issue a major concern GPT 4 TOS Basically makes them 0 responsible under any circumstances so say some one gets access to all your information going back and forth to the GPT4 Service? Auto-GPT is a great Project but I do not want my DATA in Microsoft's Grubby little hands and I think GPT4 all but in name owned by Bill Gates is enough Reason to want Everything Run Locally thank you.

@MarkSchmidty
Copy link
Author

The open PR #2594 would resolve this issue for LLaMA based models and go a long way towards supporting all models. 

It adds a a configurable API server URL and embeddings options for LLaMA models.

@9cento
Copy link

9cento commented Apr 21, 2023

For Me the Privacy Issue a major concern GPT 4 TOS Basically makes them 0 responsible under any circumstances so say some one gets access to all your information going back and forth to the GPT4 Service? Auto-GPT is a great Project but I do not want my DATA in Microsoft's Grubby little hands and I think GPT4 all but in name owned by Bill Gates is enough Reason to want Everything Run Locally thank you.

So much this. I mean, it's inevitable that one way or another the ClosedAIs are gonna scan the shit out of us like it happened with the open/free (not anymore) internet through tracking and the likes, but still!

@Boostrix
Copy link
Contributor

Boostrix commented May 1, 2023

"offline API" is a recurring topic here, and some other folks mentioned the lack of "learning", where Auto-GPT keeps looking for information (thinking) that it should already have.
These two might be tackled together if a pre-trained LLM (any!) is put in between OpenAI/ChatGPT and the Auto-GPT script - that way, the intermediate LLM would be serving as a "proxy" and would be transparently trained behind the scenes, basically comparing queries with desired outputs and self-improving over time.

Obviously, that won't "copy" all of ChatGPT locally, but it might be a good starting point to use a proxy LLM, especially for folks having to re-run the same agent(s) over and over again, because queries/prompts and responses would likely to be pretty similar: #347

Thoughts ?

@MarkSchmidty
Copy link
Author

"offline API" is a recurring topic here, and some other folks mentioned the lack of "learning", where Auto-GPT keeps looking for information (thinking) that it should already have. These two might be tackled together if a pre-trained LLM (any!) is put in between OpenAI/ChatGPT and the Auto-GPT script - that way, the intermediate LLM would be serving as a "proxy" and would be transparently trained behind the scenes, basically comparing queries with desired outputs and self-improving over time.

Obviously, that won't "copy" all of ChatGPT locally, but it might be a good starting point to use a proxy LLM, especially for folks having to re-run the same agent(s) over and over again, because queries/prompts and responses would likely to be pretty similar: #347

Thoughts ?

This is the goal of Issue #25 and pull request #2594 should do this, based on a fork which already supports private/local/offline LLMs.

@Boostrix
Copy link
Contributor

Boostrix commented May 1, 2023

#114 is one issue about Auto-GPT looking for information it should in theory already have.

@GitHub1712
Copy link

GitHub1712 commented May 1, 2023

So cool.
Auto-GPT just works with local model on text-generation-webui out of the box:
Run matatonic/text-generation-webui server with --openai
Run Gdev91/Auto-GPT with OPENAI_API_BASE_URL=http://127.0.0.1:5001/ in .env
I used EMBED_DIM=5120 in .env with eachadea_vicuna-13b-1.1 so far.
So we are fully under control of a local model chatgroup now, cheers ;)

@DGdev91
Copy link
Contributor

DGdev91 commented May 1, 2023

So cool. Auto-GPT just works with local model on text-generation-webui out of the box: Run matatonic/text-generation-webui server with --openai Run Gdev91/Auto-GPT with OPENAI_API_BASE_URL=http://127.0.0.1:5001/ in .env I used EMBED_DIM=5120 in .env with eachadea_vicuna-13b-1.1 so far. So we are fully under control of a local model chatgroup now, cheers ;)

It's DGdev91 :)
I'm glad my fork is useful!

So, matatonic basically a fork a fork of oobabooga's WebUI with an openai-like API. pretty cool!
I'm definitley going to try that!

@0xSalim
Copy link

0xSalim commented May 2, 2023

So cool. Auto-GPT just works with local model on text-generation-webui out of the box: Run matatonic/text-generation-webui server with --openai Run Gdev91/Auto-GPT with OPENAI_API_BASE_URL=http://127.0.0.1:5001/ in .env I used EMBED_DIM=5120 in .env with eachadea_vicuna-13b-1.1 so far. So we are fully under control of a local model chatgroup now, cheers ;)

Could you elaborate on the --openai argument please? Can't manage to use that

@DGdev91
Copy link
Contributor

DGdev91 commented May 4, 2023

Could you elaborate on the --openai argument please? Can't manage to use that

should work with --extensions openai
Or you can enable it manually in the "interface mode" tab
Maybe the command changed after some development on the extension.

Also, those changes have been recently merged, so you can do that in oobabooga's web ui.

Tried that myself using Vicuna 13b. it tried to execute the "command name" command, wich of course is wrong.
but that's a known problem, llama-based LLMs often get confused. that's why i proposed a way to customize the prompt without touching the code #3375

that new openai extension itself works just fine on my fork (wich i hope gets merged soon)

@Subcode
Copy link

Subcode commented May 15, 2023

To my knowledge, yes.

I cancelled the order and bought two P40's instead. Thanks, you saved me a headache.

Did the cards arrive?
And if yes, which seller did you use? Ebay always seems a bit risky to me.

@lc0rp
Copy link
Contributor

lc0rp commented Jun 13, 2023

Closing based on comment #348 (comment).

If the issue persists, please reopen or create a new bug.

@lc0rp lc0rp closed this as completed Jun 13, 2023
@jnt0rrente
Copy link

Why close?

@DGdev91
Copy link
Contributor

DGdev91 commented Jun 13, 2023

Why close?

PR #2594 has recently been merged and is now possibile to use any external service wich expose a LLM over the same API used by OpenAI (but not every local LLM is good enough to work in the same way as GPT3.5/4)

waynehamadi pushed a commit that referenced this issue Sep 5, 2023
Co-authored-by: SwiftyOS <craigswift13@gmail.com>
@ntindle ntindle reopened this Feb 5, 2024
Copy link
Contributor

This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days.

@github-actions github-actions bot added the Stale label Mar 27, 2024
Copy link
Contributor

github-actions bot commented Apr 7, 2024

This issue was closed automatically because it has been stale for 10 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests