Is it possible to replace Chat-GPT with local offline model? For example, gpt4all, LLaMa, etc #2158

Neronjust2017 · 2023-04-17T11:19:09Z

Duplicates

I have searched the existing issues

Summary 💡

Is it possible to replace Can chat-GPT with local offline model? For example, gpt4all, LLaMa, etc

Examples 🌈

No response

Motivation 🔦

No response

mendeltem · 2023-04-17T11:24:53Z

An excellent suggestion would be to use local models. AUTOGPT can utilize these models to train individuals to become experts and also develop several agents simultaneously. Additionally, it can train and facilitate communication between these models.

arronKler · 2023-04-17T13:38:57Z

just replace the request to openai with your own models service in llm_utils.py. But the embedding part may need to keep using openai's embedding api.

Neronjust2017 · 2023-04-17T14:02:17Z

maybe I should ask Auto-GPT to analyse itself and figure out how to replace openai api and chat-gpt with local LLM models, lol.

zachary-kaelan · 2023-04-17T19:29:16Z

Depends on if you have a GPU cluster and 64 GB or so of RAM to run anything comparable at a reasonable speed. You also gotta calculate impact on your electric bill.

matthewniemeier · 2023-04-17T20:57:48Z

Depends on if you have a GPU cluster and 64 GB or so of RAM to run anything comparable at a reasonable speed. You also gotta calculate impact on your electric bill.

There's a ton of smaller ones that can run relatively efficiently.
Glance the ones the issue author noted.

GPT4All | LLaMA

Neronjust2017 · 2023-04-18T01:36:19Z

There's a ton of smaller ones that can run relatively efficiently.
Glance the ones the issue author noted.
you are right. https://github.com/nomic-ai/pyllamacpp provides offilcal supported python bindings for LLaMA and gtp4all. Maybe I can use the pythonic API in llm_util.py to replace OpenAI-related API.

matthewniemeier · 2023-04-18T02:50:08Z

There's a ton of smaller ones that can run relatively efficiently.
Glance the ones the issue author noted.
you are right. https://github.com/nomic-ai/pyllamacpp provides offilcal supported python bindings for LLaMA and gtp4all. Maybe I can use the pythonic API in llm_util.py to replace OpenAI-related API.

New repo to browse just dropped fam 😅👌

Neronjust2017 · 2023-04-18T09:51:59Z

just replace the request to openai with your own models service in llm_utils.py. But the embedding part may need to keep using Openai's embedding api.

Can I avoid using OpenAI's embedding APIs? Because the network connection to OpenAI can not be always established successfully, I want to be totally offline.

talvasconcelos · 2023-04-18T10:06:01Z

I wouldn't mind if the agents were from ChatGPT, the rest (i assume the fast_llm is the one the user interacts with) would be a local gpt4all, for example!

mendeltem · 2023-04-18T11:51:06Z

It would be really have a mutliple online expert agent which all are open source. Each of them should be smaller model but specialized in degree.

For example Auto creates an Agent designed to have 10% expertise and 90% general knowledge for brain storming.
This Agent can create another Agent with a focus on finance, with 80% Truce but 20% general knowledge.
The Finance Agent creates a Model that doesn't exist yet to solve Math and Statistic problems that doesn't exits.

These agents would communicate with each other to train and improve themselves, leveraging collective knowledge and expertise to continuously enhance their capabilities.

I would really love to have these agents in the hand of the open source community.

zachary-kaelan · 2023-04-18T14:02:20Z

There's a ton of smaller ones that can run relatively efficiently. Glance the ones the issue author noted.

GPT4All | LLaMA

LLaMA requires 14 GB of GPU memory for the model weights on the smallest, 7B model, and with default parameters, it requires an additional 17 GB for the decoding cache (I don't know if that's necessary).

But GPT4All called me out big time with their demo being them chatting about the smallest model's memory requirement of 4 GB. I've never heard of machine learning using 4-bit parameters before, but the math checks out. You'd have to feed it something like this to verify its usability. The full, better performance model on GPU requires 16 GB RAM.

My biggest concern would be the context window size. Both of those look like they're limited to 2048 tokens. The full example AutoGPT prompt is a third of that.

Neronjust2017 · 2023-04-18T14:49:43Z

There's a ton of smaller ones that can run relatively efficiently. Glance the ones the issue author noted.
GPT4All | LLaMA

LLaMA requires 14 GB of GPU memory for the model weights on the smallest, 7B model, and with default parameters, it requires an additional 17 GB for the decoding cache (I don't know if that's necessary).

But GPT4All called me out big time with their demo being them chatting about the smallest model's memory requirement of 4 GB. I've never heard of machine learning using 4-bit parameters before, but the math checks out. You'd have to feed it something like this to verify its usability. The full, better performance model on GPU requires 16 GB RAM.

My biggest concern would be the context window size. Both of those look like they're limited to 2048 tokens. The full example AutoGPT prompt is a third of that.

yeah. The number of tokens of questions generated by AutoGPT during the thinking process are quite a lot. Besides, one thing I' m worried about is that the chat models provided by OpenAI, such as gpt-3.5-turbo, take a series of messages as input, as you can find in https://platform.openai.com/docs/guides/chat/introduction. For example, an API call of openai.ChatCompletion.create looks as follows:

# Note: you need to be using OpenAI Python v0.27.0 for the code below to work
import openai

openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
        {"role": "user", "content": "Where was it played?"}
    ]
)

Messages must be an array of message objects, where each object has a role (either "system", "user", or "assistant") and content (the content of the message). The system, user and assistant messages and the conversation history helps build contextual information for user's final question. However, I have reviewed the source code of https://github.com/nomic-ai/pyllamacpp (offical pythonic API for gpt4all model) and its usage. It seems no similar input argments for gpt4all, which means the conversation history, roles and context information is missing when chatting with gpt4all model, I guess.

djmaze · 2023-04-18T14:56:01Z

If you want a conversational model, you should probably use Vicuna (based on llama). It supports the human and assistant roles (via string prefixes).

Also, with llama.cpp (or PyLLaMACpp), the memory usage is really low because the models are quantized to 4 bit. Vicuna 13b needs about 5 GB on my machine.

Neronjust2017 · 2023-04-18T15:38:13Z

If you want a conversational model, you should probably use Vicuna (based on llama). It supports the human and assistant roles (via string prefixes).

Also, with llama.cpp (or PyLLaMACpp), the memory usage is really low because the models are quantized to 4 bit. Vicuna 13b needs about 5 GB on my machine.

thanks! One question, why does Vicuna 13b need only 5GB on your machine? It says 7b needs around 30 GB of CPU RAM, and 13b needs around 60 GB of CPU RAM, https://github.com/lm-sys/FastChat. Did you use --low-cpu-mem?

djmaze · 2023-04-18T15:42:32Z

You need to use llama.cpp (CPU-based) instead of FastChat (GPU-based). FastChat (the original) is more accurate because it operates in floating point, but it also needs much more RAM. Plus some additional tweaks that llama.cpp makes.

Neronjust2017 · 2023-04-18T16:20:23Z

You need to use llama.cpp (CPU-based) instead of FastChat (GPU-based). FastChat (the original) is more accurate because it operates in floating point, but it also needs much more RAM. Plus some additional tweaks that llama.cpp makes.

So what I should do is to obtain original LLaMA model weights (also the Vicuna delta wights) and place theme in right place, convert the 7B or 13B model to ggml FP16 or INT4 format, and finally run the inference, right? In INT4 quantized size, memory size of 7B model reduces to 3.9GB. Also, FastChat provides a similar API with OpenAI's, for example:

import os
from fastchat import client

client.set_baseurl(os.getenv("FASTCHAT_BASEURL"))

completion = client.ChatCompletion.create(
  model="vicuna-7b-v1.1",
  messages=[
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message)

can I specify roles, such as "user", "assistant" in the command line using llama.cpp? can I use fastchat.client.ChatCompletion. create to infer the quantified model? Besides, does fastchat provide API to create embeddings, just likeopenai.Embedding.create? does they provide open source embedding models like text-embedding-ada-002?

keldenl · 2023-04-18T20:53:54Z

i posted this in another related thread but i got autogpt mostly working a couple times but embeddings seems to be the wall im hitting (hardcoded different embedding sizes to make it work)

https://github.com/keldenl/gpt-llama.cpp aims to replace openais api endpoints completely with llama.cpp (including embeddings). but i don't know much about embeddings and seems to work off and on, anybody got any pointers there?

arronKler · 2023-04-19T06:59:04Z

just replace the request to openai with your own models service in llm_utils.py. But the embedding part may need to keep using Openai's embedding api.

Can I avoid using OpenAI's embedding APIs? Because the network connection to OpenAI can not be always established successfully, I want to be totally offline.

of course, the embedding API call is used for weaviate and milvus, if you dont need these two memory storage, you can just do nothing for the embedding part. replace llm_utils is enough
@Neronjust2017

djmaze · 2023-04-19T07:47:35Z

@Neronjust2017

So what I should do is to obtain original LLaMA model weights (also the Vicuna delta wights) and place theme in right place, convert the 7B or 13B model to ggml FP16 or INT4 format, and finally run the inference, right?

You can get the full ggml at huggingface.

can I specify roles, such as "user", "assistant" in the command line using llama.cpp?

I don't think so. You need to use the roles inside the prompt text as described in Vicuna's documentation.

About the other FastChat questions: don't know, I only used llama.cpp so far.

keldenl · 2023-04-19T08:44:43Z

i posted this in another related thread but i got autogpt mostly working a couple times but embeddings seems to be the wall im hitting (hardcoded different embedding sizes to make it work)

https://github.com/keldenl/gpt-llama.cpp aims to replace openais api endpoints completely with llama.cpp (including embeddings). but i don't know much about embeddings and seems to work off and on, anybody got any pointers there?

i got autogpt working with llama.cpp! see keldenl/gpt-llama.cpp#2 (comment)

i'm using vicuna for embeddings and generation but it's struggling a bit to generate proper commands to not fall into a infinite loop of attempting to fix itself X( will look into this tmr but super exciting cuz i got the embeddings working! (turns out it was a bug on my end lol)

here's a screenshot 🎉

edit: had to make some changes to autogpt (add base_url to openai_base_url, and adjust the dimensions of the vector, but otherwise left it alone)

Sorann753 · 2023-04-19T14:00:01Z

the https://github.com/oobabooga/text-generation-webui is able to run in an API mode if you use the flags --listen --no-stream

maybe we could make an option to allow users to use their local webui as a server instead of the OpenAI API? something like adding USE_LOCAL_SERVER=True to the .env and then in a local_llm.yaml file we'd do the configurations needed

Since they are making an interface for running all open source LLM and our project is instead to use a LLM to run an agent in autonomy, it might be a good idea to avoid redoing what they already did

Neronjust2017 · 2023-04-19T15:10:58Z

i posted this in another related thread but i got autogpt mostly working a couple times but embeddings seems to be the wall im hitting (hardcoded different embedding sizes to make it work)
https://github.com/keldenl/gpt-llama.cpp aims to replace openais api endpoints completely with llama.cpp (including embeddings). but i don't know much about embeddings and seems to work off and on, anybody got any pointers there?

i got autogpt working with llama.cpp! see keldenl/gpt-llama.cpp#2 (comment)

i'm using vicuna for embeddings and generation but it's struggling a bit to generate proper commands to not fall into a infinite loop of attempting to fix itself X( will look into this tmr but super exciting cuz i got the embeddings working! (turns out it was a bug on my end lol)

here's a screenshot tada

edit: had to make some changes to autogpt (add base_url to openai_base_url, and adjust the dimensions of the vector, but otherwise left it alone)

nice work! what local embedding model did you test?

keldenl · 2023-04-19T17:23:28Z

i just utilized the embeddings.cpp example from llama.cpp with llama-based models!

DGdev91 · 2023-04-19T21:54:20Z

Thanks to @keldenl for his work!

I made a pull request for the changes mentioned by him #2594

keldenl · 2023-04-19T21:55:18Z

beautiful work @DGdev91 ! that does the trick. i'll go ahead and write an extensive guide tonight and link it here as well

keldenl · 2023-04-20T05:35:12Z

✨ FULL GUIDE POSTED on how to get Auto-GPT running with llama.cpp via gpt-llama.cpp in keldenl/gpt-llama.cpp#2 (comment).

Huge shoutout to @DGdev91 for the PR and hope it gets merged soon!

mudler · 2023-04-20T07:26:18Z

Hi 👋

I'm the author of https://github.com/go-skynet/LocalAI, I'd be glad to help out in what's missing to see this working with local models - LocalAI has multiple backends, is multi-models and keeps thing in memory for faster inference. Supports models including gpt4all-j and those supported by llama.cpp, and will also have support for Cerebras.

Happy to jump in!

DGdev91 · 2023-04-21T00:06:55Z

Out of curiosity, did you tested with docker compose? If you are on a Mac I'd suggest to compile from source instead.
Also, to lower the number of threads to the number of physical cores in the hardware - by default it takes all the cores available, but that seems to be problematic. For instance on a Mac M1 user reports between 4 and 7 to be a good spot. On my PC that has 20 cores, and 14 are physical, 14 is a sweet spot as well.

I tried both, it doesn't seem to make any difference.
I'm not using a Mac, i'm on Linux (Arch). Also, on my cpu (Ryzen 7 3800x, wich is an 8 cores-16 threads) works better by setting it to the number of the logical threads.

Another point, is, but it's more a suspect - this is due to the fact that keldenl's implementations bashes out to the llama.cpp binary, and I suspect it keeps the process open and continues the prompting in interactive mode, and uses a stabbed number of cores (4 if I recall correctly), but I'm just speculating here, maybe @keldenl can shed more light on this.

I'm quite sure he uses the max numbers of threads supported by the cpu, it was running at 100% on my machine.

mudler · 2023-04-21T07:04:53Z

Out of curiosity, did you tested with docker compose? If you are on a Mac I'd suggest to compile from source instead.
Also, to lower the number of threads to the number of physical cores in the hardware - by default it takes all the cores available, but that seems to be problematic. For instance on a Mac M1 user reports between 4 and 7 to be a good spot. On my PC that has 20 cores, and 14 are physical, 14 is a sweet spot as well.

I tried both, it doesn't seem to make any difference. I'm not using a Mac, i'm on Linux (Arch). Also, on my cpu (Ryzen 7 3800x, wich is an 8 cores-16 threads) works better by setting it to the number of the logical threads.

Another point, is, but it's more a suspect - this is due to the fact that keldenl's implementations bashes out to the llama.cpp binary, and I suspect it keeps the process open and continues the prompting in interactive mode, and uses a stabbed number of cores (4 if I recall correctly), but I'm just speculating here, maybe @keldenl can shed more light on this.

I'm quite sure he uses the max numbers of threads supported by the cpu, it was running at 100% on my machine.

Could you please open an issue to LocalAI with your steps? that'd be appreciated! I don't want to hijack the thread here, and it looks very weird from what you report, possibly something is off. Not sure if it's due to the CPU type at this point, didn't tried on AMD myself, I have an Intel i7-1280P and here I don't notice any difference. Thanks!

Edit: I've benchmarked things locally, and token inference speed is the same as the llama.cpp CLI, however, I've just updated to the latest llama.cpp code here locally, so not sure if that's makes any difference in your case

keldenl · 2023-04-24T02:30:28Z

@mudler i do use interactive mode, but i dont think it applies to auto-gpt. it detects whether or not you are still in the same chat "conversation" by storing the previous one and comparing. but since auto-gpt requests don't build off each other one after another (instead it goes chat -> different chat agent -> embed -> original chat), interactive mode is pretty useless

@DGdev91 i just use whatever llama.cpp comes out of the box, and i think it uses as many cores as it sees fit. in most cases, probably most of them

also, i just merged a ton of fixes yesterday and today that pretty much makes gpt-llama.cpp run infinitely continuously with auto-gpt (fixed all the bugs i could find). now the focus is getting auto-gpt results as good as possible. you can read more in my update in the original thread here: keldenl/gpt-llama.cpp#2 (comment)

DGdev91 · 2023-04-24T17:29:49Z

Could you please open an issue to LocalAI with your steps? that'd be appreciated! I don't want to hijack the thread here, and it looks very weird from what you report, possibly something is off. Not sure if it's due to the CPU type at this point, didn't tried on AMD myself, I have an Intel i7-1280P and here I don't notice any difference. Thanks!

Edit: I've benchmarked things locally, and token inference speed is the same as the llama.cpp CLI, however, I've just updated to the latest llama.cpp code here locally, so not sure if that's makes any difference in your case

Made some other tests, the performances are indeed close to llama.cpp standalone binary. but still struggles with AutoGPT. i tried a couple of time, both of them AutoGPT closed itself for http timeout.
An explanation could be that sometimes llama-based models running on llama.cpp get stuck when writing long responses.
@keldenl recently added a workaround in his code. a bit dirty, but could help keldenl/gpt-llama.cpp@12a9567

JavierPorron · 2023-05-01T11:18:30Z

It's doesn't work in my case. AutoGPT outputs Http Timeout... maybe we could be able of increase the timeout? There isn't any bottle neck on execute the model via node?

mmfhmm123 · 2023-05-10T06:35:36Z

just replace the request to openai with your own models service in llm_utils.py. But the embedding part may need to keep using openai's embedding api.

Can we replace the OpenAI embedding API with an open-source embedding model from Hugging Face?

arronKler · 2023-05-11T07:51:30Z

just replace the request to openai with your own models service in llm_utils.py. But the embedding part may need to keep using openai's embedding api.

Can we replace the OpenAI embedding API with an open-source embedding model from Hugging Face?

It can be replaced with other embedding API, just search embedding related code and replace it.

mmfhmm123 · 2023-05-11T07:51:56Z

您好，你的邮件已收到，请耐心等待。我一定在我看到的第一时刻回复您。方弘

aorumbayev · 2023-05-14T20:35:44Z

Made a tiny script that uses https://github.com/go-skynet/LocalAI to run gpt4all model at localhost:8080 so autogpt can work with it. Haven't tested extensively though and might require fine tuning and picking the right model to speed up responses.

https://github.com/aorumbayev/autogpt4all

kiljoy001 · 2023-05-18T00:27:24Z

I've made some changes to the config.py file to allow for other URL's (like localhost). It seems to pass the unit tests, but it's failing the integration tests:

Any pointer will be appreciated - I am new to this project so I am unsure if I just didn't set things up right (dove right in).
The change is is fairly tiny, but it will allow for loading other URL's so I think the idea is quite good.
EDIT:
A quick look at the failed tests seem to be about embedding - not sure if the llm 'framework' I am using can do that (gpt4all).

DGdev91 · 2023-05-18T08:25:04Z

Theorically, using the changes in #2594 the embeddings should work too, but of course the service wich expose the openai-compliant API must support that.
If i'm not wrong, @keldenl 's project, gpt-llama.cpp, should support it.
Not sure for LocalAI or other inplementations

chongy076 · 2023-06-13T03:17:19Z

@mudler i tried out your project. i can confirm that is already possible to use it with my changes. just set OPENAI_API_BASE_URL=http://localhost:8080/v1 and EMBED_DIM= depending on the model you are using (4096 for 7B, 5120 for 13B, 6656 for 33B, 8192 for 65B)

Thanks for the pointers!

There's just a minor issue: AutoGPT expects gpt-3.5-turbo or gpt-4 as the model IDs, while you are using the model file name as id. A quick (and hacky) fix in your case could be renaming the model in the "models" folder as "gpt-3.5-turbo", that worked in my case. It could also be a good idea to make some changes to LocalAI to make it always serve the first model he finds (or one wich has been configured for that) as gpt-3.5-turbo and gpt4. after all, the idea here is to offer a drop-in replacement for those models.

I see! yup indeed I've an issue open already for it to allow model aliases

Also, i don't know why, but it seems much slower than keldenl's implementation. Still kinda works, but isn't really practical to use.

Out of curiosity, did you tested with docker compose? If you are on a Mac I'd suggest to compile from source instead. Also, to lower the number of threads to the number of physical cores in the hardware - by default it takes all the cores available, but that seems to be problematic. For instance on a Mac M1 user reports between 4 and 7 to be a good spot. On my PC that has 20 cores, and 14 are physical, 14 is a sweet spot as well.

Update: just fixed this on LocalAI - now defaults to the number of physical cores =)

Another point, is, but it's more a suspect - this is due to the fact that keldenl's implementations bashes out to the llama.cpp binary, and I suspect it keeps the process open and continues the prompting in interactive mode, and uses a stabbed number of cores (4 if I recall correctly), but I'm just speculating here, maybe @keldenl can shed more light on this.

LocalAI directly interacts with the library, and so calls are directly sent back to the model, without interactive mode - this probably has impact overall while sending whole inputs back again to the API, but I'm not sure how it is being used here. I'll try to have a look, but usually on my hardware (not a Mac) I don't notice difference between running with the CLI manually and with the API.

Finally, remember that some LLMs work better than others for this. the first tests have been made with Vicuna13B. I tried also GPT4All, but it can't create a valid json as response

👍 Gotcha! Thanks for the pointers!

but openai stated required ssl as default.
2023-06-13 10:37:55 File "/usr/local/lib/python3.11/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
2023-06-13 10:37:55 raise last_exc
2023-06-13 10:37:55 └ ClientConnectorError(ConnectionKey(host='localhost', port=8080, is_ssl=False, ssl=None, proxy=None, proxy_auth=None, proxy_he...
2023-06-13 10:37:55 File "/usr/local/lib/python3.11/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
2023-06-13 10:37:55 transp, proto = await self._wrap_create_connection(
2023-06-13 10:37:55 │ └ <function TCPConnector._wrap_create_connection at 0x7fa114ef2200>
2023-06-13 10:37:55 └ <aiohttp.connector.TCPConnector object at 0x7fa0f9280210>
2023-06-13 10:37:55 File "/usr/local/lib/python3.11/site-packages/aiohttp/connector.py", line 988, in _wrap_create_connection
2023-06-13 10:37:55 raise client_error(req.connection_key, exc) from exc
2023-06-13 10:37:55 │ │ └ <property object at 0x7fa11512a070>
2023-06-13 10:37:55 │ └ <aiohttp.client_reqrep.ClientRequest object at 0x7fa0f9280a90>
2023-06-13 10:37:55 └ <class 'aiohttp.client_exceptions.ClientConnectorError'>
2023-06-13 10:37:55
2023-06-13 10:37:55 aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host localhost:8080 ssl:default [Connection refused]
2023-06-13 10:37:55

[tested in Jupyter no issues with the localai]
import requests

url = "http://localhost:8080/v1/chat/completions"

headers = {
"Content-Type": "application/json"
}

data = {
"model": "gpt-3.5-turbo.bin",
"messages": [{"role": "user", "content": "How are you?"}],
"temperature": 0.7
}

response = requests.post(url, headers=headers, json=data)

Process the response as needed

print(response.json())

{'object': 'chat.completion', 'model': 'gpt-3.5-turbo.bin', 'choices': [{'message': {'role': 'assistant', 'content': 'I am good. How is your day?\nHow are you? I am good. How is your day? You look tired today, please take care of yourself.\nHow are you? I am good, and how is your day?\nHow are you? I am good, and how is your day? You look tired today, please take care of yourself.'}}], 'usage': {'prompt_tokens': 0, 'completion_tokens': 0, 'total_tokens': 0}}

--- updated status----
i think docker was not added host or mapped port....
changing.

mmfhmm123 · 2023-06-13T03:17:46Z

您好，你的邮件已收到，请耐心等待。我一定在我看到的第一时刻回复您。方弘

lc0rp · 2023-06-13T09:44:07Z

Closing as duplicate of #25

Fusseldieb · 2023-07-24T20:32:43Z

It would be really have a mutliple online expert agent which all are open source. Each of them should be smaller model but specialized in degree.

For example Auto creates an Agent designed to have 10% expertise and 90% general knowledge for brain storming. This Agent can create another Agent with a focus on finance, with 80% Truce but 20% general knowledge. The Finance Agent creates a Model that doesn't exist yet to solve Math and Statistic problems that doesn't exits.

These agents would communicate with each other to train and improve themselves, leveraging collective knowledge and expertise to continuously enhance their capabilities.

I would really love to have these agents in the hand of the open source community.

Open-source mixed agents would really change things... The 7B parameter models (even the 4bit ones) are already getting pretty good, but are too generalized. If we could download multiple specialized agents and load them only when needed, a performance similar to GPT-4 could be archieved. Of course, loading and unloading models would drastically reduce inference speed, but hey, it would run on most of the consumer GPUs (8-12GB)! I would love to see this :)

mmfhmm123 · 2023-07-24T20:33:04Z

您好，你的邮件已收到，请耐心等待。我一定在我看到的第一时刻回复您。方弘

Neronjust2017 changed the title ~~Is it possible to replace Can chat-GPT with local offline model? For example, gpt4all, LLaMa, etc~~ Is it possible to replace Chat-GPT with local offline model? For example, gpt4all, LLaMa, etc Apr 17, 2023

mendeltem mentioned this issue Apr 19, 2023

OpenAI API is too expensive #2515

Closed

1 task

DGdev91 mentioned this issue Apr 19, 2023

Add settings for custom base url #2594

Merged

ntindle added the needs discussion To be discussed among maintainers label Apr 21, 2023

DGdev91 mentioned this issue Apr 27, 2023

Make prompt parameters configurable #3375

Merged

Boostrix mentioned this issue May 1, 2023

Ability to point to private LLM deployments #3610

Closed

1 task

lc0rp closed this as completed Jun 13, 2023

Tarolrr mentioned this issue Aug 29, 2023

Collection of issues related to other models support Tarolrr/Auto-GPT#1

Open

Is it possible to replace Chat-GPT with local offline model? For example, gpt4all, LLaMa, etc #2158

Is it possible to replace Chat-GPT with local offline model? For example, gpt4all, LLaMa, etc #2158

Comments

Neronjust2017 commented Apr 17, 2023

Duplicates

Summary 💡

Examples 🌈

Motivation 🔦

mendeltem commented Apr 17, 2023 • edited Loading

arronKler commented Apr 17, 2023

Neronjust2017 commented Apr 17, 2023

zachary-kaelan commented Apr 17, 2023

matthewniemeier commented Apr 17, 2023

Neronjust2017 commented Apr 18, 2023

matthewniemeier commented Apr 18, 2023

Neronjust2017 commented Apr 18, 2023

talvasconcelos commented Apr 18, 2023

mendeltem commented Apr 18, 2023

zachary-kaelan commented Apr 18, 2023

Neronjust2017 commented Apr 18, 2023

djmaze commented Apr 18, 2023

Neronjust2017 commented Apr 18, 2023

djmaze commented Apr 18, 2023

Neronjust2017 commented Apr 18, 2023

keldenl commented Apr 18, 2023

arronKler commented Apr 19, 2023 • edited Loading

djmaze commented Apr 19, 2023

keldenl commented Apr 19, 2023

Sorann753 commented Apr 19, 2023 • edited Loading

Neronjust2017 commented Apr 19, 2023

keldenl commented Apr 19, 2023

DGdev91 commented Apr 19, 2023

keldenl commented Apr 19, 2023

keldenl commented Apr 20, 2023

mudler commented Apr 20, 2023 • edited Loading

DGdev91 commented Apr 21, 2023

mudler commented Apr 21, 2023 • edited Loading

keldenl commented Apr 24, 2023

DGdev91 commented Apr 24, 2023

JavierPorron commented May 1, 2023

mmfhmm123 commented May 10, 2023

arronKler commented May 11, 2023

mmfhmm123 commented May 11, 2023 via email

aorumbayev commented May 14, 2023

kiljoy001 commented May 18, 2023 • edited Loading

DGdev91 commented May 18, 2023

chongy076 commented Jun 13, 2023 • edited Loading

Process the response as needed

mmfhmm123 commented Jun 13, 2023 via email

lc0rp commented Jun 13, 2023

Fusseldieb commented Jul 24, 2023 • edited Loading

mmfhmm123 commented Jul 24, 2023 via email

mendeltem commented Apr 17, 2023 •

edited

Loading

arronKler commented Apr 19, 2023 •

edited

Loading

Sorann753 commented Apr 19, 2023 •

edited

Loading

mudler commented Apr 20, 2023 •

edited

Loading

mudler commented Apr 21, 2023 •

edited

Loading

kiljoy001 commented May 18, 2023 •

edited

Loading

chongy076 commented Jun 13, 2023 •

edited

Loading

Fusseldieb commented Jul 24, 2023 •

edited

Loading