-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLM output stuck in a loop #542
Comments
I'm also having a similar issue... |
@sirajperson can you look into this? |
Sure @neelayan7. What just to be on the same page. What model are we using here. I have been working on experimenting with 30B-Lazarus. They have recently updated it to include the superHOT lora. |
Since the new SuperHOT has expanded the context window for Llama to 8-16k, I'm working on getting useful JSON from 30B-Lazurus. I'll be working on the repetitive prompts and the thinking tool issue today. On my branch I've updated the Docker files to use the latest version of the TGWUI docker specific repo. On my testing rig all of the containers are loading great. I don't have the best computer, so I can't test GPTQ based inferences, so I've been sticking to models that I can inference directly in HF format, or models that I can use with llama-cpp. I would really like to use MPT-30b-instruct, because when I inference that model using GGML on the command line it produces great code. Right now I'm downloading and testing 30B-Lazurs. They recently merged the SuperHot LoRA in the repo, so I'm downloading and working with that one for now. I would like to think that the json problem is associated with the model being used. For example, If I used Llama_3b, the application would loop and fail because Llama 3b doesn't have the ability yet to produce the kind of responses that the agents need in order to process functions controlled by JSON. If 30B-Lazurus comes close though, it may be worth either using an additional LoRA that either already allows the model to produce usable JSON, or to simply train a LoRA to do the job for SuperAGI. The big hurdle of dealing with the limited context window though is definitely solved thanks to SuperHOT. At this time I'm downloading the weights from hugging face for 30B-Lazurs. I selected this model because it has the highest perplexity score on the HF leader board. I reasoned that by using a model that was not specifically fine tuned for coding would allow the agent to have better access to solving a broad range of problems. However, I'm not sure yet if that creates a trade off for not being able to produce answers that are in the correct json format. It may require not using the default prompt configurations that are in place when creating a new agent. There have have been discussions on other issues regarding this looping issue. |
So I've been burning the midnight oil the last several days trying to match the best opensource models to the task agent. So far the best one that has started successfully started generating instructions and following a reasoning chain of thought has been: In order to run it keep it coherent it cannot be quantized that much. My testing and development machine has 64gigs of system memory and 48 gigs of vram. I'm not able to use GPTQ models because my accelerator cards do not support half precision floats. You may have more luck with GPTQ models because you can easily attach LoRAs to them. In any case, using open source models is not as versatile as using ChatGPT. The instruct models have a specific data set that they were trained on. A large list of prompt styles can be found in the TGWUI characters folder: superagt.txt:
analyse_task.txt:
create_task.txt:
initialize_task.txt:
prioritize_task.txt:
It is important to note that there is a newline after '### Response:', and that the file ends with the newline, not two or three. Again, with lower parameter models and the use of quantization the really don't have much room for changing the prompt styles. As for the configuration of TGWUI. It's important to be sure that the OPENAI_API_BASE: setting is pointing to wherever your running TGWUI. For debugging, I have found it much easier to just run TGWUI form my IDE. However, if one would like to use docker the docker image needs to be configured for the run environment of the container. If the target host machine is going to be using GGML models and executing the inference on the CPU, then setup is straight forward. If, however, one's looking to use their GPU to either run the model outright, or if one is offloading layers of the model to the GPU for faster GGML conferencing (which happens to be my case) then you'll need to install the requirements to Docker to access the host machines GPUs. A full tutorial is well documented on TGWUI's docker readme: Please be aware that getting local LLMs to execute agents is still a work in progress. The framework is there, while it is buggy, it is coming together. I have been working have tried working with some of the new SuperHOT models, like Vicuna-33B, but the model doesn't respond to the queries even at all. Some of the models that I have experimented with so far are: guanaco-65b-merged: Required significant quantization in my testing environment which made it not suitable I will continue to work on creating prompt's and researching models that allow SuperAGI to perform a wide variety of tasks. The next one that I'm downloading while writing this post is I'm still having trouble with several issues. Debugging SuperAGI has been difficult for me. If anyone has been developing the agent and is reading this post I would love some insight into your debugging environment. I've been using PyCharmPro. Using SuperAGI from the command line isn't working and seems to be broken. Debugging the docker container isn't so straight forward. I'd love to be able to work on making more tools for SuperAGI after LocalLLMs becomes more stable. So far, I have been able to get the LLM to do step by step reasoning. I have been able to get it to think that it has completed a task. I have not been able to get it to use tools. I'm not certain if this is an issue with the response that the LLM is providing or if there is something else going on with the agent:
As you can see, the chain of though is correct, but there is no response from the agent from using tools like list files write files. Several times the agent had a similar arraignment of tasks and thought that it had completed all the tasks and responded "finished" but the task agent kept querying it and didn't exit. Again, anyone that could help me stop at break points in the container so that I can debug SuperAGI would have my deepest gratitude. The value in this research is that in my tests it seems that running an agent consumes about 7k tokens per minute... and that's on my slow machine. In actual usage with GPT, I could easily see that being more like 10 to 12. In a nutshell to run an agent via the token expense model would cost about 150 to 200 per day per agent. Running an LLM on a good computer would simply cost the price of electricity. Somewhere in the neighborhood of 60 to 70 dollars per month. |
I wanted to ask whether you have an update on which models you would currently recommend? Given the issues you encounter I would actually highly recommend SuperAGI to base the prompt using LMQL or guidance. These tools would dramatically increase the quality of the outputs and ensure proper JSON structures. I personally use LMQL and I really love such features that for instance constrain outputs to a certain set of options (e.g. for tool selection) or generally structured outputs. |
would development here be faster if you had access to a machine with 512GB of ram? maybe I could host something |
oobabooga rewrites the prompt using jinja2 in yaml files. llama.cpp server has a similar feature
i dont know how prompts are handled when using a local llm with superagi |
Using the current codebase, on the local LLM it seems to be stuck in a loop. After the 4th iteration the console output looks like this:
The text was updated successfully, but these errors were encountered: