-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about some commands #430
Comments
Hi @nomopo45! Happy to answer any questions.
Yes.
The first request will always be a bit slower as the tensors are loaded into memory. Some other systems opt to do this during startup, but we did not do that to ensure fast startup times. If this problem continues past the first request, ensure that you are building with the correct hardware accelerator, if you have one. Note that the performance will decrease as the prompt length goes on, but that is a known side effect of chat models, and we are actually working on a feature to optimize this (#350, #366).
Absolutely! I'm not sure what your hardware is, but here are a few examples (I merged the run with the build command using
LoRA is a popular technique for fine-tuning models, which involves efficiently training adapters. X-LoRA is a mixture of LoRA experts, and uses dense gating to choose experts. See the X-LoRA paper. X-LoRA and LoRA are distinct methods, but they both improve performance of the model (and at zero temporal cost for LoRA). |
Hello thanks a lot for your answers ! So I have a macbook m1 pro 16gb. I give a try yo interactive chat using your command, after seeing your command and some i feel it's not possible to have an interactive chat with gguf file am i right ? Then i donwload this repo : https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/tree/main and started the following command:
So as you see it ends with an error and i'm not what is this model.embed_tokens.weight ? Now about the curl here is the command i used to the build :
then here is the command i use to start the server :
and then i do my curl command, i did the same command 3 times to show you the time it take:
Completion seems quite slow but maybe it's normal for a computer like mine just would like to have your opinion on this. Thanks a lot! |
Hi @nomopo45!
No, you can:
The CPM model architecture (non-GGUF) is not compatible with the "plain" llama architecture, it would be necessary to add a
I have seen similar results on similar hardware. This performance is not great though, could you please try to build & rerun on the CPU? |
Hello,
Thanks for the project it look really nice ! i'm new in this world and i'm struggling to do what i want.
I have a macbook m1 pro 16gb.
I managed to install and make it run but some things i would like to do don't work.
First i use this command :
from my understanding the -t refers to tokenizer_config.json ? the -m would be the path of the model ? and -f the actual model gguf file name ?
Once the command runs it says :
So i tried a curl but it's super long to get a reply like ~10sec:
Is it normal for it to be so slow ?
now i tried to have an interactive but all the command i used failed could someone guide me on how to get an interactive chat ?
Could someone explain to me what is LoRa and X-Lora what it means what they do, why should we use them ?
I'm also using LMstudio, but i wanted to give a try to this project since it seems the token/s were a strong point of this porject.
Anyway thanks in advance for any replies and help!
PS: here is what's inside the /Users/xxx/.cache/lm-studio/models/lmstudio-community/Meta-Llama-3-8B-Instruct-GGUF/
The text was updated successfully, but these errors were encountered: