Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can you run inference with a local GGUF file? #295

Closed
jett06 opened this issue May 11, 2024 · 6 comments
Closed

How can you run inference with a local GGUF file? #295

jett06 opened this issue May 11, 2024 · 6 comments
Labels
documentation Improvements or additions to documentation

Comments

@jett06
Copy link

jett06 commented May 11, 2024

I'm trying to play around with mistralrs-server using a file I've already downloaded (from PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed), but the program arguments are very confusing. I've messed with it a lot and this is the closest I've come to what I want to do:

# tokenizer.json was downloaded from `microsoft/Phi-3-mini-128k-instruct` manually, `Phi-3-mini-128k-instruct-q3_K_S.gguf` was downloaded from https://huggingface.co/PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed/blob/main/Phi-3-mini-128k-instruct.Q3_K_S.gguf.
$ ./mistralrs-server gguf --tokenizer-json tokenizer.json --quantized-model-id ./Phi-3-mini-128k-instruct-q3_K_S.gguf --tok-model-id microsoft/Phi-3-mini-128k-instruct --quantized-filename Phi-3-mini-128k-instruct-q3_K_S.gguf
2024-05-11T16:31:04.581627Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: false
2024-05-11T16:31:04.589148Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-11T16:31:04.589898Z  INFO mistralrs_server: Loading model `microsoft/Phi-3-mini-128k-instruct` on Cpu...
2024-05-11T16:31:04.590217Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-11T16:31:04.702182Z  INFO hf_hub: Token file not found "/home/jett/.cache/huggingface/token"    
2024-05-11T16:31:04.702921Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/jett/.cache/huggingface/token", using no HF token.
2024-05-11T16:31:04.703635Z  INFO mistralrs_core::pipeline::gguf: Using tokenizer.json at `tokenizer.json`
2024-05-11T16:31:04.785844Z  INFO hf_hub: Token file not found "/home/jett/.cache/huggingface/token"    
2024-05-11T16:31:04.785904Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/jett/.cache/huggingface/token", using no HF token.
thread 'main' panicked at mistralrs-core/src/pipeline/mod.rs:943:25:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/Phi-3-mini-128k-instruct-q3_K_S.gguf/resolve/main/Phi-3-mini-128k-instruct-q3_K_S.gguf]))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[1]    29633 IOT instruction  ./mistralrs-server gguf --tokenizer-json tokenizer.json --quantized-model-id 

As you can see, this command fails. I'm very confused, and I'm not sure why it's this confusing or if I'm missing something when it comes to running inference on a local GGUF file that's already downloaded, like llama.cpp can. I just want to do something like ./mistralrs-server gguf -m ./Phi-3-mini-128k-instruct-q3_K_S.gguf and have it start a server with that local file, but my goal seems very complicated to achieve based on this program's arguments and the error messages that imply it works best when you have to download a remote file. Thanks in advance for any responses that'll help :)

@EricLBuehler EricLBuehler added the documentation Improvements or additions to documentation label May 11, 2024
@sdmorrey
Copy link

sdmorrey commented May 11, 2024

Something to be aware of is that PruneAI uses a custom encoding scheme. I can’t get their models to run under llama.cpp either. I don’t know if that’s related or not. But it is something to be aware of.

@EricLBuehler
Copy link
Owner

Hi @jett06, thanks for letting me know. I think we have room to improve in the verbosity of the loading of local files.

As documented here, the way to load files locally is to specify a local path as the model ID. I'll try to improve the visibility and clarity of that section.

In your case, loading from a local file may look like:

./mistralrs-server gguf --tokenizer-json tokenizer.json --quantized-model-id <PATH TO GGUF FILE> --quantized-filename Phi-3-mini-128k-instruct-q3_K_S.gguf

I made a few changes:

  • Remove --tok-model-id because you specified the tokenizer.json, so it is redundant
  • Specify the path to the GGUF file as the quantized model ID

In the backend, we start by searching locally, but since ./Phi-3-mini-128k-instruct-q3_K_S.gguf is not the path to your GGUF file, we treat it as an HF ID which causes the web-related error (https://huggingface.co/Phi-3-mini-128k-instruct-q3_K_S.gguf/resolve/main/Phi-3-mini-128k-instruct-q3_K_S.gguf does not exist). I think that in the future, to reduce misunderstandings, we should display that we could not find the local file and are now searching remote.

For brevity, you can simplify it by using the short arguments:

./mistralrs-server gguf --tokenizer-json tokenizer.json -m <PATH TO GGUF FILE> -f Phi-3-mini-128k-instruct-q3_K_S.gguf

In general, the model ID is a "path" to the files: a local path, or a Hugging Face Model ID.

@jett06
Copy link
Author

jett06 commented May 12, 2024

@EricLBuehler Phi-3-mini-128k-instruct-q3_K_S.gguf actually is the path to my local GGUF file. I've symlinked it from my downloads directory to the current dir (Phi-3-mini-128k-instruct-q3_K_S.gguf => /home/jett/Downloads/llms/Phi-3-mini-128k-instruct-q3_K_S.gguf). Here's what happens when I run the final command you've given me (with the GGUF file path filled in):

$ ./mistralrs-server gguf --tokenizer-json tokenizer.json -m ./Phi-3-mini-128k-instruct-q3_K_S.gguf -f ./Phi-3-mini-128k-instruct-q3_K_S.gguf
error: the following required arguments were not provided:
  --tok-model-id <TOK_MODEL_ID>

Usage: mistralrs-server gguf --tok-model-id <TOK_MODEL_ID> --quantized-model-id <QUANTIZED_MODEL_ID> --quantized-filename <QUANTIZED_FILENAME> --tokenizer-json <TOKENIZER_JSON>

For more information, try '--help'.

To be sure the issue didn't lie with the fact my model file was a symbolic link, I deleted the symlink and copied the file directly to the current directory (/home/jett/Downloads/llms/Phi-3-Mini-128k-instruct-q3_K_S.gguf copied to ./Phi-3-Mini-128k-instruct-q3_K_S.gguf), which bore the same error as above:

$ ./mistralrs-server gguf --tokenizer-json tokenizer.json -m ./Phi-3-mini-128k-instruct-q3_K_S.gguf -f ./Phi-3-mini-128k-instruct-q3_K_S.gguf
error: the following required arguments were not provided:
  --tok-model-id <TOK_MODEL_ID>

Usage: mistralrs-server gguf --tok-model-id <TOK_MODEL_ID> --quantized-model-id <QUANTIZED_MODEL_ID> --quantized-filename <QUANTIZED_FILENAME> --tokenizer-json <TOKENIZER_JSON>

For more information, try '--help'.

This is with the latest commit on master.

@EricLBuehler
Copy link
Owner

I made a small mistake with the command, I actually deleted the wrong arg:

 ./mistralrs-server gguf -m . -f Phi-3-mini-128k-instruct-q3_K_S.gguf -t microsoft/Phi-3-mini-128k-instruct

We tell mistral.rs to look in . for Phi-3-mini-128k-instruct-q3_K_S.gguf which should be found by your file system regardless if it is a symlink. So, when you specify the model ID as the path to the symlinked file (./...), it breaks because that's not the path, it should be looking in .

@jett06
Copy link
Author

jett06 commented May 13, 2024

@EricLBuehler I see, that makes sense! I didn't realize the command arguments were so similar conceptually, regardless of if you're using a local file or pulling from HuggingFace (eg either local directory or HF directory, then either local file or HF file), but I get it now. Thank you so much for your help, and for maintaining a wonderful project :)

@jett06 jett06 closed this as completed May 13, 2024
@EricLBuehler
Copy link
Owner

Thank you! Glad to help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants