How can you run inference with a local GGUF file? #295

jett06 · 2024-05-11T16:36:40Z

I'm trying to play around with mistralrs-server using a file I've already downloaded (from PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed), but the program arguments are very confusing. I've messed with it a lot and this is the closest I've come to what I want to do:

# tokenizer.json was downloaded from `microsoft/Phi-3-mini-128k-instruct` manually, `Phi-3-mini-128k-instruct-q3_K_S.gguf` was downloaded from https://huggingface.co/PrunaAI/Phi-3-mini-128k-instruct-GGUF-Imatrix-smashed/blob/main/Phi-3-mini-128k-instruct.Q3_K_S.gguf.
$ ./mistralrs-server gguf --tokenizer-json tokenizer.json --quantized-model-id ./Phi-3-mini-128k-instruct-q3_K_S.gguf --tok-model-id microsoft/Phi-3-mini-128k-instruct --quantized-filename Phi-3-mini-128k-instruct-q3_K_S.gguf
2024-05-11T16:31:04.581627Z  INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: false
2024-05-11T16:31:04.589148Z  INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-11T16:31:04.589898Z  INFO mistralrs_server: Loading model `microsoft/Phi-3-mini-128k-instruct` on Cpu...
2024-05-11T16:31:04.590217Z  INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-11T16:31:04.702182Z  INFO hf_hub: Token file not found "/home/jett/.cache/huggingface/token"    
2024-05-11T16:31:04.702921Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/jett/.cache/huggingface/token", using no HF token.
2024-05-11T16:31:04.703635Z  INFO mistralrs_core::pipeline::gguf: Using tokenizer.json at `tokenizer.json`
2024-05-11T16:31:04.785844Z  INFO hf_hub: Token file not found "/home/jett/.cache/huggingface/token"    
2024-05-11T16:31:04.785904Z  INFO mistralrs_core::utils::tokens: Could not load token at "/home/jett/.cache/huggingface/token", using no HF token.
thread 'main' panicked at mistralrs-core/src/pipeline/mod.rs:943:25:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/Phi-3-mini-128k-instruct-q3_K_S.gguf/resolve/main/Phi-3-mini-128k-instruct-q3_K_S.gguf]))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
[1]    29633 IOT instruction  ./mistralrs-server gguf --tokenizer-json tokenizer.json --quantized-model-id

As you can see, this command fails. I'm very confused, and I'm not sure why it's this confusing or if I'm missing something when it comes to running inference on a local GGUF file that's already downloaded, like llama.cpp can. I just want to do something like ./mistralrs-server gguf -m ./Phi-3-mini-128k-instruct-q3_K_S.gguf and have it start a server with that local file, but my goal seems very complicated to achieve based on this program's arguments and the error messages that imply it works best when you have to download a remote file. Thanks in advance for any responses that'll help :)

The text was updated successfully, but these errors were encountered:

sdmorrey · 2024-05-11T22:21:04Z

Something to be aware of is that PruneAI uses a custom encoding scheme. I can’t get their models to run under llama.cpp either. I don’t know if that’s related or not. But it is something to be aware of.

EricLBuehler · 2024-05-12T10:27:13Z

Hi @jett06, thanks for letting me know. I think we have room to improve in the verbosity of the loading of local files.

As documented here, the way to load files locally is to specify a local path as the model ID. I'll try to improve the visibility and clarity of that section.

In your case, loading from a local file may look like:

./mistralrs-server gguf --tokenizer-json tokenizer.json --quantized-model-id <PATH TO GGUF FILE> --quantized-filename Phi-3-mini-128k-instruct-q3_K_S.gguf

I made a few changes:

Remove --tok-model-id because you specified the tokenizer.json, so it is redundant
Specify the path to the GGUF file as the quantized model ID

In the backend, we start by searching locally, but since ./Phi-3-mini-128k-instruct-q3_K_S.gguf is not the path to your GGUF file, we treat it as an HF ID which causes the web-related error (https://huggingface.co/Phi-3-mini-128k-instruct-q3_K_S.gguf/resolve/main/Phi-3-mini-128k-instruct-q3_K_S.gguf does not exist). I think that in the future, to reduce misunderstandings, we should display that we could not find the local file and are now searching remote.

For brevity, you can simplify it by using the short arguments:

./mistralrs-server gguf --tokenizer-json tokenizer.json -m <PATH TO GGUF FILE> -f Phi-3-mini-128k-instruct-q3_K_S.gguf

In general, the model ID is a "path" to the files: a local path, or a Hugging Face Model ID.

jett06 · 2024-05-12T21:53:45Z

@EricLBuehler Phi-3-mini-128k-instruct-q3_K_S.gguf actually is the path to my local GGUF file. I've symlinked it from my downloads directory to the current dir (Phi-3-mini-128k-instruct-q3_K_S.gguf => /home/jett/Downloads/llms/Phi-3-mini-128k-instruct-q3_K_S.gguf). Here's what happens when I run the final command you've given me (with the GGUF file path filled in):

$ ./mistralrs-server gguf --tokenizer-json tokenizer.json -m ./Phi-3-mini-128k-instruct-q3_K_S.gguf -f ./Phi-3-mini-128k-instruct-q3_K_S.gguf
error: the following required arguments were not provided:
  --tok-model-id <TOK_MODEL_ID>

Usage: mistralrs-server gguf --tok-model-id <TOK_MODEL_ID> --quantized-model-id <QUANTIZED_MODEL_ID> --quantized-filename <QUANTIZED_FILENAME> --tokenizer-json <TOKENIZER_JSON>

For more information, try '--help'.

To be sure the issue didn't lie with the fact my model file was a symbolic link, I deleted the symlink and copied the file directly to the current directory (/home/jett/Downloads/llms/Phi-3-Mini-128k-instruct-q3_K_S.gguf copied to ./Phi-3-Mini-128k-instruct-q3_K_S.gguf), which bore the same error as above:

$ ./mistralrs-server gguf --tokenizer-json tokenizer.json -m ./Phi-3-mini-128k-instruct-q3_K_S.gguf -f ./Phi-3-mini-128k-instruct-q3_K_S.gguf
error: the following required arguments were not provided:
  --tok-model-id <TOK_MODEL_ID>

Usage: mistralrs-server gguf --tok-model-id <TOK_MODEL_ID> --quantized-model-id <QUANTIZED_MODEL_ID> --quantized-filename <QUANTIZED_FILENAME> --tokenizer-json <TOKENIZER_JSON>

For more information, try '--help'.

This is with the latest commit on master.

EricLBuehler · 2024-05-13T01:00:41Z

I made a small mistake with the command, I actually deleted the wrong arg:

 ./mistralrs-server gguf -m . -f Phi-3-mini-128k-instruct-q3_K_S.gguf -t microsoft/Phi-3-mini-128k-instruct

We tell mistral.rs to look in . for Phi-3-mini-128k-instruct-q3_K_S.gguf which should be found by your file system regardless if it is a symlink. So, when you specify the model ID as the path to the symlinked file (./...), it breaks because that's not the path, it should be looking in .

jett06 · 2024-05-13T01:26:16Z

@EricLBuehler I see, that makes sense! I didn't realize the command arguments were so similar conceptually, regardless of if you're using a local file or pulling from HuggingFace (eg either local directory or HF directory, then either local file or HF file), but I get it now. Thank you so much for your help, and for maintaining a wonderful project :)

EricLBuehler · 2024-05-13T09:02:42Z

Thank you! Glad to help.

EricLBuehler added the documentation Improvements or additions to documentation label May 11, 2024

EricLBuehler mentioned this issue May 12, 2024

More verbose logging when loading locally #298

Merged

jett06 closed this as completed May 13, 2024

polarathene mentioned this issue May 18, 2024

Running model from a GGUF file, only #326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can you run inference with a local GGUF file? #295

How can you run inference with a local GGUF file? #295

jett06 commented May 11, 2024

sdmorrey commented May 11, 2024 •

edited

Loading

EricLBuehler commented May 12, 2024

jett06 commented May 12, 2024

EricLBuehler commented May 13, 2024

jett06 commented May 13, 2024

EricLBuehler commented May 13, 2024

How can you run inference with a local GGUF file? #295

How can you run inference with a local GGUF file? #295

Comments

jett06 commented May 11, 2024

sdmorrey commented May 11, 2024 • edited Loading

EricLBuehler commented May 12, 2024

jett06 commented May 12, 2024

EricLBuehler commented May 13, 2024

jett06 commented May 13, 2024

EricLBuehler commented May 13, 2024

sdmorrey commented May 11, 2024 •

edited

Loading