Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for using GGUF tokenizer #345

Merged
merged 21 commits into from
May 28, 2024
Merged

Conversation

EricLBuehler
Copy link
Owner

@EricLBuehler EricLBuehler commented May 25, 2024

This adds support for using a GGUF tokenizer as documented here:
https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#tokenizer

  • llama
    • Unigram
  • replit
    • Unigram
  • gpt
    • BPE
  • rwkv
    • RWKV

Copy link

github-actions bot commented May 25, 2024

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    5            9            9            0            0
 Python                 21          741          622           21           98
 TOML                   15          388          351            1           36
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               15         1028            0          761          267
 |- BASH                 6          205          192            0           13
 |- Python               6          121          110            0           11
 |- Rust                 3          185          172            9            4
 (Total)                           1539          474          770          295
-------------------------------------------------------------------------------
 Rust                   84        27992        25630          365         1997
 |- Markdown            41          426            0          414           12
 (Total)                          28418        25630          779         2009
===============================================================================
 Total                 144        30634        27006         1148         2480
===============================================================================
  

@Jeadie
Copy link
Contributor

Jeadie commented May 28, 2024

What remains on this PR? Need GGUF tokenizer support so happy to contribute.

@EricLBuehler
Copy link
Owner Author

What remains on this PR? Need GGUF tokenizer support so happy to contribute.

Currently, it doesn't work. In this PR I tried to convert the GGUF tokenizer to a HF tokenizer for easy integration with the rest of mistral.rs, but I ran into some problems with how the decoder/post processor/normalizer parts of the HF tokenizer are being set up. Additionally, it looks like the Mistral GGUF doesn't contain any merges, but the HF tokenizer itself does. I'm not sure if there are sensible defaults or ways to calculate those values from the token types that I can use.

So, the current state of this PR is that it is half working. If you could perhaps take a look and see if you can get it to work, that would be amazing!

@Jeadie
Copy link
Contributor

Jeadie commented May 28, 2024

What example GGUF are you using for mistral? I don't see any reference to Mistral in ggerganov/ggml

@EricLBuehler
Copy link
Owner Author

I'm using this one: https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/tree/main

Mistral uses a llama tokenizer, which ggerganov/ggml says should be a SentencePiece tokenizer. In this PR I use a BPE tokenizer, however, because the HF tokenizer is BPE (maybe that is the problem). Perhaps we can use this crate?

@EricLBuehler
Copy link
Owner Author

@Jeadie, I made some progress! It mostly works now, and I think there is just one small bug left. With this PR you can run models fully locally, specifying paths for the chat template and GGUF file:

./mistralrs-server --chat-template <chat_template> gguf -m . -f Phi-3-mini-128k-instruct-q4_K_M.gguf

@EricLBuehler EricLBuehler merged commit 34275f4 into master May 28, 2024
11 checks passed
@EricLBuehler EricLBuehler deleted the gguf_to_hf_tokenizer branch May 28, 2024 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants