Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading tokenizer from sentencepiece model #407

Open
EricLBuehler opened this issue Jun 8, 2024 · 5 comments
Open

Support loading tokenizer from sentencepiece model #407

EricLBuehler opened this issue Jun 8, 2024 · 5 comments
Labels
new feature New feature or request

Comments

@EricLBuehler
Copy link
Owner

Currently, if a sentencepiece .model file is provided, the user must run a provided script to convert into the equivalent tokenizer.json. By supporting sentencepiece models directly, we can avoid this requirement.

Note: This requires creating a TokenizerLike trait and abstracting the tokenizer source file loading process, which is already complicated given the need to support loading from GGUF tokenizers.

Potential crate: https://docs.rs/sentencepiece/latest/sentencepiece/

@EricLBuehler EricLBuehler added the new feature New feature or request label Jun 8, 2024
@polarathene
Copy link
Contributor

polarathene commented Jun 8, 2024

the user must run a provided script to convert into the equivalent tokenizer.json

Something that isn't very straight-forward apparently from upstream advice? 🤔

An alternative option is deferring to a converter if you can reliably convert tokenizer.model to tokenizer.json?

I find the python tools a bit difficult to follow when I've looked at source of some of them.


which is already complicated given the need to support loading from GGUF tokenizers.

Is it not able to leverage that? I thought I had read that Unigram was effectively SentencePiece? If the configuration can be mapped to the tokenizers logic we already have then all we need is the data from tokenizers.model to fit into PropsGGUF?


My earlier questions for the Unigram decoder/normalizer config seems to be covered by the SentencePiece protobuf descriptions:

https://github.com/google/sentencepiece/blob/6225e08edb2577757163b3f5dbba4c0b670ef445/src/sentencepiece_model.proto#L254-L263

At a glance I can see why it may be preferrable to adopt the crate instead of trying to parse it into tokenizers format.


Is there much demand for this feature? I wouldn't bother with it while mistral.rs is still in early stages, unless user interest for it is sufficient to justify it? You've already got plenty to focus on as-is right now 😅

Those users may also find it acceptable to convert the tokenizer.model. The friction may be that it doesn't appear to be documented well, or at least when I tried to search for it I know I struggled. It's possible that users interested in this support may want it for similar popular models, which could potentially be served via docs guiding how to convert those (if the steps differ).

If that becomes a burden to maintain, then tokenizer.model may be worthwhile integrating support for?

@polarathene
Copy link
Contributor

FWIW here is exllamav2 approach for SentencePiece support via a wrapper of their own that seems to match HF tokenizers API? (both extend from this common tokenizer class).

@EricLBuehler
Copy link
Owner Author

I think that this could be an interesting feature, but as you said, I also haven't received any issues about this. It is probably not critical, and we have other things to work on.

I find the python tools a bit difficult to follow when I've looked at source of some of them.

Would you be able to tell me which ones? I would like to improve their quality.

@polarathene
Copy link
Contributor

Would you be able to tell me which ones?

It was early on into learning about LLMs, so it may have just been unfamiliarity and my lack of Python skills.

I think it was the llama.cpp python script for converting/quantizing to GGUF, and the same for EXL2.

There was also the one referenced in candle repo for converting tokenizer.model to tokenizer.json, one of the issues involved you and a maintainer there IIRC stating that a lib would be a bit tricky, but a rust CLI tool was doable in their opinion? They also said that that sort of conversion wasn't easy to give instructions for. I don't recall the linked source being that well documented either, so assume you had to read through the source.

I had not tried any of them locally, perhaps they gave decent CLI help output 🤷‍♂️ I was partially interested in the logic itself for writing a converter tool in rust.

@EricLBuehler
Copy link
Owner Author

Ok, great. I think the script right now is fine, but we should consider support for loading directly from sentencepiece as a longer-term goal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants