-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support loading tokenizer from sentencepiece
model
#407
Comments
Something that isn't very straight-forward apparently from upstream advice? 🤔 An alternative option is deferring to a converter if you can reliably convert I find the python tools a bit difficult to follow when I've looked at source of some of them.
Is it not able to leverage that? I thought I had read that Unigram was effectively SentencePiece? If the configuration can be mapped to the My earlier questions for the Unigram decoder/normalizer config seems to be covered by the At a glance I can see why it may be preferrable to adopt the crate instead of trying to parse it into Is there much demand for this feature? I wouldn't bother with it while Those users may also find it acceptable to convert the If that becomes a burden to maintain, then |
FWIW here is |
I think that this could be an interesting feature, but as you said, I also haven't received any issues about this. It is probably not critical, and we have other things to work on.
Would you be able to tell me which ones? I would like to improve their quality. |
It was early on into learning about LLMs, so it may have just been unfamiliarity and my lack of Python skills. I think it was the There was also the one referenced in candle repo for converting I had not tried any of them locally, perhaps they gave decent CLI help output 🤷♂️ I was partially interested in the logic itself for writing a converter tool in rust. |
Ok, great. I think the script right now is fine, but we should consider support for loading directly from sentencepiece as a longer-term goal. |
Currently, if a sentencepiece
.model
file is provided, the user must run a provided script to convert into the equivalenttokenizer.json
. By supportingsentencepiece
models directly, we can avoid this requirement.Note: This requires creating a
TokenizerLike
trait and abstracting the tokenizer source file loading process, which is already complicated given the need to support loading from GGUF tokenizers.Potential crate: https://docs.rs/sentencepiece/latest/sentencepiece/
The text was updated successfully, but these errors were encountered: