-
Notifications
You must be signed in to change notification settings - Fork 478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
是否能支持 huggingface/tokenizers #24
Comments
不太确定是否好支持,这个需要内部讨论下(最后一个链接很有用) |
期待能支持, 顺便问一下,国内有微信技术交流群嘛? |
目前没有,可以在这里投票: #51 |
我研究一下 tokenizer.json 的问题,稍等~ |
tokenizer.json |
@loofahcus 我用不同的中英文测试数据测试了, 和 python 的一致的结果, 太棒了, 可以发布转换脚本吗? 我这边尝试使用 candle 适配 Yi-6B 去 |
请教一下converter相比于llama的Converter都做了哪些修改呢?https://github.com/huggingface/transformers/blob/04ab5605fbb4ef207b10bf2772d88c53fc242e83/src/transformers/convert_slow_tokenizer.py#L1098 |
另外请教一下,使用transfomrers的fast tokenizer加载的时候,应该使用哪一个class呢?直接使用PreTrainedTokenizerFast嘛? |
class YiConverter(SpmConverter):
handle_byte_fallback = True
def decoder(self, replacement, add_prefix_space):
return decoders.Sequence(
[
decoders.Replace("▁", " "),
decoders.ByteFallback(),
decoders.Fuse(),
]
)
def tokenizer(self, proto):
model_type = proto.trainer_spec.model_type
vocab_scores = self.vocab(proto)
if model_type == 1:
import tokenizers
if version.parse(tokenizers.__version__) < version.parse("0.14.0"):
tokenizer = Tokenizer(Unigram(vocab_scores, 0))
else:
tokenizer = Tokenizer(Unigram(vocab_scores, 0, byte_fallback=True))
elif model_type == 2:
_, merges = SentencePieceExtractor(self.original_tokenizer.vocab_file).extract(vocab_scores)
bpe_vocab = {word: i for i, (word, _score) in enumerate(vocab_scores)}
tokenizer = Tokenizer(
BPE(bpe_vocab, merges, unk_token=proto.trainer_spec.unk_piece, fuse_unk=True, byte_fallback=True)
)
tokenizer.add_special_tokens(
[
AddedToken("<unk>", normalized=False, special=True),
AddedToken("<|startoftext|>", normalized=False, special=True),
AddedToken("<|endoftext|>", normalized=False, special=True),
]
)
else:
raise Exception(
"You're trying to run a `Unigram` model but you're file was trained with a different algorithm"
)
return tokenizer
def normalizer(self, proto):
return normalizers.Sequence([normalizers.Replace(pattern=" ", content="▁")])
def pre_tokenizer(self, replacement, add_prefix_space):
return None @Liangdi @ericzhou571 供参考,谢谢 |
I will close this issue, feel free to reopen this issue or start a new one if you need any further assistance. |
感谢,我们这边已经着手做 candle 的支持,你们可以将对应的 tokenizer.json 提交到 hf 和 modelscope 仓库中去呀,这样其他开发者就可以直接使用了 |
最近使用 candle , 想做 Yi 系列的支持,candle 使用 https://github.com/huggingface/tokenizers 这个库, 使用时候需要一个 tokenizer.json , 在 Yi 系列 中没有这个文件,一些其他模型如:https://huggingface.co/bert-base-chinese ,https://huggingface.co/Salesforce/blip-image-captioning-large 等有相关支持。
看了一下 transformer 文档, 似乎是 fast-tokenziers 这个模块 https://huggingface.co/docs/transformers/fast_tokenizers
之前咨询 ChatGLM 的时候, candle 那边回复如下,不知道 Yi 系列是否能够支持?
candle issue:
huggingface/candle#1177 (comment)
transformers 的一些相关代码 https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py
以下是 candle 支持 marian-mt 修改的 convert_slow_tokenizer.py 的代码
https://github.com/huggingface/candle/blob/main/candle-examples/examples/marian-mt/convert_slow_tokenizer.py#L1262C32-L1262C32
The text was updated successfully, but these errors were encountered: