是否能支持 huggingface/tokenizers #24

Liangdi · 2023-11-06T10:30:05Z

最近使用 candle , 想做 Yi 系列的支持，candle 使用 https://github.com/huggingface/tokenizers 这个库，使用时候需要一个 tokenizer.json , 在 Yi 系列中没有这个文件，一些其他模型如：https://huggingface.co/bert-base-chinese ,https://huggingface.co/Salesforce/blip-image-captioning-large 等有相关支持。
看了一下 transformer 文档，似乎是 fast-tokenziers 这个模块 https://huggingface.co/docs/transformers/fast_tokenizers

之前咨询 ChatGLM 的时候， candle 那边回复如下，不知道 Yi 系列是否能够支持？
candle issue:
huggingface/candle#1177 (comment)

transformers 的一些相关代码 https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py

以下是 candle 支持 marian-mt 修改的 convert_slow_tokenizer.py 的代码
https://github.com/huggingface/candle/blob/main/candle-examples/examples/marian-mt/convert_slow_tokenizer.py#L1262C32-L1262C32

ZhaoFancy · 2023-11-06T10:40:46Z

不太确定是否好支持，这个需要内部讨论下（最后一个链接很有用）

Liangdi · 2023-11-06T15:32:15Z

不太确定是否好支持，这个需要内部讨论下（最后一个链接很有用）

期待能支持，顺便问一下，国内有微信技术交流群嘛？

ZhaoFancy · 2023-11-07T07:32:06Z

国内有微信技术交流群嘛？

目前没有，可以在这里投票： #51

loofahcus · 2023-11-08T01:57:32Z

最近使用 candle , 想做 Yi 系列的支持，candle 使用 https://github.com/huggingface/tokenizers 这个库，使用时候需要一个 tokenizer.json , 在 Yi 系列中没有这个文件，一些其他模型如：https://huggingface.co/bert-base-chinese ,https://huggingface.co/Salesforce/blip-image-captioning-large 等有相关支持。看了一下 transformer 文档，似乎是 fast-tokenziers 这个模块 https://huggingface.co/docs/transformers/fast_tokenizers

之前咨询 ChatGLM 的时候， candle 那边回复如下，不知道 Yi 系列是否能够支持？ candle issue: huggingface/candle#1177 (comment)

transformers 的一些相关代码 https://github.com/huggingface/transformers/blob/main/src/transformers/convert_slow_tokenizer.py

以下是 candle 支持 marian-mt 修改的 convert_slow_tokenizer.py 的代码 https://github.com/huggingface/candle/blob/main/candle-examples/examples/marian-mt/convert_slow_tokenizer.py#L1262C32-L1262C32

我研究一下 tokenizer.json 的问题，稍等～
谢谢

loofahcus · 2023-11-08T11:14:27Z

tokenizer.json
@Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下，目前的一些测试 case 是符合预期的。但我对 candle 不熟悉，所以没法测得很全面。

Liangdi · 2023-11-08T13:08:01Z

tokenizer.json @Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下，目前的一些测试 case 是符合预期的。但我对 candle 不熟悉，所以没法测得很全面。

@loofahcus 我用不同的中英文测试数据测试了，和 python 的一致的结果, 太棒了，可以发布转换脚本吗？我这边尝试使用 candle 适配 Yi-6B 去

ericzhou571 · 2023-11-09T02:27:41Z

请教一下converter相比于llama的Converter都做了哪些修改呢？https://github.com/huggingface/transformers/blob/04ab5605fbb4ef207b10bf2772d88c53fc242e83/src/transformers/convert_slow_tokenizer.py#L1098
我们在llama的基础上将转换脚本里的speical token都改成了Yi的，中文字符的tokenize结果都是准的，但是在面对whitespace的时候还是跟原生Yitokenizer结果不一致

ericzhou571 · 2023-11-09T02:30:45Z

tokenizer.json @Liangdi 你可以帮我测测这个能用在 candle 上吗? 我简单地试了一下，目前的一些测试 case 是符合预期的。但我对 candle 不熟悉，所以没法测得很全面。

另外请教一下，使用transfomrers的fast tokenizer加载的时候，应该使用哪一个class呢？直接使用PreTrainedTokenizerFast嘛？
🥹

Nxy update

loofahcus · 2023-11-09T06:46:06Z

class YiConverter(SpmConverter):
    handle_byte_fallback = True

    def decoder(self, replacement, add_prefix_space):
        return decoders.Sequence(
            [
                decoders.Replace("▁", " "),
                decoders.ByteFallback(),
                decoders.Fuse(),
            ]
        )

    def tokenizer(self, proto):
        model_type = proto.trainer_spec.model_type
        vocab_scores = self.vocab(proto)
        if model_type == 1:
            import tokenizers

            if version.parse(tokenizers.__version__) < version.parse("0.14.0"):
                tokenizer = Tokenizer(Unigram(vocab_scores, 0))
            else:
                tokenizer = Tokenizer(Unigram(vocab_scores, 0, byte_fallback=True))

        elif model_type == 2:
            _, merges = SentencePieceExtractor(self.original_tokenizer.vocab_file).extract(vocab_scores)
            bpe_vocab = {word: i for i, (word, _score) in enumerate(vocab_scores)}
            tokenizer = Tokenizer(
                BPE(bpe_vocab, merges, unk_token=proto.trainer_spec.unk_piece, fuse_unk=True, byte_fallback=True)
            )
            tokenizer.add_special_tokens(
                [
                    AddedToken("<unk>", normalized=False, special=True),
                    AddedToken("<|startoftext|>", normalized=False, special=True),
                    AddedToken("<|endoftext|>", normalized=False, special=True),
                ]
            )
        else:
            raise Exception(
                "You're trying to run a `Unigram` model but you're file was trained with a different algorithm"
            )

        return tokenizer

    def normalizer(self, proto):
        return normalizers.Sequence([normalizers.Replace(pattern=" ", content="▁")])

    def pre_tokenizer(self, replacement, add_prefix_space):
        return None

@Liangdi @ericzhou571 供参考，谢谢

loofahcus · 2023-11-10T01:28:48Z

I will close this issue, feel free to reopen this issue or start a new one if you need any further assistance.

Liangdi · 2023-11-10T01:56:42Z

I will close this issue, feel free to reopen this issue or start a new one if you need any further assistance.

感谢，我们这边已经着手做 candle 的支持，你们可以将对应的 tokenizer.json 提交到 hf 和 modelscope 仓库中去呀，这样其他开发者就可以直接使用了

ZhaoFancy added enhancement New feature or request triage labels Nov 6, 2023

loofahcus self-assigned this Nov 8, 2023

This was referenced Nov 8, 2023

Model Wishlist huggingface/candle#1177

Open

是否能支持 huggingface/tokenizers THUDM/ChatGLM3#122

Closed

jiangchengSilent pushed a commit that referenced this issue Nov 9, 2023

Merge pull request #24 from 01-ai/nxy

ec2de81

Nxy update

loofahcus closed this as completed Nov 10, 2023

loofahcus mentioned this issue Nov 13, 2023

Base模型新增了tokenizer.json文件 #107

Closed

Yimi81 added the doc-not-needed Your PR changes do not impact docs. label Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

是否能支持 huggingface/tokenizers #24

是否能支持 huggingface/tokenizers #24

Liangdi commented Nov 6, 2023 •

edited

Loading

ZhaoFancy commented Nov 6, 2023

Liangdi commented Nov 6, 2023

ZhaoFancy commented Nov 7, 2023

loofahcus commented Nov 8, 2023

loofahcus commented Nov 8, 2023

Liangdi commented Nov 8, 2023 •

edited

Loading

ericzhou571 commented Nov 9, 2023

ericzhou571 commented Nov 9, 2023

loofahcus commented Nov 9, 2023

loofahcus commented Nov 10, 2023

Liangdi commented Nov 10, 2023

是否能支持 huggingface/tokenizers #24

是否能支持 huggingface/tokenizers #24

Comments

Liangdi commented Nov 6, 2023 • edited Loading

ZhaoFancy commented Nov 6, 2023

Liangdi commented Nov 6, 2023

ZhaoFancy commented Nov 7, 2023

loofahcus commented Nov 8, 2023

loofahcus commented Nov 8, 2023

Liangdi commented Nov 8, 2023 • edited Loading

ericzhou571 commented Nov 9, 2023

ericzhou571 commented Nov 9, 2023

loofahcus commented Nov 9, 2023

loofahcus commented Nov 10, 2023

Liangdi commented Nov 10, 2023

Liangdi commented Nov 6, 2023 •

edited

Loading

Liangdi commented Nov 8, 2023 •

edited

Loading