Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fast tokenizer swallows prefix space when there are too many white spaces #992

Merged
merged 5 commits into from
Jan 31, 2024

Conversation

AllentDan
Copy link
Collaborator

@AllentDan AllentDan commented Jan 18, 2024

The previous detokenization can not handle many empty tokens for LlamaTokenizer such as upstage/SOLAR-0-70b-16bit model. The incrementally decoding output may lack a white space in this case.

This PR does not introduce BC breaking since a new function named detokenize_incrementally is implemented.
However, full tests are required.

@AllentDan AllentDan closed this Jan 18, 2024
@AllentDan AllentDan reopened this Jan 23, 2024
Conflicts:
	lmdeploy/turbomind/hf_repo/modeling_lmdeploy.py
@lvhan028
Copy link
Collaborator

Could you provide an example to reproduce the issue?

@AllentDan
Copy link
Collaborator Author

Could you provide an example to reproduce the issue?

Yes, the case is already covered in unit tests now. This is an example for main branch:

def test_tokenizer(model_path, input):
    from lmdeploy.tokenizer import HuggingFaceTokenizer
    tokenizer = HuggingFaceTokenizer(model_path)
    encoded = tokenizer.encode(input, False)
    output = ''
    offset = 0
    for i in range(1, len(encoded) + 1):
        decoded = tokenizer.decode(encoded[:i], offset)
        if decoded.endswith('�'):
            continue
        output += decoded
        offset = i
    assert input == output, 'input string should equal to output after enc-dec'

test_tokenizer('01-ai/Yi-34B-Chat', 'a    b')

@AllentDan
Copy link
Collaborator Author

Did not influence the performance of profile_throughput.py.

output_tokens = prev_tokens + new_tokens
prev_tokens += new_tokens

prefix_text = self._convert_tokens_to_string_with_added_encoders(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看了vllm的实现。其中,有一个分支是:

if tokenizer.is_fast or not tokenizer.get_added_vocab():
      prefix_text = tokenizer.convert_tokens_to_string(
          output_tokens[prefix_offset:read_offset])
      new_text = tokenizer.convert_tokens_to_string(
          output_tokens[prefix_offset:])

但是,我们这里没有。原因是?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

移到_convert_tokens_to_string_with_added_encoders函数里了

@lvhan028 lvhan028 merged commit 29b74d5 into InternLM:main Jan 31, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants