Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPMConverter does not always add the user defined symbol -> slow fast is thus not equivalent #30824

Open
ArthurZucker opened this issue May 15, 2024 · 0 comments · May be fixed by #30929
Open

Comments

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented May 15, 2024

The SPM converter (and all that inherit / use it) does not take into account the user defined symbols, leading to issues like this one: https://huggingface.co/01-ai/Yi-9B/discussions/11#6643844f6ac3fe108e3d5190

It also does not really take into account the prefix space param which can and should be extracted from the proto:

split_by_unicode_script: true
split_by_number: true
split_by_whitespace: true
treat_whitespace_as_suffix: false
allow_whitespace_only_pieces: true
split_digits: true

and

normalizer_spec {
  name: "identity"
  precompiled_charsmap: ""
  add_dummy_prefix: false
  remove_extra_whitespaces: false
  normalization_rule_tsv: ""
}

cc @itazap, on a more general converter!

itazap pushed a commit that referenced this issue May 21, 2024
itazap pushed a commit that referenced this issue May 21, 2024
@itazap itazap linked a pull request May 21, 2024 that will close this issue
4 tasks
itazap pushed a commit that referenced this issue May 30, 2024
itazap pushed a commit that referenced this issue May 30, 2024
itazap pushed a commit that referenced this issue May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant