Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add oov_token Argument to BytePairTokenizer #1466

Open
abuelnasr0 opened this issue Feb 24, 2024 · 1 comment
Open

Add oov_token Argument to BytePairTokenizer #1466

abuelnasr0 opened this issue Feb 24, 2024 · 1 comment
Labels
type:feature New feature or request

Comments

@abuelnasr0
Copy link
Contributor

abuelnasr0 commented Feb 24, 2024

The <unk> token is not really used by the BytePairTokenizer, instead oov tokens will be mapped to -1, That will cause index error for embedding layer.
This will only occur in the case where vocabulary is limited -doesn't contain all the bytes- for example when trying an example with custom small vocabulary rather than using a preset, but adding this feature will be better.

@mattdangerw
Copy link
Member

How would we handle this for things like GPT2, which has no unk token in the vocabulary or index reserved for it? Seems fine to add as long as an optional setting for small test vocabularies.

@sachinprasadhs sachinprasadhs added the type:feature New feature or request label Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants