Add `oov_token` Argument to `BytePairTokenizer` #1466

abuelnasr0 · 2024-02-24T23:00:46Z

The <unk> token is not really used by the BytePairTokenizer, instead oov tokens will be mapped to -1, That will cause index error for embedding layer.
This will only occur in the case where vocabulary is limited -doesn't contain all the bytes- for example when trying an example with custom small vocabulary rather than using a preset, but adding this feature will be better.

The text was updated successfully, but these errors were encountered:

mattdangerw · 2024-03-05T02:37:12Z

How would we handle this for things like GPT2, which has no unk token in the vocabulary or index reserved for it? Seems fine to add as long as an optional setting for small test vocabularies.

sachinprasadhs added the type:feature New feature or request label Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `oov_token` Argument to `BytePairTokenizer` #1466

Add `oov_token` Argument to `BytePairTokenizer` #1466

abuelnasr0 commented Feb 24, 2024 •

edited

mattdangerw commented Mar 5, 2024

Add oov_token Argument to BytePairTokenizer #1466

Add oov_token Argument to BytePairTokenizer #1466

Comments

abuelnasr0 commented Feb 24, 2024 • edited

mattdangerw commented Mar 5, 2024

Add `oov_token` Argument to `BytePairTokenizer` #1466

Add `oov_token` Argument to `BytePairTokenizer` #1466

abuelnasr0 commented Feb 24, 2024 •

edited