You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The <unk> token is not really used by the BytePairTokenizer, instead oov tokens will be mapped to -1, That will cause index error for embedding layer.
This will only occur in the case where vocabulary is limited -doesn't contain all the bytes- for example when trying an example with custom small vocabulary rather than using a preset, but adding this feature will be better.
The text was updated successfully, but these errors were encountered:
How would we handle this for things like GPT2, which has no unk token in the vocabulary or index reserved for it? Seems fine to add as long as an optional setting for small test vocabularies.
The
<unk>
token is not really used by theBytePairTokenizer
, instead oov tokens will be mapped to -1, That will cause index error for embedding layer.This will only occur in the case where vocabulary is limited -doesn't contain all the bytes- for example when trying an example with custom small vocabulary rather than using a preset, but adding this feature will be better.
The text was updated successfully, but these errors were encountered: