This repository has been archived by the owner on Jul 4, 2023. It is now read-only.
Special tokens should be properly encoded by text_encoders #82
Labels
enhancement
New feature or request
good first issue
Good for newcomers
help wanted
Extra attention is needed
Expected Behavior
encoder = MosesEncoder( ["<s> hello This ain't funny. </s>", "<s> Don't? </s>"])
print (encoder.encode("<s> hello </s>"))
--CONSOLE---
tensor([3, 5, 2])
Actual Behavior
encoder = MosesEncoder( ["<s> hello This ain't funny. </s>", "<s> Don't? </s>"])
print (encoder.encode("<s> hello </s>"))
--CONSOLE---
tensor([ 5, 6, 7, 8, 5, 14, 6, 7])
Explanation
Most if this tokenizers are not aware of this special tokens and end up splitting the special token into different tokens. For instance the '<s>' token becames '<', 's', '>'.
My solution to this problem was to create a method for masking special tokens and another one to restore them in place.
Then the encode function becames:
I dont know if this is just a problem that I have but If not I believe that this should be handled natively.
The text was updated successfully, but these errors were encountered: