Skip to content
This repository has been archived by the owner on Jul 4, 2023. It is now read-only.

Special tokens should be properly encoded by text_encoders #82

Closed
ricardorei opened this issue Jul 24, 2019 · 5 comments
Closed

Special tokens should be properly encoded by text_encoders #82

ricardorei opened this issue Jul 24, 2019 · 5 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@ricardorei
Copy link

ricardorei commented Jul 24, 2019

Expected Behavior

encoder = MosesEncoder( ["<s> hello This ain't funny. </s>", "<s> Don't? </s>"])
print (encoder.encode("<s> hello </s>"))

--CONSOLE---
tensor([3, 5, 2])

Actual Behavior

encoder = MosesEncoder( ["<s> hello This ain't funny. </s>", "<s> Don't? </s>"])
print (encoder.encode("<s> hello </s>"))

--CONSOLE---
tensor([ 5, 6, 7, 8, 5, 14, 6, 7])

Explanation

Most if this tokenizers are not aware of this special tokens and end up splitting the special token into different tokens. For instance the '<s>' token becames '<', 's', '>'.

My solution to this problem was to create a method for masking special tokens and another one to restore them in place.

   def _mask_reserved_tokens(self, sequence):
        reserved_tokens = re.findall(r'\<pad\>|\<unk\>|\</s\>|\<s\>|\<copy\>', sequence)
        sequence = re.sub(r'\<pad\>|\<unk\>|\</s\>|\<s\>|\<copy\>', "RESERVEDTOKENMASK", sequence)
        return reserved_tokens, sequence
    def _restore_reserved_tokens(self, reserved_tokens, sequence):
        sequence = _detokenize(sequence)
        for token in reserved_tokens:
            sequence = sequence.replace('RESERVEDTOKENMASK', token, 1)
        return _tokenize(sequence)

Then the encode function becames:

def encode(self, sequence):
        """ Encodes a ``sequence``.
        Args:
            sequence (str): String ``sequence`` to encode.
        Returns:
            torch.Tensor: Encoding of the ``sequence``.
        """
        sequence = super().encode(sequence)
        reserved_tokens, sequence = self._mask_reserved_tokens(sequence)
        sequence = self.tokenize(sequence)
        sequence = self._restore_reserved_tokens(reserved_tokens, sequence)
        vector = [self.stoi.get(token, self.unknown_index) for token in sequence]
        if self.append_eos:
            vector.append(self.eos_index)
        return torch.tensor(vector)

I dont know if this is just a problem that I have but If not I believe that this should be handled natively.

@PetrochukM
Copy link
Owner

PetrochukM commented Jul 25, 2019

Thanks for submitting this issue.

Is there a particular reason that the end of sentence token and the beginning of sentence token cannot be added after the fact?


By the way, your solution may be the correct approach, Google's Subword Tokenizer handles special tokens in a similar way: https://github.com/PetrochukM/PyTorch-NLP/blob/master/torchnlp/encoders/text/subword_text_tokenizer.py#L101

They use unicode to encode various special characters.

@ricardorei
Copy link
Author

ricardorei commented Jul 25, 2019

In my case im building sequence to sequence models and the decoder needs the bos and eos tokens.

Lets supose you have the following sample:

[{'source': 'hello world!', 'target': 'hello'}, {'source': 'Bye', 'target': 'goodbye'}]

If you apply collate vector:

{'source': ['hello world!', 'Bye' ], 'target': ['hello', 'goodbye'] }

now if you apply the batch_encode dor the 'source' you get something like:

tensors = torch.tensor([[7, 19],[14, 0]])
lengths = torch.tensor([2,1])

If now I want to add the BOS and EOS token indexes its harder because I have to expand the tensors and increment the lengths... I just think its easier to add this tokens directly into the strings before vectorization.

@PetrochukM
Copy link
Owner

Great. Sure thing.

There is an append_eos parameter that helps with appending an EOS token. I'd be happy to accept a PR for append_sos or append_bos parameter to be included.

@PetrochukM PetrochukM added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Aug 27, 2019
@ksopyla
Copy link

ksopyla commented Jun 30, 2020

I have experienced similar problem when using CharacterEncoder

tokenizer.encode('<s>hello</s>')

gives

tensor([ 1,  8,  1, 39, 14,7,  7,11,  1,  1,  8,  1])

where token with id=1 is <unk>
I prefere adding eos and sos (bos) tokens by tokenizer, it simplifies all the processing in seq2seq scenario

@PetrochukM
Copy link
Owner

Thanks for your contribution @wxl1999! append_sos is now supported :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants