Special tokens should be properly encoded by text_encoders #82

ricardorei · 2019-07-24T15:21:55Z

Expected Behavior

encoder = MosesEncoder( ["<s> hello This ain't funny. </s>", "<s> Don't? </s>"])
print (encoder.encode("<s> hello </s>"))

--CONSOLE---
tensor([3, 5, 2])

Actual Behavior

encoder = MosesEncoder( ["<s> hello This ain't funny. </s>", "<s> Don't? </s>"])
print (encoder.encode("<s> hello </s>"))

--CONSOLE---
tensor([ 5, 6, 7, 8, 5, 14, 6, 7])

Explanation

Most if this tokenizers are not aware of this special tokens and end up splitting the special token into different tokens. For instance the '<s>' token becames '<', 's', '>'.

My solution to this problem was to create a method for masking special tokens and another one to restore them in place.

   def _mask_reserved_tokens(self, sequence):
        reserved_tokens = re.findall(r'\<pad\>|\<unk\>|\</s\>|\<s\>|\<copy\>', sequence)
        sequence = re.sub(r'\<pad\>|\<unk\>|\</s\>|\<s\>|\<copy\>', "RESERVEDTOKENMASK", sequence)
        return reserved_tokens, sequence

    def _restore_reserved_tokens(self, reserved_tokens, sequence):
        sequence = _detokenize(sequence)
        for token in reserved_tokens:
            sequence = sequence.replace('RESERVEDTOKENMASK', token, 1)
        return _tokenize(sequence)

Then the encode function becames:

def encode(self, sequence):
        """ Encodes a ``sequence``.
        Args:
            sequence (str): String ``sequence`` to encode.
        Returns:
            torch.Tensor: Encoding of the ``sequence``.
        """
        sequence = super().encode(sequence)
        reserved_tokens, sequence = self._mask_reserved_tokens(sequence)
        sequence = self.tokenize(sequence)
        sequence = self._restore_reserved_tokens(reserved_tokens, sequence)
        vector = [self.stoi.get(token, self.unknown_index) for token in sequence]
        if self.append_eos:
            vector.append(self.eos_index)
        return torch.tensor(vector)

I dont know if this is just a problem that I have but If not I believe that this should be handled natively.

The text was updated successfully, but these errors were encountered:

PetrochukM · 2019-07-25T00:20:30Z

Thanks for submitting this issue.

Is there a particular reason that the end of sentence token and the beginning of sentence token cannot be added after the fact?

By the way, your solution may be the correct approach, Google's Subword Tokenizer handles special tokens in a similar way: https://github.com/PetrochukM/PyTorch-NLP/blob/master/torchnlp/encoders/text/subword_text_tokenizer.py#L101

They use unicode to encode various special characters.

ricardorei · 2019-07-25T11:52:29Z

In my case im building sequence to sequence models and the decoder needs the bos and eos tokens.

Lets supose you have the following sample:

[{'source': 'hello world!', 'target': 'hello'}, {'source': 'Bye', 'target': 'goodbye'}]

If you apply collate vector:

{'source': ['hello world!', 'Bye' ], 'target': ['hello', 'goodbye'] }

now if you apply the batch_encode dor the 'source' you get something like:

tensors = torch.tensor([[7, 19],[14, 0]])
lengths = torch.tensor([2,1])

If now I want to add the BOS and EOS token indexes its harder because I have to expand the tensors and increment the lengths... I just think its easier to add this tokens directly into the strings before vectorization.

PetrochukM · 2019-07-25T16:39:14Z

Great. Sure thing.

There is an append_eos parameter that helps with appending an EOS token. I'd be happy to accept a PR for append_sos or append_bos parameter to be included.

ksopyla · 2020-06-30T08:35:03Z

I have experienced similar problem when using CharacterEncoder

tokenizer.encode('<s>hello</s>')

gives

tensor([ 1,  8,  1, 39, 14,7,  7,11,  1,  1,  8,  1])

where token with id=1 is <unk>
I prefere adding eos and sos (bos) tokens by tokenizer, it simplifies all the processing in seq2seq scenario

PetrochukM · 2020-07-04T03:59:32Z

Thanks for your contribution @wxl1999! append_sos is now supported :)

PetrochukM added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Aug 27, 2019

wxl1999 mentioned this issue May 15, 2020

Add support for append_sos #101

Merged

PetrochukM closed this as completed Jul 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Special tokens should be properly encoded by text_encoders #82

Special tokens should be properly encoded by text_encoders #82

ricardorei commented Jul 24, 2019 •

edited

PetrochukM commented Jul 25, 2019 •

edited

ricardorei commented Jul 25, 2019 •

edited

PetrochukM commented Jul 25, 2019

ksopyla commented Jun 30, 2020

PetrochukM commented Jul 4, 2020

Special tokens should be properly encoded by text_encoders #82

Special tokens should be properly encoded by text_encoders #82

Comments

ricardorei commented Jul 24, 2019 • edited

Expected Behavior

Actual Behavior

Explanation

PetrochukM commented Jul 25, 2019 • edited

ricardorei commented Jul 25, 2019 • edited

PetrochukM commented Jul 25, 2019

ksopyla commented Jun 30, 2020

PetrochukM commented Jul 4, 2020

ricardorei commented Jul 24, 2019 •

edited

PetrochukM commented Jul 25, 2019 •

edited

ricardorei commented Jul 25, 2019 •

edited