Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Why not add transformer tokens to vocabulary in init phrase #5510

Open
Zessay opened this issue Dec 13, 2021 · 2 comments
Open

Why not add transformer tokens to vocabulary in init phrase #5510

Zessay opened this issue Dec 13, 2021 · 2 comments

Comments

@Zessay
Copy link

Zessay commented Dec 13, 2021

I met a problem yesterday when I want to get the vocab_size of the vocabulary namespace transformer_tags which I specify in PretrainedTransformerIndexer. I found this namespace doesn't defined when I want to use it in the Model init phrase. To figure out this problem, I read this part of source codes. I find the operation _add_encoding_to_vocabulary_if_needed is called in tokens_to_indices which is called util the _train_epoch starts.

...
class PretrainedTransformerIndexer
...
   def _add_encoding_to_vocabulary_if_needed(self, vocab: Vocabulary) -> None:
        """
        Copies tokens from ```transformers``` model's vocab to the specified namespace.
        """
        if self._added_to_vocabulary:
            return

        vocab.add_transformer_vocab(self._tokenizer, self._namespace)

        self._added_to_vocabulary = True

    @overrides
    def count_vocab_items(self, token: Token, counter: Dict[str, Dict[str, int]]):
        # If we only use pretrained models, we don't need to do anything here.
        pass

    @overrides
    def tokens_to_indices(self, tokens: List[Token], vocabulary: Vocabulary) -> IndexedTokenList:
        self._add_encoding_to_vocabulary_if_needed(vocabulary)
...

It means the transformer_tags namespace I defined in the vocabulary only can be used in the forward process and after it. This causes a probelm that I can't define a linear layer that transforms the model output to vocab_size logits. Of course, I can use from_pretrained_transformer constructor to get a same namespace that I can use in the init phrase. If this is planned, I wonder what's the usage of _add_encoding_to_vocabulary_if_needed in the PretrainedTransformerIndexer. Why not call _add_encoding_to_vocabulary_if_needed in the init method of PretrainedTransformerIndexer so that we can use the specified namespace from the Model init phrase?

Looking forward to your kind reply ~~ Thanks~~

@AkshitaB AkshitaB self-assigned this Dec 17, 2021
@github-actions
Copy link

github-actions bot commented Jan 3, 2022

@AkshitaB this is just a friendly ping to make sure you haven't forgotten about this issue 😜

@dirkgr
Copy link
Member

dirkgr commented Jan 15, 2022

PretrainedTransformerIndexer doesn't get to see the vocab in its __init__ method. So we can't add it there.

In principle, the way it should work is that AllenNLP discovers the vocab automatically. That's how it worked before standardized vocabularies became standard practice. But it takes a long time to discover a vocab on a big dataset, and most of the time it's not necessary anymore, so we put in this hack to short-cut vocabulary discovery.

You are right that because of this hack, you don't see the whole vocab in Model.__init__(). Usually, we pass the name of the transformer model to Model.__init__(), so it can figure out what it needs from there. But it is not that elegant.

If you have a better idea for how to do it, feel free to make a PR. I'd be happy to review it!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants