Fix transformers vocab not being saved to files #4023

MaksymDel · 2020-04-04T15:40:42Z

Fixes #3097. Here we add a method to vocab to populate its namespace from transformers encoding (which is fetched in indexer) with an option to resave vocab to files (which is the main problem of #3097).

I reserved another non_padded_namespace family: *from_transformers to use it by default for transformers instead of tags.

Not saving transformers vocab to disc prevents from_archive model loading (because vocab folder can be empty) with all the implications.

I know this fix is not ideal, but I think it is an improvement over what we have currently. We can also open a separate issue dedicated to figuring out transformers <-> allennlp relationships.

matt-gardner

Thanks @maksym-del! I think this is a reasonable minimal fix to #3097. @dirkgr, you've been thinking about this a bunch too; what do you think?

allennlp/data/token_indexers/pretrained_transformer_indexer.py

allennlp/data/vocabulary.py

matt-gardner · 2020-04-05T15:26:40Z

allennlp/data/token_indexers/pretrained_transformer_indexer.py

@@ -28,7 +28,7 @@ class PretrainedTransformerIndexer(TokenIndexer):

    model_name : `str`
        The name of the `transformers` model to use.
-    namespace : `str`, optional (default=`tags`)
+    namespace : `str`, optional (default=`from_transformers`)
        We will add the tokens in the pytorch_transformer vocabulary to this vocabulary namespace.
        We use a somewhat confusing default value of `tags` so that we do not add padding or UNK


Suggested change

We use a somewhat confusing default value of `tags` so that we do not add padding or UNK

We use `from_transformers` as the default namespace, which gets special treatment in our

`Vocabulary` class to not get padding or UNK tokens.

Actually, how are UNK and padding handled here? Is it possible to set them using the transformer? We probably do want UNK handling and padding tokens, we just need them set in the right way, both here and during loading. This might need some additional changes to how we save and load vocabularies.

And, while we're at it, we probably want the padding and unk tokens to be namespace-dependent.

matt-gardner · 2020-04-05T15:29:39Z

allennlp/data/vocabulary.py

+        self,
+        encoding_dictionary: Dict[str, int],
+        namespace: str = "from_transformers",
+        resave_to_files: bool = False,


This is getting to be a bigger change, but it's probably necessary to really fix this the right way. But I think this function should take arguments for setting padding and oov tokens, and those should set them in a namespace-specific way. And the padding and oov tokens then would need to be serialized with the vocabulary and read from the serialized files. Then we wouldn't need the from_transformers namespace added to the non-padded namespaces.

I would wait to do any work on this, though, until @dirkgr has had a chance to glance at it and see if he agrees. I think this is the right approach, but I could be missing something.

dirkgr · 2020-04-06T23:34:35Z

It's a minimal fix, but I am very unhappy about the amount of magic going on. The way we do vocabularies already involves an untenable amount of magic, and this adds more. Saving state like this, especially the name of a file that gets overwritten, is just asking for trouble the next time this is used in a scenario we didn't think of.

I sketched an alternative here: #4034. I'll add some discussion over there. It is not as minimal as this is, but it has some other advantages.

MaksymDel added 2 commits April 4, 2020 18:18

Fix transformers vocab not being saved to files

33746ce

Fix typo

d45fad6

matt-gardner reviewed Apr 4, 2020

View reviewed changes

allennlp/data/token_indexers/pretrained_transformer_indexer.py Outdated Show resolved Hide resolved

allennlp/data/vocabulary.py Outdated Show resolved Hide resolved

allennlp/data/vocabulary.py Outdated Show resolved Hide resolved

Fix naming; merge

21ecc86

MaksymDel requested a review from matt-gardner April 5, 2020 08:40

matt-gardner reviewed Apr 5, 2020

View reviewed changes

dirkgr mentioned this pull request Apr 6, 2020

Get a Vocabulary object from the reader #4034

Closed

schmmd changed the base branch from master to main December 23, 2020 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix transformers vocab not being saved to files #4023

Fix transformers vocab not being saved to files #4023

MaksymDel commented Apr 4, 2020 •

edited by matt-gardner

matt-gardner left a comment

matt-gardner Apr 5, 2020

matt-gardner Apr 5, 2020

dirkgr commented Apr 6, 2020

	We use a somewhat confusing default value of `tags` so that we do not add padding or UNK
	We use `from_transformers` as the default namespace, which gets special treatment in our
	`Vocabulary` class to not get padding or UNK tokens.

Fix transformers vocab not being saved to files #4023

Are you sure you want to change the base?

Fix transformers vocab not being saved to files #4023

Conversation

MaksymDel commented Apr 4, 2020 • edited by matt-gardner

matt-gardner left a comment

Choose a reason for hiding this comment

matt-gardner Apr 5, 2020

Choose a reason for hiding this comment

matt-gardner Apr 5, 2020

Choose a reason for hiding this comment

dirkgr commented Apr 6, 2020

MaksymDel commented Apr 4, 2020 •

edited by matt-gardner