Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocab specials #230

Merged
merged 30 commits into from
Jan 13, 2021
Merged

Vocab specials #230

merged 30 commits into from
Jan 13, 2021

Conversation

mttk
Copy link
Member

@mttk mttk commented Dec 1, 2020

Some outstanding things:

  • Add masking functionality in wrapper of vocab -> prepped for this but left for next PR
  • Disable caching in Field when numericalizer isn't deterministic (via deterministic arg)
  • Make Specials a singleton class
    • Moved to just check for uniqueness in vocab constructor
  • Add static constructors (from_itos, from_stoi)
  • Add UNK filtering
  • Apply non-core specials in Field & add tests for this
  • Fix previous tests
  • Add new tests
  • Improve code documentation
  • Update docs

Closes #216

@mttk mttk requested review from mariosasko, FilipBolt and ivansmokovic and removed request for mariosasko December 16, 2020 00:25
@FilipBolt FilipBolt linked an issue Dec 17, 2020 that may be closed by this pull request
Copy link
Collaborator

@ivansmokovic ivansmokovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I have some questions.
Not sure about Specials subclassing str. What is the main benefit of this compared to a simple Special class?

podium/storage/field.py Outdated Show resolved Hide resolved
podium/storage/field.py Outdated Show resolved Hide resolved
podium/storage/field.py Outdated Show resolved Hide resolved
@@ -32,38 +31,121 @@ def unique(values: Iterable):
yield element


class VocabDict(dict):
class Special(str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the user expected to implement own Specials at some point or are all Specials maintained by us?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intended for the user to subclass specials in case of new usages (e.g., Mask is a relatively new case).

podium/storage/vocab.py Outdated Show resolved Hide resolved
podium/storage/vocab.py Outdated Show resolved Hide resolved
podium/storage/vocab.py Outdated Show resolved Hide resolved
podium/storage/vocab.py Outdated Show resolved Hide resolved
podium/storage/field.py Outdated Show resolved Hide resolved
podium/storage/vocab.py Outdated Show resolved Hide resolved
podium/storage/vocab.py Outdated Show resolved Hide resolved
podium/storage/field.py Outdated Show resolved Hide resolved
docs/source/advanced.rst Outdated Show resolved Hide resolved
2. Adds a stub ``apply`` method which accepts a sequence of tokens and adds the special token to that sequence. In its essence, the apply method is a post-tokenization hook which doesn't see the raw data whose job is to add the special token to the sequence of replace some of the existing tokens with the special token. The special tokens are applied after all post-tokenization hooks in the order they are passed to the :class:`podium.storage.vocab.Vocab` constructor. Each concrete implementation of a Special token has to implement this method.
3. Implements singleton-like hash and equality checks. The ``Special`` class overrides the default hash and equals and instead of checking for string value equality, it checks for *class name equality*. We use this type of check to ensure that each Vocab has a single instance of each Special and for simpler referencing and contains checks.

To better understand how specials work, we will walk through the implementation of one of special tokens implemented in Podium: the beginning-of-sequence (BOS) token.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you maybe happen to know a resource which contains typical Specials used in NLP we could link here? After a quick Google search I could not find one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vocabs in transformers (or tokenizers? not sure where they delegated the vocab) had quite a large number of reserved tokens.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/source/advanced.rst Show resolved Hide resolved
docs/source/advanced.rst Outdated Show resolved Hide resolved
@@ -0,0 +1,29 @@
Special tokens
===============
.. autoclass:: podium.storage.vocab.Special
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should (eventually) get refactored to omit storage, but looks good otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, IDK while docs refuse to use the shortened versions. I have an idea and might try it out soon.

docs/source/advanced.rst Outdated Show resolved Hide resolved
podium/storage/field.py Outdated Show resolved Hide resolved
podium/storage/field.py Outdated Show resolved Hide resolved
tests/storage/test_vocab.py Outdated Show resolved Hide resolved
docs/source/advanced.rst Show resolved Hide resolved
@mttk
Copy link
Member Author

mttk commented Jan 12, 2021

@FilipBolt all comments addressed

@mariosasko
Copy link
Collaborator

@mttk Can you please reference the issue that this PR will close?

Copy link
Collaborator

@FilipBolt FilipBolt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, few minor comments left, but good to go in my mind

docs/source/advanced.rst Show resolved Hide resolved
podium/vocab.py Outdated Show resolved Hide resolved
podium/vocab.py Outdated Show resolved Hide resolved
podium/vocab.py Outdated
[self.stoi[token] if token in self.stoi else unk_token for token in data]
)
else:
# Either UNK is not in Vocab or the user has requested unknown tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, we might need to update the docs to reflect what happens when UNK is in/out of the vocab.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented.

@mttk mttk merged commit 5273cac into master Jan 13, 2021
@mttk mttk deleted the vocab_specials branch January 13, 2021 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: specials revamp
4 participants