-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features/add unit test for token to indexers and tokenization #1
Features/add unit test for token to indexers and tokenization #1
Conversation
@damien2012eng why are there changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly cosmetic suggestions. Once these are addressed and I have some more context about what each test is supposed to be testing, I can do another round of review.
Were you able to get this set up on the linux servers? I had to pin jsonnet at 0.13.0 to get it to build and allennlp to install correctly. Version 0.18.0 (the latest jsonnet) seems to require newer libraries to build. |
These tests, unfortunately, don't pass with allennlp 0.8.4 or 0.9. Which version were you testing against? |
@desilinguist Thanks for the thoughtful review. |
But upgrading AllenNLP is out of scope for this specific PR, right? The goal of this PR is to take the GECToR codebase as is and simply add unit tests. Upgrading AllenNLP can be done in a subsequent PR? |
gector/tokenizer_indexer.py
Outdated
@@ -21,7 +21,7 @@ | |||
# TODO(joelgrus): Figure out how to generate token_type_ids out of this token indexer. | |||
|
|||
|
|||
class TokenizerIndexer(TokenIndexer[int]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nitpick but perhaps we should rename this something more descriptive? I keep reading TokenIndexer
instead of TokenizerIndexer
. I think maybe something like GectorTokenIndexer
would work? l kind of want to point out that this is a bespoke tokenizer that gector uses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ksteimel Thanks for the comments! As mentioned by Nitin, I will add your suggestions in the next PR when modifying actual codes.
gector/tokenizer_indexer.py
Outdated
@@ -21,7 +21,7 @@ | |||
# TODO(joelgrus): Figure out how to generate token_type_ids out of this token indexer. | |||
|
|||
|
|||
class TokenizerIndexer(TokenIndexer[int]): | |||
class TokenizerIndexer(TokenIndexer): | |||
""" | |||
A token indexer that does the wordpiece-tokenization (e.g. for BERT embeddings). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring is taken from allennlp's wordpiece_indexer. I think we should change this to describe what this indexer is doing. What I gather that's unique is the whole "offset_mapping" output but I could be wrong.
f92937a
to
c94803a
Compare
c94803a
to
87a4997
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@damien2012eng the tests look much cleaner now! I have one specific suggestion but the rest looks good to me. I will leave it to @ksteimel to check the actual contents of each test since I am not that familiar with the tokenization workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Team,
Please review this unit testing PR.