Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/add unit test for token to indexers and tokenization #1

Conversation

damien2012eng
Copy link
Collaborator

Team,

Please review this unit testing PR.

predict.py Outdated Show resolved Hide resolved
predict.py Outdated Show resolved Hide resolved
@desilinguist
Copy link
Member

desilinguist commented Sep 12, 2022

@damien2012eng why are there changes to tokenizer_indexer.py? I thought the point of this PR was just to add the unit tests?

Copy link
Member

@desilinguist desilinguist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly cosmetic suggestions. Once these are addressed and I have some more context about what each test is supposed to be testing, I can do another round of review.

@ksteimel
Copy link

Were you able to get this set up on the linux servers? I had to pin jsonnet at 0.13.0 to get it to build and allennlp to install correctly. Version 0.18.0 (the latest jsonnet) seems to require newer libraries to build.

@ksteimel
Copy link

These tests, unfortunately, don't pass with allennlp 0.8.4 or 0.9. Which version were you testing against?
If we're not bumping versions yet, then from allennlp.common import cached_transformers needs to be changed to use AutoTokenizer from huggingface's transformers directly.

@damien2012eng
Copy link
Collaborator Author

@desilinguist Thanks for the thoughtful review.
The reason for making changes in the tokenizer_indexer.py is that bumping AllenNLP to latest version requires modifications in some of methods, which I explained here:

grammarly@1d57e2f#r83465995

@desilinguist
Copy link
Member

@desilinguist Thanks for the thoughtful review. The reason for making changes in the tokenizer_indexer.py is that bumping AllenNLP to latest version requires modifications in some of methods, which I explained here:

grammarly@1d57e2f#r83465995

But upgrading AllenNLP is out of scope for this specific PR, right? The goal of this PR is to take the GECToR codebase as is and simply add unit tests. Upgrading AllenNLP can be done in a subsequent PR?

@@ -21,7 +21,7 @@
# TODO(joelgrus): Figure out how to generate token_type_ids out of this token indexer.


class TokenizerIndexer(TokenIndexer[int]):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nitpick but perhaps we should rename this something more descriptive? I keep reading TokenIndexer instead of TokenizerIndexer. I think maybe something like GectorTokenIndexer would work? l kind of want to point out that this is a bespoke tokenizer that gector uses.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ksteimel Thanks for the comments! As mentioned by Nitin, I will add your suggestions in the next PR when modifying actual codes.

@@ -21,7 +21,7 @@
# TODO(joelgrus): Figure out how to generate token_type_ids out of this token indexer.


class TokenizerIndexer(TokenIndexer[int]):
class TokenizerIndexer(TokenIndexer):
"""
A token indexer that does the wordpiece-tokenization (e.g. for BERT embeddings).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring is taken from allennlp's wordpiece_indexer. I think we should change this to describe what this indexer is doing. What I gather that's unique is the whole "offset_mapping" output but I could be wrong.

tests/test_tokenization.py Outdated Show resolved Hide resolved
tests/test_tokenization.py Outdated Show resolved Hide resolved
tests/test_tokenization.py Outdated Show resolved Hide resolved
tests/test_tokenization.py Outdated Show resolved Hide resolved
tests/test_tokenization.py Outdated Show resolved Hide resolved
@damien2012eng damien2012eng force-pushed the features/add_unit_test_for_tokenToIndexers_and_tokenization branch 3 times, most recently from f92937a to c94803a Compare September 14, 2022 20:15
@ksteimel ksteimel self-requested a review September 14, 2022 20:16
@damien2012eng damien2012eng force-pushed the features/add_unit_test_for_tokenToIndexers_and_tokenization branch from c94803a to 87a4997 Compare September 14, 2022 20:42
Copy link
Member

@desilinguist desilinguist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@damien2012eng the tests look much cleaner now! I have one specific suggestion but the rest looks good to me. I will leave it to @ksteimel to check the actual contents of each test since I am not that familiar with the tokenization workflow.

tests/test_token_indexer.py Outdated Show resolved Hide resolved
tests/test_token_indexer.py Outdated Show resolved Hide resolved
tests/test_tokenization.py Outdated Show resolved Hide resolved
Copy link
Member

@desilinguist desilinguist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@damien2012eng damien2012eng merged commit 5c0cf10 into master Sep 16, 2022
@damien2012eng damien2012eng deleted the features/add_unit_test_for_tokenToIndexers_and_tokenization branch September 16, 2022 00:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants