Features/add unit test for token to indexers and tokenization #1

damien2012eng · 2022-09-12T19:34:43Z

Team,

Please review this unit testing PR.

predict.py

desilinguist · 2022-09-12T19:36:58Z

@damien2012eng why are there changes to tokenizer_indexer.py? I thought the point of this PR was just to add the unit tests?

tests/test_tokenization.py

tests/test_tokenizer_indexer.py

tests/test_tokenization.py

tests/test_tokenizer_indexer.py

tests/test_tokenization.py

desilinguist

Mostly cosmetic suggestions. Once these are addressed and I have some more context about what each test is supposed to be testing, I can do another round of review.

ksteimel · 2022-09-12T20:54:06Z

Were you able to get this set up on the linux servers? I had to pin jsonnet at 0.13.0 to get it to build and allennlp to install correctly. Version 0.18.0 (the latest jsonnet) seems to require newer libraries to build.

ksteimel · 2022-09-12T21:22:29Z

These tests, unfortunately, don't pass with allennlp 0.8.4 or 0.9. Which version were you testing against?
If we're not bumping versions yet, then from allennlp.common import cached_transformers needs to be changed to use AutoTokenizer from huggingface's transformers directly.

damien2012eng · 2022-09-13T13:49:11Z

@desilinguist Thanks for the thoughtful review.
The reason for making changes in the tokenizer_indexer.py is that bumping AllenNLP to latest version requires modifications in some of methods, which I explained here:

grammarly@1d57e2f#r83465995

desilinguist · 2022-09-13T13:52:13Z

@desilinguist Thanks for the thoughtful review. The reason for making changes in the tokenizer_indexer.py is that bumping AllenNLP to latest version requires modifications in some of methods, which I explained here:

grammarly@1d57e2f#r83465995

But upgrading AllenNLP is out of scope for this specific PR, right? The goal of this PR is to take the GECToR codebase as is and simply add unit tests. Upgrading AllenNLP can be done in a subsequent PR?

ksteimel · 2022-09-13T13:57:25Z

gector/tokenizer_indexer.py

@@ -21,7 +21,7 @@
 # TODO(joelgrus): Figure out how to generate token_type_ids out of this token indexer.


-class TokenizerIndexer(TokenIndexer[int]):


This is a nitpick but perhaps we should rename this something more descriptive? I keep reading TokenIndexer instead of TokenizerIndexer. I think maybe something like GectorTokenIndexer would work? l kind of want to point out that this is a bespoke tokenizer that gector uses.

@ksteimel Thanks for the comments! As mentioned by Nitin, I will add your suggestions in the next PR when modifying actual codes.

ksteimel · 2022-09-13T15:02:05Z

gector/tokenizer_indexer.py

@@ -21,7 +21,7 @@
 # TODO(joelgrus): Figure out how to generate token_type_ids out of this token indexer.


-class TokenizerIndexer(TokenIndexer[int]):
+class TokenizerIndexer(TokenIndexer):
    """
    A token indexer that does the wordpiece-tokenization (e.g. for BERT embeddings).


This docstring is taken from allennlp's wordpiece_indexer. I think we should change this to describe what this indexer is doing. What I gather that's unique is the whole "offset_mapping" output but I could be wrong.

tests/test_tokenization.py

desilinguist

@damien2012eng the tests look much cleaner now! I have one specific suggestion but the rest looks good to me. I will leave it to @ksteimel to check the actual contents of each test since I am not that familiar with the tokenization workflow.

tests/test_token_indexer.py

tests/test_tokenization.py

desilinguist

LGTM!

damien2012eng requested review from desilinguist, mulhod, Frost45 and ksteimel September 12, 2022 19:34

desilinguist requested review from slava92 and tamarl08 September 12, 2022 19:35