Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of TextTester class and update tests for current text metrics. #450

Merged

Conversation

karthikrangasai
Copy link
Contributor

@karthikrangasai karthikrangasai commented Aug 16, 2021

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #411 .

To Do:

  • Add TextTester class.
  • Update testing for BLEU Score
  • Update testing for ROUGE Metric
  • Update testing for WER

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Aug 16, 2021

Codecov Report

Merging #450 (d5ebbfc) into master (313c868) will decrease coverage by 0%.
The diff coverage is 99%.

@@          Coverage Diff          @@
##           master   #450   +/-   ##
=====================================
- Coverage      96%    96%   -0%     
=====================================
  Files         130    130           
  Lines        4338   4357   +19     
=====================================
+ Hits         4157   4174   +17     
- Misses        181    183    +2     

@SkafteNicki SkafteNicki added this to the v0.6 milestone Aug 16, 2021
@Borda Borda added enhancement New feature or request test / CI testing or CI labels Aug 16, 2021
@Borda Borda added this to In progress in Text via automation Aug 16, 2021
@Borda Borda assigned justusschock and unassigned justusschock Aug 16, 2021
Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, a mix of comments. Looking great so far and will really improve the testing of our text metrics

torchmetrics/text/bleu.py Show resolved Hide resolved
tests/text/test_wer.py Outdated Show resolved Hide resolved
tests/text/test_wer.py Show resolved Hide resolved
tests/text/test_rouge.py Outdated Show resolved Hide resolved
tests/text/helpers.py Show resolved Hide resolved
tests/text/helpers.py Outdated Show resolved Hide resolved
tests/text/helpers.py Outdated Show resolved Hide resolved
tests/text/helpers.py Outdated Show resolved Hide resolved
tests/text/helpers.py Outdated Show resolved Hide resolved
@karthikrangasai karthikrangasai changed the title [wip] Addition of TextTester class. Addition of TextTester class and update tests for current text metrics. Aug 21, 2021
@karthikrangasai karthikrangasai marked this pull request as ready for review August 21, 2021 11:42
@karthikrangasai
Copy link
Contributor Author

Updating the implementation of ROUGE by adding a state helped get rid of some failed tests. But the class based implementation tests still fail when ddp=True.
Can I get insights on this issue ? It would be helpful.

Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to take a look at the failing test

tests/helpers/testers.py Outdated Show resolved Hide resolved
tests/helpers/testers.py Outdated Show resolved Hide resolved
tests/helpers/testers.py Outdated Show resolved Hide resolved
@Borda
Copy link
Member

Borda commented Aug 26, 2021

@karthikrangasai mind check the failing tests? 🐰

@karthikrangasai
Copy link
Contributor Author

karthikrangasai commented Aug 26, 2021

@Borda I'm working on them. The ROUGE metric is failing the tests and maybe the implementation might have some errors. I'm debugging it and trying fixes currently.

They fail specifically for ddp=True case.

@Borda
Copy link
Member

Borda commented Aug 26, 2021

The ROUGE metric is failing the tests and maybe the implementation might have some errors

hmm, that would raise the importance of this test very much if it helps to find some hidden bugs 🐰
cc: @stancld @SkafteNicki

@stancld
Copy link
Contributor

stancld commented Aug 26, 2021

@Borda @karthikrangasai I think the root of this problem is gonna be in the way of how the intermediate results are stored after calling the update method of ROUGEScore.
cc: @SkafteNicki

@karthikrangasai
Copy link
Contributor Author

karthikrangasai commented Aug 26, 2021

@stancld One thing is no state is used for the storing of the values.

Another thing I noticed is we are computing scores for every example pair where instead I feel we should be computing each score for all examples together.

I mean instead of

for pred, target in zip(preds, targets):
    # compute metric for this example

I think we should be doing

for rouge_key in self.rouge_keys_given:
     # compute metric for all examples together in the batch

What do you think about this ?

@SkafteNicki
Copy link
Member

The root of the problem in the current implementation is that the dict that is returned by the _update call is not a dict of tensors but a dict of floats. Since the sync is only applied to tensors:
https://github.com/PyTorchLightning/metrics/blob/94a158c2674e197620a79010ba1d78ea57706774/torchmetrics/metric.py#L224-L229
when end up not syncing the dict and you essentially just get the result from the local process.

The simple fix is to cast the values in the dict to tensor and then do a bit of reformatting in the compute call after the sync.

@stancld
Copy link
Contributor

stancld commented Aug 26, 2021

@stancld One thing is no state is used for the storing of the values.

Another thing I noticed is we are computing scores for every example pair where instead I feel we should be computing each score for all examples together.

I mean instead of

for pred, target in zip(preds, targets):
    # compute metric for this example

I think we should be doing

for rouge_key in self.rouge_keys_given:
     # compute metric for all examples together in the batch

What do you think about this ?

@karthikrangasai Rouge is a sentence-level metric (not like a BLEU score which is designed to be a corpus-level one). The score should be thus computed separately for each sentence.

Copy link
Member

@SkafteNicki SkafteNicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great to see more robust testing of the new text metrics :]

tests/helpers/testers.py Outdated Show resolved Hide resolved
tests/helpers/testers.py Outdated Show resolved Hide resolved
tests/helpers/testers.py Outdated Show resolved Hide resolved
torchmetrics/text/rouge.py Outdated Show resolved Hide resolved
@mergify mergify bot added the ready label Aug 26, 2021
@Borda
Copy link
Member

Borda commented Aug 27, 2021

@karthikrangasai the GPU test were killed after 45min when on master are running for about 20...
seems like something is hanging there, mind have look :]
https://dev.azure.com/PytorchLightning/Metrics/_build/results?buildId=27573&view=logs&j=3afc50db-e620-5b81-6016-870a6976ad29&t=98354d77-e326-51ec-536f-1549451db1fa

@karthikrangasai
Copy link
Contributor Author

karthikrangasai commented Aug 27, 2021

@Borda
torchmetrics.image.lpip_similarity.LPIPS is taking too long to load I think. I hadn't made any change to that file.
Should I add #doctest: +SKIP although it wasn't present before ?

@Borda
Copy link
Member

Borda commented Aug 27, 2021

torchmetrics.image.lpip_similarity.LPIPS is taking too long to load I think. I hadn't made any change to that file.

@SkafteNicki thought?

@SkafteNicki
Copy link
Member

test seems to be passing LPIPS now so maybe a bad connection when it was tested last time

@karthikrangasai
Copy link
Contributor Author

karthikrangasai commented Aug 27, 2021

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

This means there is some error with the tensors not being on the GPU right ?

Edit: The compute function of the functional implementation of the BLEU metric did not create tensors on the same devices, which I have fixed now.

Text automation moved this from In progress to Reviewer approved Aug 27, 2021
@Borda Borda merged commit e6ad813 into Lightning-AI:master Aug 27, 2021
Text automation moved this from Reviewer approved to Done Aug 27, 2021
Borda pushed a commit that referenced this pull request Aug 27, 2021
* Added TextTester class
* Updating TextTester class with small naming changes
* Update TextTester with input processing, appropriate concatenations.
* Updated tests for BLEU Score based on TextTester.
* Updated WER metric to use the TextTester class.
* Updated tests for ROUGEScore based on TestTester.

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

(cherry picked from commit e6ad813)
Borda pushed a commit that referenced this pull request Aug 30, 2021
* Added TextTester class
* Updating TextTester class with small naming changes
* Update TextTester with input processing, appropriate concatenations.
* Updated tests for BLEU Score based on TextTester.
* Updated WER metric to use the TextTester class.
* Updated tests for ROUGEScore based on TestTester.

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Nicki Skafte <skaftenicki@gmail.com>

(cherry picked from commit e6ad813)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ready test / CI testing or CI topic: Text
Projects
No open projects
Text
Done
Development

Successfully merging this pull request may close these issues.

TextTester class based in MetricTester class to robustly test text metrics
6 participants