Update label_models.py #4891

stevehuang52 · 2022-09-07T16:45:44Z

What does this PR do ?

Removed AUROC metric since it's slow and cost a lot of memory to store the logits&labels.
Fixed macro-accuracy metric usage for multiple validation manifests.
Refactored validation and test function since they share the same code but just different tags in display ("val_" vs. "test_")

nemo/collections/asr/models/label_models.py

- Removed AUROC metric since it's slow and cost a lot of memory to store the logits&labels. - Fixed macro-accuracy metric usage for multiple validation manifests. - Refactored validation and test function since they share the same code but just different tags in display ("val_" vs. "test_") Signed-off-by: stevehuang52 <heh@nvidia.com>

Signed-off-by: stevehuang52 <heh@nvidia.com>

nithinraok · 2022-09-07T16:59:18Z

nemo/collections/asr/models/label_models.py

-            'val_acc_macro': macro_accuracy_score,
-            'val_auroc': auroc_score,
-        }
+        return {'log': tensorboard_log}


also update this tensorboard_logs, refer to #4876

Sure, just fixed.

Signed-off-by: stevehuang52 <heh@nvidia.com>

fayejf · 2022-09-07T17:11:10Z

nemo/collections/asr/models/label_models.py

-            'test_acc_macro': macro_accuracy_score,
-            'test_auroc': auroc_score,
-        }
+        return self.multi_validation_epoch_end(outputs, dataloader_idx, 'test')


Not sure if it's a good idea to merge val and test here

what do you think? @nithinraok

we can merge, why do you think its not a good idea?

@nithinraok just worried it would break things such eval on multi gpu and reduce flexibility, but I guess I think too much. Then let's merge.

I dont think since its re-use of function, but @fayejf or @stevehuang52 , can you run an experiment and see this working as expected before merging? We have a unit test for training but do not check validation and test time.

I can do it, but do you have any examples of how to do it quickly? I think it will work correctly, and ASR models are also doing it the same way: https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/models/ctc_models.py#L650

@fayejf can you provide steve with some lang id examples. Notice the validation time before and after the change.

let me run it really quick.

What might be cleaner is to separate out into a function if we're worried about using the validation hook within the test hook:

def calculate_metrics(self, outputs, dataloader_idx, tag): ... def multi_test_epoch_end(self, outputs, dataloader_idx): return self.calculate_metrics(outputs, dataloader_idx, 'test') def multi_validation_epoch_end(self, outputs, dataloader_idx): return self.calculate_metrics(outputs, dataloader_idx, 'val')

yes, this would be neat.

nithinraok · 2022-09-07T17:58:30Z

@stevehuang52 since you are removing auroc, remove it from instantiation and also corresponding import

stevehuang52 · 2022-09-07T18:00:46Z

@stevehuang52 since you are removing auroc, remove it from instantiation and also corresponding import

It's already removed

fayejf · 2022-09-07T20:49:46Z

@nithinraok @stevehuang52 I've tested with a toy sample for validation time, but probably my sample is not large enough, I didn't observe time reduction with this fix than main, even though my validation set is about 656809 files, not small.

fayejf

Mostly good. could you update with sean's suggestion?

fayejf · 2022-09-07T20:55:25Z

nemo/collections/asr/models/label_models.py


-        logging.info("val_loss: {:.3f}".format(val_loss_mean))
-        self.log('val_loss', val_loss_mean)
+        logging.info(f'{tag}_loss: {loss_mean:.3f}')


could you remove the duplicate line here?

nithinraok · 2022-09-07T20:57:01Z

Did you also test it with r1.10 to compare, can you please do that as well, to find the cause of the delay in the validation time issue? @fayejf

fayejf · 2022-09-07T20:58:45Z

Did you also test it with r1.10 to compare, can you please do that as well, to find the cause of the delay in the validation time issue? @fayejf

I don't think we add auroc and val_macro_acc in r1.10, right?

nithinraok · 2022-09-07T21:02:54Z

Did you also test it with r1.10 to compare, can you please do that as well, to find the cause of the delay in the validation time issue? @fayejf

I don't think we add auroc and val_macro_acc in r1.10, right?

yes, hence want to compare with it and test

Signed-off-by: stevehuang52 <heh@nvidia.com>

stevehuang52 · 2022-09-07T21:31:55Z

Mostly good. could you update with sean's suggestion?

Done~

fayejf · 2022-09-07T22:48:25Z

Did you also test it with r1.10 to compare, can you please do that as well, to find the cause of the delay in the validation time issue? @fayejf

I don't think we add auroc and val_macro_acc in r1.10, right?

yes, hence want to compare with it and test

@nithinraok So I compared this PR with main and r1.10 (val acc since val acc macro is not implemented) and I didn't reinstall anything, just switched branch.

No much diff in terms of time actually.

time per epoch for first 5 epochs.
r1.10 - total: 12.42-12.44 val: 02:04
main - total: 12.43-12.45 val : 02.05-02.07
steve - total: 12.43-16.36 val : 02.05-02.06

I do observe some weird 16:36 run above and let me try it again. Besides that, no time diff/reduction in terms of the fix. Possibly because our val data is not huge enough. But I think it's okay to merge it to fix and remove auroc for now.

nithinraok · 2022-09-07T23:17:01Z

@nithinraok So I compared this PR with main and r1.10 (val acc since val acc macro is not implemented) and I didn't reinstall anything, just switched branch.

I think it didn't run what you might be intended, since r1.10 runs with PTL 1.6, and throws errors with PTL 1.7. So, you might need to reinstall for each test.

fayejf · 2022-09-07T23:21:54Z

@nithinraok So I compared this PR with main and r1.10 (val acc since val acc macro is not implemented) and I didn't reinstall anything, just switched branch.

I think it didn't run what you might be intended, since r1.10 runs with PTL 1.6, and throws errors with PTL 1.7. So, you might need to reinstall for each test.

what kind of error do we expect to see for 1.10 with PTL1.7? I'm with PTL 1.7.3 and r1.10 and it works fine.

nithinraok · 2022-09-07T23:41:56Z

what kind of error do we expect to see for 1.10 with PTL1.7? I'm with PTL 1.7.3 and r1.10 and it works fine.

It would be from lightningmodule.trainer attribute, But I think you ran correctly I guess, vice versa shouldn't work (latest and PTL 1.6).

fayejf · 2022-09-07T23:43:29Z

what kind of error do we expect to see for 1.10 with PTL1.7? I'm with PTL 1.7.3 and r1.10 and it works fine.
It would be from lightningmodule.trainer attribute, But I think you ran correctly I guess, vice versa shouldn't work (latest and PTL 1.6).

@nithinraok I run with PTL 1.7.3 for all of the above tests. Though I think I should run r1.10 with PTL 1.6 but the run was fine with 1.7.3

nithinraok

LGTM, thanks. Some more runs need to be done to find the cause of the increase in validation time.

fayejf · 2022-09-07T23:45:45Z

LGTM, thanks. Some more runs need to be done to find the cause of the increase in validation time.

@nithinraok I didn't see an increase in validation time, but I didn't see the expected reduction of this commit actually.

fayejf

LGTM thanks! i found using weighted cross entropy loss in test in speech_to_label.py is problematic. let me patch in another PR.

* Update label_models.py - Removed AUROC metric since it's slow and cost a lot of memory to store the logits&labels. - Fixed macro-accuracy metric usage for multiple validation manifests. - Refactored validation and test function since they share the same code but just different tags in display ("val_" vs. "test_") Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>

* Update label_models.py - Removed AUROC metric since it's slow and cost a lot of memory to store the logits&labels. - Fixed macro-accuracy metric usage for multiple validation manifests. - Refactored validation and test function since they share the same code but just different tags in display ("val_" vs. "test_") Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> * update Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Hainan Xu <hainanx@nvidia.com>

stevehuang52 requested review from nithinraok and fayejf September 7, 2022 16:45

nithinraok reviewed Sep 7, 2022

View reviewed changes

nemo/collections/asr/models/label_models.py Outdated Show resolved Hide resolved

stevehuang52 force-pushed the stevehuang52-patch-1 branch from 39b15e8 to 864a8a9 Compare September 7, 2022 16:56

update

7e2d627

Signed-off-by: stevehuang52 <heh@nvidia.com>

nithinraok reviewed Sep 7, 2022

View reviewed changes

stevehuang52 added 2 commits September 7, 2022 13:08

update

2df7ac1

Signed-off-by: stevehuang52 <heh@nvidia.com>

update

70a4337

Signed-off-by: stevehuang52 <heh@nvidia.com>

fayejf reviewed Sep 7, 2022

View reviewed changes

fayejf self-requested a review September 7, 2022 20:49

fayejf requested changes Sep 7, 2022

View reviewed changes

update

77b91af

Signed-off-by: stevehuang52 <heh@nvidia.com>

nithinraok approved these changes Sep 7, 2022

View reviewed changes

fayejf approved these changes Sep 8, 2022

View reviewed changes

fayejf merged commit 924b054 into main Sep 8, 2022

fayejf deleted the stevehuang52-patch-1 branch September 8, 2022 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update label_models.py #4891

Update label_models.py #4891

stevehuang52 commented Sep 7, 2022

nithinraok Sep 7, 2022 •

edited

Loading

stevehuang52 Sep 7, 2022

fayejf Sep 7, 2022

fayejf Sep 7, 2022

nithinraok Sep 7, 2022

fayejf Sep 7, 2022

nithinraok Sep 7, 2022

stevehuang52 Sep 7, 2022

nithinraok Sep 7, 2022

fayejf Sep 7, 2022

SeanNaren Sep 7, 2022

nithinraok Sep 7, 2022

nithinraok commented Sep 7, 2022

stevehuang52 commented Sep 7, 2022

fayejf commented Sep 7, 2022 •

edited

Loading

fayejf left a comment

fayejf Sep 7, 2022

stevehuang52 Sep 7, 2022

nithinraok commented Sep 7, 2022 •

edited

Loading

fayejf commented Sep 7, 2022

nithinraok commented Sep 7, 2022

stevehuang52 commented Sep 7, 2022

fayejf commented Sep 7, 2022 •

edited

Loading

nithinraok commented Sep 7, 2022

fayejf commented Sep 7, 2022

nithinraok commented Sep 7, 2022 •

edited

Loading

fayejf commented Sep 7, 2022

nithinraok left a comment

fayejf commented Sep 7, 2022

fayejf left a comment •

edited

Loading

Update label_models.py #4891

Update label_models.py #4891

Conversation

stevehuang52 commented Sep 7, 2022

What does this PR do ?

nithinraok Sep 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nithinraok commented Sep 7, 2022

stevehuang52 commented Sep 7, 2022

fayejf commented Sep 7, 2022 • edited Loading

fayejf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nithinraok commented Sep 7, 2022 • edited Loading

fayejf commented Sep 7, 2022

nithinraok commented Sep 7, 2022

stevehuang52 commented Sep 7, 2022

fayejf commented Sep 7, 2022 • edited Loading

nithinraok commented Sep 7, 2022

fayejf commented Sep 7, 2022

nithinraok commented Sep 7, 2022 • edited Loading

fayejf commented Sep 7, 2022

nithinraok left a comment

Choose a reason for hiding this comment

fayejf commented Sep 7, 2022

fayejf left a comment • edited Loading

Choose a reason for hiding this comment

nithinraok Sep 7, 2022 •

edited

Loading

fayejf commented Sep 7, 2022 •

edited

Loading

nithinraok commented Sep 7, 2022 •

edited

Loading

fayejf commented Sep 7, 2022 •

edited

Loading

nithinraok commented Sep 7, 2022 •

edited

Loading

fayejf left a comment •

edited

Loading