Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update label_models.py #4891

Merged
merged 5 commits into from
Sep 8, 2022
Merged

Update label_models.py #4891

merged 5 commits into from
Sep 8, 2022

Conversation

stevehuang52
Copy link
Collaborator

What does this PR do ?

  • Removed AUROC metric since it's slow and cost a lot of memory to store the logits&labels.
  • Fixed macro-accuracy metric usage for multiple validation manifests.
  • Refactored validation and test function since they share the same code but just different tags in display ("val_" vs. "test_")

- Removed AUROC metric since it's slow and cost a lot of memory to store the logits&labels. 
- Fixed macro-accuracy metric usage for multiple validation manifests.
- Refactored validation and test function since they share the same code but just different tags in display ("val_" vs. "test_")

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
'val_acc_macro': macro_accuracy_score,
'val_auroc': auroc_score,
}
return {'log': tensorboard_log}
Copy link
Collaborator

@nithinraok nithinraok Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also update this tensorboard_logs, refer to #4876

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, just fixed.

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
'test_acc_macro': macro_accuracy_score,
'test_auroc': auroc_score,
}
return self.multi_validation_epoch_end(outputs, dataloader_idx, 'test')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's a good idea to merge val and test here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think? @nithinraok

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can merge, why do you think its not a good idea?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nithinraok just worried it would break things such eval on multi gpu and reduce flexibility, but I guess I think too much. Then let's merge.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think since its re-use of function, but @fayejf or @stevehuang52 , can you run an experiment and see this working as expected before merging? We have a unit test for training but do not check validation and test time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do it, but do you have any examples of how to do it quickly? I think it will work correctly, and ASR models are also doing it the same way: https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/models/ctc_models.py#L650

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fayejf can you provide steve with some lang id examples. Notice the validation time before and after the change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me run it really quick.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What might be cleaner is to separate out into a function if we're worried about using the validation hook within the test hook:

def calculate_metrics(self, outputs, dataloader_idx, tag):
    ...

def multi_test_epoch_end(self, outputs, dataloader_idx):
   return self.calculate_metrics(outputs, dataloader_idx, 'test')

def multi_validation_epoch_end(self, outputs, dataloader_idx):
   return self.calculate_metrics(outputs, dataloader_idx, 'val')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this would be neat.

@nithinraok
Copy link
Collaborator

@stevehuang52 since you are removing auroc, remove it from instantiation and also corresponding import

@stevehuang52
Copy link
Collaborator Author

@stevehuang52 since you are removing auroc, remove it from instantiation and also corresponding import

It's already removed

@fayejf fayejf self-requested a review September 7, 2022 20:49
@fayejf
Copy link
Collaborator

fayejf commented Sep 7, 2022

@nithinraok @stevehuang52 I've tested with a toy sample for validation time, but probably my sample is not large enough, I didn't observe time reduction with this fix than main, even though my validation set is about 656809 files, not small.

Copy link
Collaborator

@fayejf fayejf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly good. could you update with sean's suggestion?


logging.info("val_loss: {:.3f}".format(val_loss_mean))
self.log('val_loss', val_loss_mean)
logging.info(f'{tag}_loss: {loss_mean:.3f}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you remove the duplicate line here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@nithinraok
Copy link
Collaborator

nithinraok commented Sep 7, 2022

Did you also test it with r1.10 to compare, can you please do that as well, to find the cause of the delay in the validation time issue? @fayejf

@fayejf
Copy link
Collaborator

fayejf commented Sep 7, 2022

Did you also test it with r1.10 to compare, can you please do that as well, to find the cause of the delay in the validation time issue? @fayejf

I don't think we add auroc and val_macro_acc in r1.10, right?

@nithinraok
Copy link
Collaborator

Did you also test it with r1.10 to compare, can you please do that as well, to find the cause of the delay in the validation time issue? @fayejf

I don't think we add auroc and val_macro_acc in r1.10, right?

yes, hence want to compare with it and test

Signed-off-by: stevehuang52 <heh@nvidia.com>
@stevehuang52
Copy link
Collaborator Author

Mostly good. could you update with sean's suggestion?

Done~

@fayejf
Copy link
Collaborator

fayejf commented Sep 7, 2022

Did you also test it with r1.10 to compare, can you please do that as well, to find the cause of the delay in the validation time issue? @fayejf

I don't think we add auroc and val_macro_acc in r1.10, right?

yes, hence want to compare with it and test

@nithinraok So I compared this PR with main and r1.10 (val acc since val acc macro is not implemented) and I didn't reinstall anything, just switched branch.

No much diff in terms of time actually.

time per epoch for first 5 epochs.
r1.10 - total: 12.42-12.44 val: 02:04
main - total: 12.43-12.45 val : 02.05-02.07
steve - total: 12.43-16.36 val : 02.05-02.06

I do observe some weird 16:36 run above and let me try it again. Besides that, no time diff/reduction in terms of the fix. Possibly because our val data is not huge enough. But I think it's okay to merge it to fix and remove auroc for now.

@nithinraok
Copy link
Collaborator

@nithinraok So I compared this PR with main and r1.10 (val acc since val acc macro is not implemented) and I didn't reinstall anything, just switched branch.

I think it didn't run what you might be intended, since r1.10 runs with PTL 1.6, and throws errors with PTL 1.7. So, you might need to reinstall for each test.

@fayejf
Copy link
Collaborator

fayejf commented Sep 7, 2022

@nithinraok So I compared this PR with main and r1.10 (val acc since val acc macro is not implemented) and I didn't reinstall anything, just switched branch.

I think it didn't run what you might be intended, since r1.10 runs with PTL 1.6, and throws errors with PTL 1.7. So, you might need to reinstall for each test.

what kind of error do we expect to see for 1.10 with PTL1.7? I'm with PTL 1.7.3 and r1.10 and it works fine.

@nithinraok
Copy link
Collaborator

nithinraok commented Sep 7, 2022

what kind of error do we expect to see for 1.10 with PTL1.7? I'm with PTL 1.7.3 and r1.10 and it works fine.

It would be from lightningmodule.trainer attribute, But I think you ran correctly I guess, vice versa shouldn't work (latest and PTL 1.6).

@fayejf
Copy link
Collaborator

fayejf commented Sep 7, 2022

what kind of error do we expect to see for 1.10 with PTL1.7? I'm with PTL 1.7.3 and r1.10 and it works fine.
It would be from lightningmodule.trainer attribute, But I think you ran correctly I guess, vice versa shouldn't work (latest and PTL 1.6).

@nithinraok I run with PTL 1.7.3 for all of the above tests. Though I think I should run r1.10 with PTL 1.6 but the run was fine with 1.7.3

Copy link
Collaborator

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks. Some more runs need to be done to find the cause of the increase in validation time.

@fayejf
Copy link
Collaborator

fayejf commented Sep 7, 2022

LGTM, thanks. Some more runs need to be done to find the cause of the increase in validation time.

@nithinraok I didn't see an increase in validation time, but I didn't see the expected reduction of this commit actually.

Copy link
Collaborator

@fayejf fayejf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks! i found using weighted cross entropy loss in test in speech_to_label.py is problematic. let me patch in another PR.

@fayejf fayejf merged commit 924b054 into main Sep 8, 2022
@fayejf fayejf deleted the stevehuang52-patch-1 branch September 8, 2022 00:16
jubick1337 pushed a commit to jubick1337/NeMo that referenced this pull request Oct 3, 2022
* Update label_models.py

- Removed AUROC metric since it's slow and cost a lot of memory to store the logits&labels.
- Fixed macro-accuracy metric usage for multiple validation manifests.
- Refactored validation and test function since they share the same code but just different tags in display ("val_" vs. "test_")

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Matvei Novikov <mattyson.so@gmail.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022
* Update label_models.py

- Removed AUROC metric since it's slow and cost a lot of memory to store the logits&labels. 
- Fixed macro-accuracy metric usage for multiple validation manifests.
- Refactored validation and test function since they share the same code but just different tags in display ("val_" vs. "test_")

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 29, 2022
* Update label_models.py

- Removed AUROC metric since it's slow and cost a lot of memory to store the logits&labels. 
- Fixed macro-accuracy metric usage for multiple validation manifests.
- Refactored validation and test function since they share the same code but just different tags in display ("val_" vs. "test_")

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

* update

Signed-off-by: stevehuang52 <heh@nvidia.com>

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants