Skip to content

Commit

Permalink
[Fix] Hanging for Fully Randomized Bucketing (#4348)
Browse files Browse the repository at this point in the history
* Update container to 22.05 (#4329)

* update container to 22.05

Signed-off-by: ericharper <complex451@gmail.com>

* try adding safe directory

Signed-off-by: ericharper <complex451@gmail.com>

* try env var

Signed-off-by: ericharper <complex451@gmail.com>

* printenv

Signed-off-by: ericharper <complex451@gmail.com>

* try GIT_BRANCH

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* remove dbug statements

Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>

* Merge r1.9.0 main (#4331)

* update branch

Signed-off-by: ericharper <complex451@gmail.com>

* update package info

Signed-off-by: ericharper <complex451@gmail.com>

* cleaned up TN/ ITN doc (#4119)

* cleaned up TN/ ITN doc

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix typo

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix image

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix image

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* Draft: Fix restoring from checkpoint for case when `model.common_dataset_parameters.label_vocab_dir` is provided (#4136)

* Fix restoring from checkpoint with label vocab dir

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Add tests for various ways to pass label ids to model

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Fix typo

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Fix typo

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Do not create tmp directory

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Fix parameter name

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* finish cherry-pick op

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Fix labels errors

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Remove duplicate stage

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Change target branch

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* fix doc (#4146)

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* Tacotron2 retrain (#4103)

* fix yaml

Signed-off-by: treacker <emshabalin@yandex.ru>

* Fix for new TTSDataset class

Signed-off-by: treacker <emshabalin@yandex.ru>

* added wandb logging

Signed-off-by: treacker <emshabalin@yandex.ru>

* added wandb logging

Signed-off-by: treacker <emshabalin@yandex.ru>

* fix numpy version

Signed-off-by: treacker <emshabalin@yandex.ru>

* fix numpy version

Signed-off-by: treacker <emshabalin@yandex.ru>

* inference fix

Signed-off-by: treacker <emshabalin@yandex.ru>

* removed old code

Signed-off-by: treacker <emshabalin@yandex.ru>

* updated parser logic

Signed-off-by: treacker <emshabalin@yandex.ru>

* reverted version update

Signed-off-by: treacker <emshabalin@yandex.ru>

* refactored parser logic

Signed-off-by: treacker <emshabalin@yandex.ru>

* Updated Jenkinsfile

Signed-off-by: treacker <emshabalin@yandex.ru>

* Refactored tutorial for Tacotron2

Signed-off-by: treacker <emshabalin@yandex.ru>

* Made backward compatibility

Signed-off-by: treacker <emshabalin@yandex.ru>

* Made backward compatibility

Signed-off-by: treacker <emshabalin@yandex.ru>

* Update Jenkinsfile

Signed-off-by: treacker <emshabalin@yandex.ru>

* Update tacotron.yaml

Signed-off-by: treacker <emshabalin@yandex.ru>

* Refactoring

Signed-off-by: treacker <emshabalin@yandex.ru>

* cleaned up TN/ ITN doc (#4119)

* cleaned up TN/ ITN doc

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix typo

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix image

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix image

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>
Signed-off-by: treacker <emshabalin@yandex.ru>

* Check implicit grad acc in GLUE dataset building (#4123)

* Check implicit grad acc in GLUE dataset building

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix jenkins test for GLUE/XNLI

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>
Signed-off-by: treacker <emshabalin@yandex.ru>

* Refactoring

Signed-off-by: treacker <emshabalin@yandex.ru>

* Refactoring

Signed-off-by: treacker <emshabalin@yandex.ru>

* Fixed jenkins

Signed-off-by: treacker <emshabalin@yandex.ru>

* Refactoring

Signed-off-by: treacker <emshabalin@yandex.ru>

* Refactoring

Signed-off-by: treacker <emshabalin@yandex.ru>

* Refactoring

Signed-off-by: treacker <emshabalin@yandex.ru>

Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>

* Multiprocess improvements (#4127)

* initial commit

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* start fix

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* improve multiprocessing speed while creating speaker dataset

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* updated scp to filelist

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* notebooks' link, typo and import  fix  (#4158)

* redo missing pr 4007

Signed-off-by: fayejf <fayejf07@gmail.com>

* remove extremely unreliable links

Signed-off-by: fayejf <fayejf07@gmail.com>

* update speaker docs (#4164)

* update speaker docs

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* chunks -> segments

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* Khz -> kHz

Signed-off-by: nithinraok <nithinrao.koluguri@gmail.com>

* small fix (#4180)

Signed-off-by: fayejf <fayejf07@gmail.com>

* fix the server key value problem (#4196)

Signed-off-by: Yi Dong <yidong@nvidia.com>

* Fix/punctuation/trainer required for setting test data (#4199)

* Draft of fix

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Add warnings and replace globa_step with current_epoch

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Small improvements to warnings

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Error and warning messages improvements

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Replace self.trainer with self._trainer

Signed-off-by: PeganovAnton <peganoff2@mail.ru>

* Update ContextNet version (#4207)

Signed-off-by: smajumdar <smajumdar@nvidia.com>

* fix bugs for dialogue tutorial (#4211)

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* Dialogue tutorial fix (#4214)

* fix bugs for dialogue tutorial

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* update path for convert_datasets.py due to conflict PR

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* Add docs for Thutmose Tagger (#4173)

* Add docs for Thutmose Tagger

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

* add level in docs

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

* delete folder to avoid error with running when folder exists from previous run

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

Co-authored-by: Alexandra Antonova <aleksandraa@nvidia.com>
Co-authored-by: ekmb <ebakhturina@nvidia.com>

* Dialogue tutorial fix (#4218)

* fix bugs for dialogue tutorial

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* update path for convert_datasets.py due to conflict PR

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* restore previously deleted files

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* style fix

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* Dialogue tutorial fix (#4221)

* fix bugs for dialogue tutorial

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* update path for convert_datasets.py due to conflict PR

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* restore previously deleted files

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* style fix

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* update tutorial

Signed-off-by: Zhilin Wang <wangzhilin12061996@hotmail.com>

* fix syntax error in ipynb-file (#4228)

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

Co-authored-by: Alexandra Antonova <aleksandraa@nvidia.com>

* fix json serialize (#4235)

Signed-off-by: Yi Dong <yidong@nvidia.com>

* Prompt Learning Typo Fixes (#4238)

* Prompt tuning notebook typo fixes

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Update tutorials.rst

* Update prompt_learning.rst

* Update prompt_learning.rst

* fixing bug 3642622 (#4250)

* fixing bug 3642622

Signed-off-by: Ghasem Pasandi <gpasandi@nvidia.com>

* fixing bug 3642622

Signed-off-by: Ghasem Pasandi <gpasandi@nvidia.com>

Co-authored-by: Ghasem Pasandi <gpasandi@nvidia.com>

* fix broken link in the tutorial (#4257)

Signed-off-by: Alexandra Antonova <aleksandraa@nvidia.com>

Co-authored-by: Alexandra Antonova <aleksandraa@nvidia.com>

* Typo fix, branch change, better download messagae (#4262)

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Raise error if bicleaner is not installed in NMT Data preprocesing notebook (#4264)

* Raise error if bicleaner is not installed

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Clear cells

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix missing validation dataset, whitelist certain keywords for datasets (#4269)

* Fix missing validation dataset, whitelist certain keywords for datasets

Signed-off-by: smajumdar <smajumdar@nvidia.com>

* Fix missing validation dataset, whitelist certain keywords for datasets

Signed-off-by: smajumdar <smajumdar@nvidia.com>

* Update asr configs with num_workers and pin_memory (#4270)

Signed-off-by: smajumdar <smajumdar@nvidia.com>

* Fix epoch end (#4265)

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: Eric Harper <complex451@gmail.com>

* Set Save on train end to false (#4274)

* Set Save on train end to false

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Update prompt_learning.rst

* Update prompt_learning.rst

* Update YAML (#4261)

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Updated config to fix CI test OOM error (#4279)

* Updated config to fix CI test issue

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Increased num workers

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* verbose k2 install, skip if failed (#4289)

Signed-off-by: Aleksandr Laptev <alaptev@nvidia.com>

Co-authored-by: Aleksandr Laptev <alaptev@nvidia.com>

* Changed total virtual prompt tokens (#4295)

* Changed total virtual prompt tokens

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* put number of workers back

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* upper bound lightning

Signed-off-by: ericharper <complex451@gmail.com>

* update branch

Signed-off-by: ericharper <complex451@gmail.com>

* update config

Signed-off-by: ericharper <complex451@gmail.com>

* remove duplicate test

Signed-off-by: ericharper <complex451@gmail.com>

* fix tn test cases

Signed-off-by: ericharper <complex451@gmail.com>

* add another safe.directory

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>
Co-authored-by: PeganovAnton <peganoff2@mail.ru>
Co-authored-by: treacker <36159472+treacker@users.noreply.github.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com>
Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: Zhilin Wang <wangzhilin12061996@hotmail.com>
Co-authored-by: bene-ges <61418381+bene-ges@users.noreply.github.com>
Co-authored-by: Alexandra Antonova <aleksandraa@nvidia.com>
Co-authored-by: ekmb <ebakhturina@nvidia.com>
Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Co-authored-by: Ghasem <35242805+pasandi20@users.noreply.github.com>
Co-authored-by: Ghasem Pasandi <gpasandi@nvidia.com>
Co-authored-by: Aleksandr Laptev <laptevsasha12@gmail.com>
Co-authored-by: Aleksandr Laptev <alaptev@nvidia.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>

* fix full_randn bucket hang

Signed-off-by: stevehuang52 <heh@nvidia.com>

* remove unused variables

Signed-off-by: stevehuang52 <heh@nvidia.com>

Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Yang Zhang <yzhang123@users.noreply.github.com>
Co-authored-by: PeganovAnton <peganoff2@mail.ru>
Co-authored-by: treacker <36159472+treacker@users.noreply.github.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>
Co-authored-by: fayejf <36722593+fayejf@users.noreply.github.com>
Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: Zhilin Wang <wangzhilin12061996@hotmail.com>
Co-authored-by: bene-ges <61418381+bene-ges@users.noreply.github.com>
Co-authored-by: Alexandra Antonova <aleksandraa@nvidia.com>
Co-authored-by: ekmb <ebakhturina@nvidia.com>
Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Co-authored-by: Ghasem <35242805+pasandi20@users.noreply.github.com>
Co-authored-by: Ghasem Pasandi <gpasandi@nvidia.com>
Co-authored-by: Aleksandr Laptev <laptevsasha12@gmail.com>
Co-authored-by: Aleksandr Laptev <alaptev@nvidia.com>
  • Loading branch information
19 people committed Jun 9, 2022
1 parent 5f6f5a1 commit cd8e6bf
Showing 1 changed file with 6 additions and 26 deletions.
32 changes: 6 additions & 26 deletions nemo/collections/asr/modules/rnnt.py
Original file line number Diff line number Diff line change
Expand Up @@ -845,8 +845,6 @@ def forward(
)

losses = []
wer_numer_list = []
wer_denom_list = []
batch_size = int(encoder_outputs.size(0)) # actual batch size

# Iterate over batch using fused_batch_size steps
Expand Down Expand Up @@ -914,31 +912,14 @@ def forward(
else:
losses = None

# Compute WER for sub batch
# Update WER for sub batch
if compute_wer:
sub_enc = sub_enc.transpose(1, 2) # [B, T, D] -> [B, D, T]
sub_enc = sub_enc.detach()
sub_transcripts = sub_transcripts.detach()

original_log_prediction = self.wer.log_prediction
if original_log_prediction and batch_idx == 0:
self.wer.log_prediction = True
else:
self.wer.log_prediction = False

# Compute the wer (with logging for just 1st sub-batch)
# Update WER on each process without syncing
self.wer.update(sub_enc, sub_enc_lens, sub_transcripts, sub_transcript_lens)
wer, wer_num, wer_denom = self.wer.compute()
self.wer.reset()

wer_numer_list.append(wer_num)
wer_denom_list.append(wer_denom)

# Reset logging default
self.wer.log_prediction = original_log_prediction

else:
wer = None

del sub_enc, sub_transcripts, sub_enc_lens, sub_transcript_lens

Expand All @@ -951,12 +932,11 @@ def forward(

# Collect sub batch wer results
if compute_wer:
wer_num = torch.tensor(wer_numer_list, dtype=torch.long)
wer_denom = torch.tensor(wer_denom_list, dtype=torch.long)

wer_num = wer_num.sum() # global sum of correct words/chars
wer_denom = wer_denom.sum() # global sum of all words/chars
# Sync and all_reduce on all processes, compute global WER
wer, wer_num, wer_denom = self.wer.compute()
self.wer.reset()
else:
wer = None
wer_num = None
wer_denom = None

Expand Down

0 comments on commit cd8e6bf

Please sign in to comment.