Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix notebook misconfiguration error #3975

Merged
merged 2 commits into from
Apr 13, 2022
Merged

Fix notebook misconfiguration error #3975

merged 2 commits into from
Apr 13, 2022

Conversation

yidong72
Copy link
Collaborator

What does this PR do ?

The notebook is complaining about Trainer misconfiguration error because it tries to spawn processes in the python interactive session. Fixed it by using torch elastic environment.

Signed-off-by: Yi Dong <doyend@gmail.com>
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ericharper ericharper merged commit db0b36d into r1.8.0 Apr 13, 2022
@ericharper ericharper deleted the fix_misconfiguration branch April 13, 2022 17:20
ericharper pushed a commit that referenced this pull request Apr 20, 2022
Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Yi Dong <doyend@gmail.com>
ericharper added a commit that referenced this pull request Apr 20, 2022
* update version

Signed-off-by: ericharper <complex451@gmail.com>

* Stateless timer fix for PTL 1.6 (#3925)

* Stateless timer fix for PTL 1.6

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Stateless timer PTL test

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix year

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Remove unused imports

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* GPU test

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* clean import

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: ericharper <complex451@gmail.com>

* Fix issues with librosa deprecations (#3950)

Signed-off-by: smajumdar <titu1994@gmail.com>

* Fix notebook bugs for branch r1.8.0 (#3948)

* load the model from ngc

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix all biomegatron notebook

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix the typos

Signed-off-by: Yi Dong <doyend@gmail.com>

* remove output

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix isort

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix merge error

Signed-off-by: Yi Dong <doyend@gmail.com>

* change ntpath for isort workaround

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix unit test

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix ci

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix ci bert pretraining

Signed-off-by: Yi Dong <doyend@gmail.com>

* make it compatible with main

Signed-off-by: Yi Dong <doyend@gmail.com>

* add the teste for biomegatron ner

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix argument

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix usablity issue

Signed-off-by: Yi Dong <doyend@gmail.com>

* work around

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Fix global batch fit loop (#3936)

* add lightning module hooks for global batch

Signed-off-by: ericharper <complex451@gmail.com>

* clean scripts

Signed-off-by: ericharper <complex451@gmail.com>

* style

Signed-off-by: ericharper <complex451@gmail.com>

* remove unused import

Signed-off-by: ericharper <complex451@gmail.com>

* DP=1 fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* set num dataset workers to 2

Signed-off-by: ericharper <complex451@gmail.com>

* update validation_loop with GlobalDataFetcher

Signed-off-by: ericharper <complex451@gmail.com>

* add test global data fetcher

Signed-off-by: ericharper <complex451@gmail.com>

* Drop last for test ds

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix test epoch end

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix eval

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix reconfigure microbatch in the complete method

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* add comments

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Set init consumed samples

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* fix shuffle

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* add save_restore_connector arg

Signed-off-by: ericharper <complex451@gmail.com>

* Fix padding for labels and loss mask

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* GLUE/XNLI CI tests

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* limit val batches in hydra fix

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Restart CI

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix unittest

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Exports 22.03 war (#3957)

* Fixed fastpitch for 22.03

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* cleanup

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Restored mask expansion; added WAR for test container images

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* style

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Refactor restorefrom (#3927)

* update package info (#3926)

Signed-off-by: ericharper <complex451@gmail.com>

* Refactor restore_from

Signed-off-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>

* Move export related python files to scripts/export/

Signed-off-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>

* Return state dict after modification function

* Remove Megatron legacy parameter in common.py restore_from function

Signed-off-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>

* ability to set log_predictions to false (#3929)

* Bumping Python version

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>

* fixing style

Signed-off-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>

* load the model from ngc

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix all biomegatron notebook

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix the typos

Signed-off-by: Yi Dong <doyend@gmail.com>

* remove output

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix isort

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix merge error

Signed-off-by: Yi Dong <doyend@gmail.com>

* change ntpath for isort workaround

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix unit test

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix ci

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix ci bert pretraining

Signed-off-by: Yi Dong <doyend@gmail.com>

* Rearrage export files; Style fix; Extend legacy MegatronBert conversion to NLP models nemo version updation

* Glu activation variants (#3951)

* Temp

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Add reglu and swiglu activations

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style on unrelated file

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* CI changes to test activations

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix unused import

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Style fix beacuse of merge from main

Signed-off-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>

* make it compatible with main

Signed-off-by: Yi Dong <doyend@gmail.com>

* add the teste for biomegatron ner

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix argument

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix usablity issue

Signed-off-by: Yi Dong <doyend@gmail.com>

* FastPitch FT notebook - Improving Speech Quality clarifications (#3954)

* FastPitch FT notebook - Improving Speech Quality clarifications

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Add pynini dependency install to FastPitch FT notebook

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Pin pynini install for FastPitch FT tutorial

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* work around

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Eric Harper <complex451@gmail.com>
Co-authored-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>
Co-authored-by: Dima Rekesh <bmwshop@gmail.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>
Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>
Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Jocelyn <jocelynh@nvidia.com>

* Bump TTS deprecation version to 1.9 (#3955)

* bump deprecation version

Signed-off-by: Jason <jasoli@nvidia.com>

* update talknet depre

Signed-off-by: Jason <jasoli@nvidia.com>

* added conformer for zh. (#3970)

Signed-off-by: Vahid <vnoroozi@nvidia.com>

* Add pinned pynini and scipy installs to TTS training tutorial (#3967)

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Fix variable name and move models to CPU in Change partition (#3972)

* fixes

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* add CI

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>

* fix misconfiguration (#3975)

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Yi Dong <doyend@gmail.com>

* Fix NMT variable passing bug (#3985)

* fix

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* stylefix

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* Compatability override to load_state_dict for old TTS checkpoints (#3978)

* Compatability override to load_state_dict for old TTS checkpoints

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Tacotron2 training notebook fix - add GPU argument

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Add hann window override warning for old model loading

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Notebook Bug Fixes for r1.8.0 (#3989)

* Made config related bug fixes

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Fixed cfg.get syntax

Signed-off-by: Virginia Adams <vadams@nvidia.com>

* Fix compat override for TalkNet Aligner (#3993)

* Fix compatibility override for TalkNet Aligner

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* Remove extraneous logging import

Signed-off-by: Jocelyn Huang <jocelynh@nvidia.com>

* docs fixes (#3987)

* docs fixes

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* rename files in docs

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* docs improvement

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* arg renamed

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* Fix nemo megatron restore with artifacts (#3997)

* update config_path in register_artifact

Signed-off-by: ericharper <complex451@gmail.com>

* fix register_artifact calls

Signed-off-by: ericharper <complex451@gmail.com>

* fix register_artifact calls

Signed-off-by: ericharper <complex451@gmail.com>

* update log messages to include merges file

Signed-off-by: ericharper <complex451@gmail.com>

* add default prompts to config

Signed-off-by: ericharper <complex451@gmail.com>

* Fixes val_check_interval, skip loading train data during eval (#3968)

* Change stage check

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix bugs in megatron t5 glue eval scripts

Signed-off-by: Yu Yao <yuya@nvidia.com>

* Fix reconfigure

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Change check

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix hasattr

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix typo in cfg structure

Signed-off-by: Yu Yao <yuya@nvidia.com>

* Update megatron t5 glue eval config file

Signed-off-by: Yu Yao <yuya@nvidia.com>

* Reconfigure to avoid drop last

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Fix for train step reconfigure as well

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* Update megatron t5 glue eval config file drop_last to False

Signed-off-by: Yu Yao <yuya@nvidia.com>

* Style

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

* limit test batches

Signed-off-by: MaximumEntropy <sandeep.subramanian.1@umontreal.ca>

Co-authored-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* LogProb calculation performance fix (#3984)

* performance fix for logprob computation

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix redandant assign

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix bug to add gather from TP workers

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Yi Dong <doyend@gmail.com>

* Fix link issues in export example notebook and fix pretrained model info for MegatronBert (#4004)

Signed-off-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>

Co-authored-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>

* Fix single GPU training issue + change deprecated Lightning args (#4010)

* change vars

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* style fix

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

* Fix P-Tune T5 model (#4001)

* fix ptune t5

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix ci test

Signed-off-by: Yi Dong <doyend@gmail.com>

* fix the ci fail because of the order problem

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* Megatron work-arounds (#3998)

* WAR around Apex issue, and making sure output is FP32

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Fixing merge issues; moving dummy Trainer; adding float() casts

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Fixing ColumnParallelLinear call

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Cleanup

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* Cleanup#2

Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

* fix the broadcast shape mismatch (#4017)

Signed-off-by: Yi Dong <doyend@gmail.com>

Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Eric Harper <complex451@gmail.com>

* add known issues (#4024)

Signed-off-by: ericharper <complex451@gmail.com>

* update readme with conda env setup instructions

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* update package info

Signed-off-by: ericharper <complex451@gmail.com>

* update branch

Signed-off-by: ericharper <complex451@gmail.com>

* update package info

Signed-off-by: ericharper <complex451@gmail.com>

* revert apex guard removal

Signed-off-by: ericharper <complex451@gmail.com>

* revert --language to --lang

Signed-off-by: ericharper <complex451@gmail.com>

* fix apex guard

Signed-off-by: ericharper <complex451@gmail.com>

* remove set_trace

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* fix apex guard

Signed-off-by: ericharper <complex451@gmail.com>

* remove unreachable statement

Signed-off-by: ericharper <complex451@gmail.com>

* remove duplicate lines

Signed-off-by: ericharper <complex451@gmail.com>

* remove duplicate lines

Signed-off-by: ericharper <complex451@gmail.com>

Co-authored-by: Sandeep Subramanian <sandeep.subramanian.1@umontreal.ca>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: Yi Dong <43824965+yidong72@users.noreply.github.com>
Co-authored-by: Yi Dong <doyend@gmail.com>
Co-authored-by: Boris Fomitchev <borisfom@users.noreply.github.com>
Co-authored-by: Ramanathan Arunachalam <ramanathan.arun@rutgers.edu>
Co-authored-by: Ramanathan Arunachalam <rarunachalam@nvidia.com>
Co-authored-by: Dima Rekesh <bmwshop@gmail.com>
Co-authored-by: Oleksii Kuchaiev <okuchaiev@nvidia.com>
Co-authored-by: Jocelyn <jocelynh@nvidia.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: Vahid Noroozi <VahidooX@users.noreply.github.com>
Co-authored-by: Abhinav Khattar <aklife97@gmail.com>
Co-authored-by: Virginia Adams <78445382+vadam5@users.noreply.github.com>
Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
Co-authored-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants