-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support chatglm2&3 #8528
support chatglm2&3 #8528
Conversation
@@ -0,0 +1,294 @@ | |||
# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you follow this PR to update your argument names? https://github.com/NVIDIA/NeMo/pull/8435/files#diff-3c80bf9f00be20ecece478699ebbd05ddf7a5a88646ef056b540be5f33622ada
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also rename to convert_chatglm_hf_to_nemo (or nemo_to_hf)
Signed-off-by: Agoniii <815244047@qq.com>
for more information, see https://pre-commit.ci Signed-off-by: Agoniii <815244047@qq.com>
* Option to set matmul precision in transcribe and chunked infer scripts Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix unsupported literal type Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix imports Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
* Revert FP8 integration * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com> Co-authored-by: Ming <111467530+Victor49152@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
* Add taurus pytorch to nemo Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add a taurus jax to nemo conversion script and few other fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Clean up code Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * bug fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * renaming Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix arguments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add HF Gemma converter Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Turn off `apply_rope_fusion` during inference Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update conversion scripts Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add exporting stuff * update conversion scripts Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add readme Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Save readme Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update jax Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Remove Gemma README_Gemma.rst Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update import path Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update docstring Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Revert "Add exporting stuff" This reverts commit 17d00b0. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove neva cyclic imports Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Remove not used vars Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "Remove neva cyclic imports" This reverts commit 898d9ed. * Fix cyclic import Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * remove neva folder Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove not used vars Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update docstrings in converter Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add docstring Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Bobby Chen <bobchen@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
* check if none before encode special token Signed-off-by: Huiying Li <willwin.lee@gmail.com> * handle when pad_id does not exist for hf Autotokenizer Signed-off-by: Huiying Li <willwin.lee@gmail.com> * refactor pad_id assignment to use getattr for cleaner code readability Signed-off-by: Huiying Li <willwin.lee@gmail.com> --------- Signed-off-by: Huiying Li <willwin.lee@gmail.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com> Co-authored-by: Ming <111467530+Victor49152@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Neal Vaidya <nealv@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>
* Update checkpoint.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update configs.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update intro.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update intro.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update neva.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update checkpoint.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update configs.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update controlnet.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update datasets.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update dreambooth.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update imagen.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update insp2p.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update sd.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update checkpoint.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update clip.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update mcore_customization.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update retro_model.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update migration-guide.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update nemo_forced_aligner.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update checkpoints.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update datasets.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update g2p.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update checkpoint.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update configs.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update datasets.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update vit.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update core.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update export.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
* Add Kaiming uniform init for LoRA adapters Signed-off-by: Michal Futrega <mfutrega@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update parallel_adapters.py Signed-off-by: Michal Futrega <mfutrega@nvidia.com> * Set value a via function argument Signed-off-by: Michal Futrega <mfutrega@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Shriya Palsamudram <69161273+ShriyaPalsamudram@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
* add fsdp support for gpt fine-tuning Signed-off-by: dimapihtar <dpihtar@gmail.com> * fsdp fix for save_nemo_on_train_end Signed-off-by: dimapihtar <dpihtar@gmail.com> * add packed_seq_params param to MCoreTransformerLayerMixin Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix remove ckpt issue Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
* Update neva.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update datasets.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update vit.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
* fix AccessMixin Signed-off-by: stevehuang52 <heh@nvidia.com> * remove caching propagate_model_guid Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
Update mcore version in Dockerfile Signed-off-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
* Add NeMo Models Signed-off-by: Nithin Rao Koluguri <nithinraok> * add codec models Signed-off-by: Nithin Rao Koluguri <nithinraok> * update link Signed-off-by: Nithin Rao Koluguri <nithinraok> --------- Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao Koluguri <nithinraok> Signed-off-by: Agoniii <815244047@qq.com>
* Account for amp_O2 in nemo_llama_to_hf conversion Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * Package converted model with new tokenizer not old Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * Account for variations in megatron_amp_O2 behavior Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * Resize the embeddings matrix Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * Correct precision when saving to HF folder Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * Fix typo in sample script Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * Fix typo in logging Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix O2 issue properly in peft mixin Signed-off-by: Chen Cui <chcui@nvidia.com> --------- Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>
* eval with ckpt Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> * fix condition Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> * modify ci test Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> --------- Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>
* initial commit Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add check to prevent mcore -> legacy conversion Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * key name change for legacy mlm -> mcore nemo conversion Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * alternative layer norm key names Signed-off-by: Chen Cui <chcui@nvidia.com> * fix error with expert parallel groups Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert previous change Signed-off-by: Chen Cui <chcui@nvidia.com> --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
* Update transcribe calls Signed-off-by: smajumdar <titu1994@gmail.com> * Update transcribe calls Signed-off-by: smajumdar <titu1994@gmail.com> * Update transcribe calls Signed-off-by: smajumdar <titu1994@gmail.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
* fix the bug where hybrid TDT CTC model uses incorrect decoding class for inference Signed-off-by: Hainan Xu <hainanx@nvidia.com> * move the fix into a function and call it from different subclassees Signed-off-by: Hainan Xu <hainanx@nvidia.com> --------- Signed-off-by: Hainan Xu <hainanx@nvidia.com> Co-authored-by: Hainan Xu <hainanx@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
* Fix SpeakerDecoder doc string Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * Fix asr RNNT doc strings Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * Fix ctc decoding doc strings Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * More doc string fixes Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * RNNTDecoding doc strings fix Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * More doc string fixes Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * modelPT, dataset doc string fixes Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * Fix generate, encode, decode docstrings Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * Update generate function docstring Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * More generate function docstring update Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> --------- Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> Co-authored-by: Dong Hyuk Chang <donghyukc@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
* Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update PTL version in requirements Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused import and comment val_iterator_done Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Temporarily comment out CPU Unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove precision arg from Trainer in convert_hf_llama_to_nemo.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix dataloader_iter API for megatron_lm_encoder_decoder_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable NMT Training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix val_step, test_step func API MegatronLMEncoderDecoderModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Enable NMT training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Disable some unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment CI tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment resume part of BART Signed-off-by: Abhishree <abhishreetm@gmail.com> * Uncomment few lines from JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return len of dataloader in microbatches Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix _link_checkpoint 1) Add inject_model_parallel_rank to _link_checkpoint 2) Override super._link_checkpoint to remove condition check for rank 0 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Check if using dist ckpt in _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove batch_idx arg from validation_step megatron_gpt_sft_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL bug fix branch Test unit tests with PTL bug fix https://github.com/Lightning-AI/pytorch-lightning/pull/19344/files Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable test_ema_saved_state in test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Skip test_beam_decoding_preserve_alignments in test_rnnt_decoding.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL with fs.lexists Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment _link_checkpoint related overrides In order to test with PTL without symbolic links Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return only batch for dataloader_iter in DFT model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify get_batch in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition checks for batch extraction from dataloader_iter Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add missing condition check for batch extraction in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for dataloader_iter extraction in MegatronLMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment test_invalid_checkpoints_removed_from_topk in test_exp_manager.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test invalid ckpts in test_exp_manager.py Also uncomment some of the commented out tests in JenkinsFile and test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix bug in test_invalid_checkpoints_removed_from_topk Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix validation step of GPTModel for finetuning case with multi dataloaders Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test_step_outputs for SFT in GPTMOdel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass dataloader_idx for val_step of GPTModel and remove unwanted code 1) Pass dataloader_idx to val_step of GPTModel as its required for GPTSFTModel in case multi dataloaders to append the outputs correctly val/test_step_output 2) Remove val_iterator_done check from all megatron GPT models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for extraction of batch in T5SFTModel & LMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for extracting batch in MegatronNMTModel Also uncomment GPT PP=2 and NMT tests from JenkinsFIle Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix typo and uncomment multimodel tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change to new dataloader_iter API for MultiModal Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix new dataloader_api for MegatronLatenDiffusion Model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Store and restore precision value in MegatronGPTSFTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily comment Multimodal Stable Diffusion Train Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update JenkinsFile for multimodal with latest main Signed-off-by: Abhishree <abhishreetm@gmail.com> * Upgrade PTL to version 2.2 in reqs Signed-off-by: Abhishree <abhishreetm@gmail.com> * Install PTL 2.2 from fork Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add strict arg to load_model_state_dict func in NLPDDPStrategy Signed-off-by: Abhishree <abhishreetm@gmail.com> * Delete megatron_t5_adapter_tuning.py, megatron_t5_ia3_tuning.py These files were added in the branch by mistake Signed-off-by: Abhishree <abhishreetm@gmail.com> * Delete megatron_t5_prompt_learning.py that got added by mistake Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add appropriate comments, code clean up Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove PTL installation from JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update PTL version to be >= 2.2.1 Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
* Update docs version Signed-off-by: smajumdar <titu1994@gmail.com> * Update docs for NeMo Framework Signed-off-by: smajumdar <titu1994@gmail.com> * Update docs for NeMo Framework Signed-off-by: smajumdar <titu1994@gmail.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
* Update results.rst for Canary Inference Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Update results.rst for Canary Inference Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> --------- Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>
…esent (NVIDIA#8618) Signed-off-by: Agoniii <815244047@qq.com>
* remove LoRA SP no redundant comm for all linear layers Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert to scatter in adapter module instead of scatter after add Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
Minor copy and instruction changes to improve tutorial viability. Signed-off-by: Chris Alexiuk <161380339+chrisalexiuk-nvidia@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
…ests (NVIDIA#8444) * AMMO integration with Llama2 PTQ example and tests Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Jenkins megatron_llama_quantization.py test setup Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * License headers Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Add AMMO to requirements_nlp.txt with --extra-index-url for pip install Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Bump AMMO version to latest Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Guards workaround on spec definition Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Save artifacts and tokenizer config at once Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Extend nemo.utils package with new tools Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reorganize & reformat Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Tests for FP8 and INT4 AWQ Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add load_config helper function Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Unused import removal Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix FP8 Jenkins test Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix TP=2 test cont'd: no need to use mpirun Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Allow for patches in AMMO versioning Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop AWQ test for now (need to debug) Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Allow for patches in AMMO versioning cont'd Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Use AMMO spec from MCore as it has been published Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Make AMMO optional dependency and properly import guard it Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add Llama2 AWQ test and update some paths Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Enable specifying quantization.algorithm=null for baseline accuracy checks Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Enable exporting qnemo tarball or just to a directory Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop AWQ testing for now Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Test case for export.inference_tensor_parallel=2 Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Flag to export TRT-LLM config.json Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
* fix FIM RNG issue * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix FIMDataset * fix seed ref * fim fix Signed-off-by: dimapihtar <dpihtar@gmail.com> * add fim test Signed-off-by: dimapihtar <dpihtar@gmail.com> * remove files Signed-off-by: dimapihtar <dpihtar@gmail.com> * remove swp Signed-off-by: dimapihtar <dpihtar@gmail.com> * remove import Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix syntax Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix Jenkins Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
…NVIDIA#8640) * Add support to perform "inference-only" without loading training data Hi, Currently, the MegatronSBERT model cannot run inference. Essentially, a user may not be able to simply load a trained .nemo checkpoint and run inference (forward()) function on it. This patch adds a try/except block to handle cases where training data is not specified Signed-off-by: Aditya Malte <aditya.malte@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Aditya Malte <aditya.malte@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
* add ctcws tutorial Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * clear sell outputs Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> --------- Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: rachitg <rachitg@nvidia.com> Co-authored-by: rachitg <rachitg@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>
* config update Signed-off-by: arendu <adithya.r@gmail.com> * save embeddings and some refac Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * entry point script for dumping embeddings to disk Signed-off-by: arendu <adithya.r@gmail.com> * normalize query and pos_doc even if no soft negatives are used Signed-off-by: arendu <adithya.r@gmail.com> * yaml for generation script Signed-off-by: arendu <adithya.r@gmail.com> * all possible negatives Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates Signed-off-by: arendu <adithya.r@gmail.com> * logging Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * need to update docstrings Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * headers and rename Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * log diff and fix cs logging Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * non-standard solution to get wandb logger to have the config Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * check for rank Signed-off-by: arendu <adithya.r@gmail.com> * cfg working for multi gpu Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * MCoreMixin chages. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * using new commit of meg-LM Signed-off-by: arendu <adithya.r@gmail.com> * default to use all layers for lora Signed-off-by: arendu <adithya.r@gmail.com> * validation only uses hard negatives, val scores are batch agnostic Signed-off-by: arendu <adithya.r@gmail.com> * minor reorg Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * metadata and bug fixes Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * dump embeddings with tracable ids, disabled val logs for the moment Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * val ids Signed-off-by: arendu <adithya.r@gmail.com> * val ids by consumed samples Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * don't gather if not saving embs Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * init global step to allow consumed samples to be called in test time Signed-off-by: arendu <adithya.r@gmail.com> * enable adapters with packed seq Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update PTL version in requirements Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused import and comment val_iterator_done Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable GPU unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Temporarily comment out CPU Unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove precision arg from Trainer in convert_hf_llama_to_nemo.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix dataloader_iter API for megatron_lm_encoder_decoder_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable NMT Training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix val_step, test_step func API MegatronLMEncoderDecoderModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Enable NMT training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Disable some unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment CI tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment resume part of BART Signed-off-by: Abhishree <abhishreetm@gmail.com> * Uncomment few lines from JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return len of dataloader in microbatches Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix _link_checkpoint 1) Add inject_model_parallel_rank to _link_checkpoint 2) Override super._link_checkpoint to remove condition check for rank 0 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Check if using dist ckpt in _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable GPT with PP=2 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove batch_idx arg from validation_step megatron_gpt_sft_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL bug fix branch Test unit tests with PTL bug fix https://github.com/Lightning-AI/pytorch-lightning/pull/19344/files Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable test_ema_saved_state in test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Skip test_beam_decoding_preserve_alignments in test_rnnt_decoding.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL with fs.lexists Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment _link_checkpoint related overrides In order to test with PTL without symbolic links Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return only batch for dataloader_iter in DFT model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify get_batch in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition checks for batch extraction from dataloader_iter Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add missing condition check for batch extraction in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for dataloader_iter extraction in MegatronLMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment test_invalid_checkpoints_removed_from_topk in test_exp_manager.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test invalid ckpts in test_exp_manager.py Also uncomment some of the commented out tests in JenkinsFile and test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix bug in test_invalid_checkpoints_removed_from_topk Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix validation step of GPTModel for finetuning case with multi dataloaders Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * multi dataloaders for validation query and docs Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * validation loop made more efficient with 2 dataloders Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP test set generation Signed-off-by: arendu <adithya.r@gmail.com> * generate working for multi dataloaders Signed-off-by: arendu <adithya.r@gmail.com> * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update PTL version in requirements Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused import and comment val_iterator_done Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable GPU unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Temporarily comment out CPU Unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove precision arg from Trainer in convert_hf_llama_to_nemo.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix dataloader_iter API for megatron_lm_encoder_decoder_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable NMT Training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix val_step, test_step func API MegatronLMEncoderDecoderModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Enable NMT training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Disable some unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment CI tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment resume part of BART Signed-off-by: Abhishree <abhishreetm@gmail.com> * Uncomment few lines from JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return len of dataloader in microbatches Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix _link_checkpoint 1) Add inject_model_parallel_rank to _link_checkpoint 2) Override super._link_checkpoint to remove condition check for rank 0 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Check if using dist ckpt in _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable GPT with PP=2 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove batch_idx arg from validation_step megatron_gpt_sft_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL bug fix branch Test unit tests with PTL bug fix https://github.com/Lightning-AI/pytorch-lightning/pull/19344/files Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable test_ema_saved_state in test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Skip test_beam_decoding_preserve_alignments in test_rnnt_decoding.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL with fs.lexists Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment _link_checkpoint related overrides In order to test with PTL without symbolic links Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return only batch for dataloader_iter in DFT model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify get_batch in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition checks for batch extraction from dataloader_iter Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add missing condition check for batch extraction in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for dataloader_iter extraction in MegatronLMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment test_invalid_checkpoints_removed_from_topk in test_exp_manager.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test invalid ckpts in test_exp_manager.py Also uncomment some of the commented out tests in JenkinsFile and test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix bug in test_invalid_checkpoints_removed_from_topk Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix validation step of GPTModel for finetuning case with multi dataloaders Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test_step_outputs for SFT in GPTMOdel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass dataloader_idx for val_step of GPTModel and remove unwanted code 1) Pass dataloader_idx to val_step of GPTModel as its required for GPTSFTModel in case multi dataloaders to append the outputs correctly val/test_step_output 2) Remove val_iterator_done check from all megatron GPT models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for extraction of batch in T5SFTModel & LMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for extracting batch in MegatronNMTModel Also uncomment GPT PP=2 and NMT tests from JenkinsFIle Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix typo and uncomment multimodel tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * default names Signed-off-by: arendu <adithya.r@gmail.com> * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update PTL version in requirements Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused import and comment val_iterator_done Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Temporarily comment out CPU Unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove precision arg from Trainer in convert_hf_llama_to_nemo.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix dataloader_iter API for megatron_lm_encoder_decoder_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable NMT Training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix val_step, test_step func API MegatronLMEncoderDecoderModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Enable NMT training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Disable some unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment CI tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment resume part of BART Signed-off-by: Abhishree <abhishreetm@gmail.com> * Uncomment few lines from JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return len of dataloader in microbatches Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix _link_checkpoint 1) Add inject_model_parallel_rank to _link_checkpoint 2) Override super._link_checkpoint to remove condition check for rank 0 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Check if using dist ckpt in _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove batch_idx arg from validation_step megatron_gpt_sft_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL bug fix branch Test unit tests with PTL bug fix https://github.com/Lightning-AI/pytorch-lightning/pull/19344/files Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable test_ema_saved_state in test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Skip test_beam_decoding_preserve_alignments in test_rnnt_decoding.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL with fs.lexists Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment _link_checkpoint related overrides In order to test with PTL without symbolic links Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return only batch for dataloader_iter in DFT model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify get_batch in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition checks for batch extraction from dataloader_iter Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add missing condition check for batch extraction in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for dataloader_iter extraction in MegatronLMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment test_invalid_checkpoints_removed_from_topk in test_exp_manager.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test invalid ckpts in test_exp_manager.py Also uncomment some of the commented out tests in JenkinsFile and test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix bug in test_invalid_checkpoints_removed_from_topk Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix validation step of GPTModel for finetuning case with multi dataloaders Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test_step_outputs for SFT in GPTMOdel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass dataloader_idx for val_step of GPTModel and remove unwanted code 1) Pass dataloader_idx to val_step of GPTModel as its required for GPTSFTModel in case multi dataloaders to append the outputs correctly val/test_step_output 2) Remove val_iterator_done check from all megatron GPT models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for extraction of batch in T5SFTModel & LMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for extracting batch in MegatronNMTModel Also uncomment GPT PP=2 and NMT tests from JenkinsFIle Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix typo and uncomment multimodel tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change to new dataloader_iter API for MultiModal Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix new dataloader_api for MegatronLatenDiffusion Model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Store and restore precision value in MegatronGPTSFTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily comment Multimodal Stable Diffusion Train Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update JenkinsFile for multimodal with latest main Signed-off-by: Abhishree <abhishreetm@gmail.com> * Upgrade PTL to version 2.2 in reqs Signed-off-by: Abhishree <abhishreetm@gmail.com> * Install PTL 2.2 from fork Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add strict arg to load_model_state_dict func in NLPDDPStrategy Signed-off-by: Abhishree <abhishreetm@gmail.com> * Delete megatron_t5_adapter_tuning.py, megatron_t5_ia3_tuning.py These files were added in the branch by mistake Signed-off-by: Abhishree <abhishreetm@gmail.com> * Delete megatron_t5_prompt_learning.py that got added by mistake Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add appropriate comments, code clean up Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove PTL installation from JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by: arendu <adithya.r@gmail.com> * llm embeddings with ptl2.2 Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * global in batch negatives using all gather Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove old files Signed-off-by: arendu <adithya.r@gmail.com> * remove changes in untouched files Signed-off-by: arendu <adithya.r@gmail.com> * inference for embedding model from ckpt Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: arendu <adithya.r@gmail.com> Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> Signed-off-by: Adi Renduchintala <adithyare@nvidia.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jiaqi Zeng <jiaqiz@nvidia.com> Co-authored-by: Tugrul Konuk <ertkonuk@gmail.com> Co-authored-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>
* add mcore updaates Signed-off-by: dimapihtar <dpihtar@gmail.com> * update mcore version Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Agoniii <815244047@qq.com>
What does this PR do ?
Support ChatGLM2&ChatGLM3
Collection: [Note which collection this PR will affect]
Changelog
Usage
Model conversion from huggingface to nemo
Inference
Pretrain
SFT
PEFT
Jenkins CI
To run Jenkins, a NeMo User with write access must comment
jenkins
on the PR.Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information