support chatglm2&3 #8528

Agoniii · 2024-02-27T12:07:41Z

What does this PR do ?

Support ChatGLM2&ChatGLM3

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

Model conversion from huggingface to nemo

python ./scripts/nlp_language_modeling/convert_hf_chatglm_to_nemo.py --in-file '/mount/data/chatglm3-6b' --out-file '/mount/nemo_models/chatglm3.nemo'

Inference

python ./examples/nlp/language_modeling/megatron_gpt_eval.py \
    --config-name=megatron_chatglm_inference \
    gpt_model_file=${CHATGLM_NEMO_MODEL} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    tensor_model_parallel_size=1 \
    pipeline_model_parallel_size=1

Pretrain

 python3 -u examples/nlp/language_modeling/megatron_gpt_pretraining.py \
  --config-name=megatron_chatglm_config \
  trainer.devices=8 \
  trainer.num_nodes=1 \
  trainer.val_check_interval=100 \
  trainer.precision=bf16 \
 +model.data.data_prefix=${TRAIN_DS} \
  trainer.max_steps=1000 \
  trainer.max_epochs=1 \
  model.megatron_amp_O2=True \
  model.sequence_parallel=True \
  model.tensor_model_parallel_size=2 \
  model.pipeline_model_parallel_size=1 \
  model.global_batch_size=128 \
  model.micro_batch_size=1 \
  model.override_vocab_size=65024 \
  model.use_cpu_initialization=False

SFT

python ./examples/nlp/language_modeling/tuning/megatron_gpt_sft.py \
    trainer.devices=8 \
    trainer.val_check_interval=200 \
    model.restore_from_path=${CHATGLM_NEMO_MODEL} \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.answer_only_loss=True \
    model.global_batch_size=128 \
    model.micro_batch_size=1 \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.data.validation_ds.global_batch_size=128\
    model.data.validation_ds.micro_batch_size=1 \
    model.megatron_amp_O2=True \
    exp_manager.exp_dir=exp \
    exp_manager.name=chatglm_test \
    trainer.precision=bf16

PEFT

./examples/nlp/language_modeling/tuning/megatron_gpt_peft_tuning.py \
  trainer.precision=bf16 \
  trainer.devices=8 \
  trainer.num_nodes=1 \
  trainer.max_steps=200 \
  trainer.val_check_interval=20 \
  ++trainer.limit_val_batches=10 \
  trainer.gradient_clip_val=1.0 \
  model.megatron_amp_O2=True \
  ++model.mcore_gpt=True \
  model.restore_from_path=${nemo_model} \
  model.peft.peft_scheme=${scheme} \
  model.data.train_ds.file_names=["${TRAIN_DS}"] \
  model.data.train_ds.concat_sampling_probabilities=[1.0] \
  model.data.train_ds.label_key='output' \
  model.data.train_ds.num_workers=0 \
  model.data.validation_ds.file_names=["${VALID_DS}"] \
  model.data.validation_ds.label_key='output' \
  model.data.validation_ds.num_workers=0 \
  model.tensor_model_parallel_size=${TP_SIZE} \
  model.pipeline_model_parallel_size=${PP_SIZE} \
  model.global_batch_size=${GBS} \
  model.micro_batch_size=1

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

yaoyu-33 · 2024-03-12T23:47:50Z

scripts/nlp_language_modeling/convert_hf_chatglm_to_nemo.py

@@ -0,0 +1,294 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.


Can you follow this PR to update your argument names? https://github.com/NVIDIA/NeMo/pull/8435/files#diff-3c80bf9f00be20ecece478699ebbd05ddf7a5a88646ef056b540be5f33622ada

Also rename to convert_chatglm_hf_to_nemo (or nemo_to_hf)

Signed-off-by: Agoniii <815244047@qq.com>

for more information, see https://pre-commit.ci Signed-off-by: Agoniii <815244047@qq.com>

* Option to set matmul precision in transcribe and chunked infer scripts Signed-off-by: Piotr Żelasko <petezor@gmail.com> * fix unsupported literal type Signed-off-by: Piotr Żelasko <petezor@gmail.com> * Fix imports Signed-off-by: Piotr Żelasko <petezor@gmail.com> --------- Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

* Revert FP8 integration * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com> Co-authored-by: Ming <111467530+Victor49152@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

* Add taurus pytorch to nemo Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add a taurus jax to nemo conversion script and few other fixes Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Clean up code Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * bug fix Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * renaming Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Fix arguments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add HF Gemma converter Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Turn off `apply_rope_fusion` during inference Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update conversion scripts Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add exporting stuff * update conversion scripts Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add readme Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Save readme Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update jax Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Remove Gemma README_Gemma.rst Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update import path Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Update docstring Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Revert "Add exporting stuff" This reverts commit 17d00b0. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove neva cyclic imports Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Remove not used vars Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "Remove neva cyclic imports" This reverts commit 898d9ed. * Fix cyclic import Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * remove neva folder Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove not used vars Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * update docstrings in converter Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Address comments Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> * Add docstring Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> --------- Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Bobby Chen <bobchen@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

* check if none before encode special token Signed-off-by: Huiying Li <willwin.lee@gmail.com> * handle when pad_id does not exist for hf Autotokenizer Signed-off-by: Huiying Li <willwin.lee@gmail.com> * refactor pad_id assignment to use getattr for cleaner code readability Signed-off-by: Huiying Li <willwin.lee@gmail.com> --------- Signed-off-by: Huiying Li <willwin.lee@gmail.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com> Co-authored-by: Ming <111467530+Victor49152@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: Neal Vaidya <nealv@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

* Update checkpoint.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update configs.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update intro.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update intro.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update neva.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update checkpoint.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update configs.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update controlnet.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update datasets.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update dreambooth.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update imagen.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update insp2p.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update sd.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update checkpoint.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update clip.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update mcore_customization.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update retro_model.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update migration-guide.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update nemo_forced_aligner.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update checkpoints.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update datasets.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update g2p.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update checkpoint.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update configs.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update datasets.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update vit.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update core.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update export.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

* Add Kaiming uniform init for LoRA adapters Signed-off-by: Michal Futrega <mfutrega@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update parallel_adapters.py Signed-off-by: Michal Futrega <mfutrega@nvidia.com> * Set value a via function argument Signed-off-by: Michal Futrega <mfutrega@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Michal Futrega <mfutrega@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Shriya Palsamudram <69161273+ShriyaPalsamudram@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

* add fsdp support for gpt fine-tuning Signed-off-by: dimapihtar <dpihtar@gmail.com> * fsdp fix for save_nemo_on_train_end Signed-off-by: dimapihtar <dpihtar@gmail.com> * add packed_seq_params param to MCoreTransformerLayerMixin Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix remove ckpt issue Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

* Update neva.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update datasets.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> * Update vit.rst Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> --------- Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

* fix AccessMixin Signed-off-by: stevehuang52 <heh@nvidia.com> * remove caching propagate_model_guid Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

Update mcore version in Dockerfile Signed-off-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

* Add NeMo Models Signed-off-by: Nithin Rao Koluguri <nithinraok> * add codec models Signed-off-by: Nithin Rao Koluguri <nithinraok> * update link Signed-off-by: Nithin Rao Koluguri <nithinraok> --------- Signed-off-by: Nithin Rao Koluguri <nithinraok> Co-authored-by: Nithin Rao Koluguri <nithinraok> Signed-off-by: Agoniii <815244047@qq.com>

* Account for amp_O2 in nemo_llama_to_hf conversion Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * Package converted model with new tokenizer not old Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * Account for variations in megatron_amp_O2 behavior Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * Resize the embeddings matrix Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * Correct precision when saving to HF folder Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * Fix typo in sample script Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * Fix typo in logging Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix O2 issue properly in peft mixin Signed-off-by: Chen Cui <chcui@nvidia.com> --------- Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com> Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

* eval with ckpt Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> * fix condition Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> * modify ci test Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> --------- Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

* initial commit Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add check to prevent mcore -> legacy conversion Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * key name change for legacy mlm -> mcore nemo conversion Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * alternative layer norm key names Signed-off-by: Chen Cui <chcui@nvidia.com> * fix error with expert parallel groups Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert previous change Signed-off-by: Chen Cui <chcui@nvidia.com> --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

* Update transcribe calls Signed-off-by: smajumdar <titu1994@gmail.com> * Update transcribe calls Signed-off-by: smajumdar <titu1994@gmail.com> * Update transcribe calls Signed-off-by: smajumdar <titu1994@gmail.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

* fix the bug where hybrid TDT CTC model uses incorrect decoding class for inference Signed-off-by: Hainan Xu <hainanx@nvidia.com> * move the fix into a function and call it from different subclassees Signed-off-by: Hainan Xu <hainanx@nvidia.com> --------- Signed-off-by: Hainan Xu <hainanx@nvidia.com> Co-authored-by: Hainan Xu <hainanx@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

* Fix SpeakerDecoder doc string Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * Fix asr RNNT doc strings Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * Fix ctc decoding doc strings Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * More doc string fixes Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * RNNTDecoding doc strings fix Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * More doc string fixes Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * modelPT, dataset doc string fixes Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * Fix generate, encode, decode docstrings Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * Update generate function docstring Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> * More generate function docstring update Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> --------- Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com> Co-authored-by: Dong Hyuk Chang <donghyukc@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: Gerald Shen <geshen@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

* Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update PTL version in requirements Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused import and comment val_iterator_done Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Temporarily comment out CPU Unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove precision arg from Trainer in convert_hf_llama_to_nemo.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix dataloader_iter API for megatron_lm_encoder_decoder_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable NMT Training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix val_step, test_step func API MegatronLMEncoderDecoderModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Enable NMT training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Disable some unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment CI tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment resume part of BART Signed-off-by: Abhishree <abhishreetm@gmail.com> * Uncomment few lines from JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return len of dataloader in microbatches Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix _link_checkpoint 1) Add inject_model_parallel_rank to _link_checkpoint 2) Override super._link_checkpoint to remove condition check for rank 0 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Check if using dist ckpt in _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove batch_idx arg from validation_step megatron_gpt_sft_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL bug fix branch Test unit tests with PTL bug fix https://github.com/Lightning-AI/pytorch-lightning/pull/19344/files Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable test_ema_saved_state in test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Skip test_beam_decoding_preserve_alignments in test_rnnt_decoding.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL with fs.lexists Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment _link_checkpoint related overrides In order to test with PTL without symbolic links Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return only batch for dataloader_iter in DFT model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify get_batch in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition checks for batch extraction from dataloader_iter Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add missing condition check for batch extraction in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for dataloader_iter extraction in MegatronLMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment test_invalid_checkpoints_removed_from_topk in test_exp_manager.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test invalid ckpts in test_exp_manager.py Also uncomment some of the commented out tests in JenkinsFile and test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix bug in test_invalid_checkpoints_removed_from_topk Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix validation step of GPTModel for finetuning case with multi dataloaders Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test_step_outputs for SFT in GPTMOdel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass dataloader_idx for val_step of GPTModel and remove unwanted code 1) Pass dataloader_idx to val_step of GPTModel as its required for GPTSFTModel in case multi dataloaders to append the outputs correctly val/test_step_output 2) Remove val_iterator_done check from all megatron GPT models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for extraction of batch in T5SFTModel & LMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for extracting batch in MegatronNMTModel Also uncomment GPT PP=2 and NMT tests from JenkinsFIle Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix typo and uncomment multimodel tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change to new dataloader_iter API for MultiModal Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix new dataloader_api for MegatronLatenDiffusion Model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Store and restore precision value in MegatronGPTSFTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily comment Multimodal Stable Diffusion Train Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update JenkinsFile for multimodal with latest main Signed-off-by: Abhishree <abhishreetm@gmail.com> * Upgrade PTL to version 2.2 in reqs Signed-off-by: Abhishree <abhishreetm@gmail.com> * Install PTL 2.2 from fork Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add strict arg to load_model_state_dict func in NLPDDPStrategy Signed-off-by: Abhishree <abhishreetm@gmail.com> * Delete megatron_t5_adapter_tuning.py, megatron_t5_ia3_tuning.py These files were added in the branch by mistake Signed-off-by: Abhishree <abhishreetm@gmail.com> * Delete megatron_t5_prompt_learning.py that got added by mistake Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add appropriate comments, code clean up Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove PTL installation from JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update PTL version to be >= 2.2.1 Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

* Update docs version Signed-off-by: smajumdar <titu1994@gmail.com> * Update docs for NeMo Framework Signed-off-by: smajumdar <titu1994@gmail.com> * Update docs for NeMo Framework Signed-off-by: smajumdar <titu1994@gmail.com> --------- Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

* Update results.rst for Canary Inference Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> * Update results.rst for Canary Inference Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> --------- Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

…esent (NVIDIA#8618) Signed-off-by: Agoniii <815244047@qq.com>

* remove LoRA SP no redundant comm for all linear layers Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert to scatter in adapter module instead of scatter after add Signed-off-by: Chen Cui <chcui@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Chen Cui <chcui@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

Minor copy and instruction changes to improve tutorial viability. Signed-off-by: Chris Alexiuk <161380339+chrisalexiuk-nvidia@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

…ests (NVIDIA#8444) * AMMO integration with Llama2 PTQ example and tests Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Jenkins megatron_llama_quantization.py test setup Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * License headers Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Add AMMO to requirements_nlp.txt with --extra-index-url for pip install Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Bump AMMO version to latest Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Guards workaround on spec definition Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Save artifacts and tokenizer config at once Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Extend nemo.utils package with new tools Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reorganize & reformat Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Tests for FP8 and INT4 AWQ Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add load_config helper function Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Unused import removal Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix FP8 Jenkins test Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Fix TP=2 test cont'd: no need to use mpirun Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Allow for patches in AMMO versioning Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop AWQ test for now (need to debug) Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Allow for patches in AMMO versioning cont'd Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Use AMMO spec from MCore as it has been published Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Make AMMO optional dependency and properly import guard it Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add Llama2 AWQ test and update some paths Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Enable specifying quantization.algorithm=null for baseline accuracy checks Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Enable exporting qnemo tarball or just to a directory Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Drop AWQ testing for now Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Test case for export.inference_tensor_parallel=2 Signed-off-by: Jan Lasek <janek.lasek@gmail.com> * Flag to export TRT-LLM config.json Signed-off-by: Jan Lasek <janek.lasek@gmail.com> --------- Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

* fix FIM RNG issue * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix FIMDataset * fix seed ref * fim fix Signed-off-by: dimapihtar <dpihtar@gmail.com> * add fim test Signed-off-by: dimapihtar <dpihtar@gmail.com> * remove files Signed-off-by: dimapihtar <dpihtar@gmail.com> * remove swp Signed-off-by: dimapihtar <dpihtar@gmail.com> * remove import Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix syntax Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix Jenkins Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

…NVIDIA#8640) * Add support to perform "inference-only" without loading training data Hi, Currently, the MegatronSBERT model cannot run inference. Essentially, a user may not be able to simply load a trained .nemo checkpoint and run inference (forward()) function on it. This patch adds a try/except block to handle cases where training data is not specified Signed-off-by: Aditya Malte <aditya.malte@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Aditya Malte <aditya.malte@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

* add ctcws tutorial Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * clear sell outputs Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> --------- Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: rachitg <rachitg@nvidia.com> Co-authored-by: rachitg <rachitg@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

* config update Signed-off-by: arendu <adithya.r@gmail.com> * save embeddings and some refac Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * entry point script for dumping embeddings to disk Signed-off-by: arendu <adithya.r@gmail.com> * normalize query and pos_doc even if no soft negatives are used Signed-off-by: arendu <adithya.r@gmail.com> * yaml for generation script Signed-off-by: arendu <adithya.r@gmail.com> * all possible negatives Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates Signed-off-by: arendu <adithya.r@gmail.com> * logging Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * need to update docstrings Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * headers and rename Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * log diff and fix cs logging Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * non-standard solution to get wandb logger to have the config Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * check for rank Signed-off-by: arendu <adithya.r@gmail.com> * cfg working for multi gpu Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * MCoreMixin chages. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * using new commit of meg-LM Signed-off-by: arendu <adithya.r@gmail.com> * default to use all layers for lora Signed-off-by: arendu <adithya.r@gmail.com> * validation only uses hard negatives, val scores are batch agnostic Signed-off-by: arendu <adithya.r@gmail.com> * minor reorg Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * metadata and bug fixes Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * dump embeddings with tracable ids, disabled val logs for the moment Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * val ids Signed-off-by: arendu <adithya.r@gmail.com> * val ids by consumed samples Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * don't gather if not saving embs Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * init global step to allow consumed samples to be called in test time Signed-off-by: arendu <adithya.r@gmail.com> * enable adapters with packed seq Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update PTL version in requirements Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused import and comment val_iterator_done Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable GPU unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Temporarily comment out CPU Unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove precision arg from Trainer in convert_hf_llama_to_nemo.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix dataloader_iter API for megatron_lm_encoder_decoder_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable NMT Training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix val_step, test_step func API MegatronLMEncoderDecoderModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Enable NMT training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Disable some unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment CI tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment resume part of BART Signed-off-by: Abhishree <abhishreetm@gmail.com> * Uncomment few lines from JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return len of dataloader in microbatches Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix _link_checkpoint 1) Add inject_model_parallel_rank to _link_checkpoint 2) Override super._link_checkpoint to remove condition check for rank 0 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Check if using dist ckpt in _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable GPT with PP=2 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove batch_idx arg from validation_step megatron_gpt_sft_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL bug fix branch Test unit tests with PTL bug fix https://github.com/Lightning-AI/pytorch-lightning/pull/19344/files Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable test_ema_saved_state in test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Skip test_beam_decoding_preserve_alignments in test_rnnt_decoding.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL with fs.lexists Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment _link_checkpoint related overrides In order to test with PTL without symbolic links Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return only batch for dataloader_iter in DFT model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify get_batch in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition checks for batch extraction from dataloader_iter Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add missing condition check for batch extraction in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for dataloader_iter extraction in MegatronLMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment test_invalid_checkpoints_removed_from_topk in test_exp_manager.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test invalid ckpts in test_exp_manager.py Also uncomment some of the commented out tests in JenkinsFile and test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix bug in test_invalid_checkpoints_removed_from_topk Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix validation step of GPTModel for finetuning case with multi dataloaders Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * multi dataloaders for validation query and docs Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * validation loop made more efficient with 2 dataloders Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP test set generation Signed-off-by: arendu <adithya.r@gmail.com> * generate working for multi dataloaders Signed-off-by: arendu <adithya.r@gmail.com> * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update PTL version in requirements Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused import and comment val_iterator_done Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable GPU unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Temporarily comment out CPU Unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove precision arg from Trainer in convert_hf_llama_to_nemo.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix dataloader_iter API for megatron_lm_encoder_decoder_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable NMT Training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix val_step, test_step func API MegatronLMEncoderDecoderModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Enable NMT training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Disable some unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment CI tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment resume part of BART Signed-off-by: Abhishree <abhishreetm@gmail.com> * Uncomment few lines from JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return len of dataloader in microbatches Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix _link_checkpoint 1) Add inject_model_parallel_rank to _link_checkpoint 2) Override super._link_checkpoint to remove condition check for rank 0 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Check if using dist ckpt in _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable GPT with PP=2 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove batch_idx arg from validation_step megatron_gpt_sft_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL bug fix branch Test unit tests with PTL bug fix https://github.com/Lightning-AI/pytorch-lightning/pull/19344/files Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable test_ema_saved_state in test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Skip test_beam_decoding_preserve_alignments in test_rnnt_decoding.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL with fs.lexists Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment _link_checkpoint related overrides In order to test with PTL without symbolic links Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return only batch for dataloader_iter in DFT model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify get_batch in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition checks for batch extraction from dataloader_iter Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add missing condition check for batch extraction in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for dataloader_iter extraction in MegatronLMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment test_invalid_checkpoints_removed_from_topk in test_exp_manager.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test invalid ckpts in test_exp_manager.py Also uncomment some of the commented out tests in JenkinsFile and test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix bug in test_invalid_checkpoints_removed_from_topk Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix validation step of GPTModel for finetuning case with multi dataloaders Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test_step_outputs for SFT in GPTMOdel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass dataloader_idx for val_step of GPTModel and remove unwanted code 1) Pass dataloader_idx to val_step of GPTModel as its required for GPTSFTModel in case multi dataloaders to append the outputs correctly val/test_step_output 2) Remove val_iterator_done check from all megatron GPT models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for extraction of batch in T5SFTModel & LMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for extracting batch in MegatronNMTModel Also uncomment GPT PP=2 and NMT tests from JenkinsFIle Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix typo and uncomment multimodel tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * default names Signed-off-by: arendu <adithya.r@gmail.com> * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update PTL version in requirements Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add the following changes for PTL 2.1 1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1 2) Make precision as None when using precision plugin in MegatronTrainerBuilder 3) Change dataloader_iter API for some megatron model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change dataloader_iter API and remove val_iterator_done 1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model 2) Comment self._val_iterator_done for all megatron models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override format_checkpoint_nae and fix dataloader_iter API Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused import and comment val_iterator_done Signed-off-by: Abhishree <abhishreetm@gmail.com> * Override _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Temporarily comment out CPU Unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove precision arg from Trainer in convert_hf_llama_to_nemo.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix dataloader_iter API for megatron_lm_encoder_decoder_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable NMT Training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix val_step, test_step func API MegatronLMEncoderDecoderModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Enable NMT training TP=2 test Signed-off-by: Abhishree <abhishreetm@gmail.com> * Disable some unit tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment CI tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment resume part of BART Signed-off-by: Abhishree <abhishreetm@gmail.com> * Uncomment few lines from JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return len of dataloader in microbatches Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix _link_checkpoint 1) Add inject_model_parallel_rank to _link_checkpoint 2) Override super._link_checkpoint to remove condition check for rank 0 Signed-off-by: Abhishree <abhishreetm@gmail.com> * Check if using dist ckpt in _link_checkpoint Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove batch_idx arg from validation_step megatron_gpt_sft_model.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL bug fix branch Test unit tests with PTL bug fix https://github.com/Lightning-AI/pytorch-lightning/pull/19344/files Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily disable test_ema_saved_state in test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Skip test_beam_decoding_preserve_alignments in test_rnnt_decoding.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Use PTL with fs.lexists Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment _link_checkpoint related overrides In order to test with PTL without symbolic links Signed-off-by: Abhishree <abhishreetm@gmail.com> * Return only batch for dataloader_iter in DFT model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Modify get_batch in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition checks for batch extraction from dataloader_iter Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add missing condition check for batch extraction in GPTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for dataloader_iter extraction in MegatronLMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Comment test_invalid_checkpoints_removed_from_topk in test_exp_manager.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test invalid ckpts in test_exp_manager.py Also uncomment some of the commented out tests in JenkinsFile and test_ema.py Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix bug in test_invalid_checkpoints_removed_from_topk Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix validation step of GPTModel for finetuning case with multi dataloaders Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix test_step_outputs for SFT in GPTMOdel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Pass dataloader_idx for val_step of GPTModel and remove unwanted code 1) Pass dataloader_idx to val_step of GPTModel as its required for GPTSFTModel in case multi dataloaders to append the outputs correctly val/test_step_output 2) Remove val_iterator_done check from all megatron GPT models Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for extraction of batch in T5SFTModel & LMEncoderDecoder Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add condition check for extracting batch in MegatronNMTModel Also uncomment GPT PP=2 and NMT tests from JenkinsFIle Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix typo and uncomment multimodel tests Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change to new dataloader_iter API for MultiModal Signed-off-by: Abhishree <abhishreetm@gmail.com> * Fix new dataloader_api for MegatronLatenDiffusion Model Signed-off-by: Abhishree <abhishreetm@gmail.com> * Store and restore precision value in MegatronGPTSFTModel Signed-off-by: Abhishree <abhishreetm@gmail.com> * Temporarily comment Multimodal Stable Diffusion Train Signed-off-by: Abhishree <abhishreetm@gmail.com> * Update JenkinsFile for multimodal with latest main Signed-off-by: Abhishree <abhishreetm@gmail.com> * Upgrade PTL to version 2.2 in reqs Signed-off-by: Abhishree <abhishreetm@gmail.com> * Install PTL 2.2 from fork Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add strict arg to load_model_state_dict func in NLPDDPStrategy Signed-off-by: Abhishree <abhishreetm@gmail.com> * Delete megatron_t5_adapter_tuning.py, megatron_t5_ia3_tuning.py These files were added in the branch by mistake Signed-off-by: Abhishree <abhishreetm@gmail.com> * Delete megatron_t5_prompt_learning.py that got added by mistake Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add appropriate comments, code clean up Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove PTL installation from JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by: arendu <adithya.r@gmail.com> * llm embeddings with ptl2.2 Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * global in batch negatives using all gather Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove old files Signed-off-by: arendu <adithya.r@gmail.com> * remove changes in untouched files Signed-off-by: arendu <adithya.r@gmail.com> * inference for embedding model from ckpt Signed-off-by: arendu <adithya.r@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: arendu <adithya.r@gmail.com> Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com> Signed-off-by: Adi Renduchintala <adithyare@nvidia.com> Signed-off-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jiaqi Zeng <jiaqiz@nvidia.com> Co-authored-by: Tugrul Konuk <ertkonuk@gmail.com> Co-authored-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

* add mcore updaates Signed-off-by: dimapihtar <dpihtar@gmail.com> * update mcore version Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

Signed-off-by: Agoniii <815244047@qq.com>

github-actions bot added NLP common labels Feb 27, 2024

Agoniii mentioned this pull request Mar 1, 2024

Add Chatglm support NVIDIA/NeMo-Framework-Launcher#249

Merged

yaoyu-33 reviewed Mar 14, 2024

View reviewed changes

Agoniii and others added 26 commits March 15, 2024 11:05

support chatglm2&3

eb8fa31

Signed-off-by: Agoniii <815244047@qq.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

614fcea

for more information, see https://pre-commit.ci Signed-off-by: Agoniii <815244047@qq.com>

Update data prep notebook (NVIDIA#8532) (NVIDIA#8533)

15c08c3

Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com> Co-authored-by: Ming <111467530+Victor49152@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

Formatting changes for new docs system (NVIDIA#8525)

6092d92

Signed-off-by: Neal Vaidya <nealv@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

Fix failing lhotse tests due to randomness (NVIDIA#8534)

b3edc1d

Signed-off-by: Piotr Żelasko <petezor@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

Fix AccessMixin (NVIDIA#8536)

ec5a498

* fix AccessMixin Signed-off-by: stevehuang52 <heh@nvidia.com> * remove caching propagate_model_guid Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

Update mcore version in Dockerfile (NVIDIA#8546)

0c581c6

Signed-off-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

Add multimodal domain to labeler.yml (NVIDIA#8548)

048c139

Signed-off-by: smajumdar <titu1994@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

Update megatron-lm in Dockerfile (NVIDIA#8549)

bbc4188

Update mcore version in Dockerfile Signed-off-by: Pablo Garay <palenq@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

Set eval samples to 1 only for float limit_val_batches (NVIDIA#8544)

3f72350

Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

Create causal mask on MCore/TE side

29b60ec

Signed-off-by: Jan Lasek <janek.lasek@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

gshennvm and others added 22 commits March 15, 2024 11:06

run val only if val dataloader exists (NVIDIA#8605)

cbac407

Signed-off-by: Gerald Shen <geshen@nvidia.com> Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

bug fix in long-form transcription for canary (NVIDIA#8614)

504f5c7

Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com> Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

Fixes gpt mcore conversion to account for _extra_state that may be pr…

6682d80

…esent (NVIDIA#8618) Signed-off-by: Agoniii <815244047@qq.com>

fixed pp eval for sft/lora (NVIDIA#8616)

8ecf74e

Co-authored-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

Set precision None in megatron_ckpt_to_nemo.py (NVIDIA#8630)

aeeed14

Signed-off-by: Abhishree <abhishreetm@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

remove include intro from docs index (NVIDIA#8636)

5873c5c

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

Fix for relative file paths when presort_manifest==True (NVIDIA#8639)

42f6e13

Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Signed-off-by: Agoniii <815244047@qq.com>

Gemma uses openai_gelu approx (NVIDIA#8638)

a41e423

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Agoniii <815244047@qq.com>

update for manifest loading (NVIDIA#8661)

3562e5e

Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

add the persistent_workers to the dataloader (NVIDIA#8654)

dd26949

Signed-off-by: rachitg <rachitg@nvidia.com> Co-authored-by: rachitg <rachitg@nvidia.com> Signed-off-by: Agoniii <815244047@qq.com>

update chatglm converter scripts and arguments

3ef6abd

Signed-off-by: Agoniii <815244047@qq.com>

Agoniii force-pushed the chatglm_pr branch from 7ec3fff to 3ef6abd Compare March 15, 2024 03:07

github-actions bot added core Changes to NeMo Core TTS ASR CI Multi Modal labels Mar 15, 2024

Agoniii closed this Mar 17, 2024

Agoniii deleted the chatglm_pr branch April 11, 2024 05:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support chatglm2&3 #8528

support chatglm2&3 #8528

Agoniii commented Feb 27, 2024

yaoyu-33 Mar 12, 2024

yaoyu-33 Mar 12, 2024

		@@ -0,0 +1,294 @@
		# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.

support chatglm2&3 #8528

support chatglm2&3 #8528

Conversation

Agoniii commented Feb 27, 2024

What does this PR do ?

Changelog

Usage

Model conversion from huggingface to nemo

Inference

Pretrain

SFT

PEFT

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

yaoyu-33 Mar 12, 2024

Choose a reason for hiding this comment

yaoyu-33 Mar 12, 2024

Choose a reason for hiding this comment