Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support chatglm2&3 #8528

Closed
wants to merge 65 commits into from
Closed

support chatglm2&3 #8528

wants to merge 65 commits into from

Conversation

Agoniii
Copy link
Contributor

@Agoniii Agoniii commented Feb 27, 2024

What does this PR do ?

Support ChatGLM2&ChatGLM3

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below

Model conversion from huggingface to nemo

python ./scripts/nlp_language_modeling/convert_hf_chatglm_to_nemo.py --in-file '/mount/data/chatglm3-6b' --out-file '/mount/nemo_models/chatglm3.nemo'

Inference

python ./examples/nlp/language_modeling/megatron_gpt_eval.py \
    --config-name=megatron_chatglm_inference \
    gpt_model_file=${CHATGLM_NEMO_MODEL} \
    trainer.devices=1 \
    trainer.num_nodes=1 \
    tensor_model_parallel_size=1 \
    pipeline_model_parallel_size=1 

Pretrain

 python3 -u examples/nlp/language_modeling/megatron_gpt_pretraining.py \
  --config-name=megatron_chatglm_config \
  trainer.devices=8 \
  trainer.num_nodes=1 \
  trainer.val_check_interval=100 \
  trainer.precision=bf16 \
 +model.data.data_prefix=${TRAIN_DS} \
  trainer.max_steps=1000 \
  trainer.max_epochs=1 \
  model.megatron_amp_O2=True \
  model.sequence_parallel=True \
  model.tensor_model_parallel_size=2 \
  model.pipeline_model_parallel_size=1 \
  model.global_batch_size=128 \
  model.micro_batch_size=1 \
  model.override_vocab_size=65024 \
  model.use_cpu_initialization=False

SFT

python ./examples/nlp/language_modeling/tuning/megatron_gpt_sft.py \
    trainer.devices=8 \
    trainer.val_check_interval=200 \
    model.restore_from_path=${CHATGLM_NEMO_MODEL} \
    model.tensor_model_parallel_size=${TP_SIZE} \
    model.pipeline_model_parallel_size=${PP_SIZE} \
    model.answer_only_loss=True \
    model.global_batch_size=128 \
    model.micro_batch_size=1 \
    model.data.train_ds.file_names=${TRAIN_DS} \
    model.data.train_ds.concat_sampling_probabilities=[1.0] \
    model.data.validation_ds.file_names=${VALID_DS} \
    model.data.validation_ds.global_batch_size=128\
    model.data.validation_ds.micro_batch_size=1 \
    model.megatron_amp_O2=True \
    exp_manager.exp_dir=exp \
    exp_manager.name=chatglm_test \
    trainer.precision=bf16

PEFT

./examples/nlp/language_modeling/tuning/megatron_gpt_peft_tuning.py \
  trainer.precision=bf16 \
  trainer.devices=8 \
  trainer.num_nodes=1 \
  trainer.max_steps=200 \
  trainer.val_check_interval=20 \
  ++trainer.limit_val_batches=10 \
  trainer.gradient_clip_val=1.0 \
  model.megatron_amp_O2=True \
  ++model.mcore_gpt=True \
  model.restore_from_path=${nemo_model} \
  model.peft.peft_scheme=${scheme} \
  model.data.train_ds.file_names=["${TRAIN_DS}"] \
  model.data.train_ds.concat_sampling_probabilities=[1.0] \
  model.data.train_ds.label_key='output' \
  model.data.train_ds.num_workers=0 \
  model.data.validation_ds.file_names=["${VALID_DS}"] \
  model.data.validation_ds.label_key='output' \
  model.data.validation_ds.num_workers=0 \
  model.tensor_model_parallel_size=${TP_SIZE} \
  model.pipeline_model_parallel_size=${PP_SIZE} \
  model.global_batch_size=${GBS} \
  model.micro_batch_size=1

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@@ -0,0 +1,294 @@
# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also rename to convert_chatglm_hf_to_nemo (or nemo_to_hf)

Agoniii and others added 26 commits March 15, 2024 11:05
Signed-off-by: Agoniii <815244047@qq.com>
for more information, see https://pre-commit.ci

Signed-off-by: Agoniii <815244047@qq.com>
* Option to set matmul precision in transcribe and chunked infer scripts

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix unsupported literal type

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Fix imports

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
* Revert FP8 integration

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com>
Co-authored-by: Ming <111467530+Victor49152@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
* Add taurus pytorch to nemo

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add a taurus jax to nemo conversion script and few other fixes

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Clean up code

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* bug fix

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* renaming

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix arguments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add HF Gemma converter

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Turn off `apply_rope_fusion` during inference

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update conversion scripts

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add exporting stuff

* update conversion scripts

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add readme

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Save readme

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update jax

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Remove Gemma README_Gemma.rst

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update import path

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Update docstring

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Revert "Add exporting stuff"

This reverts commit 17d00b0.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove neva cyclic imports

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Remove not used vars

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert "Remove neva cyclic imports"

This reverts commit 898d9ed.

* Fix cyclic import

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* remove neva folder

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove not used vars

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* update docstrings in converter

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Address comments

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Add docstring

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

---------

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: Bobby Chen <bobchen@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
* check if none before encode special token

Signed-off-by: Huiying Li <willwin.lee@gmail.com>

* handle when pad_id does not exist for hf Autotokenizer

Signed-off-by: Huiying Li <willwin.lee@gmail.com>

* refactor pad_id assignment to use getattr for cleaner code readability

Signed-off-by: Huiying Li <willwin.lee@gmail.com>

---------

Signed-off-by: Huiying Li <willwin.lee@gmail.com>
Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Mingyuan Ma <mingyuanm@nvidia.com>
Co-authored-by: Ming <111467530+Victor49152@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Neal Vaidya <nealv@nvidia.com>
Signed-off-by: Agoniii <815244047@qq.com>
* Update checkpoint.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update configs.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update intro.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update intro.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update neva.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update checkpoint.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update configs.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update controlnet.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update datasets.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update dreambooth.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update imagen.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update insp2p.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update sd.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update checkpoint.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update clip.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update mcore_customization.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update retro_model.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update migration-guide.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update nemo_forced_aligner.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update checkpoints.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update datasets.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update g2p.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update checkpoint.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update configs.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update datasets.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update vit.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update core.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update export.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

---------

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
* Add Kaiming uniform init for LoRA adapters

Signed-off-by: Michal Futrega <mfutrega@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update parallel_adapters.py

Signed-off-by: Michal Futrega <mfutrega@nvidia.com>

* Set value a via function argument

Signed-off-by: Michal Futrega <mfutrega@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Michal Futrega <mfutrega@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Shriya Palsamudram <69161273+ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
* add fsdp support for gpt fine-tuning

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* fsdp fix for save_nemo_on_train_end

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* add packed_seq_params param to MCoreTransformerLayerMixin

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* fix remove ckpt issue

Signed-off-by: dimapihtar <dpihtar@gmail.com>

---------

Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
* Update neva.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update datasets.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

* Update vit.rst

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>

---------

Signed-off-by: Andrew Schilling <85314306+aschilling-nv@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
* fix AccessMixin

Signed-off-by: stevehuang52 <heh@nvidia.com>

* remove caching propagate_model_guid

Signed-off-by: stevehuang52 <heh@nvidia.com>

---------

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Pablo Garay <palenq@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
Update mcore version in Dockerfile

Signed-off-by: Pablo Garay <palenq@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
* Add NeMo Models

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add codec models

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* update link

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Agoniii <815244047@qq.com>
* Account for amp_O2 in nemo_llama_to_hf conversion

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

* Package converted model with new tokenizer not old

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

* Account for variations in megatron_amp_O2 behavior

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

* Resize the embeddings matrix

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

* Correct precision when saving to HF folder

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

* Fix typo in sample script

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

* Fix typo in logging

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix O2 issue properly in peft mixin

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Teodor-Dumitru Ene <teodord.ene@gmail.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Agoniii <815244047@qq.com>
* eval with ckpt

Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com>

* fix condition

Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com>

* modify ci test

Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com>

---------

Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com>
Signed-off-by: Agoniii <815244047@qq.com>
* initial commit

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add check to prevent mcore -> legacy conversion

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* key name change for legacy mlm -> mcore nemo conversion

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* alternative layer norm key names

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix error with expert parallel groups

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert previous change

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
* Update transcribe calls

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update transcribe calls

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update transcribe calls

Signed-off-by: smajumdar <titu1994@gmail.com>

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
* fix the bug where hybrid TDT CTC model uses incorrect decoding class for inference

Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* move the fix into a function and call it from different subclassees

Signed-off-by: Hainan Xu <hainanx@nvidia.com>

---------

Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Co-authored-by: Hainan Xu <hainanx@nvidia.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
* Fix SpeakerDecoder doc string

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>

* Fix asr RNNT doc strings

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>

* Fix ctc decoding doc strings

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>

* More doc string fixes

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>

* RNNTDecoding doc strings fix

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>

* More doc string fixes

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>

* modelPT, dataset doc string fixes

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>

* Fix generate, encode, decode docstrings

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>

* Update generate function docstring

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>

* More generate function docstring update

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>

---------

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
Co-authored-by: Dong Hyuk Chang <donghyukc@nvidia.com>
Signed-off-by: Agoniii <815244047@qq.com>
gshennvm and others added 22 commits March 15, 2024 11:06
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
* Add the following changes for PTL 2.1

1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1
2) Make precision as None when using precision plugin in MegatronTrainerBuilder
3) Change dataloader_iter API for some megatron model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Change dataloader_iter API and remove val_iterator_done

1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model
2) Comment self._val_iterator_done for all megatron models

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Override format_checkpoint_nae and fix dataloader_iter API

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Update PTL version in requirements

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add the following changes for PTL 2.1

1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1
2) Make precision as None when using precision plugin in MegatronTrainerBuilder
3) Change dataloader_iter API for some megatron model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Change dataloader_iter API and remove val_iterator_done

1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model
2) Comment self._val_iterator_done for all megatron models

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Override format_checkpoint_nae and fix dataloader_iter API

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused import and comment val_iterator_done

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Override _link_checkpoint

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Temporarily comment out CPU Unit tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Remove precision arg from Trainer in convert_hf_llama_to_nemo.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix dataloader_iter API for megatron_lm_encoder_decoder_model.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily disable NMT Training TP=2 test

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix val_step, test_step func API MegatronLMEncoderDecoderModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Enable NMT training TP=2 test

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Disable some unit tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment CI tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment resume part of BART

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Uncomment few lines from JenkinsFile

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Return len of dataloader in microbatches

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix _link_checkpoint

1) Add inject_model_parallel_rank to _link_checkpoint
2) Override super._link_checkpoint to remove condition check for rank 0

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Check if using dist ckpt in _link_checkpoint

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Remove batch_idx arg from validation_step megatron_gpt_sft_model.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Use PTL bug fix branch

Test unit tests with PTL bug fix https://github.com/Lightning-AI/pytorch-lightning/pull/19344/files

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily disable test_ema_saved_state in test_ema.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Skip test_beam_decoding_preserve_alignments in test_rnnt_decoding.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Use PTL with fs.lexists

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment _link_checkpoint related overrides

In order to test with PTL without symbolic links

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Return only batch for dataloader_iter in DFT model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Modify get_batch in GPTModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition checks for batch extraction from dataloader_iter

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add missing condition check for batch extraction in GPTModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition check for dataloader_iter extraction in MegatronLMEncoderDecoder

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment test_invalid_checkpoints_removed_from_topk in test_exp_manager.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix test invalid ckpts in test_exp_manager.py

Also uncomment some of the commented out tests in JenkinsFile and test_ema.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix bug in test_invalid_checkpoints_removed_from_topk

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix validation step of GPTModel for finetuning case with multi dataloaders

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix test_step_outputs for SFT in GPTMOdel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Pass dataloader_idx for val_step of GPTModel and remove unwanted code

1) Pass dataloader_idx to val_step of GPTModel as its required for GPTSFTModel in case multi dataloaders to append the outputs correctly val/test_step_output
2) Remove val_iterator_done check from all megatron GPT models

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition check for extraction of batch in T5SFTModel & LMEncoderDecoder

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition check for extracting batch in MegatronNMTModel

Also uncomment GPT PP=2 and NMT tests from JenkinsFIle

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix typo and uncomment multimodel tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Change to new dataloader_iter API for MultiModal

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix new dataloader_api for MegatronLatenDiffusion Model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Store and restore precision value in MegatronGPTSFTModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily comment Multimodal Stable Diffusion Train

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Update JenkinsFile for multimodal with latest main

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Upgrade PTL to version 2.2 in reqs

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Install PTL 2.2 from fork

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add strict arg to load_model_state_dict func in NLPDDPStrategy

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Delete megatron_t5_adapter_tuning.py, megatron_t5_ia3_tuning.py

These files were added in the branch by mistake

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Delete megatron_t5_prompt_learning.py that got added by mistake

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add appropriate comments, code clean up

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Remove PTL installation from JenkinsFile

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update PTL version to be >= 2.2.1

Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>

---------

Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
* Update docs version

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update docs for NeMo Framework

Signed-off-by: smajumdar <titu1994@gmail.com>

* Update docs for NeMo Framework

Signed-off-by: smajumdar <titu1994@gmail.com>

---------

Signed-off-by: smajumdar <titu1994@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
* Update results.rst for Canary Inference

Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>

* Update results.rst for Canary Inference

Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com>

---------

Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com>
Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Krishna Puvvada <kpuvvada@nvidia.com>
Co-authored-by: Krishna Puvvada <kpuvvada@nvidia.com>
Signed-off-by: Agoniii <815244047@qq.com>
* remove LoRA SP no redundant comm for all linear layers

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert to scatter in adapter module instead of scatter after add

Signed-off-by: Chen Cui <chcui@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
Minor copy and instruction changes to improve tutorial viability.

Signed-off-by: Chris Alexiuk <161380339+chrisalexiuk-nvidia@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
…ests (NVIDIA#8444)

* AMMO integration with Llama2 PTQ example and tests

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Jenkins megatron_llama_quantization.py test setup

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* License headers

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Add AMMO to requirements_nlp.txt with --extra-index-url for pip install

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Bump AMMO version to latest

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Guards workaround on spec definition

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Save artifacts and tokenizer config at once

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Extend nemo.utils package with new tools

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Reorganize & reformat

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Tests for FP8 and INT4 AWQ

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add load_config helper function

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Unused import removal

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Fix FP8 Jenkins test

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Fix TP=2 test cont'd: no need to use mpirun

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Allow for patches in AMMO versioning

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Drop AWQ test for now (need to debug)

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Allow for patches in AMMO versioning cont'd

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Use AMMO spec from MCore as it has been published

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Make AMMO optional dependency and properly import guard it

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add Llama2 AWQ test and update some paths

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Enable specifying quantization.algorithm=null for baseline accuracy checks

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Enable exporting qnemo tarball or just to a directory

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Drop AWQ testing for now

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Test case for export.inference_tensor_parallel=2

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Flag to export TRT-LLM config.json

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
* fix FIM RNG issue

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix FIMDataset

* fix seed ref

* fim fix

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* add fim test

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* remove files

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* remove swp

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* remove import

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* fix syntax

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* fix Jenkins

Signed-off-by: dimapihtar <dpihtar@gmail.com>

---------

Signed-off-by: dimapihtar <dpihtar@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Co-authored-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
…NVIDIA#8640)

* Add support to perform "inference-only" without loading training data

Hi,
Currently, the MegatronSBERT model cannot run inference. Essentially, a user may not be able to simply load a trained .nemo checkpoint and run inference (forward()) function on it.

This patch adds a try/except block to handle cases where training data is not specified

Signed-off-by: Aditya Malte <aditya.malte@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Aditya Malte <aditya.malte@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
* add ctcws tutorial

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

* clear sell outputs

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

* fixes

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

* fixes

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

* fixes

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

---------

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: rachitg <rachitg@nvidia.com>
Co-authored-by: rachitg <rachitg@nvidia.com>
Signed-off-by: Agoniii <815244047@qq.com>
* config update

Signed-off-by: arendu <adithya.r@gmail.com>

* save embeddings and some refac

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* entry point script for dumping embeddings to disk

Signed-off-by: arendu <adithya.r@gmail.com>

* normalize query and pos_doc even if no soft negatives are used

Signed-off-by: arendu <adithya.r@gmail.com>

* yaml for generation script

Signed-off-by: arendu <adithya.r@gmail.com>

* all possible negatives

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updates

Signed-off-by: arendu <adithya.r@gmail.com>

* logging

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* need to update docstrings

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* headers and rename

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* log diff and fix cs logging

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* non-standard solution to get wandb logger to have the config

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for rank

Signed-off-by: arendu <adithya.r@gmail.com>

* cfg working for multi gpu

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* MCoreMixin chages.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* using new commit of meg-LM

Signed-off-by: arendu <adithya.r@gmail.com>

* default to use all layers for lora

Signed-off-by: arendu <adithya.r@gmail.com>

* validation only uses hard negatives, val scores are batch agnostic

Signed-off-by: arendu <adithya.r@gmail.com>

* minor reorg

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* metadata and bug fixes

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* dump embeddings with tracable ids, disabled val logs for the moment

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* val ids

Signed-off-by: arendu <adithya.r@gmail.com>

* val ids by consumed samples

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* don't gather if not saving embs

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* init global step to allow consumed samples to be called in test time

Signed-off-by: arendu <adithya.r@gmail.com>

* enable adapters with packed seq

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add the following changes for PTL 2.1

1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1
2) Make precision as None when using precision plugin in MegatronTrainerBuilder
3) Change dataloader_iter API for some megatron model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Change dataloader_iter API and remove val_iterator_done

1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model
2) Comment self._val_iterator_done for all megatron models

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Override format_checkpoint_nae and fix dataloader_iter API

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Update PTL version in requirements

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add the following changes for PTL 2.1

1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1
2) Make precision as None when using precision plugin in MegatronTrainerBuilder
3) Change dataloader_iter API for some megatron model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Change dataloader_iter API and remove val_iterator_done

1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model
2) Comment self._val_iterator_done for all megatron models

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Override format_checkpoint_nae and fix dataloader_iter API

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused import and comment val_iterator_done

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Override _link_checkpoint

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily disable GPU unit tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Temporarily comment out CPU Unit tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Remove precision arg from Trainer in convert_hf_llama_to_nemo.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix dataloader_iter API for megatron_lm_encoder_decoder_model.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily disable NMT Training TP=2 test

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix val_step, test_step func API MegatronLMEncoderDecoderModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Enable NMT training TP=2 test

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Disable some unit tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment CI tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment resume part of BART

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Uncomment few lines from JenkinsFile

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Return len of dataloader in microbatches

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix _link_checkpoint

1) Add inject_model_parallel_rank to _link_checkpoint
2) Override super._link_checkpoint to remove condition check for rank 0

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Check if using dist ckpt in _link_checkpoint

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily disable GPT with PP=2

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Remove batch_idx arg from validation_step megatron_gpt_sft_model.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Use PTL bug fix branch

Test unit tests with PTL bug fix https://github.com/Lightning-AI/pytorch-lightning/pull/19344/files

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily disable test_ema_saved_state in test_ema.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Skip test_beam_decoding_preserve_alignments in test_rnnt_decoding.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Use PTL with fs.lexists

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment _link_checkpoint related overrides

In order to test with PTL without symbolic links

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Return only batch for dataloader_iter in DFT model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Modify get_batch in GPTModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition checks for batch extraction from dataloader_iter

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add missing condition check for batch extraction in GPTModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition check for dataloader_iter extraction in MegatronLMEncoderDecoder

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment test_invalid_checkpoints_removed_from_topk in test_exp_manager.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix test invalid ckpts in test_exp_manager.py

Also uncomment some of the commented out tests in JenkinsFile and test_ema.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix bug in test_invalid_checkpoints_removed_from_topk

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix validation step of GPTModel for finetuning case with multi dataloaders

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* multi dataloaders for validation query and docs

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* validation loop made more efficient with 2 dataloders

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* WIP test set generation

Signed-off-by: arendu <adithya.r@gmail.com>

* generate working for multi dataloaders

Signed-off-by: arendu <adithya.r@gmail.com>

* Add the following changes for PTL 2.1

1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1
2) Make precision as None when using precision plugin in MegatronTrainerBuilder
3) Change dataloader_iter API for some megatron model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Change dataloader_iter API and remove val_iterator_done

1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model
2) Comment self._val_iterator_done for all megatron models

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Override format_checkpoint_nae and fix dataloader_iter API

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Update PTL version in requirements

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add the following changes for PTL 2.1

1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1
2) Make precision as None when using precision plugin in MegatronTrainerBuilder
3) Change dataloader_iter API for some megatron model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Change dataloader_iter API and remove val_iterator_done

1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model
2) Comment self._val_iterator_done for all megatron models

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Override format_checkpoint_nae and fix dataloader_iter API

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused import and comment val_iterator_done

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Override _link_checkpoint

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily disable GPU unit tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Temporarily comment out CPU Unit tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Remove precision arg from Trainer in convert_hf_llama_to_nemo.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix dataloader_iter API for megatron_lm_encoder_decoder_model.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily disable NMT Training TP=2 test

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix val_step, test_step func API MegatronLMEncoderDecoderModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Enable NMT training TP=2 test

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Disable some unit tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment CI tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment resume part of BART

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Uncomment few lines from JenkinsFile

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Return len of dataloader in microbatches

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix _link_checkpoint

1) Add inject_model_parallel_rank to _link_checkpoint
2) Override super._link_checkpoint to remove condition check for rank 0

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Check if using dist ckpt in _link_checkpoint

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily disable GPT with PP=2

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Remove batch_idx arg from validation_step megatron_gpt_sft_model.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Use PTL bug fix branch

Test unit tests with PTL bug fix https://github.com/Lightning-AI/pytorch-lightning/pull/19344/files

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily disable test_ema_saved_state in test_ema.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Skip test_beam_decoding_preserve_alignments in test_rnnt_decoding.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Use PTL with fs.lexists

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment _link_checkpoint related overrides

In order to test with PTL without symbolic links

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Return only batch for dataloader_iter in DFT model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Modify get_batch in GPTModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition checks for batch extraction from dataloader_iter

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add missing condition check for batch extraction in GPTModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition check for dataloader_iter extraction in MegatronLMEncoderDecoder

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment test_invalid_checkpoints_removed_from_topk in test_exp_manager.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix test invalid ckpts in test_exp_manager.py

Also uncomment some of the commented out tests in JenkinsFile and test_ema.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix bug in test_invalid_checkpoints_removed_from_topk

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix validation step of GPTModel for finetuning case with multi dataloaders

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix test_step_outputs for SFT in GPTMOdel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Pass dataloader_idx for val_step of GPTModel and remove unwanted code

1) Pass dataloader_idx to val_step of GPTModel as its required for GPTSFTModel in case multi dataloaders to append the outputs correctly val/test_step_output
2) Remove val_iterator_done check from all megatron GPT models

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition check for extraction of batch in T5SFTModel & LMEncoderDecoder

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition check for extracting batch in MegatronNMTModel

Also uncomment GPT PP=2 and NMT tests from JenkinsFIle

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix typo and uncomment multimodel tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* default names

Signed-off-by: arendu <adithya.r@gmail.com>

* Add the following changes for PTL 2.1

1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1
2) Make precision as None when using precision plugin in MegatronTrainerBuilder
3) Change dataloader_iter API for some megatron model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Change dataloader_iter API and remove val_iterator_done

1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model
2) Comment self._val_iterator_done for all megatron models

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Override format_checkpoint_nae and fix dataloader_iter API

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Update PTL version in requirements

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add the following changes for PTL 2.1

1) Remove LightningModuleWrapperBase around model as its not required with PTL 2.1
2) Make precision as None when using precision plugin in MegatronTrainerBuilder
3) Change dataloader_iter API for some megatron model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Change dataloader_iter API and remove val_iterator_done

1) Change dataloader_iter API according to PTl 2.1 for bert and gpt model
2) Comment self._val_iterator_done for all megatron models

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Override format_checkpoint_nae and fix dataloader_iter API

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused import and comment val_iterator_done

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Override _link_checkpoint

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Temporarily comment out CPU Unit tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Remove precision arg from Trainer in convert_hf_llama_to_nemo.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix dataloader_iter API for megatron_lm_encoder_decoder_model.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily disable NMT Training TP=2 test

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix val_step, test_step func API MegatronLMEncoderDecoderModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Enable NMT training TP=2 test

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Disable some unit tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment CI tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment resume part of BART

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Uncomment few lines from JenkinsFile

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Return len of dataloader in microbatches

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix _link_checkpoint

1) Add inject_model_parallel_rank to _link_checkpoint
2) Override super._link_checkpoint to remove condition check for rank 0

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Check if using dist ckpt in _link_checkpoint

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Remove batch_idx arg from validation_step megatron_gpt_sft_model.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Use PTL bug fix branch

Test unit tests with PTL bug fix https://github.com/Lightning-AI/pytorch-lightning/pull/19344/files

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily disable test_ema_saved_state in test_ema.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Skip test_beam_decoding_preserve_alignments in test_rnnt_decoding.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Use PTL with fs.lexists

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment _link_checkpoint related overrides

In order to test with PTL without symbolic links

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Return only batch for dataloader_iter in DFT model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Modify get_batch in GPTModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition checks for batch extraction from dataloader_iter

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add missing condition check for batch extraction in GPTModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition check for dataloader_iter extraction in MegatronLMEncoderDecoder

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Comment test_invalid_checkpoints_removed_from_topk in test_exp_manager.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix test invalid ckpts in test_exp_manager.py

Also uncomment some of the commented out tests in JenkinsFile and test_ema.py

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix bug in test_invalid_checkpoints_removed_from_topk

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix validation step of GPTModel for finetuning case with multi dataloaders

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix test_step_outputs for SFT in GPTMOdel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Pass dataloader_idx for val_step of GPTModel and remove unwanted code

1) Pass dataloader_idx to val_step of GPTModel as its required for GPTSFTModel in case multi dataloaders to append the outputs correctly val/test_step_output
2) Remove val_iterator_done check from all megatron GPT models

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition check for extraction of batch in T5SFTModel & LMEncoderDecoder

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add condition check for extracting batch in MegatronNMTModel

Also uncomment GPT PP=2 and NMT tests from JenkinsFIle

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix typo and uncomment multimodel tests

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Change to new dataloader_iter API for MultiModal

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Fix new dataloader_api for MegatronLatenDiffusion Model

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Store and restore precision value in MegatronGPTSFTModel

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Temporarily comment Multimodal Stable Diffusion Train

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Update JenkinsFile for multimodal with latest main

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Upgrade PTL to version 2.2 in reqs

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Install PTL 2.2 from fork

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add strict arg to load_model_state_dict func in NLPDDPStrategy

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Delete megatron_t5_adapter_tuning.py, megatron_t5_ia3_tuning.py

These files were added in the branch by mistake

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Delete megatron_t5_prompt_learning.py that got added by mistake

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Add appropriate comments, code clean up

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* Remove PTL installation from JenkinsFile

Signed-off-by: Abhishree <abhishreetm@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <adithya.r@gmail.com>

* llm embeddings with ptl2.2

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* global in batch negatives using all gather

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove old files

Signed-off-by: arendu <adithya.r@gmail.com>

* remove changes in untouched files

Signed-off-by: arendu <adithya.r@gmail.com>

* inference for embedding model from ckpt

Signed-off-by: arendu <adithya.r@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <adithya.r@gmail.com>
Signed-off-by: Jiaqi Zeng <jiaqiz@nvidia.com>
Signed-off-by: Adi Renduchintala <adithyare@nvidia.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Jiaqi Zeng <jiaqiz@nvidia.com>
Co-authored-by: Tugrul Konuk <ertkonuk@gmail.com>
Co-authored-by: Abhishree <abhishreetm@gmail.com>
Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com>
Co-authored-by: Eric Harper <complex451@gmail.com>
Signed-off-by: Agoniii <815244047@qq.com>
* add mcore updaates

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* update mcore version

Signed-off-by: dimapihtar <dpihtar@gmail.com>

---------

Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Signed-off-by: Agoniii <815244047@qq.com>
Signed-off-by: Agoniii <815244047@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet