Support NeMo MegatronGPTModel #344

athitten · 2024-05-01T19:11:13Z

🚀 Feature

Support NeMo's LLM MegatronGPTModel

Initial examine:
Found 44 distinct operations, of which 33 (75%) are supported

Motivation

Pitch

Work items

Running the model

Required data

Download the data tarball and extract it into the root of your nemo clone.

NeMo installation

To keep the whole thunder team on the same NeMo revisions, and to prevent having a bunch of "modify file to call thunder.jit()" instructions, we temporarily maintain our own branch of thunder. You can grab it by cloning https://github.com/tfogal/NeMo.git. Make sure you have checked out the tfogal/thunder-nemo branch.

To install NeMo, run python3 -m pip install -e . from the root of the checked-out directory.

Running the network

TMPDIR=./foo-mgpt-train
rm -fr ${TMPDIR}; mkdir -p ${TMPDIR}
HYDRA_FULL_ERROR=1 \
THUNDER_ANNOTATE_TRACES=1 \
NEMO_THUNDER_MEGATRON_GPT=1 \
python3 \
  examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
  trainer.devices=1 \
  trainer.num_nodes=1 \
  trainer.precision=32 \
  trainer.max_steps=4 \
  trainer.val_check_interval=4 \
  trainer.enable_checkpointing=False \
  +trainer.limit_val_batches=2 \
  +trainer.limit_test_batches=2 \
  exp_manager.checkpoint_callback_params.save_best_model=False \
  exp_manager.exp_dir=examples/nlp/language_modeling/gpt_sft_results \
  model.peft.peft_scheme=none \
  model.optim.name=distributed_fused_adam \
  model.restore_from_path=./data/nlp/megatron_gpt/starcoder-ci-nemo/megatron_starcoder_tp1_pp1.nemo \
  model.tensor_model_parallel_size=1 \
  model.pipeline_model_parallel_size=1 \
  model.data.train_ds.file_names=[./data/nlp/megatron_sft/quarel.jsonl] \
  model.data.train_ds.num_workers=0 \
  model.data.test_ds.file_names=[./data/nlp/megatron_sft/quarel.jsonl] \
  model.data.validation_ds.num_workers=0 \
  model.data.validation_ds.file_names=[./data/nlp/megatron_sft/quarel.jsonl] \
  model.data.test_ds.num_workers=0 \
  model.data.train_ds.concat_sampling_probabilities=[1.0]

The above command was pulled from NeMo's CI testing for this model and lightly modified to account for other directories and set appropriate environment variables.

cc @tfogal

The text was updated successfully, but these errors were encountered:

to turn on thunder for the Megatron GPT network. see: Lightning-AI/lightning-thunder#344

athitten added the enhancement New feature or request label May 1, 2024

tfogal added the nemo Issues needed to support NVIDIA NeMo models. label May 2, 2024

mruberry added triage review and removed triage review labels May 6, 2024

tfogal added a commit to tfogal/NeMo that referenced this issue Jul 10, 2024

add NEMO_THUNDER_MEGATRON_GPT env var.

17df31c

to turn on thunder for the Megatron GPT network. see: Lightning-AI/lightning-thunder#344

tfogal mentioned this issue Jul 10, 2024

Unknown attribute _base inside Megatron core #753

Open

riccardofelluga mentioned this issue Aug 27, 2024

TypeError when calling rmsnorm_fwd_noalloc from Megatron TransformerBlock #1053

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support NeMo MegatronGPTModel #344

Support NeMo MegatronGPTModel #344

athitten commented May 1, 2024 •

edited by tfogal

Loading

Support NeMo MegatronGPTModel #344

Support NeMo MegatronGPTModel #344

Comments

athitten commented May 1, 2024 • edited by tfogal Loading

🚀 Feature

Motivation

Pitch

Work items

Running the model

Required data

NeMo installation

Running the network

athitten commented May 1, 2024 •

edited by tfogal

Loading