Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support NeMo MegatronGPTModel #344

Open
athitten opened this issue May 1, 2024 · 0 comments
Open

Support NeMo MegatronGPTModel #344

athitten opened this issue May 1, 2024 · 0 comments
Labels
enhancement New feature or request nemo Issues needed to support NVIDIA NeMo models.

Comments

@athitten
Copy link

athitten commented May 1, 2024

🚀 Feature

Support NeMo's LLM MegatronGPTModel

Initial examine:
Found 44 distinct operations, of which 33 (75%) are supported

Motivation

Pitch

Work items

Running the model

Required data

Download the data tarball and extract it into the root of your nemo clone.

NeMo installation

To keep the whole thunder team on the same NeMo revisions, and to prevent having a bunch of "modify file to call thunder.jit()" instructions, we temporarily maintain our own branch of thunder. You can grab it by cloning https://github.com/tfogal/NeMo.git. Make sure you have checked out the tfogal/thunder-nemo branch.

To install NeMo, run python3 -m pip install -e . from the root of the checked-out directory.

Running the network

TMPDIR=./foo-mgpt-train
rm -fr ${TMPDIR}; mkdir -p ${TMPDIR}
HYDRA_FULL_ERROR=1 \
THUNDER_ANNOTATE_TRACES=1 \
NEMO_THUNDER_MEGATRON_GPT=1 \
python3 \
  examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
  trainer.devices=1 \
  trainer.num_nodes=1 \
  trainer.precision=32 \
  trainer.max_steps=4 \
  trainer.val_check_interval=4 \
  trainer.enable_checkpointing=False \
  +trainer.limit_val_batches=2 \
  +trainer.limit_test_batches=2 \
  exp_manager.checkpoint_callback_params.save_best_model=False \
  exp_manager.exp_dir=examples/nlp/language_modeling/gpt_sft_results \
  model.peft.peft_scheme=none \
  model.optim.name=distributed_fused_adam \
  model.restore_from_path=./data/nlp/megatron_gpt/starcoder-ci-nemo/megatron_starcoder_tp1_pp1.nemo \
  model.tensor_model_parallel_size=1 \
  model.pipeline_model_parallel_size=1 \
  model.data.train_ds.file_names=[./data/nlp/megatron_sft/quarel.jsonl] \
  model.data.train_ds.num_workers=0 \
  model.data.test_ds.file_names=[./data/nlp/megatron_sft/quarel.jsonl] \
  model.data.validation_ds.num_workers=0 \
  model.data.validation_ds.file_names=[./data/nlp/megatron_sft/quarel.jsonl] \
  model.data.test_ds.num_workers=0 \
  model.data.train_ds.concat_sampling_probabilities=[1.0]

The above command was pulled from NeMo's CI testing for this model and lightly modified to account for other directories and set appropriate environment variables.

cc @tfogal

@athitten athitten added the enhancement New feature or request label May 1, 2024
@tfogal tfogal added the nemo Issues needed to support NVIDIA NeMo models. label May 2, 2024
tfogal added a commit to tfogal/NeMo that referenced this issue Jul 10, 2024
to turn on thunder for the Megatron GPT network. see:
	Lightning-AI/lightning-thunder#344
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request nemo Issues needed to support NVIDIA NeMo models.
Projects
None yet
Development

No branches or pull requests

3 participants