Skip to content

Conversation

@srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan commented Mar 26, 2025

Summary

This PR introduces SFT/LoRA support in CloudAI Nemo2.0. The main features for performance tracking are the null tokenizer for benchmarking tasks. The entry point script (cloudai_nemorun.py) has the necessary modification to add Null tokenizer, hf_tokenizer. It also defines a new environment variable called CLOUDAI_NEMO_TASK which can take the values of pretrain or finetune which can be defined in the toml file.

We will hijack the cmd_args.task variable to set the new environment variable. This variable is used to control the flow to choose between pretrain and finetune.

This is duplicate of this PR I accidently deleted some tests and recommitted and the copyright year unit tests wasn't letting to me use the original dates.

Dependencies

  • The Nemo docker might not have changes needed until this Nemo-Run PR is merged. Contact me for local container build if you want to test this.
  • The path to mounts in the test files has been obfuscated. You need to update them before the run.

Test Plan

CI/CD
Dry Run

$ cloudai dry-run --system-config ../cloudaix/conf/common/system/xxxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nemo_run_llama3_8b.toml
[INFO] System Name: coreweave
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b_lora
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b_lora

Section Name: nemo_run_llama3_8b_sft
  Test Name: nemo_run_llama3_8b_sft
  Description: nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: nemo_run_llama3_8b_sft
[INFO] Running test: nemo_run_llama3_8b_sft
[INFO] Submitted slurm job: 0
[INFO] Job completed: nemo_run_llama3_8b_sft

Real System

$ cloudai run --system-config ../cloudaix/conf/common/system/xxxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nemo_run_llama3_8b.toml
[INFO] System Name: xxxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b_lora
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b_lora

Section Name: nemo_run_llama3_8b_sft
  Test Name: nemo_run_llama3_8b_sft
  Description: nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: nemo_run_llama3_8b_sft
[INFO] Running test: nemo_run_llama3_8b_sft
[INFO] Submitted slurm job: 2052541
INFO] Job completed: nemo_run_llama3_8b_sft
[INFO] All test scenario results stored at: results/nemo_run_llama3_8b_lora_2025-03-20_02-02-15
[INFO] Generated scenario report at results/nemo_run_llama3_8b_lora_2025-03-20_02-02-15/nemo_run_llama3_8b_lora.html
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.

Results (redacted all numbers). PLease ping me on slack for logs

Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 

Additional Notes

Has the changes of this PR 419 but without the copyright year mess I created due to acceidental deletion of some test files.

@srivatsankrishnan srivatsankrishnan changed the title Nemo Lora 2 Nemo2.0 Lora Mar 26, 2025
@srivatsankrishnan srivatsankrishnan changed the title Nemo2.0 Lora Nemo2.0 Lora + Null tokenizer Mar 26, 2025
@srivatsankrishnan srivatsankrishnan marked this pull request as ready for review March 26, 2025 01:09
TaekyungHeo
TaekyungHeo previously approved these changes Mar 26, 2025
@TaekyungHeo TaekyungHeo added the enhancement New feature or request label Mar 26, 2025
amaslenn
amaslenn previously approved these changes Mar 26, 2025
@srivatsankrishnan srivatsankrishnan merged commit 1eb323d into NVIDIA:main Mar 26, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants