Nemo2.0 Lora + Null tokenizer #430

srivatsankrishnan · 2025-03-26T01:03:59Z

Summary

This PR introduces SFT/LoRA support in CloudAI Nemo2.0. The main features for performance tracking are the null tokenizer for benchmarking tasks. The entry point script (cloudai_nemorun.py) has the necessary modification to add Null tokenizer, hf_tokenizer. It also defines a new environment variable called CLOUDAI_NEMO_TASK which can take the values of pretrain or finetune which can be defined in the toml file.

We will hijack the cmd_args.task variable to set the new environment variable. This variable is used to control the flow to choose between pretrain and finetune.

This is duplicate of this PR I accidently deleted some tests and recommitted and the copyright year unit tests wasn't letting to me use the original dates.

Dependencies

The Nemo docker might not have changes needed until this Nemo-Run PR is merged. Contact me for local container build if you want to test this.
The path to mounts in the test files has been obfuscated. You need to update them before the run.

Test Plan

CI/CD
Dry Run

$ cloudai dry-run --system-config ../cloudaix/conf/common/system/xxxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nemo_run_llama3_8b.toml
[INFO] System Name: coreweave
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b_lora
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b_lora

Section Name: nemo_run_llama3_8b_sft
  Test Name: nemo_run_llama3_8b_sft
  Description: nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: nemo_run_llama3_8b_sft
[INFO] Running test: nemo_run_llama3_8b_sft
[INFO] Submitted slurm job: 0
[INFO] Job completed: nemo_run_llama3_8b_sft

Real System

$ cloudai run --system-config ../cloudaix/conf/common/system/xxxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/nemo_run_llama3_8b.toml
[INFO] System Name: xxxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b_lora
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b_lora

Section Name: nemo_run_llama3_8b_sft
  Test Name: nemo_run_llama3_8b_sft
  Description: nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test: nemo_run_llama3_8b_sft
[INFO] Running test: nemo_run_llama3_8b_sft
[INFO] Submitted slurm job: 2052541
INFO] Job completed: nemo_run_llama3_8b_sft
[INFO] All test scenario results stored at: results/nemo_run_llama3_8b_lora_2025-03-20_02-02-15
[INFO] Generated scenario report at results/nemo_run_llama3_8b_lora_2025-03-20_02-02-15/nemo_run_llama3_8b_lora.html
[INFO] All test scenario execution attempts are complete. Please review the 'debug.log' file to confirm successful completion or to identify any issues.

Results (redacted all numbers). PLease ping me on slack for logs

Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: . 
Training epoch , iteration / | lr: e- | global_batch_size:  | global_step:  | peak_memory_usage:  | memory_allocated:  | reduced_train_loss: . | train_step_timing in s: . | consumed_samples:  | val_loss: .

Additional Notes

Has the changes of this PR 419 but without the copyright year mess I created due to acceidental deletion of some test files.

conf/common/test/nemo_run_llama3_8b_lora.toml

srivatsankrishnan added 9 commits March 25, 2025 17:34

add lora config

9512c5b

add test scenario file

2a41deb

use export=all

cac215c

add modified entrypoint script

bd03b31

modify tensorboard logger as optional

eb5be8d

add nemo_home to environment

7b659f4

fixes for ref data

e8ea739

fixes unit test

941f608

fix ucc ref data

20f6f4f

srivatsankrishnan changed the title ~~Nemo Lora 2~~ Nemo2.0 Lora Mar 26, 2025

taplo

1a46ca2

srivatsankrishnan changed the title ~~Nemo2.0 Lora~~ Nemo2.0 Lora + Null tokenizer Mar 26, 2025

srivatsankrishnan marked this pull request as ready for review March 26, 2025 01:09

srivatsankrishnan requested review from TaekyungHeo, amaslenn and srinivas212 as code owners March 26, 2025 01:09

srivatsankrishnan mentioned this pull request Mar 26, 2025

Nemo2.0 SFT/LoRA + Null Tokenizer #419

Closed

TaekyungHeo previously approved these changes Mar 26, 2025

View reviewed changes

TaekyungHeo added the enhancement New feature or request label Mar 26, 2025

amaslenn reviewed Mar 26, 2025

View reviewed changes

conf/common/test/nemo_run_llama3_8b_lora.toml Outdated Show resolved Hide resolved

fix copyright year

11b4df9

srivatsankrishnan dismissed TaekyungHeo’s stale review via 11b4df9 March 26, 2025 14:16

amaslenn previously approved these changes Mar 26, 2025

View reviewed changes

fix copyright

c44e05c

srivatsankrishnan dismissed amaslenn’s stale review via c44e05c March 26, 2025 14:20

amaslenn approved these changes Mar 26, 2025

View reviewed changes

srivatsankrishnan merged commit 1eb323d into NVIDIA:main Mar 26, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nemo2.0 Lora + Null tokenizer #430

Nemo2.0 Lora + Null tokenizer #430

Uh oh!

srivatsankrishnan commented Mar 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Nemo2.0 Lora + Null tokenizer #430

Nemo2.0 Lora + Null tokenizer #430

Uh oh!

Conversation

srivatsankrishnan commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependencies

Test Plan

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

srivatsankrishnan commented Mar 26, 2025 •

edited

Loading