Skip to content

add torch profiler plugin and call it in profiler scripts#14779

Merged
nv-mollys merged 5 commits intoNVIDIA-NeMo:llmb-nemo-r2.5.0from
briancoutinho:bcoutinho/add_torch_profiler_plugin_llmb1
Sep 26, 2025
Merged

add torch profiler plugin and call it in profiler scripts#14779
nv-mollys merged 5 commits intoNVIDIA-NeMo:llmb-nemo-r2.5.0from
briancoutinho:bcoutinho/add_torch_profiler_plugin_llmb1

Conversation

@briancoutinho
Copy link

@briancoutinho briancoutinho commented Sep 22, 2025

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Adds a Torch profiler plugin and updates Torch profiler callback. Performance scripts update to have ready access to torch profiler now.

Changelog

  1. Update Torch profiler callbacks so we can optionally enable chakra execution trace.
  2. Add a Torch profiler NeMo Run plugin (see nemo/lightning/run/plugins.py).
  3. Update performance scripts to include torch profiler plugin (see nemo/scripts/performance/...)

Usage

We added a new nemo run plugin to add PyTorch profiling.
One can add the plugin like

plugins = []
...
plugins += PyTorchProfilerPlugin(
    start_step=start_iter,
    end_step=end_iter,
    output_path=log_dir,  # a subdir torch_profiles will be created here.
    profiler_kwargs={
        "with_stack": os.environ.get('TORCH_PROFILER_WITH_STACK', '0') == '1',
    }
)
...
with run.Experiment("llama3_8b_nsys_profiling") as exp:
    exp.add(
        recipe,
        executor=executor,
        plugins=plugins,
    )
    exp.run()

In the nemo performance scripts scripts/performance, you can use the following helper function

    if torch_profiler_plugin := build_torch_profiler_plugin(args):
        plugins.append(torch_profiler_plugin)

Sample output

In the logs we should the profiling configured as

[default0]:[NeMo I 2025-09-25 15:22:14 nemo_logging:393] PyTorch profiling initialized:
[default0]:     - Start Step: 36
[default0]:     - End Step: 40
[default0]:     - Warmup Steps: 2
[default0]:     - Active Steps: 4
[default0]:     - Trace Directory: /home/bcoutinho/nemo_experiments/torch_and_nsys
[default0]:     - Collect Execution Trace: False
[default0]:     - Extra profiler kwargs: {'with_stack': False}

After the correct iteration you will see logs dumped

[default0]:Training epoch 0, iteration 38/49 | lr: 2.335e-05 | global_batch_size: 32 | global_step: 38 | reduced_train_loss: 10.88 | train_step_timing in s: 3.106 | consumed_samples: 1248
[default0]:[NeMo I 2025-09-25 15:24:58 nemo_logging:393] Kineto trace saved: /home/bcoutinho/nemo_experiments/torch_and_nsys/torch_profiler/rank-0.json.gz

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • [N/A] Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Tested using simple nemotron training script below.

def configure_recipe(nodes: int = 1, gpus_per_node: int = 2, add_torch_profiler = False):
    recipe = llm.nemotron3_4b.pretrain_recipe(
        dir="/checkpoints/nemotron", # Path to store checkpoints
        name="nemotron_pretraining",
        tensor_parallelism=2,
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
        max_steps=50, # Setting a small value for the quickstart
    )
    recipe.trainer.val_check_interval = 10000
    return recipe
    
def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
    }
    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor
    
def run_pretraining(args):
    recipe = configure_recipe(add_torch_profiler=(not use_experiment_api))
    executor = local_executor_torchrun(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)

    plugins = []
    # if torch_profiler_plugin := build_torch_profiler_plugin_1(args):
    if torch_profiler_plugin := build_torch_profiler_plugin(args):
        plugins.append(torch_profiler_plugin)
        
    with run.Experiment("llama3_8b_nsys_profiling") as exp:
        exp.add(
            recipe,
            executor=executor,
            plugins=plugins,
        )
        exp.run()

@briancoutinho briancoutinho force-pushed the bcoutinho/add_torch_profiler_plugin_llmb1 branch from 2f61e18 to 835b7ec Compare September 22, 2025 19:47
@briancoutinho briancoutinho marked this pull request as ready for review September 25, 2025 22:56
@briancoutinho briancoutinho force-pushed the bcoutinho/add_torch_profiler_plugin_llmb1 branch 3 times, most recently from 55226fb to e2782d2 Compare September 26, 2025 00:18
Brian Coutinho and others added 4 commits September 25, 2025 17:21
Signed-off-by: Brian Coutinho <bcoutinho@nvidia.com>
Signed-off-by: Brian Coutinho <bcoutinho@nvidia.com>
Signed-off-by: Brian Coutinho <bcoutinho@nvidia.com>
Signed-off-by: Brian Coutinho <bcoutinho@nvidia.com>
@briancoutinho briancoutinho force-pushed the bcoutinho/add_torch_profiler_plugin_llmb1 branch from e2782d2 to 209bff4 Compare September 26, 2025 00:21
Signed-off-by: briancoutinho <briancoutinho@users.noreply.github.com>
@nv-mollys nv-mollys merged commit 51c87a7 into NVIDIA-NeMo:llmb-nemo-r2.5.0 Sep 26, 2025
4 checks passed
briancoutinho added a commit to briancoutinho/NeMo that referenced this pull request Jan 14, 2026
…o#14779)

Signed-off-by: Brian Coutinho <bcoutinho@nvidia.com>
---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants