add torch profiler plugin and call it in profiler scripts by briancoutinho · Pull Request #14779 · NVIDIA-NeMo/NeMo

briancoutinho · 2025-09-22T19:35:24Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Adds a Torch profiler plugin and updates Torch profiler callback. Performance scripts update to have ready access to torch profiler now.

Changelog

Update Torch profiler callbacks so we can optionally enable chakra execution trace.
Add a Torch profiler NeMo Run plugin (see nemo/lightning/run/plugins.py).
Update performance scripts to include torch profiler plugin (see nemo/scripts/performance/...)

Usage

We added a new nemo run plugin to add PyTorch profiling.
One can add the plugin like

plugins = []
...
plugins += PyTorchProfilerPlugin(
    start_step=start_iter,
    end_step=end_iter,
    output_path=log_dir,  # a subdir torch_profiles will be created here.
    profiler_kwargs={
        "with_stack": os.environ.get('TORCH_PROFILER_WITH_STACK', '0') == '1',
    }
)
...
with run.Experiment("llama3_8b_nsys_profiling") as exp:
    exp.add(
        recipe,
        executor=executor,
        plugins=plugins,
    )
    exp.run()

In the nemo performance scripts scripts/performance, you can use the following helper function

    if torch_profiler_plugin := build_torch_profiler_plugin(args):
        plugins.append(torch_profiler_plugin)

Sample output

In the logs we should the profiling configured as

[default0]:[NeMo I 2025-09-25 15:22:14 nemo_logging:393] PyTorch profiling initialized:
[default0]:     - Start Step: 36
[default0]:     - End Step: 40
[default0]:     - Warmup Steps: 2
[default0]:     - Active Steps: 4
[default0]:     - Trace Directory: /home/bcoutinho/nemo_experiments/torch_and_nsys
[default0]:     - Collect Execution Trace: False
[default0]:     - Extra profiler kwargs: {'with_stack': False}

After the correct iteration you will see logs dumped

[default0]:Training epoch 0, iteration 38/49 | lr: 2.335e-05 | global_batch_size: 32 | global_step: 38 | reduced_train_loss: 10.88 | train_step_timing in s: 3.106 | consumed_samples: 1248
[default0]:[NeMo I 2025-09-25 15:24:58 nemo_logging:393] Kineto trace saved: /home/bcoutinho/nemo_experiments/torch_and_nsys/torch_profiler/rank-0.json.gz

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
[N/A] Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Tested using simple nemotron training script below.

def configure_recipe(nodes: int = 1, gpus_per_node: int = 2, add_torch_profiler = False):
    recipe = llm.nemotron3_4b.pretrain_recipe(
        dir="/checkpoints/nemotron", # Path to store checkpoints
        name="nemotron_pretraining",
        tensor_parallelism=2,
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
        max_steps=50, # Setting a small value for the quickstart
    )
    recipe.trainer.val_check_interval = 10000
    return recipe
    
def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0",
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
    }
    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor
    
def run_pretraining(args):
    recipe = configure_recipe(add_torch_profiler=(not use_experiment_api))
    executor = local_executor_torchrun(nodes=recipe.trainer.num_nodes, devices=recipe.trainer.devices)

    plugins = []
    # if torch_profiler_plugin := build_torch_profiler_plugin_1(args):
    if torch_profiler_plugin := build_torch_profiler_plugin(args):
        plugins.append(torch_profiler_plugin)
        
    with run.Experiment("llama3_8b_nsys_profiling") as exp:
        exp.add(
            recipe,
            executor=executor,
            plugins=plugins,
        )
        exp.run()

Signed-off-by: Brian Coutinho <bcoutinho@nvidia.com>

Signed-off-by: briancoutinho <briancoutinho@users.noreply.github.com>

…o#14779) Signed-off-by: Brian Coutinho <bcoutinho@nvidia.com> ---------

briancoutinho force-pushed the bcoutinho/add_torch_profiler_plugin_llmb1 branch from 2f61e18 to 835b7ec Compare September 22, 2025 19:47

briancoutinho marked this pull request as ready for review September 25, 2025 22:56

nv-mollys approved these changes Sep 25, 2025

View reviewed changes

briancoutinho force-pushed the bcoutinho/add_torch_profiler_plugin_llmb1 branch 3 times, most recently from 55226fb to e2782d2 Compare September 26, 2025 00:18

Brian Coutinho and others added 4 commits September 25, 2025 17:21

Add a pytorch profiler nemo run plugin, make chakra optional

ce274fc

Signed-off-by: Brian Coutinho <bcoutinho@nvidia.com>

Add script utilities to run pytorch profiler plugin

43a57c7

Signed-off-by: Brian Coutinho <bcoutinho@nvidia.com>

performance script changes to add torch profiling

3ef9f39

Signed-off-by: Brian Coutinho <bcoutinho@nvidia.com>

Add collect et flag and fix a test

209bff4

Signed-off-by: Brian Coutinho <bcoutinho@nvidia.com>

briancoutinho force-pushed the bcoutinho/add_torch_profiler_plugin_llmb1 branch from e2782d2 to 209bff4 Compare September 26, 2025 00:21

Apply isort and black reformatting

abc2802

Signed-off-by: briancoutinho <briancoutinho@users.noreply.github.com>

nv-mollys merged commit 51c87a7 into NVIDIA-NeMo:llmb-nemo-r2.5.0 Sep 26, 2025
4 checks passed

briancoutinho added a commit to briancoutinho/NeMo that referenced this pull request Jan 14, 2026

add torch profiler plugin and call it in profiler scripts (NVIDIA-NeM…

c459441

…o#14779) Signed-off-by: Brian Coutinho <bcoutinho@nvidia.com> ---------

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add torch profiler plugin and call it in profiler scripts#14779

add torch profiler plugin and call it in profiler scripts#14779
nv-mollys merged 5 commits intoNVIDIA-NeMo:llmb-nemo-r2.5.0from
briancoutinho:bcoutinho/add_torch_profiler_plugin_llmb1

briancoutinho commented Sep 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

briancoutinho commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Usage

Sample output

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

briancoutinho commented Sep 22, 2025 •

edited

Loading