# Resilient LLM Training with NeMo Framework

This notebook demonstrates how to use NeMo's resiliency features for robust LLM training. It covers:

1. **Crash Recovery**: Using in-job restart capabilities to automatically recover from failures during training
2. **Straggler Detection**: Identifying and handling slow/stuck processes in distributed training
3. **Checkpointing**: Implementing asynchronous checkpointing for efficient model saving

The demo uses a small LLaMA model and simulated crashes to showcase these features in action. We'll walk through:
- Setting up a local executor with fault tolerance enabled
- Configuring the straggler detection callbacks
- Launching distributed training with resiliency features
- Monitoring training progress and recovery from failures
- Analyzing logs and checkpoints

This demonstrates how NeMo makes LLM training more robust and production-ready by handling common failure modes automatically.


In [1]:
# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

# Required Libraries
import argparse
import math
import os
from functools import partial
from typing import Any
import torch

import nemo_run as run
from lightning.pytorch.callbacks import Callback

from nemo.collections import llm
from nemo.collections.llm.recipes.callbacks.common import straggler_det_callback
from nemo.lightning.run import plugins

from crash_simulator import CrashSimulationCallback

print("Required libraries loaded.")

  from .autonotebook import tqdm as notebook_tqdm
      cm = get_cmap("Set1")
    


Required libraries loaded.


### Define the executor

Define and initialize a local executor, which is used to manage distributed computing tasks. The executor encapsulates configurations for launching jobs (e.g. number of devices, environment variables, task distribution).

The executor uses the (ft launcher)[https://github.com/NVIDIA/NeMo-Run/blob/main/docs/source/guides/execution.md#launchers] to enable fault tolerance capabilities.

In [2]:
def local_executor(devices: int = 8) -> run.LocalExecutor:
    """
    Factory method for creating a LocalExecutor instance. 
    This sets up environment variables and configures the number of devices.

    Args:
        devices (int): Number of devices to be used per node.

    Returns:
        run.LocalExecutor: Configured local executor object.
    """
    env_vars = {
        "TRANSFORMERS_OFFLINE": "1",   # Run Transformer models offline
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",  # Optimize PyTorch NCCL
        "NCCL_NVLS_ENABLE": "0",      # Experimental NCCL environment variable
        "NVTE_DP_AMAX_REDUCE_INTERVAL": "0", 
        "NVTE_ASYNC_AMAX_REDUCTION": "1",
    }
    # Create LocalExecutor with the `ft` launcher
    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="ft", env_vars=env_vars)
    return executor

# Initialize the executor based on the arguments
executor = local_executor(devices=8)

print("Executor setup complete.")

Executor setup complete.


### Model setup
Load and configure a LLAMA pretrain recipe. We choose a small 54M parameter llama3 based model for faster execution. This model is obtained by reducing the sequence length, number of layers, hidden size and number of attention heads from the original llama3 8B model configuration as defined in the (Llama3Config8B class)[https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/model/llama.py].

In [3]:
# Create a small LLAMA3 model configuration
def small_llama_cfg() -> llm.GPTConfig:
    """Small 54M parameter model"""
    return run.Config(
        llm.Llama3Config8B,
        rotary_base=500_000,
        seq_length=128,
        num_layers=4,
        hidden_size=768,
        ffn_hidden_size=2688,
        num_attention_heads=16,
        init_method_std=0.023,
    )


### Modify the training recipe
`pretrain` is a partial function that takes in the experiment name and checkpoint directory, and returns a pretrain recipe. It is setup to use `num_nodes=1` and `num_gpus_per_node=8` by default but this can be changed by modifying the `num_nodes` and `num_gpus_per_node` arguments. This demo uses the llama3 8b pretrain recipe as defined in the `llama31_8b.pretrain_recipe` (module)[https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/llama3_8b.py]. This defaults to using a mock dataset: (MockDataModule)[https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/gpt/data/mock.py] but please refer to the [Llama3_8b recipe](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/llm/recipes/llama3_8b.py) for instructions on how to use a custom dataset. Since we are using a mock dataset, we set the `max_steps` to 20 so we can run the experiment in a reasonable time.

We also disable validation sanity checks to reduce startup time, and set tensor model parallel size to 2 and context parallel size to 1.

In [4]:
# Experiment name
exp_name = "resiliency-in-pretraining-demo"

# Preliminary setup for the LLAMA pretrain recipe
pretrain = partial(llm.llama31_8b.pretrain_recipe, num_nodes=1, num_gpus_per_node=8)(
    name=exp_name, dir="/tmp/nemo_run/checkpoints"
)
pretrain.model = run.Config(llm.LlamaModel, small_llama_cfg())
pretrain.trainer.strategy.tensor_model_parallel_size = 2
pretrain.trainer.strategy.context_parallel_size = 1
pretrain.trainer.num_sanity_val_steps = 0
pretrain.broadcast(max_steps=20)
pretrain.trainer.limit_val_batches = 2
pretrain.trainer.log_every_n_steps = 1
pretrain.trainer.val_check_interval = 10
print("Model recipe setup complete.")

Model recipe setup complete.


### TODO: Add info on Straggler Detection callback, Preemption Plugin, Fault Tolerance Plugin

In [5]:
# Automatically detect and mitigate stragglers during training
pretrain.trainer.callbacks.append(straggler_det_callback(straggler_report_time_interval=1))

# Add runtime plugins (e.g., preemption and fault tolerance)
run_plugins = [plugins.PreemptionPlugin()]
run_plugins.append(plugins.FaultTolerancePlugin())


In [6]:
# TODO: Add info on what these env variables are for
# Setup ENV
os.environ["FAULT_TOL_CFG_PATH"] = "/tmp/sample_job_ft_cfg.yml"
os.environ["FAULT_TOL_FINISHED_FLAG_FILE"] = "/tmp/sample_job_finished_flag"

### Running the Experiment
Run the entire pretraining experiment. Depending on the arguments passed:
- If `dryrun` is True, it performs a dry run (to validate configurations).
- Otherwise, it launches the actual training run locally.

In [7]:
def run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False):
    """
    Run the pretraining experiment either as a dry run or actual training.
    
    Args:
        exp_name: Name of the experiment
        pretrain: Pretrain configuration object
        executor: Executor to run the experiment
        run_plugins: List of runtime plugins
        dryrun: Boolean flag to perform a dry run
    """
    with run.Experiment(f"{exp_name}") as exp:
        # Add the pretrain job to the experiment
        exp.add(
            pretrain,
            executor=executor,
            name=exp_name,
            plugins=run_plugins,
            tail_logs=True,
        )

        # Execute the experiment based on the dryrun flag
        if dryrun:
            print("Performing dry run ...")
            exp.dryrun()
        else:
            print("Launching training run ...")
            exp.run(sequential=True, detach=True)
            print("Experiment executed successfully.")

In [8]:
%%bash
# delete old checkpoints
rm -rf /tmp/nemo_run/checkpoints/

In [9]:
# run the experiment
run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)

Launching training run ...


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741037640/resiliency-in-pretraining-demo


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741037640/resiliency-in-pretraining-demo
Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-vl040cxsvlrxrd
AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741037640/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-vl040cxsvlrxrd
    


Experiment executed successfully.


Waiting for job resiliency-in-pretraining-demo-vl040cxsvlrxrd to finish [log=True]...


ining-demo/0 *****************************************
ining-demo/0 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
ining-demo/0 *****************************************
ining-demo/0 [2025-03-03 21:34:02,055] [INFO] [ft_launcher@ec00c0d0158b] [default] starting workers for entrypoint: python
ining-demo/0 [2025-03-03 21:34:02,055] [INFO] [ft_launcher@ec00c0d0158b] [default] Rendezvous'ing worker group
ining-demo/0 [2025-03-03 21:34:02,297] [INFO] [ft_launcher@ec00c0d0158b] [default] Rendezvous complete for workers. Result:
ining-demo/0   restart_count=0
ining-demo/0   master_addr=ec00c0d0158b
ining-demo/0   master_port=54883
ining-demo/0   group_rank=0
ining-demo/0   group_world_size=1
ining-demo/0   local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
ining-demo/0   role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
ining-demo/0   global_ranks=[0, 1, 2, 3,

Job resiliency-in-pretraining-demo-vl040cxsvlrxrd finished: SUCCEEDED


In [10]:
%%bash
# delete old checkpoints
rm -rf /tmp/nemo_run/checkpoints/

### Demonstrate in-job restart with a crash simulator
We use the `CrashSimulationCallback` to simulate a crash during training. This callback is configured to crash the process at step 17 if a crash has not already occurred.

Expected workflow:
- Start training: Trainer Step counter = 0
- After 10 trainer steps: Trainer Step counter = 10 -> save checkpoint
- After 17 trainer steps: Trainer Step counter = 17 -> crash simulated, set `has_simulated_crash_happened` to `True`
- Automatic in-job restart from checkpoint at step 10: Trainer step counter = 10
- After 17 trainer steps:Trainer Step counter = 17 -> no crash simulated as `has_simulated_crash_happened == True`
- After 20 trainer steps: Trainer Step counter = 20 -> successfully completes training

In [11]:
# Enable a crash simulation callback
pretrain.trainer.callbacks.append(run.Config(CrashSimulationCallback, crash_step=17))

# run the experiment
run_experiment(exp_name, pretrain, executor, run_plugins, dryrun=False)

Launching training run ...


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741037843/resiliency-in-pretraining-demo


Log directory is: /root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741037843/resiliency-in-pretraining-demo
Launched app: local_persistent://nemo_run/resiliency-in-pretraining-demo-th70tx1qsh43kc
AppStatus:
    State: RUNNING
    Num Restarts: 0
    Roles: 
    Msg: <NONE>
    Structured Error Msg: <NONE>
    UI URL: file:///root/.nemo_run/experiments/resiliency-in-pretraining-demo/resiliency-in-pretraining-demo_1741037843/resiliency-in-pretraining-demo/nemo_run/resiliency-in-pretraining-demo-th70tx1qsh43kc
    


Experiment executed successfully.


Waiting for job resiliency-in-pretraining-demo-th70tx1qsh43kc to finish [log=True]...


ining-demo/0 *****************************************
ining-demo/0 Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
ining-demo/0 *****************************************
ining-demo/0 [2025-03-03 21:37:24,817] [INFO] [ft_launcher@ec00c0d0158b] [default] starting workers for entrypoint: python
ining-demo/0 [2025-03-03 21:37:24,817] [INFO] [ft_launcher@ec00c0d0158b] [default] Rendezvous'ing worker group
ining-demo/0 [2025-03-03 21:37:25,045] [INFO] [ft_launcher@ec00c0d0158b] [default] Rendezvous complete for workers. Result:
ining-demo/0   restart_count=0
ining-demo/0   master_addr=ec00c0d0158b
ining-demo/0   master_port=50919
ining-demo/0   group_rank=0
ining-demo/0   group_world_size=1
ining-demo/0   local_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
ining-demo/0   role_ranks=[0, 1, 2, 3, 4, 5, 6, 7]
ining-demo/0   global_ranks=[0, 1, 2, 3,

Job resiliency-in-pretraining-demo-th70tx1qsh43kc finished: SUCCEEDED


### Asynchronous Checkpointing
Checkpointing is important for recovering from failures, but traditional checkpointing has drawbacks:

1. Training pauses while saving checkpoints
2. To minimize these pauses, checkpoints are usually only saved once per epoch
3. If training fails between checkpoints, work must be redone from the last checkpoint

For example, with:
- 500 steps per epoch
- 10 seconds per step
- 3 epochs total

Best case (no failures):
- Training time = 15,000 seconds (500 steps × 10 seconds × 3 epochs)

Worst case (failure at step 799):
- Must redo nearly 2 full epochs
- Training time = 20,000 seconds (nearly 5,000 seconds wasted)

Asynchronous checkpointing solves these problems by:
- Saving checkpoints without pausing training
- Using fast distributed checkpointing via Megatron-Core
- Allowing frequent checkpoints with minimal overhead

This means you can checkpoint often to minimize lost work, without slowing down training.

For more details, see:
- [Megatron-Core distributed checkpointing](https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/dist_checkpointing.html)
- [NeMo documentation](https://github.com/NVIDIA/NeMo/blob/main/docs/source/checkpoints/dist_ckpt.rst)

Note: NeMo enables asynchronous and parallel checkpointing by default through MegatronStrategy's 
ckpt_async_save and ckpt_parallel_save options, so users automatically get these benefits
without any additional configuration needed.
