# Hero Notebook: TorchTitan Multi-Node Training with Monarch & Lightning SDK

This notebook demonstrates how to run TorchTitan training using Monarch for distributed multi-node training on Lightning AI infrastructure.

<div align="center">
  <img src="./assets/NB_Monarch_Lightning.svg" alt="Monarch Lightning Architecture" width="800"/>
</div>

<!-- Image size settings:
  - Adjust 'width' attribute to control the diagram size (e.g., width="600", width="1000", or width="100%")
  - You can also use 'height' attribute instead (e.g., height="400")
  - Remove width/height attributes to display at original size
-->

## Table of Contents

This notebook provides a comprehensive guide to running distributed multi-node training using **Monarch** (Meta's distributed actor framework) with **TorchTitan** (PyTorch's large-scale LLM training library) on **Lightning AI** infrastructure. You'll learn how to set up, execute, debug, and manage distributed training workflows across multiple GPU nodes. 

While Part I & II are the core of this Notebook for setup and training; Part III is for users who are interested in Monarch's advanced features such as interactive distributed debugging, environment variable management, and code synchronization for workspaces between local node and remote nodes.

### What You'll Learn

**Part I: Environment Setup** *(Essential Prerequisites)*
- Install TorchTitan - Set up PyTorch and TorchTitan for LLM training
- Download Llama-3.1-8B Model Assets - Get model tokenizers from Hugging Face
- Install Monarch - Install Meta's distributed actor framework
- Setup Weights & Biases - Configure experiment tracking
- Update Lightning SDK - Get the latest Lightning SDK features
- Verify Installations - Confirm all dependencies are ready

**Part II: Multi-Node Training** *(Core Training Workflow)*
- Import Lightning SDK Components - Import required classes for multi-machine training
- Configure Training Job Parameters - Set up nodes, GPUs, and network settings
- Launch Multi-Node Training Job - Start distributed infrastructure on Lightning AI
- Set Up Process Mesh - Initialize Monarch's distributed computing mesh
- Define TorchTitan Trainer Actor - Create distributed training actor
- Run TorchTitan Training - Execute Llama 3-8B training across nodes

**Part III: Advanced Features** *(Distributed Development & Debugging)*

1. **Environment Variable Management**
   - Spawn Environment Variable Actor - Manage env vars across nodes
   - Get/Set Environment Variables - Inspect and modify remote environments
   - List Environment Variables - Query env vars by prefix

2. **Workspace Synchronization** *(Hot-Reload Code & Configs)*
   - Introduction to sync_workspace - Understanding workspace sync
   - Content checker Actor for files - Define an Actor to check content
   - Create Local Configuration - Set up training configs
   - Sync to Remote Nodes - Propagate changes to workers
   - Verify Synchronization - Confirm files are synced

3. **Interactive Debugging with Breakpoints**
   - Debugging Overview - Using pdb with distributed actors
   - Define Debug Trainer - Create actor with breakpoints
   - Spawn and Debug - Run interactive debugging session
   - Debugger Commands - Learn monarch debug CLI commands

**Part IV: Cleanup**
- Stop Process Mesh - Gracefully shutdown distributed resources

---

### Key Concepts

- **Monarch Actor**: Distributed computation unit that runs on remote nodes
- **Process Mesh (ProcMesh)**: Network of processes across multiple nodes for distributed computing
- **Endpoint**: Method decorator that makes actor methods callable remotely
- **Workspace Sync**: Synchronize local code/config changes to remote worker nodes without restart
- **Lightning MMT**: Multi-Machine Training orchestration on Lightning AI

### Prerequisites
- Lightning AI account with access to GPU machines (L40S recommended)
- Hugging Face account with Llama model access
- Basic understanding of distributed training concepts

---

# Part I: Environment Setup

Before running the notebook cells, ensure all dependencies are properly installed by following the steps below.

## Install TorchTitan

Clone the TorchTitan repository, install the nightly PyTorch build with CUDA 12.6 support, and install TorchTitan:

```bash
git clone https://github.com/pytorch/torchtitan.git
cd torchtitan
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu126 --force-reinstall
pip install .
```

## Download Llama-3-8B Model Assets

Download the Llama-3.1-8B tokenizer from Hugging Face. You'll need a Hugging Face token with access to the Llama models:

```bash
python scripts/download_hf_assets.py \
    --repo_id meta-llama/Llama-3.1-8B \
    --assets tokenizer \
    --hf_token=YOUR_HUGGINGFACE_TOKEN_KEY
```

Replace `YOUR_HUGGINGFACE_TOKEN_KEY` with your actual Hugging Face token.

## Install Monarch

Install Monarch from the GitHub repository following the Ubuntu installation instructions:

```bash
git clone https://github.com/meta-pytorch/monarch.git
cd monarch
# Follow the Ubuntu installation instructions from the repository
```

For detailed installation steps, visit: https://github.com/meta-pytorch/monarch

## Setup Weights & Biases

Check if wandb is installed. If not, install it and login:

```bash
pip install wandb
wandb login
```

Follow the prompts to authenticate with your wandb account.

## Update the Lightning SDK

The latest version of lightning SDK offers IP sharing between the client host and remote nodes. This features is being used in this Notebook.

```bash
pip install -U lightning_sdk
```

## Verify Installations

After completing the installation steps above, verify that TorchTitan and Monarch are properly installed:

```python
# Verify TorchTitan installation
import torchtitan
print("TorchTitan is installed successfully")

# Verify Monarch installation
import monarch
print("Monarch is installed successfully")

# Verify PyTorch and CUDA
import torch
print(f"PyTorch version: {torch.__version__}")
```

If all imports succeed, you're ready to proceed with the training workflow below.

---

# Part II: Multi-Node Training with Monarch and Lightning

Now that the environment is set up, we can proceed with configuring and launching the distributed training job.

## Import Lightning SDK Components

Import the necessary classes from Lightning SDK to manage multi-machine training jobs, including `Machine` for hardware specifications, `MMT` for multi-machine training orchestration, and `Studio` for workspace management.

In [None]:
from lightning_sdk import Machine, MMT, Studio

## Configure Training Job Parameters

Set up the configuration for the multi-node training job, including the number of nodes (2), GPUs per node (8), teamspace name, username, and port range for worker node communication.

In [None]:
# Configuration
import os
NUM_NODES = 2
NUM_GPUS = 8
TEAMSPACE = "general"  # Replace with your teamspace
USER = "meta-ai"  # Replace with your username
MMT_JOB_NAME = f"Monarch-v0-MMT-{NUM_NODES}-nodes"

# Remote allowed port range for worker nodes
REMOTE_ALLOWED_PORT_RANGE = "26601..26611"

# To force Monarch to use V0 for this Notebook (This will be removed in the future)
os.environ["MONARCH_V0_WORKAROUND_DO_NOT_USE"] = "1"

## Define MMT Job Launch Function

Create a function to launch a multi-machine training (MMT) job using Lightning SDK. This function installs the MMT plugin, configures the machine type (L40S GPUs), sets environment variables for CUDA devices and Monarch configurations, and returns the job handle and studio instance.

In [None]:
def launch_mmt_job(num_nodes=2, teamspace="my-teamspace", user="my-user"):
    """
    Launch a multi-machine training job using Lightning SDK's MMT API.
    """

    studio = Studio()

    # Install the MMT plugin befor running the actual job
    studio.install_plugin("multi-machine-training")

    print(f"Launching MMT job with {num_nodes} nodes...")

    # Machine with T4 GPUs
    # machine_type = getattr(Machine, f"T4_X_{NUM_GPUS}")

     # Machine with L40 GPUs
    # machine_type = getattr(Machine, f"L4_X_{NUM_GPUS}")

    # Machine with L40S GPUs
    machine_type = getattr(Machine, f"L40S_X_{NUM_GPUS}")

    job = MMT.run(
        command="process_allocator",
        name=f"Multi-Node-Monarch-Titan-Scale-{NUM_NODES}_nodes-port_override",
        machine=machine_type,
        studio=studio,
        num_machines=num_nodes,
        env={
            "CUDA_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7",  # Make all GPUs visible # TODO: Should make this one dynamic
            "MONARCH_FILE_LOG": "debug",
            "HYPERACTOR_REMOTE_ALLOC_ALLOWED_PORT_RANGE": REMOTE_ALLOWED_PORT_RANGE,
            "HYPERACTOR_REMOTE_ALLOC_BIND_TO_INADDR_ANY": "true",
            "WORKSPACE_DIR": "/home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages",
        },
    )

    print(f"Job started with ID: {job.name}")
    print(f"Job status: {job.status}")

    # Monitor job status
    return job, studio

## Launch the Multi-Node Training Job

Execute the `launch_mmt_job` function with the specified number of nodes, teamspace, and user credentials. This starts the distributed training infrastructure and provides commands for monitoring and stopping the job.

In [None]:
# Launch the job
job, studio = launch_mmt_job(
    num_nodes=NUM_NODES, teamspace=TEAMSPACE, user=USER
)

print(f"Job launched. You can monitor it using: job.status")
print(f"To stop the job: job.stop()")
print(f"To clean up: studio.stop()")

Launching MMT job with 2 nodes...


INFO - Multi-Machine Job was successfully launched. View it at https://lightning.ai/meta-ai/general/jobs/Multi-Node-Monarch-Titan-Scale-2_nodes-port_override-7rxo2?app_id=mmt


Job started with ID: Multi-Node-Monarch-Titan-Scale-2_nodes-port_override-7rxo2
Job status: Pending
Job launched. You can monitor it using: job.status
To stop the job: job.stop()
To clean up: studio.stop()


## Set Up Process Mesh from Job

Initialize the Monarch process mesh using the launched Lightning job. This creates the distributed computing mesh that connects all nodes and GPUs for coordinated training.

In [None]:
from utils.mesh_utils import setup_proc_mesh_from_job

proc_mesh = setup_proc_mesh_from_job(job, NUM_NODES, NUM_GPUS)

# Example Hero - Run TorchTitan using Monarch for Llama 3 - 8B

## Generate Job Name Helper

Define a utility function to generate a unique job name based on the username, number of hosts, and GPUs per host. This helps identify and track different training runs.

In [None]:
import getpass
def get_job_name(num_hosts: int, num_gpus_per_host: int):
    return f"monarch-{getpass.getuser()}-hosts{num_hosts}-gpus{num_gpus_per_host}"
print(get_job_name(num_hosts=NUM_NODES, num_gpus_per_host=NUM_GPUS))

## Define TorchTitan Trainer Actor

Create the `TitanTrainerWrapper` class, a Monarch Actor that wraps TorchTitan's training functionality. This actor handles initialization, training execution, checkpointing, and cleanup of the distributed training process across all nodes.

In [None]:
import os
import sys
import logging
from monarch.actor import ProcMesh, Actor, endpoint, current_rank
import socket
from torchtitan.tools.logging import init_logger, logger
from torchtitan.train import Trainer
from typing import Optional
import torch
from torchtitan.config import JobConfig


class TitanTrainerWrapper(Actor):
    def __init__(self, job_config: JobConfig):
        self.rank = current_rank().rank
        self.job_config = job_config

    def _rprint(self, msg):
        """Helper method to print with rank information."""
        print(f"{self.rank=} {msg}")

    @endpoint
    def init(self):
        logging.getLogger().addHandler(logging.StreamHandler(sys.stderr))
        print(f"Initializing actor: {self.rank} {current_rank()=} {socket.gethostname()=}")


    @endpoint
    def train(self):
        logger.info("Starting training")
        config = self.job_config
        trainer: Optional[Trainer] = None

        try:
            trainer = Trainer(config)
            trainer.train()

            if config.checkpoint.create_seed_checkpoint:
                assert (
                    int(os.environ["WORLD_SIZE"]) == 1
                ), "Must create seed checkpoint using a single device, to disable sharding."
                assert (
                    # config.checkpoint.enable_checkpoint
                    config.checkpoint.enable
                ), "Must enable checkpointing when creating a seed checkpoint."
                trainer.checkpointer.save(curr_step=0, )
                logger.info("Created seed checkpoint")
            else:
                trainer.train()
        finally:
            if trainer:
                trainer.close()

            if torch.distributed.is_initialized():
                torch.distributed.destroy_process_group()
                logger.info("Process group destroyed.")
        print("Done training")

## Define Async Main Training Function

Set up the main asynchronous function that orchestrates the distributed training. This function configures the environment for distributed execution, spawns trainer actors across the process mesh, and initiates the training workflow. The reason that this function is defined as async is becuase of those call of endpoints where need to be awaited. This makes sure that coordination of operations across multiple machines are done asynchronously rather than blocking the main thread.

In [None]:
from torchtitan.config import ConfigManager, JobConfig
from monarch.utils import setup_env_for_distributed

async def async_main(job_config: JobConfig):
    torch.use_deterministic_algorithms(True)
    job_name = get_job_name(NUM_NODES, NUM_GPUS)

    """
    # if use_ipaddr is not passed, then default is IPv6 for MASTER_ADDR
    """
    await setup_env_for_distributed(proc_mesh, use_ipaddr=AddrType.IPv4)

    await proc_mesh.logging_option(stream_to_client=True, aggregate_window_sec=3)

    print(job_config)
    print(f"Spawning meshes on {job_name}")

    trainer_actor = proc_mesh.spawn("trainer_actor", TitanTrainerWrapper, job_config)

    await trainer_actor.init.call()
    await trainer_actor.train.call()

## Initialize Logger and Run Training

Configure the TorchTitan logger and parse training arguments including model configuration file, tokenizer path, dataset location, number of training steps, and output directory. Then execute the asynchronous training pipeline.

In [None]:
init_logger()
config_manager = ConfigManager()

job_name = get_job_name(NUM_NODES, NUM_GPUS)

manual_args = [
        "--job.config_file",
        os.path.expanduser("/teamspace/studios/this_studio/torchtitan/torchtitan/models/llama3/train_configs/llama3_8b.toml"),
        "--model.tokenizer-path",
        "/teamspace/studios/this_studio/torchtitan/assets/hf/Llama-3.1-8B",
        "--training.steps",
        "25",
        "--training.dataset_path",
        "/teamspace/studios/this_studio/torchtitan/tests/assets/c4_test",
        "--job.dump_folder",
        "/teamspace/studios/this_studio/torchtitan/outputs/" + job_name,
        "--training.seq_len",
        "1024",
        # "8192",
    ]
config = config_manager.parse_args(manual_args)
await async_main(config)

**🎉🎉 Congratulations!!!! 🎉🎉 You just ran the interactive distributed training for Llama-3 model in a Notebook using Monarch actors and Lightning setup!**

This already gives the user lots of flexibilities such as changing the configurations and launching another training without iniatiating another job or set of nodes; or experiencing the logging aggregation using Monarch.

However, a curious user can dig more into advanced features of Monarch in Part III. Monarch offers features such as interactive distributed debugging while your training is running on mutliple nodes and ranks. Another feature is the `workspace_sync` where users can update packages, environments and files and sync them with remote nodes. Without Monarch, users may need to re-initiate their launches which usually takes lots of times. 



--- 

# Part III: Advanced Features (Distributed Development & Debugging)

## Environment Variable Management with Remote Actors

Spawn an actor that can interact with environment variables on remote nodes. This is useful for debugging, configuration management, and runtime environment inspection across the distributed system.

In [None]:
from monarch.actor import Actor, endpoint, current_rank
import os
import socket

class EnvVarActor(Actor):
    """Actor for managing environment variables on remote nodes."""

    def __init__(self):
        self.rank = current_rank().rank
        self.hostname = socket.gethostname()

    @endpoint
    def get_env(self, var_name: str) -> dict:
        """Get an environment variable value from the remote node."""
        value = os.environ.get(var_name)
        return {
            "rank": self.rank,
            "hostname": self.hostname,
            "var_name": var_name,
            "value": value
        }

    @endpoint
    def set_env(self, var_name: str, var_value: str) -> dict:
        """Set an environment variable on the remote node."""
        os.environ[var_name] = var_value
        return {
            "rank": self.rank,
            "hostname": self.hostname,
            "var_name": var_name,
            "value": var_value,
            "status": "set"
        }

    @endpoint
    def list_env_vars(self, prefix: str = "") -> dict:
        """List all environment variables matching a prefix."""
        matching_vars = {k: v for k, v in os.environ.items() if k.startswith(prefix)}
        return {
            "rank": self.rank,
            "hostname": self.hostname,
            "matching_vars": matching_vars,
            "count": len(matching_vars)
        }

### Spawn the Environment Variable Actor

Spawn the `EnvVarActor` across all nodes in the process mesh. Each node will have an instance that can be used to inspect and modify its local environment.

In [None]:
# Spawn the environment variable actor across all nodes
env_actor = proc_mesh.spawn("env_actor", EnvVarActor)
print("EnvVarActor spawned across all nodes")

### Get Environment Variables from Remote Nodes

Query environment variables from all remote nodes. This example retrieves the `CUDA_VISIBLE_DEVICES` variable that was set during job initialization.

In [None]:
# Get an environment variable from all nodes
results = await env_actor.get_env.call("CUDA_VISIBLE_DEVICES")
print("\nCUDA_VISIBLE_DEVICES on all nodes:")
for result in results:
    print(f"  Rank {result['rank']} ({result['hostname']}): {result['value']}")

### Set Environment Variables on Remote Nodes

Set a custom environment variable on all remote nodes and verify it was set correctly.

In [None]:
# Set a custom environment variable on all nodes
set_results = await env_actor.set_env.call("CUSTOM_VAR", "test_value_123")
print("\nSetting CUSTOM_VAR on all nodes:")
for result in set_results:
    print(f"  Rank {result['rank']} ({result['hostname']}): {result['status']} - {result['value']}")

# Verify the variable was set by reading it back
verify_results = await env_actor.get_env.call("CUSTOM_VAR")
print("\nVerifying CUSTOM_VAR on all nodes:")
for result in verify_results:
    print(f"  Rank {result['rank']} ({result['hostname']}): {result['value']}")

### List Environment Variables with Prefix

List all environment variables that match a specific prefix (e.g., all CUDA-related or MONARCH-related variables).

In [None]:
# List all environment variables starting with "CUDA"
list_results = await env_actor.list_env_vars.call("CUDA")
print("\nCUDA-related environment variables on all nodes:")
for result in list_results:
    print(f"\n  Rank {result['rank']} ({result['hostname']}) - {result['count']} variables:")
    for var_name, var_value in result['matching_vars'].items():
        print(f"    {var_name}={var_value}")

---

## Workspace Synchronization with `sync_workspace`

When working with distributed training, you often need to modify configuration files, training scripts, or other code locally and sync those changes to remote worker nodes without restarting the entire job. Monarch's `proc_mesh.sync_workspace()` enables this workflow.

### How it works:

1. **Make changes locally** - Edit files in your local workspace (e.g., configuration files, training scripts)
2. **Call `sync_workspace()`** - Synchronize changes to all remote worker nodes
3. **Continue execution** - The updated files are immediately available on all nodes

This is particularly useful for:
- Tweaking hyperparameters in configuration files
- Updating training schedules
- Modifying data processing logic
- Hot-reloading code changes without job restart

Let's see a practical example using TorchTitan training configurations.

### Define Actor to Check File Contents

First, create an actor that can read and verify file contents on remote nodes. This will help us confirm that files are properly synchronized across the cluster.

In [None]:
class FileCheckerActor(Actor):
    """Actor to read and verify file contents on remote nodes."""

    def __init__(self):
        self.rank = current_rank().rank
        self.hostname = socket.gethostname()

    @endpoint
    def read_file(self, file_path: str) -> dict:
        """Read a file and return its contents."""
        try:
            with open(file_path, 'r') as f:
                content = f.read()
            return {
                "rank": self.rank,
                "hostname": self.hostname,
                "file_path": file_path,
                "content": content,
                "exists": True,
                "size": len(content)
            }
        except FileNotFoundError:
            return {
                "rank": self.rank,
                "hostname": self.hostname,
                "file_path": file_path,
                "exists": False,
                "error": "File not found"
            }
        except Exception as e:
            return {
                "rank": self.rank,
                "hostname": self.hostname,
                "file_path": file_path,
                "exists": False,
                "error": str(e)
            }

    @endpoint
    def file_exists(self, file_path: str) -> dict:
        """Check if a file exists on the remote node."""
        exists = os.path.exists(file_path)
        return {
            "rank": self.rank,
            "hostname": self.hostname,
            "file_path": file_path,
            "exists": exists
        }

### Spawn File Checker Actor

Spawn the file checker actor across all nodes to verify file synchronization.

In [None]:
# Spawn the file checker actor
file_checker = proc_mesh.spawn("file_checker", FileCheckerActor)
print("FileCheckerActor spawned across all nodes")

### Create a Local Configuration File

Create a local training configuration file that we'll later modify and sync to worker nodes. This simulates a common workflow where you want to tweak hyperparameters or training settings.

In [None]:
# Create a local workspace directory for our custom config
local_workspace = "/teamspace/studios/this_studio/monarch_sync_example"
os.makedirs(local_workspace, exist_ok=True)

# Create a custom training configuration file
config_file_name = "custom_training_config.toml"
local_config_path = os.path.join(local_workspace, config_file_name)

# Write initial configuration
with open(local_config_path, 'w') as f:
    f.write("""# TorchTitan Custom Training Configuration
# This file demonstrates workspace synchronization

[training]
batch_size = 32
learning_rate = 0.001
max_steps = 100
warmup_steps = 10

[model]
model_type = "llama3_8b"
seq_len = 1024

[optimizer]
optimizer_type = "AdamW"
weight_decay = 0.01
""")

print(f"Created local config file: {local_config_path}")
with open(local_config_path, 'r') as f:
    print(f"\nInitial configuration:\n{f.read()}")

### Setup Workspace and Perform Initial Sync

Create a Monarch `Workspace` object and perform the initial synchronization to all remote worker nodes.

In [None]:
from monarch.tools.config.workspace import Workspace
from pathlib import Path

# Create a Workspace object pointing to our local directory
workspace = Workspace(dirs=[Path(local_workspace)])

print(f"Workspace configured: {workspace.dirs}")
print(f"\nSyncing workspace to remote nodes...")

# Perform initial sync
await proc_mesh.sync_workspace(workspace=workspace, conda=False, auto_reload=False)

print("Initial workspace sync completed!")

### Verify File on Remote Nodes

Check that the configuration file was successfully synced to all remote worker nodes by reading it from each node.

In [None]:
# Construct the remote file path (files are synced to WORKSPACE_DIR)
remote_workspace_root = os.environ.get("WORKSPACE_DIR", "/workspace")
remote_config_path = os.path.join(remote_workspace_root, "monarch_sync_example", config_file_name)

print(f"Checking file on remote nodes: {remote_config_path}\n")

# Check file existence on all nodes
exists_results = await file_checker.file_exists.call(remote_config_path)
for result in exists_results:
    status = "EXISTS" if result['exists'] else " NOT FOUND"
    print(f"  Rank {result['rank']} ({result['hostname']}): {status}")

# Read file content from rank 0 to verify
print(f"\nReading config from rank 0:")
read_results = await file_checker.read_file.call(remote_config_path)
if read_results[0]['exists']:
    print(f"\n{read_results[0]['content']}")
else:
    print(f"Error: {read_results[0].get('error', 'Unknown error')}")

---

## Debugging with Breakpoints in Monarch

Monarch supports interactive debugging of distributed actors using Python's built-in `pdb` debugger. You can set breakpoints in your actors, attach to specific ranks, and inspect their state during execution.

### How to Debug:

1. **Add breakpoints** to your actor endpoints using `breakpoint()`
2. **Run your training** as usual - execution will pause when breakpoints are hit
3. **Open a separate terminal** and run: `monarch debug`
4. **Use debugger commands**:
   - `list` - Show all active breakpoints across ranks
   - `attach <actor_name> <rank>` - Attach to a specific actor/rank for interactive debugging
   - `cast <actor_name> ranks(<ranks>) <pdb_command>` - Send pdb commands to multiple ranks
   - `continue` - Resume execution

Let's create a debugging example using a TorchTitan trainer with breakpoints.

### Define TitanTrainerActor with Breakpoints

Create a TorchTitan trainer actor with breakpoints at key stages. This allows you to inspect the training state, configuration, and execution flow interactively.

In [None]:
class TitanTrainerDebug(Actor):
    """TorchTitan Trainer Actor with debugging breakpoints."""

    def __init__(self, job_config: JobConfig):
        self.rank = current_rank().rank
        self.job_config = job_config
        self.trainer: Optional[Trainer] = None

    def _rprint(self, msg):
        """Helper method to print with rank information."""
        print(f"{self.rank=} {msg}")

    @endpoint
    def init(self):
        logging.getLogger().addHandler(logging.StreamHandler(sys.stderr))
        self._rprint(f"Initializing debug actor: {current_rank()=} {socket.gethostname()=}")

        # Breakpoint 1: After initialization
        breakpoint()  # Debug: Inspect actor initialization state

    @endpoint
    def setup_trainer(self):
        """Setup the trainer with a breakpoint to inspect configuration."""
        logger.info(f"Setting up trainer on rank {self.rank}")
        config = self.job_config

        # Breakpoint 2: Before trainer creation
        if self.rank == 0:  # Only break on rank 0 for simplicity
            breakpoint()  # Debug: Inspect job config before trainer creation

        self.trainer = Trainer(config)
        self._rprint("Trainer setup complete")

    @endpoint
    def train_step(self, num_steps: int = 5):
        """Run a few training steps with breakpoints."""
        if not self.trainer:
            raise RuntimeError("Trainer not initialized. Call setup_trainer first.")

        logger.info(f"Starting training for {num_steps} steps on rank {self.rank}")

        # Breakpoint 3: Before training starts
        if self.rank == 0:
            breakpoint()  # Debug: Inspect trainer state before training

        # In a real scenario, you'd call trainer.train()
        # For debugging purposes, we'll just simulate a few steps
        for step in range(num_steps):
            if step == 2 and self.rank == 0:  # Break mid-training on rank 0
                breakpoint()  # Debug: Inspect mid-training state

            self._rprint(f"Processing step {step + 1}/{num_steps}")

        self._rprint(f"Completed {num_steps} training steps")

    @endpoint
    def cleanup(self):
        """Cleanup resources."""
        logger.info(f"Cleaning up trainer on rank {self.rank}")

        if self.trainer:
            self.trainer.close()

        if torch.distributed.is_initialized():
            torch.distributed.destroy_process_group()
            logger.info("Process group destroyed.")

        self._rprint("Cleanup complete")

### Spawn Debug Trainer Actor

Spawn the debug trainer actor across the process mesh. When you run the following cells, execution will pause at breakpoints, allowing you to debug interactively.

In [None]:
# Spawn the debug trainer actor
debug_trainer = proc_mesh.spawn("debug_trainer", TitanTrainerDebug, config)
print("Debug trainer actor spawned across all nodes")
print("When breakpoints are hit, run 'monarch debug' in a separate terminal")

### Run Debug Training Session

Execute the training endpoints. When breakpoints are hit:
1. Open a separate terminal
2. Run `monarch debug`
3. Use `list` to see all active breakpoints
4. Use `attach debug_trainer 0` to attach to rank 0
5. Use standard pdb commands (`n`, `s`, `p <var>`, `l`, etc.)
6. Use `continue` to resume execution

In [None]:
# Initialize actors (will hit first breakpoint)
await debug_trainer.init.call()

# Setup trainer (will hit second breakpoint on rank 0)
await debug_trainer.setup_trainer.call()

# Run training steps (will hit breakpoints during training)
await debug_trainer.train_step.call(num_steps=5)

# Cleanup
await debug_trainer.cleanup.call()

print("Debug training session completed")

### Example Debugger Commands

Once in the Monarch debugger, try these commands:

```bash
# List all active breakpoints
monarch_dbg> list

# Attach to rank 0 for interactive debugging
monarch_dbg> attach debug_trainer 0

# Standard pdb commands when attached:
(Pdb) n              # Next line
(Pdb) s              # Step into function
(Pdb) p self.rank    # Print variable
(Pdb) l              # List source code
(Pdb) c              # Continue execution

# Cast commands to multiple ranks (without attaching)
monarch_dbg> cast debug_trainer ranks(0,1) n
monarch_dbg> cast debug_trainer ranks(0:4) c

# Continue all breakpoints
monarch_dbg> continue
```

---

## Cleanup and Stop Process Mesh

Gracefully stop the Monarch process mesh, cleaning up all distributed resources and shutting down the actors across all nodes.

In [None]:
await proc_mesh.stop()