# Studio 3: Interactive Debugging for Distributed Training

Welcome to Studio 3! In this notebook, you'll master **interactive debugging** techniques for distributed systems using Monarch.

> **Note:** This notebook uses **CPU machines** since debugging techniques don't require GPU resources.

## The Challenge

Debugging distributed training is notoriously difficult:
- Issues may only appear on specific ranks or nodes
- Traditional debuggers don't work across multiple processes
- Environment differences between nodes are hard to inspect
- Logs from 128+ processes are overwhelming

## Monarch's Solution

Monarch provides powerful debugging capabilities:
1. **Interactive breakpoints** - Use `pdb` with distributed actors
2. **Selective debugging** - Attach to specific ranks
3. **Environment inspection** - Query env vars across all nodes
4. **Monarch debug CLI** - Unified interface for distributed debugging

## What You'll Learn

### Environment Variable Management
- Inspect environment variables across nodes
- Set and modify env vars remotely
- Query variables by prefix (CUDA, NCCL, etc.)

### Interactive Debugging with Breakpoints
- Add breakpoints to actor methods
- Use `monarch debug` CLI
- Attach to specific ranks for interactive debugging
- Send debugger commands to multiple ranks

## Prerequisites

**Recommended:** Complete [Studio 1: Getting Started](./studio_1_getting_started.ipynb) and [Studio 2: Workspace Sync](./studio_2_workspace_sync.ipynb) first!

You should have:
- Basic understanding of Monarch actors and endpoints

**New to Monarch?** Start with [Studio 0: Monarch Basics](./studio_0_monarch_basics.ipynb) to learn about Actors, Endpoints, and Process Meshes!

## Lightning Studios Series

This is **Studio 3** of the series:

- **[Studio 0: Monarch Basics](./studio_0_monarch_basics.ipynb)** - Learn Monarch fundamentals
- **[Studio 1: Getting Started](./studio_1_getting_started.ipynb)** - Multi-node training (GPU)
- **[Studio 2: Workspace Sync](./studio_2_workspace_sync.ipynb)** - Hot-reload configs (CPU)
- **Studio 3: Interactive Debugging** - Debug distributed systems (YOU ARE HERE - CPU)

Let's dive in!

---

# Retrieve the job or Setup (If Starting Fresh)

If you're continuing from Studio 1 or 2, you can retrieve the created job here; otherwise the cell below will create a new job

In [None]:
import os
import sys
import logging
# Need to set before importing monarch
os.environ["MONARCH_FILE_LOG"] = "debug"
os.environ["HYPERACTOR_MESH_ENABLE_LOG_FORWARDING"] = "true"
os.environ["HYPERACTOR_MESH_ENABLE_FILE_CAPTURE"] = "true"
os.environ["HYPERACTOR_MESH_TAIL_LOG_LINES"] = "100"

import socket
import time

from lightning_sdk import Status
from utils import get_host_ip_addr, bootstrap_addr
from monarch.actor import Actor, enable_transport, endpoint, current_rank
from monarch._src.actor.bootstrap import attach_to_workers

# Configuration - Using CPU machines for debugging demo
NUM_NODES = 2
NUM_CPUS = 4  # CPU_X_4 machines have 4 CPUs
USE_CPU = True  # Use CPU machines instead of GPU
port = 26600

# Enable client transport
host_ip_addr = get_host_ip_addr(addr_type="public")
enable_transport(f"tcp://{host_ip_addr}:{port}@tcp://0.0.0.0:{port}")
print(f"Client transport enabled at {host_ip_addr}:{port}")

In [None]:
from mmt_utils import launch_mmt_job

MMT_JOB_NAME = f"Monarch-v0.2.0-CPU-MMT-{NUM_NODES}-nodes"

# Launch or retrieve the job with CPU machines
job, studio = launch_mmt_job(
    num_nodes=NUM_NODES,
    mmt_job_name=MMT_JOB_NAME,
    port=port,
    use_cpu=USE_CPU,  # Use CPU machines
)

print(f"Job launched. You can monitor it using: job.status")
print(f"To stop the job: job.stop()")
print(f"To clean up: studio.stop()")

In [None]:
if job.status == Status('Running'):
    # Get worker IP addresses from the job
    ip_addresses_list_public = [machine.public_ip for machine in job.machines]
    print(f"Worker IPs: {ip_addresses_list_public}")

    # Create worker addresses
    worker_addrs = [f"tcp://{ip}:{port}@tcp://0.0.0.0:{port}" for ip in ip_addresses_list_public]
    print(f"Worker addresses: {worker_addrs}")

    # Attach to workers and create process mesh
    host_mesh = attach_to_workers(
        name="host_mesh", ca="trust_all_connections", workers=worker_addrs
    )

    # Use cpus instead of gpus for CPU machines
    proc_mesh = host_mesh.spawn_procs(per_host={"cpus": NUM_CPUS})
    await proc_mesh.logging_option(stream_to_client=True, aggregate_window_sec=3)

    print(f"\nProcess mesh initialized successfully!")
    print(f"Using {NUM_NODES} CPU nodes with {NUM_CPUS} CPUs each")
else:
    raise RuntimeError(
        f"Job status is {job.status}; however the status should be {Status('Running')} to initiate the mesh"
    )

---

# Part 1: Environment Variable Management

Let's start by creating an actor that can inspect and manage environment variables across all nodes.

## Define Environment Variable Actor

This actor provides methods to get, set, and list environment variables on remote nodes.

In [None]:
class EnvVarActor(Actor):
    """Actor for managing environment variables on remote nodes."""

    def __init__(self):
        self.rank = current_rank().rank
        self.hostname = socket.gethostname()

    @endpoint
    def get_env(self, var_name: str) -> dict:
        """Get an environment variable value from the remote node."""
        value = os.environ.get(var_name)
        return {
            "rank": self.rank,
            "hostname": self.hostname,
            "var_name": var_name,
            "value": value
        }

    @endpoint
    def set_env(self, var_name: str, var_value: str) -> dict:
        """Set an environment variable on the remote node."""
        os.environ[var_name] = var_value
        return {
            "rank": self.rank,
            "hostname": self.hostname,
            "var_name": var_name,
            "value": var_value,
            "status": "set"
        }

    @endpoint
    def list_env_vars(self, prefix: str = "") -> dict:
        """List all environment variables matching a prefix."""
        matching_vars = {k: v for k, v in os.environ.items() if k.startswith(prefix)}
        return {
            "rank": self.rank,
            "hostname": self.hostname,
            "matching_vars": matching_vars,
            "count": len(matching_vars)
        }

## Spawn Environment Variable Actor

Spawn the actor across all nodes in the process mesh.

In [None]:
# Spawn the environment variable actor across all nodes
env_actor = proc_mesh.spawn("env_actor", EnvVarActor)
print("EnvVarActor spawned across all nodes")

## Query Environment Variables

Let's inspect CUDA-related environment variables across all nodes.

In [None]:
# Get CUDA_VISIBLE_DEVICES from all nodes
results = await env_actor.get_env.call("CUDA_VISIBLE_DEVICES")

print("\nCUDA_VISIBLE_DEVICES on all nodes:")
print(f"{'-'*70}")

# Show unique values by node
seen_nodes = set()
for result in results:
    if len(result) > 1:
        rank = result[1].get('rank', '?')
        hostname = result[1].get('hostname', '?')
        value = result[1].get('value', '?')
    else:
        rank = result.get('rank', '?')
        hostname = result.get('hostname', '?')
        value = result.get('value', '?')

    if hostname not in seen_nodes:
        print(f"  Node {hostname} (Rank {rank}): {value}")
        seen_nodes.add(hostname)

print(f"{'-'*70}")

## Set Custom Environment Variables

You can set environment variables remotely for debugging purposes.

In [None]:
# Set a custom environment variable on all nodes
print("Setting CUSTOM_DEBUG_VAR on all nodes...")
set_results = await env_actor.set_env.call("CUSTOM_DEBUG_VAR", "debug_enabled")

print(f"\nSet CUSTOM_DEBUG_VAR on {len(set_results)} ranks")

# Verify the variable was set
verify_results = await env_actor.get_env.call("CUSTOM_DEBUG_VAR")
print(f"\nVerification (first 3 ranks):")
for i, result in enumerate(verify_results[:3]):
    if len(result) > 1:
        rank = result[1]['rank']
        value = result[1]['value']
    else:
        rank = result['rank']
        value = result['value']
    print(f"  Rank {rank}: CUSTOM_DEBUG_VAR = {value}")

## List Variables by Prefix

Query all environment variables matching a specific prefix - useful for debugging CUDA, NCCL, or PyTorch settings.

In [None]:
# List all CUDA-related environment variables
list_results = await env_actor.list_env_vars.call("CUDA")

print("\nCUDA-related environment variables (Rank 0):")
print(f"{'-'*70}")

if list_results[0]:
    result = list_results[0][1] if len(list_results[0]) > 1 else list_results[0]
    matching_vars = result.get('matching_vars', {})

    if matching_vars:
        for var_name, var_value in matching_vars.items():
            # Truncate long values
            display_value = var_value if len(var_value) < 60 else var_value[:57] + "..."
            print(f"  {var_name} = {display_value}")
    else:
        print("  No CUDA variables found")

print(f"{'-'*70}")

# Try other prefixes
print("\nTip: Try querying other prefixes like:")
print("  - 'NCCL' - NCCL communication settings")
print("  - 'TORCH' - PyTorch settings")
print("  - 'MONARCH' - Monarch-specific configs")
print("  - 'MASTER' - Distributed training master node info")

---

# Part 2: Interactive Debugging with Breakpoints

Now let's explore Monarch's most powerful debugging feature: **interactive breakpoints** in distributed actors!

## How Monarch Debugging Works

### The Workflow

1. **Add `breakpoint()`** to your actor methods
2. **Run your code** - execution pauses when breakpoint is hit
3. **Open a terminal** and run `monarch debug`
4. **Use debugger commands**:
   - `list` - Show all active breakpoints
   - `attach <actor> <rank>` - Attach to a specific rank
   - Standard pdb commands: `n`, `s`, `p`, `l`, `c`
   - `cast <actor> ranks(<ranks>) <cmd>` - Send commands to multiple ranks
   - `continue` - Resume all paused processes

### Key Features

- Debug specific ranks (e.g., only rank 0 or only GPU 3)
- Inspect local variables and actor state
- Step through code interactively
- Send commands to multiple ranks simultaneously

## Define Debug Worker Actor

Let's create a simple worker actor with strategic breakpoints. This actor simulates a distributed computation workflow without requiring GPU or TorchTitan.

In [None]:
import math
import random


class DebugWorkerActor(Actor):
    """A simple worker actor with debugging breakpoints for CPU-based debugging demo."""

    def __init__(self, worker_name: str = "worker"):
        self.rank = current_rank().rank
        self.worker_name = worker_name
        self.hostname = socket.gethostname()
        self.step_count = 0
        self.data = []

    def _rprint(self, msg):
        """Helper method to print with rank information."""
        print(f"[Rank {self.rank}] {msg}")

    @endpoint
    def init(self):
        """Initialize the worker with a breakpoint."""
        logging.getLogger().addHandler(logging.StreamHandler(sys.stderr))
        self._rprint(f"Initializing worker: {self.worker_name} on {self.hostname}")

        # Breakpoint 1: After initialization (only on rank 0)
        if self.rank == 0:
            self._rprint("Breakpoint 1: Initialization complete")
            breakpoint()  # Debug: Inspect actor initialization state

    @endpoint
    def setup(self, data_size: int = 100):
        """Setup the worker with some data, with a breakpoint to inspect configuration."""
        self._rprint(f"Setting up worker with data_size={data_size}")

        # Generate some random data for this worker
        self.data = [random.random() for _ in range(data_size)]

        # Breakpoint 2: After data setup (only on rank 0)
        if self.rank == 0:
            self._rprint(f"Breakpoint 2: Data setup complete, data length={len(self.data)}")
            breakpoint()  # Debug: Inspect data after setup

        self._rprint(f"Setup complete with {len(self.data)} data points")

    @endpoint
    def process_step(self, num_steps: int = 5):
        """Run a few processing steps with breakpoints."""
        if not self.data:
            raise RuntimeError("Worker not initialized. Call setup first.")

        self._rprint(f"Starting processing for {num_steps} steps")

        # Breakpoint 3: Before processing starts (only on rank 0)
        if self.rank == 0:
            self._rprint("Breakpoint 3: About to start processing")
            breakpoint()  # Debug: Inspect state before processing

        results = []
        for step in range(num_steps):
            self.step_count += 1

            # Simulate some computation
            step_result = sum(self.data) / len(self.data)
            results.append(step_result)

            # Breakpoint 4: Mid-processing on rank 0 at step 2
            if step == 2 and self.rank == 0:
                self._rprint(f"Breakpoint 4: Mid-processing (step {self.step_count})")
                self._rprint(f"Current result: {step_result:.4f}")
                breakpoint()  # Debug: Inspect mid-processing state

            self._rprint(f"Processing step {step + 1}/{num_steps}, result: {step_result:.4f}")

        self._rprint(f"Completed {num_steps} processing steps")
        return {"rank": self.rank, "steps": num_steps, "final_result": results[-1]}

    @endpoint
    def get_status(self) -> dict:
        """Get current worker status."""
        return {
            "rank": self.rank,
            "hostname": self.hostname,
            "worker_name": self.worker_name,
            "step_count": self.step_count,
            "data_size": len(self.data)
        }

    @endpoint
    def cleanup(self):
        """Cleanup resources."""
        self._rprint("Cleaning up worker")
        self.data = []
        self.step_count = 0
        self._rprint("Cleanup complete")

## Spawn Debug Worker

Spawn the debug worker actor. When you run the cells below, execution will pause at breakpoints.

In [None]:
# Spawn the debug worker actor
debug_worker = proc_mesh.spawn("debug_worker", DebugWorkerActor, "demo_worker")
print("Debug worker actor spawned across all nodes")
print("\nWhen breakpoints are hit, execution will pause.")
print("Open a separate terminal and run: monarch debug")

## Run Debug Session

Now let's run the training methods. When breakpoints are hit:

### In This Notebook
- Execution will pause
- You'll see `Breakpoint X: ...` messages

### In a Separate Terminal
1. Run: `monarch debug`
2. Use `list` to see all active breakpoints
3. Use `attach debug_trainer 0` to attach to rank 0
4. Use standard pdb commands or `continue` to resume

**Note:** For this demo, we'll skip the interactive debugging. In practice, you'd have two terminals open.

In [None]:
# Initialize workers (will hit breakpoint 1)
print("Step 1: Initializing workers...")
print("   (Breakpoint 1 will trigger on rank 0)\n")

# In a real scenario, this would pause at the breakpoint
# await debug_worker.init.call()

In [None]:
# Setup workers with data (will hit breakpoint 2)
print("Step 2: Setting up workers with data...")
print("   (Breakpoint 2 will trigger on rank 0)\n")

# await debug_worker.setup.call(data_size=100)

In [None]:
# Run processing steps (will hit breakpoints 3 and 4)
print("Step 3: Running processing steps...")
print("   (Breakpoints 3 and 4 will trigger on rank 0)\n")

# await debug_worker.process_step.call(num_steps=5)

## Monarch Debug CLI Commands

Here's a quick reference for the `monarch debug` CLI:

### Listing Breakpoints
```bash
monarch_dbg> list
# Shows all active breakpoints across ranks
# Example output:
#   debug_trainer (rank 0): /path/to/file.py:42
#   debug_trainer (rank 0): /path/to/file.py:58
```

### Attaching to a Rank
```bash
monarch_dbg> attach debug_trainer 0
# Enters interactive pdb session for rank 0

(Pdb) n              # Next line
(Pdb) s              # Step into function
(Pdb) p self.rank    # Print variable
(Pdb) l              # List source code
(Pdb) pp self.job_config  # Pretty-print object
(Pdb) c              # Continue execution
```

### Casting Commands to Multiple Ranks
```bash
# Send "next" command to ranks 0 and 1
monarch_dbg> cast debug_trainer ranks(0,1) n

# Send "continue" to ranks 0 through 7
monarch_dbg> cast debug_trainer ranks(0:8) c

# Print a variable on multiple ranks
monarch_dbg> cast debug_trainer ranks(0,1,2,3) p self.step_count
```

### Continuing All
```bash
monarch_dbg> continue
# Resumes execution on all paused ranks
```

### Getting Help
```bash
monarch_dbg> help
# Shows all available commands
```

## Common Debugging Scenarios

### Scenario 1: Rank-Specific Bug
```python
# Problem: Training fails on rank 5 but works on other ranks

@endpoint
def train(self):
    if self.rank == 5:
        breakpoint()  # Only pause rank 5
    # ... training code
```

Then in terminal:
```bash
monarch debug
monarch_dbg> attach trainer_actor 5
(Pdb) p self.data_batch  # Inspect what's different on rank 5
```

### Scenario 2: Collective Operation Hang
```python
# Problem: All-reduce hangs, need to check all ranks

@endpoint
def sync_gradients(self):
    breakpoint()  # Pause all ranks before all-reduce
    torch.distributed.all_reduce(self.gradients)
```

Then:
```bash
monarch_dbg> list  # Check which ranks hit the breakpoint
monarch_dbg> cast trainer_actor ranks(0:8) p self.gradients.shape
# Verify all ranks have same shape
```

### Scenario 3: Environment Mismatch
```python
# Problem: Different NCCL settings causing issues

# Use EnvVarActor to inspect
results = await env_actor.list_env_vars.call("NCCL")
# Compare NCCL settings across all ranks
```

---

# Congratulations!

You've mastered **interactive debugging** for distributed training with Monarch!

## What You Learned

### Environment Variable Management
- Query env vars across all nodes
- Set and modify env vars remotely
- List variables by prefix (CUDA, NCCL, etc.)

### Interactive Debugging
- Add breakpoints to distributed actors
- Use `monarch debug` CLI
- Attach to specific ranks
- Send commands to multiple ranks
- Common debugging scenarios

## Key Takeaways

- **Debug like local code** - Use familiar pdb commands in distributed settings
- **Selective debugging** - Focus on problematic ranks without noise from others
- **Environment inspection** - Quickly identify configuration mismatches
- **No more print debugging** - Interactive inspection is much more powerful

## The Complete Monarch Workflow

You've now learned the three pillars of efficient distributed development:

1. **Studio 1: Getting Started** - Launch multi-node training
2. **Studio 2: Workspace Sync** - Hot-reload configs and code
3. **Studio 3: Interactive Debugging** - Debug efficiently (YOU ARE HERE!)

Together, these enable:
- **10x faster iteration** (no job restarts)
- **Easier debugging** (interactive breakpoints)
- **Better observability** (env var inspection, log aggregation)

## Next Steps

### Put It Into Practice
Try debugging your own training code:
1. Add strategic breakpoints
2. Run `monarch debug` when they're hit
3. Inspect state and identify issues

### Explore More
- Review [Studio 1: Getting Started](./studio_1_getting_started.ipynb)
- Review [Studio 2: Workspace Sync](./studio_2_workspace_sync.ipynb)
- Check out the [Monarch documentation](https://github.com/meta-pytorch/monarch)

---

## Pro Tips

### Debugging Best Practices
1. **Use conditional breakpoints** - Only pause specific ranks
2. **Check env vars first** - Many issues are configuration mismatches
3. **Use `cast` for comparison** - Check variables across multiple ranks
4. **Don't forget `continue`** - Resume execution when done debugging

### Performance Tip
Remove or comment out `breakpoint()` calls for production runs - they have minimal overhead when not triggered, but it's cleaner to remove them.

Happy debugging!