Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
- [Reporting and Database](#-reporting-and-database)
- [Installation](#-installation)
- [Tips & Best Practices](#-tips--best-practices)
- [Exit codes and CI](#exit-codes-and-ci)
- [Contributing](#-contributing)
- [License](#-license)
- [Links & Resources](#-links--resources)
Expand Down Expand Up @@ -71,7 +72,7 @@ madengine run --tags dummy --rocm-path /path/to/rocm \
# or: export ROCM_PATH=/path/to/rocm && madengine run --tags dummy ...
```

**Results saved to `perf_entry.csv`**
**Results:** Performance data is written to `perf.csv` (and optionally `perf_entry.csv`). The file is created automatically if missing. Failed runs (including pre-run setup failures) are recorded with status `FAILURE` so every attempted model appears in the table. See [Exit Codes](docs/cli-reference.md#exit-codes) for CI/script usage.

## 📋 Commands

Expand Down Expand Up @@ -561,6 +562,10 @@ See [Installation Guide](docs/installation.md) for detailed instructions.
- **Monitor GPU utilization** with `gpu_info_power_profiler`
- **Profile library calls** with rocBLAS/MIOpen tracing

### Exit codes and CI

madengine uses consistent exit codes for scripts and CI (e.g. Jenkins): `0` = success, `1` = general failure, `2` = build failure, `3` = one or more run failures, `4` = invalid arguments. Failed runs are still written to `perf.csv` with status `FAILURE`. See [CLI Reference — Exit Codes](docs/cli-reference.md#exit-codes) for the full table and examples.

### Troubleshooting

```bash
Expand Down
14 changes: 8 additions & 6 deletions docs/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -470,17 +470,19 @@ madengine database \

## Exit Codes

madengine uses standard exit codes to indicate success or failure:
madengine uses standard exit codes so scripts and CI (e.g. Jenkins) can detect success or failure:

| Code | Constant | Description |
|------|----------|-------------|
| `0` | `SUCCESS` | Command completed successfully |
| `1` | `FAILURE` | General failure |
| `2` | `INVALID_ARGS` | Invalid command-line arguments or configuration |
| `3` | `BUILD_FAILURE` | One or more builds failed |
| `4` | `RUN_FAILURE` | One or more model executions failed |
| `2` | `BUILD_FAILURE` | One or more image builds failed (e.g. Docker build error) |
| `3` | `RUN_FAILURE` | One or more model executions failed |
| `4` | `INVALID_ARGS` | Invalid command-line arguments or configuration |

**Example Usage in Scripts:**
**Failure recording:** Pre-run failures (e.g. image pull, setup) and run failures are recorded in the performance table (`perf.csv`) with status `FAILURE`, so all attempted models appear in the CSV. The file is created automatically if missing.

**Example usage in scripts / CI:**

```bash
#!/bin/bash
Expand All @@ -491,7 +493,7 @@ if [ $? -eq 0 ]; then
madengine run --manifest-file build_manifest.json
else
echo "Build failed with exit code $?"
exit 1
exit $?
fi
```

Expand Down
8 changes: 5 additions & 3 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -664,6 +664,8 @@ madengine discover --tags model --verbose

### CI/CD Integration

madengine uses consistent exit codes (0=success, 2=build failure, 3=run failure, 4=invalid args). Failed runs are still written to `perf.csv` with status `FAILURE`. See [CLI Reference — Exit Codes](cli-reference.md#exit-codes) for the full table.

```bash
#!/bin/bash
# Example CI script
Expand All @@ -679,19 +681,19 @@ madengine build --batch-manifest batch.json \
madengine run --manifest-file build_manifest.json \
--timeout 3600

# Check exit code
# Check exit code (0=success, 2=build failure, 3=run failure; see CLI Reference)
if [ $? -eq 0 ]; then
echo "✅ Tests passed"

# Generate and upload results
madengine report to-email --output ci_results.html
madengine database \
--csv-file perf_entry.csv \
--csv-file perf.csv \
--db ci_results \
--collection ${CI_BUILD_ID}
else
echo "❌ Tests failed"
exit 1
exit $?
fi
```

Expand Down
2 changes: 1 addition & 1 deletion examples/profiling-configs/rocprofv3_multi_gpu.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"nodes": 1,
"gpus_per_node": 4,
"time": "02:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": false
},

Expand Down
2 changes: 1 addition & 1 deletion examples/profiling-configs/rocprofv3_multi_node.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"nodes": 2,
"gpus_per_node": 4,
"time": "04:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": true
},

Expand Down
14 changes: 7 additions & 7 deletions examples/slurm-configs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,7 @@ madengine uses intelligent multi-layer configuration merging:
"nodelist": "node01,node02", // Optional: restrict job to these nodes (skips node health preflight)
"gpus_per_node": 8, // GPUs per node
"time": "24:00:00", // Wall time (HH:MM:SS)
"output_dir": "./slurm_output", // Local output directory
"output_dir": "./slurm_results", // Local output directory
"results_dir": "/shared/results", // Shared results collection
"shared_workspace": "/shared/workspace", // Shared workspace (NFS/Lustre)
"exclusive": true, // Exclusive node access
Expand Down Expand Up @@ -584,10 +584,10 @@ squeue -u $USER
scontrol show job <job_id>

# View output logs (real-time)
tail -f slurm_output/madengine-*_<job_id>_*.out
tail -f slurm_results/madengine-*_<job_id>_*.out

# View error logs
tail -f slurm_output/madengine-*_<job_id>_*.err
tail -f slurm_results/madengine-*_<job_id>_*.err

# Cancel job if needed
scancel <job_id>
Expand All @@ -600,7 +600,7 @@ scancel <job_id>
- Check SLURM partition exists: `sinfo`
- Verify GPU resources available: `sinfo -o "%P %.5a %.10l %.6D %.6t %N %G"`
- Check SLURM account/QoS settings
- Review job script: `slurm_output/madengine_*.sh`
- Review job script: `slurm_results/madengine_*.sh`

### Out of Memory Errors

Expand Down Expand Up @@ -715,8 +715,8 @@ MODEL_DIR=models/my-model madengine run \
watch squeue -u $USER

# 6. Check logs when complete
ls -lh slurm_output/
tail -f slurm_output/madengine-*_<job_id>_*.out
ls -lh slurm_results/
tail -f slurm_results/madengine-*_<job_id>_*.out
```

### vLLM Inference Workflow
Expand All @@ -739,7 +739,7 @@ MODEL_DIR=models/llama2-70b madengine run \
--manifest-file build_manifest.json

# 5. Monitor for OOM errors
tail -f slurm_output/madengine-*_<job_id>_*.err | grep -i "memory"
tail -f slurm_results/madengine-*_<job_id>_*.err | grep -i "memory"

# 6. If OOM occurs, adjust config and rebuild
# Edit your config file to set VLLM_KV_CACHE_SIZE to 0.6 or 0.7
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"nodes": 1,
"gpus_per_node": 1,
"time": "01:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": false
},

Expand Down
2 changes: 1 addition & 1 deletion examples/slurm-configs/basic/02-single-node-multi-gpu.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"nodes": 1,
"gpus_per_node": 8,
"time": "12:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": true
},

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
"nodelist": "node01,node02",
"gpus_per_node": 8,
"time": "24:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": true,
"network_interface": "eth0"
},
Expand Down
2 changes: 1 addition & 1 deletion examples/slurm-configs/basic/03-multi-node-basic.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"nodes": 2,
"gpus_per_node": 8,
"time": "24:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": true,
"network_interface": "eth0"
},
Expand Down
2 changes: 1 addition & 1 deletion examples/slurm-configs/basic/04-multi-node-advanced.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"nodes": 4,
"gpus_per_node": 8,
"time": "48:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"results_dir": "/shared/results",
"shared_workspace": "/shared/workspace",
"exclusive": true,
Expand Down
2 changes: 1 addition & 1 deletion examples/slurm-configs/basic/05-vllm-single-node.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"nodes": 1,
"gpus_per_node": 4,
"time": "02:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": true
},

Expand Down
2 changes: 1 addition & 1 deletion examples/slurm-configs/basic/06-vllm-multi-node.json
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
"nodes": 2,
"gpus_per_node": 4,
"time": "00:45:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": true,

"_comment_node_check": "Preflight GPU health check (helps avoid OOM from stale processes)",
Expand Down
2 changes: 1 addition & 1 deletion examples/slurm-configs/basic/07-sglang-single-node.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"nodes": 1,
"gpus_per_node": 4,
"time": "02:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": true
},

Expand Down
2 changes: 1 addition & 1 deletion examples/slurm-configs/basic/08-sglang-multi-node.json
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
"nodes": 2,
"gpus_per_node": 4,
"time": "04:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": true
},

Expand Down
2 changes: 1 addition & 1 deletion examples/slurm-configs/basic/cluster-amd-rccl.json
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
"nodes": 1,
"gpus_per_node": 8,
"time": "12:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": true
},

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
"nodes": 7,
"gpus_per_node": 8,
"time": "04:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": true
},

Expand Down
2 changes: 1 addition & 1 deletion examples/slurm-configs/basic/sglang-disagg-multi-node.json
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
"nodes": 5,
"gpus_per_node": 8,
"time": "04:00:00",
"output_dir": "./slurm_output",
"output_dir": "./slurm_results",
"exclusive": true
},

Expand Down
3 changes: 3 additions & 0 deletions src/madengine/cli/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,9 @@ def cli_main() -> None:
"""Entry point for the CLI application."""
try:
app()
except typer.Exit:
# Preserve exit code from run/build/other commands for Jenkins and scripts
raise
except KeyboardInterrupt:
console.print("\n🛑 [yellow]Operation cancelled by user[/yellow]")
sys.exit(ExitCode.FAILURE)
Expand Down
8 changes: 8 additions & 0 deletions src/madengine/cli/commands/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

from madengine.orchestration.run_orchestrator import RunOrchestrator
from madengine.core.errors import (
BuildError,
ConfigurationError,
RuntimeError as MADRuntimeError,
)
Expand Down Expand Up @@ -419,6 +420,13 @@ def run(

except typer.Exit:
raise
except BuildError as e:
console.print(f"🔨 [bold red]Build error: {e}[/bold red]")
if hasattr(e, "suggestions") and e.suggestions:
console.print("\n💡 [cyan]Suggestions:[/cyan]")
for suggestion in e.suggestions:
console.print(f" • {suggestion}")
raise typer.Exit(ExitCode.BUILD_FAILURE)
except MADRuntimeError as e:
# Runtime execution errors
console.print(f"💥 [bold red]Runtime error: {e}[/bold red]")
Expand Down
14 changes: 8 additions & 6 deletions src/madengine/cli/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -352,7 +352,6 @@ def display_performance_table(perf_csv_path: str = "perf.csv", session_start_row
console.print("[yellow]⚠️ Performance CSV is empty[/yellow]")
return

# Get session_start_row to mark current runs (don't filter, just mark)
total_rows = len(df)

# Try parameter first, then fall back to marker file
Expand All @@ -378,7 +377,7 @@ def display_performance_table(perf_csv_path: str = "perf.csv", session_start_row
perf_table.add_column("Index", justify="right", style="dim")
perf_table.add_column("Model", style="cyan")
perf_table.add_column("Topology", justify="center", style="blue")
perf_table.add_column("Launcher", justify="center", style="magenta") # Distributed launcher
perf_table.add_column("Launcher", justify="center", style="magenta")
perf_table.add_column("Deployment", justify="center", style="cyan")
perf_table.add_column("GPU Arch", style="yellow")
perf_table.add_column("Performance", justify="right", style="green")
Expand All @@ -388,20 +387,23 @@ def display_performance_table(perf_csv_path: str = "perf.csv", session_start_row
perf_table.add_column("Data Name", style="magenta")
perf_table.add_column("Data Provider", style="magenta")

# Helper function to format duration
# Helper function to format duration (accepts float seconds or "Xs" string)
def format_duration(duration):
if pd.isna(duration) or duration == "":
if pd.isna(duration) or duration == "" or duration is None:
return "N/A"
try:
dur = float(duration)
if isinstance(duration, str) and duration.strip().endswith("s"):
dur = float(duration.strip()[:-1])
else:
dur = float(duration)
if dur < 1:
return f"{dur*1000:.0f}ms"
elif dur < 60:
return f"{dur:.2f}s"
else:
return f"{dur/60:.1f}m"
except (ValueError, TypeError):
return "N/A"
return str(duration) if duration else "N/A"

# Helper function to format performance
def format_performance(perf):
Expand Down
Loading