ROCm · coketaste · Mar 18, 2026 · Mar 8, 2026 · Mar 8, 2026 · Mar 8, 2026
diff --git a/README.md b/README.md
@@ -30,6 +30,7 @@ madengine is a modern CLI tool for running Large Language Models (LLMs) and Deep
 - [Reporting and Database](#-reporting-and-database)
 - [Installation](#-installation)
 - [Tips & Best Practices](#-tips--best-practices)
+  - [Exit codes and CI](#exit-codes-and-ci)
 - [Contributing](#-contributing)
 - [License](#-license)
 - [Links & Resources](#-links--resources)
@@ -71,7 +72,7 @@ madengine run --tags dummy --rocm-path /path/to/rocm \
 # or: export ROCM_PATH=/path/to/rocm && madengine run --tags dummy ...
 ```
 
-**Results saved to `perf_entry.csv`**
+**Results:** Performance data is written to `perf.csv` (and optionally `perf_entry.csv`). The file is created automatically if missing. Failed runs (including pre-run setup failures) are recorded with status `FAILURE` so every attempted model appears in the table. See [Exit Codes](docs/cli-reference.md#exit-codes) for CI/script usage.
 
 ## 📋 Commands
 
@@ -561,6 +562,10 @@ See [Installation Guide](docs/installation.md) for detailed instructions.
 - **Monitor GPU utilization** with `gpu_info_power_profiler`
 - **Profile library calls** with rocBLAS/MIOpen tracing
 
+### Exit codes and CI
+
+madengine uses consistent exit codes for scripts and CI (e.g. Jenkins): `0` = success, `1` = general failure, `2` = build failure, `3` = one or more run failures, `4` = invalid arguments. Failed runs are still written to `perf.csv` with status `FAILURE`. See [CLI Reference — Exit Codes](docs/cli-reference.md#exit-codes) for the full table and examples.
+
 ### Troubleshooting
 
 ```bash

diff --git a/docs/cli-reference.md b/docs/cli-reference.md
@@ -470,17 +470,19 @@ madengine database \
 
 ## Exit Codes
 
-madengine uses standard exit codes to indicate success or failure:
+madengine uses standard exit codes so scripts and CI (e.g. Jenkins) can detect success or failure:
 
 | Code | Constant | Description |
 |------|----------|-------------|
 | `0` | `SUCCESS` | Command completed successfully |
 | `1` | `FAILURE` | General failure |
-| `2` | `INVALID_ARGS` | Invalid command-line arguments or configuration |
-| `3` | `BUILD_FAILURE` | One or more builds failed |
-| `4` | `RUN_FAILURE` | One or more model executions failed |
+| `2` | `BUILD_FAILURE` | One or more image builds failed (e.g. Docker build error) |
+| `3` | `RUN_FAILURE` | One or more model executions failed |
+| `4` | `INVALID_ARGS` | Invalid command-line arguments or configuration |
 
-**Example Usage in Scripts:**
+**Failure recording:** Pre-run failures (e.g. image pull, setup) and run failures are recorded in the performance table (`perf.csv`) with status `FAILURE`, so all attempted models appear in the CSV. The file is created automatically if missing.
+
+**Example usage in scripts / CI:**
 
 ```bash
 #!/bin/bash
@@ -491,7 +493,7 @@ if [ $? -eq 0 ]; then
   madengine run --manifest-file build_manifest.json
 else
   echo "Build failed with exit code $?"
-  exit 1
+  exit $?
 fi
 ```
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -664,6 +664,8 @@ madengine discover --tags model --verbose
 
 ### CI/CD Integration
 
+madengine uses consistent exit codes (0=success, 2=build failure, 3=run failure, 4=invalid args). Failed runs are still written to `perf.csv` with status `FAILURE`. See [CLI Reference — Exit Codes](cli-reference.md#exit-codes) for the full table.
+
 ```bash
 #!/bin/bash
 # Example CI script
@@ -679,19 +681,19 @@ madengine build --batch-manifest batch.json \
 madengine run --manifest-file build_manifest.json \
   --timeout 3600
 
-# Check exit code
+# Check exit code (0=success, 2=build failure, 3=run failure; see CLI Reference)
 if [ $? -eq 0 ]; then
   echo "✅ Tests passed"
 
   # Generate and upload results
   madengine report to-email --output ci_results.html
   madengine database \
-    --csv-file perf_entry.csv \
+    --csv-file perf.csv \
     --db ci_results \
     --collection ${CI_BUILD_ID}
 else
   echo "❌ Tests failed"
-  exit 1
+  exit $?
 fi
 ```
 

diff --git a/examples/profiling-configs/rocprofv3_multi_gpu.json b/examples/profiling-configs/rocprofv3_multi_gpu.json
@@ -12,7 +12,7 @@
     "nodes": 1,
     "gpus_per_node": 4,
     "time": "02:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": false
   },
 

diff --git a/examples/profiling-configs/rocprofv3_multi_node.json b/examples/profiling-configs/rocprofv3_multi_node.json
@@ -12,7 +12,7 @@
     "nodes": 2,
     "gpus_per_node": 4,
     "time": "04:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": true
   },
 

diff --git a/examples/slurm-configs/README.md b/examples/slurm-configs/README.md
@@ -379,7 +379,7 @@ madengine uses intelligent multi-layer configuration merging:
     "nodelist": "node01,node02",     // Optional: restrict job to these nodes (skips node health preflight)
     "gpus_per_node": 8,             // GPUs per node
     "time": "24:00:00",             // Wall time (HH:MM:SS)
-    "output_dir": "./slurm_output", // Local output directory
+    "output_dir": "./slurm_results", // Local output directory
     "results_dir": "/shared/results", // Shared results collection
     "shared_workspace": "/shared/workspace", // Shared workspace (NFS/Lustre)
     "exclusive": true,               // Exclusive node access
@@ -584,10 +584,10 @@ squeue -u $USER
 scontrol show job <job_id>
 
 # View output logs (real-time)
-tail -f slurm_output/madengine-*_<job_id>_*.out
+tail -f slurm_results/madengine-*_<job_id>_*.out
 
 # View error logs
-tail -f slurm_output/madengine-*_<job_id>_*.err
+tail -f slurm_results/madengine-*_<job_id>_*.err
 
 # Cancel job if needed
 scancel <job_id>
@@ -600,7 +600,7 @@ scancel <job_id>
 - Check SLURM partition exists: `sinfo`
 - Verify GPU resources available: `sinfo -o "%P %.5a %.10l %.6D %.6t %N %G"`
 - Check SLURM account/QoS settings
-- Review job script: `slurm_output/madengine_*.sh`
+- Review job script: `slurm_results/madengine_*.sh`
 
 ### Out of Memory Errors
 
@@ -715,8 +715,8 @@ MODEL_DIR=models/my-model madengine run \
 watch squeue -u $USER
 
 # 6. Check logs when complete
-ls -lh slurm_output/
-tail -f slurm_output/madengine-*_<job_id>_*.out
+ls -lh slurm_results/
+tail -f slurm_results/madengine-*_<job_id>_*.out
 ```
 
 ### vLLM Inference Workflow
@@ -739,7 +739,7 @@ MODEL_DIR=models/llama2-70b madengine run \
   --manifest-file build_manifest.json
 
 # 5. Monitor for OOM errors
-tail -f slurm_output/madengine-*_<job_id>_*.err | grep -i "memory"
+tail -f slurm_results/madengine-*_<job_id>_*.err | grep -i "memory"
 
 # 6. If OOM occurs, adjust config and rebuild
 # Edit your config file to set VLLM_KV_CACHE_SIZE to 0.6 or 0.7

diff --git a/examples/slurm-configs/basic/01-single-node-single-gpu.json b/examples/slurm-configs/basic/01-single-node-single-gpu.json
@@ -12,7 +12,7 @@
     "nodes": 1,
     "gpus_per_node": 1,
     "time": "01:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": false
   },
 

diff --git a/examples/slurm-configs/basic/02-single-node-multi-gpu.json b/examples/slurm-configs/basic/02-single-node-multi-gpu.json
@@ -12,7 +12,7 @@
     "nodes": 1,
     "gpus_per_node": 8,
     "time": "12:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": true
   },
 

diff --git a/examples/slurm-configs/basic/03-multi-node-basic-nodelist.json b/examples/slurm-configs/basic/03-multi-node-basic-nodelist.json
@@ -10,7 +10,7 @@
     "nodelist": "node01,node02",
     "gpus_per_node": 8,
     "time": "24:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": true,
     "network_interface": "eth0"
   },

diff --git a/examples/slurm-configs/basic/03-multi-node-basic.json b/examples/slurm-configs/basic/03-multi-node-basic.json
@@ -12,7 +12,7 @@
     "nodes": 2,
     "gpus_per_node": 8,
     "time": "24:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": true,
     "network_interface": "eth0"
   },

diff --git a/examples/slurm-configs/basic/04-multi-node-advanced.json b/examples/slurm-configs/basic/04-multi-node-advanced.json
@@ -12,7 +12,7 @@
     "nodes": 4,
     "gpus_per_node": 8,
     "time": "48:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "results_dir": "/shared/results",
     "shared_workspace": "/shared/workspace",
     "exclusive": true,

diff --git a/examples/slurm-configs/basic/05-vllm-single-node.json b/examples/slurm-configs/basic/05-vllm-single-node.json
@@ -12,7 +12,7 @@
     "nodes": 1,
     "gpus_per_node": 4,
     "time": "02:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": true
   },
 

diff --git a/examples/slurm-configs/basic/06-vllm-multi-node.json b/examples/slurm-configs/basic/06-vllm-multi-node.json
@@ -19,7 +19,7 @@
     "nodes": 2,
     "gpus_per_node": 4,
     "time": "00:45:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": true,
 
     "_comment_node_check": "Preflight GPU health check (helps avoid OOM from stale processes)",

diff --git a/examples/slurm-configs/basic/07-sglang-single-node.json b/examples/slurm-configs/basic/07-sglang-single-node.json
@@ -12,7 +12,7 @@
     "nodes": 1,
     "gpus_per_node": 4,
     "time": "02:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": true
   },
 

diff --git a/examples/slurm-configs/basic/08-sglang-multi-node.json b/examples/slurm-configs/basic/08-sglang-multi-node.json
@@ -12,7 +12,7 @@
     "nodes": 2,
     "gpus_per_node": 4,
     "time": "04:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": true
   },
 

diff --git a/examples/slurm-configs/basic/cluster-amd-rccl.json b/examples/slurm-configs/basic/cluster-amd-rccl.json
@@ -19,7 +19,7 @@
     "nodes": 1,
     "gpus_per_node": 8,
     "time": "12:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": true
   },
 

diff --git a/examples/slurm-configs/basic/sglang-disagg-custom-split.json b/examples/slurm-configs/basic/sglang-disagg-custom-split.json
@@ -20,7 +20,7 @@
     "nodes": 7,
     "gpus_per_node": 8,
     "time": "04:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": true
   },
 

diff --git a/examples/slurm-configs/basic/sglang-disagg-multi-node.json b/examples/slurm-configs/basic/sglang-disagg-multi-node.json
@@ -19,7 +19,7 @@
     "nodes": 5,
     "gpus_per_node": 8,
     "time": "04:00:00",
-    "output_dir": "./slurm_output",
+    "output_dir": "./slurm_results",
     "exclusive": true
   },
 

diff --git a/src/madengine/cli/app.py b/src/madengine/cli/app.py
@@ -71,6 +71,9 @@ def cli_main() -> None:
     """Entry point for the CLI application."""
     try:
         app()
+    except typer.Exit:
+        # Preserve exit code from run/build/other commands for Jenkins and scripts
+        raise
     except KeyboardInterrupt:
         console.print("\n🛑 [yellow]Operation cancelled by user[/yellow]")
         sys.exit(ExitCode.FAILURE)

diff --git a/src/madengine/cli/commands/run.py b/src/madengine/cli/commands/run.py
@@ -21,6 +21,7 @@
 
 from madengine.orchestration.run_orchestrator import RunOrchestrator
 from madengine.core.errors import (
+    BuildError,
     ConfigurationError,
     RuntimeError as MADRuntimeError,
 )
@@ -419,6 +420,13 @@ def run(
 
     except typer.Exit:
         raise
+    except BuildError as e:
+        console.print(f"🔨 [bold red]Build error: {e}[/bold red]")
+        if hasattr(e, "suggestions") and e.suggestions:
+            console.print("\n💡 [cyan]Suggestions:[/cyan]")
+            for suggestion in e.suggestions:
+                console.print(f"  • {suggestion}")
+        raise typer.Exit(ExitCode.BUILD_FAILURE)
     except MADRuntimeError as e:
         # Runtime execution errors
         console.print(f"💥 [bold red]Runtime error: {e}[/bold red]")

diff --git a/src/madengine/cli/utils.py b/src/madengine/cli/utils.py
@@ -352,7 +352,6 @@ def display_performance_table(perf_csv_path: str = "perf.csv", session_start_row
             console.print("[yellow]⚠️  Performance CSV is empty[/yellow]")
             return
 
-        # Get session_start_row to mark current runs (don't filter, just mark)
         total_rows = len(df)
 
         # Try parameter first, then fall back to marker file
@@ -378,7 +377,7 @@ def display_performance_table(perf_csv_path: str = "perf.csv", session_start_row
         perf_table.add_column("Index", justify="right", style="dim")
         perf_table.add_column("Model", style="cyan")
         perf_table.add_column("Topology", justify="center", style="blue")
-        perf_table.add_column("Launcher", justify="center", style="magenta")  # Distributed launcher
+        perf_table.add_column("Launcher", justify="center", style="magenta")
         perf_table.add_column("Deployment", justify="center", style="cyan")
         perf_table.add_column("GPU Arch", style="yellow")
         perf_table.add_column("Performance", justify="right", style="green")
@@ -388,20 +387,23 @@ def display_performance_table(perf_csv_path: str = "perf.csv", session_start_row
         perf_table.add_column("Data Name", style="magenta")
         perf_table.add_column("Data Provider", style="magenta")        
 
-        # Helper function to format duration
+        # Helper function to format duration (accepts float seconds or "Xs" string)
         def format_duration(duration):
-            if pd.isna(duration) or duration == "":
+            if pd.isna(duration) or duration == "" or duration is None:
                 return "N/A"
             try:
-                dur = float(duration)
+                if isinstance(duration, str) and duration.strip().endswith("s"):
+                    dur = float(duration.strip()[:-1])
+                else:
+                    dur = float(duration)
                 if dur < 1:
                     return f"{dur*1000:.0f}ms"
                 elif dur < 60:
                     return f"{dur:.2f}s"
                 else:
                     return f"{dur/60:.1f}m"
             except (ValueError, TypeError):
-                return "N/A"
+                return str(duration) if duration else "N/A"
 
         # Helper function to format performance
         def format_performance(perf):