NVIDIA · TaekyungHeo · Apr 4, 2025 · Apr 1, 2025 · Apr 3, 2025 · Apr 3, 2025
@@ -16,19 +16,19 @@ These schemas enable CloudAI to be flexible and compatible with different system
 
 
 ## Support matrix
-|Test|Slurm|Kubernetes (experimental)|Standalone|
-|---|---|---|---|
-|ChakraReplay|✅|❌|❌|
-|GPT|✅|❌|❌|
-|Grok|✅|❌|❌|
-|NCCL|✅|✅|❌|
-|NeMo Launcher|✅|❌|❌|
-|NeMo Run|✅|❌|❌|
-|Nemotron|✅|❌|❌|
-|Sleep|✅|✅|✅|
-|UCC|✅|❌|❌|
-|SlurmContainer|✅|❌|❌|
-|MegatronRun (experimental)|✅|❌|❌|
+|Test|Slurm|Kubernetes|RunAI|Standalone|
+|---|---|---|---|---|
+|ChakraReplay|✅|❌|❌|❌|
+|GPT|✅|❌|❌|❌|
+|Grok|✅|❌|❌|❌|
+|NCCL|✅|✅|✅|❌|
+|NeMo Launcher|✅|❌|❌|❌|
+|NeMo Run|✅|❌|❌|❌|
+|Nemotron|✅|❌|❌|❌|
+|Sleep|✅|✅|❌|✅|
+|UCC|✅|❌|❌|❌|
+|SlurmContainer|✅|❌|❌|❌|
+|MegatronRun (experimental)|✅|❌|❌|❌|
 
 
 ## Set Up Access to the Private NGC Registry

@@ -89,7 +89,7 @@ ntasks_per_node = 8
   [partitions.<YOUR PARTITION NAME>]
   name = "<YOUR PARTITION NAME>"
 ```
-Replace `<YOUR PARTITION NAME>` with the name of the partition you want to use. You can find the partition name by running `sinfo` on the cluster. 
+Replace `<YOUR PARTITION NAME>` with the name of the partition you want to use. You can find the partition name by running `sinfo` on the cluster.
 
 #### Step 4: Install Test Requirements
 Once all configs are ready, it is time to install test requirements. It is done once so that you can run multiple experiments without reinstalling the requirements. This step requires the system config file from the step 3.
@@ -237,12 +237,33 @@ cache_docker_images_locally = true
 - **output_path**: Defines the default path where outputs are stored. Whenever a user runs a test scenario, a new subdirectory will be created under this path.
 - **default_partition**: Specifies the default partition where jobs are scheduled.
 - **partitions**: Describes the available partitions and nodes within those partitions.
-  - **[optional] groups**: Within the same partition, users can define groups of nodes. The group concept can be used to allocate nodes from specific groups in a test scenario schema. For instance, this feature is useful for specifying topology awareness. Groups represents logical partitioning of nodes and users are responsible for ensuring no overlap across groups. 
+  - **[optional] groups**: Within the same partition, users can define groups of nodes. The group concept can be used to allocate nodes from specific groups in a test scenario schema. For instance, this feature is useful for specifying topology awareness. Groups represents logical partitioning of nodes and users are responsible for ensuring no overlap across groups.
 - **mpi**: Indicates the Process Management Interface (PMI) implementation to be used for inter-process communication.
 - **gpus_per_node** and **ntasks_per_node**: These are Slurm arguments passed to the `sbatch` script and `srun`.
 - **cache_docker_images_locally**: Specifies whether CloudAI should cache remote Docker images locally during installation. If set to `true`, CloudAI will cache the Docker images, enabling local access without needing to download them each time a test is run. This approach saves network bandwidth but requires more disk capacity. If set to `false`, CloudAI will allow Slurm to download the Docker images as needed when they are not cached locally by Slurm.
 - **global_env_vars**: Lists all global environment variables that will be applied globally whenever tests are run.
 
+## Describing a System for RunAI Scheduler
+When using RunAI as the scheduler, you need to specify additional fields in the system schema TOML file. Below is the list of required fields and how to set them:
+
+```toml
+name = "runai-cluster"
+scheduler = "runai"
+
+install_path = "./install"
+output_path = "./results"
+
+base_url = "http://runai.example.com"       # The URL of your RunAI system, typically the same as used for the web interface.
+user_email = "your_email"              # The email address used to log into the RunAI system.
+app_id = "your_app_id"                      # Obtained by creating an application in the RunAI web interface.
+app_secret = "your_app_secret"              # Obtained together with the app_id.
+project_id = "your_project_id"              # Project ID assigned or created in the RunAI system (usually an integer).
+cluster_id = "your_cluster_id"              # Cluster ID in UUID format (e.g., a69928cc-ccaa-48be-bda9-482440f4d855).
+```
+* After logging into the RunAI web interface, navigate to Access → Applications and create a new application to obtain app_id and app_secret.
+* Use your assigned project and cluster IDs. Contact your administrator if they are not available.
+* All other fields follow the same semantics as in the Slurm system schema (e.g., install_path, output_path).
+
 ## Describing a Test Scenario in the Test Scenario Schema
 A test scenario is a set of tests with specific dependencies between them. A test scenario is described in a TOML schema file. This is an example of a test scenario file:
 ```toml

@@ -0,0 +1,34 @@
+# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
+# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name = "example-runai-cluster"
+scheduler = "runai"
+
+install_path = "./install_dir"
+output_path = "./results"
+monitor_interval = 1
+
+base_url = "http://runai.example.com"
+user_email = "your_email"
+app_id = "your_app_id"
+app_secret = "your_app_secret"
+project_id = "your_project_id"
+cluster_id = "your_cluster_id"
+
+[global_env_vars]
+NCCL_IB_GID_INDEX = "3"
+NCCL_IB_TIMEOUT = "20"
+NCCL_IB_QPS_PER_CONNECTION = "4"
@@ -25,6 +25,7 @@ dependencies = [
   "kubernetes==30.1.0",
   "pydantic==2.8.2",
   "jinja2==3.1.6",
+  "websockets==15.0.1",
 ]
   [project.scripts]
   cloudai = "cloudai.__main__:main"

@@ -4,4 +4,5 @@ tbparse==0.0.8
 toml==0.10.2
 kubernetes==30.1.0
 pydantic==2.8.2
-jinja2==3.1.6
+jinja2==3.1.6
+websockets==15.0.1
@@ -48,15 +48,18 @@
 from ._core.test_template_strategy import TestTemplateStrategy
 from .installer.kubernetes_installer import KubernetesInstaller
 from .installer.lsf_installer import LSFInstaller
+from .installer.runai_installer import RunAIInstaller
 from .installer.slurm_installer import SlurmInstaller
 from .installer.standalone_installer import StandaloneInstaller
 from .parser import Parser
 from .runner.kubernetes.kubernetes_runner import KubernetesRunner
 from .runner.lsf.lsf_runner import LSFRunner
+from .runner.runai.runai_runner import RunAIRunner
 from .runner.slurm.slurm_runner import SlurmRunner
 from .runner.standalone.standalone_runner import StandaloneRunner
 from .systems.kubernetes.kubernetes_system import KubernetesSystem
 from .systems.lsf.lsf_system import LSFSystem
+from .systems.runai.runai_system import RunAISystem
 from .systems.slurm.slurm_system import SlurmSystem
 from .systems.standalone_system import StandaloneSystem
 from .workloads.chakra_replay import (
@@ -91,6 +94,7 @@
     NcclTestJobStatusRetrievalStrategy,
     NcclTestKubernetesJsonGenStrategy,
     NcclTestPerformanceReportGenerationStrategy,
+    NcclTestRunAIJsonGenStrategy,
     NcclTestSlurmCommandGenStrategy,
 )
 from .workloads.nemo_launcher import (
@@ -126,6 +130,7 @@
 Registry().add_runner("kubernetes", KubernetesRunner)
 Registry().add_runner("standalone", StandaloneRunner)
 Registry().add_runner("lsf", LSFRunner)
+Registry().add_runner("runai", RunAIRunner)
 
 Registry().add_strategy(
     CommandGenStrategy, [StandaloneSystem], [SleepTestDefinition], SleepStandaloneCommandGenStrategy
@@ -134,6 +139,7 @@
 Registry().add_strategy(CommandGenStrategy, [SlurmSystem], [SleepTestDefinition], SleepSlurmCommandGenStrategy)
 Registry().add_strategy(JsonGenStrategy, [KubernetesSystem], [SleepTestDefinition], SleepKubernetesJsonGenStrategy)
 Registry().add_strategy(JsonGenStrategy, [KubernetesSystem], [NCCLTestDefinition], NcclTestKubernetesJsonGenStrategy)
+Registry().add_strategy(JsonGenStrategy, [RunAISystem], [NCCLTestDefinition], NcclTestRunAIJsonGenStrategy)
 Registry().add_strategy(GradingStrategy, [SlurmSystem], [NCCLTestDefinition], NcclTestGradingStrategy)
 
 Registry().add_strategy(
@@ -164,6 +170,7 @@
     [GPTTestDefinition, GrokTestDefinition, NemotronTestDefinition],
     JaxToolboxSlurmCommandGenStrategy,
 )
+
 Registry().add_strategy(
     JobIdRetrievalStrategy,
     [SlurmSystem],
@@ -184,8 +191,8 @@
 Registry().add_strategy(
     JobIdRetrievalStrategy, [StandaloneSystem], [SleepTestDefinition], StandaloneJobIdRetrievalStrategy
 )
-
 Registry().add_strategy(JobIdRetrievalStrategy, [LSFSystem], [SleepTestDefinition], LSFJobIdRetrievalStrategy)
+
 Registry().add_strategy(
     JobStatusRetrievalStrategy,
     [KubernetesSystem],
@@ -221,10 +228,16 @@
 Registry().add_strategy(
     JobStatusRetrievalStrategy, [StandaloneSystem], [SleepTestDefinition], DefaultJobStatusRetrievalStrategy
 )
-
 Registry().add_strategy(
     JobStatusRetrievalStrategy, [LSFSystem], [SleepTestDefinition], DefaultJobStatusRetrievalStrategy
 )
+Registry().add_strategy(
+    JobStatusRetrievalStrategy,
+    [RunAISystem],
+    [NCCLTestDefinition],
+    DefaultJobStatusRetrievalStrategy,
+)
+
 Registry().add_strategy(CommandGenStrategy, [SlurmSystem], [UCCTestDefinition], UCCTestSlurmCommandGenStrategy)
 
 Registry().add_strategy(GradingStrategy, [SlurmSystem], [ChakraReplayTestDefinition], ChakraReplayGradingStrategy)
@@ -239,11 +252,13 @@
 Registry().add_installer("standalone", StandaloneInstaller)
 Registry().add_installer("kubernetes", KubernetesInstaller)
 Registry().add_installer("lsf", LSFInstaller)
+Registry().add_installer("runai", RunAIInstaller)
 
 Registry().add_system("slurm", SlurmSystem)
 Registry().add_system("standalone", StandaloneSystem)
 Registry().add_system("kubernetes", KubernetesSystem)
 Registry().add_system("lsf", LSFSystem)
+Registry().add_system("runai", RunAISystem)
 
 Registry().add_test_definition("UCCTest", UCCTestDefinition)
 Registry().add_test_definition("NcclTest", NCCLTestDefinition)
@@ -298,6 +313,7 @@
     "PythonExecutable",
     "ReportGenerationStrategy",
     "Reporter",
+    "RunAISystem",
     "Runner",
     "System",
     "SystemConfigParsingError",

@@ -0,0 +1,48 @@
+# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
+# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+
+from cloudai import BaseInstaller, Installable, InstallStatusResult
+from cloudai.systems.runai.runai_system import RunAISystem
+
+
+class RunAIInstaller(BaseInstaller):
+    """Installer for RunAI systems."""
+
+    def __init__(self, system: RunAISystem):
+        """Initialize the RunAIInstaller with a system object."""
+        super().__init__(system)
+
+    def _check_prerequisites(self) -> InstallStatusResult:
+        logging.info("Checking prerequisites for RunAI installation.")
+        return InstallStatusResult(True)
+
+    def install_one(self, item: Installable) -> InstallStatusResult:
+        logging.info(f"Installing {item} for RunAI.")
+        return InstallStatusResult(True)
+
+    def uninstall_one(self, item: Installable) -> InstallStatusResult:
+        logging.info(f"Uninstalling {item} for RunAI.")
+        return InstallStatusResult(True)
+
+    def is_installed_one(self, item: Installable) -> InstallStatusResult:
+        logging.info(f"Checking if {item} is installed for RunAI.")
+        return InstallStatusResult(True)
+
+    def mark_as_installed_one(self, item: Installable) -> InstallStatusResult:
+        logging.info(f"Marking {item} as installed for RunAI.")
+        return InstallStatusResult(True)
@@ -0,0 +1,15 @@
+# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
+# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
@@ -0,0 +1,27 @@
+# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
+# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass
+
+from cloudai import BaseJob
+from cloudai.systems.runai.runai_training import ActualPhase
+
+
+@dataclass
+class RunAIJob(BaseJob):
+    """A job class for execution on an RunAI system."""
+
+    status: ActualPhase
@@ -0,0 +1,54 @@
+# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
+# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+from typing import cast
+
+from cloudai import BaseJob, BaseRunner, TestRun
+from cloudai.systems.runai.runai_system import RunAISystem
+
+from .runai_job import RunAIJob
+
+
+class RunAIRunner(BaseRunner):
+    """Class to manage and execute workloads using the RunAI platform."""
+
+    def _submit_test(self, tr: TestRun) -> RunAIJob:
+        logging.info(f"Running test: {tr.name}")
+        tr.output_path = self.get_job_output_path(tr)
+        job_spec = tr.test.test_template.gen_json(tr)
+        logging.debug(f"Generated JSON for test {tr.name}: {job_spec}")
+
+        if self.mode == "run":
+            runai_system = cast(RunAISystem, self.system)
+            training = runai_system.create_training(job_spec)
+            job = RunAIJob(test_run=tr, id=training.workload_id, status=training.actual_phase)
+            logging.info(f"Submitted RunAI job: {job.id}")
+            return job
+        else:
+            raise RuntimeError("Invalid mode for submitting a test.")
+
+    async def job_completion_callback(self, job: BaseJob) -> None:
+        runai_system = cast(RunAISystem, self.system)
+        job = cast(RunAIJob, job)
+        workload_id = str(job.id)
+        runai_system.get_workload_events(workload_id, job.test_run.output_path / "events.txt")
+        await runai_system.store_logs(workload_id, job.test_run.output_path / "stdout.txt")
+
+    def kill_job(self, job: BaseJob) -> None:
+        runai_system = cast(RunAISystem, self.system)
+        job = cast(RunAIJob, job)
+        runai_system.delete_training(str(job.id))
@@ -16,12 +16,14 @@
 
 from .kubernetes.kubernetes_system import KubernetesSystem
 from .lsf.lsf_system import LSFSystem
+from .runai import RunAISystem
 from .slurm.slurm_system import SlurmSystem
 from .standalone_system import StandaloneSystem
 
 __all__ = [
     "KubernetesSystem",
     "LSFSystem",
+    "RunAISystem",
     "SlurmSystem",
     "StandaloneSystem",
 ]