Skip to content
Merged
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,19 +16,19 @@ These schemas enable CloudAI to be flexible and compatible with different system


## Support matrix
|Test|Slurm|Kubernetes (experimental)|Standalone|
|---|---|---|---|
|ChakraReplay|✅|❌|❌|
|GPT|✅|❌|❌|
|Grok|✅|❌|❌|
|NCCL|✅|✅|❌|
|NeMo Launcher|✅|❌|❌|
|NeMo Run|✅|❌|❌|
|Nemotron|✅|❌|❌|
|Sleep|✅|✅|✅|
|UCC|✅|❌|❌|
|SlurmContainer|✅|❌|❌|
|MegatronRun (experimental)|✅|❌|❌|
|Test|Slurm|Kubernetes|RunAI|Standalone|
|---|---|---|---|---|
|ChakraReplay|✅|❌|❌|❌|
|GPT|✅|❌|❌|❌|
|Grok|✅|❌|❌|❌|
|NCCL|✅|✅|✅|❌|
|NeMo Launcher|✅|❌|❌|❌|
|NeMo Run|✅|❌|❌|❌|
|Nemotron|✅|❌|❌|❌|
|Sleep|✅|✅|❌|✅|
|UCC|✅|❌|❌|❌|
|SlurmContainer|✅|❌|❌|❌|
|MegatronRun (experimental)|✅|❌|❌|❌|


## Set Up Access to the Private NGC Registry
Expand Down
25 changes: 23 additions & 2 deletions USER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ ntasks_per_node = 8
[partitions.<YOUR PARTITION NAME>]
name = "<YOUR PARTITION NAME>"
```
Replace `<YOUR PARTITION NAME>` with the name of the partition you want to use. You can find the partition name by running `sinfo` on the cluster.
Replace `<YOUR PARTITION NAME>` with the name of the partition you want to use. You can find the partition name by running `sinfo` on the cluster.

#### Step 4: Install Test Requirements
Once all configs are ready, it is time to install test requirements. It is done once so that you can run multiple experiments without reinstalling the requirements. This step requires the system config file from the step 3.
Expand Down Expand Up @@ -237,12 +237,33 @@ cache_docker_images_locally = true
- **output_path**: Defines the default path where outputs are stored. Whenever a user runs a test scenario, a new subdirectory will be created under this path.
- **default_partition**: Specifies the default partition where jobs are scheduled.
- **partitions**: Describes the available partitions and nodes within those partitions.
- **[optional] groups**: Within the same partition, users can define groups of nodes. The group concept can be used to allocate nodes from specific groups in a test scenario schema. For instance, this feature is useful for specifying topology awareness. Groups represents logical partitioning of nodes and users are responsible for ensuring no overlap across groups.
- **[optional] groups**: Within the same partition, users can define groups of nodes. The group concept can be used to allocate nodes from specific groups in a test scenario schema. For instance, this feature is useful for specifying topology awareness. Groups represents logical partitioning of nodes and users are responsible for ensuring no overlap across groups.
- **mpi**: Indicates the Process Management Interface (PMI) implementation to be used for inter-process communication.
- **gpus_per_node** and **ntasks_per_node**: These are Slurm arguments passed to the `sbatch` script and `srun`.
- **cache_docker_images_locally**: Specifies whether CloudAI should cache remote Docker images locally during installation. If set to `true`, CloudAI will cache the Docker images, enabling local access without needing to download them each time a test is run. This approach saves network bandwidth but requires more disk capacity. If set to `false`, CloudAI will allow Slurm to download the Docker images as needed when they are not cached locally by Slurm.
- **global_env_vars**: Lists all global environment variables that will be applied globally whenever tests are run.

## Describing a System for RunAI Scheduler
When using RunAI as the scheduler, you need to specify additional fields in the system schema TOML file. Below is the list of required fields and how to set them:

```toml
name = "runai-cluster"
scheduler = "runai"

install_path = "./install"
output_path = "./results"

base_url = "http://runai.example.com" # The URL of your RunAI system, typically the same as used for the web interface.
user_email = "your_email" # The email address used to log into the RunAI system.
app_id = "your_app_id" # Obtained by creating an application in the RunAI web interface.
app_secret = "your_app_secret" # Obtained together with the app_id.
project_id = "your_project_id" # Project ID assigned or created in the RunAI system (usually an integer).
cluster_id = "your_cluster_id" # Cluster ID in UUID format (e.g., a69928cc-ccaa-48be-bda9-482440f4d855).
```
* After logging into the RunAI web interface, navigate to Access → Applications and create a new application to obtain app_id and app_secret.
* Use your assigned project and cluster IDs. Contact your administrator if they are not available.
* All other fields follow the same semantics as in the Slurm system schema (e.g., install_path, output_path).

## Describing a Test Scenario in the Test Scenario Schema
A test scenario is a set of tests with specific dependencies between them. A test scenario is described in a TOML schema file. This is an example of a test scenario file:
```toml
Expand Down
34 changes: 34 additions & 0 deletions conf/common/system/example_runai_cluster.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name = "example-runai-cluster"
scheduler = "runai"

install_path = "./install_dir"
output_path = "./results"
monitor_interval = 1

base_url = "http://runai.example.com"
user_email = "your_email"
app_id = "your_app_id"
app_secret = "your_app_secret"
project_id = "your_project_id"
cluster_id = "your_cluster_id"

[global_env_vars]
NCCL_IB_GID_INDEX = "3"
NCCL_IB_TIMEOUT = "20"
NCCL_IB_QPS_PER_CONNECTION = "4"
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ dependencies = [
"kubernetes==30.1.0",
"pydantic==2.8.2",
"jinja2==3.1.6",
"websockets==15.0.1",
]
[project.scripts]
cloudai = "cloudai.__main__:main"
Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ tbparse==0.0.8
toml==0.10.2
kubernetes==30.1.0
pydantic==2.8.2
jinja2==3.1.6
jinja2==3.1.6
websockets==15.0.1
20 changes: 18 additions & 2 deletions src/cloudai/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,15 +48,18 @@
from ._core.test_template_strategy import TestTemplateStrategy
from .installer.kubernetes_installer import KubernetesInstaller
from .installer.lsf_installer import LSFInstaller
from .installer.runai_installer import RunAIInstaller
from .installer.slurm_installer import SlurmInstaller
from .installer.standalone_installer import StandaloneInstaller
from .parser import Parser
from .runner.kubernetes.kubernetes_runner import KubernetesRunner
from .runner.lsf.lsf_runner import LSFRunner
from .runner.runai.runai_runner import RunAIRunner
from .runner.slurm.slurm_runner import SlurmRunner
from .runner.standalone.standalone_runner import StandaloneRunner
from .systems.kubernetes.kubernetes_system import KubernetesSystem
from .systems.lsf.lsf_system import LSFSystem
from .systems.runai.runai_system import RunAISystem
from .systems.slurm.slurm_system import SlurmSystem
from .systems.standalone_system import StandaloneSystem
from .workloads.chakra_replay import (
Expand Down Expand Up @@ -91,6 +94,7 @@
NcclTestJobStatusRetrievalStrategy,
NcclTestKubernetesJsonGenStrategy,
NcclTestPerformanceReportGenerationStrategy,
NcclTestRunAIJsonGenStrategy,
NcclTestSlurmCommandGenStrategy,
)
from .workloads.nemo_launcher import (
Expand Down Expand Up @@ -126,6 +130,7 @@
Registry().add_runner("kubernetes", KubernetesRunner)
Registry().add_runner("standalone", StandaloneRunner)
Registry().add_runner("lsf", LSFRunner)
Registry().add_runner("runai", RunAIRunner)

Registry().add_strategy(
CommandGenStrategy, [StandaloneSystem], [SleepTestDefinition], SleepStandaloneCommandGenStrategy
Expand All @@ -134,6 +139,7 @@
Registry().add_strategy(CommandGenStrategy, [SlurmSystem], [SleepTestDefinition], SleepSlurmCommandGenStrategy)
Registry().add_strategy(JsonGenStrategy, [KubernetesSystem], [SleepTestDefinition], SleepKubernetesJsonGenStrategy)
Registry().add_strategy(JsonGenStrategy, [KubernetesSystem], [NCCLTestDefinition], NcclTestKubernetesJsonGenStrategy)
Registry().add_strategy(JsonGenStrategy, [RunAISystem], [NCCLTestDefinition], NcclTestRunAIJsonGenStrategy)
Registry().add_strategy(GradingStrategy, [SlurmSystem], [NCCLTestDefinition], NcclTestGradingStrategy)

Registry().add_strategy(
Expand Down Expand Up @@ -164,6 +170,7 @@
[GPTTestDefinition, GrokTestDefinition, NemotronTestDefinition],
JaxToolboxSlurmCommandGenStrategy,
)

Registry().add_strategy(
JobIdRetrievalStrategy,
[SlurmSystem],
Expand All @@ -184,8 +191,8 @@
Registry().add_strategy(
JobIdRetrievalStrategy, [StandaloneSystem], [SleepTestDefinition], StandaloneJobIdRetrievalStrategy
)

Registry().add_strategy(JobIdRetrievalStrategy, [LSFSystem], [SleepTestDefinition], LSFJobIdRetrievalStrategy)

Registry().add_strategy(
JobStatusRetrievalStrategy,
[KubernetesSystem],
Expand Down Expand Up @@ -221,10 +228,16 @@
Registry().add_strategy(
JobStatusRetrievalStrategy, [StandaloneSystem], [SleepTestDefinition], DefaultJobStatusRetrievalStrategy
)

Registry().add_strategy(
JobStatusRetrievalStrategy, [LSFSystem], [SleepTestDefinition], DefaultJobStatusRetrievalStrategy
)
Registry().add_strategy(
JobStatusRetrievalStrategy,
[RunAISystem],
[NCCLTestDefinition],
DefaultJobStatusRetrievalStrategy,
)

Registry().add_strategy(CommandGenStrategy, [SlurmSystem], [UCCTestDefinition], UCCTestSlurmCommandGenStrategy)

Registry().add_strategy(GradingStrategy, [SlurmSystem], [ChakraReplayTestDefinition], ChakraReplayGradingStrategy)
Expand All @@ -239,11 +252,13 @@
Registry().add_installer("standalone", StandaloneInstaller)
Registry().add_installer("kubernetes", KubernetesInstaller)
Registry().add_installer("lsf", LSFInstaller)
Registry().add_installer("runai", RunAIInstaller)

Registry().add_system("slurm", SlurmSystem)
Registry().add_system("standalone", StandaloneSystem)
Registry().add_system("kubernetes", KubernetesSystem)
Registry().add_system("lsf", LSFSystem)
Registry().add_system("runai", RunAISystem)

Registry().add_test_definition("UCCTest", UCCTestDefinition)
Registry().add_test_definition("NcclTest", NCCLTestDefinition)
Expand Down Expand Up @@ -298,6 +313,7 @@
"PythonExecutable",
"ReportGenerationStrategy",
"Reporter",
"RunAISystem",
"Runner",
"System",
"SystemConfigParsingError",
Expand Down
48 changes: 48 additions & 0 deletions src/cloudai/installer/runai_installer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging

from cloudai import BaseInstaller, Installable, InstallStatusResult
from cloudai.systems.runai.runai_system import RunAISystem


class RunAIInstaller(BaseInstaller):
"""Installer for RunAI systems."""

def __init__(self, system: RunAISystem):
"""Initialize the RunAIInstaller with a system object."""
super().__init__(system)

def _check_prerequisites(self) -> InstallStatusResult:
logging.info("Checking prerequisites for RunAI installation.")
return InstallStatusResult(True)

def install_one(self, item: Installable) -> InstallStatusResult:
logging.info(f"Installing {item} for RunAI.")
return InstallStatusResult(True)

def uninstall_one(self, item: Installable) -> InstallStatusResult:
logging.info(f"Uninstalling {item} for RunAI.")
return InstallStatusResult(True)

def is_installed_one(self, item: Installable) -> InstallStatusResult:
logging.info(f"Checking if {item} is installed for RunAI.")
return InstallStatusResult(True)

def mark_as_installed_one(self, item: Installable) -> InstallStatusResult:
logging.info(f"Marking {item} as installed for RunAI.")
return InstallStatusResult(True)
15 changes: 15 additions & 0 deletions src/cloudai/runner/runai/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
27 changes: 27 additions & 0 deletions src/cloudai/runner/runai/runai_job.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from dataclasses import dataclass

from cloudai import BaseJob
from cloudai.systems.runai.runai_training import ActualPhase


@dataclass
class RunAIJob(BaseJob):
"""A job class for execution on an RunAI system."""

status: ActualPhase
54 changes: 54 additions & 0 deletions src/cloudai/runner/runai/runai_runner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging
from typing import cast

from cloudai import BaseJob, BaseRunner, TestRun
from cloudai.systems.runai.runai_system import RunAISystem

from .runai_job import RunAIJob


class RunAIRunner(BaseRunner):
"""Class to manage and execute workloads using the RunAI platform."""

def _submit_test(self, tr: TestRun) -> RunAIJob:
logging.info(f"Running test: {tr.name}")
tr.output_path = self.get_job_output_path(tr)
job_spec = tr.test.test_template.gen_json(tr)
logging.debug(f"Generated JSON for test {tr.name}: {job_spec}")

if self.mode == "run":
runai_system = cast(RunAISystem, self.system)
training = runai_system.create_training(job_spec)
job = RunAIJob(test_run=tr, id=training.workload_id, status=training.actual_phase)
logging.info(f"Submitted RunAI job: {job.id}")
return job
else:
raise RuntimeError("Invalid mode for submitting a test.")

async def job_completion_callback(self, job: BaseJob) -> None:
runai_system = cast(RunAISystem, self.system)
job = cast(RunAIJob, job)
workload_id = str(job.id)
runai_system.get_workload_events(workload_id, job.test_run.output_path / "events.txt")
await runai_system.store_logs(workload_id, job.test_run.output_path / "stdout.txt")

def kill_job(self, job: BaseJob) -> None:
runai_system = cast(RunAISystem, self.system)
job = cast(RunAIJob, job)
runai_system.delete_training(str(job.id))
2 changes: 2 additions & 0 deletions src/cloudai/systems/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,14 @@

from .kubernetes.kubernetes_system import KubernetesSystem
from .lsf.lsf_system import LSFSystem
from .runai import RunAISystem
from .slurm.slurm_system import SlurmSystem
from .standalone_system import StandaloneSystem

__all__ = [
"KubernetesSystem",
"LSFSystem",
"RunAISystem",
"SlurmSystem",
"StandaloneSystem",
]
Loading