In [17]:
# Import pieces from codeflare-sdk
from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
from codeflare_sdk.cluster.auth import TokenAuthentication

In [18]:
# Create authentication object for oc user permissions
auth = TokenAuthentication(
    token = "XXXX",
    server = "XXXX",
    skip_tls=True,
)

Here, we want to define our cluster by specifying the resources we require for our batch workload. Below, we define our cluster object (which generates a corresponding AppWrapper).

In [19]:
# Create our cluster and submit appwrapper
cluster = Cluster(ClusterConfiguration(name='mnisttest', min_worker=2, max_worker=2, min_cpus=8, max_cpus=8, min_memory=16, max_memory=16, gpu=4, instascale=True, machine_types=["m5.xlarge", "p3.8xlarge"], auth=auth))

Written to: mnisttest.yaml


Next, we want to bring our cluster up, so we call the `up()` function below to submit our cluster AppWrapper yaml onto the MCAD queue, and begin the process of obtaining our resource cluster.

In [13]:
# Bring up the cluster
cluster.up()

Now, we want to check on the status of our resource cluster, until it is finally ready for use.

In [14]:
cluster.is_ready()

(False, <CodeFlareClusterStatus.QUEUED: 2>)

In [15]:
cluster.status()

<RayClusterStatus.FAILED: 'failed'>

Now that our resource cluster is ready, we can directly submit our batch job (model training on two workers with four gpus each) to the cluster via torchx.

In [5]:
from codeflare_sdk.jobs.jobs import TorchXJobDefinition
from codeflare_sdk.jobs.config import JobConfiguration

In [6]:
job = TorchXJobDefinition(JobConfiguration(script="mnist.py", requirements="requirements.txt")).submit(cluster)

The Ray scheduler does not support port mapping.


Now we can go ahead and look at the status and logs of our batch job.

In [6]:
job.status()

AppStatus:
  msg: !!python/object/apply:ray.dashboard.modules.job.common.JobStatus
  - RUNNING
  num_restarts: -1
  roles:
  - replicas:
    - hostname: <NONE>
      id: 0
      role: ray
      state: !!python/object/apply:torchx.specs.api.AppState
      - 3
      structured_error_msg: <NONE>
    role: ray
  state: RUNNING (3)
  structured_error_msg: <NONE>
  ui_url: null

In [8]:
print(job.logs())

[RayActor(name='mnist', command=['bash', '-c', "python -m torch.distributed.run --rdzv_backend static --rdzv_endpoint $TORCHX_RANK0_HOST:49782 --rdzv_id 'mnist-fljvj1cqqnsqz' --nnodes 2 --nproc_per_node 1 --node_rank '0' --tee 3 --role '' mnist.py"], env={'LOGLEVEL': 'DEBUG', 'TORCH_DISTRIBUTED_DEBUG': 'DETAIL', 'TORCHX_JOB_ID': 'ray://torchx/mnist-fljvj1cqqnsqz'}, num_cpus=1, num_gpus=0, min_replicas=2), RayActor(name='mnist', command=['bash', '-c', "python -m torch.distributed.run --rdzv_backend static --rdzv_endpoint $TORCHX_RANK0_HOST:49782 --rdzv_id 'mnist-fljvj1cqqnsqz' --nnodes 2 --nproc_per_node 1 --node_rank '1' --tee 3 --role '' mnist.py"], env={'LOGLEVEL': 'DEBUG', 'TORCH_DISTRIBUTED_DEBUG': 'DETAIL', 'TORCHX_JOB_ID': 'ray://torchx/mnist-fljvj1cqqnsqz'}, num_cpus=1, num_gpus=0, min_replicas=2)]
2023-02-20 14:22:16,059	INFO worker.py:1230 -- Using address 10.129.2.23:6379 set in the environment variable RAY_ADDRESS
2023-02-20 14:22:16,059	INFO worker.py:1342 -- Connecting to 

Finally, we bring our resource cluster down and release/terminate the associated resources, bringing everything back to the way it was before our cluster was brought up.

In [20]:
cluster.down()