# Setting up a Ray cluster with SmartSim

## 1. Start the cluster
We set up a SmartSim experiment, which will handle the launch of the Ray cluster.

First we import the relevant modules.

In [1]:
from smartsim import Experiment, slurm
from smartsim.ray import RayCluster

NUM_WORKERS = 2
alloc=None #slurm.get_allocation(nodes=1+NUM_WORKERS, time="12:00:00", options={"ntasks": str(1+NUM_WORKERS), "partition": "spider", "C": "V100"})

In [2]:
exp = Experiment("ray-cluster", launcher='slurm')
cluster = RayCluster(name="ray-cluster", run_args={"time":"06:00:00", "unbuffered": None}, path='',
                     launcher='slurm', workers=NUM_WORKERS, alloc=alloc, batch=False, ray_num_cpus=56)

if cluster.batch:
    cluster.head_model.batch_settings._preamble = ["source ~/.bashrc", "conda activate smartsim"]
    if NUM_WORKERS:
        cluster.worker_model.batch_settings._preamble = ["source ~/.bashrc", "conda activate smartsim"]

exp.generate(cluster, overwrite=True)

11:52:03 nid00000 SmartSim[29596] INFO Working in previously created experiment


In [3]:
exp.start(cluster, block=False, summary=False)

11:52:15 nid00000 SmartSim[29596] INFO Ray cluster launched on nodes: ['nid00000', 'nid00002', 'nid00001']


## 2. Start the ray driver script

In [4]:
cluster.start_ray_job('/lus/scratch/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/templates/ppo_tune.py')

RayWorker workers produced the following error 
Error: srun: error: nid00002: task 1: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=1292874.32
srun: error: nid00001: task 0: Exited with exit code 1
 
Job status at failure: Failed 
Launcher status at failure: FAILED 
Job returncode: 1 
Error and output file located at: /lus/cls01029/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/ray-cluster/workers


In [30]:
cluster.start_ray_job('/lus/scratch/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/templates/ppo_train.py')

In [5]:
cluster.start_ray_job('/lus/scratch/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/templates/mnist_pytorch_trainable.py')

## 3. Stop cluster and release allocation

In [5]:
exp.stop(cluster)

12:30:50 nid00000 SmartSim[29596] INFO Stopping model head with job name head-CBD4HIKDML7J


In [6]:
if alloc:
    slurm.release_allocation(alloc)

13:49:36 osprey.us.cray.com SmartSim[119244] INFO Releasing allocation: 242490
13:49:36 osprey.us.cray.com SmartSim[119244] INFO Successfully freed allocation 242490


In [6]:
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            242403     bdw18  Chpl-ep  chapelu  R       6:04     16 prod-[0001-0016]
            242404     bdw18 head-CBC arigazzi  R       2:57      1 prod-0017
            242394     clx28 sstsim.x visharma  R      40:31     32 prod-[0065-0096]


In [7]:
!scancel 242404


11:26:17 osprey.us.cray.com SmartSim[75449] INFO head(242404): Failed
