# Setting up a Ray cluster with SmartSim

In this notebook we will see how to set up a Ray cluster on a system using SLURM as workload manager (WLM).

This notebook is based on [this github repo by NERSC](https://github.com/NERSC/slurm-ray-cluster/blob/master/submit-ray-cluster.sbatch) and on [this stack overflow post](https://github.com/ray-project/ray/issues/826#issuecomment-522116599).

We will have to do four steps:
1. Start the head node
2. Start the workers
3. Run a test workload
4. Stop all nodes

## 1. Start the head node
We set up a SmartSim experiment, which will handle the launch of the ray head node.

In [20]:
import os
from smartsim import Experiment, constants
from smartsim.settings import SbatchSettings, SrunSettings, RunSettings
import time
import uuid

In [21]:
!pwd

/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray


In [36]:
exp = Experiment("ray_head_exp", launcher='slurm')
head_dir = os.makedirs('/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/head_exp/head_exp', exist_ok=True)

RAY_PORT=6379

head_run_args = {"nodes": 1,
           "ntasks": 1, # Ray will take care of resources.
           "time": "00:10:00",
           "unbuffered": None}

conda_settings_1 = RunSettings("source", "~/.bashrc", block_in_batch=True, expand_exe=False)
conda_settings_2 = RunSettings("conda", "activate ray", block_in_batch=True, expand_exe=False)
head_settings = SrunSettings("bash", "start-head.sh",
                             expand_exe=False, block_in_batch=False, run_args=head_run_args)
sleep_settings = RunSettings("sleep", "infinity")

conda_model_1 = exp.create_model("conda_sh", path="/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray",
                                run_settings=conda_settings_1)
conda_model_2 = exp.create_model("conda_switch", path="/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray",
                                 run_settings=conda_settings_2)
head_node_params = {"RAY_PORT": RAY_PORT, "REDIS_PASSWORD": uuid.uuid4()}
head_node_model = exp.create_model("head_node", path='/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/head_exp',
                                   run_settings=head_settings, params=head_node_params)
head_node_model.attach_generator_files(to_configure=['./templates/start-head.sh'])
exp.generate(head_node_model, overwrite=True)

sleep_model     = exp.create_model("head_sleep", path='/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/head_exp',
                                   run_settings=sleep_settings)  
head_batch = SbatchSettings(nodes=1, time="00:15:00")

head_ensemble = exp.create_ensemble("head-ens", batch_settings=head_batch)
head_ensemble.add_model(conda_model_1)
head_ensemble.add_model(conda_model_2)
head_ensemble.add_model(head_node_model)
head_ensemble.add_model(sleep_model)

exp.start(head_ensemble, block=False, summary=False)

head_stop_settings = SrunSettings("/lus/scratch/arigazzi/anaconda3/envs/ray/bin/ray", f"stop", run_args=run_args)



14:09:23 spider-0001 SmartSim[72234] INFO Working in previously created experiment
14:09:23 spider-0001 SmartSim[72234] INFO Empty ensemble created for batch launch


In [4]:
exp.poll()

11:17:06 spider-0001 SmartSim[72234] INFO head-ens(195016): New
11:17:16 spider-0001 SmartSim[72234] INFO head-ens(195016): Running
11:17:26 spider-0001 SmartSim[72234] INFO head-ens(195016): Running
11:17:36 spider-0001 SmartSim[72234] INFO head-ens(195016): Running


KeyboardInterrupt: 

In [37]:
exp.stop(head_ensemble)

14:10:58 spider-0001 SmartSim[72234] INFO Stopping model head-ens with job name head-ens-CAPDVE1BKQ0B


In [38]:
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            195103     clx28 sstsim.x visharma  R    1:42:11      2 prod-[0065-0066]
            193214      full sstsim.x     tshi  R   14:37:23     16 prod-[0003-0018]
            193544    spider interact arigazzi  R    9:46:05      1 spider-0001
            194213    spider pharml-b   jbalma  R    6:34:48      8 spider-[0005-0010,0014-0015]
