# Setting up a Ray cluster with SmartSim

In this notebook we will see how to set up a Ray cluster on a system using SLURM as workload manager (WLM).

This notebook is based on [this github repo by NERSC](https://github.com/NERSC/slurm-ray-cluster/blob/master/submit-ray-cluster.sbatch) and on [this stack overflow post](https://github.com/ray-project/ray/issues/826#issuecomment-522116599).

We will have to do four steps:
1. Start the head node
2. Start the workers
3. Run a test workload
4. Stop all nodes

## 1. Start the head node
We set up a SmartSim experiment, which will handle the launch of the ray head node.

In [1]:
import os
from smartsim import Experiment, constants
from smartsim.settings import SbatchSettings, SrunSettings, RunSettings

ImportError: cannot import name 'bytes' from 'redis._compat' (/lus/scratch/arigazzi/anaconda3/envs/smartsim/lib/python3.7/site-packages/redis/_compat.py)

In [13]:
exp = Experiment("ray_head_exp", launcher='slurm')
head_dir = os.makedirs('./head_exp/head_exp', exist_ok=True)

RAY_PORT=6379

run_args = {"nodes": 1,
           "ntasks": 1, # Ray will take care of resources.
           "time": "00:10:00"}
head_settings = SrunSettings("ray", f"start --head --port={RAY_PORT}", run_args=run_args)
sleep_settings = SrunSettings("sleep", "infinity")

head_node_model = exp.create_model("head_node", path='./head_exp', run_settings=head_settings)
sleep_model     = exp.create_model("head_sleep", path='./head_exp', run_settings=sleep_settings)  
head_batch = SbatchSettings(nodes=1, time="00:15:00")

head_ensemble = exp.create_ensemble("head-ens", batch_settings=head_batch)
head_ensemble.add_model(head_node_model)
head_ensemble.add_model(sleep_model)
exp.start(head_ensemble, block=False)

10:54:17 spider-0001 SmartSim[22226] INFO Empty ensemble created for batch launch
10:54:49 spider-0001 SmartSim[22226] INFO head-ens(189221): Completed


In [14]:
exp.stop(head_ensemble)

In [None]:
batch = SbatchSettings(nodes=2, time="00:01:00")
ensemble = exp.create_ensemble("batch-ens", batch_settings=batch)
ensemble.add_model(M1)
ensemble.add_model(M2)
ensemble.set_path(test_dir)

exp.start(ensemble, block=True)
statuses = exp.get_status(ensemble)
assert all([stat == constants.STATUS_COMPLETED for stat in statuses])