# Setting up a Ray cluster with SmartSim

In this notebook we will see how to set up a Ray cluster on a system using SLURM as workload manager (WLM).

This notebook is based on [this github repo by NERSC](https://github.com/NERSC/slurm-ray-cluster/blob/master/submit-ray-cluster.sbatch) and on [this stack overflow post](https://github.com/ray-project/ray/issues/826#issuecomment-522116599).

We will have to do four steps:
1. Start the head node
2. Start the workers
3. Run a test workload
4. Stop all nodes

## 1. Start the head node
We set up a SmartSim experiment, which will handle the launch of the ray head node.

In [1]:
import os
from smartsim import Experiment
from smartsim.settings import SbatchSettings, SrunSettings, RunSettings
import time
import uuid
import re

In [2]:
# The experiment is local because we are on a compute node
exp = Experiment("ray_head_exp", launcher='local')

RAY_PORT=6379
REDIS_PW=uuid.uuid4()

shell_script = "manual-start.sh"
shell_script = "start-head.sh"

head_settings = RunSettings("bash", shell_script)

head_node_params = {"RAY_PORT": RAY_PORT, "REDIS_PASSWORD": REDIS_PW, "CONDA_ENV": "smartsim"}
head_node_model = exp.create_model("head_node", path='/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/head_exp',
                                   run_settings=head_settings, params=head_node_params)
head_node_model.attach_generator_files(to_configure=['./templates/'+shell_script])
exp.generate(head_node_model, overwrite=True)

exp.start(head_node_model, block=False, summary=False)

time.sleep(1)
head_log = os.path.join(head_node_model.path, "head_node.out")
while not os.path.isfile(head_log):
    time.sleep(1)

head_ip = None
while head_ip is None:
    time.sleep(5)
    with open(head_log) as fp:
        line = fp.readline()
        while line:
            plain_line = re.sub('\033\\[([0-9]+)(;[0-9]+)*m', '', line) 
            if "Local node IP:" in plain_line:
                matches=re.search(r'(?<=Local node IP: ).*', plain_line)
                head_ip = matches.group()
                print(f"Ray cluster's head is running at {head_ip}")
            line = fp.readline()

08:56:40 prod-0001 SmartSim[21136] INFO Working in previously created experiment
Ray cluster's head is running at 10.10.1.5


In [36]:
exp.stop(head_node_model)

13:52:12 prod-0125 SmartSim[3038] INFO Stopping model head_node with job name head_node-CAYQ3OBWQN8H


We now have started the head node, the next step is to start the workers!

# 2. Start the worker nodes

We will start the workers as a batch. Each worker has to start on a different node, as a single task. We will rely on `srun` for this.

In [11]:
exp_workers = Experiment("ray_worker_exp", launcher='slurm')

num_worker_nodes = 16
worker_run_args = {"nodes": num_worker_nodes,
                   "ntasks-per-node": 1, # Ray will take care of resources.
                   "ntasks": num_worker_nodes,
                   "oversubscribe": None,
                   "overcommit": None,
                   "time": "01:00:00",
                   "unbuffered": None,
                   "cpus-per-task": 32}

conda_settings_1 = RunSettings("source", "~/.bashrc", block_in_batch=True, expand_exe=False)
worker_settings = SrunSettings("bash", "start-worker.sh",
                               expand_exe=False, block_in_batch=False, run_args=worker_run_args)

conda_model_1 = exp_workers.create_model("conda_sh", path="/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray",
                                run_settings=conda_settings_1)


worker_node_params = {"HEAD_ADDRESS": head_ip+":"+str(RAY_PORT), "REDIS_PASSWORD": REDIS_PW, "CONDA_ENV": "smartsim"}
worker_node_model = exp_workers.create_model("worker_nodes", path='/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/head_exp',
                                   run_settings=worker_settings, params=worker_node_params)
worker_node_model.attach_generator_files(to_configure=['./templates/start-worker.sh'])
exp_workers.generate(worker_node_model, overwrite=True)
    
worker_batch = SbatchSettings(nodes=num_worker_nodes, time="01:00:00")
worker_ensemble = exp.create_ensemble("worker-ens", batch_settings=worker_batch)
worker_ensemble.add_model(conda_model_1)
worker_ensemble.add_model(worker_node_model)

exp_workers.start(worker_ensemble, block=False, summary=False)

10:27:41 prod-0001 SmartSim[21136] INFO Working in previously created experiment
10:27:41 prod-0001 SmartSim[21136] INFO Empty ensemble created for batch launch


In [10]:
exp_workers.stop(worker_ensemble)

10:27:04 prod-0001 SmartSim[21136] INFO Stopping model worker-ens with job name worker-ens-CAZFJNPKABQ2


and the workers are running! Now let's run a test script!

# 3. Execute script

The script will run an MNIST training. It will start locally, distributing the work across workers. We will just need to supply the cluster address. 

In [12]:
mnist_exp = Experiment("MNIST", launcher='local')


mnist_exe_args = "start-script.sh"
mnist_exe_args = f"ppo_tune.py --ray-address={head_ip}:{RAY_PORT} --redis-password={REDIS_PW}"

# mnist_run_args = {"nodes": 1,
#                    "ntasks-per-node": 1, # Ray will take care of resources.
#                    "ntasks": 1,
#                    "oversubscribe": None,
#                    "overcommit": None,
#                    "time": "01:00:00",
#                    "unbuffered": None,
#                    "cpus-per-task": 36}

#mnist_params = {"HEAD_ADDRESS": head_ip+":"+str(RAY_PORT), "REDIS_PASSWORD": REDIS_PW}
mnist_settings = RunSettings("/lus/scratch/arigazzi/anaconda3/envs/smartsim/bin/python", mnist_exe_args, expand_exe=False)#, run_args=mnist_run_args)
mnist_model = mnist_exp.create_model("MNIST-test", path='./mnist_test',
                                     run_settings = mnist_settings)#, params=mnist_params)
mnist_model.attach_generator_files(to_copy=['./templates/ppo_tune.py'])#, to_configure=['./templates/start-script.sh'])
mnist_exp.generate(mnist_model, overwrite=True)
mnist_exp.start(mnist_model, summary=True, block=False)

10:27:56 prod-0001 SmartSim[21136] INFO Working in previously created experiment


[36;1m=== LAUNCH SUMMARY ===[0m
[32;1mExperiment: MNIST[0m
[32mExperiment Path: /lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/MNIST[0m
[32mLaunching with: local[0m
[32m# of Ensembles: 0[0m
[32m# of Models: 1[0m
[32mDatabase: no[0m

[36;1m=== MODELS ===[0m
[32;1mMNIST-test[0m
[32mModel Parameters: 
{}[0m
[32mModel Run Settings: 
Executable: /lus/scratch/arigazzi/anaconda3/envs/smartsim/bin/python
Executable arguments: ['ppo_tune.py', '--ray-address=10.10.1.5:6379', '--redis-password=6389b11d-fcff-4125-ad82-13a3db92eefb']
[0m






                                                                                

10:58:48 prod-0001 SmartSim[21136] INFO MNIST-test(34525): Completed


In [11]:
mnist_exp.stop(mnist_model)

# exp.stop(worker_ensemble)
# exp.stop(head_ensemble)

In [5]:
mnist_exp.poll()

In [6]:
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            215589     bdw18 interact arigazzi  R      40:50      1 prod-0001
            215592     bdw18 worker-e arigazzi  R      37:35      2 prod-[0018-0019]
            215597    spider     make alazzaro  R      11:31      1 spider-0002


In [8]:
!scancel 214555