# Setting up a Ray cluster with SmartSim

In this notebook we will see how to set up a Ray cluster on a system using SLURM as workload manager (WLM).

This notebook is based on [this github repo by NERSC](https://github.com/NERSC/slurm-ray-cluster/blob/master/submit-ray-cluster.sbatch) and on [this stack overflow post](https://github.com/ray-project/ray/issues/826#issuecomment-522116599).

We will have to do four steps:
1. Start the head node
2. Start the workers
3. Run a test workload
4. Stop all nodes

## 1. Start the head node
We set up a SmartSim experiment, which will handle the launch of the ray head node.

In [1]:
import os
from smartsim import Experiment
from smartsim.settings import SbatchSettings, SrunSettings, RunSettings
import time
import uuid
import re

In [2]:
# The experiment is local because we are on a compute node
exp = Experiment("ray_head_exp", launcher='local')

RAY_PORT=6379
REDIS_PW=uuid.uuid4()

head_settings = RunSettings("bash", "start-head.sh")

head_node_params = {"RAY_PORT": RAY_PORT, "REDIS_PASSWORD": REDIS_PW, "CONDA_ENV": "ray"}
head_node_model = exp.create_model("head_node", path='/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/head_exp',
                                   run_settings=head_settings, params=head_node_params)
head_node_model.attach_generator_files(to_configure=['./templates/start-head.sh'])
exp.generate(head_node_model, overwrite=True)

exp.start(head_node_model, block=False, summary=False)

time.sleep(1)
head_log = os.path.join(head_node_model.path, "head_node.out")
while not os.path.isfile(head_log):
    time.sleep(1)

head_ip = None
while head_ip is None:
    time.sleep(5)
    with open(head_log) as fp:
        line = fp.readline()
        while line:
            plain_line = re.sub('\033\\[([0-9]+)(;[0-9]+)*m', '', line) 
            if "Local node IP:" in plain_line:
                matches=re.search(r'(?<=Local node IP: ).*', plain_line)
                head_ip = matches.group()
                print(f"Ray cluster's head is running at {head_ip}")
            line = fp.readline()

07:54:45 prod-0127 SmartSim[49455] INFO Working in previously created experiment
Ray cluster's head is running at 10.10.2.67


In [24]:
exp.stop(head_node_model)


We now have started the head node, the next step is to start the workers!

# 2. Start the worker nodes

We will start the workers as a batch. Each worker has to start on a different node, as a single task. We will rely on `srun` for this.

In [26]:
exp_workers = Experiment("ray_worker_exp", launcher='slurm')

worker_run_args = {"nodes": 3,
                   "ntasks-per-node": 1, # Ray will take care of resources.
                   "ntasks": 3,
                   "oversubscribe": None,
                   "overcommit": None,
                   "time": "01:00:00",
                   "unbuffered": None,
                   "cpus-per-task": 36}

conda_settings_1 = RunSettings("source", "~/.bashrc", block_in_batch=True, expand_exe=False)
conda_settings_2 = RunSettings("conda", "activate ray", block_in_batch=True, expand_exe=False)
worker_settings = SrunSettings("bash", "start-worker.sh",
                               expand_exe=False, block_in_batch=False, run_args=worker_run_args)

conda_model_1 = exp_workers.create_model("conda_sh", path="/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray",
                                run_settings=conda_settings_1)
conda_model_2 = exp_workers.create_model("conda_switch", path="/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray",
                                 run_settings=conda_settings_2)

worker_node_params = {"HEAD_ADDRESS": head_ip+":"+str(RAY_PORT), "REDIS_PASSWORD": REDIS_PW}
worker_node_model = exp_workers.create_model("worker_nodes", path='/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/head_exp',
                                   run_settings=worker_settings, params=worker_node_params)
worker_node_model.attach_generator_files(to_configure=['./templates/start-worker.sh'])
exp_workers.generate(worker_node_model, overwrite=True)
    
worker_batch = SbatchSettings(nodes=3, time="01:00:00")
worker_ensemble = exp.create_ensemble("worker-ens", batch_settings=worker_batch)
worker_ensemble.add_model(conda_model_1)
worker_ensemble.add_model(conda_model_2)
worker_ensemble.add_model(worker_node_model)

exp_workers.start(worker_ensemble, block=False, summary=False)


09:22:03 prod-0127 SmartSim[49455] INFO Working in previously created experiment
09:22:03 prod-0127 SmartSim[49455] INFO Empty ensemble created for batch launch


In [27]:
exp_workers.stop(worker_ensemble)

09:23:05 prod-0127 SmartSim[49455] INFO Stopping model worker-ens with job name worker-ens-CAV657IHT3WI


and the workers are running! Now let's run a test script!

# 3. Execute script

The script will run an MNIST training. It will start locally, distributing the work across workers. We will just need to supply the cluster address. 

In [23]:
mnist_exp = Experiment("MNIST", launcher='slurm')
mnist_exe_args = "activate ray && python ppo_tune.py --ray-address="+head_ip+":"+str(RAY_PORT)+" --redis-password="+str(REDIS_PW)

mnist_run_args = {"nodes": 1,
                   "ntasks-per-node": 1, # Ray will take care of resources.
                   "ntasks": 1,
                   "oversubscribe": None,
                   "overcommit": None,
                   "time": "01:00:00",
                   "unbuffered": None,
                   "cpus-per-task": 36}

mnist_settings = SrunSettings("conda", mnist_exe_args, expand_exe=False, run_args=mnist_run_args)
mnist_model = mnist_exp.create_model("MNIST-test", path='./mnist_test',
                                     run_settings = mnist_settings)
mnist_model.attach_generator_files(to_copy=['./templates/ppo_tune.py'])
mnist_exp.generate(mnist_model, overwrite=True)
mnist_exp.start(mnist_model, summary=True)

09:06:52 prod-0127 SmartSim[49455] INFO Working in previously created experiment


[36;1m=== LAUNCH SUMMARY ===[0m
[32;1mExperiment: MNIST[0m
[32mExperiment Path: /lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/MNIST[0m
[32mLaunching with: slurm[0m
[32m# of Ensembles: 0[0m
[32m# of Models: 1[0m
[32mDatabase: no[0m

[36;1m=== MODELS ===[0m
[32;1mMNIST-test[0m
[32mModel Parameters: 
{}[0m
[32mModel Run Settings: 
Executable: conda
Executable arguments: ['activate', 'ray', '&&', 'python', 'ppo_tune.py', '--ray-address=10.10.2.67:6379', '--redis-password=020c2b89-3506-42a8-b14d-2c448bc51ee7']
Run Command: srun
Run arguments: {'cpus-per-task': 36,
 'nodes': 1,
 'ntasks': 1,
 'ntasks-per-node': 1,
 'overcommit': None,
 'oversubscribe': None,
 'time': '01:00:00',
 'unbuffered': None}[0m






                                                                                

09:07:10 prod-0127 SmartSim[49455] INFO MNIST-test(206374.3): New
09:07:15 prod-0127 SmartSim[49455] INFO MNIST-test(206374.3): New
Model MNIST-test produced the following error 
Error: srun: error: prod-0127: task 0: Exited with exit code 1
 
Job status at failure: Failed 
Launcher status at failure: FAILED 
Job returncode: 1 
Error and output file located at: /lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/MNIST/MNIST-test
09:07:21 prod-0127 SmartSim[49455] INFO MNIST-test(206374.3): Failed


In [19]:
mnist_exp.stop(mnist_model)

# exp.stop(worker_ensemble)
# exp.stop(head_ensemble)

09:03:15 prod-0127 SmartSim[49455] INFO Stopping model MNIST-test with job name MNIST-test-CAV5Q759K9CR


In [22]:
exp.poll()

12:36:03 spider-0001 SmartSim[31493] INFO head-ens(201152): Running
12:36:03 spider-0001 SmartSim[31493] INFO worker-ens(201153): Running
12:36:13 spider-0001 SmartSim[31493] INFO head-ens(201152): Running
12:36:13 spider-0001 SmartSim[31493] INFO worker-ens(201153): Running
12:36:23 spider-0001 SmartSim[31493] INFO head-ens(201152): Running
12:36:23 spider-0001 SmartSim[31493] INFO worker-ens(201153): Running
12:36:33 spider-0001 SmartSim[31493] INFO head-ens(201152): Running
12:36:33 spider-0001 SmartSim[31493] INFO worker-ens(201153): Running
12:36:43 spider-0001 SmartSim[31493] INFO head-ens(201152): Running
12:36:43 spider-0001 SmartSim[31493] INFO worker-ens(201153): Running
12:36:53 spider-0001 SmartSim[31493] INFO head-ens(201152): Running
12:36:53 spider-0001 SmartSim[31493] INFO worker-ens(201153): Running
12:37:03 spider-0001 SmartSim[31493] INFO head-ens(201152): Running
12:37:03 spider-0001 SmartSim[31493] INFO worker-ens(201153): Running
12:37:13 spider-0001 SmartSim[3149

KeyboardInterrupt: 

In [53]:
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            201164     bdw18 head-ens arigazzi  R      14:57      1 prod-0001
            201169     bdw18 worker-e arigazzi  R      10:22      3 prod-[0002-0004]
            201175     bdw18  Chpl-mg  chapelu  R       0:14     16 prod-[0010-0025]
            199735      full sstsim.x visharma  R    9:17:41      1 prod-0009
            200628    spider interact arigazzi  R    5:43:57      1 spider-0001
