# Setting up a Ray cluster with SmartSim

In this notebook we will see how to set up a Ray cluster on a system using SLURM as workload manager (WLM).

This notebook is based on [this github repo by NERSC](https://github.com/NERSC/slurm-ray-cluster/blob/master/submit-ray-cluster.sbatch) and on [this stack overflow post](https://github.com/ray-project/ray/issues/826#issuecomment-522116599).

We will have to do four steps:
1. Start the head node
2. Start the workers
3. Run a test workload
4. Stop all nodes

## 1. Start the head node
We set up a SmartSim experiment, which will handle the launch of the ray head node.

In [1]:
import os
from smartsim import Experiment
from smartsim.settings import SbatchSettings, SrunSettings, RunSettings
from smartsim.ray import RayCluster
import time
import uuid
import re

In [2]:
# The experiment is local because we are on a compute node
exp = Experiment("ray-cluster", launcher='slurm')

cluster = RayCluster(name="ray-cluster", path='./RAYCLUSTER', launcher='slurm')

16:54:28 osprey.us.cray.com SmartSim[36185] INFO /lus/sonexion/arigazzi/smartsim-dev/SmartSim/smartsim/ray/rayserverstarter.py --ray-num-cpus=12 --ray-port=6780 --ray-password=6462a2f1-fecf-47b3-bb3a-f0d4c420216c --zmq-port=6788


In [3]:
exp.generate(cluster, overwrite=True)
exp.start(cluster, block=False, summary=False)

16:54:29 osprey.us.cray.com SmartSim[36185] INFO Working in previously created experiment
16:54:35 osprey.us.cray.com SmartSim[36185] INFO ray-cluster(227845): New
16:54:40 osprey.us.cray.com SmartSim[36185] INFO ray-cluster(227845): New
16:54:46 osprey.us.cray.com SmartSim[36185] INFO ray-cluster(227845): Running
16:54:51 osprey.us.cray.com SmartSim[36185] INFO ray-cluster(227845): Running
16:54:56 osprey.us.cray.com SmartSim[36185] INFO ray-cluster(227845): Running
16:55:01 osprey.us.cray.com SmartSim[36185] INFO ray-cluster(227845): Running
16:55:06 osprey.us.cray.com SmartSim[36185] INFO ray-cluster(227845): Running
16:55:11 osprey.us.cray.com SmartSim[36185] INFO ray-cluster(227845): Running
16:55:16 osprey.us.cray.com SmartSim[36185] INFO ray-cluster(227845): Running
16:55:21 osprey.us.cray.com SmartSim[36185] INFO ray-cluster(227845): Running
16:55:26 osprey.us.cray.com SmartSim[36185] INFO ray-cluster(227845): Running
16:55:31 osprey.us.cray.com SmartSim[36185] INFO ray-cluster

KeyboardInterrupt: 

In [4]:
exp.stop(cluster)

17:05:54 osprey.us.cray.com SmartSim[36185] INFO Stopping model ray-cluster with job name ray-cluster-CB5NA5J8OC8R


We now have started the head node, the next step is to start the workers!

# 2. Start the worker nodes

We will start the workers as a batch. Each worker has to start on a different node, as a single task. We will rely on `srun` for this.

In [3]:
exp_workers = Experiment("ray-worker", launcher='slurm')

num_worker_nodes = 2
worker_run_args = {"nodes": num_worker_nodes,
                   "ntasks-per-node": 1, # Ray will take care of resources.
                   "ntasks": num_worker_nodes,
                   "oversubscribe": None,
                   "overcommit": None,
                   "time": "01:00:00",
                   "unbuffered": None,
                   "cpus-per-task": 26}

worker_settings = SrunSettings("bash", "start-worker.sh",
                               block_in_batch=False, run_args=worker_run_args)

worker_node_params = {"HEAD_ADDRESS": head_ip+":"+str(RAY_PORT), "REDIS_PASSWORD": REDIS_PW, "CONDA_ENV": "smartsim"}
worker_node_model = exp_workers.create_model("worker_nodes", path='/lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/head_exp',
                                   run_settings=worker_settings, params=worker_node_params)
worker_node_model.attach_generator_files(to_configure=['./templates/start-worker.sh'])
exp_workers.generate(worker_node_model, overwrite=True)
    
worker_batch = SbatchSettings(nodes=num_worker_nodes, time="01:00:00")
worker_ensemble = exp.create_ensemble("worker-ens", batch_settings=worker_batch)
worker_ensemble.add_model(worker_node_model)

exp_workers.start(worker_ensemble, block=False, summary=False)

12:02:01 osprey.us.cray.com SmartSim[83421] INFO Working in previously created experiment
12:02:01 osprey.us.cray.com SmartSim[83421] INFO Empty ensemble created for batch launch


In [15]:
exp.stop(worker_ensemble)

KeyError: 

In [27]:
print(dir(head_ensemble))
for job in exp._control._jobs.jobs.values():
    print(job.jid)

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_initialize_entities', '_key_prefixing_enabled', '_read_model_parameters', '_set_strategy', 'add_model', 'attach_generator_files', 'batch', 'batch_settings', 'enable_key_prefixing', 'entities', 'models', 'name', 'params', 'path', 'query_key_prefixing', 'register_incoming_entity', 'run_settings', 'set_path', 'type']
218778


and the workers are running! Now let's run a test script!

# 3. Execute script

The script will run an MNIST training. It will start locally, distributing the work across workers. We will just need to supply the cluster address. 

In [17]:
!/lus/scratch/arigazzi/anaconda3/envs/smartsim/bin/python MNIST/MNIST-test/ppo_tune.py --ray-address=10.10.1.21:6379 --redis-password=2ad6b162-5008-45ae-a15e-e6891d1b1a06

Ready to import ray
2021-05-02 14:17:47,744	INFO worker.py:655 -- Connecting to existing Ray cluster at address: 10.10.1.21:6379


In [36]:
mnist_exp = Experiment("MNIST", launcher='local')


mnist_exe_args = "start-script.sh"
mnist_exe_args = f"ppo_tune.py --ray-address={head_ip}:{RAY_PORT} --redis-password={REDIS_PW}"

#ppo_exe_args = f"--jobid 218778 /lus/scratch/arigazzi/anaconda3/envs/smartsim/bin/python ppo_tune.py --ray-address={head_ip}:{RAY_PORT} --redis-password={REDIS_PW}"
    
# mnist_run_args = {"nodes": 1,
#                    "ntasks-per-node": 1, # Ray will take care of resources.
#                    "ntasks": 1,
#                    "oversubscribe": None,
#                    "overcommit": None,
#                    "time": "01:00:00",
#                    "unbuffered": None,
#                    "cpus-per-task": 36}

#mnist_params = {"HEAD_ADDRESS": head_ip+":"+str(RAY_PORT), "REDIS_PASSWORD": REDIS_PW}
mnist_settings = RunSettings("/lus/scratch/arigazzi/anaconda3/envs/smartsim/bin/python", mnist_exe_args, expand_exe=False)#, run_args=mnist_run_args)
mnist_model = mnist_exp.create_model("MNIST-test", path='./mnist_test',
                                     run_settings = mnist_settings)#, params=mnist_params)
mnist_model.attach_generator_files(to_copy=['./templates/ppo_tune.py'])#, to_configure=['./templates/start-script.sh'])
mnist_exp.generate(mnist_model, overwrite=True)
mnist_exp.start(mnist_model, summary=True, block=True)

14:58:56 osprey.us.cray.com SmartSim[111781] INFO Working in previously created experiment


[36;1m=== LAUNCH SUMMARY ===[0m
[32;1mExperiment: MNIST[0m
[32mExperiment Path: /lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/MNIST[0m
[32mLaunching with: local[0m
[32m# of Ensembles: 0[0m
[32m# of Models: 1[0m
[32mDatabase: no[0m

[36;1m=== MODELS ===[0m
[32;1mMNIST-test[0m
[32mModel Parameters: 
{}[0m
[32mModel Run Settings: 
Executable: /lus/scratch/arigazzi/anaconda3/envs/smartsim/bin/python
Executable arguments: ['ppo_tune.py', '--ray-address=10.10.4.206:6380', '--redis-password=ALABAMA']
[0m






                                                                                

14:59:12 osprey.us.cray.com SmartSim[111781] INFO MNIST-test(66593): Running
14:59:17 osprey.us.cray.com SmartSim[111781] INFO MNIST-test(66593): Running
14:59:22 osprey.us.cray.com SmartSim[111781] INFO MNIST-test(66593): Running
Job status at failure: Failed 
Launcher status at failure: Failed 
Job returncode: -6 
Error and output file located at: /lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/MNIST/MNIST-test


In [32]:
mnist_exp.stop(mnist_model)

# exp.stop(worker_ensemble)
# exp.stop(head_ensemble)

12:24:47 osprey.us.cray.com SmartSim[83421] INFO Stopping model MNIST-test with job name MNIST-test-CB0DLJAMTB06


In [36]:
mnist_exp.poll()

12:52:43 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:52:53 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:53:03 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:53:13 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:53:23 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:53:33 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:53:43 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:53:53 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:54:03 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:54:13 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:54:23 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:54:33 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:54:43 osprey.us.cray.com SmartSim[83421] INFO MNIST-test(61071): Running
12:54:53 osp

In [37]:
!squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            218778     bdw18 head-ens arigazzi  R    1:09:18      1 prod-0017
            218804     bdw18 Chpl-isx  chapelu  R      24:50     16 prod-[0001-0016]


In [8]:
!scancel 214555