# Setting up a Ray cluster with SmartSim

In [None]:
%set_env SMARTSIM_LOG_LEVEL debug

## 1. Start the cluster
We set up a SmartSim experiment, which will handle the launch of the Ray cluster.

First we import the relevant modules and set up variables. `NUM_WORKERS` is the number of worker nodes: in total, we will spin a Ray cluster of `NUM_WORKERS+1` nodes (one node is the head node).

In [None]:
import numpy as np

import ray
from ray import tune
import ray.util

from smartsim import Experiment
from smartsim.ext.ray import RayCluster

NUM_WORKERS = 3
CPUS_PER_WORKER = 18
alloc=None
launcher='slurm'

Now we define a SmartSim experiment which will spin the Ray cluster. The output files will be located in the `ray-cluster` directory (relative to the path from where we are executing this notebook). We are limiting the number each ray node can use to `CPUS_PER_WORKER`: if we wanted to let it use all the cpus, it would suffice not to pass `ray_args`.
Notice that the cluster will be password-protected (the password, generated internally, will be shared with worker nodes).

If the hosts are attached to multiple interfaces (e.g. `ib`, `eth0`, ...) we can specify to which one the Ray nodes should bind: it is recommended to always choose the one offering the best performances. On an XC, for example, this will be `ipogif0`.

In [3]:
exp = Experiment("ray-cluster", launcher=launcher)
cluster = RayCluster(name="ray-cluster", run_args={}, ray_args={"num-cpus": CPUS_PER_WORKER},
                     launcher=launcher, workers=NUM_WORKERS, alloc=alloc, batch=False, interface="ipogif0")

If the cluster has to be run as a batch, we might want to pass some preamble lines to the batch files, to setup modules and environments. If we are running this in an internal allocation, the environment will be automatically propagated.

In [3]:
if cluster.batch:
    cluster.head_model.batch_settings.add_preamble( ["source ~/.bashrc", "conda activate smartsim"])
    if NUM_WORKERS:
        cluster.worker_model.batch_settings.add_preamble ( ["source ~/.bashrc", "conda activate smartsim"])

We now generate the needed directories. If an experiment with the same name already exists, this call will fail, to avoid overwriting existing results. If we want to overwrite, we can simply pass `overwrite=True` to `exp.generate()`.

In [4]:
exp.generate(cluster, overwrite=True)

12:59:09 horizon SmartSim[34456] INFO Working in previously created experiment


Now we are ready to start the cluster!

In [5]:
exp.start(cluster, block=False, summary=True)



[36;1m=== LAUNCH SUMMARY ===[0m
[32;1mExperiment: ray-cluster[0m
[32mExperiment Path: /lus/cls01029/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/ray-cluster[0m
[32mLaunching with: slurm[0m
[32m# of Ensembles: 0[0m
[32m# of Models: 0[0m
[32mDatabase: no[0m

[36;1m=== RAY CLUSTERS ===[0m
[32;1mray-cluster[0m
[32m# of workers: 3[0m
[32mLaunching as batch: False[0m
[32mHead run Settings: 
Executable: /lus/scratch/arigazzi/anaconda3/envs/smartsim/bin/python
Executable arguments: ['/lus/cls01029/arigazzi/smartsim-dev/SmartSim/smartsim/ext/ray/raystarter.py', '--port=6789', '--redis-password=0af0ae16-791b-4c5d-8613-580f319e3fc8', '--ifname=ipogif0', '--ray-exe=/lus/scratch/arigazzi/anaconda3/envs/smartsim/bin/ray', '--head', '--dashboard-port=8265', '--ray-args="--num-cpus=18"']
Run Command: srun
Run arguments: {'exclusive': None,
 'nodes': 1,
 'ntasks': 1,
 'ntasks-per-node': 1,
 'unbuffered': None}[0m[32m
Workers run Settings: 
Executable: /lus/scratch

                                                                                

12:59:22 horizon SmartSim[34456] DEBUG Running on allocation 1503276 gleaned from user environment
12:59:22 horizon SmartSim[34456] DEBUG Running on allocation 1503276 gleaned from user environment
12:59:22 horizon SmartSim[34456] DEBUG Running on allocation 1503276 gleaned from user environment
12:59:22 horizon SmartSim[34456] DEBUG Running on allocation 1503276 gleaned from user environment




12:59:26 horizon SmartSim[34456] DEBUG Launching ray_head
12:59:29 horizon SmartSim[34456] DEBUG Launching ray_worker_0
12:59:32 horizon SmartSim[34456] DEBUG Launching ray_worker_1
12:59:35 horizon SmartSim[34456] DEBUG Launching ray_worker_2
12:59:36 horizon SmartSim[34456] INFO Ray cluster launched on nodes: ['nid00001', 'nid00002', 'nid00003', 'nid00004']
12:59:36 horizon SmartSim[34456] DEBUG Starting Job Manager


## 2. Start the ray driver script

Now we can just connect to our running server.

In [6]:
ray.util.connect(cluster.get_head_address()+":10001")


{'num_clients': 1,
 'python_version': '3.8.8',
 'ray_version': '1.3.0',
 'ray_commit': '9f45548488c4fa288f3cecb556801f97958eae8b',
 'protocol_version': '2020-03-12'}

Now we check that all resources are set properly.

In [8]:
print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))


This cluster consists of
    4 nodes in total
    72.0 CPU resources in total

10.128.0.2:8265


In [7]:
tune.run(
    "PPO",
    stop={"episode_reward_max": 200},
    config={
        "framework": "torch",
        "env": "CartPole-v0",
        "num_gpus": 0,
        "lr": tune.grid_search(np.linspace (0.001, 0.01, 50).tolist()),
        "log_level": "ERROR",
    },
    local_dir="/lus/scratch/arigazzi/ray_local/",
    verbose=0,
    fail_fast=True,
    log_to_file=True,
)

[2m[36m(pid=40860)[0m Instructions for updating:
[2m[36m(pid=40860)[0m non-resource variables are not supported in the long term
[2m[36m(pid=40880)[0m Instructions for updating:
[2m[36m(pid=40880)[0m non-resource variables are not supported in the long term
[2m[36m(pid=40876)[0m Instructions for updating:
[2m[36m(pid=40876)[0m non-resource variables are not supported in the long term
[2m[36m(pid=40873)[0m Instructions for updating:
[2m[36m(pid=40873)[0m non-resource variables are not supported in the long term
[2m[36m(pid=40882)[0m Instructions for updating:
[2m[36m(pid=40882)[0m non-resource variables are not supported in the long term
[2m[36m(pid=40881)[0m Instructions for updating:
[2m[36m(pid=40881)[0m non-resource variables are not supported in the long term
[2m[36m(pid=40879)[0m Instructions for updating:
[2m[36m(pid=40879)[0m non-resource variables are not supported in the long term
[2m[36m(pid=40880)[0m 2021-08-06 13:04:10,858	INFO t

Instructions for updating:
non-resource variables are not supported in the long term


<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7f491c9cc700>

## 3. Stop cluster and release allocation

In [8]:
ray.util.disconnect()
exp.stop(cluster)
if alloc:
    slurm.release_allocation(alloc)

13:23:04 horizon SmartSim[34456] INFO Stopping model ray_head with job name ray_head-CDCMIST283BS
13:23:04 horizon SmartSim[34456] DEBUG Process terminated with kill 34517
13:23:04 horizon SmartSim[34456] DEBUG Process terminated with kill 34518
13:23:04 horizon SmartSim[34456] INFO Stopping model ray_worker_0 with job name ray_worker_0-CDCMIST2KZS9
13:23:04 horizon SmartSim[34456] DEBUG Process terminated with kill 34528
13:23:04 horizon SmartSim[34456] DEBUG Process terminated with kill 34527
13:23:04 horizon SmartSim[34456] INFO Stopping model ray_worker_1 with job name ray_worker_1-CDCMIST2XJM6
13:23:04 horizon SmartSim[34456] DEBUG Process terminated with kill 34538
13:23:04 horizon SmartSim[34456] DEBUG Process terminated with kill 34537
13:23:04 horizon SmartSim[34456] INFO Stopping model ray_worker_2 with job name ray_worker_2-CDCMIST38KK3
13:23:04 horizon SmartSim[34456] DEBUG Process terminated with kill 34545
13:23:04 horizon SmartSim[34456] DEBUG Process terminated with kil