# Setting up a Ray cluster with SmartSim

## 1. Start the cluster
We set up a SmartSim experiment, which will handle the launch of the Ray cluster.

First we import the relevant modules and set up variables. `NUM_WORKERS` is the number of worker nodes: in total, we will spin a Ray cluster of `NUM_WORKERS+1` nodes (one node is the head node).

In [1]:
import numpy as np

from ray.tune.progress_reporter import JupyterNotebookReporter
import ray
from ray import tune
import ray.util

from smartsim import Experiment
from smartsim.ext.ray import RayCluster

NUM_WORKERS = 3
CPUS_PER_WORKER = 18
alloc=None
launcher='slurm'

Now we define a SmartSim experiment which will spin the Ray cluster. The output files will be located in the `ray-cluster` directory (relative to the path from where we are executing this notebook). We are limiting the number each ray node can use to `CPUS_PER_WORKER`: if we wanted to let it use all the cpus, it would suffice not to pass `ray_args`.
Notice that the cluster will be password-protected (the password, generated internally, will be shared with worker nodes).

In [2]:
exp = Experiment("ray-cluster", launcher=launcher)
cluster = RayCluster(name="ray-cluster", run_args={}, ray_args={"num-cpus": CPUS_PER_WORKER},
                     launcher=launcher, workers=NUM_WORKERS, alloc=alloc, batch=True)

If the cluster has to be run as a batch, we might want to pass some preamble lines to the batch files, to setup modules and environments. If we are running this in an internal allocation, the environment will be automatically propagated.

In [3]:
if cluster.batch:
    cluster.head_model.batch_settings.add_preamble( ["source ~/.bashrc", "conda activate smartsim"])
    if NUM_WORKERS:
        cluster.worker_model.batch_settings.add_preamble ( ["source ~/.bashrc", "conda activate smartsim"])

We now generate the needed directories. If an experiment with the same name already exists, this call will fail, to avoid overwriting existing results. If we want to overwrite, we can simply pass `overwrite=True` to `exp.generate()`.

In [4]:
exp.generate(cluster, overwrite=True)

08:34:17 osprey.us.cray.com SmartSim[17173] INFO Working in previously created experiment


Now we are ready to start the cluster!

In [5]:
exp.start(cluster, block=False, summary=True)



[36;1m=== LAUNCH SUMMARY ===[0m
[32;1mExperiment: ray-cluster[0m
[32mExperiment Path: /lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/ray-cluster[0m
[32mLaunching with: slurm[0m
[32m# of Ensembles: 0[0m
[32m# of Models: 0[0m
[32mDatabase: no[0m

[36;1m=== RAY CLUSTERS ===[0m
[32;1mray-cluster[0m
[32m# of workers: 3[0m
[32mLaunching as batch: True[0m
[32mBatch Settings: 
None[0m








08:34:43 osprey.us.cray.com SmartSim[17173] INFO Ray cluster launched on nodes: ['prod-0010']


## 2. Start the ray driver script

Now we can just connect to our running server.

In [6]:
ray.util.connect(cluster.head_model.address+":10001")


{'num_clients': 1,
 'python_version': '3.7.10',
 'ray_version': '1.3.0',
 'ray_commit': '9f45548488c4fa288f3cecb556801f97958eae8b',
 'protocol_version': '2020-03-12'}

Now we check that all resources are set properly.

In [7]:
print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

This cluster consists of
    4 nodes in total
    72.0 CPU resources in total



In [8]:
tune.run(
    "PPO",
    stop={"episode_reward_max": 200},
    config={
        "framework": "torch",
        "env": "CartPole-v0",
        "num_gpus": 0,
        "lr": tune.grid_search(np.linspace (0.001, 0.01, 50).tolist()),
        "log_level": "ERROR",
    },
    local_dir="/lus/scratch/arigazzi/ray_local/",
    verbose=0,
    fail_fast=True,
    log_to_file=True,
)

[2m[36m(pid=68106)[0m 2021-07-29 08:35:45,729	INFO trainer.py:696 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=68100)[0m 2021-07-29 08:35:46,938	INFO trainer.py:696 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=68104)[0m 2021-07-29 08:35:47,383	INFO trainer.py:696 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=68099)[0m 2021-07-29 08:35:47,416	INFO trainer.py:696 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=68095)[0m 2021-07-29 08:35:47,482	INFO trainer.py:696 -- Current log_level is ERROR. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
[2m[36m(pid=68098)[0m 2021-07-29 08:35:47,555	INFO trainer.py:696 -- Cur

<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7ff9097f5f10>

## 3. Stop cluster and release allocation

In [10]:
ray.util.disconnect()
exp.stop(cluster)
if alloc:
    slurm.release_allocation(alloc)

08:56:41 osprey.us.cray.com SmartSim[17173] INFO Stopping model workers with job name workers-CD5NVSH5WO1S
08:56:41 osprey.us.cray.com SmartSim[17173] INFO Stopping model head with job name head-CD5NVR06QEKH
