# Setting up a Ray cluster with SmartSim

In [None]:
%set_env SMARTSIM_LOG_LEVEL debug

## 1. Start the cluster
We set up a SmartSim experiment, which will handle the launch of the Ray cluster.

First we import the relevant modules and set up variables. `NUM_WORKERS` is the number of worker nodes: in total, we will spin a Ray cluster of `NUM_WORKERS+1` nodes (one node is the head node).

In [2]:
import numpy as np

import ray
from ray import tune
import ray.util

from smartsim import Experiment
from smartsim.ext.ray import RayCluster

NUM_WORKERS = 3
CPUS_PER_WORKER = 18
alloc=None
launcher='slurm'

Now we define a SmartSim experiment which will spin the Ray cluster. The output files will be located in the `ray-cluster` directory (relative to the path from where we are executing this notebook). We are limiting the number each ray node can use to `CPUS_PER_WORKER`: if we wanted to let it use all the cpus, it would suffice not to pass `ray_args`.
Notice that the cluster will be password-protected (the password, generated internally, will be shared with worker nodes).

If the hosts are attached to multiple interfaces (e.g. `ib`, `eth0`, ...) we can specify to which one the Ray nodes should bind: it is recommended to always choose the one offering the best performances. On an XC, for example, this will be `ipogif0`.

In [3]:
exp = Experiment("ray-cluster", launcher=launcher)
cluster = RayCluster(name="ray-cluster", run_args={}, ray_args={"num-cpus": CPUS_PER_WORKER},
                     launcher=launcher, workers=NUM_WORKERS, alloc=alloc, batch=False, interface="ipogif0")

If the cluster has to be run as a batch, we might want to pass some preamble lines to the batch files, to setup modules and environments. If we are running this in an internal allocation, the environment will be automatically propagated.

In [3]:
if cluster.batch:
    cluster.head_model.batch_settings.add_preamble( ["source ~/.bashrc", "conda activate smartsim"])
    if NUM_WORKERS:
        cluster.worker_model.batch_settings.add_preamble ( ["source ~/.bashrc", "conda activate smartsim"])

We now generate the needed directories. If an experiment with the same name already exists, this call will fail, to avoid overwriting existing results. If we want to overwrite, we can simply pass `overwrite=True` to `exp.generate()`.

In [4]:
exp.generate(cluster, overwrite=True)

10:02:19 nid00000 SmartSim[96667] INFO Working in previously created experiment


Now we are ready to start the cluster!

In [5]:
exp.start(cluster, block=False, summary=False)

10:02:36 nid00000 SmartSim[96667] INFO Ray cluster launched.


## 2. Start the ray driver script

Now we can just connect to our running server.

In [6]:
ray.util.connect(cluster.get_head_address()+":10001")


{'num_clients': 1,
 'python_version': '3.8.8',
 'ray_version': '1.3.0',
 'ray_commit': '9f45548488c4fa288f3cecb556801f97958eae8b',
 'protocol_version': '2020-03-12'}

Now we check that all resources are set properly.

In [7]:
print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

This cluster consists of
    4 nodes in total
    72.0 CPU resources in total



In [20]:
tune.run(
    "PPO",
    stop={"episode_reward_max": 200},
    config={
        "framework": "torch",
        "env": "CartPole-v0",
        "num_gpus": 0,
        "lr": tune.grid_search(np.linspace (0.001, 0.01, 50).tolist()),
        "log_level": "ERROR",
    },
    local_dir="/lus/scratch/arigazzi/ray_local/",
    verbose=0,
    fail_fast=True,
    log_to_file=True,
)

[2m[36m(pid=80710)[0m Instructions for updating:
[2m[36m(pid=80710)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80710)[0m Instructions for updating:
[2m[36m(pid=80710)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80710)[0m Instructions for updating:
[2m[36m(pid=80710)[0m non-resource variables are not supported in the long term
[2m[36m(pid=80710)[0m Instructions for updating:
[2m[36m(pid=80710)[0m non-resource variables are not supported in the long term
[2m[33m(raylet)[0m E0806 09:54:10.383549949   81183 server_chttp2.cc:40]        {"created":"@1628261650.383451240","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":394,"referenced_errors":[{"created":"@1628261650.383447113","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server

<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7ffdb220fe20>

## 3. Stop cluster and release allocation

In [8]:
ray.util.disconnect()
exp.stop(cluster)
if alloc:
    slurm.release_allocation(alloc)

10:03:08 nid00000 SmartSim[96667] INFO Stopping model ray_head with job name ray_head-CDCIRA6W7ACG
10:03:08 nid00000 SmartSim[96667] INFO Stopping model ray_worker_0 with job name ray_worker_0-CDCIRA6W85MT
10:03:08 nid00000 SmartSim[96667] INFO Stopping model ray_worker_1 with job name ray_worker_1-CDCIRA6W8DP5
10:03:08 nid00000 SmartSim[96667] INFO Stopping model ray_worker_2 with job name ray_worker_2-CDCIRA6W8JQN
