# Setting up a Ray cluster with SmartSim

## 1. Start the cluster
We set up a SmartSim experiment, which will handle the launch of the Ray cluster.

First we import the relevant modules and set up variables. `NUM_WORKERS` is the number of worker nodes: in total, we will spin a Ray cluster of `NUM_WORKERS+1` nodes (one node is the head node).

In [None]:
import numpy as np
import ray
from ray import tune
import ray.util

from smartsim import Experiment
from smartsim.exp.ray import RayCluster

NUM_WORKERS = 3
CPUS_PER_WORKER = 18
launcher='slurm'

Now we define a SmartSim experiment which will spin the Ray cluster. The output files will be located in the `ray-cluster` directory (relative to the path from where we are executing this notebook). We are limiting the number each ray node can use to `CPUS_PER_WORKER`: if we wanted to let it use all the cpus, it would suffice not to pass `ray_args`.
Notice that the cluster will be password-protected (the password, generated internally, will be shared with worker nodes).

If the hosts are attached to multiple interfaces (e.g. `ib`, `eth0`, ...) we can specify to which one the Ray nodes should bind: it is recommended to always choose the one offering the best performances. On a Cray XC, for example, this will be `ipogif0`. 

To connect to the cluster, we will use the Ray client. Note that this approach only works with `ray>=1.6`, for previous versions, one has to add `password=None` to the `RayCluster` constructor.

In [None]:
exp = Experiment("ray-cluster", launcher=launcher)
cluster = RayCluster(name="ray-cluster", run_args={}, ray_args={"num-cpus": CPUS_PER_WORKER},
                     launcher=launcher, workers=NUM_WORKERS, batch=False, interface="ipogif0")

We now generate the needed directories. If an experiment with the same name already exists, this call will fail, to avoid overwriting existing results. If we want to overwrite, we can simply pass `overwrite=True` to `exp.generate()`.

In [None]:
exp.generate(cluster, overwrite=True)

Now we are ready to start the cluster!

In [None]:
exp.start(cluster, block=False, summary=True)

## 2. Start the ray driver script

Now we can just connect to our running server.

In [None]:
ray.init("ray://"+cluster.get_head_address()+":10001")

Now we check that all resources are set properly.

In [None]:
print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))


And we can run a Ray Tune example, to see that everything is working!

In [None]:
tune.run(
    "PPO",
    stop={"episode_reward_max": 200},
    config={
        "framework": "torch",
        "env": "CartPole-v0",
        "num_gpus": 0,
        "lr": tune.grid_search(np.linspace (0.001, 0.01, 50).tolist()),
        "log_level": "ERROR",
    },
    local_dir="/lus/scratch/arigazzi/ray_local/",
    verbose=0,
    fail_fast=True,
    log_to_file=True,
)

## 3. Stop cluster and release allocation

In [None]:
ray.shutdown()
ray.disconnect()
exp.stop(cluster)