# Setting up a Ray cluster with SmartSim

## 1. Start the cluster
We set up a SmartSim experiment, which will handle the launch of the Ray cluster.

First we import the relevant modules and set up variables. `NUM_WORKERS` is the number of worker nodes: in total, we will spin a Ray cluster of `NUM_WORKERS+1` nodes (one node is the head node).

In [1]:
import numpy as np
import ray
from ray import tune
import ray.util

from smartsim import Experiment
from smartsim.ext.ray import RayCluster

NUM_WORKERS = 3
CPUS_PER_WORKER = 18
alloc=None
launcher='slurm'

Now we define a SmartSim experiment which will spin the Ray cluster. The output files will be located in the `ray-cluster` directory (relative to the path from where we are executing this notebook). We are limiting the number each ray node can use to `CPUS_PER_WORKER`: if we wanted to let it use all the cpus, it would suffice not to pass `ray_args`.
Notice that the cluster will be password-protected (the password, generated internally, will be shared with worker nodes).

If the hosts are attached to multiple interfaces (e.g. `ib`, `eth0`, ...) we can specify to which one the Ray nodes should bind: it is recommended to always choose the one offering the best performances. On an XC, for example, this will be `ipogif0`.

In [2]:
exp = Experiment("ray-cluster", launcher=launcher)
cluster = RayCluster(name="ray-cluster", run_args={}, ray_args={"num-cpus": CPUS_PER_WORKER},
                     launcher=launcher, workers=NUM_WORKERS, alloc=alloc, batch=False, interface="ib0", password=None)

We now generate the needed directories. If an experiment with the same name already exists, this call will fail, to avoid overwriting existing results. If we want to overwrite, we can simply pass `overwrite=True` to `exp.generate()`.

In [3]:
exp.generate(cluster, overwrite=True)

04:51:26 osprey.us.cray.com SmartSim[130213] INFO Working in previously created experiment


Now we are ready to start the cluster!

In [4]:
exp.start(cluster, block=False, summary=True)



[36;1m=== LAUNCH SUMMARY ===[0m
[32;1mExperiment: ray-cluster[0m
[32mExperiment Path: /lus/sonexion/arigazzi/smartsim-dev/SmartSim/tutorials/05_starting_ray/ray-cluster[0m
[32mLaunching with: slurm[0m
[32m# of Ensembles: 0[0m
[32m# of Models: 0[0m
[32mDatabase: no[0m

[36;1m=== RAY CLUSTERS ===[0m
[32;1mray-cluster[0m
[32m# of workers: 3[0m
[32mLaunching as batch: False[0m
[32mHead run Settings: 
Executable: /lus/scratch/arigazzi/anaconda3/envs/smartsim/bin/python
Executable arguments: ['/lus/sonexion/arigazzi/smartsim-dev/SmartSim/smartsim/ext/ray/raystarter.py', '--port=6789', '--ifname=ib0', '--ray-exe=/lus/scratch/arigazzi/anaconda3/envs/smartsim/bin/ray', '--head', '--dashboard-port=8265', '--ray-args="--num-cpus=18"']
Run Command: srun
Run arguments: {'nodes': 1, 'ntasks': 1, 'ntasks-per-node': 1, 'unbuffered': None}[0m[32m
Worker run Settings: 
Executable: /lus/scratch/arigazzi/anaconda3/envs/smartsim/bin/python
Executable arguments: ['/lus/sonexion/ar

                                                                                

04:51:52 osprey.us.cray.com SmartSim[130213] INFO Ray cluster launched on nodes: ['prod-0002', 'prod-0003', 'prod-0004', 'prod-0005']


## 2. Start the ray driver script

Now we can just connect to our running server.

In [5]:
ray.init("ray://"+cluster.get_head_address()+":10001")

ClientContext(dashboard_url='127.0.0.1:8265', python_version='3.7.10', ray_version='1.5.1', ray_commit='7d69ebb9e239fc073edc1a3a01c581e2bddfcbbf', protocol_version='2021-05-20', _num_clients=1)

Now we check that all resources are set properly.

In [6]:
print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))


This cluster consists of
    4 nodes in total
    72.0 CPU resources in total



And we can run a Ray Tune example, to see that everything is working!

In [None]:
tune.run(
    "PPO",
    stop={"episode_reward_max": 200},
    config={
        "framework": "torch",
        "env": "CartPole-v0",
        "num_gpus": 0,
        "lr": tune.grid_search(np.linspace (0.001, 0.01, 50).tolist()),
        "log_level": "ERROR",
    },
    local_dir="/lus/scratch/arigazzi/ray_local/",
    verbose=0,
    fail_fast=True,
    log_to_file=True,
)

## 3. Stop cluster and release allocation

In [7]:
ray.shutdown()
exp.stop(cluster)
if alloc:
    slurm.release_allocation(alloc)

04:52:31 osprey.us.cray.com SmartSim[130213] INFO Stopping model ray_head with job name ray_head-CDFQNJIKMLYB
04:52:31 osprey.us.cray.com SmartSim[130213] INFO Stopping model ray_worker_0 with job name ray_worker_0-CDFQNJIKOCDV
04:52:32 osprey.us.cray.com SmartSim[130213] INFO Stopping model ray_worker_1 with job name ray_worker_1-CDFQNJIKOQDM
04:52:32 osprey.us.cray.com SmartSim[130213] INFO Stopping model ray_worker_2 with job name ray_worker_2-CDFQNJIKP23M
