Skip to content

Commit

Permalink
Ray Cluster features (#50)
Browse files Browse the repository at this point in the history
* Adds notebook to run head_node.

* Add Ray notebook which can start cluster head node.

* Workarounds for Ray/Jupyter resource problems.

* Working setup for XC.

* Insert CPU control for SLURM Ray.

* Adds Ray cluster setup.

* Refactor RayWorker and RayHead.

* Adds batch capabilities to Ray Slurm launcher.

* Add preamble to batch_settings

* Fixes some issues on CCM for Ray

* Redirect cmd server output for Ray.

* Avoid Ray workers overlapping with head.

* Change Slurm configuration for Ray.

* Remove command server and use Ray client.

* Add ray cluster for PBS batch.

* Adds PBS functionalities for Ray cluster.

* Adds Ray starter script.

* Add new starter for Ray

* Fix batch args bug for RayCluster

* Add preamble mechanism to BatchSettings.

* Remove old ray starter script.

* Fixes tests for new interface.

* Fix entity utils test.

* Delete issues.md

* Add local launcher and test for Ray.

* Delete 05_starting_ray.ipynb

Removes unused tutorial.

* Delete manual-start.sh

Remove unused script.

* Delete start-head.sh

Remove unused script.

* Delete start-worker.sh

Remove unused script.

* Update requirements.

* Remove check for slurm launcher and rely on exception.

* Address reviewer's comments. Add ray_args.

* Modify Ray tutorial.

* Merge branch.

* Adds Ray to manifest.

* Fix to raystarter.py

* Add manifest.ray_clusters to exp.stop()

* Removes egg files.

* Fixes wrong option for Ray and aprun

* Address review, add flexibility to ray cluster

* Add API functions to RayCluster

* Fix for internal ray args

* Add dashboard port to raystarter args.

* Apply styling.

* Add tests for ray on slurm

* Fix slurm in alloc ray test

* ADD new information to README and OA tutorial

The readme has been updated with new examples and
usage patterns.

A new Online Analysis examples has been created with
a Lattice Boltzman simulation that will show users
how to perform streaming analysis with Smartsim.

more to come.

* Link Online Analysis example into docs

Formatting in the README, added the OA
example to the docs and converted to RST

* Add visualization to README

* Add PBS tests and pass ray_exe to raystarter

* Remove expand_exe option

* Remove duplicate function

* Remove unused ray template

* Add egg-info to gitignore

* Remove expand_exe from RayCluster

* Fix exe_path

* Adds ray API to docs

* Move set_cpus out of launch tests.

* Fix characters in options for PBSOrchestrator

* Fix non utf8 chars in options

* Fix Cobalt options

* Fix options in AprunSettings.

* Allow multiple trials for Ray tests

* Fix ray launch summary

* Address local launcher for Ray

The local launcher for Ray was broken
on Mac because of how Ray does IP lookups
for local addresses.

The local launcher functionality has been
taken out and the slurm and PBS are the
only supported launchers now.

Minor cleanup on RayCluster classes as
well. Changed inheiriance structure from
model to SmartSimEntity

* Make RayCluster closer to SmartSim paradigm

* Modifies starter to bind dashboard to all interfaces

* Revert port for dashboard, update github wf

* Add password option to RayCluster

* Remove output from Notebook

* Remove log level env from ray notebook

* Adapt ray tests.

* Bump up ray version

* Remove unused launch branch, fix ray summary

* Remove unused launcher branch

* Apply styling

* Remove notebook output

* Remove block_in_batch feature

* Fix ALPS regression introduced in this branch

* Fix settings

* Fix docstring

* Add ignore flag for ray batch tests.

* Change ray_started args

* Update docstrings

* Update TODO list in raycluster.py

* Add disclaimer to notebook, license to raycluster

* Make RayCluster error more useful

* Add ray.shutdown to tests.

* Add interface to Ray PBS tests. Apply styling

* Remove useless _vars from RayCluster

* Remove unused attributes from RayCluster

* Remove ray_head variable

* Fix new variables for RayCluster

* Make RayCluster functions static

* Modify notebook

* Update Ray path to exp

* Update docs, removed unused function

* Extend wait time for Ray head log

* Remove node retrieval for Ray

* Update notebook and summary for RayCluster

* Fixes to Ray docs

* Add interface to Ray tests

* Apply styling

* Address reviewer's comments

* Apply styling

* Add SSH tunneling instructions to Ray notebook

* Change workers parameter to num_nodes for clarity

* Update Ray tests

* Add review's suggestions and dashboard host fix

* Update Ray tutorial

* Restrict dashboard option to head only

* Correct typo in Ray notebook

* Add some info to notebook.

* Fix typo in notebook

Co-authored-by: Sam Partee <spartee@hpe.com>
  • Loading branch information
al-rigazzi and Sam Partee committed Oct 11, 2021
1 parent 266efb8 commit ebd72b0
Show file tree
Hide file tree
Showing 37 changed files with 1,584 additions and 41 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/run_local_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,13 +47,13 @@ jobs:
if: matrix.python-version != '3.9'
run: |
echo "$(brew --prefix)/opt/make/libexec/gnubin" >> $GITHUB_PATH
python -m pip install -vvv .[dev,ml]
python -m pip install -vvv .[dev,ml,ray]
- name: Install SmartSim
if: matrix.python-version == '3.9'
run: |
echo "$(brew --prefix)/opt/make/libexec/gnubin" >> $GITHUB_PATH
python -m pip install -vvv .[dev]
python -m pip install -vvv .[dev,ray]
- name: Install ML Runtimes with Smart (with pt and tf)
if: contains(matrix.os, 'macos')
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ __pycache__
.pytest_cache/
.coverage*
htmlcov
smartsim.egg-info

# Dependencies
third-party
Expand Down
78 changes: 78 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,9 @@ independently.
- [Local Launch](#local-launch)
- [Interactive Launch](#interactive-launch)
- [Batch Launch](#batch-launch)
- [Ray](#ray)
- [Ray on Slurm](#ray-on-slurm)
- [Ray on PBS](#ray-on-pbs)
- [SmartRedis](#smartredis)
- [Tensors](#tensors)
- [Datasets](#datasets)
Expand Down Expand Up @@ -288,6 +291,7 @@ python hello_ensemble_pbs.py

# Infrastructure Library Applications
- Orchestrator - In-memory data store and Machine Learning Inference (Redis + RedisAI)
- Ray - Distributed Reinforcement Learning (RL), Hyperparameter Optimization (HPO)

## Redis + RedisAI

Expand Down Expand Up @@ -415,6 +419,80 @@ exp.stop(db_cluster)
python run_db_pbs_batch.py
```

-----
## Ray

Ray is a distributed computation framework that supports a number of applications
- RLlib - Distributed Reinforcement Learning (RL)
- RaySGD - Distributed Training
- Ray Tune - Hyperparameter Optimization (HPO)
- Ray Serve - ML/DL inference
As well as other integrations with frameworks like Modin, Mars, Dask, and Spark.

Historically, Ray has not been well supported on HPC systems. A few examples exist,
but none are well maintained. Because SmartSim already has launchers for HPC systems,
launching Ray through SmartSim is a relatively simple task.

### Ray on Slurm

Below is an example of how to launch a Ray cluster on a Slurm system and connect to it.
In this example, we set `batch=True`, which means that the cluster will be started
requesting an allocation through Slurm. If this code is run within a sufficiently large
interactive allocation, setting `batch=False` will spin the Ray cluster on the
allocated nodes.

```Python
import ray

from smartsim import Experiment
from smartsim.ext.ray import RayCluster

exp = Experiment("ray-cluster", launcher='slurm')
# 3 workers + 1 head node = 4 node-cluster
cluster = RayCluster(name="ray-cluster", run_args={},
ray_args={"num-cpus": 24},
launcher=launcher, workers=3, batch=True)

exp.generate(cluster, overwrite=True)
exp.start(cluster, block=False, summary=True)

# Connect to the Ray cluster
ray.util.connect(cluster.head_model.address+":10001")

# <run Ray tune, RLlib, HPO...>
```


### Ray on PBS

Below is an example of how to launch a Ray cluster on a PBS system and connect to it.
In this example, we set `batch=True`, which means that the cluster will be started
requesting an allocation through Slurm. If this code is run within a sufficiently large
interactive allocation, setting `batch=False` will spin the Ray cluster on the
allocated nodes.

```Python
import ray

from smartsim import Experiment
from smartsim.ext.ray import RayCluster

exp = Experiment("ray-cluster", launcher='pbs')
# 3 workers + 1 head node = 4 node-cluster
cluster = RayCluster(name="ray-cluster", run_args={},
ray_args={"num-cpus": 24},
launcher=launcher, workers=3, batch=True)

exp.generate(cluster, overwrite=True)
exp.start(cluster, block=False, summary=True)

# Connect to the ray cluster
ray.util.connect(cluster.head_model.address+":10001")

# <run Ray tune, RLlib, HPO...>
```


------
# SmartRedis

Expand Down
18 changes: 18 additions & 0 deletions doc/api/smartsim_api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -490,3 +490,21 @@ Slurm

.. automodule:: smartsim.launcher.slurm.slurm
:members:


Ray
===

.. currentmodule:: smartsim.exp.ray

.. _ray_api:

``RayCluster`` is used to launch a Ray cluster
and can be launched as a batch or in an interactive allocation.

.. autoclass:: RayCluster
:show-inheritance:
:members:
:inherited-members:
:undoc-members:
:exclude-members: batch set_path type
2 changes: 2 additions & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@
:caption: Tutorials

tutorials/01_getting_started/01_getting_started
tutorlals/03_online_analysis/03_online_analysis
tutorials/05_starting_ray/05_starting_ray
tutorials/using_clients
tutorials/lattice_boltz_analysis
tutorials/inference
Expand Down
2 changes: 2 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,8 @@ doc=
sphinx-fortran==1.1.1
nbsphinx>=0.8.2

ray=
ray>=1.6

[options.packages.find]
exclude =
Expand Down
2 changes: 1 addition & 1 deletion smart
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ def clean(install_path, _all=False):
rai_path = lib_path.joinpath("redisai.so")
if rai_path.is_file():
rai_path.unlink()
print("Succesfully removed existing RedisAI installation")
print("Successfully removed existing RedisAI installation")

backend_path = lib_path.joinpath("backends")
if backend_path.is_dir():
Expand Down
11 changes: 8 additions & 3 deletions smartsim/control/controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,9 +247,11 @@ def init_launcher(self, launcher):
elif launcher == "pbs":
self._launcher = PBSLauncher()
self._jobs.set_launcher(self._launcher)
# Init Cobalt launcher
elif launcher == "cobalt":
self._launcher = CobaltLauncher()
self._jobs.set_launcher(self._launcher)
# Init LSF launcher
elif launcher == "lsf":
self._launcher = LSFLauncher()
self._jobs.set_launcher(self._launcher)
Expand All @@ -275,10 +277,13 @@ def _launch(self, manifest):
raise SmartSimError(msg)
self._launch_orchestrator(orchestrator)

for rc in manifest.ray_clusters:
rc._update_workers()

# create all steps prior to launch
steps = []

for elist in manifest.ensembles:
all_entity_lists = manifest.ensembles + manifest.ray_clusters
for elist in all_entity_lists:
if elist.batch:
batch_step = self._create_batch_job_step(elist)
steps.append((batch_step, elist))
Expand Down Expand Up @@ -498,7 +503,7 @@ def _orchestrator_launch_wait(self, orchestrator):
if not self._jobs.actively_monitoring:
self._jobs.check_jobs()

# _jobs.get_status aquires JM lock for main thread, no need for locking
# _jobs.get_status acquires JM lock for main thread, no need for locking
statuses = self.get_entity_list_status(orchestrator)
if all([stat == STATUS_RUNNING for stat in statuses]):
ready = True
Expand Down
23 changes: 22 additions & 1 deletion smartsim/control/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,11 +28,12 @@
from ..entity import EntityList, SmartSimEntity
from ..error import SmartSimError
from ..error.errors import SmartSimError
from ..exp.ray import RayCluster

# List of types derived from EntityList which require specific behavior
# A corresponding property needs to exist (like db for Orchestrator),
# otherwise they will not be accessible
entity_list_exception_types = [Orchestrator]
entity_list_exception_types = [Orchestrator, RayCluster]


class Manifest:
Expand Down Expand Up @@ -83,6 +84,26 @@ def ensembles(self):

return _ensembles

@property
def ray_clusters(self):
_ray_cluster = []
for deployable in self._deployables:
if isinstance(deployable, RayCluster):
_ray_cluster.append(deployable)
return _ray_cluster

@property
def all_entity_lists(self):
"""All entity lists, including ensembles and
exceptional ones like Orchestrator and Ray Clusters
"""
_all_entity_lists = self.ray_clusters + self.ensembles
db = self.db
if db is not None:
_all_entity_lists.append(db)

return _all_entity_lists

def _check_names(self, deployables):
used = []
for deployable in deployables:
Expand Down
2 changes: 0 additions & 2 deletions smartsim/entity/entity.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,6 @@
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

import os.path as osp


class SmartSimEntity:
def __init__(self, name, path, run_settings):
Expand Down
3 changes: 1 addition & 2 deletions smartsim/entity/entityList.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,7 @@ def batch(self):
try:
if self.batch_settings:
return True
else:
return False
return False
# local orchestrator cannot launch with batches
except AttributeError:
return False
Expand Down
1 change: 1 addition & 0 deletions smartsim/exp/ray/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .raycluster import RayCluster, parse_ray_head_node_address

0 comments on commit ebd72b0

Please sign in to comment.