Ray Cluster features (#50)

* Adds notebook to run head_node. * Add Ray notebook which can start cluster head node. * Workarounds for Ray/Jupyter resource problems. * Working setup for XC. * Insert CPU control for SLURM Ray. * Adds Ray cluster setup. * Refactor RayWorker and RayHead. * Adds batch capabilities to Ray Slurm launcher. * Add preamble to batch_settings * Fixes some issues on CCM for Ray * Redirect cmd server output for Ray. * Avoid Ray workers overlapping with head. * Change Slurm configuration for Ray. * Remove command server and use Ray client. * Add ray cluster for PBS batch. * Adds PBS functionalities for Ray cluster. * Adds Ray starter script. * Add new starter for Ray * Fix batch args bug for RayCluster * Add preamble mechanism to BatchSettings. * Remove old ray starter script. * Fixes tests for new interface. * Fix entity utils test. * Delete issues.md * Add local launcher and test for Ray. * Delete 05_starting_ray.ipynb Removes unused tutorial. * Delete manual-start.sh Remove unused script. * Delete start-head.sh Remove unused script. * Delete start-worker.sh Remove unused script. * Update requirements. * Remove check for slurm launcher and rely on exception. * Address reviewer's comments. Add ray_args. * Modify Ray tutorial. * Merge branch. * Adds Ray to manifest. * Fix to raystarter.py * Add manifest.ray_clusters to exp.stop() * Removes egg files. * Fixes wrong option for Ray and aprun * Address review, add flexibility to ray cluster * Add API functions to RayCluster * Fix for internal ray args * Add dashboard port to raystarter args. * Apply styling. * Add tests for ray on slurm * Fix slurm in alloc ray test * ADD new information to README and OA tutorial The readme has been updated with new examples and usage patterns. A new Online Analysis examples has been created with a Lattice Boltzman simulation that will show users how to perform streaming analysis with Smartsim. more to come. * Link Online Analysis example into docs Formatting in the README, added the OA example to the docs and converted to RST * Add visualization to README * Add PBS tests and pass ray_exe to raystarter * Remove expand_exe option * Remove duplicate function * Remove unused ray template * Add egg-info to gitignore * Remove expand_exe from RayCluster * Fix exe_path * Adds ray API to docs * Move set_cpus out of launch tests. * Fix characters in options for PBSOrchestrator * Fix non utf8 chars in options * Fix Cobalt options * Fix options in AprunSettings. * Allow multiple trials for Ray tests * Fix ray launch summary * Address local launcher for Ray The local launcher for Ray was broken on Mac because of how Ray does IP lookups for local addresses. The local launcher functionality has been taken out and the slurm and PBS are the only supported launchers now. Minor cleanup on RayCluster classes as well. Changed inheiriance structure from model to SmartSimEntity * Make RayCluster closer to SmartSim paradigm * Modifies starter to bind dashboard to all interfaces * Revert port for dashboard, update github wf * Add password option to RayCluster * Remove output from Notebook * Remove log level env from ray notebook * Adapt ray tests. * Bump up ray version * Remove unused launch branch, fix ray summary * Remove unused launcher branch * Apply styling * Remove notebook output * Remove block_in_batch feature * Fix ALPS regression introduced in this branch * Fix settings * Fix docstring * Add ignore flag for ray batch tests. * Change ray_started args * Update docstrings * Update TODO list in raycluster.py * Add disclaimer to notebook, license to raycluster * Make RayCluster error more useful * Add ray.shutdown to tests. * Add interface to Ray PBS tests. Apply styling * Remove useless _vars from RayCluster * Remove unused attributes from RayCluster * Remove ray_head variable * Fix new variables for RayCluster * Make RayCluster functions static * Modify notebook * Update Ray path to exp * Update docs, removed unused function * Extend wait time for Ray head log * Remove node retrieval for Ray * Update notebook and summary for RayCluster * Fixes to Ray docs * Add interface to Ray tests * Apply styling * Address reviewer's comments * Apply styling * Add SSH tunneling instructions to Ray notebook * Change workers parameter to num_nodes for clarity * Update Ray tests * Add review's suggestions and dashboard host fix * Update Ray tutorial * Restrict dashboard option to head only * Correct typo in Ray notebook * Add some info to notebook. * Fix typo in notebook Co-authored-by: Sam Partee <spartee@hpe.com>
CrayLabs · Oct 11, 2021 · ebd72b0 · ebd72b0
1 parent 266efb8
commit ebd72b0
Show file tree

Hide file tree

Showing 37 changed files with 1,584 additions and 41 deletions.
diff --git a/.github/workflows/run_local_tests.yml b/.github/workflows/run_local_tests.yml
@@ -47,13 +47,13 @@ jobs:
         if: matrix.python-version != '3.9'
         run: |
           echo "$(brew --prefix)/opt/make/libexec/gnubin" >> $GITHUB_PATH
-          python -m pip install -vvv .[dev,ml]
+          python -m pip install -vvv .[dev,ml,ray]
 
       - name: Install SmartSim
         if: matrix.python-version == '3.9'
         run: |
           echo "$(brew --prefix)/opt/make/libexec/gnubin" >> $GITHUB_PATH
-          python -m pip install -vvv .[dev]
+          python -m pip install -vvv .[dev,ray]
 
       - name: Install ML Runtimes with Smart (with pt and tf)
         if: contains(matrix.os, 'macos')

diff --git a/.gitignore b/.gitignore
@@ -4,6 +4,7 @@ __pycache__
 .pytest_cache/
 .coverage*
 htmlcov
+smartsim.egg-info
 
 # Dependencies
 third-party

diff --git a/README.md b/README.md
@@ -81,6 +81,9 @@ independently.
     - [Local Launch](#local-launch)
     - [Interactive Launch](#interactive-launch)
     - [Batch Launch](#batch-launch)
+  - [Ray](#ray)
+    - [Ray on Slurm](#ray-on-slurm)
+    - [Ray on PBS](#ray-on-pbs)
 - [SmartRedis](#smartredis)
   - [Tensors](#tensors)
   - [Datasets](#datasets)
@@ -288,6 +291,7 @@ python hello_ensemble_pbs.py
 
 # Infrastructure Library Applications
  - Orchestrator - In-memory data store and Machine Learning Inference (Redis + RedisAI)
+ - Ray - Distributed Reinforcement Learning (RL), Hyperparameter Optimization (HPO)
 
 ## Redis + RedisAI
 
@@ -415,6 +419,80 @@ exp.stop(db_cluster)
 python run_db_pbs_batch.py
 ```
 
+-----
+## Ray
+
+Ray is a distributed computation framework that supports a number of applications
+ - RLlib - Distributed Reinforcement Learning (RL)
+ - RaySGD - Distributed Training
+ - Ray Tune - Hyperparameter Optimization (HPO)
+ - Ray Serve - ML/DL inference
+As well as other integrations with frameworks like Modin, Mars, Dask, and Spark.
+
+Historically, Ray has not been well supported on HPC systems. A few examples exist,
+but none are well maintained. Because SmartSim already has launchers for HPC systems,
+launching Ray through SmartSim is a relatively simple task.
+
+### Ray on Slurm
+
+Below is an example of how to launch a Ray cluster on a Slurm system and connect to it.
+In this example, we set `batch=True`, which means that the cluster will be started
+requesting an allocation through Slurm. If this code is run within a sufficiently large
+interactive allocation, setting `batch=False` will spin the Ray cluster on the
+allocated nodes.
+
+```Python
+import ray
+
+from smartsim import Experiment
+from smartsim.ext.ray import RayCluster
+
+exp = Experiment("ray-cluster", launcher='slurm')
+# 3 workers + 1 head node = 4 node-cluster
+cluster = RayCluster(name="ray-cluster", run_args={},
+                     ray_args={"num-cpus": 24},
+                     launcher=launcher, workers=3, batch=True)
+
+exp.generate(cluster, overwrite=True)
+exp.start(cluster, block=False, summary=True)
+
+# Connect to the Ray cluster
+ray.util.connect(cluster.head_model.address+":10001")
+
+# <run Ray tune, RLlib, HPO...>
+```
+
+
+### Ray on PBS
+
+Below is an example of how to launch a Ray cluster on a PBS system and connect to it.
+In this example, we set `batch=True`, which means that the cluster will be started
+requesting an allocation through Slurm. If this code is run within a sufficiently large
+interactive allocation, setting `batch=False` will spin the Ray cluster on the
+allocated nodes.
+
+```Python
+import ray
+
+from smartsim import Experiment
+from smartsim.ext.ray import RayCluster
+
+exp = Experiment("ray-cluster", launcher='pbs')
+# 3 workers + 1 head node = 4 node-cluster
+cluster = RayCluster(name="ray-cluster", run_args={},
+                     ray_args={"num-cpus": 24},
+                     launcher=launcher, workers=3, batch=True)
+
+exp.generate(cluster, overwrite=True)
+exp.start(cluster, block=False, summary=True)
+
+# Connect to the ray cluster
+ray.util.connect(cluster.head_model.address+":10001")
+
+# <run Ray tune, RLlib, HPO...>
+```
+
+
 ------
 # SmartRedis
 

diff --git a/doc/api/smartsim_api.rst b/doc/api/smartsim_api.rst
@@ -490,3 +490,21 @@ Slurm
 
 .. automodule:: smartsim.launcher.slurm.slurm
     :members:
+
+
+Ray
+===
+
+.. currentmodule:: smartsim.exp.ray
+
+.. _ray_api:
+
+``RayCluster`` is used to launch a Ray cluster 
+ and can be launched as a batch or in an interactive allocation.
+
+.. autoclass:: RayCluster
+    :show-inheritance:
+    :members:
+    :inherited-members:
+    :undoc-members:
+    :exclude-members: batch set_path type
diff --git a/doc/index.rst b/doc/index.rst
@@ -14,6 +14,8 @@
    :caption: Tutorials
 
    tutorials/01_getting_started/01_getting_started
+   tutorlals/03_online_analysis/03_online_analysis
+   tutorials/05_starting_ray/05_starting_ray
    tutorials/using_clients
    tutorials/lattice_boltz_analysis
    tutorials/inference

diff --git a/setup.cfg b/setup.cfg
@@ -66,6 +66,8 @@ doc=
     sphinx-fortran==1.1.1
     nbsphinx>=0.8.2
 
+ray=
+    ray>=1.6
 
 [options.packages.find]
 exclude =

diff --git a/smart b/smart
@@ -245,7 +245,7 @@ def clean(install_path, _all=False):
         rai_path = lib_path.joinpath("redisai.so")
         if rai_path.is_file():
             rai_path.unlink()
-            print("Succesfully removed existing RedisAI installation")
+            print("Successfully removed existing RedisAI installation")
 
         backend_path = lib_path.joinpath("backends")
         if backend_path.is_dir():

diff --git a/smartsim/control/controller.py b/smartsim/control/controller.py
@@ -247,9 +247,11 @@ def init_launcher(self, launcher):
             elif launcher == "pbs":
                 self._launcher = PBSLauncher()
                 self._jobs.set_launcher(self._launcher)
+            # Init Cobalt launcher
             elif launcher == "cobalt":
                 self._launcher = CobaltLauncher()
                 self._jobs.set_launcher(self._launcher)
+            # Init LSF launcher
             elif launcher == "lsf":
                 self._launcher = LSFLauncher()
                 self._jobs.set_launcher(self._launcher)
@@ -275,10 +277,13 @@ def _launch(self, manifest):
                 raise SmartSimError(msg)
             self._launch_orchestrator(orchestrator)
 
+        for rc in manifest.ray_clusters:
+            rc._update_workers()
+
         # create all steps prior to launch
         steps = []
-
-        for elist in manifest.ensembles:
+        all_entity_lists = manifest.ensembles + manifest.ray_clusters
+        for elist in all_entity_lists:
             if elist.batch:
                 batch_step = self._create_batch_job_step(elist)
                 steps.append((batch_step, elist))
@@ -498,7 +503,7 @@ def _orchestrator_launch_wait(self, orchestrator):
                 if not self._jobs.actively_monitoring:
                     self._jobs.check_jobs()
 
-                # _jobs.get_status aquires JM lock for main thread, no need for locking
+                # _jobs.get_status acquires JM lock for main thread, no need for locking
                 statuses = self.get_entity_list_status(orchestrator)
                 if all([stat == STATUS_RUNNING for stat in statuses]):
                     ready = True

diff --git a/smartsim/control/manifest.py b/smartsim/control/manifest.py
@@ -28,11 +28,12 @@
 from ..entity import EntityList, SmartSimEntity
 from ..error import SmartSimError
 from ..error.errors import SmartSimError
+from ..exp.ray import RayCluster
 
 # List of types derived from EntityList which require specific behavior
 # A corresponding property needs to exist (like db for Orchestrator),
 # otherwise they will not be accessible
-entity_list_exception_types = [Orchestrator]
+entity_list_exception_types = [Orchestrator, RayCluster]
 
 
 class Manifest:
@@ -83,6 +84,26 @@ def ensembles(self):
 
         return _ensembles
 
+    @property
+    def ray_clusters(self):
+        _ray_cluster = []
+        for deployable in self._deployables:
+            if isinstance(deployable, RayCluster):
+                _ray_cluster.append(deployable)
+        return _ray_cluster
+
+    @property
+    def all_entity_lists(self):
+        """All entity lists, including ensembles and
+        exceptional ones like Orchestrator and Ray Clusters
+        """
+        _all_entity_lists = self.ray_clusters + self.ensembles
+        db = self.db
+        if db is not None:
+            _all_entity_lists.append(db)
+
+        return _all_entity_lists
+
     def _check_names(self, deployables):
         used = []
         for deployable in deployables:

diff --git a/smartsim/entity/entity.py b/smartsim/entity/entity.py
@@ -24,8 +24,6 @@
 # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
-import os.path as osp
-
 
 class SmartSimEntity:
     def __init__(self, name, path, run_settings):

diff --git a/smartsim/entity/entityList.py b/smartsim/entity/entityList.py
@@ -43,8 +43,7 @@ def batch(self):
         try:
             if self.batch_settings:
                 return True
-            else:
-                return False
+            return False
         # local orchestrator cannot launch with batches
         except AttributeError:
             return False

diff --git a/smartsim/exp/ray/__init__.py b/smartsim/exp/ray/__init__.py
@@ -0,0 +1 @@
+from .raycluster import RayCluster, parse_ray_head_node_address