Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray Cluster features #50

Merged
merged 131 commits into from
Oct 11, 2021
Merged

Ray Cluster features #50

merged 131 commits into from
Oct 11, 2021

Conversation

al-rigazzi
Copy link
Collaborator

@al-rigazzi al-rigazzi commented May 26, 2021

This draft PR adds Ray cluster setup and deployment to SmartSim.

The ray.raystarter script file allows us to bypass some IP inconsistencies we encountered when starting the head node in the standard way. We need to be on the node already to correctly get the head node IP.

  • Ray works on SLURM and PBS systems
  • RayCluster is in smartsim.exp.ray
  • Ray clusters can be launched in batch and on/in allocation where possible
  • API functions to get addresses (head node, dashboard)

Missing:

  • Tests on WLM
  • Docs
  • Tutorial cleanup

@al-rigazzi al-rigazzi requested a review from Spartee May 26, 2021 15:20
Copy link
Contributor

@Spartee Spartee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Some small things primarily in testing and defaults.

setup.cfg Outdated
@@ -65,7 +65,10 @@ doc=
sphinx_rtd_theme>=0.5.0
sphinx-fortran==1.1.1
nbsphinx>=0.8.2
myst-parser>=0.15.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed anymore

del dictionary[key]


### TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be taken out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I left it till the last moment to make sure we did not forget any suggestion.

line = fp.readline()
while line:
plain_line = re.sub("\033\\[([0-9]+)(;[0-9]+)*m", "", line)
if "Local node IP:" in plain_line:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this parse from the raystarter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this parses from the head node log.

ray_port,
run_args=None,
ray_args=None,
interface="eth0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if the default interface should be ipogif0 which is the high speed network on XC.

eth0 would be cloud and local....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

cliargs += [f"--redis-password={args.redis_password}"]

# On some systems, ssh to compute nodes (and port forwarding) is not allowed.
# If that's the case, the user should bind the dashboard to 0.0.0.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment seems to be about the host address but the following lines are about the port number. Im I missing something?

tests/with_ray/full_wlm/test_ray_slurm_batch.py Outdated Show resolved Hide resolved
tests/with_ray/on_wlm/test_ray_pbs.py Outdated Show resolved Hide resolved
tests/with_ray/on_wlm/test_ray_slurm.py Outdated Show resolved Hide resolved

pytestmark = pytest.mark.skip(reason="Local launch is currently disabled for Ray")

# pytestmark = pytest.mark.skipif(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we are leaving this in as a future?

"cells": [
{
"cell_type": "markdown",
"source": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include a simple example of port forwarding here? I knew how to do it, but others might not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I write it as text? Because I usually do this with ssh.

Copy link
Contributor

@Spartee Spartee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor changes and optimizations but I think this is finally ready to go in!!

Great work!! 😄

Makefile Outdated
@@ -103,23 +103,23 @@ cov:
# help: test - Build and run all tests
.PHONY: test
test:
@cd ./tests/; python -m pytest --ignore=full_wlm/
@cd ./tests/; python -m pytest --ignore=full_wlm/ --ignore=with_ray/full_wlm/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put ray with full_wlm

doc/conf.py Outdated
@@ -41,7 +41,8 @@
'sphinxfortran.fortran_domain',
'sphinxfortran.fortran_autodoc',
'breathe',
'nbsphinx'
'nbsphinx',
'myst_parser'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont need this

@@ -345,6 +350,31 @@ def _launch_orchestrator(self, orchestrator):
self._save_orchestrator(orchestrator)
logger.debug(f"Orchestrator launched on nodes: {orchestrator.hosts}")

def _launch_ray_cluster(self, ray_cluster):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be the normal launching function?

if on WLM, find the nodes where it was launched and
set them in the JobManager

:param orchestrator: ray cluster to launch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orchestrator?

logger = get_logger(__name__)


def delete_elements(dictionary, key_list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets put this in utilities

@@ -120,6 +120,8 @@ def stop(self, *args):
self._control.stop_entity(entity)
for entity_list in stop_manifest.ensembles:
self._control.stop_entity_list(entity_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible optimization to make in manifest so that we don't have switch cases here. add method to manifest to obtain all EntityList types despite execptions. then just loop over all of them and call _control.stop_entity_list

Copy link
Contributor

@Spartee Spartee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing these. This looks great!

@al-rigazzi al-rigazzi merged commit ebd72b0 into CrayLabs:develop Oct 11, 2021
@al-rigazzi al-rigazzi deleted the rl branch October 11, 2021 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants