Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis starter script for Orchestrator #73

Merged
merged 37 commits into from
Aug 11, 2021
Merged

Conversation

Spartee
Copy link
Contributor

@Spartee Spartee commented Aug 6, 2021

Recently it was discovered that redis will often change the IP address of a shard unless it is bound to a specific IP address.

To address this, each Orchestrator has been adapted to use the redis_starter.py script that binds the redis instance to a specific address. Essentially this is a Python script that takes in a command and does a lookup on the network interface specified by the user in order to bind the server to a specific address.

Also in this PR is a change back to redis-py and redis-py-cluster for the orchestrator functions is_active and check_cluster_status. This was after a bug was discovered in clang/pybind/cray PrgEnv. As there isn't much we can do about the bug at the moment we now require both dependencies in the library.

TODO

  • Adapt LSF
  • Test on Theta
  • Test on Swan
  • Test on Osprey (have to specific "ib0" for orchestrator tests)
  • Test on Cheyenne/Casper

All tests pass (gpu and cpu) on Horizon and MacOS.

al-rigazzi and others added 11 commits July 30, 2021 00:59
Co-authored-by: Eric Gustin <eric.gustin@hpe.com>
SmartRedis was prematurely updated to 0.2.0
that has been reverted

Also added back in is redis-py-cluster and redis-py
for the orchestrator is_active and check_cluster_status
functions.
@Spartee Spartee added area: orchestrator Issues related to the Ochestrator API, launch, and runtime type: feature Issues that include feature request or feature idea type: refactor Issues focused on refactoring existing code labels Aug 6, 2021
al-rigazzi and others added 12 commits August 9, 2021 07:42
Since the redis_starter.py script provides
the IP address given a network interface
specified by the user, we no longer need
the RedisIP module built at startup.

The IP address is now parsed from the
output of that script.
Prefixed SmartSim hyperlink targets in order to avoid
duplicate target name conflicts with SmartRedis
@al-rigazzi
Copy link
Collaborator

al-rigazzi commented Aug 10, 2021

Would it be possible to add an interface to be used for tests as we do for launchers and accounts? So that a user could do something along the lines of export SMARTSIM_TEST_INTERFACE=ib0 and run tests?

(also: it might already be in the code base and I might have just overlooked it)

al-rigazzi and others added 9 commits August 10, 2021 13:23
The database now supports binding to a network
interface through the interface argument on the
Orchestrator init

this commit adds SMARTSIM_TEST_INTERFACE to the
config so that the tests can switch interfaces
between machines
Edit Experiment API
Edit Entity API
Edit Orchestrator API"
Edit RunSettings
Edit Slurm
Some slurm systems with infiniband networks
cause problems with the socket library IP
lookup due to the listing in /etc/hosts

to get around this, the slurm launcher is no
longer responsible for obtaining hostnames for
the database. SmartSim relies on the redis_starter
script and output files for all launchers
now (was the case for all except slurm before)
@Spartee Spartee mentioned this pull request Aug 10, 2021
@Spartee Spartee merged commit 35e50c7 into CrayLabs:develop Aug 11, 2021
@Spartee Spartee deleted the temp branch August 11, 2021 17:34
@Spartee Spartee mentioned this pull request Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: orchestrator Issues related to the Ochestrator API, launch, and runtime type: feature Issues that include feature request or feature idea type: refactor Issues focused on refactoring existing code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants