Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconnect SSH after disconnect #71

Draft
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

jackbaker1001
Copy link

This PR implements feature request #64 .

When in _poll_slurm, we now do two things:

  1. When get_status is called, any conn.run() commands are run through a new method run_remote_command_with_reconnect.

  2. A new method get_ssh_conn_status is run in async with get_status. The new method polls the ssh connection with a test command and reconnects on the test command's failure.

I have checked this rather manually on my laptop (2021 Mac M1 Pro, running MacOS Ventura 13.3.1 [22E261] ) by (i) running a trivial sample workflow which is disptached to Perlmutter and (ii) turning off my wifi connection once in poll slurm and (iii) Turning it back on again once the debug level logger message:

app_log.debug( f"SSH connection error {e}, reconnecting, attempt {self.ssh_connect_retries}...")

appears. I can confirm that the ssh connection becomes reestablished and the workflow runs to completion without error.

Some things to note are

  1. This code only checks for connection drops in _poll_slurm while connection can drop and cause hangs at any call to conn.run(). If we wanted to be super tight, we should use the new method run_remote_command_with_reconnect in place of any conn.run() command

  2. Does this code also check out with the plain ssh executor? Let's discuss.

  • I have added the tests to cover my changes.
  • I have updated the documentation, VERSION, and CHANGELOG accordingly.
  • I have read the CONTRIBUTING document.

@CLAassistant
Copy link

CLAassistant commented Jun 8, 2023

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants