Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iHPC sessions fail silently #171

Open
ericfranz opened this issue Jun 16, 2017 · 2 comments

Comments

@ericfranz
Copy link
Contributor

commented Jun 16, 2017

When there is a problem with an iHPC app, and it crashes after running, the batch connect views treat the session as "complete" and then just delete the session.

This is a bad user experience.

If the session ends without the user explicitly pressing "delete", we should not remove the session from the list. Rather, the session should appear as "completed". If there were errors, it should give access to the output of the session to the user.

@ericfranz ericfranz added the bug label Jun 16, 2017

@nickjer

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2017

Sessions can also be successfully closed by a user "Logging out" of the desktop or closing the GUI app from within the VNC session.

Also do we expect users to debug issues with their sessions? The logs should be part of the documentation for developers to debug these issues.

@ericfranz ericfranz added this to the OOD1.2 milestone Oct 15, 2017

@ericfranz ericfranz modified the milestones: OOD1.2, NEXT Nov 16, 2017

@KristinaPlazonic

This comment has been minimized.

Copy link

commented Aug 14, 2018

I have two issues when I launch Jupyter notebook as an interactive app.

  1. If the slurmctrld times out (during polling by the app (with squeue) whether the session is running), the session is deleted from the list of interactive sessions, even though the notebook still exists and is running. If you closed the browser tab with the Jupyter notebook tree, there is no way to reconnect to Jupyter unless you took note to copy the notebook URL at starting time. This timeout can happen when slurmctrld is under high load. Here is the output of strace when that has happened:
poll([{fd=3, events=POLLIN}], 1, 10000) = 0 (Timeout)
fcntl(3, F_SETFL, O_RDWR)               = 0
close(3)                                = 0
write(2, "slurm_load_jobs error: Socket ti"..., 63) = 63
write(1, "CLUSTER: amarel\n", 16)       = 16
exit_group(0)                           = ?
  1. If the Jupyter notebook gets preempted by another job, then the both the notebook and the sessions are deleted without explanation and users wonder what just happened.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.