New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
launching a daemon process #89
Comments
[comment by blelbach] [Trac time Tue Aug 16 17:32:27 2011] The hpx_invoke.py script will timeout the running process. The default timeout is an hour. I would tell them that hpx is not a daemon, and that hung jobs will be garbage collected. I will change the default timeout to 10 minutes. Please make it clear to the sys-admin that this is /not/ managing processing. The hpx_invoke.py script merely ensures cleanup of bad HPX runs. |
[comment by manderson] [Trac time Tue Aug 16 18:35:57 2011] Here's the latest from the sysadmin: "Well, I can't speak to how the processing is managed internally. What I do know, is that I saw processes like the ones I So, as I see it, this illustrates two problems:
In a nutshell, in some fashion your processes are staying resident, and that's a problem. We need to know that when your We are willing to work with you on this, to try to determine what's going on and what we can do about it. However, we Any suggestions? |
[comment by manderson] [Trac time Tue Aug 16 19:56:02 2011] Follow up from sysadmins: Well, if you're using SSH to spawn the remote instances of the "hpx_invoke.py", that would make a lot of sense. You see, in order to properly manage the resource usage, etc., the remote processes need to be children of the "pbs_mom" management tool. But if you use ssh to launch the remote processes, that's not the case; they end up as children of the SSH daemon (sshd) instead. It's possible that if you delete a job, and it doesn't actively kill the instances of the remote processes, that those processes could get orphaned, and adopted by init in the process. We may have an alternative, though. There exists a program called "pbsdsh" that can be used to launch remote processes, as children of pbs_mom. It might take some work, but I think it could work. What level of variation exists between the command-line syntax of the various instances? Are they all the same, or is there some variation? How feasible would it be to build some variation on that launcher script to handle this? |
[comment by blelbach] [Trac time Thu Aug 18 13:35:21 2011] Closing this - I opened a meta-ticket for this, #95. |
[reported by manderson] [Trac time Tue Aug 16 17:20:33 2011] The supercomputer administrators just sent me an email with the following concern:
We've begun to see some rogue processes running as your user on some of our compute nodes. For example:
mwa2 2718 1 0 10:48 ? 00:00:00 python /fslhome/mwa2/parallex/install/bin/hpx_invoke.py --timeout=3600 --program=/fslhome/mwa2/parallex/install/bin/adaptive1d -p /fslhome/mwa2/compute/1dtest/idadaptive1d -t8 -Ihpx.agas.address=192.168.220.102 -Ihpx.agas.port=7910 -l2 -x192.168.220.101:7910 -w
mwa2 2719 2718 99 10:48 ? 02:19:31 /fslhome/mwa2/parallex/install/bin/adaptive1d -p /fslhome/mwa2/compute/1dtest/idadaptive1d -t8 -Ihpx.agas.address=192.168.220.102 -Ihpx.agas.port=7910 -l2 -x192.168.220.101:7910 -w
I suspect this is related to the ParalleX software that you were testing. As I was afraid of, this appears to be launching a daemon process, that manages its processing outside the realm of a normal scheduled job.
Is ParalleX really launching a daemon process that isn't shutting down properly? Or is this something else?
The text was updated successfully, but these errors were encountered: