Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

launching a daemon process #89

Closed
brycelelbach opened this issue Jul 10, 2012 · 4 comments
Closed

launching a daemon process #89

brycelelbach opened this issue Jul 10, 2012 · 4 comments

Comments

@brycelelbach
Copy link
Member

[reported by manderson] [Trac time Tue Aug 16 17:20:33 2011] The supercomputer administrators just sent me an email with the following concern:

We've begun to see some rogue processes running as your user on some of our compute nodes. For example:

mwa2 2718 1 0 10:48 ? 00:00:00 python /fslhome/mwa2/parallex/install/bin/hpx_invoke.py --timeout=3600 --program=/fslhome/mwa2/parallex/install/bin/adaptive1d -p /fslhome/mwa2/compute/1dtest/idadaptive1d -t8 -Ihpx.agas.address=192.168.220.102 -Ihpx.agas.port=7910 -l2 -x192.168.220.101:7910 -w
mwa2 2719 2718 99 10:48 ? 02:19:31 /fslhome/mwa2/parallex/install/bin/adaptive1d -p /fslhome/mwa2/compute/1dtest/idadaptive1d -t8 -Ihpx.agas.address=192.168.220.102 -Ihpx.agas.port=7910 -l2 -x192.168.220.101:7910 -w

I suspect this is related to the ParalleX software that you were testing. As I was afraid of, this appears to be launching a daemon process, that manages its processing outside the realm of a normal scheduled job.

Is ParalleX really launching a daemon process that isn't shutting down properly? Or is this something else?

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Tue Aug 16 17:32:27 2011] The hpx_invoke.py script will timeout the running process. The default timeout is an hour. I would tell them that hpx is not a daemon, and that hung jobs will be garbage collected. I will change the default timeout to 10 minutes. Please make it clear to the sys-admin that this is /not/ managing processing. The hpx_invoke.py script merely ensures cleanup of bad HPX runs.

@brycelelbach
Copy link
Member Author

[comment by manderson] [Trac time Tue Aug 16 18:35:57 2011] Here's the latest from the sysadmin:

"Well, I can't speak to how the processing is managed internally. What I do know, is that I saw processes like the ones I
cited, that were using the full number of processors on the host. The output I pasted, was from "ps -ef", so the third
field is the parent-processID, and you can see that it was branching from PID 1, or the "init" process. Definitely not
the "sshd" or the "pbs_mom" process, which was what I was afraid of.

So, as I see it, this illustrates two problems:

  • Processing was occurring outside of a job - If this occurs, then you're acquiring processing time outside of the
    control of the scheduling system
  • The processes that were doing the processing not children of the "pbs_mom" management daemon - If this occurs, then the
    scheduling system cannot properly account for the resources being used, and kill processes when the job ends.

In a nutshell, in some fashion your processes are staying resident, and that's a problem. We need to know that when your
jobs terminate, they will clean up after themselves. Unfortunately, this needs to occur immediately. No amount of
processing outside of a job, whether it's 10 minutes, 1 hour, or 1 day, is acceptable practice on our cluster.

We are willing to work with you on this, to try to determine what's going on and what we can do about it. However, we
will need more information about the launching mechanism, etc., as I was asking about before."

Any suggestions?

@brycelelbach
Copy link
Member Author

[comment by manderson] [Trac time Tue Aug 16 19:56:02 2011] Follow up from sysadmins:

Well, if you're using SSH to spawn the remote instances of the "hpx_invoke.py", that would make a lot of sense. You see, in order to properly manage the resource usage, etc., the remote processes need to be children of the "pbs_mom" management tool. But if you use ssh to launch the remote processes, that's not the case; they end up as children of the SSH daemon (sshd) instead. It's possible that if you delete a job, and it doesn't actively kill the instances of the remote processes, that those processes could get orphaned, and adopted by init in the process.

We may have an alternative, though. There exists a program called "pbsdsh" that can be used to launch remote processes, as children of pbs_mom. It might take some work, but I think it could work. What level of variation exists between the command-line syntax of the various instances? Are they all the same, or is there some variation? How feasible would it be to build some variation on that launcher script to handle this?

@brycelelbach
Copy link
Member Author

[comment by blelbach] [Trac time Thu Aug 18 13:35:21 2011] Closing this - I opened a meta-ticket for this, #95.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant