launching a daemon process #89

brycelelbach · 2012-07-10T23:26:15Z

[reported by manderson] [Trac time Tue Aug 16 17:20:33 2011] The supercomputer administrators just sent me an email with the following concern:

We've begun to see some rogue processes running as your user on some of our compute nodes. For example:

mwa2 2718 1 0 10:48 ? 00:00:00 python /fslhome/mwa2/parallex/install/bin/hpx_invoke.py --timeout=3600 --program=/fslhome/mwa2/parallex/install/bin/adaptive1d -p /fslhome/mwa2/compute/1dtest/idadaptive1d -t8 -Ihpx.agas.address=192.168.220.102 -Ihpx.agas.port=7910 -l2 -x192.168.220.101:7910 -w
mwa2 2719 2718 99 10:48 ? 02:19:31 /fslhome/mwa2/parallex/install/bin/adaptive1d -p /fslhome/mwa2/compute/1dtest/idadaptive1d -t8 -Ihpx.agas.address=192.168.220.102 -Ihpx.agas.port=7910 -l2 -x192.168.220.101:7910 -w

I suspect this is related to the ParalleX software that you were testing. As I was afraid of, this appears to be launching a daemon process, that manages its processing outside the realm of a normal scheduled job.

Is ParalleX really launching a daemon process that isn't shutting down properly? Or is this something else?

brycelelbach · 2012-07-10T23:26:16Z

[comment by blelbach] [Trac time Tue Aug 16 17:32:27 2011] The hpx_invoke.py script will timeout the running process. The default timeout is an hour. I would tell them that hpx is not a daemon, and that hung jobs will be garbage collected. I will change the default timeout to 10 minutes. Please make it clear to the sys-admin that this is /not/ managing processing. The hpx_invoke.py script merely ensures cleanup of bad HPX runs.

brycelelbach · 2012-07-10T23:26:17Z

[comment by manderson] [Trac time Tue Aug 16 18:35:57 2011] Here's the latest from the sysadmin:

"Well, I can't speak to how the processing is managed internally. What I do know, is that I saw processes like the ones I
cited, that were using the full number of processors on the host. The output I pasted, was from "ps -ef", so the third
field is the parent-processID, and you can see that it was branching from PID 1, or the "init" process. Definitely not
the "sshd" or the "pbs_mom" process, which was what I was afraid of.

So, as I see it, this illustrates two problems:

Processing was occurring outside of a job - If this occurs, then you're acquiring processing time outside of the
control of the scheduling system
The processes that were doing the processing not children of the "pbs_mom" management daemon - If this occurs, then the
scheduling system cannot properly account for the resources being used, and kill processes when the job ends.

In a nutshell, in some fashion your processes are staying resident, and that's a problem. We need to know that when your
jobs terminate, they will clean up after themselves. Unfortunately, this needs to occur immediately. No amount of
processing outside of a job, whether it's 10 minutes, 1 hour, or 1 day, is acceptable practice on our cluster.

We are willing to work with you on this, to try to determine what's going on and what we can do about it. However, we
will need more information about the launching mechanism, etc., as I was asking about before."

Any suggestions?

brycelelbach · 2012-07-10T23:26:18Z

[comment by manderson] [Trac time Tue Aug 16 19:56:02 2011] Follow up from sysadmins:

Well, if you're using SSH to spawn the remote instances of the "hpx_invoke.py", that would make a lot of sense. You see, in order to properly manage the resource usage, etc., the remote processes need to be children of the "pbs_mom" management tool. But if you use ssh to launch the remote processes, that's not the case; they end up as children of the SSH daemon (sshd) instead. It's possible that if you delete a job, and it doesn't actively kill the instances of the remote processes, that those processes could get orphaned, and adopted by init in the process.

We may have an alternative, though. There exists a program called "pbsdsh" that can be used to launch remote processes, as children of pbs_mom. It might take some work, but I think it could work. What level of variation exists between the command-line syntax of the various instances? Are they all the same, or is there some variation? How feasible would it be to build some variation on that launcher script to handle this?

brycelelbach · 2012-07-10T23:26:18Z

[comment by blelbach] [Trac time Thu Aug 18 13:35:21 2011] Closing this - I opened a meta-ticket for this, #95.

brycelelbach closed this as completed Jul 10, 2012

brycelelbach mentioned this issue Jul 10, 2012

Support for distributed launches #95

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

launching a daemon process #89

launching a daemon process #89

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

launching a daemon process #89

launching a daemon process #89

Comments

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012

brycelelbach commented Jul 10, 2012