Clarify intended usage of --container-name #30

sfeltman · 2020-10-22T22:03:26Z

We had been attempting to use --container-name to share enroot containers across Slurm job arrays. This ended up having a lot of issues due to PID sharing of between the array jobs running on the same machine (we didn't know it did this until reading the Pyxis code). While this could be fixed with some sort of option to disable PID sharing. Commit a35027c added a prefix of "pyxis_$JOBID" to the container-name which would then break the idea.

Please clarify the intended usage of --container-name. We had been hoping to use it for speeding up array jobs that use big containers on the same machine and manually managing the enroot container import directory before/after the job array.

Thanks

The text was updated successfully, but these errors were encountered:

flx42 · 2020-10-23T01:24:18Z

Hello @sfeltman, the intent was to save a container state across job steps. So for example within a sbatch script or a salloc. In our cluster we had a Slurm epilog to manually cleanup the named containers at the end of the job, and the commit above was part of a change to move this cleanup logic to Pyxis directly. We didn't want to allow named containers to be shared across different jobs, since it's usually challenging to make sure you land on the same nodes across jobs.

I need to look more into what happens when job arrays are involved, I didn't test this use case yet. Perhaps there is an unexpected interaction with the SPANK API.

By the way, I don't quite understand what you mean by "PID sharing", could you explain?

flx42 · 2020-10-23T02:10:51Z

@sfeltman I see that the Slurm epilog is called for each job of the job array. So how were you planning to cleanup the named containers for this use case? I don't see any way to know when the job array is entirely finished on one node.

sfeltman · 2020-10-23T03:38:34Z

Hi @flx42,

Thanks for the explanation, I think some of my confusion stems from the cmdline help makes it seems like the feature is more general purpose.

With regards to "PID sharing", I meant container PID re-use from a running container with the same name.

In terms of job arrays and cleanup. The idea is to use a new job that is dependent on the array job completion to do cleanup on any potential node the array ran on.

flx42 · 2020-10-23T04:54:21Z

In terms of job arrays and cleanup. The idea is to use a new job that is dependent on the array job completion to do cleanup on any potential node the array ran on.

That seems tricky, making sure the follow-up job runs on exactly the same nodes.

But at the same time that use case seems similar to #28
So I'll consider changing the epilog pyxis config flag to trigger both the pyxis_$jobid container name prefix + the epilog cleanup. In that case you will be able to get the previous behavior by disabling this option.

flx42 · 2020-10-23T20:25:30Z

@sfeltman I just pushed 5a7d900

You should be able to get the previous behavior with a config flag like the following:

$ cat /etc/slurm-llnl/plugstack.conf.d/pyxis.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global

There will still be a pyxis_ prefix, but it won't use the job_id in the prefix anymore.

flx42 · 2020-10-23T20:27:00Z

Could you also describe the kind of problems you've seen with containers reusing existing PIDs? It just means it will share the container namespaces, is that an issue? I'm wondering if there is a bug lurking here.

sfeltman · 2020-10-23T21:05:09Z

Hi Felix,

Thanks for the update. Below I've pasted some records of some of the errors we were running into. This was with Pyxis version 0.8.1 and enroot 3.1.0. I played with addding a --no-container-pid-reuse option which fixed the issue. However, this was on top of the master branch, so it may have also been conflated with other changes since 0.8.1...

pyxis: reusing existing container PID
No devices found.

pyxis: reusing existing container PID
slurmstepd: error: pyxis: couldn't join cgroup namespace: Operation not permitted
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: <nodename>: task 0: Exited with exit code 1

slurmstepd: error: pyxis: child 57362 failed with error code: 1
slurmstepd: error: pyxis: couldn't get list of existing container filesystems
slurmstepd: error: pyxis: printing contents of log file ...
slurmstepd: error: pyxis:     NAME  PID  STATE  STARTED  TIME  MNTNS  USERNS  COMMAND
slurmstepd: error: pyxis: couldn't get list of containers
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: <nodename>: task 0: Exited with exit code 1

pyxis: reusing existing container PID
slurmstepd: error: pyxis: unable to open mount namespace file: No such file or directory
slurmstepd: error: pyxis: couldn't get container attributes
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: <nodename>: task 0: Exited with exit code 1

pyxis: reusing existing container PID
slurmstepd: error: pyxis: unable to open user namespace file: No such file or directory
slurmstepd: error: pyxis: couldn't get container attributes
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: <nodename>: task 0: Exited with exit code 1

flx42 · 2020-10-23T21:14:28Z

Ok, it's probably a race condition between the different jobs here. For instance if the job being joined terminates when the new one is starting up.

sfeltman · 2020-10-23T21:29:22Z

With regards to sharing the container namespace, does this mean the cgroups resources are actually shared or the limit values just copied? With array jobs, each job in the array are independent jobs using the same limit values, but would have their own CPU/GPU/memory allocations.

flx42 · 2020-10-23T21:32:51Z

The cgroups should still be per-job, but it will get a bit weird for the jobs reusing the initial container, since they will join the cgroup namespace while being under a cgroup outside of this namespace.
I don't think this has a functional impact.

sfeltman · 2020-10-23T22:54:15Z

I just confirmed the current HEAD 5a7d900 without any of my changes still exhibits the problems I mentioned when sharing the container name between array jobs (using container_scope=global option).

flx42 · 2020-10-28T02:27:44Z

Yes, this aspect is more tricky and for now I'm not too keen on adding another command-line argument for this, since the main intended use case is to have named containers with a job-level scope.
So you should probably continue carrying your patch for disabling PID sharing, for now :)

avolkov1 · 2021-02-12T19:29:21Z

Is it possible to just specify a path to the sqsh files?

$ ls ~/enroot_images/
nvcr.io+nvidia+cuda+11.0.3-base-ubuntu18.04.sqsh

I just want to run srun with an option to pyxis/enroot to use that sqsh file.

3XX0 · 2021-02-12T21:44:48Z

@avolkov1 yes, see https://github.com/NVIDIA/pyxis/wiki/Usage#--container-image

avolkov1 · 2021-02-12T22:53:12Z

Oops, sorry. That's simple. I overlooked that part in the docs. Thank you.

flx42 · 2021-05-14T15:33:06Z

I think this is solved now, closing.

flx42 · 2021-05-14T15:39:31Z

I mean that we're probably not going to add a knob for disabling PID sharing when a container exists, at least not right now.

flx42 mentioned this issue Oct 23, 2020

Reusing containers created with Pyxis not working #28

Closed

flx42 closed this as completed May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify intended usage of --container-name #30

Clarify intended usage of --container-name #30

sfeltman commented Oct 22, 2020

flx42 commented Oct 23, 2020 •

edited

flx42 commented Oct 23, 2020

sfeltman commented Oct 23, 2020

flx42 commented Oct 23, 2020

flx42 commented Oct 23, 2020

flx42 commented Oct 23, 2020

sfeltman commented Oct 23, 2020

flx42 commented Oct 23, 2020

sfeltman commented Oct 23, 2020

flx42 commented Oct 23, 2020

sfeltman commented Oct 23, 2020

flx42 commented Oct 28, 2020

avolkov1 commented Feb 12, 2021

3XX0 commented Feb 12, 2021

avolkov1 commented Feb 12, 2021

flx42 commented May 14, 2021

flx42 commented May 14, 2021

Clarify intended usage of --container-name #30

Clarify intended usage of --container-name #30

Comments

sfeltman commented Oct 22, 2020

flx42 commented Oct 23, 2020 • edited

flx42 commented Oct 23, 2020

sfeltman commented Oct 23, 2020

flx42 commented Oct 23, 2020

flx42 commented Oct 23, 2020

flx42 commented Oct 23, 2020

sfeltman commented Oct 23, 2020

flx42 commented Oct 23, 2020

sfeltman commented Oct 23, 2020

flx42 commented Oct 23, 2020

sfeltman commented Oct 23, 2020

flx42 commented Oct 28, 2020

avolkov1 commented Feb 12, 2021

3XX0 commented Feb 12, 2021

avolkov1 commented Feb 12, 2021

flx42 commented May 14, 2021

flx42 commented May 14, 2021

flx42 commented Oct 23, 2020 •

edited