Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify intended usage of --container-name #30

Closed
sfeltman opened this issue Oct 22, 2020 · 17 comments
Closed

Clarify intended usage of --container-name #30

sfeltman opened this issue Oct 22, 2020 · 17 comments

Comments

@sfeltman
Copy link

We had been attempting to use --container-name to share enroot containers across Slurm job arrays. This ended up having a lot of issues due to PID sharing of between the array jobs running on the same machine (we didn't know it did this until reading the Pyxis code). While this could be fixed with some sort of option to disable PID sharing. Commit a35027c added a prefix of  "pyxis_$JOBID" to the container-name which would then break the idea.

Please clarify the intended usage of --container-name. We had been hoping to use it for speeding up array jobs that use big containers on the same machine and manually managing the enroot container import directory before/after the job array.
 
Thanks

@flx42
Copy link
Member

flx42 commented Oct 23, 2020

Hello @sfeltman, the intent was to save a container state across job steps. So for example within a sbatch script or a salloc. In our cluster we had a Slurm epilog to manually cleanup the named containers at the end of the job, and the commit above was part of a change to move this cleanup logic to Pyxis directly. We didn't want to allow named containers to be shared across different jobs, since it's usually challenging to make sure you land on the same nodes across jobs.

I need to look more into what happens when job arrays are involved, I didn't test this use case yet. Perhaps there is an unexpected interaction with the SPANK API.

By the way, I don't quite understand what you mean by "PID sharing", could you explain?

@flx42
Copy link
Member

flx42 commented Oct 23, 2020

@sfeltman I see that the Slurm epilog is called for each job of the job array. So how were you planning to cleanup the named containers for this use case? I don't see any way to know when the job array is entirely finished on one node.

@sfeltman
Copy link
Author

Hi @flx42,

Thanks for the explanation, I think some of my confusion stems from the cmdline help makes it seems like the feature is more general purpose.

With regards to "PID sharing", I meant container PID re-use from a running container with the same name.

In terms of job arrays and cleanup. The idea is to use a new job that is dependent on the array job completion to do cleanup on any potential node the array ran on.

@flx42
Copy link
Member

flx42 commented Oct 23, 2020

In terms of job arrays and cleanup. The idea is to use a new job that is dependent on the array job completion to do cleanup on any potential node the array ran on.

That seems tricky, making sure the follow-up job runs on exactly the same nodes.

But at the same time that use case seems similar to #28
So I'll consider changing the epilog pyxis config flag to trigger both the pyxis_$jobid container name prefix + the epilog cleanup. In that case you will be able to get the previous behavior by disabling this option.

@flx42
Copy link
Member

flx42 commented Oct 23, 2020

@sfeltman I just pushed 5a7d900

You should be able to get the previous behavior with a config flag like the following:

$ cat /etc/slurm-llnl/plugstack.conf.d/pyxis.conf
required /usr/local/lib/slurm/spank_pyxis.so container_scope=global

There will still be a pyxis_ prefix, but it won't use the job_id in the prefix anymore.

@flx42
Copy link
Member

flx42 commented Oct 23, 2020

Could you also describe the kind of problems you've seen with containers reusing existing PIDs? It just means it will share the container namespaces, is that an issue? I'm wondering if there is a bug lurking here.

@sfeltman
Copy link
Author

Hi Felix,

Thanks for the update. Below I've pasted some records of some of the errors we were running into. This was with Pyxis version 0.8.1 and enroot 3.1.0. I played with addding a --no-container-pid-reuse option which fixed the issue. However, this was on top of the master branch, so it may have also been conflated with other changes since 0.8.1...

pyxis: reusing existing container PID
No devices found.
pyxis: reusing existing container PID
slurmstepd: error: pyxis: couldn't join cgroup namespace: Operation not permitted
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: <nodename>: task 0: Exited with exit code 1
slurmstepd: error: pyxis: child 57362 failed with error code: 1
slurmstepd: error: pyxis: couldn't get list of existing container filesystems
slurmstepd: error: pyxis: printing contents of log file ...
slurmstepd: error: pyxis:     NAME  PID  STATE  STARTED  TIME  MNTNS  USERNS  COMMAND
slurmstepd: error: pyxis: couldn't get list of containers
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: <nodename>: task 0: Exited with exit code 1
pyxis: reusing existing container PID
slurmstepd: error: pyxis: unable to open mount namespace file: No such file or directory
slurmstepd: error: pyxis: couldn't get container attributes
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: <nodename>: task 0: Exited with exit code 1
pyxis: reusing existing container PID
slurmstepd: error: pyxis: unable to open user namespace file: No such file or directory
slurmstepd: error: pyxis: couldn't get container attributes
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: <nodename>: task 0: Exited with exit code 1

@flx42
Copy link
Member

flx42 commented Oct 23, 2020

Ok, it's probably a race condition between the different jobs here. For instance if the job being joined terminates when the new one is starting up.

@sfeltman
Copy link
Author

With regards to sharing the container namespace, does this mean the cgroups resources are actually shared or the limit values just copied? With array jobs, each job in the array are independent jobs using the same limit values, but would have their own CPU/GPU/memory allocations.

@flx42
Copy link
Member

flx42 commented Oct 23, 2020

The cgroups should still be per-job, but it will get a bit weird for the jobs reusing the initial container, since they will join the cgroup namespace while being under a cgroup outside of this namespace.
I don't think this has a functional impact.

@sfeltman
Copy link
Author

I just confirmed the current HEAD 5a7d900 without any of my changes still exhibits the problems I mentioned when sharing the container name between array jobs (using container_scope=global option).

@flx42
Copy link
Member

flx42 commented Oct 28, 2020

Yes, this aspect is more tricky and for now I'm not too keen on adding another command-line argument for this, since the main intended use case is to have named containers with a job-level scope.
So you should probably continue carrying your patch for disabling PID sharing, for now :)

@avolkov1
Copy link

Is it possible to just specify a path to the sqsh files?

$ ls ~/enroot_images/
nvcr.io+nvidia+cuda+11.0.3-base-ubuntu18.04.sqsh

I just want to run srun with an option to pyxis/enroot to use that sqsh file.

@3XX0
Copy link
Member

3XX0 commented Feb 12, 2021

@avolkov1
Copy link

Oops, sorry. That's simple. I overlooked that part in the docs. Thank you.

@flx42
Copy link
Member

flx42 commented May 14, 2021

I think this is solved now, closing.

@flx42 flx42 closed this as completed May 14, 2021
@flx42
Copy link
Member

flx42 commented May 14, 2021

I mean that we're probably not going to add a knob for disabling PID sharing when a container exists, at least not right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants