-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reusing containers created with Pyxis not working #28
Comments
Hi @Juanjdurillo, Are you doing different However, if you do two separate If you want to go back to the old behavior, you can set |
Hi @flx42 thanks for your answer. The
I am not sure if this is the case. These containers created with the suffix are never cleaned up (they persist within my enroot/data folder) and are never reused. |
I'm not able to reproduce the problem you are describing right now. You can Could you list the commands you are using and the error message (if any) that you are seeing? And what version of Slurm are you using? |
My workflow is as follows: and afterwards, I do Sometimes, I might finish and within the same allocation issue another But sometimes, what I want to do is to get an allocation in another point in time and repeat these steps. If I look here https://github.com/NVIDIA/pyxis reusing the container should be possible even across allocations (at least the documentation does not state the opposite). However, if I look into the code in master, the slurm_spank_user_init Line 815 in f3ea9a7
Line 834 in f3ea9a7
I'd say that reusing containers is only possible within the same slurm job now (which is something different than what the previous documentation states). Is this the case? However, even, if my new assumption is correct, this does not solve my usecase problem, as I am experiencing problems in reusing containers even within the same slurm job id (i.e., within a single salloc). This would make sense only if Line 159 in f3ea9a7
Unfortunately, as much as I would like to test this and provide a patch if an error is found, I cannot because I am simply a user of a system which provides that configuration. |
The errors were happening with:
|
Thanks for the detailed feedback. I'm still unsure about what your sequence of command looks like, does the following works for you? $ salloc -N1
salloc: Granted job allocation 4
salloc: Waiting for resource configuration
salloc: Nodes ioctl are ready for job
$ srun --container-image ubuntu which vmtouch
pyxis: importing docker image ...
pyxis: creating container filesystem ...
pyxis: starting container ...
srun: error: ioctl: task 0: Exited with exit code 1
$ srun --container-image ubuntu --container-name ctr bash -c "apt-get update && apt-get install -y vmtouch"
pyxis: importing docker image ...
pyxis: creating container filesystem ...
pyxis: starting container ...
[...]
$ srun --container-name ctr which vmtouch
pyxis: reusing existing container filesystem
pyxis: starting container ...
/usr/bin/vmtouch Perhaps you are doing things differently, are you allocating multiple nodes? Or perhaps you were actually using a different job but landing on the same node? You can
Yes, it's a change in the latest code (after 0.8.1, so not in a tagged release yet). In our cluster we are doing the cleanup of named containers manually with a Slurm epilog, not doing so would quickly fill the local storage. So the latest changes is to integrate this cleanup in pyxis directly, and the cleanup is done in a job epilog.
I understand, thanks for taking time to look at this. You should recommend to your admin to go back to pyxis 0.8.1 in the meantime then. |
Thanks to you for having a look!
Here is the thing, if I tried to do a sequence of commands as you suggested, the last
Yes, this has been the adopted solution. I very much appreciate your help on the matter. I would also appreciate if reusing containers across slurm jobs would be still possible in the future :-) |
I close the issue as resolve as the suggested tagged version works fine for us |
I'm not aware of anything like that, weird! I'll do further research soon. |
@Juanjdurillo FYI, this might also solve your issue: 5a7d900 See #30 |
reusing a container launched with
srun
and the Pyxis plugin does not seem to work with the latest version.When using --container-name flag of srun, the container fs is preceded with pyxis and an id. Every new srun command results on a new container fs (with different id) despite using the same name. Using that prefix does not work either.
The text was updated successfully, but these errors were encountered: