-
Notifications
You must be signed in to change notification settings - Fork 39
Add options to disable concurrent image pulling and saving squashfs files to the dedicated file/directory, config to expose enroot logs #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
12a036b to
73c5f18
Compare
73c5f18 to
9bc2c3d
Compare
c885be0 to
e7c9f58
Compare
e7c9f58 to
4234903
Compare
|
I like the idea of being able to cache a squashfs in a shared location, but I think this is a bit too complex for a starting point, I will need to think a little more about the problem but here is my early feedback:
|
|
@flx42
sure, I can move it to a separate PR
The reason we decided to add not only to the command-line, but also to the pyxis config file, is that we can easily update config file for the whole Slurm cluster, and every existing script will automatically work with new functionality without any changes. If we remove it from the config, then it'll require every Slurm user to adapt their scripts (as for our case, we already have plenty of them and I think they would prefer it implicitly turned on in pyxis config, rather then try to understand where they need to specify that options)
I see your point. Maybe it's better to change like it's done with |
8574995 to
046a12a
Compare
046a12a to
e7937d9
Compare
|
@flx42 If you're OK with that, I'd like to discuss how would you prefer this PR to be split, so it'll be easier for you to review in more details :) |
b431d84 to
63ac0e9
Compare
63ac0e9 to
ee8c3c7
Compare
|
@flx42 do you have any updates :) |
|
Apologies, with GTC it's been a bit crazy here, will try to take a look at it soon. |
|
@flx42 Hello Felix! |
After discussing with internal users that have similar but also slightly different requirements, I will likely implement a different approach in |
|
Here is the feature I have added: https://github.com/NVIDIA/pyxis/tree/main/importers |
|
Closing as I'm not planning to merge this MR, please do open new issues if the |
Description
Added new srun argument:
--container-image-save- (PATH) to indicate, where you can save squashfs file of your image (could be either file or directory). Squashfs files are not deleted aftersrun, followingsruncalls with the same argument value will reuse existingsquashfsfileFor this option it's possible to configure default behaviour through pyxis configuration. Also, extended configuration with option:
expose_enroot_logs- (BOOLEAN) changes enroot logs from the in-memory file tostderr, so now every pyxis step is visible.Tests
Performed 2 runs for different setups on same slurm cluster.
Note:first run: https://gist.github.com/itechdima/70cf20bbb61134f7801f3cde42271376second run: https://gist.github.com/itechdima/385c0e77b4934711c181e796d588ab9asquashfs_image.batstests where fixed later, see 3d option and 3d run--container-image-sharedand--container-image-saveoptions in config:first run: https://gist.github.com/itechdima/f9dc24a87c8cdcf6d45f37b3d9b9a6a9second run: https://gist.github.com/itechdima/bba91cf76d212ff694ecb5a4ebafd418I'm not sure about all the tests failures. Maybe some of them are flacky, and some of them are related to the:
"Some tests assume a specific enroot configuration (such as PMIx/PyTorch hooks), so they might not pass on all systems"
but overall, changes speed up even tests (instead of ~40min, all the test runs, except of the first one, will take ~20min)
Documentation
I didn't find a way to update github wiki with new pyxis config parameters, maybe you can help me with that.