Skip to content

Conversation

@itechdima
Copy link

@itechdima itechdima commented Jan 23, 2025

Description

Added new srun argument:

  • --container-image-save - (PATH) to indicate, where you can save squashfs file of your image (could be either file or directory). Squashfs files are not deleted after srun, following srun calls with the same argument value will reuse existing squashfs file

For this option it's possible to configure default behaviour through pyxis configuration. Also, extended configuration with option:

  • expose_enroot_logs - (BOOLEAN) changes enroot logs from the in-memory file to stderr, so now every pyxis step is visible.

Tests

Performed 2 runs for different setups on same slurm cluster.

  1. pyxis 0.20:
  2. patched pyxis without any additional options: Note: squashfs_image.bats tests where fixed later, see 3d option and 3d run
  3. patched pyxis with --container-image-shared and --container-image-save options in config:

I'm not sure about all the tests failures. Maybe some of them are flacky, and some of them are related to the:
"Some tests assume a specific enroot configuration (such as PMIx/PyTorch hooks), so they might not pass on all systems"
but overall, changes speed up even tests (instead of ~40min, all the test runs, except of the first one, will take ~20min)

Documentation

I didn't find a way to update github wiki with new pyxis config parameters, maybe you can help me with that.

@itechdima itechdima force-pushed the disable-concurrent-pull-and-keep-squashfs branch 2 times, most recently from 12a036b to 73c5f18 Compare January 23, 2025 16:27
@itechdima itechdima force-pushed the disable-concurrent-pull-and-keep-squashfs branch from 73c5f18 to 9bc2c3d Compare January 23, 2025 16:59
@itechdima itechdima force-pushed the disable-concurrent-pull-and-keep-squashfs branch 4 times, most recently from c885be0 to e7c9f58 Compare January 27, 2025 17:24
@itechdima itechdima force-pushed the disable-concurrent-pull-and-keep-squashfs branch from e7c9f58 to 4234903 Compare January 28, 2025 09:43
@flx42
Copy link
Member

flx42 commented Jan 28, 2025

I like the idea of being able to cache a squashfs in a shared location, but I think this is a bit too complex for a starting point, I will need to think a little more about the problem but here is my early feedback:

  • expose_enroot_logs is not really related to the main problem you are trying to solve, so it should be removed from this PR.
  • I think the feature should only be exposed on the command-line for now, not in the pyxis config file.
  • Given the previous point, the cache folder should stay entirely under the user's control and we should not have pyxis create per-user folders automatically. Each user would specify a directory of their choice.

@itechdima
Copy link
Author

@flx42
Thanks for the early reply (I wan't even able to complete my PR description before you started reviewing it :D )

expose_enroot_logs is not really related to the main problem you are trying to solve, so it should be removed from this PR.

sure, I can move it to a separate PR

I think the feature should only be exposed on the command-line for now, not in the pyxis config file.

The reason we decided to add not only to the command-line, but also to the pyxis config file, is that we can easily update config file for the whole Slurm cluster, and every existing script will automatically work with new functionality without any changes. If we remove it from the config, then it'll require every Slurm user to adapt their scripts (as for our case, we already have plenty of them and I think they would prefer it implicitly turned on in pyxis config, rather then try to understand where they need to specify that options)

Given the previous point, the cache folder should stay entirely under the user's control and we should not have pyxis create per-user folders automatically. Each user would specify a directory of their choice.

I see your point. Maybe it's better to change like it's done with pyxis runtime? Create only subdir with uid?

@itechdima itechdima marked this pull request as ready for review January 28, 2025 15:44
@itechdima itechdima force-pushed the disable-concurrent-pull-and-keep-squashfs branch 11 times, most recently from 8574995 to 046a12a Compare January 30, 2025 14:18
@itechdima itechdima force-pushed the disable-concurrent-pull-and-keep-squashfs branch from 046a12a to e7937d9 Compare January 30, 2025 15:37
@itechdima
Copy link
Author

@flx42
I've made some changes
Now, if the container image save path passed in the config, then the logic is similar to the pyxis runtime path: it doesn't create the whole path but only uid folder for the particular user.
If the container image save path passed through the srun argument, then it doesn't create any folder in that path

If you're OK with that, I'd like to discuss how would you prefer this PR to be split, so it'll be easier for you to review in more details :)

@itechdima itechdima force-pushed the disable-concurrent-pull-and-keep-squashfs branch 3 times, most recently from b431d84 to 63ac0e9 Compare February 4, 2025 15:49
@itechdima itechdima force-pushed the disable-concurrent-pull-and-keep-squashfs branch from 63ac0e9 to ee8c3c7 Compare February 4, 2025 15:51
@itechdima
Copy link
Author

@flx42 do you have any updates :)

@flx42
Copy link
Member

flx42 commented Mar 20, 2025

Apologies, with GTC it's been a bit crazy here, will try to take a look at it soon.

@asteny
Copy link

asteny commented Jul 24, 2025

@flx42 Hello Felix!
Can you please check PR of my colleague @itechdima ? We (in Nebius) want to use this feature in Soperator from NVIDIA upstream, but right now we are using forked code.

@flx42
Copy link
Member

flx42 commented Aug 20, 2025

@flx42 Hello Felix! Can you please check PR of my colleague @itechdima ? We (in Nebius) want to use this feature in Soperator from NVIDIA upstream, but right now we are using forked code.

After discussing with internal users that have similar but also slightly different requirements, I will likely implement a different approach in pyxis in the future.

@flx42
Copy link
Member

flx42 commented Sep 30, 2025

Here is the feature I have added: https://github.com/NVIDIA/pyxis/tree/main/importers
Let me know if that would work for your use case, or if you see something missing with this API, it can still evolve.

@flx42
Copy link
Member

flx42 commented Nov 11, 2025

Closing as I'm not planning to merge this MR, please do open new issues if the importers mechanism is insufficient for your use case or if you find any bug.

@flx42 flx42 closed this Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants