Skip to content

Apptainer Setup#415

Merged
rrutmann merged 4 commits intomainfrom
apptainer
Nov 7, 2025
Merged

Apptainer Setup#415
rrutmann merged 4 commits intomainfrom
apptainer

Conversation

@rrutmann
Copy link
Copy Markdown
Collaborator

@rrutmann rrutmann commented Nov 4, 2025

What does this PR do?

This PR adds support for using Modalities with apptainer/singularity.

General Changes

  • adds an def-file for apptainer
  • document setup in README.md

Breaking Changes

  • ..

Checklist before submitting final PR

  • My PR is minimal and addresses one issue in isolation
  • I have merged the latest version of the target branch into this feature branch
  • I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
  • I have run a sample config for model training
  • I have checked that all tests run through (python tests/tests.py)
  • I have updated the internal changelog (CHANGELOG_DEV.md)

@rrutmann rrutmann self-assigned this Nov 4, 2025
@rrutmann rrutmann requested a review from le1nux November 4, 2025 12:30
Copy link
Copy Markdown
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few small remarks.
As a side-note, I would not have these files at the top level. Should we create a deployment folder?

Comment thread container/modalities.def
python -m pip install --upgrade pip || true

# remove pytorch and install pytorch nightly
rm -rf /usr/local/lib/python3.12/dist-packages/torch* /usr/local/lib/python3.12/dist-packages/pytorch_triton* || true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks a bit like a hack to me. Why not use a UV venv?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried first with a venv, but had the problem that parts of torch were not updated properly, which led to a broken installation. Removing the mentioned folders explicitly solved the problem

Comment thread container/modalities.def

# Install modalities
cd /opt/repos/modalities
pip install -e .
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need -e here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we want to be able to update modalities without rebuilding the container again. With -e, we can bind bind our modalities repo from the host and use the latest changes. This is also documented in the README

Comment thread apptainer.def Outdated
Comment on lines +28 to +30
%runscript
echo "Run training:"
echo "torchrun --nnodes 1 --nproc_per_node 1 --rdzv-endpoint=0.0.0.0:29503 src/modalities/__main__.py run --config_file_path /opt/modalities/config_files/training/config_lorem_ipsum_long_fsdp2_pp_tp.yaml --test_comm"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a test? Otherwise, I guess we can remove it?

Copy link
Copy Markdown
Collaborator Author

@rrutmann rrutmann Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an example usage. The test is in %test. I removed this part

Comment thread README.md Outdated
Comment on lines +130 to +135
apptainer exec --nv modalities.sif bash -lc '\
cd /opt/repos/modalities && \
torchrun --nnodes 1 --nproc_per_node 1 --rdzv-endpoint=0.0.0.0:29503 \
src/modalities/__main__.py run \
--config_file_path config_files/training/config_lorem_ipsum_long_fsdp2_pp_tp.yaml --test_comm'
```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we deploy the container on Leonardo and also try out the multi-node setting and document it here?
I assume the discussed ablations you're running within this container right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I provided now an example slurm script

Copy link
Copy Markdown
Member

@le1nux le1nux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@rrutmann rrutmann merged commit e7e8ce5 into main Nov 7, 2025
3 checks passed
@BlueCrescent BlueCrescent deleted the apptainer branch November 7, 2025 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants