Conversation
le1nux
left a comment
There was a problem hiding this comment.
left a few small remarks.
As a side-note, I would not have these files at the top level. Should we create a deployment folder?
| python -m pip install --upgrade pip || true | ||
|
|
||
| # remove pytorch and install pytorch nightly | ||
| rm -rf /usr/local/lib/python3.12/dist-packages/torch* /usr/local/lib/python3.12/dist-packages/pytorch_triton* || true |
There was a problem hiding this comment.
this looks a bit like a hack to me. Why not use a UV venv?
There was a problem hiding this comment.
I tried first with a venv, but had the problem that parts of torch were not updated properly, which led to a broken installation. Removing the mentioned folders explicitly solved the problem
|
|
||
| # Install modalities | ||
| cd /opt/repos/modalities | ||
| pip install -e . |
There was a problem hiding this comment.
Yes, if we want to be able to update modalities without rebuilding the container again. With -e, we can bind bind our modalities repo from the host and use the latest changes. This is also documented in the README
| %runscript | ||
| echo "Run training:" | ||
| echo "torchrun --nnodes 1 --nproc_per_node 1 --rdzv-endpoint=0.0.0.0:29503 src/modalities/__main__.py run --config_file_path /opt/modalities/config_files/training/config_lorem_ipsum_long_fsdp2_pp_tp.yaml --test_comm" |
There was a problem hiding this comment.
Is this a test? Otherwise, I guess we can remove it?
There was a problem hiding this comment.
It is an example usage. The test is in %test. I removed this part
| apptainer exec --nv modalities.sif bash -lc '\ | ||
| cd /opt/repos/modalities && \ | ||
| torchrun --nnodes 1 --nproc_per_node 1 --rdzv-endpoint=0.0.0.0:29503 \ | ||
| src/modalities/__main__.py run \ | ||
| --config_file_path config_files/training/config_lorem_ipsum_long_fsdp2_pp_tp.yaml --test_comm' | ||
| ``` |
There was a problem hiding this comment.
Should we deploy the container on Leonardo and also try out the multi-node setting and document it here?
I assume the discussed ablations you're running within this container right?
There was a problem hiding this comment.
As discussed offline, I provided now an example slurm script
What does this PR do?
This PR adds support for using Modalities with apptainer/singularity.
General Changes
Breaking Changes
Checklist before submitting final PR
python tests/tests.py)CHANGELOG_DEV.md)