-
Notifications
You must be signed in to change notification settings - Fork 16
Apptainer Setup #415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apptainer Setup #415
Changes from all commits
584288b
392ce0f
d3e8680
5bec93c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| Bootstrap: docker | ||
| From: nvcr.io/nvidia/nemo:25.09 | ||
|
|
||
| %environment | ||
| export PYTHONNOUSERSITE=1 | ||
|
|
||
| %post | ||
| set -e | ||
| mkdir -p /opt/repos /opt/modalities/config_files/training /e /p /etc/FZJ | ||
| python -m pip install --upgrade pip || true | ||
|
|
||
| # remove pytorch and install pytorch nightly | ||
| rm -rf /usr/local/lib/python3.12/dist-packages/torch* /usr/local/lib/python3.12/dist-packages/pytorch_triton* || true | ||
| python -m pip install --pre --no-cache-dir --index-url https://download.pytorch.org/whl/nightly/cu129 torch torchvision | ||
|
|
||
| # Clone repos (network required at build time) | ||
| git clone --depth 1 https://github.com/Dao-AILab/flash-attention.git /opt/repos/flash-attention | ||
| git clone --branch main --depth 1 https://github.com/Modalities/modalities.git /opt/repos/modalities | ||
|
|
||
| # Install flash-attention | ||
| cd /opt/repos/flash-attention | ||
| MAX_JOBS=4 python setup.py install | ||
|
|
||
| # Install modalities | ||
| cd /opt/repos/modalities | ||
| pip install -e . | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need -e here?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, if we want to be able to update modalities without rebuilding the container again. With -e, we can bind bind our modalities repo from the host and use the latest changes. This is also documented in the README |
||
|
|
||
| %test | ||
| python - <<'EOF' | ||
| import torch | ||
| print("Torch import OK") | ||
| import modalities | ||
| print("Modalities import OK") | ||
| EOF | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| #!/bin/bash | ||
| # SLURM SUBMIT SCRIPT | ||
| #SBATCH --exclusive | ||
| #SBATCH --account=your_account | ||
| #SBATCH --partition=your_partition | ||
| #SBATCH --qos=your_qos | ||
| #SBATCH --job-name=modalities | ||
| #SBATCH --output=/path/to/logs/log_%j.out | ||
| #SBATCH --error=/path/to/logs/log_%j.err | ||
| #SBATCH --time=00:10:00 | ||
| #SBATCH --ntasks-per-node=1 | ||
| #SBATCH --cpus-per-task=32 | ||
| #SBATCH --nodes=32 | ||
| #SBATCH --gres=gpu:4 | ||
| #SBATCH --mem=0 | ||
|
|
||
| # Paths (set these to real host locations before submitting): | ||
| SINGULARITY_IMAGE=./modalities.sif # Singularity image file. | ||
| CONTAINER_HOME=/path/to/container/home/on/host # Acts as $HOME inside container (-H). | ||
| MODALITIES_DIR=/path/to/modalities/on/host # Host repo path bind-mounted into container. | ||
|
|
||
| #### Environment variables #### | ||
| export CXX=g++ | ||
| export CC=gcc | ||
|
|
||
| # NCCL/UCX settings | ||
| export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 | ||
| export NCCL_IB_TIMEOUT=50 | ||
| export UCX_RC_TIMEOUT=4s | ||
| export NCCL_SOCKET_IFNAME=ib0 | ||
| export GLOO_SOCKET_IFNAME=ib0 | ||
| export NCCL_IB_RETRY_CNT=10 | ||
| export CUDA_VISIBLE_DEVICES=0,1,2,3 | ||
| export NCCL_DEBUG=INFO | ||
| export NCCL_ASYNC_ERROR_HANDLING=1 | ||
|
|
||
| # Enable logging | ||
| set -x -e | ||
| echo "START TIME: $(date)" | ||
|
|
||
| ##### Network parameters ##### | ||
| MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1) | ||
| MASTER_PORT=6000 | ||
|
|
||
| echo "START TIME: $(date)" | ||
|
|
||
| # Launch the job via srun; Singularity provides GPUs with --nv. | ||
| # Additional bind mounts (-B) can be added for datasets, scratch, etc. | ||
| srun singularity exec --nv \ | ||
| -H $CONTAINER_HOME \ | ||
| -B /dev/infiniband:/dev/infiniband \ | ||
| -B $MODALITIES_DIR:/opt/modalities \ | ||
| # bind other directories as needed, e.g. to access data on host system | ||
| "$SINGULARITY_IMAGE" bash -lc " | ||
| cd /opt/modalities | ||
| torchrun \ | ||
| --node_rank=$SLURM_NODEID \ | ||
| --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \ | ||
| --rdzv-id test_pp \ | ||
| --nnodes $SLURM_JOB_NUM_NODES \ | ||
| --nproc_per_node 4 \ | ||
| --rdzv_backend c10d \ | ||
| src/modalities/__main__.py run \ | ||
| --config_file_path config_files/training/config_lorem_ipsum_long_fsdp2.yaml \ | ||
| --test_comm | ||
| " | ||
|
|
||
| echo "END TIME: $(date)" | ||
| echo "=== FINISHED ===" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks a bit like a hack to me. Why not use a UV venv?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried first with a venv, but had the problem that parts of torch were not updated properly, which led to a broken installation. Removing the mentioned folders explicitly solved the problem