Modalities · rrutmann · Nov 7, 2025 · Nov 4, 2025 · Nov 4, 2025 · Nov 7, 2025
diff --git a/README.md b/README.md
@@ -30,7 +30,7 @@ For training and evaluation of a model, feel free to checkout [this](https://git
 
 ## Installation
 
-There are two ways to install Modalities. If you want to use the latest nightly version, or if you want to modify the code base itself, we recommend installing Modalities directly from source. 
+There are multiple ways to install Modalities. If you want to use the latest nightly version, or if you want to modify the code base itself, we recommend installing Modalities directly from source. 
 
 If you want to use Modalities as a library and register your custom components with Modalities, you can install it directly via pip which provides you with the latest stable version.
 
@@ -88,6 +88,58 @@ uv pip install -e .[tests,linting]
 pre-commit install --install-hooks
 ```
 
+### Option 4: Containerized Setup via Singularity / Apptainer
+
+If you prefer an isolated, reproducible environment or you are deploying to an HPC center that already supports Apptainer / Singularity, you can build and run Modalities using the provided `modalities.def` file in the container folder.
+
+Note: Commands shown with singularity work the same with apptainer. Substitute the command name (e.g. apptainer build ..., apptainer exec ..., apptainer test ...). If both are installed, choose one consistently.
+
+#### 1. Build the image
+
+Use `--fakeroot` if you don't have root but your system enables user namespaces; otherwise omit it.
+
+```sh
+singularity build modalities.sif modalities.def            # standard build
+# or (if allowed / required on your system)
+singularity build --fakeroot modalities.sif modalities.def
+```
+
+This will:
+* Pull the base image `nvcr.io/nvidia/nemo:25.09`.
+* Install nightly PyTorch (per the definition file) and flash-attention.
+* Clone and install `modalities` inside the container.
+
+#### 2. Run the built-in smoke test
+
+Your `%test` section is executed with:
+
+```sh
+singularity test modalities.sif
+```
+
+Expected output contains lines similar to:
+
+```
+Torch import OK
+Modalities import OK
+```
+
+If this step fails, the container is not usable yet—inspect the earlier build logs.
+
+#### 3. Launch training inside the container
+
+```sh
+singularity exec --nv modalities.sif bash -lc '\
+  cd /opt/repos/modalities && \
+  torchrun --nnodes 1 --nproc_per_node 1 --rdzv-endpoint=0.0.0.0:29503 \
+           src/modalities/__main__.py run \
+           --config_file_path config_files/training/config_lorem_ipsum_long_fsdp2_pp_tp.yaml --test_comm'
+```
+
+To iterate on local code without rebuilding the image, bind‑mount your checkout: e.g. `singularity exec --nv --bind $PWD:/opt/repos/modalities modalities.sif bash` (the host repo then overrides the cloned one inside the container).
+
+For a multinode training with slurm, see the example sbatch-file container/slurm_singularity.sbatch.
+
 ## Usage
 Modalities provides several entry points to interact with the framework. The following section lists the available entry points and their respective functionalities.
 

diff --git a/container/modalities.def b/container/modalities.def
@@ -0,0 +1,34 @@
+Bootstrap: docker
+From: nvcr.io/nvidia/nemo:25.09
+
+%environment
+    export PYTHONNOUSERSITE=1
+
+%post
+    set -e
+    mkdir -p /opt/repos /opt/modalities/config_files/training /e /p /etc/FZJ
+    python -m pip install --upgrade pip || true
+
+    # remove pytorch and install pytorch nightly
+    rm -rf /usr/local/lib/python3.12/dist-packages/torch* /usr/local/lib/python3.12/dist-packages/pytorch_triton* || true
+    python -m pip install --pre --no-cache-dir --index-url https://download.pytorch.org/whl/nightly/cu129 torch torchvision
+
+    # Clone repos (network required at build time)
+    git clone --depth 1 https://github.com/Dao-AILab/flash-attention.git /opt/repos/flash-attention
+    git clone --branch main --depth 1 https://github.com/Modalities/modalities.git /opt/repos/modalities
+
+    # Install flash-attention
+    cd /opt/repos/flash-attention
+    MAX_JOBS=4 python setup.py install
+
+    # Install modalities
+    cd /opt/repos/modalities
+    pip install -e .
+
+%test
+    python - <<'EOF'
+import torch
+print("Torch import OK")
+import modalities
+print("Modalities import OK")
+EOF
diff --git a/container/slurm_singularity.sbatch b/container/slurm_singularity.sbatch
@@ -0,0 +1,69 @@
+#!/bin/bash
+# SLURM SUBMIT SCRIPT
+#SBATCH --exclusive
+#SBATCH --account=your_account
+#SBATCH --partition=your_partition
+#SBATCH --qos=your_qos
+#SBATCH --job-name=modalities
+#SBATCH --output=/path/to/logs/log_%j.out
+#SBATCH --error=/path/to/logs/log_%j.err
+#SBATCH --time=00:10:00
+#SBATCH --ntasks-per-node=1
+#SBATCH --cpus-per-task=32
+#SBATCH --nodes=32
+#SBATCH --gres=gpu:4
+#SBATCH --mem=0
+
+# Paths (set these to real host locations before submitting):
+SINGULARITY_IMAGE=./modalities.sif                # Singularity image file.
+CONTAINER_HOME=/path/to/container/home/on/host    # Acts as $HOME inside container (-H).
+MODALITIES_DIR=/path/to/modalities/on/host        # Host repo path bind-mounted into container.
+
+#### Environment variables ####
+export CXX=g++
+export CC=gcc
+
+# NCCL/UCX settings
+export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
+export NCCL_IB_TIMEOUT=50
+export UCX_RC_TIMEOUT=4s
+export NCCL_SOCKET_IFNAME=ib0
+export GLOO_SOCKET_IFNAME=ib0
+export NCCL_IB_RETRY_CNT=10
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+export NCCL_DEBUG=INFO
+export NCCL_ASYNC_ERROR_HANDLING=1
+
+# Enable logging
+set -x -e
+echo "START TIME: $(date)"
+
+##### Network parameters #####
+MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
+MASTER_PORT=6000
+
+echo "START TIME: $(date)"
+
+# Launch the job via srun; Singularity provides GPUs with --nv.
+# Additional bind mounts (-B) can be added for datasets, scratch, etc.
+srun singularity exec --nv \
+  -H $CONTAINER_HOME \
+  -B /dev/infiniband:/dev/infiniband \
+  -B $MODALITIES_DIR:/opt/modalities \
+  # bind other directories as needed, e.g. to access data on host system
+  "$SINGULARITY_IMAGE" bash -lc "
+    cd /opt/modalities
+    torchrun \
+      --node_rank=$SLURM_NODEID \
+      --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
+      --rdzv-id test_pp \
+      --nnodes $SLURM_JOB_NUM_NODES \
+      --nproc_per_node 4 \
+      --rdzv_backend c10d \
+      src/modalities/__main__.py run \
+      --config_file_path config_files/training/config_lorem_ipsum_long_fsdp2.yaml \
+      --test_comm
+  "
+
+echo "END TIME: $(date)"
+echo "=== FINISHED ==="