Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 53 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ For training and evaluation of a model, feel free to checkout [this](https://git

## Installation

There are two ways to install Modalities. If you want to use the latest nightly version, or if you want to modify the code base itself, we recommend installing Modalities directly from source.
There are multiple ways to install Modalities. If you want to use the latest nightly version, or if you want to modify the code base itself, we recommend installing Modalities directly from source.

If you want to use Modalities as a library and register your custom components with Modalities, you can install it directly via pip which provides you with the latest stable version.

Expand Down Expand Up @@ -88,6 +88,58 @@ uv pip install -e .[tests,linting]
pre-commit install --install-hooks
```

### Option 4: Containerized Setup via Singularity / Apptainer

If you prefer an isolated, reproducible environment or you are deploying to an HPC center that already supports Apptainer / Singularity, you can build and run Modalities using the provided `modalities.def` file in the container folder.

Note: Commands shown with singularity work the same with apptainer. Substitute the command name (e.g. apptainer build ..., apptainer exec ..., apptainer test ...). If both are installed, choose one consistently.

#### 1. Build the image

Use `--fakeroot` if you don't have root but your system enables user namespaces; otherwise omit it.

```sh
singularity build modalities.sif modalities.def # standard build
# or (if allowed / required on your system)
singularity build --fakeroot modalities.sif modalities.def
```

This will:
* Pull the base image `nvcr.io/nvidia/nemo:25.09`.
* Install nightly PyTorch (per the definition file) and flash-attention.
* Clone and install `modalities` inside the container.

#### 2. Run the built-in smoke test

Your `%test` section is executed with:

```sh
singularity test modalities.sif
```

Expected output contains lines similar to:

```
Torch import OK
Modalities import OK
```

If this step fails, the container is not usable yet—inspect the earlier build logs.

#### 3. Launch training inside the container

```sh
singularity exec --nv modalities.sif bash -lc '\
cd /opt/repos/modalities && \
torchrun --nnodes 1 --nproc_per_node 1 --rdzv-endpoint=0.0.0.0:29503 \
src/modalities/__main__.py run \
--config_file_path config_files/training/config_lorem_ipsum_long_fsdp2_pp_tp.yaml --test_comm'
```

To iterate on local code without rebuilding the image, bind‑mount your checkout: e.g. `singularity exec --nv --bind $PWD:/opt/repos/modalities modalities.sif bash` (the host repo then overrides the cloned one inside the container).

For a multinode training with slurm, see the example sbatch-file container/slurm_singularity.sbatch.

## Usage
Modalities provides several entry points to interact with the framework. The following section lists the available entry points and their respective functionalities.

Expand Down
34 changes: 34 additions & 0 deletions container/modalities.def
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
Bootstrap: docker
From: nvcr.io/nvidia/nemo:25.09

%environment
export PYTHONNOUSERSITE=1

%post
set -e
mkdir -p /opt/repos /opt/modalities/config_files/training /e /p /etc/FZJ
python -m pip install --upgrade pip || true

# remove pytorch and install pytorch nightly
rm -rf /usr/local/lib/python3.12/dist-packages/torch* /usr/local/lib/python3.12/dist-packages/pytorch_triton* || true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks a bit like a hack to me. Why not use a UV venv?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried first with a venv, but had the problem that parts of torch were not updated properly, which led to a broken installation. Removing the mentioned folders explicitly solved the problem

python -m pip install --pre --no-cache-dir --index-url https://download.pytorch.org/whl/nightly/cu129 torch torchvision

# Clone repos (network required at build time)
git clone --depth 1 https://github.com/Dao-AILab/flash-attention.git /opt/repos/flash-attention
git clone --branch main --depth 1 https://github.com/Modalities/modalities.git /opt/repos/modalities

# Install flash-attention
cd /opt/repos/flash-attention
MAX_JOBS=4 python setup.py install

# Install modalities
cd /opt/repos/modalities
pip install -e .
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need -e here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we want to be able to update modalities without rebuilding the container again. With -e, we can bind bind our modalities repo from the host and use the latest changes. This is also documented in the README


%test
python - <<'EOF'
import torch
print("Torch import OK")
import modalities
print("Modalities import OK")
EOF
69 changes: 69 additions & 0 deletions container/slurm_singularity.sbatch
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
#!/bin/bash
# SLURM SUBMIT SCRIPT
#SBATCH --exclusive
#SBATCH --account=your_account
#SBATCH --partition=your_partition
#SBATCH --qos=your_qos
#SBATCH --job-name=modalities
#SBATCH --output=/path/to/logs/log_%j.out
#SBATCH --error=/path/to/logs/log_%j.err
#SBATCH --time=00:10:00
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --nodes=32
#SBATCH --gres=gpu:4
#SBATCH --mem=0

# Paths (set these to real host locations before submitting):
SINGULARITY_IMAGE=./modalities.sif # Singularity image file.
CONTAINER_HOME=/path/to/container/home/on/host # Acts as $HOME inside container (-H).
MODALITIES_DIR=/path/to/modalities/on/host # Host repo path bind-mounted into container.

#### Environment variables ####
export CXX=g++
export CC=gcc

# NCCL/UCX settings
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_IB_TIMEOUT=50
export UCX_RC_TIMEOUT=4s
export NCCL_SOCKET_IFNAME=ib0
export GLOO_SOCKET_IFNAME=ib0
export NCCL_IB_RETRY_CNT=10
export CUDA_VISIBLE_DEVICES=0,1,2,3
export NCCL_DEBUG=INFO
export NCCL_ASYNC_ERROR_HANDLING=1

# Enable logging
set -x -e
echo "START TIME: $(date)"

##### Network parameters #####
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000

echo "START TIME: $(date)"

# Launch the job via srun; Singularity provides GPUs with --nv.
# Additional bind mounts (-B) can be added for datasets, scratch, etc.
srun singularity exec --nv \
-H $CONTAINER_HOME \
-B /dev/infiniband:/dev/infiniband \
-B $MODALITIES_DIR:/opt/modalities \
# bind other directories as needed, e.g. to access data on host system
"$SINGULARITY_IMAGE" bash -lc "
cd /opt/modalities
torchrun \
--node_rank=$SLURM_NODEID \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
--rdzv-id test_pp \
--nnodes $SLURM_JOB_NUM_NODES \
--nproc_per_node 4 \
--rdzv_backend c10d \
src/modalities/__main__.py run \
--config_file_path config_files/training/config_lorem_ipsum_long_fsdp2.yaml \
--test_comm
"

echo "END TIME: $(date)"
echo "=== FINISHED ==="