Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions backend/kuberay/README.md
Comment thread
VassilisVassiliadis marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,8 @@ with 4 Nodes each with 8 NVIDIA-A100-SXM4-80GB GPUs, 64 CPU cores, and 1TB memor
num-gpus: '1'
resources: '"{\"NVIDIA-A100-SXM4-80GB\": 1}"'
containerEnv:
- name: RAY_RUNTIME_ENV_PLUGINS
value: '[{"class":"orchestrator.utilities.ray_env.ordered_pip.OrderedPipPlugin"}]'
- name: OMP_NUM_THREADS
value: "1"
- name: OPENBLAS_NUM_THREADS
Expand Down Expand Up @@ -173,6 +175,8 @@ with 4 Nodes each with 8 NVIDIA-A100-SXM4-80GB GPUs, 64 CPU cores, and 1TB memor
num-gpus: '2'
resources: '"{\"NVIDIA-A100-SXM4-80GB\": 2}"'
containerEnv:
- name: RAY_RUNTIME_ENV_PLUGINS
value: '[{"class":"orchestrator.utilities.ray_env.ordered_pip.OrderedPipPlugin"}]'
- name: OMP_NUM_THREADS
value: "1"
- name: OPENBLAS_NUM_THREADS
Expand Down Expand Up @@ -212,6 +216,8 @@ with 4 Nodes each with 8 NVIDIA-A100-SXM4-80GB GPUs, 64 CPU cores, and 1TB memor
num-gpus: '4'
resources: '"{\"NVIDIA-A100-SXM4-80GB\": 4}"'
containerEnv:
- name: RAY_RUNTIME_ENV_PLUGINS
value: '[{"class":"orchestrator.utilities.ray_env.ordered_pip.OrderedPipPlugin"}]'
- name: OMP_NUM_THREADS
value: "1"
- name: OPENBLAS_NUM_THREADS
Expand Down Expand Up @@ -251,6 +257,8 @@ with 4 Nodes each with 8 NVIDIA-A100-SXM4-80GB GPUs, 64 CPU cores, and 1TB memor
num-gpus: '8'
resources: '"{\"NVIDIA-A100-SXM4-80GB\": 8, \"full-worker\": 1}"'
containerEnv:
- name: RAY_RUNTIME_ENV_PLUGINS
value: '[{"class":"orchestrator.utilities.ray_env.ordered_pip.OrderedPipPlugin"}]'
- name: OMP_NUM_THREADS
value: "1"
- name: OPENBLAS_NUM_THREADS
Expand Down Expand Up @@ -302,3 +310,60 @@ with 4 Nodes each with 8 NVIDIA-A100-SXM4-80GB GPUs, 64 CPU cores, and 1TB memor
> your HuggingFace home directory. On Kubernetes with RayClusters, avoid S3-like
> filesystems as that is known to cause failures in **transformers**. Use a NFS
> or GPFS-backed PersistentVolumeClaim instead.

## Using the OrderedPip Ray Runtime Environment Plugin

The `OrderedPipPlugin` is a Ray RuntimeEnvPlugin bundled with `ado-core` that
enables you to control the build order of Python packages. This is useful when
installing packages with build-time dependencies, such as `mamba-ssm` which
requires `torch` to be installed before it can be built.

### Enabling the Plugin

To enable the `OrderedPipPlugin`, set the `RAY_RUNTIME_ENV_PLUGINS` environment
variable before starting the Ray head node and workers:

```bash
export RAY_RUNTIME_ENV_PLUGINS='[{"class":"orchestrator.utilities.ray_env.ordered_pip.OrderedPipPlugin"}]'
```

When deploying a RayCluster via KubeRay, add this environment variable to both
head and worker node configurations (see examples below).

### Documentation and Usage

For detailed documentation, configuration details, and usage examples, see the
[OrderedPip Plugin README](https://github.com/IBM/ado/blob/main/orchestrator/utilities/ray_env/README.md).

### Example: Using ordered_pip with ray job submit

You can use `ordered_pip` with `ray job submit` by providing a runtime
environment YAML file:

```yaml
# ray_runtime_env.yaml
ordered_pip:
phases:
# Phase 1: Install PyTorch first
- packages:
- torch==2.6.0
# Phase 2: Install packages that depend on PyTorch during build
- packages:
- mamba-ssm==2.2.5
pip_install_options:
# IMPORTANT.
# --no-build-isolation tells pip to build the wheel
# in the same venv where torch is already installed
- --no-build-isolation
```

Then submit your job with:

```bash
ray job submit --runtime-env-json ray_runtime_env.yaml -- python my_script.py
```

> [!NOTE]
>
> Actuators like `SFTTrainer` automatically use `OrderedPipPlugin` when available
> to ensure correct installation of their dependencies.
11 changes: 8 additions & 3 deletions backend/kuberay/vanilla-ray.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,17 @@ head:
requests:
cpu: "500m"
memory: "512Mi"
containerEnv:
- name: RAY_RUNTIME_ENV_PLUGINS
value: '[{"class":"orchestrator.utilities.ray_env.ordered_pip.OrderedPipPlugin"}]'
lifecycle: #https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html#pod-and-container-lifecyle-prestophook
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
rayStartParams:
dashboard-host: '0.0.0.0'
dashboard-host: "0.0.0.0"
num-cpus: "0"
block: 'true'
block: "true"
resources:
limits:
cpu: 4
Expand All @@ -39,8 +42,10 @@ worker:
minReplicas: 1
maxReplicas: 4
rayStartParams:
block: 'true'
block: "true"
containerEnv:
- name: RAY_RUNTIME_ENV_PLUGINS
value: '[{"class":"orchestrator.utilities.ray_env.ordered_pip.OrderedPipPlugin"}]'
- name: OMP_NUM_THREADS
value: "1"
- name: OPENBLAS_NUM_THREADS
Expand Down
147 changes: 147 additions & 0 deletions orchestrator/utilities/ray_env/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# OrderedPip Ray Runtime Environment Plugin

The `OrderedPipPlugin` is a Ray RuntimeEnvPlugin that enables you to control the
build order of Python packages. This is useful when installing packages with
build-time dependencies. For example, `mamba-ssm` requires `torch` to be
installed before it can be built. We suggest using
`pip_install_options: ["--no-build-isolation"]`
which ensures that `pip` will use the same virtual environment to build and
install the wheels it builds.

## Overview

The plugin allows you to define multiple installation phases, where each phase
is executed sequentially. All phases install the wheels in the same virtual
environment, ensuring that packages installed in earlier phases are available
during the build of packages in later phases provided you also use
`--no-build-isolation`.

## Availability

The `OrderedPipPlugin` is pre-installed in ado Docker images and bundled with
`ado-core`.

## Enabling the Plugin

To enable the `OrderedPipPlugin`, set the `RAY_RUNTIME_ENV_PLUGINS` environment
variable before starting the Ray head and workers.

```bash
export RAY_RUNTIME_ENV_PLUGINS='[{"class":"orchestrator.utilities.ray_env.ordered_pip.OrderedPipPlugin"}]'
```

### Enabling in KubeRay

When deploying a RayCluster via KubeRay, add the environment variable to both
head and worker node configurations:

```yaml
head:
containerEnv:
- name: RAY_RUNTIME_ENV_PLUGINS
value: '[{"class":"orchestrator.utilities.ray_env.ordered_pip.OrderedPipPlugin"}]'

worker:
containerEnv:
- name: RAY_RUNTIME_ENV_PLUGINS
value: '[{"class":"orchestrator.utilities.ray_env.ordered_pip.OrderedPipPlugin"}]'
```

## Configuration Details

> [!IMPORTANT]
>
> Each entry in `phases` uses the **identical schema** as Ray's standard `pip`
> runtime environment field. If you know how to configure `pip`, you already
> know how to configure each phase in `ordered_pip`.

The `ordered_pip` runtime environment accepts a dictionary with a `phases` key:

- **`phases`**: Each phase can be one of:
- A list of package names (e.g., `["torch==2.6.0"]`)
- A dictionary with `packages` and optional `pip_install_options` fields
- Any other valid `pip` specification format

## Usage Examples

### Using ordered_pip in Python Code

Here's a complete example showing how to use `ordered_pip` in a Ray task:

```python
import ray

@ray.remote(
runtime_env={
"ordered_pip": {
"phases": [
# Phase 1: Install PyTorch first
["torch==2.6.0"],
# Phase 2: Install packages that depend on PyTorch during build
{
"packages": ["mamba-ssm==2.2.5"],
# IMPORTANT.
# --no-build-isolation tells pip to build the wheel
# in the same venv where torch is already installed
"pip_install_options": ["--no-build-isolation"],
}
]
}
}
)
def my_task():
import torch
import mamba_ssm
return torch.__version__

result = ray.get(my_task.remote())
print(f"PyTorch version: {result}")
```

### Using ordered_pip with ray job submit

You can also use `ordered_pip` with `ray job submit` by providing a runtime
environment YAML file:

```yaml
# ray_runtime_env.yaml
ordered_pip:
phases:
# Phase 1: Install PyTorch first
- packages:
- torch==2.6.0
# Phase 2: Install packages that depend on PyTorch during build
- packages:
- mamba-ssm==2.2.5
pip_install_options:
# IMPORTANT.
# --no-build-isolation tells pip to build the wheel
# in the same venv where torch is already installed
- --no-build-isolation
```

Then submit your job with:

```bash
ray job submit --runtime-env-json ray_runtime_env.yaml -- python my_script.py
```

## Key Points

- **Sequential Execution**: Phases execute sequentially in the order specified
- **Shared Environment**: All phases reuse the same virtual environment
- **Build Isolation**: The `--no-build-isolation` flag is critical for packages
that need build-time dependencies. It instructs pip to build wheels in the
existing virtual environment rather than in an isolated one
- **Phase Order Matters**: Package order within a phase doesn't matter, but the
order of phases does

## Integration with ado Actuators

Actuators like `SFTTrainer` automatically use `OrderedPipPlugin` when available
to ensure correct installation of their dependencies.

## Technical Details

For implementation details, see the source code in
[`ordered_pip.py`](./ordered_pip.py).
Loading