diff --git a/.secrets.baseline b/.secrets.baseline
index eca42867..bf0bc6aa 100644
--- a/.secrets.baseline
+++ b/.secrets.baseline
@@ -3,7 +3,7 @@
"files": "requirements.txt|^.secrets.baseline$",
"lines": null
},
- "generated_at": "2025-11-05T16:16:55Z",
+ "generated_at": "2025-11-10T08:32:10Z",
"plugins_used": [
{
"name": "AWSKeyDetector"
@@ -414,7 +414,7 @@
}
]
},
- "version": "0.13.1+ibm.64.dss",
+ "version": "0.13.1+ibm.62.dss",
"word_list": {
"file": null,
"hash": null
diff --git a/backend/kuberay/README.md b/backend/kuberay/README.md
index 7e4e462c..e0a2f9b4 100644
--- a/backend/kuberay/README.md
+++ b/backend/kuberay/README.md
@@ -18,21 +18,26 @@ the
## Deploying a RayCluster
-> [!WARNING]
+> [!WARNING] Ray version compatibility
>
-> The `ray` versions must be compatible. For a more in depth guide refer to the
-> [RayCluster configuration](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html)
+> The `ray` version set in KubeRay YAML and the one
+> used in the ray head and worker containers must be compatible.
+> For a more in depth guide refer to the [RayCluster configuration](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html)
> page.
-!!! note
+We provide [an example set of values](vanilla-ray.yaml) for deploying a
+RayCluster via KubeRay. To deploy it run:
+
+``` commandline
+helm upgrade --install ado-ray kuberay/ray-cluster --version 1.1.0 --values backend/kuberay/vanilla-ray.yaml
+```
- When running multi-node measurement make sure that
- all nodes in your multi-node setup have read and write access
- to your HuggingFace home directory. On Kubernetes with RayCluster,
- avoid S3-like filesystems as that is known to cause failures
- in **transformers**. Use a NFS or GPFS-backed PersistentVolumeClaim instead.
+Feel free to customize the example file provided to suit your cluster,
+such as uncommenting GPU-enabled workers.
-### Configuring a Kubernetes ServiceAccount for the RayCluster
+### Enabling ado actuators to create K8s resources
+
+#### Configuring a ServiceAccount for the RayCluster
The default Kubernetes ServiceAccount created for a RayCluster does not
have enough permissions for an ado actuator to create Kubernetes resources
@@ -46,46 +51,14 @@ It also provides access to the RayCluster resources.
```yaml
-apiVersion: v1
-kind: ServiceAccount
-metadata:
- name: ray-deployer
----
-apiVersion: rbac.authorization.k8s.io/v1
-kind: RoleBinding
-metadata:
- name: ray-deployer
-roleRef:
- apiGroup: rbac.authorization.k8s.io
- kind: Role
- name: ray-deployer
-subjects:
- - kind: ServiceAccount
- name: ray-deployer
----
-apiVersion: rbac.authorization.k8s.io/v1
-kind: Role
-metadata:
- name: ray-deployer
-rules:
- - apiGroups: ["ray.io"]
- resources:
- - rayclusters
- verbs: ["get", "patch"]
- - apiGroups: ["apps"]
- resources:
- - pods
- - deployments
- verbs: ["get", "create", "delete", "list", "watch", "update"]
- - apiGroups: [""]
- resources:
- - services
- verbs: ["get", "create", "delete", "list", "watch", "update"]
+{% include "./service-account.yaml" %}
```
From the root of the ado project run the below command:
- kubectl apply -f backend/kuberay/service-account.yaml
+```commandline
+kubectl apply -f backend/kuberay/service-account.yaml
+```
This will create a ServiceAccount named `ray-deployer`.
We will reference this name later when
@@ -94,6 +67,19 @@ We will reference this name later when
More information about ServiceAccount, Role, and RoleBinding objects can be found
in the [official Kubernetes RBAC documentation](https://kubernetes.io/docs/reference/access-authn-authz/rbac/).
+#### Associating a RayCluster with the ServiceAccount
+
+The below command shows how to set the `serviceAccountName` property for head
+and worker nodes.
+
+
+```bash
+helm upgrade --install ado-ray kuberay/ray-cluster --version 1.1.0 \
+ --values backend/kuberay/vanilla-ray-service-account.yaml \
+ --set head.serviceAccountName=ray-deployer \
+ --set worker.serviceAccountName=ray-deployer
+```
+
### Best Practices for Efficient GPU Resource Utilization
To maximize the efficiency of your RayCluster and minimize GPU resource
@@ -124,12 +110,13 @@ Recommended worker setup:
- 4 replicas of a worker with **8 GPUs**
+
Example: The contents of the additionalWorkerGroups field of a RayCluster
with 4 Nodes each with 8 NVIDIA-A100-SXM4-80GB GPUs, 64 CPU cores, and 1TB memory
-
+
```yaml
one-A100-80G-gpu-WG:
replicas: 0
@@ -288,34 +275,24 @@ with 4 Nodes each with 8 NVIDIA-A100-SXM4-80GB GPUs, 64 CPU cores, and 1TB memor
# volumes: ...
# volumeMounts: ....
```
-
+
-!!! note
-
- Notice that the only variant with a **full-worker** custom resource
- is the one with 8 GPUs. Some actuators, like SFTTrainer, use this
- custom resource for measurements that involve reserving an entire GPU node.
-
-We provide [an example set of values](vanilla-ray.yaml) for deploying a
-RayCluster via KubeRay. To deploy it, simply run:
-
- helm upgrade --install ado-ray kuberay/ray-cluster --version 1.1.0 --values backend/kuberay/vanilla-ray.yaml
-
-In the case the ado operation to be executed requires creating Kubernetes
-resources, the RayCluster to be deployed must be associated with a properly
-configured ServiceAccount like the one described [above](#configuring-a-kubernetes-serviceaccount-for-the-raycluster).
-The below command shows how to set the `serviceAccountName` property for head
-and worker nodes.
+> [!IMPORTANT] full-worker custom resource
+>
+> Notice that the only variant with a **full-worker** custom resource
+> is the one with 8 GPUs. Some actuators, like SFTTrainer, use this
+> custom resource for measurements that involve reserving an entire GPU node.
-
-```bash
-helm upgrade --install ado-ray kuberay/ray-cluster --version 1.1.0 \
- --values backend/kuberay/vanilla-ray-service-account.yaml \
- --set head.serviceAccountName=ray-deployer \
- --set worker.serviceAccountName=ray-deployer
-```
+### RayClusters and SFTTrainer
-Feel free to customize the example file provided to suit your cluster,
-such as uncommenting GPU-enabled workers.
+> [!IMPORTANT] HuggingFace home directory
+>
+> If you want to run multi-node measurements with
+> the SFTTrainer actuator make sure that
+> all nodes in your multi-node setup have read and write access
+> to your HuggingFace home directory. On Kubernetes with RayClusters,
+> avoid S3-like filesystems as that is known to cause failures
+> in **transformers**.
+> Use a NFS or GPFS-backed PersistentVolumeClaim instead.
diff --git a/backend/kuberay/service-account.yaml b/backend/kuberay/service-account.yaml
index 3da627ee..cb121a51 100644
--- a/backend/kuberay/service-account.yaml
+++ b/backend/kuberay/service-account.yaml
@@ -36,4 +36,5 @@ rules:
- apiGroups: [""]
resources:
- services
+ - persistentvolumeclaims
verbs: ["get", "create", "delete", "list", "watch", "update"]
\ No newline at end of file
diff --git a/plugins/actuators/vllm_performance/yamls/vllm_actuator_configuration.yaml b/plugins/actuators/vllm_performance/yamls/vllm_actuator_configuration.yaml
new file mode 100644
index 00000000..f601151a
--- /dev/null
+++ b/plugins/actuators/vllm_performance/yamls/vllm_actuator_configuration.yaml
@@ -0,0 +1,16 @@
+# Copyright (c) IBM Corporation
+# SPDX-License-Identifier: MIT
+actuatorIdentifier: vllm_performance
+metadata:
+ name: "Test actuator deployment"
+parameters:
+ benchmark_retries: 3
+ hf_token: 'test' # Set if you need to access a gated model
+ image_secret: ''
+ in_cluster: false
+ interpreter: python3
+ max_environments: 1
+ namespace: null # Must set to the namespace to create deployments
+ node_selector: {}
+ retries_timeout: 5
+ verify_ssl: false
diff --git a/plugins/actuators/vllm_performance/yamls/discoveryspace_override_defaults.yaml b/plugins/actuators/vllm_performance/yamls/vllm_deployment_space.yaml
similarity index 60%
rename from plugins/actuators/vllm_performance/yamls/discoveryspace_override_defaults.yaml
rename to plugins/actuators/vllm_performance/yamls/vllm_deployment_space.yaml
index 898f0947..c382f2f4 100644
--- a/plugins/actuators/vllm_performance/yamls/discoveryspace_override_defaults.yaml
+++ b/plugins/actuators/vllm_performance/yamls/vllm_deployment_space.yaml
@@ -1,7 +1,5 @@
# Copyright (c) IBM Corporation
# SPDX-License-Identifier: MIT
-
-sampleStoreIdentifier: 2963a5
entitySpace:
- identifier: model
propertyDomain:
@@ -11,36 +9,26 @@ entitySpace:
propertyDomain:
values:
- quay.io/dataprep1/data-prep-kit/vllm_image:0.1
- - identifier: n_cpus
- propertyDomain:
- values: [8]
- - identifier: memory
- propertyDomain:
- values: ["128Gi"]
- - identifier: dtype
+ - identifier: "number_input_tokens"
propertyDomain:
- values: ["auto"]
- - identifier: "num_prompts"
- propertyDomain:
- values: [500]
+ values: [1024, 2048, 4096]
- identifier: "request_rate"
propertyDomain:
- values: [-1]
- - identifier: "max_concurrency"
- propertyDomain:
- values: [-1]
- - identifier: "gpu_memory_utilization"
+ domainRange: [1,10]
+ interval: 1
+ - identifier: n_cpus
propertyDomain:
- values: [.9]
- - identifier: "cpu_offload"
+ domainRange: [2,16]
+ interval: 2
+ - identifier: memory
propertyDomain:
- values: [0]
+ values: ["128Gi", "256Gi"]
- identifier: "max_batch_tokens"
propertyDomain:
- values: [16384]
+ values: [1024, 2048, 4096, 8192, 16384, 32768]
- identifier: "max_num_seq"
propertyDomain:
- values: [256]
+ values: [16,32,64]
- identifier: "n_gpus"
propertyDomain:
values: [1]
@@ -51,4 +39,5 @@ experiments:
- actuatorIdentifier: vllm_performance
experimentIdentifier: performance-testing-full
metadata:
- description: Parameters for VLLM performance testing
+ description: A space of vllm deployment configurations
+ name: vllm_deployments
diff --git a/website/docs/actuators/vllm_performance.md b/website/docs/actuators/vllm_performance.md
new file mode 100644
index 00000000..29060ee9
--- /dev/null
+++ b/website/docs/actuators/vllm_performance.md
@@ -0,0 +1,393 @@
+# The `vllm_performance` actuator
+
+
+
+> [!TIP] Overview
+> The `vllm_performance` actuator **can
+> automatically create and benchmark [vLLM](https://github.com/vllm-project/vllm) inference deployments on Kubernetes and OpenShift clusters**.
+>
+> It is designed for robust, repeatable, and configurable experiment execution.
+> It is suitable for both simple one-off benchmarks and large parameter sweeps.
+
+
+## Key Capabilities
+
+- **Automated LLM benchmarking:** Deploys vLLM serving endpoints
+on NVIDIA GPU-enabled OpenShift/Kubernetes clusters and runs
+standardized serving benchmarks.
+- **Cluster integration:** Handles deployments and clean-up of vLLM inference
+pods on OpenShift/Kubernetes, with configurable resource selection via namespace,
+node selector, and PVC/service templates.
+- **Scenario configurability:** Supports customizing models, NVIDIA GPU types,
+node selection, retry behavior, concurrent deployments, and more
+- **Efficient sampling:** Supports grouped sampling which maximises reuse
+of vLLM deployments, hence minimising time spent creating such deployments
+- **Endpoint benchmarking:** Can also be used to benchmark existing OpenAI
+compatible endpoints
+
+### Available experiments
+
+The `vllm_performance` actuator implements two experiments
+
+- `performance-testing-full`: This experiment can test the full vLLM workload configuration,
+including resource requests and server deployment configuration. It deploys
+servers with given configuration on kubernetes and runs `vllm bench serve` on them
+with the given parameters
+- `performance-testing-endpoint`: This experiment is equivalent to running
+`vllm bench serve` against an endpoint.
+
+---
+
+## Running single experiments: Quick endpoint and deployment tests
+
+For rapid testing and debugging, you can use the [`run_experiment`](run_experiment.md)
+tool to execute individual experiments on a single point (entity).
+This is ideal when you want to:
+
+- Quickly check if your actuator installation and configuration works
+- Debug a deployment scenario or endpoint using the vllm_performance actuator
+
+### Running an endpoint test
+
+To test the throughput or limits of an existing vLLM-compatible endpoint, create
+a `point.yaml`file like this:
+
+```yaml
+entity:
+ model: openai/gpt-oss-20b
+ endpoint: http://localhost:8000
+ request_rate: 50
+experiments:
+- actuatorIdentifier: vllm_performance
+ experimentIdentifier: performance-testing-endpoint
+```
+
+Then run:
+
+```shell
+run_experiment point.yaml
+```
+
+This will assess how many requests per second the endpoint can handle for the given
+model and configuration.
+
+> [!TIP] Inference endpoint testing example
+>
+> See [the detailed endpoint scenario](../examples/vllm-performance-endpoint.md)
+> for a production-style workflow exploring inference endpoint throughput.
+
+### Running a deployment test
+
+To launch and benchmark a temporary vLLM deployment
+(including provisioning on Kubernetes/OpenShift), you must provide both:
+
+
+- An entity definition (as before)
+- The identifier of a valid `actuatorconfiguration` resource
+ - This contains information necessary for accessing and creating
+ deployments on the Kubernetes/OpenShift cluster
+ - See [configuring the vllm_performance actuator](#configuring-the-vllm_performance-actuator)
+ for details.
+
+
+Example `point.yaml`:
+
+```yaml
+entity:
+ model: ibm-granite/granite-3.3-8b-instruct
+ n_cpus: 8
+ memory: 128Gi
+ gpu_type: NVIDIA-A100-80GB-PCIe
+ max_batch_tokens: 8192
+ max_num_seq: 32
+ n_gpus: 1
+experiments:
+- actuatorIdentifier: vllm_performance
+ experimentIdentifier: performance-testing-full
+```
+
+Then run:
+
+```shell
+run_experiment point.yaml --actuator-config-id my-vllm-performance-config
+```
+
+Here `my-vllm-performance-config` is the ID of an `actuatorconfiguration` resource
+containing the details for accessing and running on your target cluster.
+See [configuring the vllm_performance actuator](#configuring-the-vllm_performance-actuator)
+for more.
+
+This command will provision the deployment for the specified entity, using your indicated
+actuator configuration, run the benchmark, and print results.
+
+> [!TIP] vLLM deployment example
+>
+> See [the vLLM deployment exploration example](../examples/vllm-performance-full.md)
+> for details on how to explore many deployment configurations.
+
+---
+
+## Configuring the vllm_performance actuator
+
+You can configure how the `vllm_performance` actuator creates,
+manage, and monitor vLLM deployments on a Kubernetes/OpenShift
+cluster.
+This configuration covers several needs:
+
+- **Cluster targeting and permissions**: Specify the OpenShift/Kubernetes namespace
+and optionally node selectors, secrets, and templates to match your cluster resources.
+- **Secure access**: Pass required HuggingFace tokens, set up image pull secrets,
+control in-cluster or remote execution, and toggle SSL verification.
+- **Experiment protocol and retries**: Choose how benchmarks are run, including interpreter,
+retry logic, and YAML templates for deployments/services used.
+- **Deployment resource management**: Limit the number of concurrent deployments
+and control automated clean-up.
+
+You supply this configuration information as an `ado`
+[`actuatorconfiguration` resource](../resources/actuatorconfig.md),
+which is a YAML file with the configuration options.
+An example is:
+
+
+```yaml
+actuatorIdentifier: vllm_performance #The actuator the configuration is for
+metadata:
+ description: "Actuator config for vLLM LLM benchmarking"
+ name: demo-vllm-perf
+parameters:
+ benchmark_retries: 3 # Number of benchmark attempts (see Failure Handling)
+ hf_token: "" # Required for pulling some models
+ image_secret: "" # Optional image pull secret
+ in_cluster: false # Set to true if running from within the cluster
+ interpreter: python3 # Language for test drivers/benchmarks
+ max_environments: 1 # Max concurrent vLLM deployments
+ namespace: "mynamespace" # OpenShift/K8s namespace to deploy into
+ node_selector: # A dictionary of Kubernetes node_selector key:value pairs
+ "kubernetes.io/hostname":"gpunode01"
+ pvc_name: null # Name of existing PVC to use. If null/omitted a temporary PVC is created
+ retries_timeout: 5 # Seconds between retries (exponential backoff)
+ verify_ssl: false # Whether to verify HTTPS endpoints
+```
+
+
+If the above YAML was saved to a file called `vllm_config.yaml` you would create
+the configuration using
+
+```commandline
+ado create actuatorconfiguration -f vllm_config.yaml
+```
+
+> [!WARNING] namespace
+>
+> The critical parameter you must set in the configuration is `namespace`
+
+
+> [!WARNING] GPU type
+>
+> The GPU type to use in an experiment is set via the experiment itself (performance-testing-full).
+> **Do not** set this via the `node_selector` parameter of the configuration.
+
+
+> [!TIP] Further details
+>
+> For further details on specific options and advanced behavior see:
+>
+> - [Maximum number of deployments](#maximum-number-of-deployments) (details on `max_environments`)
+> - [Handling benchmark failures](#handling-benchmark-failures) and [Deployment Clean-Up](#deployment-clean-up)
+> - [Grouped sampling for efficient deployment usage](#grouped-sampling-for-efficient-deployment-usage)
+
+### Multiple configurations
+
+You can create multiple `actuatorconfiguration`s for the `vllm_performance` actuator.
+Each configuration captures
+the cluster-specific, security-sensitive, and experiment-relevant settings necessary
+for the actuator to operate in a given environment.
+Each configuration will have a different id and you can choose the one to use
+when submitting an operation or single experiment that uses the `vllm_performance`
+actuator.
+
+> [!TIP] Getting a default configuration
+>
+> You can generate a default configuration via the ado CLI:
+>
+> ```shell
+> ado template actuatorconfiguration --actuator-identifier vllm_performance -o actuatorconfiguration.yaml
+> ```
+
+---
+
+## vLLM deployment management
+
+### The `in_cluster` configuration option
+
+The `in_cluster` option in your `actuatorconfiguration` tells the `vllm_performance`
+actuator how to communicate with the target Kubernetes or OpenShift cluster when
+running `performance-testing-full`.
+
+If running `ado` from outside the Kubernetes/OpenShift cluster where
+the deployments will be created, leave `in_cluster: false` (the default).
+
+Set `in_cluster: true` if your `ado` operation will be run on a
+**remote Ray cluster that is in the same Kubernetes/OpenShift cluster** as your
+vLLM deployments.
+This configuration maximizes efficiency for large-scale, distributed benchmarking.
+For a detailed guide on running `ado` remotely on a Ray cluster, including environment
+and package setup, see [Running ado remotely](../getting-started/remote_run.md).
+
+> [!IMPORTANT] RayCluster permissions
+>
+> If running with `in_cluset=True`, your RayCluster **must** be configured so that
+> jobs launched by `ado` have permissions to create and manage Kubernetes deployments,
+> pods, and services.
+> For configuring the necessary ServiceAccount, roles, and permissions,
+> see our [documentation on deploying RayClusters for `ado`](../getting-started/installing-backend-services.md).
+
+
+> [!TIP] Installing the `vllm_performance` actuator on a remote RayCluster
+>
+> If the `ado-vllm-performance` actuator is not installed in the
+> image used by the RayCluster you can have [ray install it following
+> this guide](../getting-started/remote_run.md).
+>
+> In particular, if a compatible version of vLLM is not installed
+> in the image this step will require installing vLLM on each RayCluster node
+> (so `vllm bench serve` is available).
+> This can take some time so you may see the `ado` `operation` output "hang"
+> while this is happening.
+
+### Maximum number of deployments
+
+The actuator configuration parameter `max_environments` controls how
+many concurrent vLLM deployments will be created. The default is 1.
+
+When experiments are requested, if an existing deployment cannot
+be used a new environment is created as long as `max_environments` has
+not been reached.
+If it has been reached, then the actuator waits for an existing
+environment to become idle, at which point it is deleted and
+the new environment is created.
+
+Some notes:
+
+
+- `max_environments` deployments are always created before any are deleted
+ - This means idle environments will remain until there is a need to delete them
+ - This is to increases chances they can be reused/minimise cost of redeploying
+- Environment creation is serialized
+ - If `max_environments` is reached and all are active, the first experiment
+ that requires a new environment will block. Subsequent experiment
+ requests will queue behind it in FIFO order until it can proceed (i.e. delete
+ an existing environment and create the one it needs)
+
+
+### Handling benchmark failures
+
+Once deployments are created and the vLLM health endpoint is responding to requests
+(pod running, container ready), or 20 mins has elapsed, the actuator runs
+`vllm bench serve` against it.
+The 20min timeout is so the wait won't pend forever in a case where something
+goes wrong
+in K8s that means the health check will never pass.
+
+When running the benchmark the actuator will try `benchmark_retries` times
+backing off exponentially based on `retries_timeout` to run the benchmark successfully.
+The retries may be required as it can happen for large models that 20 minutes is
+not sufficient for model download and load for serving.
+Since vLLM bench itself waits 10 minutes for the endpoint to come up this means with
+`benchmark_retries=3` (the default) there is roughly 50mins-1hr timeout for the
+endpoint to become available.
+
+### PVCs
+
+#### `pvc_name` not given
+
+If no `pvc_name` is set in the `actuatorconfiguration`, when an actuator
+instance is created with this configuration, e.g., via `create operation` or `run_experiment`,
+it creates a PVC called `vllm-support-$UUID` that is shared by all deployments
+it creates.
+The `$UUID` is a randomly generated string that will vary each time the
+actuator is created.
+When the `operation` or `run_experiment` exits this PVC will be deleted.
+
+#### `pvc_name` given
+
+If a `pvc_name` is set in the `actuatorconfiguration`, when an actuator
+instance is created with this configuration, e.g., via `create operation`
+or `run_experiment`,
+it will look for an existing PVC with the given name.
+If the PVC exists it will be used for all deployments the actuator instance
+creates.
+When the `operation` or `run_experiment` exits this PVC will NOT be deleted.
+If the PVC does not exist the actuator will exit with an error.
+
+### Deployment Clean-Up
+
+The `vllm_performance` actuator will automatically clean up
+all Kubernetes resources associated with the vLLM deployments as it proceeds
+leaving at most `max_environments` active at a time.
+On a graceful shutdown of the `ado` process running the operation
+(CTRL-C, SIGTERM, SIGINT) active deployments will be deleted
+before exit.
+On an uncontrolled shutdown (SIGKILL) you will need to manually
+clean up any K8s deployments that were running at the time.
+
+> [!IMPORTANT] PVC Deletion
+>
+> If the actuator created a PVC (i.e. `vllm-support-$UUID`) it will be deleted.
+>
+> If the actuator used an existing PVC it will not be deleted.
+
+### Kubernetes resource templates
+
+The `vllm_performance` actuator creates Kubernetes resources
+based on a set of template YAML files
+that are distributed with the actuator.
+The templates are for:
+
+- vLLM deployment
+- PVC used by deployment pod
+- vLLM service
+
+You can use your own templates,
+by creating a vllm_performance
+`actuatorconfiguration` resource with the following
+fields set to the path to your templates.
+
+```yaml
+deployment_template: $PATH_RELATIVE_TO_WORKING_DIR
+service_template: $PATH_RELATIVE_TO_WORKING_DIR
+pvc_template: $PATH_RELATIVE_TO_WORKING_DIR
+```
+
+Then use this `actuatorconfiguration` resource
+when running operations with the actuator.
+
+The paths given are always interpreted relative to the
+working directory of process using the actuator
+(where `ado create operation` or `run_experiment` is executed).
+
+>[!IMPORTANT] Custom templates and executing on remote RayClusters
+>
+> The template path must be accessible where the actuator is running.
+> This is important to consider when running operation using
+> `vllm_performance` on a remote RayCluster.
+> To handle this we recommend:
+>
+> - Put custom templates in the working directory (or a subdirectory of it)
+> that you will
+> [send to the RayCluster](../getting-started/remote_run.md#other-options)
+> - Create an `actuatorconfiguration` with the relative paths to the
+> templates from this working directory
+>
+
+### Grouped sampling for efficient deployment usage
+
+Creating and deleting vLLM deployments takes time.
+If you have limited number of vLLM deployments that can be
+created concurrently, say one, then this can add significant
+overhead if consecutive points being sampled require
+different deployments.
+The [grouped sampling](../operators/random-walk.md#enabling-grouping)
+feature of the `random_walk` operator can be useful in this case.
+This allows configuring the sampling so points that
+require a given vLLM deployment are submitted in a batch.
diff --git a/website/docs/examples/vllm-performance-endpoint.md b/website/docs/examples/vllm-performance-endpoint.md
index a6bc0452..8eb7cd22 100644
--- a/website/docs/examples/vllm-performance-endpoint.md
+++ b/website/docs/examples/vllm-performance-endpoint.md
@@ -2,7 +2,9 @@
> [!NOTE] The scenario
>
-> **In this example, the _vllm_performance_ actuator is used to find
+> **In this example,
+> the [_vllm_performance_ actuator](../actuators/vllm_performance.md)
+> is used to find
> the maximum requests per second a server can handle while maintaining
> stable maximum throughput.**
>
@@ -16,12 +18,13 @@
> To explore this space, you will:
>
> - define an endpoint, model and range of requests per second to test
-> - use an optimizer to efficiently find the maximum requests per second
+> - use [an optimizer](../operators/optimisation-with-ray-tune.md)
+> to efficiently find the maximum requests per second
> [!IMPORTANT] Prerequisites
>
-> - An endpoint serving an LLM in an OpenAI API-compatible format
+> - An endpoint serving an LLM via an OpenAI API-compatible API
> - Install the following Python packages:
>
> ```bash
@@ -37,11 +40,15 @@
> from [our repository](https://github.com/IBM/ado/tree/main/plugins/actuators/vllm_performance/yamls).
>
> - `vllm_request_rate_space.yaml`: this file defines the _endpoint_, _model_,
-> and _request_ _range_ to explore.
-> - **You must edit the _model_ and _endpoint_ fields in this file
-> to match your own.**
+> and _request_ _range_ to explore.
+>
+>
+> - **You must edit the _model_ and _endpoint_ fields in this file
+> to match your own.**
+>
+>
> - `operation_hyperopt.yaml`: this file contains the optimization parameters.
-> You do not need to edit it.
+> You do not need to edit it.
>
> Then, in a directory with these files, execute:
>
@@ -213,12 +220,16 @@ and the best region is unlikely to be visited.
## Next steps
+
- Use `ado describe experiment vllm_performance_endpoint` to see what
other parameters can be explored
- Try varying **`burstiness`** or **`number_input_tokens`**, or adding
them as dimensions of the `entityspace`, to explore their impact on throughput
- Try varying `num_samples`, `gamma` and `n_initial_points` parameters of hyperopt
- - You can keep running the optimization on the same `discoveryspace`.
- The previous runs will not influence new runs, but their results will
- be reused, speeding experimentation up
+ - You can keep running the optimization on the same `discoveryspace`.
+ The previous runs will not influence new runs, but their results will
+ be reused, speeding experimentation up
- Measure the [performance of vLLM deployment configurations](vllm-performance-full.md)
+- Check the [`vllm_performance` actuator documentation](../actuators/vllm_performance.md)
+
+
diff --git a/website/docs/examples/vllm-performance-full.md b/website/docs/examples/vllm-performance-full.md
index 161fe516..c6b83230 100644
--- a/website/docs/examples/vllm-performance-full.md
+++ b/website/docs/examples/vllm-performance-full.md
@@ -1,67 +1,67 @@
# Exploring vLLM deployment configurations
-> [!NOTE]
+> [!NOTE] The scenario
+>
+> **In this example,
+> the [_vllm_performance_ actuator](../actuators/vllm_performance.md)
+> is used to evaluate
+> different vLLM server deployment configurations on Kubernetes/OpenShift.**
+>
+> When deploying vLLM, you must choose values for parameters like GPU type, batch
+> size, and memory limits. These choices directly affect performance, cost, and
+> scalability. To find the best configuration for your workload, whether you are
+> optimizing for latency, throughput, or cost, you need to explore the deployment
+> parameter space. In this example:
>
-> This example illustrates using the vllm-performance actuator to discover
-> how best to deploy vLLM for a given use-case
+> - We will define a space of vLLM deployment configurations to test with
+> the `vllm_performance` actuator's `performance_testing_full` experiment
+> - This experiment can create and characterize a vLLM deployment on Kubernetes
+> - Use the [`random_walk` operator](../operators/random-walk.md) to
+> explore the space
-> [!IMPORTANT]
+> [!IMPORTANT] Prerequisites
>
-> **Prerequisites**
+> - Be logged-in to your Kubernetes/OpenShift cluster
+> - Have access to a namespace where you can create vLLM deployments
+> - Install the following Python packages locally:
>
-> - Access to a k8s namespace where you can deploy vLLM
-
-## The scenario
-
-When deploying vLLM, you must choose values for parameters like GPU type, batch
-size, and memory limits. These choices directly affect performance, cost, and
-scalability. To find the best configuration for your workload, whether you are
-optimizing for latency, throughput, or cost—you need to explore the deployment
-parameter space.
-
-In this example:
-
-- We will define a space of vLLM deployment configurations to test with
-the `vllm_performance` actuator's `performance_testing_full` experiment
- - This experiment can create and characterize a vLLM deployment on Kubernetes
-- Use the `random_walk` operator to explore the space
-
-## Install the actuator
-
-[//]: # (If you haven't already:)
-
-[//]: # ()
-[//]: # (```commandline)
-
-[//]: # (pip install ado-vllm-performance)
-
-[//]: # (```)
-
-[//]: # ()
-[//]: # (If you have cloned the `ado` source repository you can also do:)
-
-[//]: # ()
-[//]: # (```commandline)
-
-[//]: # (# From the root of this repository )
-
-[//]: # (pip install -e plugins/actuators/vllm_performance)
-
-[//]: # (```)
-
-Execute:
-
-```commandline
-pip install -e plugins/actuators/vllm_performance
-```
-
-in the root of the `ado` source repository.
-You can clone the repository with
+> ```bash
+> pip install ado-vllm-performance
+> ```
+
-```commandline
-git clone https://github.com/IBM/ado.git
-```
+> [!TIP] TL;DR
+>
+> Get the files `vllm_deployment_space.yaml`, `vllm_actuator_configuration.yaml`
+> and `operation_random_walk.yaml` from
+>
+> [our repository](https://github.com/IBM/ado/tree/main/plugins/actuators/vllm_performance/yamls).
+>
+>
+> **You must edit `vllm_actuator_configuration.yaml` with your details.**
+> In particular the following two fields are important:
+>
+> ```yaml
+> hf_token: # Required to access gated models
+> namespace: vllm-testing # you MUST set this to a namespace where you can create vLLM deployments
+> ```
+>
+> Then, in a directory with these files, execute:
+>
+> ```bash
+> : # Define the configurations to explore
+> ado create space -f vllm_deployment_space.yaml
+> : # Create a configuration for the actuator - normally just once as it can be reused
+> ado create actuatorconfiguration -f vllm_actuator_configuration.yaml
+> : # Explore!
+> ado create operation -f random_walk_operation_grouped.yaml --use-latest space --use-latest actuatorconfiguration
+> ```
+>
+> See [configuring the `vllm_performance` actuator](../actuators/vllm_performance.md#configuring-the-vllm_performance-actuator)
+> for more configuration options.
+
+## Verify the installation
Verify the installation with:
@@ -69,43 +69,42 @@ Verify the installation with:
ado get actuators --details
```
-The actuator `vllm_performance` will appear in the list of available actuators.
+The actuator `vllm_performance` should appear in the list of available actuators
+if installation completed successfully.
## Create an actuator configuration
-The vllm-performance actuator needs some information the target cluster to
+The vllm-performance actuator needs some information about the target cluster to
deploy on. This is provided via an `actuatorconfiguration`.
-First execute,
+First execute:
```commandline
-# Generate the template file
-ado template actuatorconfiguration --actuator-identifier vllm_performance -o actuatorconfiguration.yaml
+ado template actuatorconfiguration --actuator-identifier vllm_performance -o vllm_actuator_configuration.yaml
```
-This will create a file called `vllm_performance_actuatorconfiguration.yaml`
-
-Edit the file and set correct values for the following fields:
+This will create a file called `vllm_actuator_configuration.yaml`
+Edit the file and set correct values for at least the `namespace` field.
+Also consider if you need to supply a value for `hf_token` :
```yaml
-hf_token:
-namespace: vllm-testing # OpenShift namespace you have write access to
-node_selector: '{"kubernetes.io/hostname":""}' # JSON string selecting a node that owns GPU
+hf_token: # Required to access gated models
+namespace: vllm-testing # you MUST set this to a namespace where you can create vLLM deployments
```
Then save this configuration as an `actuatorconfiguration` resource:
```bash
-ado create actuatorconfiguration -f vllm_performance_actuatorconfiguration.yaml
+ado create actuatorconfiguration -f vllm_actuator_configuration.yaml
```
> [!TIP]
>
> You can create multiple actuator configurations corresponding
-> to different clusters/target environments.
-> You choose the one to use when you launch an operation requiring the actuator
+> to different target environments.
+> You choose the one to use when you launch an operation requiring the actuator.
## Define the configurations to test
@@ -120,8 +119,6 @@ deployment parameters, including `max_num_seq` and `max_batch_tokens`, for a
scenario where requests arrive between 1 and 10 per second with sizes
around 2000 tokens.
-Save the following as `vllm_discoveryspace.yaml`:
-
```yaml
entitySpace:
- identifier: model
@@ -166,11 +163,11 @@ metadata:
name: vllm_deployments
```
-Save the above as `vllm_discoveryspace.yaml`.
+Save the above as `vllm_deployment_space.yaml`.
Then run:
```bash
-ado create space -f vllm_discoveryspace.yaml
+ado create space -f vllm_deployment_space.yaml
```
## Explore the space with random_walk
@@ -180,7 +177,7 @@ efficiency. The `grouped` sampler ensures we explore all the different benchmark
configurations for a given vLLM deployment before creating a new deployment -
minimizing the number of deployment creations.
-Save the following as `random_walk.yaml`:
+Save the following as `random_walk_operation_grouped.yaml`:
```yaml
metadata:
@@ -212,11 +209,12 @@ operation:
Then, start the operation with:
```commandline
-ado create operation -f random_walk.yaml \
+ado create operation -f random_walk_operation_grouped.yaml \
--use-latest space --use-latest actuatorconfiguration
```
-Results will appear as they are measured.
+As it runs a table of the results is updated
+live in the terminal as they come in.
### Monitor the optimization
@@ -227,10 +225,7 @@ While the operation is running you can monitor the deployment:
oc get deployments --watch -n vllm-testing
```
-As it runs a table of the results is updated
-live in the terminal as they come in.
-
-You can also get the table be executing (in another terminal)
+You can also get the results table by executing (in another terminal)
```commandline
ado show entities operation --use-latest
@@ -253,6 +248,7 @@ ado show entities space --output-format csv --use-latest
## Next steps
+
- Try varying **`max_batch_tokens`** or **`gpu_memory_utilization`** to
explore the impact on throughput.
- Try creating a different `actuatorconfiguration` with more
@@ -261,3 +257,7 @@ explore the impact on throughput.
- Use **RayTune**
(see the [vLLM endpoint performance](vllm-performance-endpoint.md) example)
to optimise the hyper‑parameters of the benchmark.
+- Run [the exploration on the OpenShift/Kubernetes cluster](../actuators/vllm_performance.md#the-in_cluster-configuration-option)
+you create the deployments on, so you don't have to keep your laptop open.
+- Check the [`vllm_performance` actuator documentation](../actuators/vllm_performance.md)
+
\ No newline at end of file
diff --git a/website/docs/getting-started/service-account.yaml b/website/docs/getting-started/service-account.yaml
new file mode 120000
index 00000000..3952243f
--- /dev/null
+++ b/website/docs/getting-started/service-account.yaml
@@ -0,0 +1 @@
+../../../backend/kuberay/service-account.yaml
\ No newline at end of file
diff --git a/website/docs/resources/actuatorconfig.md b/website/docs/resources/actuatorconfig.md
index ebce72bf..8da0aba7 100644
--- a/website/docs/resources/actuatorconfig.md
+++ b/website/docs/resources/actuatorconfig.md
@@ -88,11 +88,13 @@ the `operation` resource documentation for details.
### Other ado commands that work with actuatorconfiguration
+
- `ado get actuatorconfigurations`
- - list stored `actuatorconfiguration`s or retrieve their representations
+ - list stored `actuatorconfiguration`s or retrieve their representations
- `ado show related actuatorconfiguration ID`
- - show operations using an `actuatorconfiguration`
+ - show operations using an `actuatorconfiguration`
- `ado edit actuatorconfiguration ID`
- - set the name, description, and labels for an `actuatorconfiguration`
+ - set the name, description, and labels for an `actuatorconfiguration`
- `ado delete actuatorconfiguration ID`
- - delete an `actuatorconfiguration`
+ - delete an `actuatorconfiguration`
+
diff --git a/website/docs/resources/resources.md b/website/docs/resources/resources.md
index ce9e9085..950eddc5 100644
--- a/website/docs/resources/resources.md
+++ b/website/docs/resources/resources.md
@@ -59,26 +59,28 @@ metastore.
Here is a list of common `ado` CLI commands for interacting with resources. See
the [ado CLI guide](../getting-started/ado.md) for more details
+
- `ado get [resource type]`
- - Lists all resources of the requested type
+ - Lists all resources of the requested type
- `ado get [resource type] [$identifier] -o yaml`
- - Outputs the YAML of resource `$identifier`
+ - Outputs the YAML of resource `$identifier`
- `ado create [resource type] -f [YAMLFILE]`
- - Creates the resource of the specified type from the definition in "YAMLFILE"
+ - Creates the resource of the specified type from the definition in "YAMLFILE"
- `ado delete [resource type] [$identifier]`
- - Deletes the resource of the specified type with the provided identifier from
+ - Deletes the resource of the specified type with the provided identifier from
the database. See the [deleting resources](#deleting-resources) section for
more information and considerations to keep in mind.
- `ado describe [resource type] [$identifier]`
- - Outputs a human-readable description of resource `$identifier`
+ - Outputs a human-readable description of resource `$identifier`
- `ado show related [resource type] [$identifier]`
- - List ids of resources related to resource `$identifier`
+ - List ids of resources related to resource `$identifier`
- `ado show details [resource type] [$identifier]`
- - Outputs some details on the resource. Usually these are quantities that have
+ - Outputs some details on the resource. Usually these are quantities that have
to be computed.
- `ado template [resource type] --include-schema`
- - Outputs a default YAML for the given resource along with a schema file
+ - Outputs a default YAML for the given resource along with a schema file
explaining the fields.`
+
### Deleting resources
diff --git a/website/mkdocs.yml b/website/mkdocs.yml
index f577134e..54ec6e1a 100644
--- a/website/mkdocs.yml
+++ b/website/mkdocs.yml
@@ -191,6 +191,7 @@ nav:
- Adding custom experiments: actuators/creating-custom-experiments.md
- Running experiments on single entities: actuators/run_experiment.md
- Using externally obtained data: actuators/replay.md
+ - vllm_performance - measure inference performance: actuators/vllm_performance.md
#- ST4SD: actuators/st4sd.md
- SFTTrainer - measure fine-tuning performance : actuators/sft-trainer.md
#- Molformer: actuators/molformer.md