diff --git a/.secrets.baseline b/.secrets.baseline index eca42867..bf0bc6aa 100644 --- a/.secrets.baseline +++ b/.secrets.baseline @@ -3,7 +3,7 @@ "files": "requirements.txt|^.secrets.baseline$", "lines": null }, - "generated_at": "2025-11-05T16:16:55Z", + "generated_at": "2025-11-10T08:32:10Z", "plugins_used": [ { "name": "AWSKeyDetector" @@ -414,7 +414,7 @@ } ] }, - "version": "0.13.1+ibm.64.dss", + "version": "0.13.1+ibm.62.dss", "word_list": { "file": null, "hash": null diff --git a/backend/kuberay/README.md b/backend/kuberay/README.md index 7e4e462c..e0a2f9b4 100644 --- a/backend/kuberay/README.md +++ b/backend/kuberay/README.md @@ -18,21 +18,26 @@ the ## Deploying a RayCluster -> [!WARNING] +> [!WARNING] Ray version compatibility > -> The `ray` versions must be compatible. For a more in depth guide refer to the -> [RayCluster configuration](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html) +> The `ray` version set in KubeRay YAML and the one +> used in the ray head and worker containers must be compatible. +> For a more in depth guide refer to the [RayCluster configuration](https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html) > page. -!!! note +We provide [an example set of values](vanilla-ray.yaml) for deploying a +RayCluster via KubeRay. To deploy it run: + +``` commandline +helm upgrade --install ado-ray kuberay/ray-cluster --version 1.1.0 --values backend/kuberay/vanilla-ray.yaml +``` - When running multi-node measurement make sure that - all nodes in your multi-node setup have read and write access - to your HuggingFace home directory. On Kubernetes with RayCluster, - avoid S3-like filesystems as that is known to cause failures - in **transformers**. Use a NFS or GPFS-backed PersistentVolumeClaim instead. +Feel free to customize the example file provided to suit your cluster, +such as uncommenting GPU-enabled workers. -### Configuring a Kubernetes ServiceAccount for the RayCluster +### Enabling ado actuators to create K8s resources + +#### Configuring a ServiceAccount for the RayCluster The default Kubernetes ServiceAccount created for a RayCluster does not have enough permissions for an ado actuator to create Kubernetes resources @@ -46,46 +51,14 @@ It also provides access to the RayCluster resources. ```yaml -apiVersion: v1 -kind: ServiceAccount -metadata: - name: ray-deployer ---- -apiVersion: rbac.authorization.k8s.io/v1 -kind: RoleBinding -metadata: - name: ray-deployer -roleRef: - apiGroup: rbac.authorization.k8s.io - kind: Role - name: ray-deployer -subjects: - - kind: ServiceAccount - name: ray-deployer ---- -apiVersion: rbac.authorization.k8s.io/v1 -kind: Role -metadata: - name: ray-deployer -rules: - - apiGroups: ["ray.io"] - resources: - - rayclusters - verbs: ["get", "patch"] - - apiGroups: ["apps"] - resources: - - pods - - deployments - verbs: ["get", "create", "delete", "list", "watch", "update"] - - apiGroups: [""] - resources: - - services - verbs: ["get", "create", "delete", "list", "watch", "update"] +{% include "./service-account.yaml" %} ``` From the root of the ado project run the below command: - kubectl apply -f backend/kuberay/service-account.yaml +```commandline +kubectl apply -f backend/kuberay/service-account.yaml +``` This will create a ServiceAccount named `ray-deployer`. We will reference this name later when @@ -94,6 +67,19 @@ We will reference this name later when More information about ServiceAccount, Role, and RoleBinding objects can be found in the [official Kubernetes RBAC documentation](https://kubernetes.io/docs/reference/access-authn-authz/rbac/). +#### Associating a RayCluster with the ServiceAccount + +The below command shows how to set the `serviceAccountName` property for head +and worker nodes. + + +```bash +helm upgrade --install ado-ray kuberay/ray-cluster --version 1.1.0 \ + --values backend/kuberay/vanilla-ray-service-account.yaml \ + --set head.serviceAccountName=ray-deployer \ + --set worker.serviceAccountName=ray-deployer +``` + ### Best Practices for Efficient GPU Resource Utilization To maximize the efficiency of your RayCluster and minimize GPU resource @@ -124,12 +110,13 @@ Recommended worker setup: - 4 replicas of a worker with **8 GPUs** +
Example: The contents of the additionalWorkerGroups field of a RayCluster with 4 Nodes each with 8 NVIDIA-A100-SXM4-80GB GPUs, 64 CPU cores, and 1TB memory - + ```yaml one-A100-80G-gpu-WG: replicas: 0 @@ -288,34 +275,24 @@ with 4 Nodes each with 8 NVIDIA-A100-SXM4-80GB GPUs, 64 CPU cores, and 1TB memor # volumes: ... # volumeMounts: .... ``` - +
-!!! note - - Notice that the only variant with a **full-worker** custom resource - is the one with 8 GPUs. Some actuators, like SFTTrainer, use this - custom resource for measurements that involve reserving an entire GPU node. - -We provide [an example set of values](vanilla-ray.yaml) for deploying a -RayCluster via KubeRay. To deploy it, simply run: - - helm upgrade --install ado-ray kuberay/ray-cluster --version 1.1.0 --values backend/kuberay/vanilla-ray.yaml - -In the case the ado operation to be executed requires creating Kubernetes -resources, the RayCluster to be deployed must be associated with a properly -configured ServiceAccount like the one described [above](#configuring-a-kubernetes-serviceaccount-for-the-raycluster). -The below command shows how to set the `serviceAccountName` property for head -and worker nodes. +> [!IMPORTANT] full-worker custom resource +> +> Notice that the only variant with a **full-worker** custom resource +> is the one with 8 GPUs. Some actuators, like SFTTrainer, use this +> custom resource for measurements that involve reserving an entire GPU node. - -```bash -helm upgrade --install ado-ray kuberay/ray-cluster --version 1.1.0 \ - --values backend/kuberay/vanilla-ray-service-account.yaml \ - --set head.serviceAccountName=ray-deployer \ - --set worker.serviceAccountName=ray-deployer -``` +### RayClusters and SFTTrainer -Feel free to customize the example file provided to suit your cluster, -such as uncommenting GPU-enabled workers. +> [!IMPORTANT] HuggingFace home directory +> +> If you want to run multi-node measurements with +> the SFTTrainer actuator make sure that +> all nodes in your multi-node setup have read and write access +> to your HuggingFace home directory. On Kubernetes with RayClusters, +> avoid S3-like filesystems as that is known to cause failures +> in **transformers**. +> Use a NFS or GPFS-backed PersistentVolumeClaim instead. diff --git a/backend/kuberay/service-account.yaml b/backend/kuberay/service-account.yaml index 3da627ee..cb121a51 100644 --- a/backend/kuberay/service-account.yaml +++ b/backend/kuberay/service-account.yaml @@ -36,4 +36,5 @@ rules: - apiGroups: [""] resources: - services + - persistentvolumeclaims verbs: ["get", "create", "delete", "list", "watch", "update"] \ No newline at end of file diff --git a/plugins/actuators/vllm_performance/yamls/vllm_actuator_configuration.yaml b/plugins/actuators/vllm_performance/yamls/vllm_actuator_configuration.yaml new file mode 100644 index 00000000..f601151a --- /dev/null +++ b/plugins/actuators/vllm_performance/yamls/vllm_actuator_configuration.yaml @@ -0,0 +1,16 @@ +# Copyright (c) IBM Corporation +# SPDX-License-Identifier: MIT +actuatorIdentifier: vllm_performance +metadata: + name: "Test actuator deployment" +parameters: + benchmark_retries: 3 + hf_token: 'test' # Set if you need to access a gated model + image_secret: '' + in_cluster: false + interpreter: python3 + max_environments: 1 + namespace: null # Must set to the namespace to create deployments + node_selector: {} + retries_timeout: 5 + verify_ssl: false diff --git a/plugins/actuators/vllm_performance/yamls/discoveryspace_override_defaults.yaml b/plugins/actuators/vllm_performance/yamls/vllm_deployment_space.yaml similarity index 60% rename from plugins/actuators/vllm_performance/yamls/discoveryspace_override_defaults.yaml rename to plugins/actuators/vllm_performance/yamls/vllm_deployment_space.yaml index 898f0947..c382f2f4 100644 --- a/plugins/actuators/vllm_performance/yamls/discoveryspace_override_defaults.yaml +++ b/plugins/actuators/vllm_performance/yamls/vllm_deployment_space.yaml @@ -1,7 +1,5 @@ # Copyright (c) IBM Corporation # SPDX-License-Identifier: MIT - -sampleStoreIdentifier: 2963a5 entitySpace: - identifier: model propertyDomain: @@ -11,36 +9,26 @@ entitySpace: propertyDomain: values: - quay.io/dataprep1/data-prep-kit/vllm_image:0.1 - - identifier: n_cpus - propertyDomain: - values: [8] - - identifier: memory - propertyDomain: - values: ["128Gi"] - - identifier: dtype + - identifier: "number_input_tokens" propertyDomain: - values: ["auto"] - - identifier: "num_prompts" - propertyDomain: - values: [500] + values: [1024, 2048, 4096] - identifier: "request_rate" propertyDomain: - values: [-1] - - identifier: "max_concurrency" - propertyDomain: - values: [-1] - - identifier: "gpu_memory_utilization" + domainRange: [1,10] + interval: 1 + - identifier: n_cpus propertyDomain: - values: [.9] - - identifier: "cpu_offload" + domainRange: [2,16] + interval: 2 + - identifier: memory propertyDomain: - values: [0] + values: ["128Gi", "256Gi"] - identifier: "max_batch_tokens" propertyDomain: - values: [16384] + values: [1024, 2048, 4096, 8192, 16384, 32768] - identifier: "max_num_seq" propertyDomain: - values: [256] + values: [16,32,64] - identifier: "n_gpus" propertyDomain: values: [1] @@ -51,4 +39,5 @@ experiments: - actuatorIdentifier: vllm_performance experimentIdentifier: performance-testing-full metadata: - description: Parameters for VLLM performance testing + description: A space of vllm deployment configurations + name: vllm_deployments diff --git a/website/docs/actuators/vllm_performance.md b/website/docs/actuators/vllm_performance.md new file mode 100644 index 00000000..29060ee9 --- /dev/null +++ b/website/docs/actuators/vllm_performance.md @@ -0,0 +1,393 @@ +# The `vllm_performance` actuator + + + +> [!TIP] Overview +> The `vllm_performance` actuator **can +> automatically create and benchmark [vLLM](https://github.com/vllm-project/vllm) inference deployments on Kubernetes and OpenShift clusters**. +> +> It is designed for robust, repeatable, and configurable experiment execution. +> It is suitable for both simple one-off benchmarks and large parameter sweeps. + + +## Key Capabilities + +- **Automated LLM benchmarking:** Deploys vLLM serving endpoints +on NVIDIA GPU-enabled OpenShift/Kubernetes clusters and runs +standardized serving benchmarks. +- **Cluster integration:** Handles deployments and clean-up of vLLM inference +pods on OpenShift/Kubernetes, with configurable resource selection via namespace, +node selector, and PVC/service templates. +- **Scenario configurability:** Supports customizing models, NVIDIA GPU types, +node selection, retry behavior, concurrent deployments, and more +- **Efficient sampling:** Supports grouped sampling which maximises reuse +of vLLM deployments, hence minimising time spent creating such deployments +- **Endpoint benchmarking:** Can also be used to benchmark existing OpenAI +compatible endpoints + +### Available experiments + +The `vllm_performance` actuator implements two experiments + +- `performance-testing-full`: This experiment can test the full vLLM workload configuration, +including resource requests and server deployment configuration. It deploys +servers with given configuration on kubernetes and runs `vllm bench serve` on them +with the given parameters +- `performance-testing-endpoint`: This experiment is equivalent to running +`vllm bench serve` against an endpoint. + +--- + +## Running single experiments: Quick endpoint and deployment tests + +For rapid testing and debugging, you can use the [`run_experiment`](run_experiment.md) +tool to execute individual experiments on a single point (entity). +This is ideal when you want to: + +- Quickly check if your actuator installation and configuration works +- Debug a deployment scenario or endpoint using the vllm_performance actuator + +### Running an endpoint test + +To test the throughput or limits of an existing vLLM-compatible endpoint, create +a `point.yaml`file like this: + +```yaml +entity: + model: openai/gpt-oss-20b + endpoint: http://localhost:8000 + request_rate: 50 +experiments: +- actuatorIdentifier: vllm_performance + experimentIdentifier: performance-testing-endpoint +``` + +Then run: + +```shell +run_experiment point.yaml +``` + +This will assess how many requests per second the endpoint can handle for the given +model and configuration. + +> [!TIP] Inference endpoint testing example +> +> See [the detailed endpoint scenario](../examples/vllm-performance-endpoint.md) +> for a production-style workflow exploring inference endpoint throughput. + +### Running a deployment test + +To launch and benchmark a temporary vLLM deployment +(including provisioning on Kubernetes/OpenShift), you must provide both: + + +- An entity definition (as before) +- The identifier of a valid `actuatorconfiguration` resource + - This contains information necessary for accessing and creating + deployments on the Kubernetes/OpenShift cluster + - See [configuring the vllm_performance actuator](#configuring-the-vllm_performance-actuator) + for details. + + +Example `point.yaml`: + +```yaml +entity: + model: ibm-granite/granite-3.3-8b-instruct + n_cpus: 8 + memory: 128Gi + gpu_type: NVIDIA-A100-80GB-PCIe + max_batch_tokens: 8192 + max_num_seq: 32 + n_gpus: 1 +experiments: +- actuatorIdentifier: vllm_performance + experimentIdentifier: performance-testing-full +``` + +Then run: + +```shell +run_experiment point.yaml --actuator-config-id my-vllm-performance-config +``` + +Here `my-vllm-performance-config` is the ID of an `actuatorconfiguration` resource +containing the details for accessing and running on your target cluster. +See [configuring the vllm_performance actuator](#configuring-the-vllm_performance-actuator) +for more. + +This command will provision the deployment for the specified entity, using your indicated +actuator configuration, run the benchmark, and print results. + +> [!TIP] vLLM deployment example +> +> See [the vLLM deployment exploration example](../examples/vllm-performance-full.md) +> for details on how to explore many deployment configurations. + +--- + +## Configuring the vllm_performance actuator + +You can configure how the `vllm_performance` actuator creates, +manage, and monitor vLLM deployments on a Kubernetes/OpenShift +cluster. +This configuration covers several needs: + +- **Cluster targeting and permissions**: Specify the OpenShift/Kubernetes namespace +and optionally node selectors, secrets, and templates to match your cluster resources. +- **Secure access**: Pass required HuggingFace tokens, set up image pull secrets, +control in-cluster or remote execution, and toggle SSL verification. +- **Experiment protocol and retries**: Choose how benchmarks are run, including interpreter, +retry logic, and YAML templates for deployments/services used. +- **Deployment resource management**: Limit the number of concurrent deployments +and control automated clean-up. + +You supply this configuration information as an `ado` +[`actuatorconfiguration` resource](../resources/actuatorconfig.md), +which is a YAML file with the configuration options. +An example is: + + +```yaml +actuatorIdentifier: vllm_performance #The actuator the configuration is for +metadata: + description: "Actuator config for vLLM LLM benchmarking" + name: demo-vllm-perf +parameters: + benchmark_retries: 3 # Number of benchmark attempts (see Failure Handling) + hf_token: "" # Required for pulling some models + image_secret: "" # Optional image pull secret + in_cluster: false # Set to true if running from within the cluster + interpreter: python3 # Language for test drivers/benchmarks + max_environments: 1 # Max concurrent vLLM deployments + namespace: "mynamespace" # OpenShift/K8s namespace to deploy into + node_selector: # A dictionary of Kubernetes node_selector key:value pairs + "kubernetes.io/hostname":"gpunode01" + pvc_name: null # Name of existing PVC to use. If null/omitted a temporary PVC is created + retries_timeout: 5 # Seconds between retries (exponential backoff) + verify_ssl: false # Whether to verify HTTPS endpoints +``` + + +If the above YAML was saved to a file called `vllm_config.yaml` you would create +the configuration using + +```commandline +ado create actuatorconfiguration -f vllm_config.yaml +``` + +> [!WARNING] namespace +> +> The critical parameter you must set in the configuration is `namespace` + + +> [!WARNING] GPU type +> +> The GPU type to use in an experiment is set via the experiment itself (performance-testing-full). +> **Do not** set this via the `node_selector` parameter of the configuration. + + +> [!TIP] Further details +> +> For further details on specific options and advanced behavior see: +> +> - [Maximum number of deployments](#maximum-number-of-deployments) (details on `max_environments`) +> - [Handling benchmark failures](#handling-benchmark-failures) and [Deployment Clean-Up](#deployment-clean-up) +> - [Grouped sampling for efficient deployment usage](#grouped-sampling-for-efficient-deployment-usage) + +### Multiple configurations + +You can create multiple `actuatorconfiguration`s for the `vllm_performance` actuator. +Each configuration captures +the cluster-specific, security-sensitive, and experiment-relevant settings necessary +for the actuator to operate in a given environment. +Each configuration will have a different id and you can choose the one to use +when submitting an operation or single experiment that uses the `vllm_performance` +actuator. + +> [!TIP] Getting a default configuration +> +> You can generate a default configuration via the ado CLI: +> +> ```shell +> ado template actuatorconfiguration --actuator-identifier vllm_performance -o actuatorconfiguration.yaml +> ``` + +--- + +## vLLM deployment management + +### The `in_cluster` configuration option + +The `in_cluster` option in your `actuatorconfiguration` tells the `vllm_performance` +actuator how to communicate with the target Kubernetes or OpenShift cluster when +running `performance-testing-full`. + +If running `ado` from outside the Kubernetes/OpenShift cluster where +the deployments will be created, leave `in_cluster: false` (the default). + +Set `in_cluster: true` if your `ado` operation will be run on a +**remote Ray cluster that is in the same Kubernetes/OpenShift cluster** as your +vLLM deployments. +This configuration maximizes efficiency for large-scale, distributed benchmarking. +For a detailed guide on running `ado` remotely on a Ray cluster, including environment +and package setup, see [Running ado remotely](../getting-started/remote_run.md). + +> [!IMPORTANT] RayCluster permissions +> +> If running with `in_cluset=True`, your RayCluster **must** be configured so that +> jobs launched by `ado` have permissions to create and manage Kubernetes deployments, +> pods, and services. +> For configuring the necessary ServiceAccount, roles, and permissions, +> see our [documentation on deploying RayClusters for `ado`](../getting-started/installing-backend-services.md). + + +> [!TIP] Installing the `vllm_performance` actuator on a remote RayCluster +> +> If the `ado-vllm-performance` actuator is not installed in the +> image used by the RayCluster you can have [ray install it following +> this guide](../getting-started/remote_run.md). +> +> In particular, if a compatible version of vLLM is not installed +> in the image this step will require installing vLLM on each RayCluster node +> (so `vllm bench serve` is available). +> This can take some time so you may see the `ado` `operation` output "hang" +> while this is happening. + +### Maximum number of deployments + +The actuator configuration parameter `max_environments` controls how +many concurrent vLLM deployments will be created. The default is 1. + +When experiments are requested, if an existing deployment cannot +be used a new environment is created as long as `max_environments` has +not been reached. +If it has been reached, then the actuator waits for an existing +environment to become idle, at which point it is deleted and +the new environment is created. + +Some notes: + + +- `max_environments` deployments are always created before any are deleted + - This means idle environments will remain until there is a need to delete them + - This is to increases chances they can be reused/minimise cost of redeploying +- Environment creation is serialized + - If `max_environments` is reached and all are active, the first experiment + that requires a new environment will block. Subsequent experiment + requests will queue behind it in FIFO order until it can proceed (i.e. delete + an existing environment and create the one it needs) + + +### Handling benchmark failures + +Once deployments are created and the vLLM health endpoint is responding to requests +(pod running, container ready), or 20 mins has elapsed, the actuator runs +`vllm bench serve` against it. +The 20min timeout is so the wait won't pend forever in a case where something +goes wrong +in K8s that means the health check will never pass. + +When running the benchmark the actuator will try `benchmark_retries` times +backing off exponentially based on `retries_timeout` to run the benchmark successfully. +The retries may be required as it can happen for large models that 20 minutes is +not sufficient for model download and load for serving. +Since vLLM bench itself waits 10 minutes for the endpoint to come up this means with +`benchmark_retries=3` (the default) there is roughly 50mins-1hr timeout for the +endpoint to become available. + +### PVCs + +#### `pvc_name` not given + +If no `pvc_name` is set in the `actuatorconfiguration`, when an actuator +instance is created with this configuration, e.g., via `create operation` or `run_experiment`, +it creates a PVC called `vllm-support-$UUID` that is shared by all deployments +it creates. +The `$UUID` is a randomly generated string that will vary each time the +actuator is created. +When the `operation` or `run_experiment` exits this PVC will be deleted. + +#### `pvc_name` given + +If a `pvc_name` is set in the `actuatorconfiguration`, when an actuator +instance is created with this configuration, e.g., via `create operation` +or `run_experiment`, +it will look for an existing PVC with the given name. +If the PVC exists it will be used for all deployments the actuator instance +creates. +When the `operation` or `run_experiment` exits this PVC will NOT be deleted. +If the PVC does not exist the actuator will exit with an error. + +### Deployment Clean-Up + +The `vllm_performance` actuator will automatically clean up +all Kubernetes resources associated with the vLLM deployments as it proceeds +leaving at most `max_environments` active at a time. +On a graceful shutdown of the `ado` process running the operation +(CTRL-C, SIGTERM, SIGINT) active deployments will be deleted +before exit. +On an uncontrolled shutdown (SIGKILL) you will need to manually +clean up any K8s deployments that were running at the time. + +> [!IMPORTANT] PVC Deletion +> +> If the actuator created a PVC (i.e. `vllm-support-$UUID`) it will be deleted. +> +> If the actuator used an existing PVC it will not be deleted. + +### Kubernetes resource templates + +The `vllm_performance` actuator creates Kubernetes resources +based on a set of template YAML files +that are distributed with the actuator. +The templates are for: + +- vLLM deployment +- PVC used by deployment pod +- vLLM service + +You can use your own templates, +by creating a vllm_performance +`actuatorconfiguration` resource with the following +fields set to the path to your templates. + +```yaml +deployment_template: $PATH_RELATIVE_TO_WORKING_DIR +service_template: $PATH_RELATIVE_TO_WORKING_DIR +pvc_template: $PATH_RELATIVE_TO_WORKING_DIR +``` + +Then use this `actuatorconfiguration` resource +when running operations with the actuator. + +The paths given are always interpreted relative to the +working directory of process using the actuator +(where `ado create operation` or `run_experiment` is executed). + +>[!IMPORTANT] Custom templates and executing on remote RayClusters +> +> The template path must be accessible where the actuator is running. +> This is important to consider when running operation using +> `vllm_performance` on a remote RayCluster. +> To handle this we recommend: +> +> - Put custom templates in the working directory (or a subdirectory of it) +> that you will +> [send to the RayCluster](../getting-started/remote_run.md#other-options) +> - Create an `actuatorconfiguration` with the relative paths to the +> templates from this working directory +> + +### Grouped sampling for efficient deployment usage + +Creating and deleting vLLM deployments takes time. +If you have limited number of vLLM deployments that can be +created concurrently, say one, then this can add significant +overhead if consecutive points being sampled require +different deployments. +The [grouped sampling](../operators/random-walk.md#enabling-grouping) +feature of the `random_walk` operator can be useful in this case. +This allows configuring the sampling so points that +require a given vLLM deployment are submitted in a batch. diff --git a/website/docs/examples/vllm-performance-endpoint.md b/website/docs/examples/vllm-performance-endpoint.md index a6bc0452..8eb7cd22 100644 --- a/website/docs/examples/vllm-performance-endpoint.md +++ b/website/docs/examples/vllm-performance-endpoint.md @@ -2,7 +2,9 @@ > [!NOTE] The scenario > -> **In this example, the _vllm_performance_ actuator is used to find +> **In this example, +> the [_vllm_performance_ actuator](../actuators/vllm_performance.md) +> is used to find > the maximum requests per second a server can handle while maintaining > stable maximum throughput.** > @@ -16,12 +18,13 @@ > To explore this space, you will: > > - define an endpoint, model and range of requests per second to test -> - use an optimizer to efficiently find the maximum requests per second +> - use [an optimizer](../operators/optimisation-with-ray-tune.md) +> to efficiently find the maximum requests per second > [!IMPORTANT] Prerequisites > -> - An endpoint serving an LLM in an OpenAI API-compatible format +> - An endpoint serving an LLM via an OpenAI API-compatible API > - Install the following Python packages: > > ```bash @@ -37,11 +40,15 @@ > from [our repository](https://github.com/IBM/ado/tree/main/plugins/actuators/vllm_performance/yamls). > > - `vllm_request_rate_space.yaml`: this file defines the _endpoint_, _model_, -> and _request_ _range_ to explore. -> - **You must edit the _model_ and _endpoint_ fields in this file -> to match your own.** +> and _request_ _range_ to explore. +> +> +> - **You must edit the _model_ and _endpoint_ fields in this file +> to match your own.** +> +> > - `operation_hyperopt.yaml`: this file contains the optimization parameters. -> You do not need to edit it. +> You do not need to edit it. > > Then, in a directory with these files, execute: > @@ -213,12 +220,16 @@ and the best region is unlikely to be visited. ## Next steps + - Use `ado describe experiment vllm_performance_endpoint` to see what other parameters can be explored - Try varying **`burstiness`** or **`number_input_tokens`**, or adding them as dimensions of the `entityspace`, to explore their impact on throughput - Try varying `num_samples`, `gamma` and `n_initial_points` parameters of hyperopt - - You can keep running the optimization on the same `discoveryspace`. - The previous runs will not influence new runs, but their results will - be reused, speeding experimentation up + - You can keep running the optimization on the same `discoveryspace`. + The previous runs will not influence new runs, but their results will + be reused, speeding experimentation up - Measure the [performance of vLLM deployment configurations](vllm-performance-full.md) +- Check the [`vllm_performance` actuator documentation](../actuators/vllm_performance.md) + + diff --git a/website/docs/examples/vllm-performance-full.md b/website/docs/examples/vllm-performance-full.md index 161fe516..c6b83230 100644 --- a/website/docs/examples/vllm-performance-full.md +++ b/website/docs/examples/vllm-performance-full.md @@ -1,67 +1,67 @@ # Exploring vLLM deployment configurations -> [!NOTE] +> [!NOTE] The scenario +> +> **In this example, +> the [_vllm_performance_ actuator](../actuators/vllm_performance.md) +> is used to evaluate +> different vLLM server deployment configurations on Kubernetes/OpenShift.** +> +> When deploying vLLM, you must choose values for parameters like GPU type, batch +> size, and memory limits. These choices directly affect performance, cost, and +> scalability. To find the best configuration for your workload, whether you are +> optimizing for latency, throughput, or cost, you need to explore the deployment +> parameter space. In this example: > -> This example illustrates using the vllm-performance actuator to discover -> how best to deploy vLLM for a given use-case +> - We will define a space of vLLM deployment configurations to test with +> the `vllm_performance` actuator's `performance_testing_full` experiment +> - This experiment can create and characterize a vLLM deployment on Kubernetes +> - Use the [`random_walk` operator](../operators/random-walk.md) to +> explore the space -> [!IMPORTANT] +> [!IMPORTANT] Prerequisites > -> **Prerequisites** +> - Be logged-in to your Kubernetes/OpenShift cluster +> - Have access to a namespace where you can create vLLM deployments +> - Install the following Python packages locally: > -> - Access to a k8s namespace where you can deploy vLLM - -## The scenario - -When deploying vLLM, you must choose values for parameters like GPU type, batch -size, and memory limits. These choices directly affect performance, cost, and -scalability. To find the best configuration for your workload, whether you are -optimizing for latency, throughput, or cost—you need to explore the deployment -parameter space. - -In this example: - -- We will define a space of vLLM deployment configurations to test with -the `vllm_performance` actuator's `performance_testing_full` experiment - - This experiment can create and characterize a vLLM deployment on Kubernetes -- Use the `random_walk` operator to explore the space - -## Install the actuator - -[//]: # (If you haven't already:) - -[//]: # () -[//]: # (```commandline) - -[//]: # (pip install ado-vllm-performance) - -[//]: # (```) - -[//]: # () -[//]: # (If you have cloned the `ado` source repository you can also do:) - -[//]: # () -[//]: # (```commandline) - -[//]: # (# From the root of this repository ) - -[//]: # (pip install -e plugins/actuators/vllm_performance) - -[//]: # (```) - -Execute: - -```commandline -pip install -e plugins/actuators/vllm_performance -``` - -in the root of the `ado` source repository. -You can clone the repository with +> ```bash +> pip install ado-vllm-performance +> ``` + -```commandline -git clone https://github.com/IBM/ado.git -``` +> [!TIP] TL;DR +> +> Get the files `vllm_deployment_space.yaml`, `vllm_actuator_configuration.yaml` +> and `operation_random_walk.yaml` from +> +> [our repository](https://github.com/IBM/ado/tree/main/plugins/actuators/vllm_performance/yamls). +> +> +> **You must edit `vllm_actuator_configuration.yaml` with your details.** +> In particular the following two fields are important: +> +> ```yaml +> hf_token: # Required to access gated models +> namespace: vllm-testing # you MUST set this to a namespace where you can create vLLM deployments +> ``` +> +> Then, in a directory with these files, execute: +> +> ```bash +> : # Define the configurations to explore +> ado create space -f vllm_deployment_space.yaml +> : # Create a configuration for the actuator - normally just once as it can be reused +> ado create actuatorconfiguration -f vllm_actuator_configuration.yaml +> : # Explore! +> ado create operation -f random_walk_operation_grouped.yaml --use-latest space --use-latest actuatorconfiguration +> ``` +> +> See [configuring the `vllm_performance` actuator](../actuators/vllm_performance.md#configuring-the-vllm_performance-actuator) +> for more configuration options. + +## Verify the installation Verify the installation with: @@ -69,43 +69,42 @@ Verify the installation with: ado get actuators --details ``` -The actuator `vllm_performance` will appear in the list of available actuators. +The actuator `vllm_performance` should appear in the list of available actuators +if installation completed successfully. ## Create an actuator configuration -The vllm-performance actuator needs some information the target cluster to +The vllm-performance actuator needs some information about the target cluster to deploy on. This is provided via an `actuatorconfiguration`. -First execute, +First execute: ```commandline -# Generate the template file -ado template actuatorconfiguration --actuator-identifier vllm_performance -o actuatorconfiguration.yaml +ado template actuatorconfiguration --actuator-identifier vllm_performance -o vllm_actuator_configuration.yaml ``` -This will create a file called `vllm_performance_actuatorconfiguration.yaml` - -Edit the file and set correct values for the following fields: +This will create a file called `vllm_actuator_configuration.yaml` +Edit the file and set correct values for at least the `namespace` field. +Also consider if you need to supply a value for `hf_token` : ```yaml -hf_token: -namespace: vllm-testing # OpenShift namespace you have write access to -node_selector: '{"kubernetes.io/hostname":""}' # JSON string selecting a node that owns GPU +hf_token: # Required to access gated models +namespace: vllm-testing # you MUST set this to a namespace where you can create vLLM deployments ``` Then save this configuration as an `actuatorconfiguration` resource: ```bash -ado create actuatorconfiguration -f vllm_performance_actuatorconfiguration.yaml +ado create actuatorconfiguration -f vllm_actuator_configuration.yaml ``` > [!TIP] > > You can create multiple actuator configurations corresponding -> to different clusters/target environments. -> You choose the one to use when you launch an operation requiring the actuator +> to different target environments. +> You choose the one to use when you launch an operation requiring the actuator. ## Define the configurations to test @@ -120,8 +119,6 @@ deployment parameters, including `max_num_seq` and `max_batch_tokens`, for a scenario where requests arrive between 1 and 10 per second with sizes around 2000 tokens. -Save the following as `vllm_discoveryspace.yaml`: - ```yaml entitySpace: - identifier: model @@ -166,11 +163,11 @@ metadata: name: vllm_deployments ``` -Save the above as `vllm_discoveryspace.yaml`. +Save the above as `vllm_deployment_space.yaml`. Then run: ```bash -ado create space -f vllm_discoveryspace.yaml +ado create space -f vllm_deployment_space.yaml ``` ## Explore the space with random_walk @@ -180,7 +177,7 @@ efficiency. The `grouped` sampler ensures we explore all the different benchmark configurations for a given vLLM deployment before creating a new deployment - minimizing the number of deployment creations. -Save the following as `random_walk.yaml`: +Save the following as `random_walk_operation_grouped.yaml`: ```yaml metadata: @@ -212,11 +209,12 @@ operation: Then, start the operation with: ```commandline -ado create operation -f random_walk.yaml \ +ado create operation -f random_walk_operation_grouped.yaml \ --use-latest space --use-latest actuatorconfiguration ``` -Results will appear as they are measured. +As it runs a table of the results is updated +live in the terminal as they come in. ### Monitor the optimization @@ -227,10 +225,7 @@ While the operation is running you can monitor the deployment: oc get deployments --watch -n vllm-testing ``` -As it runs a table of the results is updated -live in the terminal as they come in. - -You can also get the table be executing (in another terminal) +You can also get the results table by executing (in another terminal) ```commandline ado show entities operation --use-latest @@ -253,6 +248,7 @@ ado show entities space --output-format csv --use-latest ## Next steps + - Try varying **`max_batch_tokens`** or **`gpu_memory_utilization`** to explore the impact on throughput. - Try creating a different `actuatorconfiguration` with more @@ -261,3 +257,7 @@ explore the impact on throughput. - Use **RayTune** (see the [vLLM endpoint performance](vllm-performance-endpoint.md) example) to optimise the hyper‑parameters of the benchmark. +- Run [the exploration on the OpenShift/Kubernetes cluster](../actuators/vllm_performance.md#the-in_cluster-configuration-option) +you create the deployments on, so you don't have to keep your laptop open. +- Check the [`vllm_performance` actuator documentation](../actuators/vllm_performance.md) + \ No newline at end of file diff --git a/website/docs/getting-started/service-account.yaml b/website/docs/getting-started/service-account.yaml new file mode 120000 index 00000000..3952243f --- /dev/null +++ b/website/docs/getting-started/service-account.yaml @@ -0,0 +1 @@ +../../../backend/kuberay/service-account.yaml \ No newline at end of file diff --git a/website/docs/resources/actuatorconfig.md b/website/docs/resources/actuatorconfig.md index ebce72bf..8da0aba7 100644 --- a/website/docs/resources/actuatorconfig.md +++ b/website/docs/resources/actuatorconfig.md @@ -88,11 +88,13 @@ the `operation` resource documentation for details. ### Other ado commands that work with actuatorconfiguration + - `ado get actuatorconfigurations` - - list stored `actuatorconfiguration`s or retrieve their representations + - list stored `actuatorconfiguration`s or retrieve their representations - `ado show related actuatorconfiguration ID` - - show operations using an `actuatorconfiguration` + - show operations using an `actuatorconfiguration` - `ado edit actuatorconfiguration ID` - - set the name, description, and labels for an `actuatorconfiguration` + - set the name, description, and labels for an `actuatorconfiguration` - `ado delete actuatorconfiguration ID` - - delete an `actuatorconfiguration` + - delete an `actuatorconfiguration` + diff --git a/website/docs/resources/resources.md b/website/docs/resources/resources.md index ce9e9085..950eddc5 100644 --- a/website/docs/resources/resources.md +++ b/website/docs/resources/resources.md @@ -59,26 +59,28 @@ metastore. Here is a list of common `ado` CLI commands for interacting with resources. See the [ado CLI guide](../getting-started/ado.md) for more details + - `ado get [resource type]` - - Lists all resources of the requested type + - Lists all resources of the requested type - `ado get [resource type] [$identifier] -o yaml` - - Outputs the YAML of resource `$identifier` + - Outputs the YAML of resource `$identifier` - `ado create [resource type] -f [YAMLFILE]` - - Creates the resource of the specified type from the definition in "YAMLFILE" + - Creates the resource of the specified type from the definition in "YAMLFILE" - `ado delete [resource type] [$identifier]` - - Deletes the resource of the specified type with the provided identifier from + - Deletes the resource of the specified type with the provided identifier from the database. See the [deleting resources](#deleting-resources) section for more information and considerations to keep in mind. - `ado describe [resource type] [$identifier]` - - Outputs a human-readable description of resource `$identifier` + - Outputs a human-readable description of resource `$identifier` - `ado show related [resource type] [$identifier]` - - List ids of resources related to resource `$identifier` + - List ids of resources related to resource `$identifier` - `ado show details [resource type] [$identifier]` - - Outputs some details on the resource. Usually these are quantities that have + - Outputs some details on the resource. Usually these are quantities that have to be computed. - `ado template [resource type] --include-schema` - - Outputs a default YAML for the given resource along with a schema file + - Outputs a default YAML for the given resource along with a schema file explaining the fields.` + ### Deleting resources diff --git a/website/mkdocs.yml b/website/mkdocs.yml index f577134e..54ec6e1a 100644 --- a/website/mkdocs.yml +++ b/website/mkdocs.yml @@ -191,6 +191,7 @@ nav: - Adding custom experiments: actuators/creating-custom-experiments.md - Running experiments on single entities: actuators/run_experiment.md - Using externally obtained data: actuators/replay.md + - vllm_performance - measure inference performance: actuators/vllm_performance.md #- ST4SD: actuators/st4sd.md - SFTTrainer - measure fine-tuning performance : actuators/sft-trainer.md #- Molformer: actuators/molformer.md