# Deploy a vision LLM

<div align="left">
<a target="_blank" href="https://console.anyscale.com/template-preview/deployment-serve-llm?file=%252Ffiles%252Fvision-llm"><img src="https://img.shields.io/badge/ðŸš€ Run_on-Anyscale-9hf"></a>&nbsp;
<a href="https://github.com/ray-project/ray/tree/master/doc/source/serve/tutorials/deployment-serve-llm/vision-llm" role="button"><img src="https://img.shields.io/static/v1?label=&amp;message=View%20On%20GitHub&amp;color=586069&amp;logo=github&amp;labelColor=2f363d"></a>&nbsp;
</div>

A vision LLM can interpret images as well as text, enabling tasks like answering questions about charts, analyzing photos, or combining visuals with instructions. It extends LLMs beyond language to support multimodal reasoning and richer applications.  

This tutorial deploys a vision LLM using Ray Serve LLM.  

---

## Configure Ray Serve LLM

Make sure to set your Hugging Face token in the config file to access gated models.

Ray Serve LLM provides multiple [Python APIs](https://docs.ray.io/en/latest/serve/api/index.html#llm-api) for defining your application. Use [`build_openai_app`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.build_openai_app.html#ray.serve.llm.build_openai_app) to build a full application from your [`LLMConfig`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html#ray.serve.llm.LLMConfig) object.

In [2]:
# serve_qwen_VL.py
from ray.serve.llm import LLMConfig, build_openai_app
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-qwen-VL",
        model_source="qwen/Qwen2.5-VL-7B-Instruct",
    ),
    experimental_configs=dict(num_ingress_replicas=1),
    deployment_config=dict(
        autoscaling_config=dict(
            min_replicas=1,
            max_replicas=1,
        )
    ),
    ### Uncomment if your model is gated and needs your Hugging Face token to access it.
    # runtime_env=dict(env_vars={"HF_TOKEN": os.environ.get("HF_TOKEN")}),
    engine_kwargs=dict(max_model_len=8192),
)

app = build_openai_app({"llm_configs": [llm_config]})


INFO 02-06 12:38:42 [__init__.py:220] No platform detected, vLLM is running on UnspecifiedPlatform


2026-02-06 12:38:43,490	INFO worker.py:1833 -- Connecting to existing Ray cluster at address: 10.128.5.219:6379...
2026-02-06 12:38:43,502	INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-9fyy71sw3bgwajvnjflq7jxd9h.i.anyscaleuserdata.com [39m[22m
2026-02-06 12:38:43,503	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_46981a311274126bb320af99503b26e521e1864a.zip' (0.06MiB) to Ray cluster...
2026-02-06 12:38:43,504	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_46981a311274126bb320af99503b26e521e1864a.zip'.
INFO 2026-02-06 12:38:43,513 serve 609385 -- {'autoscaling_config': {'max_replicas': 1, 'min_replicas': 1},
 'health_check_period_s': 10,
 'health_check_timeout_s': 10,
 'max_ongoing_requests': 1000000000,
 'name': 'LLMServer:my-qwen-VL',
 'placement_group_bundles': [{'CPU': 1, 'GPU': 1}],
 'placement_group_strategy': 'STRICT_PACK',
 'ray_actor_options': {'runtime_env': {'ray_debugger': {'working

**Note:** Before moving to a production setup, migrate to a [Serve config file](https://docs.ray.io/en/latest/serve/production-guide/config.html) to make your deployment version-controlled, reproducible, and easier to maintain for CI/CD pipelines. See [Serving LLMs - Quickstart Examples: Production Guide](https://docs.ray.io/en/latest/serve/llm/quick-start.html#production-deployment) for an example.

---

## Deploy locally

**Prerequisites**

* Access to GPU compute.
* (Optional) A **Hugging Face token** if using gated models. Store it in `export HF_TOKEN=<YOUR-TOKEN-HERE>`

**Note:** Depending on the organization, you can usually request access on the model's Hugging Face page. For example, Metaâ€™s Llama models approval can take anywhere from a few hours to several weeks.

**Dependencies:**  
```bash
pip install "ray[serve,llm]"
```

---

### Launch

Follow the instructions at [Configure Ray Serve LLM](#configure-ray-serve-llm) to define your app in a Python module `serve_qwen_VL.py`.  

In a terminal, run:   

In [1]:
!serve run serve_qwen_VL:app --non-blocking

2026-02-06 12:36:14,387	INFO scripts.py:507 -- Running import path: 'serve_qwen_VL:app'.
INFO 02-06 12:36:16 [__init__.py:220] No platform detected, vLLM is running on UnspecifiedPlatform
2026-02-06 12:36:18,033	INFO worker.py:1833 -- Connecting to existing Ray cluster at address: 10.128.5.219:6379...
2026-02-06 12:36:18,043	INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-9fyy71sw3bgwajvnjflq7jxd9h.i.anyscaleuserdata.com [39m[22m
2026-02-06 12:36:18,044	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_46981a311274126bb320af99503b26e521e1864a.zip' (0.06MiB) to Ray cluster...
2026-02-06 12:36:18,044	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_46981a311274126bb320af99503b26e521e1864a.zip'.
INFO 2026-02-06 12:36:18,053 serve 609405 -- {'autoscaling_config': {'max_replicas': 1, 'min_replicas': 1},
 'health_check_period_s': 10,
 'health_check_timeout_s': 10,
 'max_ongoing_requests': 1000000000,
 'nam

Deployment typically takes a few minutes as the cluster is provisioned, the vLLM server starts, and the model is downloaded. 

---

### Sending requests with images

Your endpoint is available locally at `http://localhost:8000` and you can use a placeholder authentication token for the OpenAI client, for example `"FAKE_KEY"`.

Example curl with image URL:

In [None]:
!curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Authorization: Bearer FAKE_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "model": "my-qwen-VL", "messages": [ { "role": "user", "content": [ {"type": "text", "text": "What do you see in this image?"}, {"type": "image_url", "image_url": { "url": "http://images.cocodataset.org/val2017/000000039769.jpg" }} ] } ] }'

Example Python with image URL:

In [3]:
#client_url_image.py
from urllib.parse import urljoin
from openai import OpenAI

API_KEY = "FAKE_KEY"
BASE_URL = "http://localhost:8000"

client = OpenAI(base_url=urljoin(BASE_URL, "v1"), api_key=API_KEY)

response = client.chat.completions.create(
    model="my-qwen-VL",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {"type": "image_url", "image_url": {"url": "http://images.cocodataset.org/val2017/000000039769.jpg"}}
            ]
        }
    ],
    temperature=0.5,
    stream=True
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

{"asctime": "2026-02-06 12:38:52,973", "levelname": "INFO", "message": "HTTP Request: POST http://localhost:8000/v1/chat/completions \"HTTP/1.1 200 OK\"", "filename": "_client.py", "lineno": 1025, "process": 609385, "job_id": "1c000000", "worker_id": "1c000000ffffffffffffffffffffffffffffffffffffffffffffffff", "node_id": "1a6ddbbb716b74256e415b58e3dca445abdb4074bbfecbc482406ab0", "timestamp_ns": 1770410332973892038}


This image shows two tabby cats lying on a pink couch, appearing to be asleep or resting. Between them are two remote controls placed on the couch. The cats are curled up in comfortable positions, and their relaxed posture suggests they are at ease in their environment.

Example Python with local image:

In [5]:
#client_local_image.py
from urllib.parse import urljoin
import base64
from openai import OpenAI

API_KEY = "FAKE_KEY"
BASE_URL = "http://localhost:8000"

client = OpenAI(base_url=urljoin(BASE_URL, "v1"), api_key=API_KEY)

### From an image locally saved as `example.jpg`
# Load and encode image as base64
with open("example.jpg", "rb") as f:
    img_base64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="my-qwen-VL",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"}}
            ]
        }
    ],
    temperature=0.5,
    stream=True
)

for chunk in response:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

{"asctime": "2026-02-06 12:39:48,908", "levelname": "INFO", "message": "HTTP Request: POST http://localhost:8000/v1/chat/completions \"HTTP/1.1 200 OK\"", "filename": "_client.py", "lineno": 1025, "process": 609385, "job_id": "1c000000", "worker_id": "1c000000ffffffffffffffffffffffffffffffffffffffffffffffff", "node_id": "1a6ddbbb716b74256e415b58e3dca445abdb4074bbfecbc482406ab0", "timestamp_ns": 1770410388908303106}


This

 image shows a small dog, possibly a Jack Russell Terrier, standing on its hind legs and giving a high-five to a person's hand. The dog appears happy and engaged, with its mouth open as if smiling or panting. The background features an outdoor setting with greenery, indicating that the scene takes place in a park or a grassy area on a sunny day.


---

### Shutdown 

Shutdown your LLM service:

In [6]:
!serve shutdown -y

2026-02-06 12:39:57,302	SUCC scripts.py:774 -- [32mSent shutdown request; applications will be deleted asynchronously.[39m
[0m


---

## Deploy to production with Anyscale services

For production, it's recommended to use Anyscale services to deploy your Ray Serve app on a dedicated cluster without code changes. Anyscale provides scalability, fault tolerance, and load balancing, ensuring resilience against node failures, high traffic, and rolling updates. See [Deploy a small-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/small-size-llm/README.html#deploy-to-production-with-anyscale-services) for an example with a small-sized model like the *Qwen2.5-VL-7&nbsp;B-Instruct* used in this tutorial.

---

## Limiting images per prompt

Ray Serve LLM uses [vLLM](https://docs.vllm.ai/en/stable/) as its backend engine. You can configure vLLM by passing parameters through the `engine_kwargs` section of your Serve LLM configuration. For a full list of supported options, see the [vLLM documentation](https://docs.vllm.ai/en/stable/configuration/engine_args.html#multimodalconfig).  

In particular, you can limit the number of images per request by setting `limit_mm_per_prompt` in your configuration.  
```yaml
applications:
- ...
  args:
    llm_configs:
        ...
        engine_kwargs:
          ...
          limit_mm_per_prompt: {"image": 3}
```

---

## Summary

In this tutorial, you deployed a vision LLM with Ray Serve LLM, from development to production. You learned how to configure Ray Serve LLM, deploy your service on your Ray cluster, and send requests with images.