# Deploy a hybrid reasoning LLM

<div align="left">
<a target="_blank" href="https://console.anyscale.com/template-preview/deployment-serve-llm?file=%252Ffiles%252Fhybrid-reasoning-llm"><img src="https://img.shields.io/badge/üöÄ Run_on-Anyscale-9hf"></a>&nbsp;
<a href="https://github.com/ray-project/ray/tree/master/doc/source/serve/tutorials/deployment-serve-llm/hybrid-reasoning-llm" role="button"><img src="https://img.shields.io/static/v1?label=&amp;message=View%20On%20GitHub&amp;color=586069&amp;logo=github&amp;labelColor=2f363d"></a>&nbsp;
</div>

A hybrid reasoning model provides flexibility by allowing you to enable or disable reasoning as needed. You can use structured, step-by-step thinking for complex queries while skipping it for simpler ones, balancing accuracy with efficiency depending on the task.

This tutorial deploys a hybrid reasoning LLM using Ray Serve LLM.  

---

## Distinction with purely reasoning models

*Hybrid reasoning models* are reasoning-capable models that allow you to toggle the thinking process on and off. You can enable structured, step-by-step reasoning when needed but skip it for simpler queries to reduce latency. Purely reasoning models always apply their reasoning behavior, while hybrid models give you fine-grained control over when to use reasoning.
<!-- vale Google.Acronyms = NO -->
| **Mode**         | **Core behavior**                            | **Use case examples**                                               | **Limitation**                                    |
| ---------------- | -------------------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------- |
| **Thinking ON**  | Explicit multi-step thinking process | Math, coding, logic puzzles, multi-hop QA, CoT prompting | Slower response time, more tokens used.      |
| **Thinking OFF** | Direct answer generation                   | Casual queries, short instructions, single-step answers              | May struggle with complex reasoning or interpretability. |
<!-- vale Google.Acronyms = YES -->
**Note:** Reasoning often benefits from long context windows (32K up to +1M tokens), high token throughput, low-temperature decoding (greedy sampling), and strong instruction tuning or scratchpad-style reasoning.

To see an example of deploying a purely reasoning model like *QwQ-32&nbsp;B*, see [Deploy a reasoning LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/reasoning-llm/README.html).

---

## Enable or disable thinking

Some hybrid reasoning models let you toggle their "thinking" mode on or off. This section explains when to use thinking mode versus skipping it, and shows how to control the setting in practice.

---
<!-- vale Vale.Terms = NO -->
### When to enable or disable thinking mode
<!-- vale Vale.Terms = YES -->
**Enable thinking mode for:**
- Complex, multi-step tasks that require reasoning, such as math, physics, or logic problems.
- Ambiguous queries or situations with incomplete information.
- Planning, workflow orchestration, or when the model needs to act as an "agent" coordinating other tools or models.
- Analyzing intricate data, images, or charts.
- In-depth code reviews or evaluating outputs from other AI systems (LLM as Judge approach).

**Disable thinking mode for:**
- Simple, well-defined, or routine tasks.
- Low latency and fast responses as the priority.
- Repetitive, straightforward steps within a larger automated workflow.

---

### How to enable or disable thinking mode

Toggle thinking mode varies by model and framework. Consult the documentation for the model to see how it structures and controls thinking.

For example, to [control reasoning in Qwen-3](https://huggingface.co/Qwen/Qwen3-32B#switching-between-thinking-and-non-thinking-mode), you can:
* Add `"/think"` or `"/no_think"` in the prompt.
* Set `enable_thinking` in the request:
  `extra_body={"chat_template_kwargs": {"enable_thinking": ...}}`.

See [Send request with thinking enabled](#send-request-with-thinking-enabled) or [Send request with thinking disabled](#send-request-with-thinking-disabled) for practical examples.

---

## Parse reasoning outputs

In thinking mode, hybrid models often separate _reasoning_ from the _final answer_ using tags like `<think>...</think>`. Without a proper parser, this reasoning may end up in the `content` field instead of the dedicated `reasoning_content` field.  

To ensure that Ray Serve LLM correctly parses the reasoning output, configure a `reasoning_parser` in your Ray Serve LLM deployment. This tells vLLM how to isolate the model‚Äôs thought process from the rest of the output.  
**Note:** For example, *Qwen-3* uses the `qwen3` parser. See the [vLLM docs](https://docs.vllm.ai/en/stable/features/reasoning_outputs.html#supported-models) or your model's documentation to find a supported parser, or [build your own](https://docs.vllm.ai/en/stable/features/reasoning_outputs.html#how-to-support-a-new-reasoning-model) if needed.

```yaml
applications:
- ...
  args:
    llm_configs:
      - model_loading_config:
          model_id: my-qwen-3-32b
          model_source: Qwen/Qwen3-32B
        ...
        engine_kwargs:
          ...
          reasoning_parser: qwen3 # <-- for Qwen-3 models
```

See [Configure Ray Serve LLM](#configure-ray-serve-llm) for a complete example.

**Example response**  
When using a reasoning parser, the response is typically structured like this:

```python
ChatCompletionMessage(
    content="The temperature is...",
    ...,
    reasoning_content="Okay, the user is asking for the temperature today and tomorrow..."
)
```
And you can extract the content and reasoning like this
```python
response = client.chat.completions.create(
  ...
)

print(f"Content: {response.choices[0].message.content}")
print(f"Reasoning: {response.choices[0].message.reasoning_content}")
```

---

## Configure Ray Serve LLM

Set your Hugging Face token in the config file to access gated models.

Ray Serve LLM provides multiple [Python APIs](https://docs.ray.io/en/latest/serve/api/index.html#llm-api) for defining your application. Use [`build_openai_app`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.build_openai_app.html#ray.serve.llm.build_openai_app) to build a full application from your [`LLMConfig`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html#ray.serve.llm.LLMConfig) object.

Set `tensor_parallel_size` to distribute the model's weights among 8 GPUs in the node.  

In [10]:
# serve_qwen_3_32b.py
from ray.serve.llm import LLMConfig, build_openai_app
import os

llm_config = LLMConfig(
    model_loading_config=dict(
        model_id="my-qwen-3-32b",
        model_source="Qwen/Qwen3-32B",
    ),
    experimental_configs=dict(num_ingress_replicas=1),
    deployment_config=dict(
        autoscaling_config=dict(
            # Increase number of replicas for higher throughput/concurrency.
            min_replicas=1,
            max_replicas=1,
        )
    ),
    ### Uncomment if your model is gated and needs your Hugging Face token to access it.
    # runtime_env=dict(env_vars={"HF_TOKEN": os.environ.get("HF_TOKEN")}),
    engine_kwargs=dict(
        # 4 GPUs is enough but you can increase tensor_parallel_size to fit larger models.
        tensor_parallel_size=4, max_model_len=32768, reasoning_parser="qwen3"
    ),
)
app = build_openai_app({"llm_configs": [llm_config]})


INFO 2026-02-06 12:23:18,164 serve 601243 -- {'autoscaling_config': {'max_replicas': 1, 'min_replicas': 1},
 'health_check_period_s': 10,
 'health_check_timeout_s': 10,
 'max_ongoing_requests': 1000000000,
 'name': 'LLMServer:my-qwen-3-32b',
 'placement_group_bundles': [{'CPU': 1, 'GPU': 1},
                             {'GPU': 1},
                             {'GPU': 1},
                             {'GPU': 1}],
 'placement_group_strategy': 'STRICT_PACK',
 'ray_actor_options': {'runtime_env': {'ray_debugger': {'working_dir': '/home/ray/default/hybrid-reasoning-llm'},
                                       'worker_process_setup_hook': 'ray.llm._internal.serve._worker_process_setup_hook',
                                       'working_dir': 'gcs://_ray_pkg_2d34fa4bd2de3407c9d9c2ba3068a90164cbcc4f.zip'}}}
INFO 2026-02-06 12:23:18,190 serve 601243 -- {'autoscaling_config': {'initial_replicas': 1,
                        'max_replicas': 1,
                        'min_replicas': 1,
      

**Note:** Before moving to a production setup, migrate your settings to a [Serve config file](https://docs.ray.io/en/latest/serve/production-guide/config.html) to make your deployment version-controlled, reproducible, and easier to maintain for CI/CD pipelines. See [Serving LLMs - Quickstart Examples: Production Guide](https://docs.ray.io/en/latest/serve/llm/quick-start.html#production-deployment) for an example.

---

## Deploy locally

**Prerequisites**

* Access to GPU compute.
* (Optional) A **Hugging Face token** if using gated models like. Store it in `export HF_TOKEN=<YOUR-TOKEN-HERE>`.

**Note:** Depending on the organization, you can usually request access on the model's Hugging Face page. For example, Meta‚Äôs Llama models approval can take anywhere from a few hours to several weeks.

**Dependencies:**  
```bash
pip install "ray[serve,llm]"
```

---

### Launch

Follow the instructions at [Configure Ray Serve LLM](#configure-ray-serve-llm) to define your app in a Python module `serve_qwen_3_32b.py`.  

In a terminal, run:  

In [11]:
!serve run serve_qwen_3_32b:app --non-blocking

2026-02-06 12:23:23,140	INFO scripts.py:507 -- Running import path: 'serve_qwen_3_32b:app'.
INFO 02-06 12:23:25 [__init__.py:220] No platform detected, vLLM is running on UnspecifiedPlatform
2026-02-06 12:23:26,880	INFO worker.py:1833 -- Connecting to existing Ray cluster at address: 10.128.5.219:6379...
2026-02-06 12:23:26,891	INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-9fyy71sw3bgwajvnjflq7jxd9h.i.anyscaleuserdata.com [39m[22m
2026-02-06 12:23:26,892	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_2d34fa4bd2de3407c9d9c2ba3068a90164cbcc4f.zip' (0.04MiB) to Ray cluster...
2026-02-06 12:23:26,893	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_2d34fa4bd2de3407c9d9c2ba3068a90164cbcc4f.zip'.
INFO 2026-02-06 12:23:26,902 serve 603513 -- {'autoscaling_config': {'max_replicas': 1, 'min_replicas': 1},
 'health_check_period_s': 10,
 'health_check_timeout_s': 10,
 'max_ongoing_requests': 1000000000,
 '

Deployment typically takes a few minutes as the cluster is provisioned, the vLLM server starts, and the model is downloaded. 

Your endpoint is available locally at `http://localhost:8000` and you can use a placeholder authentication token for the OpenAI client, for example `"FAKE_KEY"`

Use the `model_id` defined in your config (here, `my-qwen-3-32b`) to query your model. Below are some examples on how to send a request to a Qwen-3 deployment with thinking enabled or disabled. 

---

### Send request with thinking disabled

You can disable thinking in Qwen-3 by either adding a `/no_think` tag in the prompt or by forwarding `enable_thinking: False` to the vLLM inference engine.  

Example curl with `/no_think`:

In [12]:
!curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer FAKE_KEY" \
  -d '{ "model": "my-qwen-3-32b", "messages": [{"role": "user", "content": "What is greater between 7.8 and 7.11 ? /no_think"}] }'

{"id":"chatcmpl-6397590a-811d-4582-bfde-440b239e42b3","object":"chat.completion","created":1770409555,"model":"my-qwen-3-32b","choices":[{"index":0,"message":{"role":"assistant","content":"\n\nTo determine which number is greater between **7.8** and **7.11**, follow this comparison:\n\n- **7.8** is the same as **7.80**.\n- **7.11** is already in two decimal places.\n\nNow compare:\n\n- **7.80** vs. **7.11**\n\nSince **80 > 11**, **7.80 > 7.11**\n\n‚úÖ **Answer: 7.8 is greater than 7.11**.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":"\n\n"},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":27,"total_tokens":143,"completion_tokens":116,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Example Python with `enable_thinking: False`:

In [13]:
#client_thinking_disabled.py
from urllib.parse import urljoin
from openai import OpenAI

API_KEY = "FAKE_KEY"
BASE_URL = "http://localhost:8000"

client = OpenAI(base_url=urljoin(BASE_URL, "v1"), api_key=API_KEY)

# Example: Complex query with thinking process
response = client.chat.completions.create(
    model="my-qwen-3-32b",
    messages=[
        {"role": "user", "content": "What's the capital of France ?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}}
)

print(f"Reasoning: \n{response.choices[0].message.reasoning_content}\n\n")
print(f"Answer: \n {response.choices[0].message.content}")

{"asctime": "2026-02-06 12:26:00,007", "levelname": "INFO", "message": "HTTP Request: POST http://localhost:8000/v1/chat/completions \"HTTP/1.1 200 OK\"", "filename": "_client.py", "lineno": 1025, "process": 601243, "job_id": "12000000", "worker_id": "12000000ffffffffffffffffffffffffffffffffffffffffffffffff", "node_id": "1a6ddbbb716b74256e415b58e3dca445abdb4074bbfecbc482406ab0", "timestamp_ns": 1770409560007933807}


Reasoning: 
None


Answer: 
 The capital of France is Paris.


Notice the `reasoning_content` is empty here. 
**Note:** Depending on your model's documentation, empty could mean `None`, an empty string or even empty tags `"<think></think>"`.

---

### Send request with thinking enabled
 
You can enable thinking in Qwen-3 by either adding a `/think` tag in the prompt or by forwarding `enable_thinking: True` to the vLLM inference engine.  

Example curl with `/think`:

In [14]:
!curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer FAKE_KEY" \
  -d '{ "model": "my-qwen-3-32b", "messages": [{"role": "user", "content": "What is greater between 7.8 and 7.11 ? /think"}] }'

{"id":"chatcmpl-17ae842c-0bc5-43d7-9ac1-ff5e43450095","object":"chat.completion","created":1770409563,"model":"my-qwen-3-32b","choices":[{"index":0,"message":{"role":"assistant","content":"\n\nWhen comparing the numbers **7.8** and **7.11**, it's crucial to understand how place value works in decimal numbers. Here's a clear breakdown:\n\n---\n\n### Step-by-Step Comparison\n\n1. **Align the Decimal Places**  \n   To make the comparison easier, we convert both numbers to have the same number of decimal places:\n   - **7.8** becomes **7.80** (adding a zero at the end does not change the value).\n   - **7.11** remains **7.11**.\n\n2. **Break Down the Place Values**  \n   Now we compare:\n   - **7.80** = 7 (ones) + 8 (tenths) + 0 (hundredths)\n   - **7.11** = 7 (ones) + 1 (tenths) + 1 (hundredths)\n\n3. **Compare from Left to Right**  \n   - The **ones** place is the same in both numbers: **7**.\n   - The **tenths** place is where the difference occurs: **8** (in 7.80) vs. **1** (in 7.11).\

 Example Python with `enable_thinking: True`:

In [15]:
#client_thinking_enabled.py
from urllib.parse import urljoin
from openai import OpenAI

API_KEY = "FAKE_KEY"
BASE_URL = "http://localhost:8000"

client = OpenAI(base_url=urljoin(BASE_URL, "v1"), api_key=API_KEY)

# Example: Complex query with thinking process
response = client.chat.completions.create(
    model="my-qwen-3-32b",
    messages=[
        {"role": "user", "content": "What's the capital of France ?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}}
)

print(f"Reasoning: \n{response.choices[0].message.reasoning_content}\n\n")
print(f"Answer: \n {response.choices[0].message.content}")

{"asctime": "2026-02-06 12:26:58,966", "levelname": "INFO", "message": "HTTP Request: POST http://localhost:8000/v1/chat/completions \"HTTP/1.1 200 OK\"", "filename": "_client.py", "lineno": 1025, "process": 601243, "job_id": "12000000", "worker_id": "12000000ffffffffffffffffffffffffffffffffffffffffffffffff", "node_id": "1a6ddbbb716b74256e415b58e3dca445abdb4074bbfecbc482406ab0", "timestamp_ns": 1770409618966867734}


Reasoning: 

Okay, so the user is asking for the capital of France. Let me think. I remember that France is a country in Europe, and I think the capital is Paris. But wait, let me make sure. Sometimes people might confuse it with Lyon or Marseille, but those are other major cities. Paris is definitely the capital. I can recall that the Eiffel Tower is in Paris, and it's a major city known for art and fashion. Also, the French government is based there. Yeah, I'm pretty confident it's Paris. Let me double-check in my mind. No, I don't think I'm mixing it up with another country. France's capital is indeed Paris. So the answer should be Paris.



Answer: 
 

The capital of France is **Paris**. It is a major global city known for its cultural landmarks, such as the Eiffel Tower and the Louvre Museum, and serves as the political, economic, and administrative center of the country. 

**Answer:** Paris.


If you configure a valid reasoning parser, the reasoning output should appear in the `reasoning_content` field of the response message. Otherwise, it may be included in the main `content` field, typically wrapped in `<think>...</think>` tags. See [Parse reasoning outputs](#parse-reasoning-outputs) for more information.

---

### Shutdown 

Shutdown your LLM service:

In [17]:
!serve shutdown -y

2026-02-06 12:27:43,227	SUCC scripts.py:774 -- [32mSent shutdown request; applications will be deleted asynchronously.[39m
[0m


---

## Deploy to production with Anyscale services

For production, it's recommended to use Anyscale services to deploy your Ray Serve app on a dedicated cluster without any code changes. Anyscale provides scalability, fault tolerance, and load balancing, ensuring resilience against node failures, high traffic, and rolling updates. See [Deploy a medium-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/medium-size-llm/README.html#deploy-to-production-with-anyscale-services) for an example with a medium-sized model like the *Qwen-32b* from this tutorial.

---

## Stream reasoning content

In thinking mode, hybrid reasoning models may take longer to begin generating the main content. You can stream intermediate reasoning output in the same way as the main content.  

In [16]:
#client_streaming.py
from urllib.parse import urljoin
from openai import OpenAI

API_KEY = "FAKE_KEY"
BASE_URL = "http://localhost:8000"

client = OpenAI(base_url=urljoin(BASE_URL, "v1"), api_key=API_KEY)

# Example: Complex query with thinking process
response = client.chat.completions.create(
    model="my-qwen-3-32b",
    messages=[
        {"role": "user", "content": "I need to plan a trip to Paris from Seattle. Can you help me research flight costs, create an itinerary for 3 days, and suggest restaurants based on my dietary restrictions (vegetarian)?"}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    stream=True
)

# Stream 
for chunk in response:
    # Stream reasoning content
    if hasattr(chunk.choices[0].delta, "reasoning_content"):
        data_reasoning = chunk.choices[0].delta.reasoning_content
        if data_reasoning:
            print(data_reasoning, end="", flush=True)
    # Later, stream the final answer
    if hasattr(chunk.choices[0].delta, "content"):
        data_content = chunk.choices[0].delta.content
        if data_content:
            print(data_content, end="", flush=True)

{"asctime": "2026-02-06 12:27:01,765", "levelname": "INFO", "message": "HTTP Request: POST http://localhost:8000/v1/chat/completions \"HTTP/1.1 200 OK\"", "filename": "_client.py", "lineno": 1025, "process": 601243, "job_id": "12000000", "worker_id": "12000000ffffffffffffffffffffffffffffffffffffffffffffffff", "node_id": "1a6ddbbb716b74256e415b58e3dca445abdb4074bbfecbc482406ab0", "timestamp_ns": 1770409621765202428}



Okay, the

 user wants to plan a trip to Paris from Seattle. Let me break down what they need. First, flight costs. They might be looking for the best times to fly or budget-friendly options. I should check current prices and maybe suggest flexible dates. Then, a 3-day itinerary. They might want to see the major attractions but also have some downtime. I need to balance popular spots with less crowded areas. Also, vegetarian restaurants. I should make sure the suggestions are up-to-date and consider different areas of Paris for each day. Let me start with flights.

For flights, I remember that prices can vary a lot. Maybe suggest using Google Flights or Skyscanner to track prices. Also, mention that flying mid-week might be cheaper. The average cost from Seattle to Paris is around $1,000-$1,500, but it depends on the time of year. If they're flexible, they might find cheaper options. Maybe highlight some airlines that offer good deals, like Air France or low-cost carriers if available.

Next, the


---

## Summary

In this tutorial, you deployed a hybrid reasoning LLM with Ray Serve LLM, from development to production. You learned how to configure Ray Serve LLM with the right reasoning parser, deploy your service on your Ray cluster, send requests, and parse reasoning outputs in the response.