# Deploying Qwen3.5-397B-A17B with TensorRT-LLM

This notebook walks you through deploying the `Qwen/Qwen3.5-397B-A17B` model using TensorRT-LLM.

[TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM/) is NVIDIA's open-source library for accelerating and optimizing LLM inference on NVIDIA GPUs. Support for Qwen3.5-397B-A17B is enabled through the AutoDeploy workflow. More details about AutoDeploy can be found [here](https://nvidia.github.io/TensorRT-LLM/torch/auto_deploy/auto-deploy.html).

**Model Resources:**
- [HuggingFace Model Card](https://huggingface.co/Qwen/Qwen3.5-397B-A17B)
- [Qwen3.5 GitHub](https://github.com/QwenLM/Qwen3.5)
- [Qwen Blog](https://qwenlm.github.io/blog/qwen3.5/)

**Model Highlights:**
- 397B-A17B Mixture of Experts (MoE) with Gated Delta Networks
- 262,144 token context length
- Reasoning and tool calling support
- 201 languages and dialects
- Apache-2.0 License

**Prerequisites:**
- 8x NVIDIA B200 GPUs (or equivalent VRAM for BF16) with recent drivers and CUDA 12.x
- Python 3.10+
- TensorRT-LLM ([container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) or pip install)

## Prerequisites & Environment

Set up a containerized environment for TensorRT-LLM by running the following command in a terminal:

```shell
docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all -p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc1
```

You now have TensorRT-LLM set up!

In [None]:
# If pip not found
!python -m ensurepip --default-pip

In [None]:
%pip install torch openai

## Verify GPU

Check that CUDA is available and the GPU is detected correctly.

In [None]:
# Environment check
import sys

import torch

print(f"Python: {sys.version}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"GPU[{i}]: {torch.cuda.get_device_name(i)}")

## OpenAI-Compatible Server

Start a local OpenAI-compatible server with TensorRT-LLM via the terminal, within the running docker container.

Ensure that the following commands are executed from the docker terminal.

Start with the Qwen 3.5 YAML here: `examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml`

### Load the BF16 Model

Launch the TensorRT-LLM server with Qwen3.5-397B-A17B:

```shell
trtllm-serve "Qwen/Qwen3.5-397B-A17B" \
  --host 0.0.0.0 \
  --port 8000 \
  --backend _autodeploy \
  --trust_remote_code \
  --extra_llm_api_options examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml
```

### Load the NVFP4 Model

Launch the TensorRT-LLM server with nvidia/Qwen3.5-397B-A17B-NVFP4:

Note:
nvidia/Qwen3.5-397B-A17B-NVFP4 can run on 4 x B200. You can update `world_size` in examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml
```
world_size: 4
```

Command:

```shell
trtllm-serve "nvidia/Qwen3.5-397B-A17B-NVFP4" \
  --host 0.0.0.0 \
  --port 8000 \
  --backend _autodeploy \
  --trust_remote_code \
  --extra_llm_api_options examples/auto_deploy/model_registry/configs/qwen3.5_moe_400b.yaml
```

Your server is now running!

## Use the API

Use the OpenAI-compatible client to send requests to the TensorRT-LLM server.

In [3]:
from openai import OpenAI

# Setup client
BASE_URL = "http://0.0.0.0:8000/v1"
API_KEY = "null"
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

MODEL_ID = "Qwen/Qwen3.5-397B-A17B"

In [5]:
# Basic chat completion
print("Chat Completion Example")
print("=" * 50)

response = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is 15% of 85? Show your reasoning."},
    ],
    temperature=1,
    top_p=0.95,
    max_tokens=512,
)

print("Response:")
print(response.choices[0].message.content)

Chat Completion Example
Response:
To find 15% of 85, I need to multiply 85 by 0.15 (since 15% = 15/100 = 0.15).

Let me calculate this:

85 × 0.15

I can break this down:
85 × 0.15 = 85 × (15/100) = (85 × 15) / 100

85 × 15:
85 × 10 = 850
85 × 5 = 425
850 + 425 = 1,275

So 85 × 15 = 1,275

Now divide by 100:
1,275 / 100 = 12.75

Therefore, 15% of 85 = 12.75
</think>

To find 15% of 85, I need to multiply 85 by 0.15 (since 15% = 15/100 = 0.15).

**Method 1: Direct multiplication**
85 × 0.15 = 12.75

**Method 2: Breaking it down**
- 10% of 85 = 8.5
- 5% of 85 = 4.25 (half of 10%)
- 15% = 10% + 5% = 8.5 + 4.25 = 12.75

**Method 3: Using fractions**
15% of 85 = (15/100) × 85 = (85 × 15) / 100 = 1,275 / 100 = 12.75

**Answer: 15% of 85 = 12.75**


In [4]:
# Streaming chat completion
print("Streaming response:")
print("=" * 50)

stream = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the first 5 prime numbers?"},
    ],
    temperature=0.7,
    max_tokens=1024,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Streaming response:
Thinking Process:

1.  **Analyze the Request:** The user is asking for the first 5 prime numbers.

2.  **Define "Prime Number":** A prime number is a natural number greater than 1 that has no positive divisors other than 1 and itself.

3.  **Identify Prime Numbers in Order:**
    *   1 is not prime (by definition).
    *   2 is prime (divisible only by 1 and 2).
    *   3 is prime (divisible only by 1 and 3).
    *   4 is not prime (divisible by 1, 2, 4).
    *   5 is prime (divisible only by 1 and 5).
    *   6 is not prime (divisible by 1, 2, 3, 6).
    *   7 is prime (divisible only by 1 and 7).
    *   8 is not prime (divisible by 1, 2, 4, 8).
    *   9 is not prime (divisible by 1, 3, 9).
    *   11 is prime (divisible only by 1 and 11).

4.  **List the First 5:**
    1.  2
    2.  3
    3.  5
    4.  7
    5.  11

5.  **Format the Output:** Present the list clearly.

6.  **Final Review:** Does the list match the definition? Yes. 2, 3, 5, 7, 11.

7.  **Construc

## Evaluation Parameters

For optimal results, use the following parameters based on your task:

**Default Settings (Most Tasks)**
- `temperature`: 1.0
- `top_p`: 0.95
- `max_tokens`: 262144

**Agentic Tasks (SWE-bench, Terminal Bench)**
- `temperature`: 0.7
- `top_p`: 1.0
- `max_tokens`: 16384

**Deterministic Tasks**
- `temperature`: 0
- `max_tokens`: 16384

## Additional Resources

- [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/)
- [AutoDeploy Guide](https://nvidia.github.io/TensorRT-LLM/torch/auto_deploy/auto-deploy.html)
- [Qwen3.5-397B-A17B on HuggingFace](https://huggingface.co/Qwen/Qwen3.5-397B-A17B)
- [Qwen3.5 GitHub](https://github.com/QwenLM/Qwen3.5)