# Deploying GLM-4.7-Flash with TensorRT-LLM

This notebook walks you through deploying the `zai-org/GLM-4.7-Flash` model using TensorRT-LLM.

[TensorRT-LLM](https://nvidia.github.io/TensorRT-LLM/) is NVIDIA's open-source library for accelerating and optimizing LLM inference on NVIDIA GPUs. Support for GLM-4.7-Flash is enabled through the AutoDeploy workflow. More details about AutoDeploy can be found [here](https://nvidia.github.io/TensorRT-LLM/torch/auto_deploy/auto-deploy.html).

**Model Resources:**
- [HuggingFace Model Card](https://huggingface.co/zai-org/GLM-4.7-Flash)
- [Technical Blog](https://z.ai/blog/glm-4.7)
- [Technical Report (GLM-4.5)](https://arxiv.org/abs/2508.06471)
- [Z.ai API Platform](https://docs.z.ai/guides/llm/glm-4.7)

**Model Highlights:**
- 30B-A3B Mixture of Experts (MoE) architecture
- 131,072 token context length
- Tool calling support
- MIT License

**Prerequisites:**
- NVIDIA GPU with recent drivers (â‰¥ 64 GB VRAM for BF16) and CUDA 12.x
- Python 3.10+
- TensorRT-LLM ([container](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) or pip install)

## Prerequisites & Environment

Set up a containerized environment for TensorRT-LLM by running the following command in a terminal:

```shell
docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all -p 8000:8000 nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc1
```

You now have TensorRT-LLM set up!

In [None]:
# If pip not found
!python -m ensurepip --default-pip

In [None]:
%pip install torch openai

## Verify GPU

Check that CUDA is available and the GPU is detected correctly.

In [1]:
# Environment check
import sys

import torch

print(f"Python: {sys.version}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Num GPUs: {torch.cuda.device_count()}")

if torch.cuda.is_available():
    for i in range(torch.cuda.device_count()):
        print(f"GPU[{i}]: {torch.cuda.get_device_name(i)}")

Python: 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0]
CUDA available: True
Num GPUs: 8
GPU[0]: NVIDIA H100 80GB HBM3
GPU[1]: NVIDIA H100 80GB HBM3
GPU[2]: NVIDIA H100 80GB HBM3
GPU[3]: NVIDIA H100 80GB HBM3
GPU[4]: NVIDIA H100 80GB HBM3
GPU[5]: NVIDIA H100 80GB HBM3
GPU[6]: NVIDIA H100 80GB HBM3
GPU[7]: NVIDIA H100 80GB HBM3


## OpenAI-Compatible Server

Start a local OpenAI-compatible server with TensorRT-LLM via the terminal, within the running docker container.

Ensure that the following commands are executed from the docker terminal.

Start with the GLM 4.7 Flash Yaml here: `examples/auto_deploy/model_registry/configs/glm-4.7-flash.yaml`

### Load the Model

Launch the TensorRT-LLM server with GLM-4.7-Flash:

```shell
trtllm-serve "zai-org/GLM-4.7-Flash" \
  --host 0.0.0.0 \
  --port 8000 \
  --backend _autodeploy \
  --trust_remote_code \
  --extra_llm_api_options examples/auto_deploy/model_registry/configs/glm-4.7-flash.yaml
```

Your server is now running!

## Use the API

Use the OpenAI-compatible client to send requests to the TensorRT-LLM server.

In [11]:
from openai import OpenAI

# Setup client
BASE_URL = "http://0.0.0.0:8000/v1"
API_KEY = "null"
client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

MODEL_ID = "zai-org/GLM-4.7-Flash"

In [None]:
# Basic chat completion
print("Chat Completion Example")
print("=" * 50)

response = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is 15% of 85? Show your reasoning."},
    ],
    temperature=1,
    top_p=0.95,
    max_tokens=512,
)

print("Response:")
print(response.choices[0].message.content)

Chat Completion Example
Response:
1.  **Analyze the Request:** The user wants to know 15% of 85 and wants to see the reasoning behind the calculation.

2.  **Identify the Core Task:** Calculate $15\% \times 85$.

3.  **Determine the Mathematical Approach:** There are several ways to solve this:
    *   *Method 1: Fraction multiplication.* Convert 15% to a fraction ($\frac{15}{100}$), then multiply by 85.
    *   *Method 2: Decimal multiplication.* Convert 15% to a decimal ($0.15$), then multiply by 85.
    *   *Method 3: Decomposition (Breaking it down).* $15\% = 10\% + 5\%$.
        *   $10\%$ of $85 = 8.5$
        *   $5\%$ of $85 = \frac{8.5}{2} = 4.25$
        *   Sum: $8.5 + 4.25 = 12.75$

4.  **Select the Best Approach for Explanation:** Method 3 is often easiest for a general audience to follow step-by-step because it avoids dealing with decimals until the end or simplifies large multiplications. Method 2 is the most direct standard school method. I will use Method 3 (Splitting 

In [None]:
# Streaming chat completion
print("Streaming response:")
print("=" * 50)

stream = client.chat.completions.create(
    model=MODEL_ID,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the first 5 prime numbers?"},
    ],
    temperature=0.7,
    max_tokens=1024,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Streaming response:
1.  **Analyze the Request:** The user is asking for the "first 5 prime numbers".

2.  **Define "Prime Number":** A prime number is a natural number greater than 1 that is not a product of two smaller natural numbers. In other words, it has exactly two distinct positive divisors: 1 and itself.

3.  **Identify the First Numbers:**
    *   Start checking from 1 (exclusive).
    *   Check 2: Divisors are 1 and 2. Prime. (1st)
    *   Check 3: Divisors are 1 and 3. Prime. (2nd)
    *   Check 4: Divisors are 1, 2, 4. Not prime (2 * 2).
    *   Check 5: Divisors are 1 and 5. Prime. (3rd)
    *   Check 6: Divisors are 1, 2, 3, 6. Not prime.
    *   Check 7: Divisors are 1 and 7. Prime. (4th)
    *   Check 8: Divisors are 1, 2, 4, 8. Not prime.
    *   Check 9: Divisors are 1, 3, 9. Not prime.
    *   Check 10: Divisors are 1, 2, 5, 10. Not prime.
    *   Check 11: Divisors are 1 and 11. Prime. (5th)

4.  **Compile the List:** 2, 3, 5, 7, 11.

5.  **Formulate the Output:** P

## Evaluation Parameters

For optimal results, use the following parameters based on your task:

**Default Settings (Most Tasks)**
- `temperature`: 1.0
- `top_p`: 0.95
- `max_tokens`: 131072

**Agentic Tasks (SWE-bench, Terminal Bench)**
- `temperature`: 0.7
- `top_p`: 1.0
- `max_tokens`: 16384

**Deterministic Tasks**
- `temperature`: 0
- `max_tokens`: 16384

## Additional Resources

- [TensorRT-LLM Documentation](https://nvidia.github.io/TensorRT-LLM/)
- [AutoDeploy Guide](https://nvidia.github.io/TensorRT-LLM/torch/auto_deploy/auto-deploy.html)
- [GLM-4.7-Flash on HuggingFace](https://huggingface.co/zai-org/GLM-4.7-Flash)
- [Z.ai Discord Community](https://discord.gg/QR7SARHRxK)