# vLLM Deployment Tutorial for ERNIE-4.5-0.3B

This tutorial details how to use vLLM to deploy the Baidu ERNIE-4.5-0.3B model and achieve high-performance inference services. ERNIE-4.5 is Baidu's next-generation large language model, featuring excellent Chinese language understanding and generation capabilities. vLLM is a high-performance large language model inference engine designed for production environments, supporting high-throughput batch inference and online serving.

## Environment Preparation

### Hardware Requirements

- **GPU**: RTX 4090 24GB recommended (the ERNIE-4.5-0.3B model actually only requires 2-4GB of video memory, but the RTX 4090 provides better inference performance)
- **Memory**: At least 16GB of RAM (32GB recommended for better performance)
- **Storage**: At least 10GB of free space (for model files and cache)
- **Operating System**: Linux (Ubuntu 20.04+ recommended)

### Software Requirements

- **Python**: 3.9-3.12 (3.10 recommended)
- **CUDA**: 11.8+ (if using a GPU)
- **PyTorch**: 2.0+
- **Network**: A stable internet connection (model download required for the first run)

In [None]:
# Check Python version
!python --version

Python 3.10.18


In [None]:
# Check CUDA version
!nvidia-smi

Thu Aug 21 15:25:33 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 572.83         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        On  |   00000000:01:00.0  On |                  Off |
|  0%   46C    P8             20W /  450W |    2125MiB /  24564MiB |     10%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
# Check available video memory
!nvidia-smi --query-gpu=memory.total,memory.free --format=csv

memory.total [MiB], memory.free [MiB]
24564 MiB, 22029 MiB


## vLLM Installation

When installing vLLM, pay special attention to version compatibility, as support for the ERNIE-4.5 model was added in a later version.

### Creating a Virtual Environment

```bash
# Create a conda environment (Python 3.10 is recommended)
conda create -n vllm-ernie python=3.10 -y
conda activate vllm-ernie

# Update pip to the latest version
pip install --upgrade pip
```

### Install vLLM

```bash
# Install vLLM (make sure to use the latest version to support the ERNIE model)
pip install vllm

```

### Verify Installation


In [None]:
import vllm
print(f"vLLM version: {vllm.__version__}")

# Check if CUDA is available
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device count: {torch.cuda.device_count()}")
    print(f"Current CUDA device: {torch.cuda.current_device()}")
    print(f"CUDA device name: {torch.cuda.get_device_name()}")

vLLM version: 0.10.1
CUDA available: True
CUDA device count: 1
Current CUDA device: 0
CUDA device name: NVIDIA GeForce RTX 4090


### Common Installation Issues

1. **CUDA Version Mismatch**: Ensure your CUDA version is compatible with your PyTorch version.
2. **Insufficient Memory**: The installation process may require a significant amount of memory; it is recommended that you close other programs.
3. **Network Issue**: If downloading is slow, you can use a domestic mirror.

## SDK Call Examples

### Basic Text Generation

The following example demonstrates how to use the vLLM SDK for basic text generation:

In [None]:
import os
from vllm import LLM, SamplingParams
import time


# Initialize the model
print("Loading ERNIE-4.5-0.3B model...")
start_time = time.time()

llm = LLM(
    model="baidu/ERNIE-4.5-0.3B-PT",
    trust_remote_code=True,
    gpu_memory_utilization=0.8,
    max_model_len=2048,
    dtype="float16"
)

load_time = time.time() - start_time
print(f"Model loading completed, took: {load_time:.2f} seconds")

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,      # Controls the randomness of the generated data, 0.0-2.0, the higher the value, the more random it is
    top_p=0.9,           # Controls the diversity of the generated data, 0.0-1.0, the higher the value, the more diverse it is
    max_tokens=512,      # Controls the maximum length of the generated data
    repetition_penalty=1.1  # Repeated penalties to avoid repeated generation
)

# Single prompt word generation
prompt = "Please introduce the development history of artificial intelligence"
print(f"\nInput prompt word: {prompt}")
print("Generating reply...")

start_time = time.time()
outputs = llm.generate([prompt], sampling_params)
generate_time = time.time() - start_time

generated_text = outputs[0].outputs[0].text
print(f"\nGenerated result: {generated_text}")
print(f"Generation time: {generate_time:.2f} seconds")
print(f"Generated length: {len(generated_text)} characters")


INFO 08-21 15:44:09 [__init__.py:241] Automatically detected platform cuda.
Loading ERNIE-4.5-0.3B model...
INFO 08-21 15:44:10 [utils.py:326] non-default args: {'model': 'baidu/ERNIE-4.5-0.3B-PT', 'trust_remote_code': True, 'dtype': 'float16', 'max_model_len': 2048, 'gpu_memory_utilization': 0.8, 'disable_log_stats': True}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


config.json: 0.00B [00:00, ?B/s]

INFO 08-21 15:44:16 [__init__.py:711] Resolved architecture: Ernie4_5ForCausalLM
INFO 08-21 15:44:16 [__init__.py:1750] Using max model len 2048
INFO 08-21 15:44:18 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.2M [00:00<?, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

generation_config.json:   0%|          | 0.00/226 [00:00<?, ?B/s]

[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:44:30 [core.py:636] Waiting for init message from front-end.
[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:44:30 [core.py:74] Initializing a V1 LLM engine (v0.10.1) with config: model='baidu/ERNIE-4.5-0.3B-PT', speculative_config=None, tokenizer='baidu/ERNIE-4.5-0.3B-PT', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=N

model.safetensors:   0%|          | 0.00/722M [00:00<?, ?B/s]

[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:54:05 [weight_utils.py:312] Time spent downloading weights for baidu/ERNIE-4.5-0.3B-PT: 587.481343 seconds
[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:54:06 [weight_utils.py:349] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:54:06 [default_loader.py:262] Loading weights took 0.57 seconds
[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:54:07 [gpu_model_runner.py:2007] Model loading took 0.7042 GiB and 589.396855 seconds
[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:54:10 [backends.py:548] Using cache directory: /home/xigan/.cache/vllm/torch_compile_cache/5074a8da94/rank_0_0/backbone for vLLM's torch.compile
[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:54:10 [backends.py:559] Dynamo bytecode transform time: 3.32 s
[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:54:12 [backends.py:194] Cache the graph for dynamic shape for later use
[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:54:21 [backends.py:215] Compiling a graph for dynamic shape takes 10.89 s
[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:54:26 [monitor.py:34] torch.compile takes 14.21 s in total
[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:54:27 [gpu_worker.py:276] Ava

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████| 67/67 [00:01<00:00, 64.09it/s]


[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:54:29 [gpu_model_runner.py:2708] Graph capturing finished in 1 secs, took 0.81 GiB
[1;36m(EngineCore_0 pid=2440)[0;0m INFO 08-21 15:54:29 [core.py:214] init engine (profile, create kv cache, warmup model) took 21.78 seconds
INFO 08-21 15:54:30 [llm.py:298] Supported_tasks: ['generate']
Model loading completed, took: 620.29 seconds

Input prompt word: Please introduce the development history of artificial intelligence
Generating reply...


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|                      | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.…


Generated result: .
<|SpecialToken:PPL|>
**Phase 1: Embryonic Period (1950s-1970s)**
- **Starting Stage**: In 1956, the US Navy's "flying experimental aircraft," the Dornier 224 prototype, successfully completed its first aerial refueling mission. This event marked the transition of AI technology from theory to practice, laying the foundation for subsequent development.
- **Early Applications**: In 1958, IBM developed the first logic-based computer program, "Intelligent System," which analyzed vast amounts of data and performed complex algorithms to solve problems. This opened up the field of AI applications.
- **Technical Breakthrough**: In 1963, "The Bell Laboratories" proposed the "Rule-Based System" theory, which explained the essence of AI as simulating human intelligent behavior patterns, laying the theoretical foundation for later deep learning models.
**Phase 2: Explosive Period (1980s-2000s)**
- **Machine Learning Emerges**: In the late 1970s, Stanford University professors R

### Conversation mode generation

In [25]:
# Conversation mode example
def format_conversation(messages):
    """Format conversation messages into a format that the model can understand"""
    conversation = ""
    for message in messages:
        role = message["role"]
        content = message["content"]
        if role == "user":
            conversation += f"User: {content}\n"
        elif role == "assistant":
            conversation += f"Assistant: {content}\n"
        elif role == "system":
            conversation += f"System: {content}\n"
    conversation += "Assistant: "  # Prompt the model to start replying
    return conversation

# Conversation example
messages = [
    # {"role": "system", "content": "You are a helpful AI assistant, please answer questions in Chinese."},
    {"role": "user", "content": "Hello, please introduce yourself"}
]

conversation_prompt = format_conversation(messages)
print(f"Conversation prompt word:\n{conversation_prompt}")

# Generate conversation reply
outputs = llm.generate([conversation_prompt], sampling_params)
response = outputs[0].outputs[0].text

print(f"\nConversation reply: {response}")

Conversation prompt word:
User: Hello, please introduce yourself
Assistant: 


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|                      | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.…


Conversation reply: Hello! I am a Baidu R&D AI intelligent assistant, named Wenxin Yiyan. My name has the words "smart" and "language" hidden in it, which means I have both language capabilities and a deep understanding of knowledge, information, and social issues. I am glad to help you.


## Start vLLM service

### Basic service startup

```bash
# Start vLLM server (basic configuration)
vllm serve baidu/ERNIE-4.5-0.3B-PT \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code
```

### Command line operation

#### 1. Check service status

After starting the service, you can check if the service is running normally using the following command:

```bash
# Check model list
curl http://localhost:8000/v1/models
```

#### 2. Use Completions API

```bash
# Basic text completion
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "baidu/ERNIE-4.5-0.3B-PT",
        "prompt": "The future development trends of artificial intelligence are",
        "max_tokens": 256,
        "temperature": 0.7
    }'
```

#### 3. Use Chat Completions API

```bash
# Chat completions API
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "baidu/ERNIE-4.5-0.3B-PT",
        "messages": [
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": "Please introduce the Python programming language"}
        ],
        "max_tokens": 512,
        "temperature": 0.7
    }'
```

#### 4. Advanced service startup options

```bash
# Start server (with API key authentication)
vllm serve baidu/ERNIE-4.5-0.3B-PT \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --api-key your-api-key-here

# Start server (specify GPU and memory configuration)
vllm serve baidu/ERNIE-4.5-0.3B-PT \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --gpu-memory-utilization 0.8 \
    --max-model-len 4096

# Start server (use ModelScope to download model)
export VLLM_USE_MODELSCOPE=True
vllm serve baidu/ERNIE-4.5-0.3B-PT \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code
```

#### 5. Service management commands

```bash
# View vLLM help information
vllm serve --help

# Check vLLM version
vllm --version

# Stop service (if running in foreground, use Ctrl+C)
# If running in background, use the following command to find and stop the process:
ps aux | grep vllm
kill <process_id>
```

#### 6. Test connection

```bash
# Simple health check
curl -f http://localhost:8000/health || echo "Service not started"

# Test basic response
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "baidu/ERNIE-4.5-0.3B-PT",
        "messages": [{"role": "user", "content": "你好"}],
        "max_tokens": 50
    }' | jq '.choices[0].message.content'
```

> **Note**:
> - Ensure vLLM is installed: `pip install vllm`
> - Service startup may take a few minutes to load the model
> - It is recommended to use process management tools (e.g., systemd, supervisor) in production environments

## References

- [vLLM official documentation](https://docs.vllm.ai/) <mcreference link="https://docs.vllm.ai/" index="1">1</mcreference>
- [ERNIE-4.5-0.3B model page](https://huggingface.co/baidu/ERNIE-4.5-0.3B-PT) <mcreference link="https://huggingface.co/baidu/ERNIE-4.5-0.3B-PT" index="2">2</mcreference>
- [ERNIE-4.5 official blog](https://yiyan.baidu.com/blog/posts/ernie4.5/) <mcreference link="https://yiyan.baidu.com/blog/posts/ernie4.5/" index="1">1</mcreference>
- [vLLM ERNIE support PR](https://github.com/vllm-project/vllm/pull/20220) <mcreference link="https://github.com/vllm-project/vllm/pull/20220" index="4">4</mcreference>
- [OpenAI API documentation](https://platform.openai.com/docs/api-reference)

---

*Note: vLLM support for ERNIE-4.5 models may require the latest version of vLLM. If you encounter compatibility issues, please check the vLLM version or refer to the official documentation for the latest information.*