# Tutorial on Deploying the ERNIE-4.5-0.3B Model with SGLang

## Introduction to SGLang

SGLang (Structured Generation Language) is a high-performance large language model serving framework jointly developed by UC Berkeley and Stanford. By collaboratively designing the backend runtime and frontend programming language, this framework achieves faster inference speed and more flexible control than traditional frameworks. SGLang has gained widespread recognition in both academia and industry, and is adopted by many well-known companies for production use.

| Feature Category | Specific Functions | Technical Advantages |
|---------|---------|----------|
| **Backend Runtime Optimization** | RadixAttention Prefix Cache | Reduces duplicate computations by sharing prefixes, improving batch processing efficiency |
| | Zero-Overhead CPU Scheduler | Eliminates scheduling delays and improves GPU utilization |
| | Prefill-Decode Separation | Separates compute-intensive and memory-intensive operations to optimize resource utilization |
| | Speculative Decoding | Generates multiple candidate tokens in parallel, accelerating the generation process |
| | Continuous Batching | Dynamic batch management to improve throughput |
| **Front-End Programming API** | Structured Generation | Supports structured output formats such as JSON and regular expressions |
| | Chained Calls | Simplifies programming for complex conversations and multi-turn interactions |
| | Parallel Execution | Native support for concurrent request processing |
| **Model Support** | Generative Models | Mainstream models such as ERNIE, Qwen, and DeepSeek |
| | Quantization Support | Quantization methods such as FP4/FP8/INT4/AWQ/GPTQ |

## ERNIE-4.5 Model Support

SGLang natively supports the ERNIE-4.5 series of models, including:
- **ERNIE-4.5-0.3B** - Lightweight, dense model
- **ERNIE-4.5-21B-A3B** - MoE model
- **ERNIE-4.5-300B-A47B** - Large MoE model

## Environment Preparation

### Installing SGLang
```bash
git clone -b v0.5.0rc2 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"
```bash

## Deployment Method Comparison

| Deployment Method | Applicable Scenarios | Advantages | Disadvantages |
|---------|---------|------|------|
| **Command Line Server** | Production deployment, API service | High performance, good stability, concurrency support | Requires server management |
| **Python Script** | Batch processing, automated tasks | High flexibility, easy integration | Requires programming knowledge |

## Deploying the ERNIE-4.5-0.3B Model

### 1. Environment Preparation and Installation

In [None]:
# Check Installation
import sglang as sgl
import torch
print(f"SGLang Version: {sgl.__version__}")
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name()}")


SGLang Version: 0.5.0rc2
PyTorch Version: 2.8.0+cu128
CUDA Available: True
GPU Device: NVIDIA GeForce RTX 4090


### 2. Launching the Local SGLang Server

We use `baidu/ERNIE-4.5-0.3B-PT` as an example to deploy the model on a local server.

```bash
python -m sglang.launch_server \
    --model-path baidu/ERNIE-4.5-0.3B-PT \
    --host 0.0.0.0 \
    --port 30000 \
    --trust-remote-code
```

**Note**: 
- Use `--host 0.0.0.0` to allow access from Windows hosts
- The first run will automatically download the model from HuggingFace, and configuring a mirror can significantly speed up the download process
- After startup, you will see "Launched server at http://0.0.0.0:30000" in the console

In [32]:
import requests
model_info = requests.get("http://127.0.0.1:30000/get_model_info")
if model_info.status_code == 200:
    info = model_info.json()
    print(f"\n📋 Model Information:")
    print(f"Model Path: {info.get('model_path', 'N/A')}")


📋 Model Information:
Model Path: baidu/ERNIE-4.5-0.3B-PT


### 3. Using OpenAI-Compatible API Calls

In [30]:
import openai

# Configuring the Client
client = openai.OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"  # SGLang does not require an API key
)

# Sending a Chat Request
response = client.chat.completions.create(
    model="baidu/ERNIE-4.5-0.3B-PT",
    messages=[
        {"role": "user", "content": "Hello, please introduce yourself."}
    ],
    max_tokens=100,
    temperature=0.3
)

print(f"Model response: {response.choices[0].message.content}")

Model response: Hello! I'm an AI assistant, and I'm happy to help you. Is there anything I can do for you? Like answering questions, solving problems, or just chatting about everyday things? 😊


### 4. 使用SGLang原生API

In [27]:
import sglang as sgl

# Configuring the Backend
sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))

# Defining the Generation Function
@sgl.function
def chat_with_ernie(s, user_message):
    s += sgl.user(user_message)
    s += sgl.assistant(sgl.gen("response", max_tokens=100))

# Calling the Model
state = chat_with_ernie.run(user_message="Please introduce Beijing to us!")
print(f"Model response: {state['response']}")

Model response: Welcome to Beijing! As China's capital, Beijing has many unique and bustling places. (Gestures) Here, you can visit famous attractions such as the Palace Museum, the Temple of Heaven, and Beihai Park. You can see historical sites while experiencing the ancient and modern charm of Beijing.


## Summarize

This tutorial demonstrates how to deploy and use the ERNIE series of models (using a 0.3B model as an example). As a next-generation LLM service framework, SGLang offers significant advantages in performance and ease of use. This tutorial will help you quickly get started with SGLang and leverage the power of ERNIE models in your own projects.

### Next Steps

| Learning Directions | Recommended Resources | Application Scenarios |
|---------|---------|----------|
| **Advanced Features** | [SGLang Official Documentation](https://docs.sglang.ai/) | Production Deployment and Performance Optimization |
| **Model Fine-tuning** | ERNIE-Tutorial Training Tutorial | Customized Model Development |
| **Structured Output** | SGLang Structured Document Generation | JSON Generation and Code Generation |

### Contact Us

If you encounter any issues or have any suggestions, please contact us through the following channels:

- **GitHub Issues**: Submit issues in the project repository
- **WeChat**: G_Fuji
- **Community Forum**: Participate in technical discussions