Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 4 additions & 21 deletions docs/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,10 @@ The minimum number of cards required for deployment on the following hardware is

Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md)

> ⚠️ Precautions:
> - FastDeploy only supports models in Paddle format – please ensure to download models with the `-Paddle` file extension.
> - The model name will trigger an automatic download. If the model has already been downloaded, you can directly use the absolute path to the model's download location.
## 2.How to Use
### 2.1 Basic: Launching the Service
**Example 1:** Deploying a 32K Context Service on a Single RTX 4090 GPU
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
Expand All @@ -38,14 +33,11 @@ python -m fastdeploy.entrypoints.openai.api_server \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
--quantization wint4
```
**Example 2:** Deploying a 128K Context Service on Dual H800 GPUs
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
Expand All @@ -57,14 +49,10 @@ python -m fastdeploy.entrypoints.openai.api_server \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
--quantization wint4
```

> ⚠️ For versions 2.1 and above, the new scheduler needs to be enabled via an environment variable `ENABLE_V1_KVCACHE_SCHEDULER=1`. Otherwise, some requests may be truncated before reaching the maximum length or return empty results.
An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
### 2.2 Advanced: How to Achieve Better Performance

Expand Down Expand Up @@ -92,8 +80,8 @@ An example is a set of configurations that can run stably while also delivering

#### 2.2.2 Chunked Prefill
- **Parameters:** `--enable-chunked-prefill`
- **Description:** Enabling `chunked prefill` can **reduce peak GPU memory usage** and **improve service throughput**.
- **Other relevant configurations**:
- **Description:** Enabling `chunked prefill` can reduce peak GPU memory usage and improve service throughput. Version 2.2 has **enabled by default**; for versions prior to 2.2, you need to enable it manually—refer to the best practices documentation for 2.1.
- **Relevant configurations**:

`--max-num-batched-tokens`:Limit the maximum number of tokens per chunk, with a recommended setting of 384.

Expand All @@ -115,12 +103,7 @@ An example is a set of configurations that can run stably while also delivering
- **Description:** Rejection sampling involves generating samples from a proposal distribution that is easy to sample from, thereby avoiding explicit sorting and achieving an effect of improving sampling speed, which can enhance inference performance.
- **Recommendation:** This is a relatively aggressive optimization strategy that affects the results, and we are still conducting comprehensive validation of its impact. If you have high performance requirements and can accept potential compromises in results, you may consider enabling this strategy.

> **Attention Hyperparameter:**`FLAGS_max_partition_size=1024`
- **Description:** The hyperparameters for the Append Attention (default) backend have been tested on commonly used datasets, and our results show that setting it to 1024 can significantly improve decoding speed, especially in long-text scenarios.
- **Recommendation:** In the future, it will be modified to an automatic adjustment mechanism. If you have high performance requirements, you may consider enabling it.

## 3. FAQ
**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.

### 3.1 Out of Memory
If the service prompts "Out of Memory" during startup, please try the following solutions:
Expand Down
22 changes: 4 additions & 18 deletions docs/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,10 @@ The minimum number of cards required for deployment on the following hardware is

Installation process reference documentation [FastDeploy GPU Install](../get_started/installation/nvidia_gpu.md)

> ⚠️ Precautions:
> - FastDeploy only supports models in Paddle format – please ensure to download models with the `-Paddle` file extension.
> - The model name will trigger an automatic download. If the model has already been downloaded, you can directly use the absolute path to the model's download location.
## 2.How to Use
### 2.1 Basic: Launching the Service
**Example 1:** Deploying a 128K context service on 8x H800 GPUs.
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-424B-A47B-Paddle \
--port 8180 \
Expand All @@ -34,15 +29,11 @@ python -m fastdeploy.entrypoints.openai.api_server \
--max-num-seqs 16 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.8 \
--enable-chunked-prefill \
--gpu-memory-utilization 0.85 \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
--quantization wint4
```

> ⚠️ For versions 2.1 and above, the new scheduler needs to be enabled via an environment variable `ENABLE_V1_KVCACHE_SCHEDULER=1`. Otherwise, some requests may be truncated before reaching the maximum length or return empty results.
An example is a set of configurations that can run stably while also delivering relatively good performance. If you have further requirements for precision or performance, please continue reading the content below.
### 2.2 Advanced: How to Achieve Better Performance

Expand Down Expand Up @@ -70,8 +61,8 @@ An example is a set of configurations that can run stably while also delivering

#### 2.2.2 Chunked Prefill
- **Parameters:** `--enable-chunked-prefill`
- **Description:** Enabling `chunked prefill` can **reduce peak GPU memory usage** and **improve service throughput**.
- **Other relevant configurations**:
- **Description:** Enabling `chunked prefill` can reduce peak GPU memory usage and improve service throughput. Version 2.2 has **enabled by default**; for versions prior to 2.2, you need to enable it manually—refer to the best practices documentation for 2.1.
- **Relevant configurations**:

`--max-num-batched-tokens`:Limit the maximum number of tokens per chunk, with a recommended setting of 384.

Expand All @@ -93,12 +84,7 @@ An example is a set of configurations that can run stably while also delivering
- **Description:** Rejection sampling involves generating samples from a proposal distribution that is easy to sample from, thereby avoiding explicit sorting and achieving an effect of improving sampling speed, which can enhance inference performance.
- **Recommendation:** This is a relatively aggressive optimization strategy that affects the results, and we are still conducting comprehensive validation of its impact. If you have high performance requirements and can accept potential compromises in results, you may consider enabling this strategy.

> **Attention Hyperparameter:**`FLAGS_max_partition_size=1024`
- **Description:** The hyperparameters for the Append Attention (default) backend have been tested on commonly used datasets, and our results show that setting it to 1024 can significantly improve decoding speed, especially in long-text scenarios.
- **Recommendation:** In the future, it will be modified to an automatic adjustment mechanism. If you have high performance requirements, you may consider enabling it.

## 3. FAQ
**Note:** Deploying multimodal services requires adding parameters to the configuration `--enable-mm`.

### 3.1 Out of Memory
If the service prompts "Out of Memory" during startup, please try the following solutions:
Expand Down
24 changes: 4 additions & 20 deletions docs/zh/best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,10 @@

安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)

> ⚠️ 注意事项
> - FastDeploy只支持Paddle格式的模型,注意下载Paddle后缀的模型
> - 使用模型名称会自动下载模型,如果已经下载过模型,可以直接使用模型下载位置的绝对路径

## 二、如何使用
### 2.1 基础:启动服务
**示例1:** 4090上单卡部署32K上下文的服务
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
Expand All @@ -37,14 +32,11 @@ python -m fastdeploy.entrypoints.openai.api_server \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以文档提示下,2.2版本默认开启了chunked prefill,2.2之前的版本建议加

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
--quantization wint4
```
**示例2:** H800上双卡部署128K上下文的服务
```shell
export ENABLE_V1_KVCACHE_SCHEDULER=1
python -m fastdeploy.entrypoints.openai.api_server \
--model baidu/ERNIE-4.5-VL-28B-A3B-Paddle \
--port 8180 \
Expand All @@ -56,12 +48,9 @@ python -m fastdeploy.entrypoints.openai.api_server \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.9 \
--enable-chunked-prefill \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
--quantization wint4
```
> ⚠️ 2.1及以上版本需要通过环境变量开启新调度器 `ENABLE_V1_KVCACHE_SCHEDULER=1`,否则可能会有部分请求最大长度前截断或返空。

示例是可以稳定运行的一组配置,同时也能得到比较好的性能。
如果对精度、性能有进一步的要求,请继续阅读下面的内容。
Expand Down Expand Up @@ -91,9 +80,9 @@ python -m fastdeploy.entrypoints.openai.api_server \

#### 2.2.2 Chunked Prefill
- **参数:** `--enable-chunked-prefill`
- **用处:** 开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**
- **用处:** 开启 `chunked prefill` 可降低显存峰值并提升服务吞吐。2.2版本已经**默认开启**,2.2之前需要手动开启,参考2.1的最佳实践文档

- **其他相关配置**:
- **相关配置**:

`--max-num-batched-tokens`:限制每个chunk的最大token数量。多模场景下每个chunk会向上取整保持图片的完整性,因此实际每次推理的总token数会大于该值。我们推荐设置为384。

Expand All @@ -115,12 +104,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
- **描述**:拒绝采样即从一个易于采样的提议分布(proposal distribution)中生成样本,避免显式排序从而达到提升采样速度的效果,可以提升推理性能。
- **推荐**:这是一种影响效果的较为激进的优化策略,我们还在全面验证影响。如果对性能有较高要求,也可以接受对效果的影响时可以尝试开启。

> **Attention超参:**`FLAGS_max_partition_size=1024`
- **描述**:Append Attntion(默认)后端的超参,我们在常用数据集上的测试结果表明,设置为1024后可以大幅提升解码速度,尤其是长文场景。
- **推荐**:未来会修改为自动调整的机制。如果对性能有较高要求可以尝试开启。

## 三、常见问题FAQ
**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。

### 3.1 显存不足(OOM)
如果服务启动时提示显存不足,请尝试以下方法:
Expand Down
19 changes: 4 additions & 15 deletions docs/zh/best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,6 @@

安装流程参考文档 [FastDeploy GPU 安装](../get_started/installation/nvidia_gpu.md)

> ⚠️ 注意事项
> - FastDeploy只支持Paddle格式的模型,注意下载Paddle后缀的模型
> - 使用模型名称会自动下载模型,如果已经下载过模型,可以直接使用模型下载位置的绝对路径

## 二、如何使用
### 2.1 基础:启动服务
**示例1:** H800上8卡部署128K上下文的服务
Expand All @@ -33,13 +29,10 @@ python -m fastdeploy.entrypoints.openai.api_server \
--max-num-seqs 16 \
--limit-mm-per-prompt '{"image": 100, "video": 100}' \
--reasoning-parser ernie-45-vl \
--gpu-memory-utilization 0.8 \
--enable-chunked-prefill \
--gpu-memory-utilization 0.85 \
--max-num-batched-tokens 384 \
--quantization wint4 \
--enable-mm
--quantization wint4
```
> ⚠️ 2.1及以上版本需要通过环境变量开启新调度器 `ENABLE_V1_KVCACHE_SCHEDULER=1`,否则可能会有部分请求最大长度前截断或返空。

示例是可以稳定运行的一组配置,同时也能得到比较好的性能。
如果对精度、性能有进一步的要求,请继续阅读下面的内容。
Expand Down Expand Up @@ -68,9 +61,9 @@ python -m fastdeploy.entrypoints.openai.api_server \

#### 2.2.2 Chunked Prefill
- **参数:** `--enable-chunked-prefill`
- **用处:** 开启 `chunked prefill` 可**降低显存峰值**并**提升服务吞吐**
- **用处:** 开启 `chunked prefill` 可降低显存峰值并提升服务吞吐。2.2版本已经**默认开启**,2.2之前需要手动开启,参考2.1的最佳实践文档

- **其他相关配置**:
- **相关配置**:

`--max-num-batched-tokens`:限制每个chunk的最大token数量。多模场景下每个chunk会向上取整保持图片的完整性,因此实际每次推理的总token数会大于该值。推荐设置为384。

Expand All @@ -92,10 +85,6 @@ python -m fastdeploy.entrypoints.openai.api_server \
- **描述**:拒绝采样即从一个易于采样的提议分布(proposal distribution)中生成样本,避免显式排序从而达到提升采样速度的效果,可以提升推理性能。
- **推荐**:这是一种影响效果的较为激进的优化策略,我们还在全面验证影响。如果对性能有较高要求,也可以接受对效果的影响时可以尝试开启。

> **Attention超参:**`FLAGS_max_partition_size=1024`
- **描述**:Append Attntion(默认)后端的超参,我们在常用数据集上的测试结果表明,设置为1024后可以大幅提升解码速度,尤其是长文场景。
- **推荐**:未来会修改为自动调整的机制。如果对性能有较高要求可以尝试开启。

## 三、常见问题FAQ
**注意:** 使用多模服务部署需要在配置中添加参数 `--enable-mm`。

Expand Down
Loading