Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ Learn how to use FastDeploy through our documentation:

## Supported Models

Learn how to download models, enable support for Torch weights, and calculate minimum resource requirements, and more:
Learn how to download models, enable using the torch format, and more:
- [Full Supported Models List](./docs/supported_models.md)

## Advanced Usage
Expand Down
2 changes: 1 addition & 1 deletion README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ FastDeploy 支持在**英伟达(NVIDIA)GPU**、**昆仑芯(Kunlunxin)XPU

## 支持模型列表

通过我们的文档了解如何下载模型,如何支持Torch 权重,如何计算最小资源部署等
通过我们的文档了解如何下载模型,如何支持torch格式等
- [模型支持列表](./docs/zh/supported_models.md)

## 进阶用法
Expand Down
6 changes: 3 additions & 3 deletions docs/get_started/installation/nvidia_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,14 @@ The following installation methods are available when your environment meets the
**Notice**: The pre-built image only supports SM80/90 GPU(e.g. H800/A800),if you are deploying on SM86/89GPU(L40/4090/L20), please reinstall ```fastdpeloy-gpu``` after you create the container.

```shell
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.1.1
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.0
```

## 2. Pre-built Pip Installation

First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
```shell
python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```

Then install fastdeploy. **Do not install from PyPI**. Use the following methods instead:
Expand Down Expand Up @@ -58,7 +58,7 @@ docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .

First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)
```shell
python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```

Then clone the source code and build:
Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@
|QWEN3|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K|
|QWEN-VL|BF16/WINT8/FP8|⛔|✅|✅|🚧|⛔|128K|
|QWEN2|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K|
|DEEPSEEK-V3|BF16/WINT4|⛔|✅||🚧|✅|128K|
|DEEPSEEK-R1|BF16/WINT4|⛔|✅||🚧|✅|128K|
|DEEPSEEK-V3|BF16/WINT4|⛔|✅|🚧|🚧|✅|128K|
|DEEPSEEK-R1|BF16/WINT4|⛔|✅|🚧|🚧|✅|128K|

```
✅ Supported 🚧 In Progress ⛔ No Plan
Expand Down
23 changes: 1 addition & 22 deletions docs/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ These models accept text input.
|⭐QWEN3|BF16/WINT8/FP8|Qwen/qwen3-32B;<br>Qwen/qwen3-14B;<br>Qwen/qwen3-8B;<br>Qwen/qwen3-4B;<br>Qwen/qwen3-1.7B;<br>[Qwen/qwen3-0.6B](./get_started/quick_start_qwen.md), etc.|
|⭐QWEN2.5|BF16/WINT8/FP8|Qwen/qwen2.5-72B;<br>Qwen/qwen2.5-32B;<br>Qwen/qwen2.5-14B;<br>Qwen/qwen2.5-7B;<br>Qwen/qwen2.5-3B;<br>Qwen/qwen2.5-1.5B;<br>Qwen/qwen2.5-0.5B, etc.|
|⭐QWEN2|BF16/WINT8/FP8|Qwen/Qwen/qwen2-72B;<br>Qwen/Qwen/qwen2-7B;<br>Qwen/qwen2-1.5B;<br>Qwen/qwen2-0.5B;<br>Qwen/QwQ-32, etc.|
|DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;<br>unsloth/DeepSeek-V3-0324-BF16;<br>unsloth/DeepSeek-R1-BF16, etc.|
|DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;<br>unsloth/DeepSeek-V3-0324-BF16;<br>unsloth/DeepSeek-R1-BF16, etc.|

## Multimodal Language Models

Expand All @@ -49,25 +49,4 @@ These models accept multi-modal inputs (e.g., images and text).
| ERNIE-VL |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle<br>&emsp;[quick start](./get_started/ernie-4.5-vl.md) &emsp; [best practice](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Paddle<br>&emsp;[quick start](./get_started/quick_start_vl.md) &emsp; [best practice](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;|
| QWEN-VL |BF16/WINT4/FP8| Qwen/Qwen2.5-VL-72B-Instruct;<br>Qwen/Qwen2.5-VL-32B-Instruct;<br>Qwen/Qwen2.5-VL-7B-Instruct;<br>Qwen/Qwen2.5-VL-3B-Instruct|

## Minimum Resource Deployment Instruction

There is no universal formula for minimum deployment resources; it depends on both context length and quantization method. We recommend estimating the required GPU memory using the following formula:
```
Required GPU Memory = Number of Parameters × Quantization Byte factor
```
> (The factor list is provided below.)

And the final number of GPUs depends on:
```
Number of GPUs = Total Memory Requirement ÷ Memory per GPU
```

| Quantization Method | Bytes per Parameter factor |
| :--- | :--- |
|BF16 |2 |
|FP8 |1 |
|WINT8 |1 |
|WINT4 |0.5 |
|W4A8C8 |0.5 |

More models are being supported. You can submit requests for new model support via [Github Issues](https://github.com/PaddlePaddle/FastDeploy/issues).
6 changes: 3 additions & 3 deletions docs/zh/get_started/installation/nvidia_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@
**注意**: 如下镜像仅支持SM 80/90架构GPU(A800/H800等),如果你是在L20/L40/4090等SM 86/69架构的GPU上部署,请在创建容器后,卸载```fastdeploy-gpu```再重新安装如下文档指定支持86/89架构的`fastdeploy-gpu`包。

``` shell
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.1.1
docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.0
```

## 2. 预编译Pip安装

首先安装 paddlepaddle-gpu,详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html)

``` shell
python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```

再安装 fastdeploy,**注意不要通过pypi源安装**,需要通过如下方式安装
Expand Down Expand Up @@ -64,7 +64,7 @@ docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu .
首先安装 paddlepaddle-gpu,详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/)

``` shell
python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
```

接着克隆源代码,编译安装
Expand Down
12 changes: 4 additions & 8 deletions docs/zh/get_started/quick_start_qwen.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,12 @@
- CUDNN >= 9.5
- Linux X86_64
- Python >= 3.10
- 运行模型满足最低硬件配置要求,参考[支持模型列表文档](supported_models.md)

为了快速在各类硬件部署,本文档采用 ```Qwen3-0.6b``` 模型作为示例,可在大部分硬件上完成部署。

安装FastDeploy方式参考[安装文档](./installation/README.md)。
## 1. 启动服务
安装FastDeploy后,在终端执行如下命令,启动服务,其中启动命令配置方式参考[参数说明](parameters.md)
安装FastDeploy后,在终端执行如下命令,启动服务,其中启动命令配置方式参考[参数说明](../parameters.md)

> ⚠️ **注意:**
> 当使用HuggingFace 模型(torch格式)时, 需要开启 `--load_choices "default_v1"`
Expand All @@ -30,14 +29,14 @@ python -m fastdeploy.entrypoints.openai.api_server \
--load_choices "default_v1"
```

>💡 注意:在 ```--model``` 指定的路径中,若当前目录下不存在该路径对应的子目录,则会尝试根据指定的模型名称(如 ```Qwen/Qwen3-0.6B```)查询AIStudio是否存在预置模型,若存在,则自动启动下载。默认的下载路径为:```~/xx```。关于模型自动下载的说明和配置参阅[模型下载](supported_models.md)。
>💡 注意:在 ```--model``` 指定的路径中,若当前目录下不存在该路径对应的子目录,则会尝试根据指定的模型名称(如 ```Qwen/Qwen3-0.6B```)查询AIStudio是否存在预置模型,若存在,则自动启动下载。默认的下载路径为:```~/xx```。关于模型自动下载的说明和配置参阅[模型下载](../supported_models.md)。
```--max-model-len``` 表示当前部署的服务所支持的最长Token数量。
```--max-num-seqs``` 表示当前部署的服务所支持的最大并发处理数量。

**相关文档**

- [服务部署配置](online_serving/README.md)
- [服务监控metrics](online_serving/metrics.md)
- [服务部署配置](../online_serving/README.md)
- [服务监控metrics](../online_serving/metrics.md)

## 2. 用户发起服务请求

Expand Down Expand Up @@ -92,6 +91,3 @@ for chunk in response:
print(chunk.choices[0].delta.content, end='')
print('\n')
```
📌
⚙️
4 changes: 2 additions & 2 deletions docs/zh/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@
|QWEN3|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K|
|QWEN-VL|BF16/WINT8/FP8|⛔|✅|✅|🚧|⛔|128K|
|QWEN2|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K|
|DEEPSEEK-V3|BF16/WINT4|⛔|✅||🚧|✅|128K|
|DEEPSEEK-R1|BF16/WINT4|⛔|✅||🚧|✅|128K|
|DEEPSEEK-V3|BF16/WINT4|⛔|✅|🚧|🚧|✅|128K|
|DEEPSEEK-R1|BF16/WINT4|⛔|✅|🚧|🚧|✅|128K|

```
✅ 已支持 🚧 适配中 ⛔ 暂无计划
Expand Down
15 changes: 1 addition & 14 deletions docs/zh/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ python -m fastdeploy.entrypoints.openai.api_server \
|⭐QWEN3|BF16/WINT8/FP8|Qwen/qwen3-32B;<br>Qwen/qwen3-14B;<br>Qwen/qwen3-8B;<br>Qwen/qwen3-4B;<br>Qwen/qwen3-1.7B;<br>[Qwen/qwen3-0.6B](./get_started/quick_start_qwen.md), etc.|
|⭐QWEN2.5|BF16/WINT8/FP8|Qwen/qwen2.5-72B;<br>Qwen/qwen2.5-32B;<br>Qwen/qwen2.5-14B;<br>Qwen/qwen2.5-7B;<br>Qwen/qwen2.5-3B;<br>Qwen/qwen2.5-1.5B;<br>Qwen/qwen2.5-0.5B, etc.|
|⭐QWEN2|BF16/WINT8/FP8|Qwen/Qwen/qwen2-72B;<br>Qwen/Qwen/qwen2-7B;<br>Qwen/qwen2-1.5B;<br>Qwen/qwen2-0.5B;<br>Qwen/QwQ-32, etc.|
|DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;<br>unsloth/DeepSeek-V3-0324-BF16;<br>unsloth/DeepSeek-R1-BF16, etc.|
|DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;<br>unsloth/DeepSeek-V3-0324-BF16;<br>unsloth/DeepSeek-R1-BF16, etc.|

## 多模态语言模型列表

Expand All @@ -47,17 +47,4 @@ python -m fastdeploy.entrypoints.openai.api_server \
| ERNIE-VL |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle<br>&emsp;[快速部署](./get_started/ernie-4.5-vl.md) &emsp; [最佳实践](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;<br>baidu/ERNIE-4.5-VL-28B-A3B-Paddle<br>&emsp;[快速部署](./get_started/quick_start_vl.md) &emsp; [最佳实践](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;|
| QWEN-VL |BF16/WINT4/FP8| Qwen/Qwen2.5-VL-72B-Instruct;<br>Qwen/Qwen2.5-VL-32B-Instruct;<br>Qwen/Qwen2.5-VL-7B-Instruct;<br>Qwen/Qwen2.5-VL-3B-Instruct|

## 最小资源部署说明

最小部署资源没有普适公式,需要根据上下文长度 和 量化方式
我们推荐计算显存需求 = 参数量 × 量化方式字节系数(系数列表如下),最终 GPU 数量取决于 总显存需求 ÷ 单卡显存

|量化方式 |对应每参数字节系数 |
| :--- | :--- |
|BF16 |2 |
|FP8 |1 |
|WINT8 |1 |
|WINT4 |0.5 |
|W4A8C8 |0.5 |

更多模型同步支持中,你可以通过[Github Issues](https://github.com/PaddlePaddle/FastDeploy/issues)向我们提交新模型的支持需求。
Loading