diff --git a/README.md b/README.md index cc78f6fd250..a93840df139 100644 --- a/README.md +++ b/README.md @@ -73,7 +73,7 @@ Learn how to use FastDeploy through our documentation: ## Supported Models -Learn how to download models, enable support for Torch weights, and calculate minimum resource requirements, and more: +Learn how to download models, enable using the torch format, and more: - [Full Supported Models List](./docs/supported_models.md) ## Advanced Usage diff --git a/README_CN.md b/README_CN.md index 944ece19d7c..defcada1736 100644 --- a/README_CN.md +++ b/README_CN.md @@ -71,7 +71,7 @@ FastDeploy 支持在**英伟达(NVIDIA)GPU**、**昆仑芯(Kunlunxin)XPU ## 支持模型列表 -通过我们的文档了解如何下载模型,如何支持Torch 权重,如何计算最小资源部署等: +通过我们的文档了解如何下载模型,如何支持torch格式等: - [模型支持列表](./docs/zh/supported_models.md) ## 进阶用法 diff --git a/docs/get_started/installation/nvidia_gpu.md b/docs/get_started/installation/nvidia_gpu.md index 31dedb3dc09..706162151eb 100644 --- a/docs/get_started/installation/nvidia_gpu.md +++ b/docs/get_started/installation/nvidia_gpu.md @@ -13,14 +13,14 @@ The following installation methods are available when your environment meets the **Notice**: The pre-built image only supports SM80/90 GPU(e.g. H800/A800),if you are deploying on SM86/89GPU(L40/4090/L20), please reinstall ```fastdpeloy-gpu``` after you create the container. ```shell -docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.1.1 +docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.0 ``` ## 2. Pre-built Pip Installation First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html) ```shell -python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/ +python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/ ``` Then install fastdeploy. **Do not install from PyPI**. Use the following methods instead: @@ -58,7 +58,7 @@ docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu . First install paddlepaddle-gpu. For detailed instructions, refer to [PaddlePaddle Installation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html) ```shell -python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/ +python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/ ``` Then clone the source code and build: diff --git a/docs/index.md b/docs/index.md index bff311362e5..d991b2f933c 100644 --- a/docs/index.md +++ b/docs/index.md @@ -24,8 +24,8 @@ |QWEN3|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K| |QWEN-VL|BF16/WINT8/FP8|⛔|✅|✅|🚧|⛔|128K| |QWEN2|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K| -|DEEPSEEK-V3|BF16/WINT4|⛔|✅|✅|🚧|✅|128K| -|DEEPSEEK-R1|BF16/WINT4|⛔|✅|✅|🚧|✅|128K| +|DEEPSEEK-V3|BF16/WINT4|⛔|✅|🚧|🚧|✅|128K| +|DEEPSEEK-R1|BF16/WINT4|⛔|✅|🚧|🚧|✅|128K| ``` ✅ Supported 🚧 In Progress ⛔ No Plan diff --git a/docs/supported_models.md b/docs/supported_models.md index 37548225340..ff32b5820c6 100644 --- a/docs/supported_models.md +++ b/docs/supported_models.md @@ -38,7 +38,7 @@ These models accept text input. |⭐QWEN3|BF16/WINT8/FP8|Qwen/qwen3-32B;
Qwen/qwen3-14B;
Qwen/qwen3-8B;
Qwen/qwen3-4B;
Qwen/qwen3-1.7B;
[Qwen/qwen3-0.6B](./get_started/quick_start_qwen.md), etc.| |⭐QWEN2.5|BF16/WINT8/FP8|Qwen/qwen2.5-72B;
Qwen/qwen2.5-32B;
Qwen/qwen2.5-14B;
Qwen/qwen2.5-7B;
Qwen/qwen2.5-3B;
Qwen/qwen2.5-1.5B;
Qwen/qwen2.5-0.5B, etc.| |⭐QWEN2|BF16/WINT8/FP8|Qwen/Qwen/qwen2-72B;
Qwen/Qwen/qwen2-7B;
Qwen/qwen2-1.5B;
Qwen/qwen2-0.5B;
Qwen/QwQ-32, etc.| -|DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;
unsloth/DeepSeek-V3-0324-BF16;
unsloth/DeepSeek-R1-BF16, etc.| +|⭐DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;
unsloth/DeepSeek-V3-0324-BF16;
unsloth/DeepSeek-R1-BF16, etc.| ## Multimodal Language Models @@ -49,25 +49,4 @@ These models accept multi-modal inputs (e.g., images and text). | ERNIE-VL |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle
 [quick start](./get_started/ernie-4.5-vl.md)   [best practice](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;
baidu/ERNIE-4.5-VL-28B-A3B-Paddle
 [quick start](./get_started/quick_start_vl.md)   [best practice](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;| | QWEN-VL |BF16/WINT4/FP8| Qwen/Qwen2.5-VL-72B-Instruct;
Qwen/Qwen2.5-VL-32B-Instruct;
Qwen/Qwen2.5-VL-7B-Instruct;
Qwen/Qwen2.5-VL-3B-Instruct| -## Minimum Resource Deployment Instruction - -There is no universal formula for minimum deployment resources; it depends on both context length and quantization method. We recommend estimating the required GPU memory using the following formula: -``` -Required GPU Memory = Number of Parameters × Quantization Byte factor -``` -> (The factor list is provided below.) - -And the final number of GPUs depends on: -``` -Number of GPUs = Total Memory Requirement ÷ Memory per GPU -``` - -| Quantization Method | Bytes per Parameter factor | -| :--- | :--- | -|BF16 |2 | -|FP8 |1 | -|WINT8 |1 | -|WINT4 |0.5 | -|W4A8C8 |0.5 | - More models are being supported. You can submit requests for new model support via [Github Issues](https://github.com/PaddlePaddle/FastDeploy/issues). diff --git a/docs/zh/get_started/installation/nvidia_gpu.md b/docs/zh/get_started/installation/nvidia_gpu.md index 154430bf8dc..562f112e4f8 100644 --- a/docs/zh/get_started/installation/nvidia_gpu.md +++ b/docs/zh/get_started/installation/nvidia_gpu.md @@ -15,7 +15,7 @@ **注意**: 如下镜像仅支持SM 80/90架构GPU(A800/H800等),如果你是在L20/L40/4090等SM 86/69架构的GPU上部署,请在创建容器后,卸载```fastdeploy-gpu```再重新安装如下文档指定支持86/89架构的`fastdeploy-gpu`包。 ``` shell -docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.1.1 +docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12.6:2.2.0 ``` ## 2. 预编译Pip安装 @@ -23,7 +23,7 @@ docker pull ccr-2vdh3abv-pub.cnc.bj.baidubce.com/paddlepaddle/fastdeploy-cuda-12 首先安装 paddlepaddle-gpu,详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/develop/install/pip/linux-pip_en.html) ``` shell -python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/ +python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/ ``` 再安装 fastdeploy,**注意不要通过pypi源安装**,需要通过如下方式安装 @@ -64,7 +64,7 @@ docker build -f dockerfiles/Dockerfile.gpu -t fastdeploy:gpu . 首先安装 paddlepaddle-gpu,详细安装方式参考 [PaddlePaddle安装](https://www.paddlepaddle.org.cn/) ``` shell -python -m pip install paddlepaddle-gpu==3.1.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/ +python -m pip install paddlepaddle-gpu==3.2.0 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/ ``` 接着克隆源代码,编译安装 diff --git a/docs/zh/get_started/quick_start_qwen.md b/docs/zh/get_started/quick_start_qwen.md index 70127b52e04..afb13e30b05 100644 --- a/docs/zh/get_started/quick_start_qwen.md +++ b/docs/zh/get_started/quick_start_qwen.md @@ -7,13 +7,12 @@ - CUDNN >= 9.5 - Linux X86_64 - Python >= 3.10 -- 运行模型满足最低硬件配置要求,参考[支持模型列表文档](supported_models.md) 为了快速在各类硬件部署,本文档采用 ```Qwen3-0.6b``` 模型作为示例,可在大部分硬件上完成部署。 安装FastDeploy方式参考[安装文档](./installation/README.md)。 ## 1. 启动服务 -安装FastDeploy后,在终端执行如下命令,启动服务,其中启动命令配置方式参考[参数说明](parameters.md) +安装FastDeploy后,在终端执行如下命令,启动服务,其中启动命令配置方式参考[参数说明](../parameters.md) > ⚠️ **注意:** > 当使用HuggingFace 模型(torch格式)时, 需要开启 `--load_choices "default_v1"` @@ -30,14 +29,14 @@ python -m fastdeploy.entrypoints.openai.api_server \ --load_choices "default_v1" ``` ->💡 注意:在 ```--model``` 指定的路径中,若当前目录下不存在该路径对应的子目录,则会尝试根据指定的模型名称(如 ```Qwen/Qwen3-0.6B```)查询AIStudio是否存在预置模型,若存在,则自动启动下载。默认的下载路径为:```~/xx```。关于模型自动下载的说明和配置参阅[模型下载](supported_models.md)。 +>💡 注意:在 ```--model``` 指定的路径中,若当前目录下不存在该路径对应的子目录,则会尝试根据指定的模型名称(如 ```Qwen/Qwen3-0.6B```)查询AIStudio是否存在预置模型,若存在,则自动启动下载。默认的下载路径为:```~/xx```。关于模型自动下载的说明和配置参阅[模型下载](../supported_models.md)。 ```--max-model-len``` 表示当前部署的服务所支持的最长Token数量。 ```--max-num-seqs``` 表示当前部署的服务所支持的最大并发处理数量。 **相关文档** -- [服务部署配置](online_serving/README.md) -- [服务监控metrics](online_serving/metrics.md) +- [服务部署配置](../online_serving/README.md) +- [服务监控metrics](../online_serving/metrics.md) ## 2. 用户发起服务请求 @@ -92,6 +91,3 @@ for chunk in response: print(chunk.choices[0].delta.content, end='') print('\n') ``` -📌 -⚙️ -✕ diff --git a/docs/zh/index.md b/docs/zh/index.md index 73721bbc0b5..b4f44cfd851 100644 --- a/docs/zh/index.md +++ b/docs/zh/index.md @@ -24,8 +24,8 @@ |QWEN3|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K| |QWEN-VL|BF16/WINT8/FP8|⛔|✅|✅|🚧|⛔|128K| |QWEN2|BF16/WINT8/FP8|⛔|✅|✅|🚧|✅|128K| -|DEEPSEEK-V3|BF16/WINT4|⛔|✅|✅|🚧|✅|128K| -|DEEPSEEK-R1|BF16/WINT4|⛔|✅|✅|🚧|✅|128K| +|DEEPSEEK-V3|BF16/WINT4|⛔|✅|🚧|🚧|✅|128K| +|DEEPSEEK-R1|BF16/WINT4|⛔|✅|🚧|🚧|✅|128K| ``` ✅ 已支持 🚧 适配中 ⛔ 暂无计划 diff --git a/docs/zh/supported_models.md b/docs/zh/supported_models.md index 61f353b334f..209852343c9 100644 --- a/docs/zh/supported_models.md +++ b/docs/zh/supported_models.md @@ -36,7 +36,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ |⭐QWEN3|BF16/WINT8/FP8|Qwen/qwen3-32B;
Qwen/qwen3-14B;
Qwen/qwen3-8B;
Qwen/qwen3-4B;
Qwen/qwen3-1.7B;
[Qwen/qwen3-0.6B](./get_started/quick_start_qwen.md), etc.| |⭐QWEN2.5|BF16/WINT8/FP8|Qwen/qwen2.5-72B;
Qwen/qwen2.5-32B;
Qwen/qwen2.5-14B;
Qwen/qwen2.5-7B;
Qwen/qwen2.5-3B;
Qwen/qwen2.5-1.5B;
Qwen/qwen2.5-0.5B, etc.| |⭐QWEN2|BF16/WINT8/FP8|Qwen/Qwen/qwen2-72B;
Qwen/Qwen/qwen2-7B;
Qwen/qwen2-1.5B;
Qwen/qwen2-0.5B;
Qwen/QwQ-32, etc.| -|DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;
unsloth/DeepSeek-V3-0324-BF16;
unsloth/DeepSeek-R1-BF16, etc.| +|⭐DEEPSEEK|BF16/WINT4|unsloth/DeepSeek-V3.1-BF16;
unsloth/DeepSeek-V3-0324-BF16;
unsloth/DeepSeek-R1-BF16, etc.| ## 多模态语言模型列表 @@ -47,17 +47,4 @@ python -m fastdeploy.entrypoints.openai.api_server \ | ERNIE-VL |BF16/WINT4/WINT8| baidu/ERNIE-4.5-VL-424B-A47B-Paddle
 [快速部署](./get_started/ernie-4.5-vl.md)   [最佳实践](./best_practices/ERNIE-4.5-VL-424B-A47B-Paddle.md) ;
baidu/ERNIE-4.5-VL-28B-A3B-Paddle
 [快速部署](./get_started/quick_start_vl.md)   [最佳实践](./best_practices/ERNIE-4.5-VL-28B-A3B-Paddle.md) ;| | QWEN-VL |BF16/WINT4/FP8| Qwen/Qwen2.5-VL-72B-Instruct;
Qwen/Qwen2.5-VL-32B-Instruct;
Qwen/Qwen2.5-VL-7B-Instruct;
Qwen/Qwen2.5-VL-3B-Instruct| -## 最小资源部署说明 - -最小部署资源没有普适公式,需要根据上下文长度 和 量化方式 -我们推荐计算显存需求 = 参数量 × 量化方式字节系数(系数列表如下),最终 GPU 数量取决于 总显存需求 ÷ 单卡显存 - -|量化方式 |对应每参数字节系数 | -| :--- | :--- | -|BF16 |2 | -|FP8 |1 | -|WINT8 |1 | -|WINT4 |0.5 | -|W4A8C8 |0.5 | - 更多模型同步支持中,你可以通过[Github Issues](https://github.com/PaddlePaddle/FastDeploy/issues)向我们提交新模型的支持需求。