diff --git a/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md b/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md index 7204e1b1424..dbd244cb9b1 100644 --- a/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-0.3B-Paddle.md @@ -31,11 +31,11 @@ python -m fastdeploy.entrypoints.openai.api_server \ --quantization wint4 \ --max-model-len 32768 \ --max-num-seqs 128 \ - --load_choices "default_v1" + --load-choices "default_v1" ``` - `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed). - `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency. -- `--load_choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage. +- `--load-choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage. For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。 diff --git a/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md b/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md index 9de142d85b9..306b4e71505 100644 --- a/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-21B-A3B-Paddle.md @@ -31,11 +31,11 @@ python -m fastdeploy.entrypoints.openai.api_server \ --quantization wint4 \ --max-model-len 32768 \ --max-num-seqs 128 \ - --load_choices "default_v1" + --load-choices "default_v1" ``` - `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed). - `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency. -- `--load_choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage. +- `--load-choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage. For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。 diff --git a/docs/best_practices/ERNIE-4.5-21B-A3B-Thinking.md b/docs/best_practices/ERNIE-4.5-21B-A3B-Thinking.md index 92b620ddf4c..639ac60634e 100644 --- a/docs/best_practices/ERNIE-4.5-21B-A3B-Thinking.md +++ b/docs/best_practices/ERNIE-4.5-21B-A3B-Thinking.md @@ -27,7 +27,7 @@ Start the service by following command: ```bash python -m fastdeploy.entrypoints.openai.api_server \ --model baidu/ERNIE-4.5-21B-A3B-Thinking \ - --load_choices "default_v1" \ + --load-choices "default_v1" \ --tensor-parallel-size 1 \ --max-model-len 131072 \ --quantization wint8 \ @@ -37,7 +37,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ ``` - `--quantization`: Indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed). - `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency. -- `--load_choices`: Indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage. +- `--load-choices`: Indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage. - `--reasoning-parser`, `--tool-call-parser`: Indicates the corresponding reasoning content and tool call parser. For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。 diff --git a/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md b/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md index debb54fc0b9..1aba169a092 100644 --- a/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md +++ b/docs/best_practices/ERNIE-4.5-300B-A47B-Paddle.md @@ -28,11 +28,11 @@ python -m fastdeploy.entrypoints.openai.api_server \ --quantization wint4 \ --max-model-len 32768 \ --max-num-seqs 128 \ - --load_choices "default_v1" + --load-choices "default_v1" ``` - `--quantization`: indicates the quantization strategy used by the model. Different quantization strategies will result in different performance and accuracy of the model. It could be one of `wint8` / `wint4` / `block_wise_fp8`(Hopper is needed). - `--max-model-len`: Indicates the maximum number of tokens supported by the currently deployed service. The larger the value, the longer the context length the model can support, but the more GPU memory is occupied, which may affect the concurrency. -- `--load_choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage. +- `--load-choices`: indicates the version of the loader. "default_v1" means enabling the v1 version of the loader, which has faster loading speed and less memory usage. For more parameter meanings and default settings, see [FastDeploy Parameter Documentation](../parameters.md)。 @@ -91,7 +91,7 @@ Just specify the corresponding model name in the startup command, `baidu/ERNIE-4 ``` Note: -- W4A8C8 quantized models are not supported when loaded via `--load_choices "default_v1"`. +- W4A8C8 quantized models are not supported when loaded via `--load-choices "default_v1"`. #### 2.2.6 Rejection Sampling **Idea:** diff --git a/docs/get_started/quick_start_qwen.md b/docs/get_started/quick_start_qwen.md index 95a31d71696..c0510fb6f5b 100644 --- a/docs/get_started/quick_start_qwen.md +++ b/docs/get_started/quick_start_qwen.md @@ -16,7 +16,7 @@ For more information about how to install FastDeploy, refer to the [installation After installing FastDeploy, execute the following command in the terminal to start the service. For the configuration method of the startup command, refer to [Parameter Description](../parameters.md) > ⚠️ **Note:** -> When using HuggingFace models (torch format), you need to enable `--load_choices "default_v1"`. +> When using HuggingFace models (torch format), you need to enable `--load-choices "default_v1"`. ``` export ENABLE_V1_KVCACHE_SCHEDULER=1 @@ -27,7 +27,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --engine-worker-queue-port 8182 \ --max-model-len 32768 \ --max-num-seqs 32 \ - --load_choices "default_v1" + --load-choices "default_v1" ``` > 💡 Note: In the path specified by ```--model```, if the subdirectory corresponding to the path does not exist in the current directory, it will try to query whether AIStudio has a preset model based on the specified model name (such as ```Qwen/QWEN3-0.6b```). If it exists, it will automatically start downloading. The default download path is: ```~/xx```. For instructions and configuration on automatic model download, see [Model Download](../supported_models.md). diff --git a/docs/supported_models.md b/docs/supported_models.md index 7849a9f7b37..e6823c0ff91 100644 --- a/docs/supported_models.md +++ b/docs/supported_models.md @@ -13,7 +13,7 @@ export FD_MODEL_SOURCE=AISTUDIO # "AISTUDIO", "MODELSCOPE" or "HUGGINGFACE" export FD_MODEL_CACHE=/ssd1/download_models ``` -> ⭐ **Note**: Models marked with an asterisk can directly use **HuggingFace Torch weights** and support **FP8/WINT8/WINT4** as well as **BF16**. When running inference, you need to enable **`--load_choices "default_v1"`**. +> ⭐ **Note**: Models marked with an asterisk can directly use **HuggingFace Torch weights** and support **FP8/WINT8/WINT4** as well as **BF16**. When running inference, you need to enable **`--load-choices "default_v1"`**. > Example launch Command using baidu/ERNIE-4.5-21B-A3B-PT: ``` @@ -24,7 +24,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --engine-worker-queue-port 8182 \ --max-model-len 32768 \ --max-num-seqs 32 \ - --load_choices "default_v1" + --load-choices "default_v1" ``` ## Large Language Models diff --git a/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md index c4a82d6414d..498f46e9c37 100644 --- a/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-0.3B-Paddle.md @@ -31,12 +31,12 @@ python -m fastdeploy.entrypoints.openai.api_server \ --quantization wint4 \ --max-model-len 32768 \ --max-num-seqs 128 \ - --load_choices "default_v1" + --load-choices "default_v1" ``` 其中: - `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。 - `--max-model-len`:表示当前部署的服务所支持的最长Token数量。设置得越大,模型可支持的上下文长度也越大,但相应占用的显存也越多,可能影响并发数。 -- `--load_choices`: 表示loader的版本,"default_v1"表示启用v1版本的loader,具有更快的加载速度和更少的内存使用。 +- `--load-choices`: 表示loader的版本,"default_v1"表示启用v1版本的loader,具有更快的加载速度和更少的内存使用。 更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。 diff --git a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md index 36ee050c2b2..a0f649b2e94 100644 --- a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Paddle.md @@ -31,12 +31,12 @@ python -m fastdeploy.entrypoints.openai.api_server \ --quantization wint4 \ --max-model-len 32768 \ --max-num-seqs 128 \ - --load_choices "default_v1" + --load-choices "default_v1" ``` 其中: - `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。 - `--max-model-len`:表示当前部署的服务所支持的最长Token数量。设置得越大,模型可支持的上下文长度也越大,但相应占用的显存也越多,可能影响并发数。 -- `--load_choices`: 表示loader的版本,"default_v1"表示启用v1版本的loader,具有更快的加载速度和更少的内存使用。 +- `--load-choices`: 表示loader的版本,"default_v1"表示启用v1版本的loader,具有更快的加载速度和更少的内存使用。 更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。 diff --git a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md index 36ced7cbcc3..cc4cc9a5bff 100644 --- a/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md +++ b/docs/zh/best_practices/ERNIE-4.5-21B-A3B-Thinking.md @@ -27,7 +27,7 @@ ERNIE-4.5-21B-A3B 各量化精度,在下列硬件上部署所需要的最小 ```bash python -m fastdeploy.entrypoints.openai.api_server \ --model baidu/ERNIE-4.5-21B-A3B-Thinking \ - --load_choices "default_v1" \ + --load-choices "default_v1" \ --tensor-parallel-size 1 \ --max-model-len 131072 \ --quantization wint8 \ @@ -38,7 +38,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ 其中: - `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。 - `--max-model-len`:表示当前部署的服务所支持的最长Token数量。设置得越大,模型可支持的上下文长度也越大,但相应占用的显存也越多,可能影响并发数。 -- `--load_choices`: 表示loader的版本,"default_v1"表示启用v1版本的loader,具有更快的加载速度和更少的内存使用。 +- `--load-choices`: 表示loader的版本,"default_v1"表示启用v1版本的loader,具有更快的加载速度和更少的内存使用。 - `--reasoning-parser` 、 `--tool-call-parser`: 表示对应调用的思考内容和工具调用解析器 更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。 diff --git a/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md b/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md index 37e1ad7a5a0..43f1b261cb5 100644 --- a/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md +++ b/docs/zh/best_practices/ERNIE-4.5-300B-A47B-Paddle.md @@ -28,12 +28,12 @@ python -m fastdeploy.entrypoints.openai.api_server \ --quantization wint4 \ --max-model-len 32768 \ --max-num-seqs 128 \ - --load_choices "default_v1" + --load-choices "default_v1" ``` 其中: - `--quantization`: 表示模型采用的量化策略。不同量化策略,模型的性能和精度也会不同。可选值包括:`wint8` / `wint4` / `block_wise_fp8`(需要Hopper架构)。 - `--max-model-len`:表示当前部署的服务所支持的最长Token数量。设置得越大,模型可支持的上下文长度也越大,但相应占用的显存也越多,可能影响并发数。 -- `--load_choices`: 表示loader的版本,"default_v1"表示启用v1版本的loader,具有更快的加载速度和更少的内存使用。 +- `--load-choices`: 表示loader的版本,"default_v1"表示启用v1版本的loader,具有更快的加载速度和更少的内存使用。 更多的参数含义与默认设置,请参见[FastDeploy参数说明](../parameters.md)。 @@ -92,7 +92,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ ``` 注: -- W4A8C8量化的模型不支持通过`--load_choices "default_v1"`载入。 +- W4A8C8量化的模型不支持通过`--load-choices "default_v1"`载入。 #### 2.2.6 拒绝采样 **原理:** diff --git a/docs/zh/get_started/quick_start_qwen.md b/docs/zh/get_started/quick_start_qwen.md index afb13e30b05..ee22650e712 100644 --- a/docs/zh/get_started/quick_start_qwen.md +++ b/docs/zh/get_started/quick_start_qwen.md @@ -15,7 +15,7 @@ 安装FastDeploy后,在终端执行如下命令,启动服务,其中启动命令配置方式参考[参数说明](../parameters.md) > ⚠️ **注意:** -> 当使用HuggingFace 模型(torch格式)时, 需要开启 `--load_choices "default_v1"` +> 当使用HuggingFace 模型(torch格式)时, 需要开启 `--load-choices "default_v1"` ```shell export ENABLE_V1_KVCACHE_SCHEDULER=1 @@ -26,7 +26,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --engine-worker-queue-port 8182 \ --max-model-len 32768 \ --max-num-seqs 32 \ - --load_choices "default_v1" + --load-choices "default_v1" ``` >💡 注意:在 ```--model``` 指定的路径中,若当前目录下不存在该路径对应的子目录,则会尝试根据指定的模型名称(如 ```Qwen/Qwen3-0.6B```)查询AIStudio是否存在预置模型,若存在,则自动启动下载。默认的下载路径为:```~/xx```。关于模型自动下载的说明和配置参阅[模型下载](../supported_models.md)。 diff --git a/docs/zh/supported_models.md b/docs/zh/supported_models.md index 13db10a3b64..8507765fb8b 100644 --- a/docs/zh/supported_models.md +++ b/docs/zh/supported_models.md @@ -13,7 +13,7 @@ export FD_MODEL_SOURCE=AISTUDIO # "AISTUDIO", "MODELSCOPE" or "HUGGINGFACE" export FD_MODEL_CACHE=/ssd1/download_models ``` -> ⭐ **说明**:带星号的模型可直接使用 **HuggingFace Torch 权重**,支持 **FP8/WINT8/WINT4 动态量化** 和 **BF16 精度** 推理,推理时需启用 **`--load_choices "default_v1"`**。 +> ⭐ **说明**:带星号的模型可直接使用 **HuggingFace Torch 权重**,支持 **FP8/WINT8/WINT4 动态量化** 和 **BF16 精度** 推理,推理时需启用 **`--load-choices "default_v1"`**。 > 以baidu/ERNIE-4.5-21B-A3B-PT为例启动命令如下 ``` @@ -24,7 +24,7 @@ python -m fastdeploy.entrypoints.openai.api_server \ --engine-worker-queue-port 8182 \ --max-model-len 32768 \ --max-num-seqs 32 \ - --load_choices "default_v1" + --load-choices "default_v1" ``` ## 纯文本模型列表 diff --git a/fastdeploy/engine/args_utils.py b/fastdeploy/engine/args_utils.py index 82dacc1c2e9..ece32fef346 100644 --- a/fastdeploy/engine/args_utils.py +++ b/fastdeploy/engine/args_utils.py @@ -712,7 +712,7 @@ def add_cli_args(parser: FlexibleArgumentParser) -> FlexibleArgumentParser: # Load group load_group = parser.add_argument_group("Load Configuration") load_group.add_argument( - "--load_choices", + "--load-choices", type=str, default=EngineArgs.load_choices, help="The format of the model weights to load.\ diff --git a/fastdeploy/model_executor/load_weight_utils.py b/fastdeploy/model_executor/load_weight_utils.py index ca20adf4fd9..4bc30b779f1 100644 --- a/fastdeploy/model_executor/load_weight_utils.py +++ b/fastdeploy/model_executor/load_weight_utils.py @@ -54,7 +54,7 @@ def pdparams_weight_iterator(paddle_file_list: list[str]): del state_dict -def load_weights_form_cache(model, weights_iterator): +def load_weights_from_cache(model, weights_iterator): params_dict = dict(model.named_parameters()) for loaded_weight_name, loaded_weight in weights_iterator: param = params_dict[loaded_weight_name] diff --git a/fastdeploy/model_executor/model_loader/default_loader_v1.py b/fastdeploy/model_executor/model_loader/default_loader_v1.py index 3e700ca2747..09b85cbe7e9 100644 --- a/fastdeploy/model_executor/model_loader/default_loader_v1.py +++ b/fastdeploy/model_executor/model_loader/default_loader_v1.py @@ -22,7 +22,7 @@ from fastdeploy.model_executor.load_weight_utils import ( get_weight_iterator, is_weight_cache_enabled, - load_weights_form_cache, + load_weights_from_cache, measure_time, save_model, ) @@ -52,7 +52,7 @@ def clean_memory_fragments(self) -> None: def load_weights(self, model, fd_config: FDConfig, enable_cache: bool = False) -> None: weights_iterator = get_weight_iterator(fd_config.model_config.model) if enable_cache: - load_weights_form_cache(model, weights_iterator) + load_weights_from_cache(model, weights_iterator) else: model.load_weights(weights_iterator) diff --git a/fastdeploy/model_executor/models/glm4_moe.py b/fastdeploy/model_executor/models/glm4_moe.py index a2a4c4bdac3..18d0420a811 100644 --- a/fastdeploy/model_executor/models/glm4_moe.py +++ b/fastdeploy/model_executor/models/glm4_moe.py @@ -478,7 +478,7 @@ def set_state_dict(self, state_dict): """ glm4_moe only support loader_v1. """ - assert False, "glm4_moe only support --load_choices default_v1." + assert False, "glm4_moe only support --load-choices default_v1." def compute_logits(self, hidden_states: paddle.Tensor): """ """ diff --git a/tests/e2e/test_fake_Glm45_AIR_serving.py b/tests/e2e/test_fake_Glm45_AIR_serving.py index ff0a3f5be07..46ad9dd8e0e 100644 --- a/tests/e2e/test_fake_Glm45_AIR_serving.py +++ b/tests/e2e/test_fake_Glm45_AIR_serving.py @@ -118,7 +118,7 @@ def setup_and_run_server(): "32", "--graph-optimization-config", '{"use_cudagraph":true}', - "--load_choices", + "--load-choices", "default_v1", "--lm_head-fp32", "--quantization", diff --git a/tests/model_loader/test_load_ernie_vl.py b/tests/model_loader/test_load_ernie_vl.py index 9442e2ba660..68ef55c314e 100644 --- a/tests/model_loader/test_load_ernie_vl.py +++ b/tests/model_loader/test_load_ernie_vl.py @@ -124,7 +124,7 @@ def setup_and_run_server(): "0.71", "--reasoning-parser", "ernie-45-vl", - "--load_choices", + "--load-choices", "default_v1", ]