[Loader]add multi-thread model loading by bukejiyu · Pull Request #6877 · PaddlePaddle/FastDeploy

bukejiyu · 2026-03-16T12:52:59Z

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

测试 Deepseek-V3 tp8 fp8动态量化测试 in NVME ssd PCIE Gen4 (Speed 16GT/s, Width x4) 设备下有明显加载性能提升，但是内存占用会有额外的增加

Scenario	Time before optimization	Time after optimization	Speed up
Weight in single NVMe SSD	232s	145s	1.6×

Usage or Command

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export FD_ATTENTION_BACKEND="MLA_ATTN"
export FLAGS_mla_use_tensorcore=1
export FLAGS_flash_attn_version=3
python -m fastdeploy.entrypoints.openai.api_server \
     --model $model_name_or_path \
     --port 8380 \
     --metrics-port 8381 \
     --tensor-parallel-size 8 \
     --engine-worker-queue-port 9382 \
     --max-model-len 32768 \
     --max-num-seqs 128 \
     --graph-optimization-config '{"use_cudagraph": false}' \
     --load-choices "default_v1" \
     --quantization "block_wise_fp8" \
     --model-loader-extra-config '{"enable_multithread_load": true,"num_threads":8}' \

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-03-16T12:53:08Z

Thanks for your contribution!

codecov-commenter · 2026-03-17T06:04:00Z

Codecov Report

❌ Patch coverage is 46.15385% with 21 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@924690b). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/load_weight_utils.py	36.36%	20 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #6877   +/-   ##
==========================================
  Coverage           ?   74.16%           
==========================================
  Files              ?      383           
  Lines              ?    53593           
  Branches           ?     8399           
==========================================
  Hits               ?    39748           
  Misses             ?    11153           
  Partials           ?     2692

Flag	Coverage Δ
GPU	`74.16% <46.15%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chang-wenbin · 2026-03-17T11:19:36Z

        help="The format of the model weights to load. default/default_v1/dummy.",
    )

+    parser.add_argument(


能否直接默认开启？会有之前显存碎片的问题吗？

会有显存碎片的问题，而且对硬件有要求，所以没法默认开启

chang-wenbin · 2026-03-17T11:20:01Z

 from fastdeploy.model_executor.layers.linear import KVBatchLinear
 from fastdeploy.model_executor.utils import multi_switch_config_context

+DEFAULT_NUM_THREADS = 8


为什么默认给了8

Copilot

Pull request overview

该 PR 旨在为基于 safetensors 的模型权重加载引入可选的多线程加载能力，并将相关配置从 API Server/Engine 透传到 Worker 与权重加载逻辑中，以提升大模型启动加载速度。

Changes:

新增 model_loader_extra_config 配置项与 CLI 参数，并在 Engine 启动 Worker 时透传该配置
扩展 get_weight_iterator 支持读取 LoadConfig，按配置启用多线程 safetensors shard 加载
新增 multi_thread_safetensors_weights_iterator 以并行读取多个 safetensors 分片

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
fastdeploy/worker/worker_process.py	新增 Worker 侧 `--model_loader_extra_config` 参数解析（JSON）
fastdeploy/model_executor/model_loader/default_loader_v1.py	调整权重迭代器获取方式，将 `fd_config.load_config` 传入以支持扩展加载策略
fastdeploy/model_executor/load_weight_utils.py	引入多线程 safetensors shard 加载迭代器，并在 `get_weight_iterator` 中按配置选择加载路径
fastdeploy/engine/engine.py	启动 Worker 进程时透传 `model_loader_extra_config`
fastdeploy/engine/common_engine.py	同上：在另一套引擎启动逻辑中透传 `model_loader_extra_config`
fastdeploy/engine/args_utils.py	为 API Server/Engine CLI 新增 `--model-loader-extra-config` 参数并接入 `EngineArgs`
fastdeploy/config.py	为 `LoadConfig` 增加 `model_loader_extra_config` 字段以承载加载器扩展配置

Copilot · 2026-04-08T11:35:38Z

+        extra_config = load_config.model_loader_extra_config if load_config else None
+        if extra_config is not None and extra_config.get("enable_multithread_load", False):
+            weights_iterator = multi_thread_safetensors_weights_iterator(
+                files_list,
+                max_workers=extra_config.get("num_threads", DEFAULT_NUM_THREADS),
+                disable_mmap=extra_config.get("disable_mmap", False),
+            )


num_threads 直接透传到 ThreadPoolExecutor(max_workers=...)，当用户传入 0/负数/非 int 时会直接抛 ValueError 导致加载失败。建议在读取 extra_config 时对线程数做校验与兜底（例如强制转换为 int 并 clamp 到 >=1，同时可上限为 len(files_list)）。

Copilot · 2026-04-08T11:35:38Z

+def get_weight_iterator(model_path: str, load_config: Optional[LoadConfig] = None):
    files_list, ordered_weight_map, use_safetensors, is_layers_are_grouped = get_all_weights_file(model_path)
    if use_safetensors:
-        if is_layers_are_grouped:
-            weights_iterator = safetensors_weights_iterator(files_list)
+        extra_config = load_config.model_loader_extra_config if load_config else None
+        if extra_config is not None and extra_config.get("enable_multithread_load", False):
+            weights_iterator = multi_thread_safetensors_weights_iterator(
+                files_list,
+                max_workers=extra_config.get("num_threads", DEFAULT_NUM_THREADS),
+                disable_mmap=extra_config.get("disable_mmap", False),
+            )


新增的多线程 safetensors 加载分支（enable_multithread_load / disable_mmap）目前没有对应单测覆盖。仓库已存在 tests/model_executor/test_load_weight_utils.py 覆盖其它 iterator 行为，建议补充用例验证多线程分支能正确产出权重、并覆盖 disable_mmap=True/False 的路径。

PaddlePaddle-bot

🤖 AI Code Review | 2026-04-10 02:57 CST

📋 Review 摘要

PR 概述：实现多线程模型加载功能，用于提升 NVMe SSD 上的权重加载性能

变更范围：load_weight_utils.py（核心实现）、配置传递链路、文档、测试

影响面 Tag：[Loader] [Optimization]

📝 PR 规范检查

✓ PR 标题包含 [Loader] 标签
✓ 描述包含 Motivation 和 Modifications
✓ 有使用示例和性能测试数据

无 PR 规范问题。

问题

级别	文件	概述
🟡 建议	`load_weight_utils.py:433`	异常捕获过于宽泛
🟡 建议	`load_weight_utils.py:441`	`_load_file` 函数缺少异常处理
🟡 建议	`load_weight_utils.py:439`	disable_mmap=True 时可能导致 OOM
🟡 建议	`load_weight_utils.py:445`	ThreadPoolExecutor 缺少 thread_name_prefix
🟡 建议	`load_weight_utils.py:127`	缺少参数验证

总体评价

功能实现正确，性能提升明显（1.6×加速）。代码结构清晰，配置传递链路完整。但建议增强异常处理、参数验证以提高代码健壮性。

PaddlePaddle-bot · 2026-04-10T02:57:52Z

+    """
+    try:
+        enable_tqdm = dist.get_rank() == 0
+    except Exception:


🟡 建议 异常捕获过于宽泛

当前使用 except Exception 捕获所有异常但没有处理或记录。建议捕获特定异常类型（如 RuntimeError），或在异常发生时记录日志，以便在分布式环境下调试问题。

PaddlePaddle-bot · 2026-04-10T02:57:52Z

+            with open(st_file, "rb") as f:
+                result = safetensors.paddle.load(f.read())
+        else:
+            result = safetensors.paddle.load_file(st_file, device="cpu")


🟡 建议 _load_file 函数缺少异常处理

如果文件不存在、读取失败或 safetensors 解析失败，会导致整个加载过程失败。建议添加 try-except 并提供有意义的错误信息：

def _load_file(st_file: str): try: if disable_mmap: with open(st_file, "rb") as f: result = safetensors.paddle.load(f.read()) else: result = safetensors.paddle.load_file(st_file, device="cpu") return result except FileNotFoundError: logger.error(f"Weight file not found: {st_file}") raise except Exception as e: logger.error(f"Failed to load weight file {st_file}: {e}") raise

PaddlePaddle-bot · 2026-04-10T02:57:52Z

+    def _load_file(st_file: str):
+        if disable_mmap:
+            with open(st_file, "rb") as f:
+                result = safetensors.paddle.load(f.read())


🟡 建议 disable_mmap=True 时可能导致 OOM

f.read() 会将整个文件读入内存。对于大型模型文件（如 100GB+），这可能导致内存溢出。建议：

添加文件大小检查和警告

在文档中明确说明 disable_mmap 仅适用于小文件或 mmap 不支持的场景

PaddlePaddle-bot · 2026-04-10T02:57:52Z

+
+        return result
+
+    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:


🟡 建议 ThreadPoolExecutor 建议添加 thread_name_prefix

添加 thread_name_prefix="safetensors_loader" 便于在调试时识别线程来源。

PaddlePaddle-bot · 2026-04-10T02:57:52Z

+        if extra_config is not None and extra_config.get("enable_multithread_load", False):
+            weights_iterator = multi_thread_safetensors_weights_iterator(
+                files_list,
+                max_workers=extra_config.get("num_threads", DEFAULT_NUM_THREADS),


🟡 建议 缺少参数验证

建议在 get_weight_iterator 函数中添加参数验证：

num_threads = extra_config.get("num_threads", DEFAULT_NUM_THREADS) if not isinstance(num_threads, int) or num_threads <= 0: raise ValueError(f"num_threads must be a positive integer, got {num_threads}") if num_threads > 32: logger.warning(f"num_threads={num_threads} is unusually high, consider reducing it")

bukejiyu had a problem deploying to Metax_ci March 16, 2026 12:53 — with GitHub Actions Failure

bukejiyu had a problem deploying to Metax_ci March 17, 2026 03:16 — with GitHub Actions Error

bukejiyu had a problem deploying to Metax_ci March 17, 2026 03:25 — with GitHub Actions Error

bukejiyu force-pushed the multi-thread-loader branch from 7423eac to 2c8f1e0 Compare March 17, 2026 03:59

bukejiyu temporarily deployed to Metax_ci March 17, 2026 03:59 — with GitHub Actions Inactive

bukejiyu temporarily deployed to Metax_ci March 17, 2026 07:45 — with GitHub Actions Inactive

chang-wenbin reviewed Mar 17, 2026

View reviewed changes

bukejiyu had a problem deploying to Metax_ci March 26, 2026 11:22 — with GitHub Actions Failure

bukejiyu had a problem deploying to Metax_ci April 8, 2026 11:23 — with GitHub Actions Error

Jiang-Jia-Jun requested a review from Copilot April 8, 2026 11:29

Copilot started reviewing on behalf of Jiang-Jia-Jun April 8, 2026 11:30 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

This comment was marked as outdated.

Sign in to view

bukejiyu force-pushed the multi-thread-loader branch from 64436f9 to a891bfb Compare April 8, 2026 12:28

bukejiyu had a problem deploying to Metax_ci April 8, 2026 12:28 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

bukejiyu had a problem deploying to Metax_ci April 9, 2026 11:28 — with GitHub Actions Failure

This comment was marked as outdated.

Sign in to view

multi-thread-loader

6d9d9e5

bukejiyu force-pushed the multi-thread-loader branch from d8814cb to 6d9d9e5 Compare April 9, 2026 14:02

bukejiyu temporarily deployed to Metax_ci April 9, 2026 14:02 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

fix ut

43af3d1

bukejiyu had a problem deploying to Metax_ci April 10, 2026 02:45 — with GitHub Actions Failure

PaddlePaddle-bot reviewed Apr 10, 2026

View reviewed changes

yuanlehome approved these changes Apr 10, 2026

View reviewed changes

yuanlehome merged commit 14d4618 into PaddlePaddle:develop Apr 10, 2026
57 of 64 checks passed


		return result

		with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:

Conversation

bukejiyu commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Mar 16, 2026

Uh oh!

codecov-commenter commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bukejiyu commented Mar 16, 2026 •

edited

Loading

codecov-commenter commented Mar 17, 2026 •

edited

Loading