Skip to content

feat(kb): add knowledge base package import and export#6488

Open
catDforD wants to merge 3 commits intoAstrBotDevs:masterfrom
catDforD:feat/4339-kb-import-export
Open

feat(kb): add knowledge base package import and export#6488
catDforD wants to merge 3 commits intoAstrBotDevs:masterfrom
catDforD:feat/4339-kb-import-export

Conversation

@catDforD
Copy link
Copy Markdown
Contributor

@catDforD catDforD commented Mar 17, 2026

Closes #4339

为 AstrBot 知识库新增“知识库包”导入导出能力,目标是尽可能完整地迁移一个知识库,并在导入后尽可能直接可用,
而不是只迁移基础配置后再重新构建全部运行时数据。

Modifications / 改动点

  • 新增独立的知识库包导入导出实现,导出内容尽量完整,包含:
    • manifest.json
    • kb_metadata.json
    • runtime/doc.db
    • runtime/index.faiss
    • runtime/medias/...
    • runtime/files/...
  • 新增知识库包相关后端接口:
    • 导出
    • 上传
    • 预检查
    • 导入
    • 进度查询
    • 下载
  • 新增前端交互入口:
    • 在知识库详情/设置页增加“导出知识库包”入口
    • 在知识库列表页增加“导入知识库包”入口
  • 新增导入前预检查逻辑,覆盖以下内容:
    • 包格式是否合法
    • AstrBot 版本是否兼容
    • 本地 embedding provider 是否存在兼容维度
    • rerank provider 是否可选映射
  • 导入策略固定为“重命名导入”:
    • 不覆盖现有知识库
    • 自动创建新的知识库
    • 导入时重写 kb_id、文档 ID、doc.db 中的元数据引用以及媒体关联
  • 导入目标优先保证“导入后尽快可用”:
    • 尽可能复用知识库运行时数据
    • 避免仅恢复元数据后还需要完整重建
  • 新增后端测试,覆盖知识库包导出、导入及元数据重写逻辑
  • 补充多语言文案:
    • zh-CN
    • en-US
    • ru-RU

主要涉及文件:

  • astrbot/core/knowledge_base/package_io.py
  • astrbot/dashboard/routes/knowledge_base.py
  • astrbot/dashboard/server.py
  • dashboard/src/views/knowledge-base/KBList.vue
  • dashboard/src/views/knowledge-base/components/SettingsTab.vue
  • dashboard/src/views/knowledge-base/components/KBPackageImportDialog.vue
  • dashboard/src/views/knowledge-base/components/KBPackageExportCard.vue
  • tests/test_kb_package.py

已知边界:

  • 当前知识库包上传仍走单文件上传流程,受 Dashboard 全局 128MB 上传限制影响

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

验证步骤:

  1. 在知识库详情页执行导出,生成并下载知识库包
  2. 在知识库列表页上传知识库包,完成预检查
  3. 选择本地兼容的 embedding provider,并按需选择 rerank provider
  4. 使用新的知识库名称执行导入
  5. 导入完成后验证新知识库已创建,且文档/检索能力正常可用

自动化验证:

uv run ruff check astrbot/core/knowledge_base/package_io.py astrbot/dashboard/routes/knowledge_base.py
astrbot/dashboard/server.py tests/test_kb_package.py
cd dashboard && pnpm typecheck
uv run pytest -q tests/test_kb_package.py
cd dashboard && pnpm build

验证结果:

- uv run ruff check ... 通过
- cd dashboard && pnpm typecheck 通过
- uv run pytest -q tests/test_kb_package.py 通过
- cd dashboard && pnpm build 通过

主要功能测试视频:

  • 导出流程录屏、预检查与 Provider 映射录屏、导入成功与导入后可用性验证录屏:
freecompress-.mp4
  • 异常/边界场景录屏,包括无效 zip、缺少 manifest、主版本不兼容等情况:
default.mp4

部分测试截图:

  • 知识库详情页中的导出入口
    • image
  • 导出进度与下载结果
    • image
  • 知识库列表页中的导入弹窗
    • image
  • 预检查摘要与 provider 兼容性展示
    • image

Checklist / 检查清单

  • 😊 If there are new features added in the PR, I have discussed it with the authors through issues/
    emails, etc.
    / 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。
  • 👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided
    above.
    / 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”。
  • 🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced,
    they have been added to the appropriate locations in requirements.txt and pyproject.toml.
    / 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txt 和 pyproject.toml 文件相应
    位置。
  • 😮 My changes do not introduce malicious code.
    / 我的更改没有引入恶意代码。

Summary by Sourcery

Introduce import/export support for knowledge base packages, including backend APIs, runtime data packaging, and UI entry points.

New Features:

  • Add knowledge base package exporter/importer that bundles metadata, vector index, documents, and media into a portable ZIP.
  • Expose REST endpoints to upload, pre-check, import, export, track progress, and download knowledge base packages.
  • Add dashboard UI for exporting a knowledge base package from the settings page and importing a package from the knowledge base list.

Enhancements:

  • Add version and provider compatibility pre-check for knowledge base packages, including embedding dimension and rerank provider mapping.
  • Secure filenames and temp storage for uploaded and exported knowledge base package files, and reuse dashboard JWT-based download flow for package downloads.

Tests:

  • Add integration-style tests covering knowledge base package export, import, and metadata rewriting for documents and media.

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Mar 17, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求为 AstrBot 引入了全面的知识库包导入和导出功能,使用户能够完整迁移知识库,包括其配置、运行时数据(如向量索引和文档存储)以及关联的媒体文件,确保导入后的知识库能够立即投入使用,无需重新构建数据。它包含了新的后端 API 和前端 UI 组件来管理这些操作,并提供了强大的预检查和导入逻辑,以处理兼容性和数据完整性。

Highlights

  • 知识库包导入导出功能: 新增独立的知识库包导入导出功能,旨在完整迁移知识库及其运行时数据。
  • 包内容: 导出的知识库包包含 manifest.jsonkb_metadata.jsonruntime/doc.dbruntime/index.faiss 以及 runtime/mediasruntime/files 目录。
  • 后端API: 后端新增知识库包相关的 API 接口,包括导出、上传、预检查、导入、进度查询和下载。
  • 前端UI: 前端在知识库详情/设置页增加了“导出知识库包”入口,在知识库列表页增加了“导入知识库包”入口。
  • 导入预检查: 导入前增加了预检查逻辑,涵盖包格式、AstrBot 版本兼容性、本地嵌入模型提供商和重排序提供商的兼容性。
  • 导入策略: 导入策略固定为“重命名导入”,自动创建新知识库并重写 kb_id、文档 ID、doc.db 中的元数据引用及媒体关联,确保导入后知识库可直接使用。
  • 测试覆盖: 增加了后端测试,覆盖知识库包的导出、导入及元数据重写逻辑。
  • 多语言支持: 补充了多语言文案,包括中文 (zh-CN)、英文 (en-US) 和俄文 (ru-RU)。
Changelog
  • astrbot/core/knowledge_base/package_io.py
    • 新增了知识库包导出器和导入器类,实现了知识库包的打包、解包、元数据收集、运行时数据处理、版本兼容性检查、本地提供商匹配以及导入时的 ID 重写和数据恢复逻辑。
  • astrbot/dashboard/routes/knowledge_base.py
    • 修改了知识库路由,新增了知识库包导出、上传、预检查、导入、进度查询和下载的 API 接口,并添加了处理这些操作的后台任务和进度管理逻辑。
  • astrbot/dashboard/server.py
    • 更新了认证中间件,允许知识库包下载接口在特定条件下无需认证访问。
  • dashboard/src/i18n/locales/en-US/features/knowledge-base/detail.json
    • 新增了英文本地化字符串,用于知识库包导出功能。
  • dashboard/src/i18n/locales/en-US/features/knowledge-base/index.json
    • 新增了英文本地化字符串,用于知识库包导入功能和导入成功消息。
  • dashboard/src/i18n/locales/ru-RU/features/knowledge-base/detail.json
    • 新增了俄文本地化字符串,用于知识库包导出功能。
  • dashboard/src/i18n/locales/ru-RU/features/knowledge-base/index.json
    • 新增了俄文本地化字符串,用于知识库包导入功能和导入成功消息。
  • dashboard/src/i18n/locales/zh-CN/features/knowledge-base/detail.json
    • 新增了中文本地化字符串,用于知识库包导出功能。
  • dashboard/src/i18n/locales/zh-CN/features/knowledge-base/index.json
    • 新增了中文本地化字符串,用于知识库包导入功能和导入成功消息。
  • dashboard/src/views/knowledge-base/KBList.vue
    • 修改了知识库列表页面,添加了“导入知识库”按钮,并集成了知识库包导入对话框组件。
  • dashboard/src/views/knowledge-base/components/KBPackageExportCard.vue
    • 新增了 Vue 组件,用于在知识库设置页面展示和管理知识库包导出功能。
  • dashboard/src/views/knowledge-base/components/KBPackageImportDialog.vue
    • 新增了 Vue 对话框组件,用于处理知识库包的上传、预检查和导入流程。
  • dashboard/src/views/knowledge-base/components/SettingsTab.vue
    • 修改了知识库设置标签页,引入并使用了 KBPackageExportCard 组件。
  • tests/test_kb_package.py
    • 新增了测试文件,包含了知识库包导出和导入功能的单元测试,验证了文件打包、元数据重写和提供商匹配等核心逻辑。
Activity
  • 作者提供了详细的验证步骤,包括在知识库详情页执行导出、在知识库列表页上传和预检查、选择兼容的提供商、执行导入以及验证新知识库的功能。
  • 自动化验证通过了 ruff checkpnpm typecheckpytestpnpm build 等测试。
  • 提供了主要功能测试视频,展示了导出流程、预检查、提供商映射、导入成功及导入后可用性验证。
  • 提供了异常/边界场景录屏,包括无效 zip、缺少 manifest、主版本不兼容等情况。
  • 提供了多个截图,展示了知识库详情页的导出入口、导出进度、下载结果、知识库列表页的导入弹窗以及预检查摘要和提供商兼容性展示。
  • 作者在检查清单中确认了新功能已与作者讨论、更改经过良好测试并提供了验证步骤和截图、没有引入新依赖或已正确添加,且更改不包含恶意代码。
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@dosubot dosubot bot added area:core The bug / feature is about astrbot's core, backend area:webui The bug / feature is about webui(dashboard) of astrbot. feature:knowledge-base The bug / feature is about knowledge base labels Mar 17, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • In KBPackageImportDialog.vue, when version_status is major_diff (can_import = false) the UI still shows the same informational versionAlertMessage as a version match and proceeds to the confirm screen with a disabled button; consider explicitly surfacing an incompatibility error and stopping at the check step instead of showing a normal confirm form.
  • Both exported zip files and uploaded package files are accumulated under kb_package_exports / kb_package_uploads with no cleanup strategy; consider adding periodic pruning or a TTL-based cleanup mechanism to prevent unbounded growth of temp storage.
  • The formatDate helper in KBPackageImportDialog.vue hardcodes the 'zh-CN' locale, which can be confusing for non‑Chinese users; consider using the current UI locale or a more generic formatting strategy.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `KBPackageImportDialog.vue`, when `version_status` is `major_diff` (can_import = false) the UI still shows the same informational `versionAlertMessage` as a version match and proceeds to the confirm screen with a disabled button; consider explicitly surfacing an incompatibility error and stopping at the check step instead of showing a normal confirm form.
- Both exported zip files and uploaded package files are accumulated under `kb_package_exports` / `kb_package_uploads` with no cleanup strategy; consider adding periodic pruning or a TTL-based cleanup mechanism to prevent unbounded growth of temp storage.
- The `formatDate` helper in `KBPackageImportDialog.vue` hardcodes the `'zh-CN'` locale, which can be confusing for non‑Chinese users; consider using the current UI locale or a more generic formatting strategy.

## Individual Comments

### Comment 1
<location path="astrbot/core/knowledge_base/package_io.py" line_range="181-190" />
<code_context>
+            )
+            new_media.append(new_item)
+
+        async with self.kb_manager.kb_db.get_db() as session:
+            async with session.begin():
+                for doc in new_documents:
+                    session.add(doc)
+                for media in new_media:
+                    session.add(media)
+                await session.commit()
+
+    def _rewrite_runtime_path(
</code_context>
<issue_to_address>
**issue:** Avoid manual commit inside `async with session.begin()` context manager.

Inside `_restore_kb_metadata`, `async with session.begin()` already handles commit/rollback, so `await session.commit()` here is redundant and potentially confusing. Please either:
- keep `async with session.begin()` and remove `await session.commit()`, or
- drop `session.begin()` and manage `commit`/`rollback` explicitly.

For this bulk-insert pattern, the first option is preferable.
</issue_to_address>

### Comment 2
<location path="astrbot/dashboard/routes/knowledge_base.py" line_range="47-52" />
<code_context>
         self.retrieval_manager = None
         self.upload_progress = {}  # 存储上传进度 {task_id: {status, file_index, file_total, stage, current, total}}
         self.upload_tasks = {}  # 存储后台上传任务 {task_id: {"status", "result", "error"}}
+        self.package_tasks = {}
+        self.package_progress = {}
+        self.package_export_dir = Path(get_astrbot_temp_path()) / "kb_package_exports"
+        self.package_upload_dir = Path(get_astrbot_temp_path()) / "kb_package_uploads"
+        self.package_export_dir.mkdir(parents=True, exist_ok=True)
+        self.package_upload_dir.mkdir(parents=True, exist_ok=True)

         # 注册路由
</code_context>
<issue_to_address>
**suggestion (performance):** Consider lifecycle/cleanup for package task and temp file state to avoid unbounded growth.

The new package-related state (`package_tasks`, `package_progress`) and directories (`kb_package_exports`, `kb_package_uploads`) only ever grow and are never cleaned up, which can cause unbounded memory and disk usage on long-lived instances. Please add some form of lifecycle management (e.g., TTL-based pruning of finished/failed tasks, periodic deletion of old export/upload files, or cleanup once a client retrieves the result), or at least a simple size/age cap to bound resource usage.

Suggested implementation:

```python
import os
import time
from pathlib import Path

from astrbot.core.knowledge_base.package_io import (
    KnowledgeBasePackageExporter,
    KnowledgeBasePackageImporter,
)
from astrbot.core.provider.provider import EmbeddingProvider, RerankProvider
from astrbot.core.utils.astrbot_path import get_astrbot_temp_path

```

```python
        self.retrieval_manager = None
        self.upload_progress = {}  # 存储上传进度 {task_id: {status, file_index, file_total, stage, current, total}}
        self.upload_tasks = {}  # 存储后台上传任务 {task_id: {"status", "result", "error"}}

        # 包相关任务与临时目录状态
        # {task_id: {"status", "result", "error", "created_at"}}
        self.package_tasks = {}
        # {task_id: {status, file_index, file_total, stage, current, total, "updated_at"}}
        self.package_progress = {}

        # 磁盘临时目录(用于导出/上传知识库包)
        self.package_export_dir = Path(get_astrbot_temp_path()) / "kb_package_exports"
        self.package_upload_dir = Path(get_astrbot_temp_path()) / "kb_package_uploads"
        self.package_export_dir.mkdir(parents=True, exist_ok=True)
        self.package_upload_dir.mkdir(parents=True, exist_ok=True)

        # 简单生命周期管理配置,防止任务状态和临时文件无限增长
        # 内存中任务和进度保留时间(秒)
        self.package_task_retention_seconds = 24 * 60 * 60  # 24 小时
        # 磁盘文件保留时间(秒)
        self.package_file_retention_seconds = 7 * 24 * 60 * 60  # 7 天
        # 每次实例化时做一次旧文件清理,避免目录无限增长
        def _cleanup_old_package_files():
            now = time.time()

            def _cleanup_dir(path: Path, ttl_seconds: int):
                if not path.exists():
                    return
                try:
                    for entry in path.iterdir():
                        try:
                            # 只清理普通文件和空目录
                            stat = entry.stat()
                            age = now - stat.st_mtime
                            if age > ttl_seconds:
                                if entry.is_file():
                                    entry.unlink(missing_ok=True)
                                elif entry.is_dir():
                                    # 只尝试删除空目录,避免误删
                                    try:
                                        entry.rmdir()
                                    except OSError:
                                        # 非空目录忽略,由更具体逻辑处理
                                        pass
                        except FileNotFoundError:
                            # 竞态情况下文件可能已经被删除
                            continue
                except FileNotFoundError:
                    # 目录本身在并发场景可能被删除
                    return

            _cleanup_dir(self.package_export_dir, self.package_file_retention_seconds)
            _cleanup_dir(self.package_upload_dir, self.package_file_retention_seconds)

        _cleanup_old_package_files()

        # 注册路由

```
</issue_to_address>

### Comment 3
<location path="dashboard/src/views/knowledge-base/components/KBPackageImportDialog.vue" line_range="364-367" />
<code_context>
+  }
+}
+
+const formatDate = (value?: string) => {
+  if (!value) return '-'
+  return new Date(value).toLocaleString('zh-CN')
+}
+</script>
</code_context>
<issue_to_address>
**suggestion:** Consider using the current UI locale instead of hardcoded 'zh-CN' when formatting dates.

`formatDate` always uses `toLocaleString('zh-CN')`, so dates won’t match the UI language when the dashboard is localized. Instead, derive the locale from your i18n configuration or the browser (or omit the locale argument) so the date format follows the active language.

```suggestion
const formatDate = (value?: string) => {
  if (!value) return '-'
  // Use the environment's current locale instead of hardcoding 'zh-CN'
  return new Date(value).toLocaleString()
}
```
</issue_to_address>

### Comment 4
<location path="astrbot/core/knowledge_base/package_io.py" line_range="281" />
<code_context>
+                zf.write(file_path, f"{archive_prefix}/{rel_path}")
+
+
+class KnowledgeBasePackageImporter:
+    def __init__(self, kb_manager: "KnowledgeBaseManager") -> None:
+        self.kb_manager = kb_manager
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring `KnowledgeBasePackageImporter` by introducing small typed result/match/path helpers and sync core functions so its logic is decomposed into simpler, testable pieces instead of one large, multi-responsibility class.

The main complexity comes from `KnowledgeBasePackageImporter` mixing a lot of concerns and having several “smart” helpers that hide non-trivial behavior. You can reduce cognitive load without changing behavior by:

---

### 1. Make version compatibility strongly typed

`_check_version_compatibility` returns a loosely-typed `dict` that is immediately unpacked into a dataclass. This can be simplified with a small result type:

```python
from dataclasses import dataclass
from typing import Optional

@dataclass
class VersionCheckResult:
    status: str
    can_import: bool
    warning: Optional[str] = None
```

Refactor `_check_version_compatibility`:

```python
def _check_version_compatibility(self, backup_version: str) -> VersionCheckResult:
    if not backup_version:
        return VersionCheckResult(status="major_diff", can_import=False)

    backup_major = _get_major_version(backup_version)
    current_major = _get_major_version(VERSION)
    if VersionComparator.compare_version(backup_major, current_major) != 0:
        return VersionCheckResult(status="major_diff", can_import=False)

    if VersionComparator.compare_version(backup_version, VERSION) != 0:
        return VersionCheckResult(
            status="minor_diff",
            can_import=True,
            warning=f"包版本为 {backup_version},当前版本为 {VERSION}",
        )

    return VersionCheckResult(status="match", can_import=True)
```

Then in `pre_check`:

```python
version_check = self._check_version_compatibility(result.backup_version)
result.version_status = version_check.status
result.can_import = version_check.can_import
if version_check.warning:
    result.warnings.append(version_check.warning)
```

This keeps behavior intact but removes dict string keys and makes the logic easier to follow and test.

---

### 2. Separate provider matching logic from building the result shape

`_collect_local_provider_matches` both discovers matches and builds a nested dict. You can split those concerns, making the discovery part a pure function and leaving the dict assembly very transparent.

```python
@dataclass
class ProviderMatch:
    source_id: str | None
    required_dimensions: int | None = None
    compatible_ids: list[str] = None
    preselected_id: str | None = None
```

Pure helper:

```python
def _find_embedding_matches(
    self, required_dim: int | None, source_id: str | None
) -> ProviderMatch:
    compatible: list[str] = []
    preselected: str | None = None

    for provider in self.kb_manager.provider_manager.embedding_provider_insts:
        if required_dim is not None and provider.get_dim() == int(required_dim):
            pid = provider.provider_config.get("id", "")
            compatible.append(pid)
            if pid == source_id:
                preselected = pid

    if preselected is None and compatible:
        preselected = compatible[0]

    return ProviderMatch(
        source_id=source_id,
        required_dimensions=required_dim,
        compatible_ids=compatible,
        preselected_id=preselected,
    )
```

Then `_collect_local_provider_matches` becomes mostly mapping:

```python
def _collect_local_provider_matches(
    self, provider_summary: dict[str, Any]
) -> dict[str, Any]:
    embedding_required_dim = provider_summary.get("embedding", {}).get("dimensions")
    embedding_source_id = provider_summary.get("embedding", {}).get("provider_id")
    rerank_source_id = provider_summary.get("rerank", {}).get("provider_id")

    emb_match = self._find_embedding_matches(embedding_required_dim, embedding_source_id)
    rerank_match = self._find_rerank_matches(rerank_source_id)  # similar helper

    return {
        "embedding": {
            "required_dimensions": emb_match.required_dimensions,
            "source_provider_id": emb_match.source_id,
            "compatible_provider_ids": emb_match.compatible_ids,
            "preselected_provider_id": emb_match.preselected_id,
        },
        "rerank": {
            "source_provider_id": rerank_match.source_id,
            "compatible_provider_ids": rerank_match.compatible_ids,
            "preselected_provider_id": rerank_match.preselected_id,
        },
    }
```

This keeps the external dict format the same, but localizes the matching logic into small, testable units.

---

### 3. Split “compute path” from “copy/extract file”

`_restore_runtime_tree` currently both computes rewritten paths and performs extraction. You already have `_rewrite_runtime_path` for metadata; you can reuse that pattern for files to make the flow clearer.

Introduce a pure path-rewrite helper specific to runtime trees:

```python
def _rewrite_runtime_tree_path(
    self,
    archive_name: str,
    prefix: str,
    target_root: Path,
    old_kb_id: str,
    new_kb_id: str,
) -> Path | None:
    if not archive_name.startswith(prefix) or archive_name == prefix:
        return None

    rel_path = PurePosixPath(archive_name[len(prefix):])
    parts = list(rel_path.parts)
    if len(parts) >= 2 and parts[0] in {"medias", "files"} and parts[1] == old_kb_id:
        parts[1] = new_kb_id
    elif len(parts) >= 1 and parts[0] == old_kb_id:
        parts[0] = new_kb_id

    return target_root / Path(*parts)
```

Then `_restore_runtime_tree` becomes:

```python
def _restore_runtime_tree(
    self,
    zf: zipfile.ZipFile,
    prefix: str,
    target_root: Path,
    old_kb_id: str,
    new_kb_id: str,
) -> None:
    for name in zf.namelist():
        target_path = self._rewrite_runtime_tree_path(
            name, prefix, target_root, old_kb_id, new_kb_id
        )
        if target_path is None:
            continue
        target_path.parent.mkdir(parents=True, exist_ok=True)
        with zf.open(name) as src, open(target_path, "wb") as dst:
            shutil.copyfileobj(src, dst)
```

This doesn’t change behavior, but it pulls non-trivial path rules into a pure function that can be tested independently, and makes `_restore_runtime_tree` “just” about iterating and copying.

---

### 4. Reduce async surface area for pure/sync helpers

Several methods are `async` but do synchronous work only (`_write_runtime_file`, `_write_runtime_tree`, `_rewrite_doc_store_metadata` except DB open, `_restore_kb_metadata` except ORM usage). You can keep the public async API but make internal helpers synchronous and wrap them from async orchestrators.

Example for `_write_runtime_file` / `_write_runtime_tree`:

```python
def _write_runtime_file_sync(
    self,
    zf: zipfile.ZipFile,
    file_path: Path,
    archive_path: str,
) -> None:
    if file_path.exists():
        zf.write(file_path, archive_path)

async def _write_runtime_file(
    self,
    zf: zipfile.ZipFile,
    file_path: Path,
    archive_path: str,
) -> None:
    # thin async wrapper; easier to test _write_runtime_file_sync
    self._write_runtime_file_sync(zf, file_path, archive_path)
```

Same pattern for `_write_runtime_tree` and `_rewrite_doc_store_metadata` (core logic sync, async wrapper used by higher-level flow). This keeps your current async usage but makes the logic within each step simpler and easier to test without an event loop.

---

These refactors are incremental and do not change the external behavior or data formats, but they make the code less monolithic and reduce the cognitive complexity of the importer.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/knowledge_base/package_io.py
Comment on lines +47 to +52
self.package_tasks = {}
self.package_progress = {}
self.package_export_dir = Path(get_astrbot_temp_path()) / "kb_package_exports"
self.package_upload_dir = Path(get_astrbot_temp_path()) / "kb_package_uploads"
self.package_export_dir.mkdir(parents=True, exist_ok=True)
self.package_upload_dir.mkdir(parents=True, exist_ok=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): Consider lifecycle/cleanup for package task and temp file state to avoid unbounded growth.

The new package-related state (package_tasks, package_progress) and directories (kb_package_exports, kb_package_uploads) only ever grow and are never cleaned up, which can cause unbounded memory and disk usage on long-lived instances. Please add some form of lifecycle management (e.g., TTL-based pruning of finished/failed tasks, periodic deletion of old export/upload files, or cleanup once a client retrieves the result), or at least a simple size/age cap to bound resource usage.

Suggested implementation:

import os
import time
from pathlib import Path

from astrbot.core.knowledge_base.package_io import (
    KnowledgeBasePackageExporter,
    KnowledgeBasePackageImporter,
)
from astrbot.core.provider.provider import EmbeddingProvider, RerankProvider
from astrbot.core.utils.astrbot_path import get_astrbot_temp_path
        self.retrieval_manager = None
        self.upload_progress = {}  # 存储上传进度 {task_id: {status, file_index, file_total, stage, current, total}}
        self.upload_tasks = {}  # 存储后台上传任务 {task_id: {"status", "result", "error"}}

        # 包相关任务与临时目录状态
        # {task_id: {"status", "result", "error", "created_at"}}
        self.package_tasks = {}
        # {task_id: {status, file_index, file_total, stage, current, total, "updated_at"}}
        self.package_progress = {}

        # 磁盘临时目录(用于导出/上传知识库包)
        self.package_export_dir = Path(get_astrbot_temp_path()) / "kb_package_exports"
        self.package_upload_dir = Path(get_astrbot_temp_path()) / "kb_package_uploads"
        self.package_export_dir.mkdir(parents=True, exist_ok=True)
        self.package_upload_dir.mkdir(parents=True, exist_ok=True)

        # 简单生命周期管理配置,防止任务状态和临时文件无限增长
        # 内存中任务和进度保留时间(秒)
        self.package_task_retention_seconds = 24 * 60 * 60  # 24 小时
        # 磁盘文件保留时间(秒)
        self.package_file_retention_seconds = 7 * 24 * 60 * 60  # 7 天
        # 每次实例化时做一次旧文件清理,避免目录无限增长
        def _cleanup_old_package_files():
            now = time.time()

            def _cleanup_dir(path: Path, ttl_seconds: int):
                if not path.exists():
                    return
                try:
                    for entry in path.iterdir():
                        try:
                            # 只清理普通文件和空目录
                            stat = entry.stat()
                            age = now - stat.st_mtime
                            if age > ttl_seconds:
                                if entry.is_file():
                                    entry.unlink(missing_ok=True)
                                elif entry.is_dir():
                                    # 只尝试删除空目录,避免误删
                                    try:
                                        entry.rmdir()
                                    except OSError:
                                        # 非空目录忽略,由更具体逻辑处理
                                        pass
                        except FileNotFoundError:
                            # 竞态情况下文件可能已经被删除
                            continue
                except FileNotFoundError:
                    # 目录本身在并发场景可能被删除
                    return

            _cleanup_dir(self.package_export_dir, self.package_file_retention_seconds)
            _cleanup_dir(self.package_upload_dir, self.package_file_retention_seconds)

        _cleanup_old_package_files()

        # 注册路由

Comment on lines +364 to +367
const formatDate = (value?: string) => {
if (!value) return '-'
return new Date(value).toLocaleString('zh-CN')
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Consider using the current UI locale instead of hardcoded 'zh-CN' when formatting dates.

formatDate always uses toLocaleString('zh-CN'), so dates won’t match the UI language when the dashboard is localized. Instead, derive the locale from your i18n configuration or the browser (or omit the locale argument) so the date format follows the active language.

Suggested change
const formatDate = (value?: string) => {
if (!value) return '-'
return new Date(value).toLocaleString('zh-CN')
}
const formatDate = (value?: string) => {
if (!value) return '-'
// Use the environment's current locale instead of hardcoding 'zh-CN'
return new Date(value).toLocaleString()
}

zf.write(file_path, f"{archive_prefix}/{rel_path}")


class KnowledgeBasePackageImporter:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider refactoring KnowledgeBasePackageImporter by introducing small typed result/match/path helpers and sync core functions so its logic is decomposed into simpler, testable pieces instead of one large, multi-responsibility class.

The main complexity comes from KnowledgeBasePackageImporter mixing a lot of concerns and having several “smart” helpers that hide non-trivial behavior. You can reduce cognitive load without changing behavior by:


1. Make version compatibility strongly typed

_check_version_compatibility returns a loosely-typed dict that is immediately unpacked into a dataclass. This can be simplified with a small result type:

from dataclasses import dataclass
from typing import Optional

@dataclass
class VersionCheckResult:
    status: str
    can_import: bool
    warning: Optional[str] = None

Refactor _check_version_compatibility:

def _check_version_compatibility(self, backup_version: str) -> VersionCheckResult:
    if not backup_version:
        return VersionCheckResult(status="major_diff", can_import=False)

    backup_major = _get_major_version(backup_version)
    current_major = _get_major_version(VERSION)
    if VersionComparator.compare_version(backup_major, current_major) != 0:
        return VersionCheckResult(status="major_diff", can_import=False)

    if VersionComparator.compare_version(backup_version, VERSION) != 0:
        return VersionCheckResult(
            status="minor_diff",
            can_import=True,
            warning=f"包版本为 {backup_version},当前版本为 {VERSION}。",
        )

    return VersionCheckResult(status="match", can_import=True)

Then in pre_check:

version_check = self._check_version_compatibility(result.backup_version)
result.version_status = version_check.status
result.can_import = version_check.can_import
if version_check.warning:
    result.warnings.append(version_check.warning)

This keeps behavior intact but removes dict string keys and makes the logic easier to follow and test.


2. Separate provider matching logic from building the result shape

_collect_local_provider_matches both discovers matches and builds a nested dict. You can split those concerns, making the discovery part a pure function and leaving the dict assembly very transparent.

@dataclass
class ProviderMatch:
    source_id: str | None
    required_dimensions: int | None = None
    compatible_ids: list[str] = None
    preselected_id: str | None = None

Pure helper:

def _find_embedding_matches(
    self, required_dim: int | None, source_id: str | None
) -> ProviderMatch:
    compatible: list[str] = []
    preselected: str | None = None

    for provider in self.kb_manager.provider_manager.embedding_provider_insts:
        if required_dim is not None and provider.get_dim() == int(required_dim):
            pid = provider.provider_config.get("id", "")
            compatible.append(pid)
            if pid == source_id:
                preselected = pid

    if preselected is None and compatible:
        preselected = compatible[0]

    return ProviderMatch(
        source_id=source_id,
        required_dimensions=required_dim,
        compatible_ids=compatible,
        preselected_id=preselected,
    )

Then _collect_local_provider_matches becomes mostly mapping:

def _collect_local_provider_matches(
    self, provider_summary: dict[str, Any]
) -> dict[str, Any]:
    embedding_required_dim = provider_summary.get("embedding", {}).get("dimensions")
    embedding_source_id = provider_summary.get("embedding", {}).get("provider_id")
    rerank_source_id = provider_summary.get("rerank", {}).get("provider_id")

    emb_match = self._find_embedding_matches(embedding_required_dim, embedding_source_id)
    rerank_match = self._find_rerank_matches(rerank_source_id)  # similar helper

    return {
        "embedding": {
            "required_dimensions": emb_match.required_dimensions,
            "source_provider_id": emb_match.source_id,
            "compatible_provider_ids": emb_match.compatible_ids,
            "preselected_provider_id": emb_match.preselected_id,
        },
        "rerank": {
            "source_provider_id": rerank_match.source_id,
            "compatible_provider_ids": rerank_match.compatible_ids,
            "preselected_provider_id": rerank_match.preselected_id,
        },
    }

This keeps the external dict format the same, but localizes the matching logic into small, testable units.


3. Split “compute path” from “copy/extract file”

_restore_runtime_tree currently both computes rewritten paths and performs extraction. You already have _rewrite_runtime_path for metadata; you can reuse that pattern for files to make the flow clearer.

Introduce a pure path-rewrite helper specific to runtime trees:

def _rewrite_runtime_tree_path(
    self,
    archive_name: str,
    prefix: str,
    target_root: Path,
    old_kb_id: str,
    new_kb_id: str,
) -> Path | None:
    if not archive_name.startswith(prefix) or archive_name == prefix:
        return None

    rel_path = PurePosixPath(archive_name[len(prefix):])
    parts = list(rel_path.parts)
    if len(parts) >= 2 and parts[0] in {"medias", "files"} and parts[1] == old_kb_id:
        parts[1] = new_kb_id
    elif len(parts) >= 1 and parts[0] == old_kb_id:
        parts[0] = new_kb_id

    return target_root / Path(*parts)

Then _restore_runtime_tree becomes:

def _restore_runtime_tree(
    self,
    zf: zipfile.ZipFile,
    prefix: str,
    target_root: Path,
    old_kb_id: str,
    new_kb_id: str,
) -> None:
    for name in zf.namelist():
        target_path = self._rewrite_runtime_tree_path(
            name, prefix, target_root, old_kb_id, new_kb_id
        )
        if target_path is None:
            continue
        target_path.parent.mkdir(parents=True, exist_ok=True)
        with zf.open(name) as src, open(target_path, "wb") as dst:
            shutil.copyfileobj(src, dst)

This doesn’t change behavior, but it pulls non-trivial path rules into a pure function that can be tested independently, and makes _restore_runtime_tree “just” about iterating and copying.


4. Reduce async surface area for pure/sync helpers

Several methods are async but do synchronous work only (_write_runtime_file, _write_runtime_tree, _rewrite_doc_store_metadata except DB open, _restore_kb_metadata except ORM usage). You can keep the public async API but make internal helpers synchronous and wrap them from async orchestrators.

Example for _write_runtime_file / _write_runtime_tree:

def _write_runtime_file_sync(
    self,
    zf: zipfile.ZipFile,
    file_path: Path,
    archive_path: str,
) -> None:
    if file_path.exists():
        zf.write(file_path, archive_path)

async def _write_runtime_file(
    self,
    zf: zipfile.ZipFile,
    file_path: Path,
    archive_path: str,
) -> None:
    # thin async wrapper; easier to test _write_runtime_file_sync
    self._write_runtime_file_sync(zf, file_path, archive_path)

Same pattern for _write_runtime_tree and _rewrite_doc_store_metadata (core logic sync, async wrapper used by higher-level flow). This keeps your current async usage but makes the logic within each step simpler and easier to test without an event loop.


These refactors are incremental and do not change the external behavior or data formats, but they make the code less monolithic and reduce the cognitive complexity of the importer.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4185498ae5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +617 to +620
target_path = target_root / Path(*parts)
target_path.parent.mkdir(parents=True, exist_ok=True)
with zf.open(name) as src, open(target_path, "wb") as dst:
shutil.copyfileobj(src, dst)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Reject zip entries that escape runtime directories

The runtime restore code writes every archive member under runtime/medias/ and runtime/files/ directly to target_root / Path(*parts) without validating that the resolved path stays inside target_root. A crafted package containing entries like runtime/files/../../../../tmp/pwned will pass prefix checks and be written outside the KB directory during import, allowing an authenticated importer to overwrite arbitrary files on the server.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 355cc70, thanks.

Comment on lines +347 to +348
if not check_result.valid:
raise ValueError(check_result.error or "知识库包无效")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce pre-check importability before running import

import_kb only checks check_result.valid and ignores check_result.can_import, so packages flagged as non-importable (for example, version-incompatible ones where pre-check returns major_diff) can still be imported by calling the API directly. This bypasses the server-side compatibility gate that the pre-check computes and can lead to importing runtime data the current version is not meant to accept.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 355cc70, thanks.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

这个 PR 为知识库引入了导入和导出知识库包的新功能。实现涵盖了后端 API、用于长时间运行操作的后台任务以及前端 UI 组件。代码结构良好,关注点分离清晰。我发现了一个与导入过程中的路径重写相关的严重问题,可能导致媒体文件链接损坏,以及几个中等严重性的问题,涉及错误处理和国际化。总的来说,这是一个很棒的新功能,在修复这些问题后,它将成为一个非常可靠的功能。

Comment on lines +680 to +696
new_item = KBMedia(
media_id=str(uuid.uuid4()),
doc_id=doc_id_map[media["doc_id"]],
kb_id=new_kb.kb_id,
media_type=media["media_type"],
file_name=media["file_name"],
file_path=self._rewrite_runtime_path(
media.get("file_path", ""),
kb_dir=kb_dir,
storage_kind="medias",
old_kb_id=old_kb_id,
new_kb_id=new_kb.kb_id,
),
file_size=media["file_size"],
mime_type=media["mime_type"],
created_at=self._parse_datetime(media.get("created_at")),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

在导入过程中,媒体项的文件路径没有被正确地重写。根据 kb_helper.py 的定义,doc_id 是媒体文件路径的一部分。虽然在导入时 doc_id 被重新映射了,但这个重映射没有反映在数据库中存储的文件路径上。

这里的 _rewrite_runtime_path 函数被调用来更新 file_path,但它没有考虑到 doc_id 的重映射,这将导致导入的知识库中媒体文件的链接失效。

_restore_runtime_tree 中也存在类似的问题,在恢复文件时,目录结构中的 doc_id 部分没有被重映射,导致文件系统路径和数据库记录之间不匹配。

为了修复这个问题,你应该将 doc_id_map 传递给 _rewrite_runtime_path_restore_runtime_tree 函数,并用它来将文件路径中的旧 doc_id 替换为新的 doc_id

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

实际媒体保存路径本来就是 medias/<kb_id>/<doc_id>/<media_id>.,见 kb_helper.py:452。导入时数据库里
的 file_path 和磁盘恢复都统一保留了旧 doc_id 这一层目录,只重写了 kb_id,两边是一致的,见package_io.py:686 和 package_io.py:602。它不优雅,但不是链接一定会坏。

suffix = parts[index + 2 :]
return (kb_dir / storage_kind / new_kb_id / Path(*suffix)).as_posix()

return ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

_rewrite_runtime_path 函数无法找到预期的路径结构进行重写时,它会返回一个空字符串。这可能会导致数据被静默损坏,因为媒体项或文档的文件路径将在数据库中被清空,而没有任何警告。

在这种情况下,更健壮的做法是引发一个 ValueError。这将导致导入失败,这比部分成功但数据损坏的导入要好。调用函数 import_kb 已经为此场景准备了错误处理和清理逻辑。

Suggested change
return ""
raise ValueError(f"Could not find expected structure in path to rewrite: {raw_path}")


const formatDate = (value?: string) => {
if (!value) return '-'
return new Date(value).toLocaleString('zh-CN')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

日期格式化使用了 toLocaleString('zh-CN'),这会将区域设置硬编码为中文。在一个国际化的应用中,这不会尊重用户选择的语言。

你应该使用项目中已在使用的 vue-i18n 库的日期格式化功能。这将确保根据用户的区域设置正确显示日期。

例如,你可以使用 useI18n 中的 d 函数:

import { useI18n } from 'vue-i18n'
// ...
const { d } = useI18n()
// ...
return d(new Date(value), 'short')
  return new Date(value).toLocaleString()

@Soulter Soulter force-pushed the master branch 2 times, most recently from faf411f to 0068960 Compare April 19, 2026 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core The bug / feature is about astrbot's core, backend area:webui The bug / feature is about webui(dashboard) of astrbot. feature:knowledge-base The bug / feature is about knowledge base size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]追加知识库导入导出功能

1 participant