feat(kb): add knowledge base package import and export#6488
feat(kb): add knowledge base package import and export#6488catDforD wants to merge 3 commits intoAstrBotDevs:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! 此拉取请求为 AstrBot 引入了全面的知识库包导入和导出功能,使用户能够完整迁移知识库,包括其配置、运行时数据(如向量索引和文档存储)以及关联的媒体文件,确保导入后的知识库能够立即投入使用,无需重新构建数据。它包含了新的后端 API 和前端 UI 组件来管理这些操作,并提供了强大的预检查和导入逻辑,以处理兼容性和数据完整性。 Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- In
KBPackageImportDialog.vue, whenversion_statusismajor_diff(can_import = false) the UI still shows the same informationalversionAlertMessageas a version match and proceeds to the confirm screen with a disabled button; consider explicitly surfacing an incompatibility error and stopping at the check step instead of showing a normal confirm form. - Both exported zip files and uploaded package files are accumulated under
kb_package_exports/kb_package_uploadswith no cleanup strategy; consider adding periodic pruning or a TTL-based cleanup mechanism to prevent unbounded growth of temp storage. - The
formatDatehelper inKBPackageImportDialog.vuehardcodes the'zh-CN'locale, which can be confusing for non‑Chinese users; consider using the current UI locale or a more generic formatting strategy.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `KBPackageImportDialog.vue`, when `version_status` is `major_diff` (can_import = false) the UI still shows the same informational `versionAlertMessage` as a version match and proceeds to the confirm screen with a disabled button; consider explicitly surfacing an incompatibility error and stopping at the check step instead of showing a normal confirm form.
- Both exported zip files and uploaded package files are accumulated under `kb_package_exports` / `kb_package_uploads` with no cleanup strategy; consider adding periodic pruning or a TTL-based cleanup mechanism to prevent unbounded growth of temp storage.
- The `formatDate` helper in `KBPackageImportDialog.vue` hardcodes the `'zh-CN'` locale, which can be confusing for non‑Chinese users; consider using the current UI locale or a more generic formatting strategy.
## Individual Comments
### Comment 1
<location path="astrbot/core/knowledge_base/package_io.py" line_range="181-190" />
<code_context>
+ )
+ new_media.append(new_item)
+
+ async with self.kb_manager.kb_db.get_db() as session:
+ async with session.begin():
+ for doc in new_documents:
+ session.add(doc)
+ for media in new_media:
+ session.add(media)
+ await session.commit()
+
+ def _rewrite_runtime_path(
</code_context>
<issue_to_address>
**issue:** Avoid manual commit inside `async with session.begin()` context manager.
Inside `_restore_kb_metadata`, `async with session.begin()` already handles commit/rollback, so `await session.commit()` here is redundant and potentially confusing. Please either:
- keep `async with session.begin()` and remove `await session.commit()`, or
- drop `session.begin()` and manage `commit`/`rollback` explicitly.
For this bulk-insert pattern, the first option is preferable.
</issue_to_address>
### Comment 2
<location path="astrbot/dashboard/routes/knowledge_base.py" line_range="47-52" />
<code_context>
self.retrieval_manager = None
self.upload_progress = {} # 存储上传进度 {task_id: {status, file_index, file_total, stage, current, total}}
self.upload_tasks = {} # 存储后台上传任务 {task_id: {"status", "result", "error"}}
+ self.package_tasks = {}
+ self.package_progress = {}
+ self.package_export_dir = Path(get_astrbot_temp_path()) / "kb_package_exports"
+ self.package_upload_dir = Path(get_astrbot_temp_path()) / "kb_package_uploads"
+ self.package_export_dir.mkdir(parents=True, exist_ok=True)
+ self.package_upload_dir.mkdir(parents=True, exist_ok=True)
# 注册路由
</code_context>
<issue_to_address>
**suggestion (performance):** Consider lifecycle/cleanup for package task and temp file state to avoid unbounded growth.
The new package-related state (`package_tasks`, `package_progress`) and directories (`kb_package_exports`, `kb_package_uploads`) only ever grow and are never cleaned up, which can cause unbounded memory and disk usage on long-lived instances. Please add some form of lifecycle management (e.g., TTL-based pruning of finished/failed tasks, periodic deletion of old export/upload files, or cleanup once a client retrieves the result), or at least a simple size/age cap to bound resource usage.
Suggested implementation:
```python
import os
import time
from pathlib import Path
from astrbot.core.knowledge_base.package_io import (
KnowledgeBasePackageExporter,
KnowledgeBasePackageImporter,
)
from astrbot.core.provider.provider import EmbeddingProvider, RerankProvider
from astrbot.core.utils.astrbot_path import get_astrbot_temp_path
```
```python
self.retrieval_manager = None
self.upload_progress = {} # 存储上传进度 {task_id: {status, file_index, file_total, stage, current, total}}
self.upload_tasks = {} # 存储后台上传任务 {task_id: {"status", "result", "error"}}
# 包相关任务与临时目录状态
# {task_id: {"status", "result", "error", "created_at"}}
self.package_tasks = {}
# {task_id: {status, file_index, file_total, stage, current, total, "updated_at"}}
self.package_progress = {}
# 磁盘临时目录(用于导出/上传知识库包)
self.package_export_dir = Path(get_astrbot_temp_path()) / "kb_package_exports"
self.package_upload_dir = Path(get_astrbot_temp_path()) / "kb_package_uploads"
self.package_export_dir.mkdir(parents=True, exist_ok=True)
self.package_upload_dir.mkdir(parents=True, exist_ok=True)
# 简单生命周期管理配置,防止任务状态和临时文件无限增长
# 内存中任务和进度保留时间(秒)
self.package_task_retention_seconds = 24 * 60 * 60 # 24 小时
# 磁盘文件保留时间(秒)
self.package_file_retention_seconds = 7 * 24 * 60 * 60 # 7 天
# 每次实例化时做一次旧文件清理,避免目录无限增长
def _cleanup_old_package_files():
now = time.time()
def _cleanup_dir(path: Path, ttl_seconds: int):
if not path.exists():
return
try:
for entry in path.iterdir():
try:
# 只清理普通文件和空目录
stat = entry.stat()
age = now - stat.st_mtime
if age > ttl_seconds:
if entry.is_file():
entry.unlink(missing_ok=True)
elif entry.is_dir():
# 只尝试删除空目录,避免误删
try:
entry.rmdir()
except OSError:
# 非空目录忽略,由更具体逻辑处理
pass
except FileNotFoundError:
# 竞态情况下文件可能已经被删除
continue
except FileNotFoundError:
# 目录本身在并发场景可能被删除
return
_cleanup_dir(self.package_export_dir, self.package_file_retention_seconds)
_cleanup_dir(self.package_upload_dir, self.package_file_retention_seconds)
_cleanup_old_package_files()
# 注册路由
```
</issue_to_address>
### Comment 3
<location path="dashboard/src/views/knowledge-base/components/KBPackageImportDialog.vue" line_range="364-367" />
<code_context>
+ }
+}
+
+const formatDate = (value?: string) => {
+ if (!value) return '-'
+ return new Date(value).toLocaleString('zh-CN')
+}
+</script>
</code_context>
<issue_to_address>
**suggestion:** Consider using the current UI locale instead of hardcoded 'zh-CN' when formatting dates.
`formatDate` always uses `toLocaleString('zh-CN')`, so dates won’t match the UI language when the dashboard is localized. Instead, derive the locale from your i18n configuration or the browser (or omit the locale argument) so the date format follows the active language.
```suggestion
const formatDate = (value?: string) => {
if (!value) return '-'
// Use the environment's current locale instead of hardcoding 'zh-CN'
return new Date(value).toLocaleString()
}
```
</issue_to_address>
### Comment 4
<location path="astrbot/core/knowledge_base/package_io.py" line_range="281" />
<code_context>
+ zf.write(file_path, f"{archive_prefix}/{rel_path}")
+
+
+class KnowledgeBasePackageImporter:
+ def __init__(self, kb_manager: "KnowledgeBaseManager") -> None:
+ self.kb_manager = kb_manager
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring `KnowledgeBasePackageImporter` by introducing small typed result/match/path helpers and sync core functions so its logic is decomposed into simpler, testable pieces instead of one large, multi-responsibility class.
The main complexity comes from `KnowledgeBasePackageImporter` mixing a lot of concerns and having several “smart” helpers that hide non-trivial behavior. You can reduce cognitive load without changing behavior by:
---
### 1. Make version compatibility strongly typed
`_check_version_compatibility` returns a loosely-typed `dict` that is immediately unpacked into a dataclass. This can be simplified with a small result type:
```python
from dataclasses import dataclass
from typing import Optional
@dataclass
class VersionCheckResult:
status: str
can_import: bool
warning: Optional[str] = None
```
Refactor `_check_version_compatibility`:
```python
def _check_version_compatibility(self, backup_version: str) -> VersionCheckResult:
if not backup_version:
return VersionCheckResult(status="major_diff", can_import=False)
backup_major = _get_major_version(backup_version)
current_major = _get_major_version(VERSION)
if VersionComparator.compare_version(backup_major, current_major) != 0:
return VersionCheckResult(status="major_diff", can_import=False)
if VersionComparator.compare_version(backup_version, VERSION) != 0:
return VersionCheckResult(
status="minor_diff",
can_import=True,
warning=f"包版本为 {backup_version},当前版本为 {VERSION}。",
)
return VersionCheckResult(status="match", can_import=True)
```
Then in `pre_check`:
```python
version_check = self._check_version_compatibility(result.backup_version)
result.version_status = version_check.status
result.can_import = version_check.can_import
if version_check.warning:
result.warnings.append(version_check.warning)
```
This keeps behavior intact but removes dict string keys and makes the logic easier to follow and test.
---
### 2. Separate provider matching logic from building the result shape
`_collect_local_provider_matches` both discovers matches and builds a nested dict. You can split those concerns, making the discovery part a pure function and leaving the dict assembly very transparent.
```python
@dataclass
class ProviderMatch:
source_id: str | None
required_dimensions: int | None = None
compatible_ids: list[str] = None
preselected_id: str | None = None
```
Pure helper:
```python
def _find_embedding_matches(
self, required_dim: int | None, source_id: str | None
) -> ProviderMatch:
compatible: list[str] = []
preselected: str | None = None
for provider in self.kb_manager.provider_manager.embedding_provider_insts:
if required_dim is not None and provider.get_dim() == int(required_dim):
pid = provider.provider_config.get("id", "")
compatible.append(pid)
if pid == source_id:
preselected = pid
if preselected is None and compatible:
preselected = compatible[0]
return ProviderMatch(
source_id=source_id,
required_dimensions=required_dim,
compatible_ids=compatible,
preselected_id=preselected,
)
```
Then `_collect_local_provider_matches` becomes mostly mapping:
```python
def _collect_local_provider_matches(
self, provider_summary: dict[str, Any]
) -> dict[str, Any]:
embedding_required_dim = provider_summary.get("embedding", {}).get("dimensions")
embedding_source_id = provider_summary.get("embedding", {}).get("provider_id")
rerank_source_id = provider_summary.get("rerank", {}).get("provider_id")
emb_match = self._find_embedding_matches(embedding_required_dim, embedding_source_id)
rerank_match = self._find_rerank_matches(rerank_source_id) # similar helper
return {
"embedding": {
"required_dimensions": emb_match.required_dimensions,
"source_provider_id": emb_match.source_id,
"compatible_provider_ids": emb_match.compatible_ids,
"preselected_provider_id": emb_match.preselected_id,
},
"rerank": {
"source_provider_id": rerank_match.source_id,
"compatible_provider_ids": rerank_match.compatible_ids,
"preselected_provider_id": rerank_match.preselected_id,
},
}
```
This keeps the external dict format the same, but localizes the matching logic into small, testable units.
---
### 3. Split “compute path” from “copy/extract file”
`_restore_runtime_tree` currently both computes rewritten paths and performs extraction. You already have `_rewrite_runtime_path` for metadata; you can reuse that pattern for files to make the flow clearer.
Introduce a pure path-rewrite helper specific to runtime trees:
```python
def _rewrite_runtime_tree_path(
self,
archive_name: str,
prefix: str,
target_root: Path,
old_kb_id: str,
new_kb_id: str,
) -> Path | None:
if not archive_name.startswith(prefix) or archive_name == prefix:
return None
rel_path = PurePosixPath(archive_name[len(prefix):])
parts = list(rel_path.parts)
if len(parts) >= 2 and parts[0] in {"medias", "files"} and parts[1] == old_kb_id:
parts[1] = new_kb_id
elif len(parts) >= 1 and parts[0] == old_kb_id:
parts[0] = new_kb_id
return target_root / Path(*parts)
```
Then `_restore_runtime_tree` becomes:
```python
def _restore_runtime_tree(
self,
zf: zipfile.ZipFile,
prefix: str,
target_root: Path,
old_kb_id: str,
new_kb_id: str,
) -> None:
for name in zf.namelist():
target_path = self._rewrite_runtime_tree_path(
name, prefix, target_root, old_kb_id, new_kb_id
)
if target_path is None:
continue
target_path.parent.mkdir(parents=True, exist_ok=True)
with zf.open(name) as src, open(target_path, "wb") as dst:
shutil.copyfileobj(src, dst)
```
This doesn’t change behavior, but it pulls non-trivial path rules into a pure function that can be tested independently, and makes `_restore_runtime_tree` “just” about iterating and copying.
---
### 4. Reduce async surface area for pure/sync helpers
Several methods are `async` but do synchronous work only (`_write_runtime_file`, `_write_runtime_tree`, `_rewrite_doc_store_metadata` except DB open, `_restore_kb_metadata` except ORM usage). You can keep the public async API but make internal helpers synchronous and wrap them from async orchestrators.
Example for `_write_runtime_file` / `_write_runtime_tree`:
```python
def _write_runtime_file_sync(
self,
zf: zipfile.ZipFile,
file_path: Path,
archive_path: str,
) -> None:
if file_path.exists():
zf.write(file_path, archive_path)
async def _write_runtime_file(
self,
zf: zipfile.ZipFile,
file_path: Path,
archive_path: str,
) -> None:
# thin async wrapper; easier to test _write_runtime_file_sync
self._write_runtime_file_sync(zf, file_path, archive_path)
```
Same pattern for `_write_runtime_tree` and `_rewrite_doc_store_metadata` (core logic sync, async wrapper used by higher-level flow). This keeps your current async usage but makes the logic within each step simpler and easier to test without an event loop.
---
These refactors are incremental and do not change the external behavior or data formats, but they make the code less monolithic and reduce the cognitive complexity of the importer.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| self.package_tasks = {} | ||
| self.package_progress = {} | ||
| self.package_export_dir = Path(get_astrbot_temp_path()) / "kb_package_exports" | ||
| self.package_upload_dir = Path(get_astrbot_temp_path()) / "kb_package_uploads" | ||
| self.package_export_dir.mkdir(parents=True, exist_ok=True) | ||
| self.package_upload_dir.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
suggestion (performance): Consider lifecycle/cleanup for package task and temp file state to avoid unbounded growth.
The new package-related state (package_tasks, package_progress) and directories (kb_package_exports, kb_package_uploads) only ever grow and are never cleaned up, which can cause unbounded memory and disk usage on long-lived instances. Please add some form of lifecycle management (e.g., TTL-based pruning of finished/failed tasks, periodic deletion of old export/upload files, or cleanup once a client retrieves the result), or at least a simple size/age cap to bound resource usage.
Suggested implementation:
import os
import time
from pathlib import Path
from astrbot.core.knowledge_base.package_io import (
KnowledgeBasePackageExporter,
KnowledgeBasePackageImporter,
)
from astrbot.core.provider.provider import EmbeddingProvider, RerankProvider
from astrbot.core.utils.astrbot_path import get_astrbot_temp_path self.retrieval_manager = None
self.upload_progress = {} # 存储上传进度 {task_id: {status, file_index, file_total, stage, current, total}}
self.upload_tasks = {} # 存储后台上传任务 {task_id: {"status", "result", "error"}}
# 包相关任务与临时目录状态
# {task_id: {"status", "result", "error", "created_at"}}
self.package_tasks = {}
# {task_id: {status, file_index, file_total, stage, current, total, "updated_at"}}
self.package_progress = {}
# 磁盘临时目录(用于导出/上传知识库包)
self.package_export_dir = Path(get_astrbot_temp_path()) / "kb_package_exports"
self.package_upload_dir = Path(get_astrbot_temp_path()) / "kb_package_uploads"
self.package_export_dir.mkdir(parents=True, exist_ok=True)
self.package_upload_dir.mkdir(parents=True, exist_ok=True)
# 简单生命周期管理配置,防止任务状态和临时文件无限增长
# 内存中任务和进度保留时间(秒)
self.package_task_retention_seconds = 24 * 60 * 60 # 24 小时
# 磁盘文件保留时间(秒)
self.package_file_retention_seconds = 7 * 24 * 60 * 60 # 7 天
# 每次实例化时做一次旧文件清理,避免目录无限增长
def _cleanup_old_package_files():
now = time.time()
def _cleanup_dir(path: Path, ttl_seconds: int):
if not path.exists():
return
try:
for entry in path.iterdir():
try:
# 只清理普通文件和空目录
stat = entry.stat()
age = now - stat.st_mtime
if age > ttl_seconds:
if entry.is_file():
entry.unlink(missing_ok=True)
elif entry.is_dir():
# 只尝试删除空目录,避免误删
try:
entry.rmdir()
except OSError:
# 非空目录忽略,由更具体逻辑处理
pass
except FileNotFoundError:
# 竞态情况下文件可能已经被删除
continue
except FileNotFoundError:
# 目录本身在并发场景可能被删除
return
_cleanup_dir(self.package_export_dir, self.package_file_retention_seconds)
_cleanup_dir(self.package_upload_dir, self.package_file_retention_seconds)
_cleanup_old_package_files()
# 注册路由| const formatDate = (value?: string) => { | ||
| if (!value) return '-' | ||
| return new Date(value).toLocaleString('zh-CN') | ||
| } |
There was a problem hiding this comment.
suggestion: Consider using the current UI locale instead of hardcoded 'zh-CN' when formatting dates.
formatDate always uses toLocaleString('zh-CN'), so dates won’t match the UI language when the dashboard is localized. Instead, derive the locale from your i18n configuration or the browser (or omit the locale argument) so the date format follows the active language.
| const formatDate = (value?: string) => { | |
| if (!value) return '-' | |
| return new Date(value).toLocaleString('zh-CN') | |
| } | |
| const formatDate = (value?: string) => { | |
| if (!value) return '-' | |
| // Use the environment's current locale instead of hardcoding 'zh-CN' | |
| return new Date(value).toLocaleString() | |
| } |
| zf.write(file_path, f"{archive_prefix}/{rel_path}") | ||
|
|
||
|
|
||
| class KnowledgeBasePackageImporter: |
There was a problem hiding this comment.
issue (complexity): Consider refactoring KnowledgeBasePackageImporter by introducing small typed result/match/path helpers and sync core functions so its logic is decomposed into simpler, testable pieces instead of one large, multi-responsibility class.
The main complexity comes from KnowledgeBasePackageImporter mixing a lot of concerns and having several “smart” helpers that hide non-trivial behavior. You can reduce cognitive load without changing behavior by:
1. Make version compatibility strongly typed
_check_version_compatibility returns a loosely-typed dict that is immediately unpacked into a dataclass. This can be simplified with a small result type:
from dataclasses import dataclass
from typing import Optional
@dataclass
class VersionCheckResult:
status: str
can_import: bool
warning: Optional[str] = NoneRefactor _check_version_compatibility:
def _check_version_compatibility(self, backup_version: str) -> VersionCheckResult:
if not backup_version:
return VersionCheckResult(status="major_diff", can_import=False)
backup_major = _get_major_version(backup_version)
current_major = _get_major_version(VERSION)
if VersionComparator.compare_version(backup_major, current_major) != 0:
return VersionCheckResult(status="major_diff", can_import=False)
if VersionComparator.compare_version(backup_version, VERSION) != 0:
return VersionCheckResult(
status="minor_diff",
can_import=True,
warning=f"包版本为 {backup_version},当前版本为 {VERSION}。",
)
return VersionCheckResult(status="match", can_import=True)Then in pre_check:
version_check = self._check_version_compatibility(result.backup_version)
result.version_status = version_check.status
result.can_import = version_check.can_import
if version_check.warning:
result.warnings.append(version_check.warning)This keeps behavior intact but removes dict string keys and makes the logic easier to follow and test.
2. Separate provider matching logic from building the result shape
_collect_local_provider_matches both discovers matches and builds a nested dict. You can split those concerns, making the discovery part a pure function and leaving the dict assembly very transparent.
@dataclass
class ProviderMatch:
source_id: str | None
required_dimensions: int | None = None
compatible_ids: list[str] = None
preselected_id: str | None = NonePure helper:
def _find_embedding_matches(
self, required_dim: int | None, source_id: str | None
) -> ProviderMatch:
compatible: list[str] = []
preselected: str | None = None
for provider in self.kb_manager.provider_manager.embedding_provider_insts:
if required_dim is not None and provider.get_dim() == int(required_dim):
pid = provider.provider_config.get("id", "")
compatible.append(pid)
if pid == source_id:
preselected = pid
if preselected is None and compatible:
preselected = compatible[0]
return ProviderMatch(
source_id=source_id,
required_dimensions=required_dim,
compatible_ids=compatible,
preselected_id=preselected,
)Then _collect_local_provider_matches becomes mostly mapping:
def _collect_local_provider_matches(
self, provider_summary: dict[str, Any]
) -> dict[str, Any]:
embedding_required_dim = provider_summary.get("embedding", {}).get("dimensions")
embedding_source_id = provider_summary.get("embedding", {}).get("provider_id")
rerank_source_id = provider_summary.get("rerank", {}).get("provider_id")
emb_match = self._find_embedding_matches(embedding_required_dim, embedding_source_id)
rerank_match = self._find_rerank_matches(rerank_source_id) # similar helper
return {
"embedding": {
"required_dimensions": emb_match.required_dimensions,
"source_provider_id": emb_match.source_id,
"compatible_provider_ids": emb_match.compatible_ids,
"preselected_provider_id": emb_match.preselected_id,
},
"rerank": {
"source_provider_id": rerank_match.source_id,
"compatible_provider_ids": rerank_match.compatible_ids,
"preselected_provider_id": rerank_match.preselected_id,
},
}This keeps the external dict format the same, but localizes the matching logic into small, testable units.
3. Split “compute path” from “copy/extract file”
_restore_runtime_tree currently both computes rewritten paths and performs extraction. You already have _rewrite_runtime_path for metadata; you can reuse that pattern for files to make the flow clearer.
Introduce a pure path-rewrite helper specific to runtime trees:
def _rewrite_runtime_tree_path(
self,
archive_name: str,
prefix: str,
target_root: Path,
old_kb_id: str,
new_kb_id: str,
) -> Path | None:
if not archive_name.startswith(prefix) or archive_name == prefix:
return None
rel_path = PurePosixPath(archive_name[len(prefix):])
parts = list(rel_path.parts)
if len(parts) >= 2 and parts[0] in {"medias", "files"} and parts[1] == old_kb_id:
parts[1] = new_kb_id
elif len(parts) >= 1 and parts[0] == old_kb_id:
parts[0] = new_kb_id
return target_root / Path(*parts)Then _restore_runtime_tree becomes:
def _restore_runtime_tree(
self,
zf: zipfile.ZipFile,
prefix: str,
target_root: Path,
old_kb_id: str,
new_kb_id: str,
) -> None:
for name in zf.namelist():
target_path = self._rewrite_runtime_tree_path(
name, prefix, target_root, old_kb_id, new_kb_id
)
if target_path is None:
continue
target_path.parent.mkdir(parents=True, exist_ok=True)
with zf.open(name) as src, open(target_path, "wb") as dst:
shutil.copyfileobj(src, dst)This doesn’t change behavior, but it pulls non-trivial path rules into a pure function that can be tested independently, and makes _restore_runtime_tree “just” about iterating and copying.
4. Reduce async surface area for pure/sync helpers
Several methods are async but do synchronous work only (_write_runtime_file, _write_runtime_tree, _rewrite_doc_store_metadata except DB open, _restore_kb_metadata except ORM usage). You can keep the public async API but make internal helpers synchronous and wrap them from async orchestrators.
Example for _write_runtime_file / _write_runtime_tree:
def _write_runtime_file_sync(
self,
zf: zipfile.ZipFile,
file_path: Path,
archive_path: str,
) -> None:
if file_path.exists():
zf.write(file_path, archive_path)
async def _write_runtime_file(
self,
zf: zipfile.ZipFile,
file_path: Path,
archive_path: str,
) -> None:
# thin async wrapper; easier to test _write_runtime_file_sync
self._write_runtime_file_sync(zf, file_path, archive_path)Same pattern for _write_runtime_tree and _rewrite_doc_store_metadata (core logic sync, async wrapper used by higher-level flow). This keeps your current async usage but makes the logic within each step simpler and easier to test without an event loop.
These refactors are incremental and do not change the external behavior or data formats, but they make the code less monolithic and reduce the cognitive complexity of the importer.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4185498ae5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| target_path = target_root / Path(*parts) | ||
| target_path.parent.mkdir(parents=True, exist_ok=True) | ||
| with zf.open(name) as src, open(target_path, "wb") as dst: | ||
| shutil.copyfileobj(src, dst) |
There was a problem hiding this comment.
Reject zip entries that escape runtime directories
The runtime restore code writes every archive member under runtime/medias/ and runtime/files/ directly to target_root / Path(*parts) without validating that the resolved path stays inside target_root. A crafted package containing entries like runtime/files/../../../../tmp/pwned will pass prefix checks and be written outside the KB directory during import, allowing an authenticated importer to overwrite arbitrary files on the server.
Useful? React with 👍 / 👎.
| if not check_result.valid: | ||
| raise ValueError(check_result.error or "知识库包无效") |
There was a problem hiding this comment.
Enforce pre-check importability before running import
import_kb only checks check_result.valid and ignores check_result.can_import, so packages flagged as non-importable (for example, version-incompatible ones where pre-check returns major_diff) can still be imported by calling the API directly. This bypasses the server-side compatibility gate that the pre-check computes and can lead to importing runtime data the current version is not meant to accept.
Useful? React with 👍 / 👎.
| new_item = KBMedia( | ||
| media_id=str(uuid.uuid4()), | ||
| doc_id=doc_id_map[media["doc_id"]], | ||
| kb_id=new_kb.kb_id, | ||
| media_type=media["media_type"], | ||
| file_name=media["file_name"], | ||
| file_path=self._rewrite_runtime_path( | ||
| media.get("file_path", ""), | ||
| kb_dir=kb_dir, | ||
| storage_kind="medias", | ||
| old_kb_id=old_kb_id, | ||
| new_kb_id=new_kb.kb_id, | ||
| ), | ||
| file_size=media["file_size"], | ||
| mime_type=media["mime_type"], | ||
| created_at=self._parse_datetime(media.get("created_at")), | ||
| ) |
There was a problem hiding this comment.
在导入过程中,媒体项的文件路径没有被正确地重写。根据 kb_helper.py 的定义,doc_id 是媒体文件路径的一部分。虽然在导入时 doc_id 被重新映射了,但这个重映射没有反映在数据库中存储的文件路径上。
这里的 _rewrite_runtime_path 函数被调用来更新 file_path,但它没有考虑到 doc_id 的重映射,这将导致导入的知识库中媒体文件的链接失效。
在 _restore_runtime_tree 中也存在类似的问题,在恢复文件时,目录结构中的 doc_id 部分没有被重映射,导致文件系统路径和数据库记录之间不匹配。
为了修复这个问题,你应该将 doc_id_map 传递给 _rewrite_runtime_path 和 _restore_runtime_tree 函数,并用它来将文件路径中的旧 doc_id 替换为新的 doc_id。
There was a problem hiding this comment.
实际媒体保存路径本来就是 medias/<kb_id>/<doc_id>/<media_id>.,见 kb_helper.py:452。导入时数据库里
的 file_path 和磁盘恢复都统一保留了旧 doc_id 这一层目录,只重写了 kb_id,两边是一致的,见package_io.py:686 和 package_io.py:602。它不优雅,但不是链接一定会坏。
| suffix = parts[index + 2 :] | ||
| return (kb_dir / storage_kind / new_kb_id / Path(*suffix)).as_posix() | ||
|
|
||
| return "" |
There was a problem hiding this comment.
当 _rewrite_runtime_path 函数无法找到预期的路径结构进行重写时,它会返回一个空字符串。这可能会导致数据被静默损坏,因为媒体项或文档的文件路径将在数据库中被清空,而没有任何警告。
在这种情况下,更健壮的做法是引发一个 ValueError。这将导致导入失败,这比部分成功但数据损坏的导入要好。调用函数 import_kb 已经为此场景准备了错误处理和清理逻辑。
| return "" | |
| raise ValueError(f"Could not find expected structure in path to rewrite: {raw_path}") |
|
|
||
| const formatDate = (value?: string) => { | ||
| if (!value) return '-' | ||
| return new Date(value).toLocaleString('zh-CN') |
There was a problem hiding this comment.
faf411f to
0068960
Compare
Closes #4339
为 AstrBot 知识库新增“知识库包”导入导出能力,目标是尽可能完整地迁移一个知识库,并在导入后尽可能直接可用,
而不是只迁移基础配置后再重新构建全部运行时数据。
Modifications / 改动点
manifest.jsonkb_metadata.jsonruntime/doc.dbruntime/index.faissruntime/medias/...runtime/files/...kb_id、文档 ID、doc.db中的元数据引用以及媒体关联zh-CNen-USru-RU主要涉及文件:
astrbot/core/knowledge_base/package_io.pyastrbot/dashboard/routes/knowledge_base.pyastrbot/dashboard/server.pydashboard/src/views/knowledge-base/KBList.vuedashboard/src/views/knowledge-base/components/SettingsTab.vuedashboard/src/views/knowledge-base/components/KBPackageImportDialog.vuedashboard/src/views/knowledge-base/components/KBPackageExportCard.vuetests/test_kb_package.py已知边界:
当前知识库包上传仍走单文件上传流程,受 Dashboard 全局
128MB上传限制影响This is NOT a breaking change. / 这不是一个破坏性变更。
Screenshots or Test Results / 运行截图或测试结果
验证步骤:
自动化验证:
验证结果:
主要功能测试视频:
freecompress-.mp4
default.mp4
部分测试截图:
Checklist / 检查清单
emails, etc.
/ 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。
above.
/ 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”。
they have been added to the appropriate locations in requirements.txt and pyproject.toml.
/ 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txt 和 pyproject.toml 文件相应
位置。
/ 我的更改没有引入恶意代码。
Summary by Sourcery
Introduce import/export support for knowledge base packages, including backend APIs, runtime data packaging, and UI entry points.
New Features:
Enhancements:
Tests: