Skip to content

feat: support zip upload with error-state recovery and recursive file scanning#522

Closed
xiongjnu wants to merge 2 commits into
HKUDS:mainfrom
xiongjnu:feat/zip-upload
Closed

feat: support zip upload with error-state recovery and recursive file scanning#522
xiongjnu wants to merge 2 commits into
HKUDS:mainfrom
xiongjnu:feat/zip-upload

Conversation

@xiongjnu
Copy link
Copy Markdown
Contributor

Summary

Three related improvements to the knowledge base file upload pipeline:

  1. ZIP archive upload support — Users can now upload .zip files containing knowledge base documents. The archive is extracted server-side, files are validated against allowed extensions, and individual files are processed normally. Path traversal attempts (../, absolute paths) in zip entries are rejected with a 400 error.

  2. Error-state recovery — When uploading files to a knowledge base that is in an error state, the upload endpoint now falls back to a full KnowledgeBaseInitializer.process_documents() run instead of failing on the incremental DocumentAdder path.

  3. Recursive file scanning — FileTypeRouter.collect_supported_files() calls are updated to use recursive=True, so files in nested subdirectories inside raw/ are discovered during indexing, metadata collection, and reindexing.

Changes

File Change
deeptutor/api/routers/knowledge.py Zip extraction in _save_uploaded_files(); error-state fallback in run_upload_processing_task(); .zip added to allowed extensions; recursive=True for file collection
deeptutor/knowledge/initializer.py recursive=True for 3 collect_supported_files() calls
web/lib/knowledge-helpers.ts kbIsUploadable now excludes initializing/processing states instead of requiring === ready, allowing upload to error-state KBs

Test plan

  • Upload a .zip containing valid PDF/markdown files to a KB -> files are extracted and indexed
  • Upload a .zip with path traversal entries -> 400 error, safe extraction blocked
  • Upload a .zip containing an invalid .exe file -> silently skipped (not in allowed extensions)
  • Upload files to a KB in error state -> auto-recovery via full initialization
  • Upload files to a KB with nested subdirectories in raw/ -> all files found during indexing

… scanning

Allow users to upload .zip archives containing knowledge base files
(e.g., PDF, markdown, images). Zip entries with path traversal
attempts ("../" or absolute paths) are rejected. The zip file is
extracted and removed after processing.

When a knowledge base is in an error state (e.g., from a failed
initialization), the upload endpoint now recovers automatically by
running a full re-initialization instead of failing the append-only
DocumentAdder path.

FileTypeRouter.collect_supported_files() calls are updated to use
recursive=True, so files in nested subdirectories inside raw/ are
discovered during indexing and metadata collection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Fix file count (raw_documents) in KB manager to recurse into subdirectories
- Fix files API endpoint to list files recursively with relative paths
- Add embedding chunk truncation (2000 chars) to prevent SiliconFlow 413 error
- Add diagnostic logging for oversized chunks in embedding adapter
- Add missing list_kb_versions import in knowledge router
- Add CLAUDE.md with project documentation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@pancacake
Copy link
Copy Markdown
Collaborator

Thanks for your contribution! The zip-upload direction here is genuinely useful. We've taken it forward with a dedicated, safer implementation — per-entry extraction with Zip-Slip, zip-bomb (size/count/ratio) and extension-whitelist guards rather than a blanket extractall — plus the error-state recovery idea. Closing this PR in favour of that implementation, but really appreciate you surfacing the need and the approach.

@pancacake pancacake closed this May 27, 2026
pancacake added a commit that referenced this pull request May 27, 2026
Security: lock down the TutorBot tool sandbox (shell exec is opt-in, all
filesystem/shell access confined to the bot workspace) and isolate per-user
resources, closing #518, #517, #516, #515, #514 and #506 (first hardened in
#507).

Bug fixes: chat input disabled after the first turn (#520), KB embedding
failure on long documents (#521 / #509), profile creation under Docker
(#512 / #513), Qwen reasoning models failing native tool calling (#527 / #528),
the GPT-5 init-wizard token parameter (#508), and oversized session-event
truncation (#524).

Features: HTTP/SSE API for multi-turn chat with a specific TutorBot (#511),
multimodal image fallback for vision-capable providers without a capability
entry, safe ZIP knowledge upload, and a /settings/network page with model
fetching (community PRs #522 and #523 reimplemented locally).

Also bumps __version__ to 1.4.1, adds the v1.4.1 release notes, updates the
README Releases section, and ships the Astro + Starlight docs site under site/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants