feat(upload): async upload + sanitization + size/type checks + structured logging #32

albertyosef · 2025-09-25T14:44:09Z

Convert /upload to an async, safe, and well-observed endpoint. This change offloads blocking file IO to a thread pool, normalizes user-supplied filenames, enforces strict content-type and size limits, and adds per-request structured logging (with request_id) for traceability.

Features

Async upload pipeline
- Switch handler to async with thread offload for disk writes (prevents event-loop starvation).
- Streamed reads with bounded chunks for predictable memory usage.
- Early rejection when Content-Length exceeds configured max.
Filename sanitization
- Rejects path traversals (.., absolute paths) and strips control/special chars.
- Enforces safe charset + length cap (configurable).
- Optional slugification to stable, deterministic names; preserves extension when allowed.
Type & size validation
- Allow-list of MIME types (server-side detection, not only client header).
- Max size limit (defaults provided; overridable via config/env).
- Consistent 4xx errors with machine-readable error codes.
Structured logging (per request)
- Attach request_id to all logs in the request scope.
- Emits start/finish events with bytes received, duration, result status, and validation outcomes.
- Error logs include sanitized filename and error code (no PII or raw paths).

Fixes

Mitigates path traversal and filename injection risks.
Prevents event-loop blocking caused by synchronous disk IO.
Correct content-type detection drift (fallback to server sniff + extension match).
Normalizes inconsistent HTTP status codes (now: 413 for size, 415 for type, 400 for validation, 422 for parse errors).

Refactors

Extracted validation into upload_validation.py (pure functions, unit-tested).
Centralized logging fields via RequestLoggerAdapter (adds request_id, route, ip).
Introduced storage.py for pluggable backends (local fs for now; S3/GCS ready).
Consolidated error mapping to errors.py with typed exceptions → HTTP responses.

API Changes

Endpoint: POST /upload
Request headers:
- Content-Type: must be one of configured allow-list.
- Content-Length: validated against MAX_UPLOAD_BYTES (if present).
Response:
- 201 Created with { id, filename, contentType, size, sha256, url? }
- Errors return { error: { code, message, details? } }
Idempotency: Optional Idempotency-Key header supported (dedup on hash + key).

Security & Compliance

Deny traversal, device paths, and hidden multi-dot tricks (e.g., file.jpg.exe).
Enforce MIME allow-list (e.g., image/*, video/mp4, etc. per config).
Limits on filename length, path depth, and total bytes.
Redacts user paths from logs; logs store only sanitized name + short hash prefix.

Performance

Async handler + thread offload eliminates upload-time stalls for other requests.
Streamed hashing (SHA-256) computed during write; no second pass.
Bounded chunk size prevents unbounded memory spikes.

Observability

Logs (JSON):
- request_id, route, filename_sanitized, bytes_received, duration_ms, result.
- Validation events (type_checked, size_checked, sanitized).
Metrics (if enabled):
- upload_bytes_total, upload_duration_ms, upload_failures{code}, upload_inflight.
Trace hooks: Span around validation + write stages (no payload capture).

Configuration

MAX_UPLOAD_BYTES (e.g., 104857600 for 100 MB)
ALLOWED_MIME_TYPES (CSV or JSON)
FILENAME_STRATEGY (sanitize|slug|reject-unsafe)
STORAGE_ROOT (local path) / STORAGE_BACKEND (local|s3|gcs reserved)
UPLOAD_THREAD_WORKERS (defaults to CPU-bound heuristic)

Testing

Unit tests
- Filename sanitization (unicode, emojis, path traversal, reserved names).
- MIME/type sniffing vs extension matrix.
- Size guard (boundary ±1 byte).
Integration tests
- Happy path: small and large files under limit.
- Rejections: 413 size, 415 type, 400 invalid filename, 422 malformed multipart.
- Concurrency: multiple parallel uploads without throughput degradation.
Property-based tests for random filenames and mixed headers.

Docs

Updated API reference for /upload with examples (curl + JS/TS).
Deployment note on reverse proxy client_max_body_size/max_request_body_size.
Runbook: interpreting structured logs and common error codes.

Breaking Changes

Requests with disallowed MIME types or oversize payloads now fail early with strict codes.
Filenames may be altered by the sanitizer/slugifier; callers should rely on returned filename or id.

Migration Notes

Review and set ALLOWED_MIME_TYPES and MAX_UPLOAD_BYTES for your environment.
If callers depended on “as-sent” filenames, switch to server-returned filename or id.
Ensure proxy/body size limits ≥ MAX_UPLOAD_BYTES.

Risk & Rollout

Risk: Medium (touches request path + IO).
Rollout plan:
1. Deploy behind flag UPLOAD_V2_ENABLED (shadow logs only).
2. Canary to 5–10% traffic; watch upload_failures and latency P95/P99.
3. Ramp to 100%; remove flag after 48h stable.

Summary by CodeRabbit

New Features
- Safer uploads with content-type whitelisting, optional size limits, and secure filename handling.
- Clearer, standardized error responses for invalid files and processing issues.
- More reliable preview generation, including improved PDF handling and robust plain text previews.
- Smoother, more responsive uploads under load.
Bug Fixes
- Reduced failed uploads due to improved validation and error mapping to user-visible messages.
- Fixed intermittent preview errors by strengthening parsing and post-save checks.

…d improved error handling; refactor for async processing and logging context management.

coderabbitai · 2025-09-25T14:44:15Z

Walkthrough

Asynchronous upload endpoint replaced the synchronous handler, adding request-scoped logging context, content-type and size validations, filename sanitization, thread-pooled file saving, post-save size check, refined parsing for previews, and expanded error handling mapping domain errors to specific HTTP statuses. The function signature now includes Request and is async.

Changes

Cohort / File(s)	Summary
Upload endpoint and lifecycle `api/endpoints/upload.py`	Converted to `async` with `Request` param; added per-request structured logging (set/clear); enforced content-type whitelist and optional Content-Length cap (MAX_UPLOAD_BYTES); sanitized filenames via `_secure_filename`; offloaded file save to thread pool; added post-save size verification; improved PDF/plain-text preview parsing with path-aware logging; standardized error mapping: `DocumentSaveError → 400`, `DocumentParseError`/`DocumentChunkError`/`DocumentEmbeddingError → 422`; ensured logging context cleanup.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant API as Upload Endpoint
  participant Val as Validation
  participant FS as ThreadPool Save
  participant Parser as Parser/Preview
  participant Log as Log Context

  Client->>API: POST /upload (file)
  API->>Log: set_log_context(request)
  rect rgba(200,230,255,0.3)
    API->>Val: Check content-type & Content-Length
    Val-->>API: Valid or raise
  end
  API->>API: _secure_filename(name)

  rect rgba(220,255,220,0.3)
    API->>FS: Save file (run_in_executor)
    FS-->>API: Saved path
    API->>Val: Post-save size check
  end

  rect rgba(255,245,200,0.4)
    API->>Parser: Generate preview (PDF/text)
    Parser-->>API: Preview/metadata
  end

  API-->>Client: 200 OK (result)

  alt DocumentSaveError
    API-->>Client: 400 Bad Request
  else Parse/Chunk/Embedding Error
    API-->>Client: 422 Unprocessable Entity
  else Other Error
    API-->>Client: 500 Internal Server Error
  end

  API->>Log: clear_log_context()

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Enhanced Upload Endpoint with Multi-Format Parsing and Robust Error Handling #6 — Also modifies the upload endpoint to integrate in-request parsing (PDF/text), preview generation, and extended validation/response handling.

Suggested reviewers

fehranbit

Poem

A hare with logs in tidy rows,
Async winds where the upload flows.
We whisk the bytes, then safely save,
Preview the tales the pages gave.
If errors hop—422 or 400—fine,
We clear our tracks and call it time.
(_/)> 🗂️📄

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title clearly identifies the key enhancements introduced by the pull request, namely conversion to an asynchronous upload endpoint alongside filename sanitization, size/type validation, and structured logging. It is concise, specific, and directly reflects the primary functionality changes without unrelated details. Using a conventional commit style prefix also aids in clarity and consistency. Overall, it accurately communicates the main purpose of the changeset.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/upload-async-sanitize-logging

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (6)

api/endpoints/upload.py (6)
5-10: Add required imports; drop per-request ThreadPoolExecutor

Use asyncio.to_thread and cleanup helpers.

Apply this diff:
 from fastapi import APIRouter, UploadFile, File, HTTPException, Request, status
 import os
 import pathlib
-from concurrent.futures import ThreadPoolExecutor
 from typing import Optional
+import asyncio
+import contextlib
+import uuid
89-89: Avoid logging absolute filesystem paths

Log only sanitized basename to reduce sensitive path exposure.

Apply this diff:
-                logger.exception("Error parsing PDF file for preview", extra={"path": saved_path})
+                logger.exception("Error parsing PDF file for preview", extra={"filename": os.path.basename(saved_path)})
45-48: Normalize Content-Type by stripping parameters (e.g., charset)

Clients may send "text/plain; charset=utf-8". Split off parameters before allow-list check.

Apply this diff:
-        content_type = (file.content_type or "").lower()
+        content_type = (file.content_type or "").split(";", 1)[0].lower()
39-43: Always set a request_id; support common header casing

Fallback to X-Request-ID and generate one if absent.

Apply this diff:
-    request_id = request.headers.get("X-Request-Id") or None
-    if request_id:
-        set_log_context(request_id=request_id)
+    request_id = (
+        request.headers.get("X-Request-Id")
+        or request.headers.get("X-Request-ID")
+        or uuid.uuid4().hex
+    )
+    set_log_context(request_id=request_id)
26-31: Tighten filename sanitization (optional; file_service also validates)

Prefer a restricted ASCII set and length cap to reduce surprises.

Apply this diff:
 def _secure_filename(name: str) -> str:
-    # Simple sanitization: take only base name and strip suspicious characters
-    base = pathlib.Path(name).name
-    # remove path separators and control chars
-    return "".join(c for c in base if c.isprintable())
+    # Take base name; restrict to a safe ASCII set; cap length
+    base = pathlib.Path(name or "upload").name
+    allowed = set("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_. ")
+    cleaned = "".join(c for c in base if c in allowed) or "upload"
+    return cleaned[:128]
81-91: Server-side type validation (optional)

Current allow-list trusts client-provided Content-Type. Consider lightweight sniffing (e.g., check for "%PDF-" magic for PDFs) to harden type checks.

Would you like me to add a small header-sniff (no extra deps) for PDFs and fallback text detection?

Also applies to: 92-102

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77ca247 and 1c6ca9c.

📒 Files selected for processing (1)

api/endpoints/upload.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

api/endpoints/upload.py (5)

services/file_service.py (1)

save_upload_file (19-57)

services/parsing_service.py (1)

extract_text_from_pdf (10-24)

models/schemas.py (1)

UploadResponse (7-11)

services/exceptions.py (4)

DocumentSaveError (13-15)

DocumentParseError (9-11)

DocumentChunkError (17-19)

DocumentEmbeddingError (21-23)

services/logging_config.py (3)

get_logger (171-174)

set_log_context (157-160)

clear_log_context (167-168)

🪛 Ruff (0.13.1)

api/endpoints/upload.py

86-86: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

98-98: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

101-101: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

110-110: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

113-113: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

114-114: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

coderabbitai · 2025-09-25T14:50:42Z

api/endpoints/upload.py

+        # If client provided Content-Length header, check early
+        content_length = request.headers.get("content-length")
+        if content_length:
+            try:
+                if int(content_length) > MAX_UPLOAD_BYTES:
+                    logger.warning("Rejected upload due to size header too large", extra={"size": content_length})
+                    raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, detail="File too large")
+            except ValueError:
+                # ignore invalid header and continue with streaming checks
+                pass
+


⚠️ Potential issue | 🔴 Critical

Block oversized uploads during streaming; avoid saving then rejecting

Header-based checks aren’t enough. As written, a client can omit/misstate Content-Length and force writing arbitrarily large files to disk. Enforce the size limit during the write, and delete the file if a post-save check fails. Also, avoid per-request ThreadPoolExecutor and use asyncio.to_thread.

Apply this diff:

@@ - # Offload blocking save to threadpool - loop = __import__("asyncio").get_running_loop() - with ThreadPoolExecutor(max_workers=1) as ex: - saved_path = await loop.run_in_executor(ex, save_upload_file, file.file, filename) + # Offload blocking save, enforcing size during save + saved_path = await asyncio.to_thread(save_upload_file, file.file, filename, MAX_UPLOAD_BYTES) @@ - # Quick size check after save + # Quick size check after save (defense-in-depth) try: size = os.path.getsize(saved_path) if size > MAX_UPLOAD_BYTES: - logger.warning("Saved file exceeds max size", extra={"size": size, "path": saved_path}) - raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, detail="File too large") - except OSError: - logger.exception("Failed to stat saved file", extra={"path": saved_path}) + logger.warning("Saved file exceeds max size", extra={"filename": os.path.basename(saved_path), "size": size}) + with contextlib.suppress(Exception): + os.remove(saved_path) + raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, detail="File too large") + except OSError as e: + logger.exception("Failed to stat saved file", extra={"filename": os.path.basename(saved_path)}) + with contextlib.suppress(Exception): + os.remove(saved_path) + raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail="Failed to validate saved file") from e

And update the saver to enforce the limit while writing (outside this file):

diff --git a/services/file_service.py b/services/file_service.py @@ -def save_upload_file(upload_file: IO, filename: str) -> str: +def save_upload_file(upload_file: IO, filename: str, max_bytes: int | None = None) -> str: @@ - with tempfile.NamedTemporaryFile(dir=UPLOAD_DIR, delete=False) as tmp: - for chunk in iter(lambda: upload_file.read(8192), b""): - tmp.write(chunk) - temp_path = tmp.name + with tempfile.NamedTemporaryFile(dir=UPLOAD_DIR, delete=False) as tmp: + total = 0 + for chunk in iter(lambda: upload_file.read(8192), b""): + tmp.write(chunk) + total += len(chunk) + if max_bytes is not None and total > max_bytes: + temp_path = tmp.name + raise DocumentSaveError("File exceeds allowed size.") + temp_path = tmp.name @@ - except OSError as e: + except OSError as e: logger.exception("OSError during file save: %s", filename) with contextlib.suppress(Exception): if 'temp_path' in locals() and os.path.exists(temp_path): os.remove(temp_path) raise DocumentSaveError("Failed to save file securely.") from e + except DocumentSaveError as e: + logger.warning("Upload aborted due to size limit") + with contextlib.suppress(Exception): + if 'temp_path' in locals() and os.path.exists(temp_path): + os.remove(temp_path) + raise

Also applies to: 64-68, 69-77

🤖 Prompt for AI Agents

In api/endpoints/upload.py around lines 50-60 (and similarly 64-68, 69-77), the current logic only checks Content-Length header and may allow oversized uploads to be written to disk if the header is missing or incorrect; update the upload flow to enforce MAX_UPLOAD_BYTES during the streaming/save operation (move size-check logic into the saver so it raises once the cumulative bytes exceed the limit), ensure that if any post-save size check fails the partially written file is deleted, and replace any per-request ThreadPoolExecutor usage with asyncio.to_thread for disk I/O offloading; adjust calling code here to call the updated saver that raises on overflow and handle that exception to return HTTP 413 and perform file cleanup.

coderabbitai · 2025-09-25T14:50:42Z

api/endpoints/upload.py

                text_preview = text[:500] if text else None
            except DocumentParseError:
-                logger.error("Document parse error")
+                logger.error("Document parse error", extra={"path": saved_path})


⚠️ Potential issue | 🟠 Major

Use logger.exception in except blocks and chain exceptions

Align with Ruff TRY400 and B904; also avoid logging full paths elsewhere.

Apply this diff:

- logger.error("Document parse error", extra={"path": saved_path}) + logger.exception("Document parse error", extra={"filename": os.path.basename(saved_path)}) @@ - logger.error("Unicode decode error while reading file for preview", extra={"path": saved_path}) + logger.exception("Unicode decode error while reading file for preview", extra={"filename": os.path.basename(saved_path)}) @@ - logger.error("OS error while reading file for preview", extra={"path": saved_path}) + logger.exception("OS error while reading file for preview", extra={"filename": os.path.basename(saved_path)}) @@ - logger.error("Document save error", extra={"error": str(dse)}) + logger.exception("Document save error", extra={"error": str(dse)}) raise HTTPException(status_code=400, detail="Failed to save uploaded document") from dse @@ - logger.error("Document processing error", extra={"error": str(de)}) - raise HTTPException(status_code=422, detail="Error processing document") + logger.exception("Document processing error", extra={"error": str(de)}) + raise HTTPException(status_code=422, detail="Error processing document") from de

Also applies to: 98-98, 101-101, 110-110, 113-114

🧰 Tools

🪛 Ruff (0.13.1)

86-86: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents

In api/endpoints/upload.py around lines 86, 98, 101, 110 and 113-114, replace logger.error(...) calls inside except blocks with logger.exception(...) so the stack trace is included, avoid logging full filesystem paths by logging only the filename or a redacted path (e.g., os.path.basename(saved_path) or "<redacted>"), and when re-raising exceptions ensure you use exception chaining (raise NewError(...) from e) rather than raising without the original exception; do this for each listed location.

coderabbitai · 2025-09-25T14:50:42Z

api/endpoints/upload.py

+    finally:
+        # Clear per-request logging context
+        clear_log_context()


⚠️ Potential issue | 🟠 Major

Close the uploaded file in finally to avoid resource leaks

Starlette’s UploadFile exposes an async close(). Ensure cleanup.

Apply this diff:

finally: - # Clear per-request logging context - clear_log_context() + with contextlib.suppress(Exception): + await file.close() + # Clear per-request logging context + clear_log_context()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

finally:

# Clear per-request logging context

clear_log_context()

finally:

with contextlib.suppress(Exception):

await file.close()

# Clear per-request logging context

clear_log_context()

🤖 Prompt for AI Agents

In api/endpoints/upload.py around lines 115 to 117, the finally block only clears per-request logging context and does not close the Starlette UploadFile, risking a resource leak; modify the finally block to also await the uploaded file's async close() (e.g., await uploaded_file.close()) so the uploaded file is properly cleaned up before exiting the request handler, ensuring the call is awaited and wrapped in a safe try/except if needed to avoid masking original exceptions.

Enhance upload endpoint with content type validation, size checks, an…

1c6ca9c

…d improved error handling; refactor for async processing and logging context management.

coderabbitai bot reviewed Sep 25, 2025

View reviewed changes

fehranbit self-requested a review September 30, 2025 05:43

fehranbit approved these changes Sep 30, 2025

View reviewed changes

fehranbit merged commit 4ae0082 into main Sep 30, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(upload): async upload + sanitization + size/type checks + structured logging #32

feat(upload): async upload + sanitization + size/type checks + structured logging #32

Uh oh!

albertyosef commented Sep 25, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 25, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 25, 2025

Uh oh!

coderabbitai bot Sep 25, 2025

Uh oh!

coderabbitai bot Sep 25, 2025

Uh oh!

Uh oh!

Uh oh!

feat(upload): async upload + sanitization + size/type checks + structured logging #32

feat(upload): async upload + sanitization + size/type checks + structured logging #32

Uh oh!

Conversation

albertyosef commented Sep 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

Fixes

Refactors

API Changes

Security & Compliance

Performance

Observability

Configuration

Testing

Docs

Breaking Changes

Migration Notes

Risk & Rollout

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

albertyosef commented Sep 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 25, 2025 •

edited

Loading