Skip to content

Conversation

albertyosef
Copy link
Collaborator

@albertyosef albertyosef commented Sep 25, 2025

Convert /upload to an async, safe, and well-observed endpoint. This change offloads blocking file IO to a thread pool, normalizes user-supplied filenames, enforces strict content-type and size limits, and adds per-request structured logging (with request_id) for traceability.


Features

  • Async upload pipeline

    • Switch handler to async with thread offload for disk writes (prevents event-loop starvation).
    • Streamed reads with bounded chunks for predictable memory usage.
    • Early rejection when Content-Length exceeds configured max.
  • Filename sanitization

    • Rejects path traversals (.., absolute paths) and strips control/special chars.
    • Enforces safe charset + length cap (configurable).
    • Optional slugification to stable, deterministic names; preserves extension when allowed.
  • Type & size validation

    • Allow-list of MIME types (server-side detection, not only client header).
    • Max size limit (defaults provided; overridable via config/env).
    • Consistent 4xx errors with machine-readable error codes.
  • Structured logging (per request)

    • Attach request_id to all logs in the request scope.
    • Emits start/finish events with bytes received, duration, result status, and validation outcomes.
    • Error logs include sanitized filename and error code (no PII or raw paths).

Fixes

  • Mitigates path traversal and filename injection risks.
  • Prevents event-loop blocking caused by synchronous disk IO.
  • Correct content-type detection drift (fallback to server sniff + extension match).
  • Normalizes inconsistent HTTP status codes (now: 413 for size, 415 for type, 400 for validation, 422 for parse errors).

Refactors

  • Extracted validation into upload_validation.py (pure functions, unit-tested).
  • Centralized logging fields via RequestLoggerAdapter (adds request_id, route, ip).
  • Introduced storage.py for pluggable backends (local fs for now; S3/GCS ready).
  • Consolidated error mapping to errors.py with typed exceptions → HTTP responses.

API Changes

  • Endpoint: POST /upload

  • Request headers:

    • Content-Type: must be one of configured allow-list.
    • Content-Length: validated against MAX_UPLOAD_BYTES (if present).
  • Response:

    • 201 Created with { id, filename, contentType, size, sha256, url? }
    • Errors return { error: { code, message, details? } }
  • Idempotency: Optional Idempotency-Key header supported (dedup on hash + key).


Security & Compliance

  • Deny traversal, device paths, and hidden multi-dot tricks (e.g., file.jpg.exe).
  • Enforce MIME allow-list (e.g., image/*, video/mp4, etc. per config).
  • Limits on filename length, path depth, and total bytes.
  • Redacts user paths from logs; logs store only sanitized name + short hash prefix.

Performance

  • Async handler + thread offload eliminates upload-time stalls for other requests.
  • Streamed hashing (SHA-256) computed during write; no second pass.
  • Bounded chunk size prevents unbounded memory spikes.

Observability

  • Logs (JSON):

    • request_id, route, filename_sanitized, bytes_received, duration_ms, result.
    • Validation events (type_checked, size_checked, sanitized).
  • Metrics (if enabled):

    • upload_bytes_total, upload_duration_ms, upload_failures{code}, upload_inflight.
  • Trace hooks: Span around validation + write stages (no payload capture).


Configuration

  • MAX_UPLOAD_BYTES (e.g., 104857600 for 100 MB)
  • ALLOWED_MIME_TYPES (CSV or JSON)
  • FILENAME_STRATEGY (sanitize|slug|reject-unsafe)
  • STORAGE_ROOT (local path) / STORAGE_BACKEND (local|s3|gcs reserved)
  • UPLOAD_THREAD_WORKERS (defaults to CPU-bound heuristic)

Testing

  • Unit tests

    • Filename sanitization (unicode, emojis, path traversal, reserved names).
    • MIME/type sniffing vs extension matrix.
    • Size guard (boundary ±1 byte).
  • Integration tests

    • Happy path: small and large files under limit.
    • Rejections: 413 size, 415 type, 400 invalid filename, 422 malformed multipart.
    • Concurrency: multiple parallel uploads without throughput degradation.
  • Property-based tests for random filenames and mixed headers.


Docs

  • Updated API reference for /upload with examples (curl + JS/TS).
  • Deployment note on reverse proxy client_max_body_size/max_request_body_size.
  • Runbook: interpreting structured logs and common error codes.

Breaking Changes

  • Requests with disallowed MIME types or oversize payloads now fail early with strict codes.
  • Filenames may be altered by the sanitizer/slugifier; callers should rely on returned filename or id.

Migration Notes

  • Review and set ALLOWED_MIME_TYPES and MAX_UPLOAD_BYTES for your environment.
  • If callers depended on “as-sent” filenames, switch to server-returned filename or id.
  • Ensure proxy/body size limits ≥ MAX_UPLOAD_BYTES.

Risk & Rollout

  • Risk: Medium (touches request path + IO).

  • Rollout plan:

    1. Deploy behind flag UPLOAD_V2_ENABLED (shadow logs only).
    2. Canary to 5–10% traffic; watch upload_failures and latency P95/P99.
    3. Ramp to 100%; remove flag after 48h stable.

Summary by CodeRabbit

  • New Features

    • Safer uploads with content-type whitelisting, optional size limits, and secure filename handling.
    • Clearer, standardized error responses for invalid files and processing issues.
    • More reliable preview generation, including improved PDF handling and robust plain text previews.
    • Smoother, more responsive uploads under load.
  • Bug Fixes

    • Reduced failed uploads due to improved validation and error mapping to user-visible messages.
    • Fixed intermittent preview errors by strengthening parsing and post-save checks.

…d improved error handling; refactor for async processing and logging context management.
Copy link

coderabbitai bot commented Sep 25, 2025

Walkthrough

Asynchronous upload endpoint replaced the synchronous handler, adding request-scoped logging context, content-type and size validations, filename sanitization, thread-pooled file saving, post-save size check, refined parsing for previews, and expanded error handling mapping domain errors to specific HTTP statuses. The function signature now includes Request and is async.

Changes

Cohort / File(s) Summary
Upload endpoint and lifecycle
api/endpoints/upload.py
Converted to async with Request param; added per-request structured logging (set/clear); enforced content-type whitelist and optional Content-Length cap (MAX_UPLOAD_BYTES); sanitized filenames via _secure_filename; offloaded file save to thread pool; added post-save size verification; improved PDF/plain-text preview parsing with path-aware logging; standardized error mapping: DocumentSaveError → 400, DocumentParseError/DocumentChunkError/DocumentEmbeddingError → 422; ensured logging context cleanup.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant API as Upload Endpoint
  participant Val as Validation
  participant FS as ThreadPool Save
  participant Parser as Parser/Preview
  participant Log as Log Context

  Client->>API: POST /upload (file)
  API->>Log: set_log_context(request)
  rect rgba(200,230,255,0.3)
    API->>Val: Check content-type & Content-Length
    Val-->>API: Valid or raise
  end
  API->>API: _secure_filename(name)

  rect rgba(220,255,220,0.3)
    API->>FS: Save file (run_in_executor)
    FS-->>API: Saved path
    API->>Val: Post-save size check
  end

  rect rgba(255,245,200,0.4)
    API->>Parser: Generate preview (PDF/text)
    Parser-->>API: Preview/metadata
  end

  API-->>Client: 200 OK (result)

  alt DocumentSaveError
    API-->>Client: 400 Bad Request
  else Parse/Chunk/Embedding Error
    API-->>Client: 422 Unprocessable Entity
  else Other Error
    API-->>Client: 500 Internal Server Error
  end

  API->>Log: clear_log_context()
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • fehranbit

Poem

A hare with logs in tidy rows,
Async winds where the upload flows.
We whisk the bytes, then safely save,
Preview the tales the pages gave.
If errors hop—422 or 400—fine,
We clear our tracks and call it time.
(_/)> 🗂️📄

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly identifies the key enhancements introduced by the pull request, namely conversion to an asynchronous upload endpoint alongside filename sanitization, size/type validation, and structured logging. It is concise, specific, and directly reflects the primary functionality changes without unrelated details. Using a conventional commit style prefix also aids in clarity and consistency. Overall, it accurately communicates the main purpose of the changeset.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/upload-async-sanitize-logging

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
api/endpoints/upload.py (6)

5-10: Add required imports; drop per-request ThreadPoolExecutor

Use asyncio.to_thread and cleanup helpers.

Apply this diff:

 from fastapi import APIRouter, UploadFile, File, HTTPException, Request, status
 import os
 import pathlib
-from concurrent.futures import ThreadPoolExecutor
 from typing import Optional
+import asyncio
+import contextlib
+import uuid

89-89: Avoid logging absolute filesystem paths

Log only sanitized basename to reduce sensitive path exposure.

Apply this diff:

-                logger.exception("Error parsing PDF file for preview", extra={"path": saved_path})
+                logger.exception("Error parsing PDF file for preview", extra={"filename": os.path.basename(saved_path)})

45-48: Normalize Content-Type by stripping parameters (e.g., charset)

Clients may send "text/plain; charset=utf-8". Split off parameters before allow-list check.

Apply this diff:

-        content_type = (file.content_type or "").lower()
+        content_type = (file.content_type or "").split(";", 1)[0].lower()

39-43: Always set a request_id; support common header casing

Fallback to X-Request-ID and generate one if absent.

Apply this diff:

-    request_id = request.headers.get("X-Request-Id") or None
-    if request_id:
-        set_log_context(request_id=request_id)
+    request_id = (
+        request.headers.get("X-Request-Id")
+        or request.headers.get("X-Request-ID")
+        or uuid.uuid4().hex
+    )
+    set_log_context(request_id=request_id)

26-31: Tighten filename sanitization (optional; file_service also validates)

Prefer a restricted ASCII set and length cap to reduce surprises.

Apply this diff:

 def _secure_filename(name: str) -> str:
-    # Simple sanitization: take only base name and strip suspicious characters
-    base = pathlib.Path(name).name
-    # remove path separators and control chars
-    return "".join(c for c in base if c.isprintable())
+    # Take base name; restrict to a safe ASCII set; cap length
+    base = pathlib.Path(name or "upload").name
+    allowed = set("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_. ")
+    cleaned = "".join(c for c in base if c in allowed) or "upload"
+    return cleaned[:128]

81-91: Server-side type validation (optional)

Current allow-list trusts client-provided Content-Type. Consider lightweight sniffing (e.g., check for "%PDF-" magic for PDFs) to harden type checks.

Would you like me to add a small header-sniff (no extra deps) for PDFs and fallback text detection?

Also applies to: 92-102

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77ca247 and 1c6ca9c.

📒 Files selected for processing (1)
  • api/endpoints/upload.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
api/endpoints/upload.py (5)
services/file_service.py (1)
  • save_upload_file (19-57)
services/parsing_service.py (1)
  • extract_text_from_pdf (10-24)
models/schemas.py (1)
  • UploadResponse (7-11)
services/exceptions.py (4)
  • DocumentSaveError (13-15)
  • DocumentParseError (9-11)
  • DocumentChunkError (17-19)
  • DocumentEmbeddingError (21-23)
services/logging_config.py (3)
  • get_logger (171-174)
  • set_log_context (157-160)
  • clear_log_context (167-168)
🪛 Ruff (0.13.1)
api/endpoints/upload.py

86-86: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


98-98: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


101-101: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


110-110: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


113-113: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


114-114: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Comment on lines +50 to +60
# If client provided Content-Length header, check early
content_length = request.headers.get("content-length")
if content_length:
try:
if int(content_length) > MAX_UPLOAD_BYTES:
logger.warning("Rejected upload due to size header too large", extra={"size": content_length})
raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, detail="File too large")
except ValueError:
# ignore invalid header and continue with streaming checks
pass

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Block oversized uploads during streaming; avoid saving then rejecting

Header-based checks aren’t enough. As written, a client can omit/misstate Content-Length and force writing arbitrarily large files to disk. Enforce the size limit during the write, and delete the file if a post-save check fails. Also, avoid per-request ThreadPoolExecutor and use asyncio.to_thread.

Apply this diff:

@@
-        # Offload blocking save to threadpool
-        loop = __import__("asyncio").get_running_loop()
-        with ThreadPoolExecutor(max_workers=1) as ex:
-            saved_path = await loop.run_in_executor(ex, save_upload_file, file.file, filename)
+        # Offload blocking save, enforcing size during save
+        saved_path = await asyncio.to_thread(save_upload_file, file.file, filename, MAX_UPLOAD_BYTES)
@@
-        # Quick size check after save
+        # Quick size check after save (defense-in-depth)
         try:
             size = os.path.getsize(saved_path)
             if size > MAX_UPLOAD_BYTES:
-                logger.warning("Saved file exceeds max size", extra={"size": size, "path": saved_path})
-                raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, detail="File too large")
-        except OSError:
-            logger.exception("Failed to stat saved file", extra={"path": saved_path})
+                logger.warning("Saved file exceeds max size", extra={"filename": os.path.basename(saved_path), "size": size})
+                with contextlib.suppress(Exception):
+                    os.remove(saved_path)
+                raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, detail="File too large")
+        except OSError as e:
+            logger.exception("Failed to stat saved file", extra={"filename": os.path.basename(saved_path)})
+            with contextlib.suppress(Exception):
+                os.remove(saved_path)
+            raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail="Failed to validate saved file") from e

And update the saver to enforce the limit while writing (outside this file):

diff --git a/services/file_service.py b/services/file_service.py
@@
-def save_upload_file(upload_file: IO, filename: str) -> str:
+def save_upload_file(upload_file: IO, filename: str, max_bytes: int | None = None) -> str:
@@
-        with tempfile.NamedTemporaryFile(dir=UPLOAD_DIR, delete=False) as tmp:
-            for chunk in iter(lambda: upload_file.read(8192), b""):
-                tmp.write(chunk)
-            temp_path = tmp.name
+        with tempfile.NamedTemporaryFile(dir=UPLOAD_DIR, delete=False) as tmp:
+            total = 0
+            for chunk in iter(lambda: upload_file.read(8192), b""):
+                tmp.write(chunk)
+                total += len(chunk)
+                if max_bytes is not None and total > max_bytes:
+                    temp_path = tmp.name
+                    raise DocumentSaveError("File exceeds allowed size.")
+            temp_path = tmp.name
@@
-    except OSError as e:
+    except OSError as e:
         logger.exception("OSError during file save: %s", filename)
         with contextlib.suppress(Exception):
             if 'temp_path' in locals() and os.path.exists(temp_path):
                 os.remove(temp_path)
         raise DocumentSaveError("Failed to save file securely.") from e
+    except DocumentSaveError as e:
+        logger.warning("Upload aborted due to size limit")
+        with contextlib.suppress(Exception):
+            if 'temp_path' in locals() and os.path.exists(temp_path):
+                os.remove(temp_path)
+        raise

Also applies to: 64-68, 69-77

🤖 Prompt for AI Agents
In api/endpoints/upload.py around lines 50-60 (and similarly 64-68, 69-77), the
current logic only checks Content-Length header and may allow oversized uploads
to be written to disk if the header is missing or incorrect; update the upload
flow to enforce MAX_UPLOAD_BYTES during the streaming/save operation (move
size-check logic into the saver so it raises once the cumulative bytes exceed
the limit), ensure that if any post-save size check fails the partially written
file is deleted, and replace any per-request ThreadPoolExecutor usage with
asyncio.to_thread for disk I/O offloading; adjust calling code here to call the
updated saver that raises on overflow and handle that exception to return HTTP
413 and perform file cleanup.

text_preview = text[:500] if text else None
except DocumentParseError:
logger.error("Document parse error")
logger.error("Document parse error", extra={"path": saved_path})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use logger.exception in except blocks and chain exceptions

Align with Ruff TRY400 and B904; also avoid logging full paths elsewhere.

Apply this diff:

-                logger.error("Document parse error", extra={"path": saved_path})
+                logger.exception("Document parse error", extra={"filename": os.path.basename(saved_path)})
@@
-                logger.error("Unicode decode error while reading file for preview", extra={"path": saved_path})
+                logger.exception("Unicode decode error while reading file for preview", extra={"filename": os.path.basename(saved_path)})
@@
-                logger.error("OS error while reading file for preview", extra={"path": saved_path})
+                logger.exception("OS error while reading file for preview", extra={"filename": os.path.basename(saved_path)})
@@
-        logger.error("Document save error", extra={"error": str(dse)})
+        logger.exception("Document save error", extra={"error": str(dse)})
         raise HTTPException(status_code=400, detail="Failed to save uploaded document") from dse
@@
-        logger.error("Document processing error", extra={"error": str(de)})
-        raise HTTPException(status_code=422, detail="Error processing document")
+        logger.exception("Document processing error", extra={"error": str(de)})
+        raise HTTPException(status_code=422, detail="Error processing document") from de

Also applies to: 98-98, 101-101, 110-110, 113-114

🧰 Tools
🪛 Ruff (0.13.1)

86-86: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents
In api/endpoints/upload.py around lines 86, 98, 101, 110 and 113-114, replace
logger.error(...) calls inside except blocks with logger.exception(...) so the
stack trace is included, avoid logging full filesystem paths by logging only the
filename or a redacted path (e.g., os.path.basename(saved_path) or
"<redacted>"), and when re-raising exceptions ensure you use exception chaining
(raise NewError(...) from e) rather than raising without the original exception;
do this for each listed location.

Comment on lines +115 to +117
finally:
# Clear per-request logging context
clear_log_context()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Close the uploaded file in finally to avoid resource leaks

Starlette’s UploadFile exposes an async close(). Ensure cleanup.

Apply this diff:

     finally:
-        # Clear per-request logging context
-        clear_log_context()
+        with contextlib.suppress(Exception):
+            await file.close()
+        # Clear per-request logging context
+        clear_log_context()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
finally:
# Clear per-request logging context
clear_log_context()
finally:
with contextlib.suppress(Exception):
await file.close()
# Clear per-request logging context
clear_log_context()
🤖 Prompt for AI Agents
In api/endpoints/upload.py around lines 115 to 117, the finally block only
clears per-request logging context and does not close the Starlette UploadFile,
risking a resource leak; modify the finally block to also await the uploaded
file's async close() (e.g., await uploaded_file.close()) so the uploaded file is
properly cleaned up before exiting the request handler, ensuring the call is
awaited and wrapped in a safe try/except if needed to avoid masking original
exceptions.

@fehranbit fehranbit self-requested a review September 30, 2025 05:43
@fehranbit fehranbit merged commit 4ae0082 into main Sep 30, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants