-
Notifications
You must be signed in to change notification settings - Fork 0
feat(upload): async upload + sanitization + size/type checks + structured logging #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…d improved error handling; refactor for async processing and logging context management.
WalkthroughAsynchronous upload endpoint replaced the synchronous handler, adding request-scoped logging context, content-type and size validations, filename sanitization, thread-pooled file saving, post-save size check, refined parsing for previews, and expanded error handling mapping domain errors to specific HTTP statuses. The function signature now includes Request and is async. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant API as Upload Endpoint
participant Val as Validation
participant FS as ThreadPool Save
participant Parser as Parser/Preview
participant Log as Log Context
Client->>API: POST /upload (file)
API->>Log: set_log_context(request)
rect rgba(200,230,255,0.3)
API->>Val: Check content-type & Content-Length
Val-->>API: Valid or raise
end
API->>API: _secure_filename(name)
rect rgba(220,255,220,0.3)
API->>FS: Save file (run_in_executor)
FS-->>API: Saved path
API->>Val: Post-save size check
end
rect rgba(255,245,200,0.4)
API->>Parser: Generate preview (PDF/text)
Parser-->>API: Preview/metadata
end
API-->>Client: 200 OK (result)
alt DocumentSaveError
API-->>Client: 400 Bad Request
else Parse/Chunk/Embedding Error
API-->>Client: 422 Unprocessable Entity
else Other Error
API-->>Client: 500 Internal Server Error
end
API->>Log: clear_log_context()
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (6)
api/endpoints/upload.py (6)
5-10
: Add required imports; drop per-request ThreadPoolExecutorUse asyncio.to_thread and cleanup helpers.
Apply this diff:
from fastapi import APIRouter, UploadFile, File, HTTPException, Request, status import os import pathlib -from concurrent.futures import ThreadPoolExecutor from typing import Optional +import asyncio +import contextlib +import uuid
89-89
: Avoid logging absolute filesystem pathsLog only sanitized basename to reduce sensitive path exposure.
Apply this diff:
- logger.exception("Error parsing PDF file for preview", extra={"path": saved_path}) + logger.exception("Error parsing PDF file for preview", extra={"filename": os.path.basename(saved_path)})
45-48
: Normalize Content-Type by stripping parameters (e.g., charset)Clients may send "text/plain; charset=utf-8". Split off parameters before allow-list check.
Apply this diff:
- content_type = (file.content_type or "").lower() + content_type = (file.content_type or "").split(";", 1)[0].lower()
39-43
: Always set a request_id; support common header casingFallback to X-Request-ID and generate one if absent.
Apply this diff:
- request_id = request.headers.get("X-Request-Id") or None - if request_id: - set_log_context(request_id=request_id) + request_id = ( + request.headers.get("X-Request-Id") + or request.headers.get("X-Request-ID") + or uuid.uuid4().hex + ) + set_log_context(request_id=request_id)
26-31
: Tighten filename sanitization (optional; file_service also validates)Prefer a restricted ASCII set and length cap to reduce surprises.
Apply this diff:
def _secure_filename(name: str) -> str: - # Simple sanitization: take only base name and strip suspicious characters - base = pathlib.Path(name).name - # remove path separators and control chars - return "".join(c for c in base if c.isprintable()) + # Take base name; restrict to a safe ASCII set; cap length + base = pathlib.Path(name or "upload").name + allowed = set("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_. ") + cleaned = "".join(c for c in base if c in allowed) or "upload" + return cleaned[:128]
81-91
: Server-side type validation (optional)Current allow-list trusts client-provided Content-Type. Consider lightweight sniffing (e.g., check for "%PDF-" magic for PDFs) to harden type checks.
Would you like me to add a small header-sniff (no extra deps) for PDFs and fallback text detection?
Also applies to: 92-102
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
api/endpoints/upload.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
api/endpoints/upload.py (5)
services/file_service.py (1)
save_upload_file
(19-57)services/parsing_service.py (1)
extract_text_from_pdf
(10-24)models/schemas.py (1)
UploadResponse
(7-11)services/exceptions.py (4)
DocumentSaveError
(13-15)DocumentParseError
(9-11)DocumentChunkError
(17-19)DocumentEmbeddingError
(21-23)services/logging_config.py (3)
get_logger
(171-174)set_log_context
(157-160)clear_log_context
(167-168)
🪛 Ruff (0.13.1)
api/endpoints/upload.py
86-86: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
98-98: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
101-101: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
110-110: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
113-113: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
114-114: Within an except
clause, raise exceptions with raise ... from err
or raise ... from None
to distinguish them from errors in exception handling
(B904)
# If client provided Content-Length header, check early | ||
content_length = request.headers.get("content-length") | ||
if content_length: | ||
try: | ||
if int(content_length) > MAX_UPLOAD_BYTES: | ||
logger.warning("Rejected upload due to size header too large", extra={"size": content_length}) | ||
raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, detail="File too large") | ||
except ValueError: | ||
# ignore invalid header and continue with streaming checks | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block oversized uploads during streaming; avoid saving then rejecting
Header-based checks aren’t enough. As written, a client can omit/misstate Content-Length and force writing arbitrarily large files to disk. Enforce the size limit during the write, and delete the file if a post-save check fails. Also, avoid per-request ThreadPoolExecutor and use asyncio.to_thread.
Apply this diff:
@@
- # Offload blocking save to threadpool
- loop = __import__("asyncio").get_running_loop()
- with ThreadPoolExecutor(max_workers=1) as ex:
- saved_path = await loop.run_in_executor(ex, save_upload_file, file.file, filename)
+ # Offload blocking save, enforcing size during save
+ saved_path = await asyncio.to_thread(save_upload_file, file.file, filename, MAX_UPLOAD_BYTES)
@@
- # Quick size check after save
+ # Quick size check after save (defense-in-depth)
try:
size = os.path.getsize(saved_path)
if size > MAX_UPLOAD_BYTES:
- logger.warning("Saved file exceeds max size", extra={"size": size, "path": saved_path})
- raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, detail="File too large")
- except OSError:
- logger.exception("Failed to stat saved file", extra={"path": saved_path})
+ logger.warning("Saved file exceeds max size", extra={"filename": os.path.basename(saved_path), "size": size})
+ with contextlib.suppress(Exception):
+ os.remove(saved_path)
+ raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE, detail="File too large")
+ except OSError as e:
+ logger.exception("Failed to stat saved file", extra={"filename": os.path.basename(saved_path)})
+ with contextlib.suppress(Exception):
+ os.remove(saved_path)
+ raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail="Failed to validate saved file") from e
And update the saver to enforce the limit while writing (outside this file):
diff --git a/services/file_service.py b/services/file_service.py
@@
-def save_upload_file(upload_file: IO, filename: str) -> str:
+def save_upload_file(upload_file: IO, filename: str, max_bytes: int | None = None) -> str:
@@
- with tempfile.NamedTemporaryFile(dir=UPLOAD_DIR, delete=False) as tmp:
- for chunk in iter(lambda: upload_file.read(8192), b""):
- tmp.write(chunk)
- temp_path = tmp.name
+ with tempfile.NamedTemporaryFile(dir=UPLOAD_DIR, delete=False) as tmp:
+ total = 0
+ for chunk in iter(lambda: upload_file.read(8192), b""):
+ tmp.write(chunk)
+ total += len(chunk)
+ if max_bytes is not None and total > max_bytes:
+ temp_path = tmp.name
+ raise DocumentSaveError("File exceeds allowed size.")
+ temp_path = tmp.name
@@
- except OSError as e:
+ except OSError as e:
logger.exception("OSError during file save: %s", filename)
with contextlib.suppress(Exception):
if 'temp_path' in locals() and os.path.exists(temp_path):
os.remove(temp_path)
raise DocumentSaveError("Failed to save file securely.") from e
+ except DocumentSaveError as e:
+ logger.warning("Upload aborted due to size limit")
+ with contextlib.suppress(Exception):
+ if 'temp_path' in locals() and os.path.exists(temp_path):
+ os.remove(temp_path)
+ raise
Also applies to: 64-68, 69-77
🤖 Prompt for AI Agents
In api/endpoints/upload.py around lines 50-60 (and similarly 64-68, 69-77), the
current logic only checks Content-Length header and may allow oversized uploads
to be written to disk if the header is missing or incorrect; update the upload
flow to enforce MAX_UPLOAD_BYTES during the streaming/save operation (move
size-check logic into the saver so it raises once the cumulative bytes exceed
the limit), ensure that if any post-save size check fails the partially written
file is deleted, and replace any per-request ThreadPoolExecutor usage with
asyncio.to_thread for disk I/O offloading; adjust calling code here to call the
updated saver that raises on overflow and handle that exception to return HTTP
413 and perform file cleanup.
text_preview = text[:500] if text else None | ||
except DocumentParseError: | ||
logger.error("Document parse error") | ||
logger.error("Document parse error", extra={"path": saved_path}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use logger.exception in except blocks and chain exceptions
Align with Ruff TRY400 and B904; also avoid logging full paths elsewhere.
Apply this diff:
- logger.error("Document parse error", extra={"path": saved_path})
+ logger.exception("Document parse error", extra={"filename": os.path.basename(saved_path)})
@@
- logger.error("Unicode decode error while reading file for preview", extra={"path": saved_path})
+ logger.exception("Unicode decode error while reading file for preview", extra={"filename": os.path.basename(saved_path)})
@@
- logger.error("OS error while reading file for preview", extra={"path": saved_path})
+ logger.exception("OS error while reading file for preview", extra={"filename": os.path.basename(saved_path)})
@@
- logger.error("Document save error", extra={"error": str(dse)})
+ logger.exception("Document save error", extra={"error": str(dse)})
raise HTTPException(status_code=400, detail="Failed to save uploaded document") from dse
@@
- logger.error("Document processing error", extra={"error": str(de)})
- raise HTTPException(status_code=422, detail="Error processing document")
+ logger.exception("Document processing error", extra={"error": str(de)})
+ raise HTTPException(status_code=422, detail="Error processing document") from de
Also applies to: 98-98, 101-101, 110-110, 113-114
🧰 Tools
🪛 Ruff (0.13.1)
86-86: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In api/endpoints/upload.py around lines 86, 98, 101, 110 and 113-114, replace
logger.error(...) calls inside except blocks with logger.exception(...) so the
stack trace is included, avoid logging full filesystem paths by logging only the
filename or a redacted path (e.g., os.path.basename(saved_path) or
"<redacted>"), and when re-raising exceptions ensure you use exception chaining
(raise NewError(...) from e) rather than raising without the original exception;
do this for each listed location.
finally: | ||
# Clear per-request logging context | ||
clear_log_context() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close the uploaded file in finally to avoid resource leaks
Starlette’s UploadFile exposes an async close(). Ensure cleanup.
Apply this diff:
finally:
- # Clear per-request logging context
- clear_log_context()
+ with contextlib.suppress(Exception):
+ await file.close()
+ # Clear per-request logging context
+ clear_log_context()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
finally: | |
# Clear per-request logging context | |
clear_log_context() | |
finally: | |
with contextlib.suppress(Exception): | |
await file.close() | |
# Clear per-request logging context | |
clear_log_context() |
🤖 Prompt for AI Agents
In api/endpoints/upload.py around lines 115 to 117, the finally block only
clears per-request logging context and does not close the Starlette UploadFile,
risking a resource leak; modify the finally block to also await the uploaded
file's async close() (e.g., await uploaded_file.close()) so the uploaded file is
properly cleaned up before exiting the request handler, ensuring the call is
awaited and wrapped in a safe try/except if needed to avoid masking original
exceptions.
Convert
/upload
to an async, safe, and well-observed endpoint. This change offloads blocking file IO to a thread pool, normalizes user-supplied filenames, enforces strict content-type and size limits, and adds per-request structured logging (withrequest_id
) for traceability.Features
Async upload pipeline
async
with thread offload for disk writes (prevents event-loop starvation).Content-Length
exceeds configured max.Filename sanitization
..
, absolute paths) and strips control/special chars.Type & size validation
Structured logging (per request)
request_id
to all logs in the request scope.Fixes
413
for size,415
for type,400
for validation,422
for parse errors).Refactors
upload_validation.py
(pure functions, unit-tested).RequestLoggerAdapter
(addsrequest_id
,route
,ip
).storage.py
for pluggable backends (local fs for now; S3/GCS ready).errors.py
with typed exceptions → HTTP responses.API Changes
Endpoint:
POST /upload
Request headers:
Content-Type
: must be one of configured allow-list.Content-Length
: validated againstMAX_UPLOAD_BYTES
(if present).Response:
201 Created
with{ id, filename, contentType, size, sha256, url? }
{ error: { code, message, details? } }
Idempotency: Optional
Idempotency-Key
header supported (dedup on hash + key).Security & Compliance
file.jpg.exe
).image/*
,video/mp4
, etc. per config).Performance
Observability
Logs (JSON):
request_id
,route
,filename_sanitized
,bytes_received
,duration_ms
,result
.type_checked
,size_checked
,sanitized
).Metrics (if enabled):
upload_bytes_total
,upload_duration_ms
,upload_failures{code}
,upload_inflight
.Trace hooks: Span around validation + write stages (no payload capture).
Configuration
MAX_UPLOAD_BYTES
(e.g.,104857600
for 100 MB)ALLOWED_MIME_TYPES
(CSV or JSON)FILENAME_STRATEGY
(sanitize|slug|reject-unsafe
)STORAGE_ROOT
(local path) /STORAGE_BACKEND
(local|s3|gcs
reserved)UPLOAD_THREAD_WORKERS
(defaults to CPU-bound heuristic)Testing
Unit tests
Integration tests
413
size,415
type,400
invalid filename,422
malformed multipart.Property-based tests for random filenames and mixed headers.
Docs
/upload
with examples (curl + JS/TS).client_max_body_size
/max_request_body_size
.Breaking Changes
filename
orid
.Migration Notes
ALLOWED_MIME_TYPES
andMAX_UPLOAD_BYTES
for your environment.filename
orid
.MAX_UPLOAD_BYTES
.Risk & Rollout
Risk: Medium (touches request path + IO).
Rollout plan:
UPLOAD_V2_ENABLED
(shadow logs only).upload_failures
and latency P95/P99.Summary by CodeRabbit
New Features
Bug Fixes