-
Notifications
You must be signed in to change notification settings - Fork 0
feat(file_service): safer uploads, metadata returns, async wrappers, and management helpers #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughMajor refactor of the file upload service adding size/content-type limits, safer path handling, checksum and MIME detection, atomic writes, metadata-returning save, async offloading via a shared thread pool, and administrative utilities for listing, deleting, and cleaning up uploads. New constants and multiple public functions are introduced. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant FileSvc as FileService
participant FS as Filesystem
rect rgba(230,240,255,0.4)
Client->>FileSvc: save_upload_file_with_meta(upload_file, filename, max_bytes?)
Note right of FileSvc: Sanitize filename, validate path
FileSvc->>FileSvc: Enforce size limit (DEFAULT_MAX_UPLOAD_BYTES or override)
FileSvc->>FS: Write to temp file atomically
FileSvc-->>FS: Truncate/abort on overflow (error)
FileSvc->>FileSvc: calculate_checksum(temp_path)
FileSvc->>FileSvc: detect_mime_type(temp_path)
FileSvc->>FS: Move temp -> final path (atomic rename)
FileSvc-->>Client: {path, original_name, size, sha256, mime_type, saved_at}
end
alt Error (size/path/I-O)
FileSvc->>FS: Cleanup temp file
FileSvc-->>Client: DocumentSaveError
end
sequenceDiagram
autonumber
actor Client
participant API as Async API
participant Pool as ThreadPoolExecutor
participant FileSvc as FileService (sync)
rect rgba(240,255,240,0.5)
Client->>API: save_upload_file_with_meta_async(...)
API->>Pool: offload to thread
Pool->>FileSvc: save_upload_file_with_meta(...)
FileSvc-->>Pool: result / error
Pool-->>API: result / error
API-->>Client: await result (propagate errors)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–90 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (5)
services/file_service.py (5)
65-79
: Durability: fsync temp file before atomic replaceAtomic rename is good, but without fsync bytes may not hit disk on crash.
Apply this diff:
for chunk in iter(lambda: source.read(8192), b""): - if not chunk: - break tmp.write(chunk) total += len(chunk) sha256.update(chunk) if total > max_bytes: logger.warning("Upload exceeds max allowed size: %d bytes", total) raise DocumentSaveError("Uploaded file exceeds maximum allowed size.") + # Ensure bytes are on disk before atomic replace + tmp.flush() + os.fsync(tmp.fileno())
132-145
: Durability: fsync temp file in metadata save path as wellApply this diff:
for chunk in iter(lambda: source.read(8192), b""): - if not chunk: - break tmp.write(chunk) total += len(chunk) sha256.update(chunk) if total > max_bytes: logger.warning("Upload exceeds max allowed size: %d bytes", total) raise DocumentSaveError("Uploaded file exceeds maximum allowed size.") + # Ensure bytes are on disk before atomic replace + tmp.flush() + os.fsync(tmp.fileno())
33-37
: Make thread pool size configurableSupport optional worker tuning via env (e.g., FILE_SERVICE_MAX_WORKERS).
Apply this diff:
if _THREAD_POOL is None: - _THREAD_POOL = ThreadPoolExecutor(max_workers=4) + _THREAD_POOL = ThreadPoolExecutor( + max_workers=int(os.environ.get("FILE_SERVICE_MAX_WORKERS", "4")) + )
61-76
: Minor: drop unused checksum in save_upload_file or return itsave_upload_file computes sha256 but never uses it; remove to save CPU or return it in logs/metadata.
69-73
: Minor: redundant empty-chunk checkiter(lambda: read(...), b"") already stops on b"". The explicit if not chunk: break is redundant.
Apply this diff:
- for chunk in iter(lambda: source.read(8192), b""): - if not chunk: - break + for chunk in iter(lambda: source.read(8192), b""):Also applies to: 136-139
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
services/file_service.py
(2 hunks)
🔇 Additional comments (1)
services/file_service.py (1)
26-29
: ALLOWED_CONTENT_TYPES is never usedAfter adding allowed-type enforcement above, this becomes effective. If not enforcing now, remove the env to avoid a false sense of security.
def save_upload_file(upload_file: IO, filename: str) -> str: | ||
# Sanitize filename: remove path, reject empty/unsafe | ||
logger = logging.getLogger(__name__) | ||
base = os.path.basename(filename) | ||
if not base or base in {'.', '..'} or any(c in base for c in '\/:*?"<>|'): | ||
if not base or base in {'.', '..'} or any(c in base for c in '\\/:*?"<>|'): | ||
logger.error("Invalid filename for upload: %s", filename) | ||
raise DocumentSaveError("Invalid filename.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix unused max_bytes: plumb through async wrapper to sync save
max_bytes in save_upload_file_async is ignored; wire it through and make sync save accept it.
Apply this diff:
-def save_upload_file(upload_file: IO, filename: str) -> str:
+def save_upload_file(upload_file: IO, filename: str, max_bytes: Optional[int] = None) -> str:
@@
- # Default max bytes
- max_bytes = DEFAULT_MAX_UPLOAD_BYTES
+ # Default max bytes
+ if max_bytes is None:
+ max_bytes = DEFAULT_MAX_UPLOAD_BYTES
@@
async def save_upload_file_async(upload_file: IO, filename: str, max_bytes: Optional[int] = None) -> str:
"""Async wrapper that offloads the blocking save call to a thread pool."""
loop = asyncio.get_running_loop()
- return await loop.run_in_executor(_get_thread_pool(), save_upload_file, upload_file, filename)
+ return await loop.run_in_executor(_get_thread_pool(), save_upload_file, upload_file, filename, max_bytes)
Also applies to: 58-60, 179-183
🤖 Prompt for AI Agents
In services/file_service.py around lines 39-46 (and also update usages at 58-60
and 179-183), the async wrapper save_upload_file_async currently ignores the
max_bytes argument; update the sync function signature to accept an optional
max_bytes parameter, plumb max_bytes from the async wrapper into the sync call,
and enforce the byte limit inside the sync save (validate file size or stream
read versus max_bytes and raise DocumentSaveError on overflow). Also update all
callsites and tests to pass max_bytes through and ensure logging includes the
filename and reason when rejecting due to size.
# Optionally detect mime type (by extension); more accurate detection can be added | ||
try: | ||
mime, _ = mimetypes.guess_type(final_path) | ||
except Exception: | ||
mime = None | ||
|
||
logger.info("Saved upload %s as %s (%d bytes, mime=%s)", base, final_path, total, mime) | ||
return final_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enforce allowed MIME types and use detect_mime_type()
ALLOWED_CONTENT_TYPES is unused. Detect MIME and reject disallowed types; clean up the saved file if rejected.
Apply this diff:
- # Optionally detect mime type (by extension); more accurate detection can be added
- try:
- mime, _ = mimetypes.guess_type(final_path)
- except Exception:
- mime = None
-
- logger.info("Saved upload %s as %s (%d bytes, mime=%s)", base, final_path, total, mime)
+ # Detect MIME (with python-magic fallback) and enforce allowed types if configured
+ mime = detect_mime_type(final_path)
+ if ALLOWED_CONTENT_TYPES is not None and mime not in ALLOWED_CONTENT_TYPES:
+ logger.warning("Disallowed content type: %s for %s", mime, base)
+ with contextlib.suppress(Exception):
+ os.remove(final_path)
+ raise DocumentSaveError("Uploaded file type is not allowed.")
+
+ logger.info("Saved upload %s as %s (%d bytes, mime=%s)", base, final_path, total, mime)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Optionally detect mime type (by extension); more accurate detection can be added | |
try: | |
mime, _ = mimetypes.guess_type(final_path) | |
except Exception: | |
mime = None | |
logger.info("Saved upload %s as %s (%d bytes, mime=%s)", base, final_path, total, mime) | |
return final_path | |
# Detect MIME (with python-magic fallback) and enforce allowed types if configured | |
mime = detect_mime_type(final_path) | |
if ALLOWED_CONTENT_TYPES is not None and mime not in ALLOWED_CONTENT_TYPES: | |
logger.warning("Disallowed content type: %s for %s", mime, base) | |
with contextlib.suppress(Exception): | |
os.remove(final_path) | |
raise DocumentSaveError("Uploaded file type is not allowed.") | |
logger.info("Saved upload %s as %s (%d bytes, mime=%s)", base, final_path, total, mime) | |
return final_path |
base = os.path.basename(filename) | ||
if not base: | ||
raise DocumentSaveError("Invalid filename.") | ||
|
||
ext = os.path.splitext(base)[1] | ||
safe_name = f"{uuid.uuid4().hex}{ext}" | ||
final_path = os.path.abspath(os.path.join(UPLOAD_DIR, safe_name)) | ||
|
||
upload_dir_abs = os.path.abspath(UPLOAD_DIR) | ||
if not final_path.startswith(upload_dir_abs + os.sep): | ||
logger.error("Unsafe file path detected: %s", final_path) | ||
raise DocumentSaveError("Unsafe file path.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply same filename sanitization here as in save_upload_file
Inconsistent checks; mirror the stricter validation to avoid edge-case surprises.
Apply this diff:
- base = os.path.basename(filename)
- if not base:
- raise DocumentSaveError("Invalid filename.")
+ base = os.path.basename(filename)
+ if not base or base in {'.', '..'} or any(c in base for c in '\\/:*?"<>|'):
+ logger.error("Invalid filename for upload: %s", filename)
+ raise DocumentSaveError("Invalid filename.")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
base = os.path.basename(filename) | |
if not base: | |
raise DocumentSaveError("Invalid filename.") | |
ext = os.path.splitext(base)[1] | |
safe_name = f"{uuid.uuid4().hex}{ext}" | |
final_path = os.path.abspath(os.path.join(UPLOAD_DIR, safe_name)) | |
upload_dir_abs = os.path.abspath(UPLOAD_DIR) | |
if not final_path.startswith(upload_dir_abs + os.sep): | |
logger.error("Unsafe file path detected: %s", final_path) | |
raise DocumentSaveError("Unsafe file path.") | |
base = os.path.basename(filename) | |
if not base or base in {'.', '..'} or any(c in base for c in '\\/:*?"<>|'): | |
logger.error("Invalid filename for upload: %s", filename) | |
raise DocumentSaveError("Invalid filename.") | |
ext = os.path.splitext(base)[1] | |
safe_name = f"{uuid.uuid4().hex}{ext}" | |
final_path = os.path.abspath(os.path.join(UPLOAD_DIR, safe_name)) | |
upload_dir_abs = os.path.abspath(UPLOAD_DIR) | |
if not final_path.startswith(upload_dir_abs + os.sep): | |
logger.error("Unsafe file path detected: %s", final_path) | |
raise DocumentSaveError("Unsafe file path.") |
🤖 Prompt for AI Agents
In services/file_service.py around lines 116 to 128, the current filename
handling builds a UUID-based safe_name but lacks the stricter checks used in
save_upload_file; update this block to reuse or mirror the same
sanitization/validation: validate and normalize the incoming filename (strip
path components, reject names containing os.sep or leading dots), enforce the
allowed/extensions rule and length limits used by save_upload_file, derive the
extension safely (lowercased) and build safe_name from a validated ext, then
compute final_path and re-check that final_path resides inside the absolute
UPLOAD_DIR before writing; if any check fails, log the full context and raise
DocumentSaveError.
try: | ||
mime, _ = mimetypes.guess_type(final_path) | ||
except Exception: | ||
mime = None | ||
|
||
meta = { | ||
'path': final_path, | ||
'original_name': base, | ||
'size': total, | ||
'sha256': sha256.hexdigest(), | ||
'mime_type': mime, | ||
'saved_at': datetime.now(timezone.utc).isoformat(), | ||
} | ||
logger.info("Saved upload metadata: %s", {k: meta[k] for k in ('original_name','path','size','mime_type')}) | ||
return meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return accurate MIME and enforce allowed types in metadata save
Use detect_mime_type() and enforce ALLOWED_CONTENT_TYPES with cleanup on violation.
Apply this diff:
- try:
- mime, _ = mimetypes.guess_type(final_path)
- except Exception:
- mime = None
+ mime = detect_mime_type(final_path)
+ if ALLOWED_CONTENT_TYPES is not None and mime not in ALLOWED_CONTENT_TYPES:
+ logger.warning("Disallowed content type: %s for %s", mime, base)
+ with contextlib.suppress(Exception):
+ os.remove(final_path)
+ raise DocumentSaveError("Uploaded file type is not allowed.")
meta = {
'path': final_path,
'original_name': base,
'size': total,
'sha256': sha256.hexdigest(),
'mime_type': mime,
'saved_at': datetime.now(timezone.utc).isoformat(),
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
try: | |
mime, _ = mimetypes.guess_type(final_path) | |
except Exception: | |
mime = None | |
meta = { | |
'path': final_path, | |
'original_name': base, | |
'size': total, | |
'sha256': sha256.hexdigest(), | |
'mime_type': mime, | |
'saved_at': datetime.now(timezone.utc).isoformat(), | |
} | |
logger.info("Saved upload metadata: %s", {k: meta[k] for k in ('original_name','path','size','mime_type')}) | |
return meta | |
mime = detect_mime_type(final_path) | |
if ALLOWED_CONTENT_TYPES is not None and mime not in ALLOWED_CONTENT_TYPES: | |
logger.warning("Disallowed content type: %s for %s", mime, base) | |
with contextlib.suppress(Exception): | |
os.remove(final_path) | |
raise DocumentSaveError("Uploaded file type is not allowed.") | |
meta = { | |
'path': final_path, | |
'original_name': base, | |
'size': total, | |
'sha256': sha256.hexdigest(), | |
'mime_type': mime, | |
'saved_at': datetime.now(timezone.utc).isoformat(), | |
} | |
logger.info("Saved upload metadata: %s", {k: meta[k] for k in ('original_name','path','size','mime_type')}) | |
return meta |
def delete_upload(name_or_path: str) -> bool: | ||
"""Delete an upload by filename or absolute path. Returns True if deleted.""" | ||
if os.path.isabs(name_or_path): | ||
path = name_or_path | ||
else: | ||
path = os.path.join(UPLOAD_DIR, name_or_path) | ||
try: | ||
if os.path.exists(path): | ||
os.remove(path) | ||
return True | ||
except Exception: | ||
logging.getLogger(__name__).exception("Failed to delete upload: %s", path) | ||
return False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block path traversal: restrict deletes to UPLOAD_DIR and files only
Allowing absolute paths enables deleting arbitrary files. Constrain deletions to files under UPLOAD_DIR.
Apply this diff:
def delete_upload(name_or_path: str) -> bool:
- """Delete an upload by filename or absolute path. Returns True if deleted."""
- if os.path.isabs(name_or_path):
- path = name_or_path
- else:
- path = os.path.join(UPLOAD_DIR, name_or_path)
- try:
- if os.path.exists(path):
- os.remove(path)
- return True
- except Exception:
- logging.getLogger(__name__).exception("Failed to delete upload: %s", path)
- return False
+ """Delete an upload by filename or absolute path within UPLOAD_DIR. Returns True if deleted."""
+ logger = logging.getLogger(__name__)
+ if os.path.isabs(name_or_path):
+ path = name_or_path
+ else:
+ path = os.path.join(UPLOAD_DIR, name_or_path)
+ try:
+ path = os.path.abspath(path)
+ upload_dir_abs = os.path.abspath(UPLOAD_DIR)
+ # Ensure path is inside UPLOAD_DIR and is a regular file
+ if not path.startswith(upload_dir_abs + os.sep) or not os.path.isfile(path):
+ logger.warning("Refusing to delete path outside UPLOAD_DIR or non-file: %s", path)
+ return False
+ os.remove(path)
+ return True
+ except Exception:
+ logger.exception("Failed to delete upload: %s", path)
+ return False
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def delete_upload(name_or_path: str) -> bool: | |
"""Delete an upload by filename or absolute path. Returns True if deleted.""" | |
if os.path.isabs(name_or_path): | |
path = name_or_path | |
else: | |
path = os.path.join(UPLOAD_DIR, name_or_path) | |
try: | |
if os.path.exists(path): | |
os.remove(path) | |
return True | |
except Exception: | |
logging.getLogger(__name__).exception("Failed to delete upload: %s", path) | |
return False | |
def delete_upload(name_or_path: str) -> bool: | |
"""Delete an upload by filename or absolute path within UPLOAD_DIR. Returns True if deleted.""" | |
logger = logging.getLogger(__name__) | |
if os.path.isabs(name_or_path): | |
path = name_or_path | |
else: | |
path = os.path.join(UPLOAD_DIR, name_or_path) | |
try: | |
path = os.path.abspath(path) | |
upload_dir_abs = os.path.abspath(UPLOAD_DIR) | |
# Ensure path is inside UPLOAD_DIR and is a regular file | |
if not path.startswith(upload_dir_abs + os.sep) or not os.path.isfile(path): | |
logger.warning("Refusing to delete path outside UPLOAD_DIR or non-file: %s", path) | |
return False | |
os.remove(path) | |
return True | |
except Exception: | |
logger.exception("Failed to delete upload: %s", path) | |
return False |
🤖 Prompt for AI Agents
In services/file_service.py around lines 230-243, the delete_upload function
currently accepts absolute paths and can delete files outside UPLOAD_DIR; change
it to resolve the requested path against UPLOAD_DIR only, compute both
UPLOAD_DIR and target via Path.resolve(), ensure the resolved target starts with
the resolved UPLOAD_DIR path (reject if not), verify the target is a regular
file (is_file()) and not a directory or symlink pointing outside, then remove it
and return True; on any check failure return False and keep the existing
exception logging for actual removal errors.
Upgrades
file_service.py
to make uploads safer, more observable, and non-blocking. Adds a configurable upload directory (via env), enforces a server-side size cap (default 50 MiB), writes files atomically, and computes SHA-256 checksums. Introduces a metadata-returning save that includes path, size, hash, MIME type, and timestamp. Provides async wrappers to offload blocking IO and hashing to a thread pool. Adds MIME detection with optional python-magic fallback, plus lifecycle helpers to list, delete, and clean up old uploads. Strengthens filename/path validation and improves structured logging for start/finish/error events. Configuration knobs cover upload directory, max bytes, and optional thread worker tuning.Summary by CodeRabbit
New Features
Improvements