Skip to content

Conversation

enfayz
Copy link
Collaborator

@enfayz enfayz commented Sep 25, 2025

Upgrades file_service.py to make uploads safer, more observable, and non-blocking. Adds a configurable upload directory (via env), enforces a server-side size cap (default 50 MiB), writes files atomically, and computes SHA-256 checksums. Introduces a metadata-returning save that includes path, size, hash, MIME type, and timestamp. Provides async wrappers to offload blocking IO and hashing to a thread pool. Adds MIME detection with optional python-magic fallback, plus lifecycle helpers to list, delete, and clean up old uploads. Strengthens filename/path validation and improves structured logging for start/finish/error events. Configuration knobs cover upload directory, max bytes, and optional thread worker tuning.

Summary by CodeRabbit

  • New Features

    • Configurable upload size limits and allowed file types via settings.
    • Asynchronous uploads that return rich metadata (name, size, checksum, type, saved time).
    • Automatic checksum and MIME type detection.
    • Administrative tools to list uploads, delete files, and clean up old uploads.
  • Improvements

    • Safer file handling with sanitized names, safe paths, and atomic writes.
    • More informative errors with automatic cleanup on failures.
    • Performance boosts by offloading blocking I/O to background workers.
    • Enhanced logging for saved uploads.

Copy link

coderabbitai bot commented Sep 25, 2025

Walkthrough

Major refactor of the file upload service adding size/content-type limits, safer path handling, checksum and MIME detection, atomic writes, metadata-returning save, async offloading via a shared thread pool, and administrative utilities for listing, deleting, and cleaning up uploads. New constants and multiple public functions are introduced.

Changes

Cohort / File(s) Summary of changes
Upload service refactor and utilities
services/file_service.py
Introduces MAX size and allowed content-type config; filename sanitization and safe path checks; atomic write with byte-limit enforcement; SHA-256 checksum and MIME detection; new save with metadata; async wrappers using shared ThreadPoolExecutor; functions for listing uploads, deleting by name/path, and time-based cleanup; new public exports and constants.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant FileSvc as FileService
  participant FS as Filesystem

  rect rgba(230,240,255,0.4)
  Client->>FileSvc: save_upload_file_with_meta(upload_file, filename, max_bytes?)
  Note right of FileSvc: Sanitize filename, validate path
  FileSvc->>FileSvc: Enforce size limit (DEFAULT_MAX_UPLOAD_BYTES or override)
  FileSvc->>FS: Write to temp file atomically
  FileSvc-->>FS: Truncate/abort on overflow (error)
  FileSvc->>FileSvc: calculate_checksum(temp_path)
  FileSvc->>FileSvc: detect_mime_type(temp_path)
  FileSvc->>FS: Move temp -> final path (atomic rename)
  FileSvc-->>Client: {path, original_name, size, sha256, mime_type, saved_at}
  end

  alt Error (size/path/I-O)
    FileSvc->>FS: Cleanup temp file
    FileSvc-->>Client: DocumentSaveError
  end
Loading
sequenceDiagram
  autonumber
  actor Client
  participant API as Async API
  participant Pool as ThreadPoolExecutor
  participant FileSvc as FileService (sync)

  rect rgba(240,255,240,0.5)
  Client->>API: save_upload_file_with_meta_async(...)
  API->>Pool: offload to thread
  Pool->>FileSvc: save_upload_file_with_meta(...)
  FileSvc-->>Pool: result / error
  Pool-->>API: result / error
  API-->>Client: await result (propagate errors)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Poem

I thump my paws—new files arrive!
With checksums, MIME, and tidy hive.
I stash them safe, then hop away,
Async breezes make my day.
Old crumbs cleaned, bad bits denied—
Carrot logs show where they hide. 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly summarizes the key enhancements to the file_service module—safer uploads, metadata returns, async wrappers, and management helpers—and directly reflects the main changes introduced in the pull request.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/file-service-enhancements

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (5)
services/file_service.py (5)

65-79: Durability: fsync temp file before atomic replace

Atomic rename is good, but without fsync bytes may not hit disk on crash.

Apply this diff:

             for chunk in iter(lambda: source.read(8192), b""):
-                if not chunk:
-                    break
                 tmp.write(chunk)
                 total += len(chunk)
                 sha256.update(chunk)
                 if total > max_bytes:
                     logger.warning("Upload exceeds max allowed size: %d bytes", total)
                     raise DocumentSaveError("Uploaded file exceeds maximum allowed size.")
+            # Ensure bytes are on disk before atomic replace
+            tmp.flush()
+            os.fsync(tmp.fileno())

132-145: Durability: fsync temp file in metadata save path as well

Apply this diff:

             for chunk in iter(lambda: source.read(8192), b""):
-                if not chunk:
-                    break
                 tmp.write(chunk)
                 total += len(chunk)
                 sha256.update(chunk)
                 if total > max_bytes:
                     logger.warning("Upload exceeds max allowed size: %d bytes", total)
                     raise DocumentSaveError("Uploaded file exceeds maximum allowed size.")
+            # Ensure bytes are on disk before atomic replace
+            tmp.flush()
+            os.fsync(tmp.fileno())

33-37: Make thread pool size configurable

Support optional worker tuning via env (e.g., FILE_SERVICE_MAX_WORKERS).

Apply this diff:

     if _THREAD_POOL is None:
-        _THREAD_POOL = ThreadPoolExecutor(max_workers=4)
+        _THREAD_POOL = ThreadPoolExecutor(
+            max_workers=int(os.environ.get("FILE_SERVICE_MAX_WORKERS", "4"))
+        )

61-76: Minor: drop unused checksum in save_upload_file or return it

save_upload_file computes sha256 but never uses it; remove to save CPU or return it in logs/metadata.


69-73: Minor: redundant empty-chunk check

iter(lambda: read(...), b"") already stops on b"". The explicit if not chunk: break is redundant.

Apply this diff:

-            for chunk in iter(lambda: source.read(8192), b""):
-                if not chunk:
-                    break
+            for chunk in iter(lambda: source.read(8192), b""):

Also applies to: 136-139

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 77ca247 and 60ae223.

📒 Files selected for processing (1)
  • services/file_service.py (2 hunks)
🔇 Additional comments (1)
services/file_service.py (1)

26-29: ALLOWED_CONTENT_TYPES is never used

After adding allowed-type enforcement above, this becomes effective. If not enforcing now, remove the env to avoid a false sense of security.

Comment on lines 39 to 46
def save_upload_file(upload_file: IO, filename: str) -> str:
# Sanitize filename: remove path, reject empty/unsafe
logger = logging.getLogger(__name__)
base = os.path.basename(filename)
if not base or base in {'.', '..'} or any(c in base for c in '\/:*?"<>|'):
if not base or base in {'.', '..'} or any(c in base for c in '\\/:*?"<>|'):
logger.error("Invalid filename for upload: %s", filename)
raise DocumentSaveError("Invalid filename.")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix unused max_bytes: plumb through async wrapper to sync save

max_bytes in save_upload_file_async is ignored; wire it through and make sync save accept it.

Apply this diff:

-def save_upload_file(upload_file: IO, filename: str) -> str:
+def save_upload_file(upload_file: IO, filename: str, max_bytes: Optional[int] = None) -> str:
@@
-    # Default max bytes
-    max_bytes = DEFAULT_MAX_UPLOAD_BYTES
+    # Default max bytes
+    if max_bytes is None:
+        max_bytes = DEFAULT_MAX_UPLOAD_BYTES
@@
 async def save_upload_file_async(upload_file: IO, filename: str, max_bytes: Optional[int] = None) -> str:
     """Async wrapper that offloads the blocking save call to a thread pool."""
     loop = asyncio.get_running_loop()
-    return await loop.run_in_executor(_get_thread_pool(), save_upload_file, upload_file, filename)
+    return await loop.run_in_executor(_get_thread_pool(), save_upload_file, upload_file, filename, max_bytes)

Also applies to: 58-60, 179-183

🤖 Prompt for AI Agents
In services/file_service.py around lines 39-46 (and also update usages at 58-60
and 179-183), the async wrapper save_upload_file_async currently ignores the
max_bytes argument; update the sync function signature to accept an optional
max_bytes parameter, plumb max_bytes from the async wrapper into the sync call,
and enforce the byte limit inside the sync save (validate file size or stream
read versus max_bytes and raise DocumentSaveError on overflow). Also update all
callsites and tests to pass max_bytes through and ensure logging includes the
filename and reason when rejecting due to size.

Comment on lines +96 to 103
# Optionally detect mime type (by extension); more accurate detection can be added
try:
mime, _ = mimetypes.guess_type(final_path)
except Exception:
mime = None

logger.info("Saved upload %s as %s (%d bytes, mime=%s)", base, final_path, total, mime)
return final_path
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Enforce allowed MIME types and use detect_mime_type()

ALLOWED_CONTENT_TYPES is unused. Detect MIME and reject disallowed types; clean up the saved file if rejected.

Apply this diff:

-    # Optionally detect mime type (by extension); more accurate detection can be added
-    try:
-        mime, _ = mimetypes.guess_type(final_path)
-    except Exception:
-        mime = None
-
-    logger.info("Saved upload %s as %s (%d bytes, mime=%s)", base, final_path, total, mime)
+    # Detect MIME (with python-magic fallback) and enforce allowed types if configured
+    mime = detect_mime_type(final_path)
+    if ALLOWED_CONTENT_TYPES is not None and mime not in ALLOWED_CONTENT_TYPES:
+        logger.warning("Disallowed content type: %s for %s", mime, base)
+        with contextlib.suppress(Exception):
+            os.remove(final_path)
+        raise DocumentSaveError("Uploaded file type is not allowed.")
+
+    logger.info("Saved upload %s as %s (%d bytes, mime=%s)", base, final_path, total, mime)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Optionally detect mime type (by extension); more accurate detection can be added
try:
mime, _ = mimetypes.guess_type(final_path)
except Exception:
mime = None
logger.info("Saved upload %s as %s (%d bytes, mime=%s)", base, final_path, total, mime)
return final_path
# Detect MIME (with python-magic fallback) and enforce allowed types if configured
mime = detect_mime_type(final_path)
if ALLOWED_CONTENT_TYPES is not None and mime not in ALLOWED_CONTENT_TYPES:
logger.warning("Disallowed content type: %s for %s", mime, base)
with contextlib.suppress(Exception):
os.remove(final_path)
raise DocumentSaveError("Uploaded file type is not allowed.")
logger.info("Saved upload %s as %s (%d bytes, mime=%s)", base, final_path, total, mime)
return final_path

Comment on lines +116 to +128
base = os.path.basename(filename)
if not base:
raise DocumentSaveError("Invalid filename.")

ext = os.path.splitext(base)[1]
safe_name = f"{uuid.uuid4().hex}{ext}"
final_path = os.path.abspath(os.path.join(UPLOAD_DIR, safe_name))

upload_dir_abs = os.path.abspath(UPLOAD_DIR)
if not final_path.startswith(upload_dir_abs + os.sep):
logger.error("Unsafe file path detected: %s", final_path)
raise DocumentSaveError("Unsafe file path.")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply same filename sanitization here as in save_upload_file

Inconsistent checks; mirror the stricter validation to avoid edge-case surprises.

Apply this diff:

-    base = os.path.basename(filename)
-    if not base:
-        raise DocumentSaveError("Invalid filename.")
+    base = os.path.basename(filename)
+    if not base or base in {'.', '..'} or any(c in base for c in '\\/:*?"<>|'):
+        logger.error("Invalid filename for upload: %s", filename)
+        raise DocumentSaveError("Invalid filename.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
base = os.path.basename(filename)
if not base:
raise DocumentSaveError("Invalid filename.")
ext = os.path.splitext(base)[1]
safe_name = f"{uuid.uuid4().hex}{ext}"
final_path = os.path.abspath(os.path.join(UPLOAD_DIR, safe_name))
upload_dir_abs = os.path.abspath(UPLOAD_DIR)
if not final_path.startswith(upload_dir_abs + os.sep):
logger.error("Unsafe file path detected: %s", final_path)
raise DocumentSaveError("Unsafe file path.")
base = os.path.basename(filename)
if not base or base in {'.', '..'} or any(c in base for c in '\\/:*?"<>|'):
logger.error("Invalid filename for upload: %s", filename)
raise DocumentSaveError("Invalid filename.")
ext = os.path.splitext(base)[1]
safe_name = f"{uuid.uuid4().hex}{ext}"
final_path = os.path.abspath(os.path.join(UPLOAD_DIR, safe_name))
upload_dir_abs = os.path.abspath(UPLOAD_DIR)
if not final_path.startswith(upload_dir_abs + os.sep):
logger.error("Unsafe file path detected: %s", final_path)
raise DocumentSaveError("Unsafe file path.")
🤖 Prompt for AI Agents
In services/file_service.py around lines 116 to 128, the current filename
handling builds a UUID-based safe_name but lacks the stricter checks used in
save_upload_file; update this block to reuse or mirror the same
sanitization/validation: validate and normalize the incoming filename (strip
path components, reject names containing os.sep or leading dots), enforce the
allowed/extensions rule and length limits used by save_upload_file, derive the
extension safely (lowercased) and build safe_name from a validated ext, then
compute final_path and re-check that final_path resides inside the absolute
UPLOAD_DIR before writing; if any check fails, log the full context and raise
DocumentSaveError.

Comment on lines +162 to +176
try:
mime, _ = mimetypes.guess_type(final_path)
except Exception:
mime = None

meta = {
'path': final_path,
'original_name': base,
'size': total,
'sha256': sha256.hexdigest(),
'mime_type': mime,
'saved_at': datetime.now(timezone.utc).isoformat(),
}
logger.info("Saved upload metadata: %s", {k: meta[k] for k in ('original_name','path','size','mime_type')})
return meta
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Return accurate MIME and enforce allowed types in metadata save

Use detect_mime_type() and enforce ALLOWED_CONTENT_TYPES with cleanup on violation.

Apply this diff:

-    try:
-        mime, _ = mimetypes.guess_type(final_path)
-    except Exception:
-        mime = None
+    mime = detect_mime_type(final_path)
+    if ALLOWED_CONTENT_TYPES is not None and mime not in ALLOWED_CONTENT_TYPES:
+        logger.warning("Disallowed content type: %s for %s", mime, base)
+        with contextlib.suppress(Exception):
+            os.remove(final_path)
+        raise DocumentSaveError("Uploaded file type is not allowed.")
 
     meta = {
         'path': final_path,
         'original_name': base,
         'size': total,
         'sha256': sha256.hexdigest(),
         'mime_type': mime,
         'saved_at': datetime.now(timezone.utc).isoformat(),
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
mime, _ = mimetypes.guess_type(final_path)
except Exception:
mime = None
meta = {
'path': final_path,
'original_name': base,
'size': total,
'sha256': sha256.hexdigest(),
'mime_type': mime,
'saved_at': datetime.now(timezone.utc).isoformat(),
}
logger.info("Saved upload metadata: %s", {k: meta[k] for k in ('original_name','path','size','mime_type')})
return meta
mime = detect_mime_type(final_path)
if ALLOWED_CONTENT_TYPES is not None and mime not in ALLOWED_CONTENT_TYPES:
logger.warning("Disallowed content type: %s for %s", mime, base)
with contextlib.suppress(Exception):
os.remove(final_path)
raise DocumentSaveError("Uploaded file type is not allowed.")
meta = {
'path': final_path,
'original_name': base,
'size': total,
'sha256': sha256.hexdigest(),
'mime_type': mime,
'saved_at': datetime.now(timezone.utc).isoformat(),
}
logger.info("Saved upload metadata: %s", {k: meta[k] for k in ('original_name','path','size','mime_type')})
return meta

Comment on lines +230 to +243
def delete_upload(name_or_path: str) -> bool:
"""Delete an upload by filename or absolute path. Returns True if deleted."""
if os.path.isabs(name_or_path):
path = name_or_path
else:
path = os.path.join(UPLOAD_DIR, name_or_path)
try:
if os.path.exists(path):
os.remove(path)
return True
except Exception:
logging.getLogger(__name__).exception("Failed to delete upload: %s", path)
return False

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Block path traversal: restrict deletes to UPLOAD_DIR and files only

Allowing absolute paths enables deleting arbitrary files. Constrain deletions to files under UPLOAD_DIR.

Apply this diff:

 def delete_upload(name_or_path: str) -> bool:
-    """Delete an upload by filename or absolute path. Returns True if deleted."""
-    if os.path.isabs(name_or_path):
-        path = name_or_path
-    else:
-        path = os.path.join(UPLOAD_DIR, name_or_path)
-    try:
-        if os.path.exists(path):
-            os.remove(path)
-            return True
-    except Exception:
-        logging.getLogger(__name__).exception("Failed to delete upload: %s", path)
-    return False
+    """Delete an upload by filename or absolute path within UPLOAD_DIR. Returns True if deleted."""
+    logger = logging.getLogger(__name__)
+    if os.path.isabs(name_or_path):
+        path = name_or_path
+    else:
+        path = os.path.join(UPLOAD_DIR, name_or_path)
+    try:
+        path = os.path.abspath(path)
+        upload_dir_abs = os.path.abspath(UPLOAD_DIR)
+        # Ensure path is inside UPLOAD_DIR and is a regular file
+        if not path.startswith(upload_dir_abs + os.sep) or not os.path.isfile(path):
+            logger.warning("Refusing to delete path outside UPLOAD_DIR or non-file: %s", path)
+            return False
+        os.remove(path)
+        return True
+    except Exception:
+        logger.exception("Failed to delete upload: %s", path)
+        return False
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def delete_upload(name_or_path: str) -> bool:
"""Delete an upload by filename or absolute path. Returns True if deleted."""
if os.path.isabs(name_or_path):
path = name_or_path
else:
path = os.path.join(UPLOAD_DIR, name_or_path)
try:
if os.path.exists(path):
os.remove(path)
return True
except Exception:
logging.getLogger(__name__).exception("Failed to delete upload: %s", path)
return False
def delete_upload(name_or_path: str) -> bool:
"""Delete an upload by filename or absolute path within UPLOAD_DIR. Returns True if deleted."""
logger = logging.getLogger(__name__)
if os.path.isabs(name_or_path):
path = name_or_path
else:
path = os.path.join(UPLOAD_DIR, name_or_path)
try:
path = os.path.abspath(path)
upload_dir_abs = os.path.abspath(UPLOAD_DIR)
# Ensure path is inside UPLOAD_DIR and is a regular file
if not path.startswith(upload_dir_abs + os.sep) or not os.path.isfile(path):
logger.warning("Refusing to delete path outside UPLOAD_DIR or non-file: %s", path)
return False
os.remove(path)
return True
except Exception:
logger.exception("Failed to delete upload: %s", path)
return False
🤖 Prompt for AI Agents
In services/file_service.py around lines 230-243, the delete_upload function
currently accepts absolute paths and can delete files outside UPLOAD_DIR; change
it to resolve the requested path against UPLOAD_DIR only, compute both
UPLOAD_DIR and target via Path.resolve(), ensure the resolved target starts with
the resolved UPLOAD_DIR path (reject if not), verify the target is a regular
file (is_file()) and not a directory or symlink pointing outside, then remove it
and return True; on any check failure return False and keep the existing
exception logging for actual removal errors.

@fehranbit fehranbit self-requested a review September 28, 2025 15:08
@fehranbit fehranbit merged commit e5677b6 into main Sep 28, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants