Skip to content

feat: Implement PDF Report Streaming from Storage#95

Merged
klingonaston merged 1 commit intomainfrom
feat/stream-pdf-reports
Dec 14, 2025
Merged

feat: Implement PDF Report Streaming from Storage#95
klingonaston merged 1 commit intomainfrom
feat/stream-pdf-reports

Conversation

@felixjordandev
Copy link
Collaborator

@felixjordandev felixjordandev commented Dec 14, 2025

Overview

This PR updates the report download functionality to stream pre-generated PDF reports directly from a designated storage directory, rather than generating them on-the-fly from HTML.

Changes

  • Added a new PDF_REPORT_STORAGE_PATH setting to backend/app/core/config.py to specify the location of stored PDF reports (e.g., storage/reports).
  • Modified the get_report_pdf endpoint in backend/app/api/v1/report_download.py to use FileResponse to serve existing PDF files.
  • The endpoint now constructs the PDF file path using the report_id and the new PDF_REPORT_STORAGE_PATH setting.
  • Removed the dependency on weasyprint and tempfile for PDF generation in report_download.py since files are now streamed directly.

Summary by CodeRabbit

  • New Features

    • PDF reports are now cached and served instantly on subsequent requests, improving load times.
  • Chores

    • Updated directory structure for storing generated reports.
    • Removed automatic cleanup of previously generated report files.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 14, 2025

Walkthrough

The changes implement PDF caching for report downloads, storing generated PDFs at a deterministic path based on report_id and serving cached copies without regeneration. A cleanup helper function and related cleanup logic are removed. The reports output directory path is updated from reports_output to storage/reports.

Changes

Cohort / File(s) Summary
PDF caching and cleanup removal
backend/app/api/v1/report_download.py
Removed cleanup_file helper function and post-generation cleanup logic. Added conditional check to serve cached PDFs if they exist. Introduced deterministic pdf_filepath based on report_id and ensured output directory exists before generation. Removed temporary file handling. Added logging to indicate whether cached or newly generated PDF is served.
Configuration path update
backend/app/core/config.py
Updated REPORT_OUTPUT_DIR default path from BASE_DIR / "reports_output" to BASE_DIR / "storage" / "reports".

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Review the caching logic in report_download.py to ensure the conditional check correctly handles both cached and fresh PDF scenarios
  • Verify that directory creation (ensureDir) is robust and the deterministic path calculation is correct
  • Confirm that the path change in config.py aligns with project directory structure and any deployment configurations

Possibly related PRs

  • Feature: Add PDF download endpoint for reports #93: Directly inverse relationship—that PR added cleanup_file helper and background cleanup tasks around PDF generation, while this PR removes that helper/cleanup and replaces temporary file handling with deterministic cached PDF paths.

Poem

🐰 A cache is born, PDFs rest,
No cleanup haunt, just paths that nest,
Storage/reports, organized and spry,
Reports fly fast without a sigh! ✨📄

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: implementing PDF report streaming from storage, which aligns with the removal of on-demand PDF generation and the addition of caching/serving pre-generated PDFs from a designated storage directory.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/stream-pdf-reports

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/app/api/v1/report_download.py (2)

1-1: Remove unused tempfile import.

The PR description states that dependencies on tempfile were removed, and indeed the code no longer uses temporary files. This import should be removed.

Apply this diff:

-import tempfile
 import os

105-105: Remove unused background_tasks parameter.

The background_tasks parameter is declared but never used in the function. Since PDFs are now cached indefinitely without cleanup, this parameter serves no purpose.

Apply this diff:

 async def get_report_pdf(
     report_id: str,
-    background_tasks: BackgroundTasks,
     db_session: AsyncSession = Depends(get_db),

And remove the unused import:

-from fastapi import APIRouter, Depends, HTTPException, status, BackgroundTasks
+from fastapi import APIRouter, Depends, HTTPException, status
🧹 Nitpick comments (1)
backend/app/api/v1/report_download.py (1)

135-146: No cleanup strategy for accumulated PDFs.

PDFs are cached indefinitely without any cleanup mechanism. Over time, this will consume increasing amounts of disk space, especially if reports are frequently generated or if old reports are no longer needed.

Consider implementing one of the following strategies:

  1. LRU cache with size limits: Remove least recently accessed PDFs when storage reaches a threshold
  2. Time-based expiration: Delete PDFs older than a configured retention period (e.g., 30 days)
  3. On-demand regeneration: Add a query parameter or separate endpoint to force regeneration, allowing clients to refresh stale reports

You might also want to add monitoring/alerting for disk usage on the REPORT_OUTPUT_DIR volume.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4468bb8 and 961aba6.

📒 Files selected for processing (2)
  • backend/app/api/v1/report_download.py (1 hunks)
  • backend/app/core/config.py (1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
backend/app/core/config.py

54-54: Undefined name Path

(F821)

Comment on lines +139 to 157
if pdf_filepath.exists():
logger.info(f"Serving existing PDF report for report_id: {report_id}")
return FileResponse(
path=pdf_filepath,
media_type="application/pdf",
filename=f"report_{report_id}.pdf",
status_code=status.HTTP_200_OK
)

logger.info(f"Generating new PDF report for report_id: {report_id}")
html_content = render_report_html(final_report_json)

pdf_path = None
try:
# Offload blocking PDF generation to a thread pool
loop = asyncio.get_event_loop()
pdf_file = await loop.run_in_executor(
None, # Use default thread pool executor
lambda: tempfile.NamedTemporaryFile(delete=False, suffix=".pdf")
)
pdf_path = pdf_file.name
await loop.run_in_executor(
None,
lambda: HTML(string=html_content).write_pdf(pdf_path)
lambda: HTML(string=html_content).write_pdf(pdf_filepath)
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Race condition: concurrent PDF generation for the same report.

When multiple requests arrive simultaneously for a report that doesn't have a cached PDF, all requests will pass the pdf_filepath.exists() check and attempt to generate the PDF concurrently. This can lead to:

  • Corrupted PDF files due to simultaneous writes
  • Wasted compute resources
  • Inconsistent user experience

Consider using file-based locking or an in-memory lock (e.g., using asyncio.Lock stored in a dictionary keyed by report_id) to ensure only one request generates the PDF at a time, while others wait for the generation to complete.

Example approach using asyncio locks:

# At module level
_pdf_generation_locks: dict[str, asyncio.Lock] = {}

# In get_report_pdf function, before checking if PDF exists:
if report_id not in _pdf_generation_locks:
    _pdf_generation_locks[report_id] = asyncio.Lock()

async with _pdf_generation_locks[report_id]:
    # Re-check existence after acquiring lock
    if pdf_filepath.exists():
        logger.info(f"Serving existing PDF report for report_id: {report_id}")
        return FileResponse(...)
    
    # Generate PDF...
🤖 Prompt for AI Agents
In backend/app/api/v1/report_download.py around lines 139 to 157, multiple
concurrent requests can pass the pdf_filepath.exists() check and start PDF
generation simultaneously; add a per-report asyncio lock to serialize
generation: create a module-level dict mapping report_id to asyncio.Lock,
acquire the lock before re-checking pdf_filepath.exists(), return the
FileResponse if rediscovered, otherwise generate the PDF inside the lock (using
the existing executor call), then release the lock and optionally remove the
lock from the dict to avoid growth; this ensures only one coroutine writes the
PDF while others wait and then serve the cached file.


BASE_DIR: Path = Path(__file__).parent.parent.parent
REPORT_OUTPUT_DIR: Path = BASE_DIR / "reports_output"
REPORT_OUTPUT_DIR: Path = BASE_DIR / "storage" / "reports"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Missing import for Path.

The Path class is used on lines 53-54 but is not imported. This will cause a NameError at runtime when the Settings class is instantiated.

Add the missing import at the top of the file:

 from pydantic_settings import BaseSettings, SettingsConfigDict
 from typing import Dict, List
+from pathlib import Path
🧰 Tools
🪛 Ruff (0.14.8)

54-54: Undefined name Path

(F821)

🤖 Prompt for AI Agents
In backend/app/core/config.py around line 54, the Settings class uses Path
(REPORT_OUTPUT_DIR) but Path is not imported, causing a NameError; add the
missing import by importing Path from the pathlib module at the top of the file
(for example: from pathlib import Path) so the REPORT_OUTPUT_DIR assignment
resolves correctly.

@klingonaston klingonaston merged commit b71ada6 into main Dec 14, 2025
1 check passed
@klingonaston klingonaston deleted the feat/stream-pdf-reports branch December 14, 2025 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants