Skip to content

Conversation

@avirajsingh7
Copy link
Collaborator

@avirajsingh7 avirajsingh7 commented Sep 1, 2025

Summary

Partners often deal with scanned PDFs and document formats that are not amenable to AI services like OpenAI's vector stores. This results in poor RAG (Retrieval-Augmented Generation) performance and limits the platform's utility for AI-powered document processing.

This PR implements a foundational document transformation pipeline that enhances the /documents/upload endpoint with on-demand, pluggable document conversion capabilities. Users can now prepare documents for optimal RAG performance directly within the platform by converting documents to LLM-friendly formats.

Key Features Changes

  • Enhanced /documents/upload endpoint with optional target_format and transformerparameter
  • RESTful APIs for job status monitoring and batch queries
  • Transformer Registry System: Pluggable architecture for different document transformers
  • Docker Support: Added Poppler utilities for PDF processing

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Summary by CodeRabbit

  • New Features

    • Upload documents with optional background transformations (e.g., pdf→markdown); responses include transformation job metadata, signed access URL, and endpoints to fetch single or multiple transformation job statuses.
  • Documentation

    • Upload API docs expanded with modes, supported conversions, and transformer details; README adds Poppler prerequisite.
  • Chores

    • Docker installs poppler-utils; added py-zerox dependency; backend now targets Python 3.11.
  • Tests

    • Extensive unit, integration, and error/retry tests covering registry, service, routes, CRUD, and upload flows.

@avirajsingh7 avirajsingh7 marked this pull request as draft September 1, 2025 08:16
@coderabbitai
Copy link

coderabbitai bot commented Sep 1, 2025

Walkthrough

Adds an asynchronous document transformation subsystem: new DB migrations, models, CRUD, API routes and docs; transformer registry with a Zerox PDF→Markdown transformer; background job service with retries; Docker and pyproject dependency updates (poppler, py-zerox); comprehensive unit and integration tests.

Changes

Cohort / File(s) Summary
Docs
README.md, backend/app/api/docs/documents/upload.md
Document Poppler prerequisite; update upload docs with optional transformation flow, supported mappings, and transformer names.
Container & Dependencies
backend/Dockerfile, backend/pyproject.toml
Install poppler-utils in the Docker image; bump Python requirement to >=3.11 and add py-zerox dependency.
DB Migrations
backend/app/alembic/versions/...b5b9412d3d2a_add_source_document_id_to_document_table.py, backend/app/alembic/versions/...9f8a4af9d6fd_create_doc_transformation_job_table.py
Add nullable document.source_document_id (self-FK) and create doc_transformation_job table with enum status, timestamps, FKs.
Models & Exports
backend/app/models/document.py, backend/app/models/doc_transformation_job.py, backend/app/models/__init__.py
Add source_document_id to Document; new TransformationStatus, DocTransformationJob, DocTransformationJobs, TransformationJobInfo, DocumentUploadResponse; re-export models.
Core Transform Framework
backend/app/core/doctransform/transformer.py, .../registry.py, .../service.py, .../zerox_transformer.py
Add Transformer ABC; registry with mappings/resolution and convert_document; ZeroxTransformer (py-zerox-based PDF→text) and async service implementing start_job/execute_job with retries, S3 download/upload, temp files, and job status updates.
CRUD
backend/app/crud/doc_transformation_job.py, backend/app/crud/__init__.py
New project-scoped DocTransformationJobCrud (create/read_one/read_each/update_status); exported from crud package.
API & Routes
backend/app/api/main.py, backend/app/api/routes/documents.py, backend/app/api/routes/doc_transformation_job.py
Register job router; /upload becomes async, accepts target_format and transformer, schedules background transformation jobs and returns DocumentUploadResponse; add endpoints to fetch single/multiple job statuses.
Logging
backend/app/core/logger.py
Suppress LiteLLM logs by setting its logger level to WARNING.
Tests — API & Upload
backend/app/tests/api/routes/documents/test_route_document_upload.py, backend/app/tests/api/routes/test_doc_transformation_job.py
Update test uploader to include transform params; add tests for uploads with/without transformations, validation errors, and job retrieval (single/multiple, cross-project).
Tests — Core Transform Service
backend/app/tests/core/doctransformer/test_service/*, backend/app/tests/core/doctransformer/test_registry.py, backend/app/tests/core/doctransformer/test_service/utils.py, backend/app/tests/core/doctransformer/test_service/conftest.py
Add fixtures, utilities, and comprehensive tests for start_job, execute_job (success, retries, error handling), registry behavior, and integration flows with S3 (moto).
Tests — CRUD & Utilities
backend/app/tests/crud/test_doc_transformation_job.py, backend/app/tests/utils/document.py
Add CRUD unit tests for job lifecycle and change DocumentMaker to eager project resolution.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant API as Documents API (/upload)
  participant DB as DB (Session)
  participant Storage as Cloud Storage
  participant BG as BackgroundTasks
  participant Svc as Transform Service
  participant Reg as Transformer Registry

  Client->>API: POST /documents/upload (file[, target_format, transformer])
  API->>API: get_file_format(filename)
  API->>Storage: put(file)
  API->>DB: Create Document
  alt target_format provided
    API->>Svc: start_job(db, user, doc_id, transformer, target_format, BG)
    Svc-->>API: job_id
    API-->>Client: DocumentUploadResponse + transformation_job
    note over BG,Svc: Async execution
    BG->>Svc: execute_job(project_id, job_id, transformer, target_format)
    Svc->>DB: Mark job PROCESSING
    Svc->>Storage: stream(source) -> tmp_in
    Svc->>Reg: convert_document(tmp_in, tmp_out, transformer)
    Reg->>Svc: transformed file
    Svc->>Storage: put(transformed)
    Svc->>DB: Create transformed Document (source_document_id link)
    Svc->>DB: Mark job COMPLETED (transformed_document_id)
  else no target_format
    API-->>Client: DocumentUploadResponse (transformation_job = null)
  end
Loading
sequenceDiagram
  autonumber
  actor Client
  participant API as Jobs API (/documents/transformations)
  participant DB as DB (Session)

  Client->>API: GET /{job_id}
  API->>DB: DocTransformationJobCrud.read_one(job_id)
  DB-->>API: Job
  API-->>Client: APIResponse(Job)

  Client->>API: GET /?job_ids=uuid1,uuid2
  API->>API: Parse/validate UUIDs
  API->>DB: read_each({uuid1, uuid2})
  DB-->>API: [Jobs]
  API-->>Client: APIResponse({jobs, jobs_not_found})
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • AkhileshNegi
  • kartpop
  • nishika26

Poem

A rabbit taps keys in the moonlit fog,
Queues a job to turn PDF into log.
Poppler pops, Zerox hums along,
Background hops finish tasks so strong.
Hop—markdown lands, tidy and bright. 🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 76f8757 and e06042f.

📒 Files selected for processing (1)
  • backend/app/api/routes/doc_transformation_job.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/api/routes/doc_transformation_job.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.11.7, 6)
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/doc-transformer-main

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/pyproject.toml (1)

6-33: Trim runtime deps; move test/dev-only packages to dev-dependencies.

pytest, pre-commit, moto, openai_responses shouldn’t ship in prod images. Keep them in dev deps and exclude on image build.

 [project]
 dependencies = [
@@
-    "moto[s3]>=5.1.1",
@@
-    "pytest>=7.4.4",
-    "pre-commit>=3.8.0",
-    "openai_responses",
@@
 ]
 
 [tool.uv]
 dev-dependencies = [
     "pytest<8.0.0,>=7.4.3",
     "mypy<2.0.0,>=1.8.0",
     "ruff<1.0.0,>=0.2.2",
     "pre-commit<4.0.0,>=3.6.2",
     "types-passlib<2.0.0.0,>=1.7.7.20240106",
     "coverage<8.0.0,>=7.4.3",
+    "moto[s3]>=5.1.1",
+    "openai_responses",
 ]

Also consider pinning upper bounds for fast-moving libs (openai, boto3) if you need reproducible builds.

backend/app/api/routes/documents.py (1)

95-103: Separate create vs update in DocumentCrud
DocumentCrud.update unconditionally does session.add() on a transient Document, causing an INSERT every time (and duplicate‐key errors when the record already exists). Implement a create() that adds new rows and refactor update() to first load the existing document (e.g. via read_one or session.merge) before applying changes. backend/app/crud/document.py: def update (±lines 98–106)

🧹 Nitpick comments (63)
backend/app/tests/utils/document.py (2)

35-35: Remove stray blank line to avoid churn with Black.

Minor style nit; keeps diffs cleaner.

     def __iter__(self):
         return self
-    

46-51: Verify intentional extension mismatch (.xyz fname vs .txt object key).

fname uses .xyz while object_store_url uses .txt. If not intentional, align them to prevent brittle tests.

Possible fix (choose one):

-            fname=f"{doc_id}.xyz",
+            fname=f"{doc_id}.txt",

or

-        key = f"{self.project.storage_path}/{doc_id}.txt"
+        key = f"{self.project.storage_path}/{doc_id}.xyz"
backend/Dockerfile (1)

10-14: Slimmer image; avoid bringing dev deps.

  • Use --no-install-recommends for apt and drop curl unless required.
  • Exclude dev deps at sync time.
-RUN apt-get update && apt-get install -y \
-    curl \
-    poppler-utils \
- && rm -rf /var/lib/apt/lists/*
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    poppler-utils \
+ && rm -rf /var/lib/apt/lists/*
@@
-RUN --mount=type=cache,target=/root/.cache/uv \
-    uv sync --frozen --no-install-project
+RUN --mount=type=cache,target=/root/.cache/uv \
+    uv sync --frozen --no-install-project --no-dev

If curl is needed (healthchecks/debug), keep it; otherwise omit.

Also applies to: 30-31

backend/app/core/doctransform/transformer.py (1)

7-13: Define error contract in the interface.

Document expected exception type (e.g., TransformationError) to standardize handling upstream (registry/service).

     def transform(self, input_path: Path, output_path: Path) -> Path:
         """
         Transform the document at input_path and write the result to output_path.
         Returns the path to the transformed file.
+        
+        Raises:
+            TransformationError: if the transformation fails.
         """
         pass
backend/app/api/docs/documents/upload.md (1)

3-5: Tighten wording and fix minor grammar.

Short, clear phrasing; fix articles.

-- If a target format is specified, a transformation job will also be created to transform document into target format in the background. The response will include both the uploaded document details and information about the transformation job.
+- If a target format is specified, a background transformation job will be created to transform the document into the target format. The response will include the uploaded document details and information about the transformation job.
@@
-  - zerox 
+  - zerox
@@
-Available transformer names and their implementations, default transformer is zerox:
+Available transformer names and their implementations; the default transformer is zerox:

for the structural doc update.

Also applies to: 10-17

backend/app/core/doctransform/zerox_transformer.py (2)

23-31: Stronger result checks and robust write.

Treat empty page contents as failure, and be tolerant to encoding glitches.

-            if result is None or not hasattr(result, "pages") or result.pages is None:
+            if result is None or not hasattr(result, "pages") or result.pages is None:
                 raise RuntimeError("Zerox returned no pages. This may indicate a PDF/image conversion failure (is Poppler installed and in PATH?)")
 
-            with output_path.open("w", encoding="utf-8") as output_file:
+            pages = list(result.pages or [])
+            if not any(getattr(p, "content", None) for p in pages):
+                raise RuntimeError("Zerox returned pages without content. PDF/image conversion may have failed.")
+
+            with output_path.open("w", encoding="utf-8", errors="replace") as output_file:
-                for page in result.pages:
+                for page in pages:
                     if not getattr(page, "content", None):
                         continue    
                     output_file.write(page.content)
                     output_file.write("\n\n")

12-14: Make model configurable.

Default is fine, but allow env/config override to avoid hardcoding OpenAI model in code.

-    def __init__(self, model: str = "gpt-4o"):
-        self.model = model
+    def __init__(self, model: str | None = None):
+        # e.g., read from settings if not provided
+        self.model = model or os.getenv("ZER0X_MODEL", "gpt-4o")

(Remember to import os.)

README.md (1)

14-14: Add install commands for Poppler (per OS) and Docker note
Make it copy-pasteable and clarify Docker images already install it (if true).

Apply:

-- **Poppler** – Install Poppler, required for PDF processing.
+- **Poppler** – required for PDF processing.
+  - macOS: `brew install poppler`
+  - Ubuntu/Debian: `sudo apt-get update && sudo apt-get install -y poppler-utils`
+  - Fedora: `sudo dnf install -y poppler-utils`
+  - Windows (choco): `choco install poppler`
+  - Note: If you use our Docker setup, ensure the image includes Poppler (pdftoppm). Otherwise, add it to the Dockerfile.
backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1)

15-15: Unused import: get_project_by_id
This import isn’t used in this test; remove to keep tests tidy.

-from app.crud import get_project_by_id
backend/app/tests/crud/collections/test_crud_collection_read_all.py (1)

6-6: Unused import: get_project_by_id
Not referenced in this module; drop it.

-from app.crud import CollectionCrud, get_project_by_id
+from app.crud import CollectionCrud
backend/app/tests/api/routes/documents/test_route_document_info.py (1)

4-4: Unused import: get_project_by_id
Remove to avoid lint noise.

-from app.crud import get_project_by_id
backend/app/core/doctransform/test_transformer.py (1)

9-15: Harden write step by ensuring parent dir exists.

Preempt failures when the parent directory hasn’t been created by the caller.

Apply this diff:

     def transform(self, input_path: Path, output_path: Path) -> Path:
-        content = (
+        content = (
             "Lorem ipsum dolor sit amet, consectetur adipiscing elit, "
             "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
         )
-        output_path.write_text(content, encoding='utf-8')
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        output_path.write_text(content, encoding="utf-8")
         return output_path
backend/app/tests/crud/test_doc_transformation_job.py (2)

37-46: Prefer uuid4() for nonexistent IDs in tests.

SequentialUuidGenerator always starts at 0 when newly instantiated; uuid4() better reflects real IDs and avoids accidental collisions.

-        invalid_id = next(SequentialUuidGenerator())
+        from uuid import uuid4
+        invalid_id = uuid4()

Apply similarly to other tests expecting a non-existent UUID.

Also applies to: 74-83, 208-217


223-231: Clarify error_message retention semantics on state change.

Test asserts error persists when moving FAILED → PROCESSING. If intended, also add a test that explicitly clears the error when an empty string is passed, and one that preserves when None is passed.

-        updated_job = crud.update_status(job.id, TransformationStatus.PROCESSING)
+        # Explicitly clear the error if desired behavior is to reset
+        # updated_job = crud.update_status(job.id, TransformationStatus.PROCESSING, error_message="")
+        updated_job = crud.update_status(job.id, TransformationStatus.PROCESSING)

If the desired behavior is to clear on retry, adjust CRUD accordingly and update this test.

backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py (2)

30-33: Add helpful indexes and clean enum teardown.

  • Index by source_document_id and status/created_at for common lookups.
  • Drop the enum type on downgrade to avoid orphaned types (Postgres).
 def upgrade():
@@
-    )
+    )
+    op.create_index('ix_doc_transformation_job_source_document_id', 'doc_transformation_job', ['source_document_id'])
+    op.create_index('ix_doc_transformation_job_status_created_at', 'doc_transformation_job', ['status', 'created_at'])
@@
 def downgrade():
-    # ### commands auto generated by Alembic - please adjust! ###
-    op.drop_table('doc_transformation_job')
+    op.drop_index('ix_doc_transformation_job_status_created_at', table_name='doc_transformation_job')
+    op.drop_index('ix_doc_transformation_job_source_document_id', table_name='doc_transformation_job')
+    op.drop_table('doc_transformation_job')
+    # Best-effort cleanup for Postgres enum
+    try:
+        op.execute("DROP TYPE IF EXISTS transformationstatus")
+    except Exception:
+        pass

Also applies to: 37-40


30-31: Consider explicit ON DELETE behavior on FKs.

  • source_document_id: RESTRICT (prevents deleting source while jobs exist) or CASCADE if you want jobs removed with the source.
  • transformed_document_id: SET NULL is typical if the transformed doc is deleted independently.
-    sa.ForeignKeyConstraint(['source_document_id'], ['document.id'], ),
-    sa.ForeignKeyConstraint(['transformed_document_id'], ['document.id'], ),
+    sa.ForeignKeyConstraint(['source_document_id'], ['document.id'], ondelete='RESTRICT'),
+    sa.ForeignKeyConstraint(['transformed_document_id'], ['document.id'], ondelete='SET NULL'),

Validate with your deletion semantics.

Also applies to: 24-25

backend/app/tests/core/doctransformer/test_service/test_start_job.py (2)

4-4: Replace typing.Tuple with built-in tuple.

Aligns with Ruff UP035 and modern typing.

-from typing import Any, Tuple
+from typing import Any
+from typing import Tuple as _Tuple  # temporary to avoid rename churn

Or update annotations directly:

-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],

Apply across this file.


29-36: Standardize test transformer names to avoid confusion.

Registry test transformer is “test”; in one place you use “test-transformer”. Since execute_job isn’t run here, it won’t fail, but consistency helps readability.

-            transformer_name="test-transformer",
+            transformer_name="test",

Also consider parameterizing formats using TestDataProvider.get_format_test_cases() to avoid drift.

Also applies to: 112-129, 130-156

backend/app/tests/api/routes/test_doc_transformation_job.py (2)

37-38: Assert exact ID, not just non-null.

Stronger check improves signal and guards serialization issues.

-        assert data["data"]["id"] is not None
+        assert data["data"]["id"] == str(created_job.id)

1-1: Pre-commit/Black touched this file. Run hooks locally.

Pipeline shows trailing whitespace/formatting fixes. Please run pre-commit locally to avoid CI churn.

backend/app/models/doc_transformation_job.py (3)

21-25: Make nullability explicit for optional DB fields.

Being explicit helps migrations and DB introspection tools.

-    transformed_document_id: Optional[UUID] = Field(default=None, foreign_key="document.id")
+    transformed_document_id: Optional[UUID] = Field(default=None, foreign_key="document.id", nullable=True)
@@
-    error_message: Optional[str] = Field(default=None)
+    error_message: Optional[str] = Field(default=None, nullable=True)

16-26: Consider indexes for frequent filters/joins.

Given frequent joins on source_document_id and filters by status, add DB indexes (and Alembic migration) for (source_document_id), (status), and optionally (created_at DESC) for recent jobs.


1-1: EOF newline/formatting were auto-fixed by CI.

Keep pre-commit enabled locally to prevent reformat diffs.

backend/app/tests/core/doctransformer/test_service/test_execute_job.py (5)

4-4: Modernize typing: use collections.abc.Callable and builtin tuple.

Aligns with Ruff UP035 and current typing best practices.

-from typing import Any, Callable, Tuple
+from typing import Any
+from collections.abc import Callable
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project], 
+        test_document: tuple[Document, Project], 
@@
-        test_document: Tuple[Document, Project], 
+        test_document: tuple[Document, Project], 
@@
-        test_document: Tuple[Document, Project], 
+        test_document: tuple[Document, Project], 
@@
-        test_document: Tuple[Document, Project]
+        test_document: tuple[Document, Project]
@@
-        test_document: Tuple[Document, Project]
+        test_document: tuple[Document, Project]

Also applies to: 33-33, 84-84, 110-110, 145-145, 185-185, 218-218


126-126: Avoid catching broad Exception in tests.

Assert the specific failure kinds to reduce flakes and satisfy Ruff B017.

-            with pytest.raises(Exception):
+            with pytest.raises((RetryError, HTTPException)):

225-251: Use expected_content_type by asserting the S3 ContentType.

Prevents unused-var warning (B007) and verifies correct headers are stored.

+from urllib.parse import urlparse
@@
+from app.core.config import settings
@@
         for target_format, expected_content_type, expected_extension in format_extensions:
@@
             transformed_doc = document_crud.read_one(job.transformed_document_id)
             assert transformed_doc is not None
             assert transformed_doc.fname.endswith(expected_extension)
+            # Assert S3 ContentType
+            parsed = urlparse(transformed_doc.object_store_url)
+            key = parsed.path.lstrip("/")
+            head = aws.client.head_object(Bucket=settings.AWS_S3_BUCKET, Key=key)
+            assert head["ContentType"] == expected_content_type

Also applies to: 8-8, 13-13


47-47: Remove redundant db.commit() calls.

DocTransformationJobCrud.create commits already; extra commits slow tests.

-        db.commit()

Also applies to: 121-121, 156-156, 196-196, 231-231


1-1: Pre-commit/Black formatting tweaks were applied by CI.

Run hooks locally to keep CI green and diffs minimal.

backend/app/models/document.py (2)

35-35: Make deleted_at explicitly optional with default

Be explicit to avoid accidental “required field” interpretation in pydantic/SQLModel.

Apply this diff:

-    deleted_at: datetime | None
+    deleted_at: datetime | None = None

36-40: Add ondelete=SET NULL and index for source_document_id in the migration

  • In backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py, change the FK creation to
    op.create_foreign_key(None, 'document', 'document', ['source_document_id'], ['id'], ondelete='SET NULL').
  • Add an index for faster lookups, e.g.
    op.create_index('ix_document_source_document_id', 'document', ['source_document_id']).
backend/app/api/routes/doc_transformation_job.py (1)

36-47: Leverage FastAPI validation for UUID lists (optional)

Accept job_ids as list[UUID] to drop manual parsing and 422 handling.

Example:

-    job_ids: str = Query(..., description="Comma-separated list of transformation job IDs"),
+    job_ids: list[UUID] = Query(..., description="Repeat ?job_ids=<uuid> per id"),
@@
-    job_id_list = []
-    invalid_ids = []
-    for jid in job_ids.split(","):
-        ...
-    if invalid_ids:
-        raise HTTPException(...)
+    job_id_list = job_ids
backend/app/tests/core/doctransformer/test_service/conftest.py (3)

5-5: Modernize typing imports

Use collections.abc for Callable/Generator and built-in tuple syntax.

Apply this diff:

-from typing import Any, Callable, Generator, Tuple
+from typing import Any
+from collections.abc import Callable, Generator

67-71: Use built-in tuple type annotation

Aligns with modern typing and Ruff UP035.

Apply this diff:

-def test_document(db: Session, current_user: UserProjectOrg) -> Tuple[Document, Project]:
+def test_document(db: Session, current_user: UserProjectOrg) -> tuple[Document, Project]:

22-30: Prefer monkeypatch for env vars in tests (optional)

Using pytest’s monkeypatch keeps env scoped to the test run.

backend/app/tests/core/doctransformer/test_service/test_integration.py (6)

4-4: Drop deprecated Tuple import

Use built-in tuple annotations.

Apply this diff:

-from typing import Tuple

25-27: Update annotation to tuple[...]

Apply this diff:

-        test_document: Tuple[Document, Project]
+        test_document: tuple[Document, Project]

82-85: Update annotation to tuple[...]

Apply this diff:

-        test_document: Tuple[Document, Project]
+        test_document: tuple[Document, Project]

120-123: Update annotation to tuple[...]

Apply this diff:

-        test_document: Tuple[Document, Project]
+        test_document: tuple[Document, Project]

93-96: Unused loop variable and redundant commit

Rename i to _ and drop extra commit; create() already commits.

Apply this diff:

-        for i in range(3):
+        for _ in range(3):
             job = job_crud.create(source_document_id=document.id)
             jobs.append(job)
-        db.commit()

154-154: Unused loop index

Rename i to _i or remove enumerate.

Apply this diff:

-        for i, (job, target_format) in enumerate(jobs):
+        for _i, (job, target_format) in enumerate(jobs):
backend/app/core/doctransform/service.py (2)

66-72: Close streaming body after copy (optional)

Avoid leaking connections/file descriptors from storage.stream.

Apply this diff:

-        body = storage.stream(source_doc_object_store_url)
-        tmp_dir = Path(tempfile.mkdtemp())
-        tmp_in = tmp_dir / f"{source_doc_id}"
-        with open(tmp_in, "wb") as f:
-            shutil.copyfileobj(body, f)
+        body = storage.stream(source_doc_object_store_url)
+        tmp_dir = Path(tempfile.mkdtemp())
+        tmp_in = tmp_dir / f"{source_doc_id}"
+        try:
+            with open(tmp_in, "wb") as f:
+                shutil.copyfileobj(body, f)
+        finally:
+            try:
+                body.close()
+            except Exception:
+                pass

76-78: Idempotency and partial-failure risks (optional)

On retry, a new transformed_doc_id is generated; failures after upload but before job status update can orphan objects. Consider deriving object key from job_id and target_format, writing to a temp key then moving/overwriting on completion.

Also applies to: 98-114

backend/app/crud/doc_transformation_job.py (3)

3-4: Tighten imports; modernize typing and drop unused symbols.

  • List is unused; prefer built-in list[...] types.
  • join is imported but never used.

Apply:

-from typing import List, Optional
-from sqlmodel import Session, select, and_, join
+from typing import Optional
+from sqlmodel import Session, select, and_

29-46: Clarify behavior when the source document is soft-deleted.

read_one filters Document.is_deleted.is_(False). If a source doc is soft-deleted after job creation, this method will 404 the job. Is that intentional for readers of job status? If not, drop the is_deleted filter here (or only for internal callers) so jobs remain discoverable.

Example:

-                    Document.project_id == self.project_id,
-                    Document.is_deleted.is_(False)
+                    Document.project_id == self.project_id,

48-63: Guard against empty IN-clause and consider input type.

  • If job_ids is empty, many dialects render IN () which is invalid; return [] early.
  • Consider accepting Iterable[UUID] for flexibility.
-    def read_each(self, job_ids: set[UUID]) -> list[DocTransformationJob]:
+    def read_each(self, job_ids: set[UUID]) -> list[DocTransformationJob]:
+        if not job_ids:
+            return []
         statement = (
             select(DocTransformationJob)
backend/app/tests/api/routes/documents/test_route_document_upload.py (4)

25-40: Close the file handle to avoid descriptor leaks in tests.

Wrap the file open in a context manager so the descriptor is closed even if the request fails.

-    def put(self, route: Route, scratch: Path, target_format: str = None, transformer: str = None):
+    def put(self, route: Route, scratch: Path, target_format: str = None, transformer: str = None):
         (mtype, _) = mimetypes.guess_type(str(scratch))
-        files = {"src": (str(scratch), scratch.open("rb"), mtype)}
-        
-        data = {}
+        data = {}
         if target_format:
             data["target_format"] = target_format
         if transformer:
             data["transformer"] = transformer
-        
-        return self.client.post(
-            str(route),
-            headers={"X-API-KEY": self.user_api_key.key},
-            files=files,
-            data=data,
-        )
+        with scratch.open("rb") as fh:
+            files = {"src": (scratch.name, fh, mtype)}
+            return self.client.post(
+                str(route),
+                headers={"X-API-KEY": self.user_api_key.key},
+                files=files,
+                data=data,
+            )

200-216: Unsupported transformation error path looks good but message is brittle.

If server error messages change, this may become flaky. Consider matching a stable prefix or error code if available.


217-236: Invalid transformer case covered.

Assertions look good, same note about brittle full-string match applies.


261-285: Fix unused variable flagged by Ruff (F841).

response is assigned but unused (Line 277). Either remove the assignment or assert on it to strengthen the test.

Option A — remove assignment:

-        response = httpx_to_standard(uploader.put(route, pdf_scratch, target_format="markdown"))
+        httpx_to_standard(uploader.put(route, pdf_scratch, target_format="markdown"))

Option B — use the response:

-        response = httpx_to_standard(uploader.put(route, pdf_scratch, target_format="markdown"))
+        response = httpx_to_standard(uploader.put(route, pdf_scratch, target_format="markdown"))
+        assert response.success is True
+        assert response.data["transformation_job"]["job_id"] == mock_job_id
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (6)

5-5: Modernize typing imports (UP035).

Import Callable from collections.abc and prefer built-in generics.

-from typing import Any, Callable, Tuple
+from typing import Any
+from collections.abc import Callable

11-11: Remove unused import.

RetryError is not used.

-from tenacity import RetryError
+# from tenacity import RetryError  # unused

14-14: Remove unused import.

execute_job is not referenced directly.

-from app.core.doctransform.service import execute_job
+# from app.core.doctransform.service import execute_job  # unused

24-63: Avoid blind exception assertions; match expected message.

Strengthen the test by asserting on the error text (Ruff B017).

-            with pytest.raises(Exception):
+            with pytest.raises(Exception, match="S3 upload failed"):
                 fast_execute_job(
                     project_id=project.id,
                     job_id=job.id,
                     transformer_name="test",
                     target_format="markdown"
                 )

Also, update annotations to use built-in tuple:

-        test_document: Tuple[Document, Project], 
+        test_document: tuple[Document, Project],

101-139: Assert specific error on exhausted retries.

Don't catch Exception blindly; verify the message.

-            with pytest.raises(Exception):
+            with pytest.raises(Exception, match="Persistent error"):
                 fast_execute_job(
                     project_id=project.id,
                     job_id=job.id,
                     transformer_name="test",
                     target_format="markdown"
                 )

And modernize the annotation:

-        test_document: Tuple[Document, Project], 
+        test_document: tuple[Document, Project],

140-179: Also match message for DB error case.

Tighten the assertion to the expected failure text; update tuple typing.

-                with pytest.raises(Exception):
+                with pytest.raises(Exception, match="Database error during document creation"):
                     fast_execute_job(
                         project_id=project.id,
                         job_id=job.id,
                         transformer_name="test",
                         target_format="markdown"
                     )
-        test_document: Tuple[Document, Project], 
+        test_document: tuple[Document, Project],
backend/app/api/routes/documents.py (4)

3-3: Clean imports and modernize typing (fix Ruff F811/UP035).

  • Drop duplicate HTTPException import and unused JSONResponse.
  • Use built-in generics instead of typing.List; remove the typing import.
-from typing import List, Optional
+# typing imports not required; use built-in generics

-from fastapi import APIRouter, File, UploadFile, Query, Form, BackgroundTasks, HTTPException
+from fastapi import APIRouter, File, UploadFile, Query, Form, BackgroundTasks, HTTPException
-from fastapi.responses import JSONResponse
-from fastapi import HTTPException

-    response_model=APIResponse[List[DocumentPublic]],
+    response_model=APIResponse[list[DocumentPublic]],

Also applies to: 6-6, 8-9, 32-32


70-77: Guard against no-op transforms (source == target).

Rejects unnecessary jobs early.

     # validate if transformation is possible or not
     if target_format:
+        if source_format == target_format:
+            raise HTTPException(
+                status_code=400,
+                detail=f"Source and target formats are the same ('{source_format}'); no transformation needed."
+            )
         if not is_transformation_supported(source_format, target_format):
             raise HTTPException(
                 status_code=400,
                 detail=f"Transformation from {source_format} to {target_format} is not supported"
             )

114-121: Pass UUID, not str, to TransformationJobInfo.job_id.

The model field is UUID; avoid implicit coercion.

         job_info = TransformationJobInfo(
             message=f"Document accepted for transformation from {source_format} to {target_format}.",
-            job_id=str(job_id),
+            job_id=job_id,
             source_format=source_format,
             target_format=target_format,
             transformer=actual_transformer,
             status_check_url=f"/documents/transformations/{job_id}"
         )

45-49: Return 201 Created for uploads.

Aligns HTTP semantics with resource creation.

 @router.post(
     "/upload",
     description=load_description("documents/upload.md"),
-    response_model=APIResponse[DocumentUploadResponse],
+    response_model=APIResponse[DocumentUploadResponse],
+    status_code=201,
 )
backend/app/core/doctransform/registry.py (2)

2-2: Adopt built-in generics and union syntax (fix Ruff UP035).

Modernize type hints; drop typing imports.

-from typing import Type, Dict, Set, Tuple, Optional
+# Built-in generics and | None are used; no typing imports needed.

-TRANSFORMERS: Dict[str, Type[Transformer]] = {
+TRANSFORMERS: dict[str, type[Transformer]] = {
@@
-SUPPORTED_TRANSFORMATIONS: Dict[Tuple[str, str], Dict[str, str]] = {
+SUPPORTED_TRANSFORMATIONS: dict[tuple[str, str], dict[str, str]] = {
@@
-EXTENSION_TO_FORMAT: Dict[str, str] = {
+EXTENSION_TO_FORMAT: dict[str, str] = {
@@
-FORMAT_TO_EXTENSION: Dict[str, str] = {
+FORMAT_TO_EXTENSION: dict[str, str] = {
@@
-def get_supported_transformations() -> Dict[Tuple[str, str], Set[str]]:
+def get_supported_transformations() -> dict[tuple[str, str], set[str]]:
@@
-def get_available_transformers(source_format: str, target_format: str) -> Dict[str, str]:
+def get_available_transformers(source_format: str, target_format: str) -> dict[str, str]:
@@
-def resolve_transformer(source_format: str, target_format: str, transformer_name: Optional[str] = None) -> str:
+def resolve_transformer(source_format: str, target_format: str, transformer_name: str | None = None) -> str:

Also applies to: 12-16, 19-27, 30-39, 42-49, 59-65, 70-73, 74-79


98-111: Ensure output directory exists before writing.

Prevents failures when output_path parents are missing.

 def convert_document(input_path: Path, output_path: Path, transformer_name: str = "default") -> Path:
@@
-    transformer = transformer_cls()
+    # Make sure output directory exists
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    transformer = transformer_cls()
backend/app/tests/core/doctransformer/test_service/base.py (2)

13-13: Use built-in generics in tests (fix Ruff UP035).

Replace typing.List with list[...] in return annotations; drop typing import.

-from typing import List
+# typing.List not needed; use built-in generics
@@
-    def get_format_test_cases() -> List[tuple]:
+    def get_format_test_cases() -> list[tuple]:
@@
-    def get_content_type_test_cases() -> List[tuple]:
+    def get_content_type_test_cases() -> list[tuple]:
@@
-    def get_test_transformer_names() -> List[str]:
+    def get_test_transformer_names() -> list[str]:

Also applies to: 75-81, 84-91, 93-97


124-128: Silence unused args warnings in persistent failing mock.

Prefix with underscores.

-    def create_persistent_failing_convert_document(error_message: str = "Persistent error"):
+    def create_persistent_failing_convert_document(error_message: str = "Persistent error"):
         """Create a side effect function that always fails."""
-        def persistent_failing_convert_document(*args, **kwargs):
+        def persistent_failing_convert_document(*_args, **_kwargs):
             raise Exception(error_message)
         return persistent_failing_convert_document
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f24f4c5 and b615d5a.

⛔ Files ignored due to path filters (1)
  • backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (35)
  • README.md (1 hunks)
  • backend/Dockerfile (1 hunks)
  • backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py (1 hunks)
  • backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py (1 hunks)
  • backend/app/api/docs/documents/upload.md (1 hunks)
  • backend/app/api/main.py (2 hunks)
  • backend/app/api/routes/doc_transformation_job.py (1 hunks)
  • backend/app/api/routes/documents.py (3 hunks)
  • backend/app/core/doctransform/registry.py (1 hunks)
  • backend/app/core/doctransform/service.py (1 hunks)
  • backend/app/core/doctransform/test_transformer.py (1 hunks)
  • backend/app/core/doctransform/transformer.py (1 hunks)
  • backend/app/core/doctransform/zerox_transformer.py (1 hunks)
  • backend/app/crud/__init__.py (1 hunks)
  • backend/app/crud/doc_transformation_job.py (1 hunks)
  • backend/app/models/__init__.py (1 hunks)
  • backend/app/models/doc_transformation_job.py (1 hunks)
  • backend/app/models/document.py (3 hunks)
  • backend/app/tests/api/routes/documents/test_route_document_info.py (1 hunks)
  • backend/app/tests/api/routes/documents/test_route_document_list.py (1 hunks)
  • backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1 hunks)
  • backend/app/tests/api/routes/documents/test_route_document_remove.py (1 hunks)
  • backend/app/tests/api/routes/documents/test_route_document_upload.py (4 hunks)
  • backend/app/tests/api/routes/test_doc_transformation_job.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/base.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/conftest.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_execute_job.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_integration.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_start_job.py (1 hunks)
  • backend/app/tests/crud/collections/test_crud_collection_create.py (1 hunks)
  • backend/app/tests/crud/collections/test_crud_collection_read_all.py (1 hunks)
  • backend/app/tests/crud/test_doc_transformation_job.py (1 hunks)
  • backend/app/tests/utils/document.py (1 hunks)
  • backend/pyproject.toml (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (30)
backend/app/core/doctransform/test_transformer.py (2)
backend/app/core/doctransform/transformer.py (2)
  • Transformer (4-13)
  • transform (8-13)
backend/app/core/doctransform/zerox_transformer.py (1)
  • transform (15-44)
backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1)
backend/app/crud/project.py (1)
  • get_project_by_id (37-38)
backend/app/tests/api/routes/documents/test_route_document_list.py (2)
backend/app/crud/project.py (1)
  • get_project_by_id (37-38)
backend/app/tests/crud/test_project.py (1)
  • test_get_project_by_id (53-60)
backend/app/crud/doc_transformation_job.py (3)
backend/app/crud/document.py (2)
  • DocumentCrud (14-135)
  • __init__ (15-17)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-25)
  • TransformationStatus (9-13)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/tests/utils/document.py (3)
backend/app/models/project.py (1)
  • Project (29-60)
backend/app/crud/project.py (1)
  • get_project_by_id (37-38)
backend/app/tests/utils/utils.py (1)
  • SequentialUuidGenerator (140-153)
backend/app/tests/api/routes/documents/test_route_document_remove.py (2)
backend/app/tests/utils/document.py (1)
  • project (67-68)
backend/app/crud/project.py (1)
  • get_project_by_id (37-38)
backend/app/tests/crud/test_doc_transformation_job.py (5)
backend/app/crud/doc_transformation_job.py (5)
  • DocTransformationJobCrud (13-84)
  • create (18-27)
  • read_one (29-46)
  • read_each (48-62)
  • update_status (64-84)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-25)
  • TransformationStatus (9-13)
backend/app/tests/utils/document.py (4)
  • DocumentStore (55-82)
  • project (67-68)
  • put (70-75)
  • fill (81-82)
backend/app/tests/utils/utils.py (2)
  • get_project (70-89)
  • SequentialUuidGenerator (140-153)
backend/app/tests/conftest.py (1)
  • db (24-41)
backend/app/tests/api/routes/documents/test_route_document_info.py (1)
backend/app/crud/project.py (1)
  • get_project_by_id (37-38)
backend/app/crud/__init__.py (2)
backend/app/crud/doc_transformation_job.py (1)
  • DocTransformationJobCrud (13-84)
backend/app/crud/document.py (1)
  • DocumentCrud (14-135)
backend/app/tests/core/doctransformer/test_service/test_start_job.py (7)
backend/app/core/doctransform/service.py (2)
  • execute_job (42-128)
  • start_job (24-39)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-25)
  • TransformationStatus (9-13)
backend/app/models/user.py (1)
  • UserProjectOrg (65-66)
backend/app/tests/core/doctransformer/test_service/base.py (3)
  • DocTransformTestBase (21-68)
  • TestDataProvider (71-101)
  • get_test_transformer_names (94-96)
backend/app/tests/conftest.py (1)
  • db (24-41)
backend/app/tests/core/doctransformer/test_service/conftest.py (3)
  • current_user (49-57)
  • test_document (67-71)
  • background_tasks (61-63)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (7)
backend/app/crud/doc_transformation_job.py (3)
  • DocTransformationJobCrud (13-84)
  • create (18-27)
  • read_one (29-46)
backend/app/crud/document.py (1)
  • DocumentCrud (14-135)
backend/app/core/doctransform/registry.py (1)
  • TransformationError (8-9)
backend/app/core/doctransform/service.py (1)
  • execute_job (42-128)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-25)
  • TransformationStatus (9-13)
backend/app/tests/core/doctransformer/test_service/base.py (8)
  • DocTransformTestBase (21-68)
  • TestDataProvider (71-101)
  • get_format_test_cases (75-81)
  • setup_aws_s3 (24-28)
  • get_sample_document_content (99-101)
  • create_s3_document_content (30-46)
  • verify_s3_content (48-68)
  • get_content_type_test_cases (84-91)
backend/app/tests/api/routes/documents/test_route_document_upload.py (4)
backend/app/tests/utils/document.py (5)
  • Route (85-112)
  • _ (143-144)
  • _ (148-149)
  • delete (126-130)
  • httpx_to_standard (22-23)
backend/app/core/cloud/storage.py (5)
  • client (30-42)
  • delete (140-142)
  • delete (244-259)
  • AmazonCloudStorageClient (28-84)
  • create (44-84)
backend/app/tests/conftest.py (2)
  • client (52-55)
  • db (24-41)
backend/app/crud/doc_transformation_job.py (1)
  • create (18-27)
backend/app/tests/crud/collections/test_crud_collection_create.py (1)
backend/app/crud/project.py (1)
  • get_project_by_id (37-38)
backend/app/models/doc_transformation_job.py (2)
backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)
  • upgrade (20-36)
backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py (1)
  • upgrade (19-66)
backend/app/tests/core/doctransformer/test_service/conftest.py (6)
backend/app/crud/project.py (1)
  • get_project_by_id (37-38)
backend/app/models/user.py (2)
  • User (48-57)
  • UserProjectOrg (65-66)
backend/app/tests/utils/document.py (4)
  • DocumentStore (55-82)
  • get (120-124)
  • project (67-68)
  • put (70-75)
backend/app/tests/utils/test_data.py (1)
  • create_test_api_key (53-67)
backend/app/core/doctransform/service.py (1)
  • execute_job (42-128)
backend/app/tests/conftest.py (1)
  • db (24-41)
backend/app/tests/core/doctransformer/test_service/test_integration.py (8)
backend/app/crud/doc_transformation_job.py (3)
  • DocTransformationJobCrud (13-84)
  • read_one (29-46)
  • create (18-27)
backend/app/crud/document.py (1)
  • DocumentCrud (14-135)
backend/app/core/doctransform/service.py (2)
  • execute_job (42-128)
  • start_job (24-39)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-25)
  • TransformationStatus (9-13)
backend/app/tests/core/doctransformer/test_service/base.py (3)
  • DocTransformTestBase (21-68)
  • setup_aws_s3 (24-28)
  • create_s3_document_content (30-46)
backend/app/tests/conftest.py (1)
  • db (24-41)
backend/app/tests/core/doctransformer/test_service/conftest.py (3)
  • test_document (67-71)
  • current_user (49-57)
  • background_tasks (61-63)
backend/app/tests/crud/collections/test_crud_collection_read_all.py (2)
backend/app/crud/project.py (1)
  • get_project_by_id (37-38)
backend/app/tests/crud/test_project.py (1)
  • test_get_project_by_id (53-60)
backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py (2)
backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py (2)
  • upgrade (20-24)
  • downgrade (27-31)
backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)
  • upgrade (20-36)
backend/app/core/doctransform/transformer.py (2)
backend/app/core/doctransform/test_transformer.py (1)
  • transform (9-15)
backend/app/core/doctransform/zerox_transformer.py (1)
  • transform (15-44)
backend/app/core/doctransform/registry.py (3)
backend/app/core/doctransform/transformer.py (2)
  • Transformer (4-13)
  • transform (8-13)
backend/app/core/doctransform/test_transformer.py (2)
  • TestTransformer (4-15)
  • transform (9-15)
backend/app/core/doctransform/zerox_transformer.py (2)
  • ZeroxTransformer (7-44)
  • transform (15-44)
backend/app/models/__init__.py (2)
backend/app/models/document.py (4)
  • Document (20-40)
  • DocumentPublic (43-61)
  • DocumentUploadResponse (83-87)
  • TransformationJobInfo (64-80)
backend/app/models/doc_transformation_job.py (3)
  • DocTransformationJob (16-25)
  • DocTransformationJobs (28-30)
  • TransformationStatus (9-13)
backend/app/tests/api/routes/test_doc_transformation_job.py (7)
backend/app/crud/doc_transformation_job.py (3)
  • DocTransformationJobCrud (13-84)
  • create (18-27)
  • update_status (64-84)
backend/app/models/doc_transformation_job.py (1)
  • TransformationStatus (9-13)
backend/app/tests/utils/document.py (5)
  • DocumentStore (55-82)
  • project (67-68)
  • put (70-75)
  • get (120-124)
  • fill (81-82)
backend/app/tests/utils/utils.py (1)
  • get_project (70-89)
backend/app/models/api_key.py (1)
  • APIKeyPublic (23-25)
backend/app/crud/project.py (1)
  • get_project_by_id (37-38)
backend/app/tests/conftest.py (3)
  • db (24-41)
  • user_api_key (89-91)
  • superuser_api_key (83-85)
backend/app/tests/core/doctransformer/test_service/base.py (4)
backend/app/core/cloud/storage.py (1)
  • AmazonCloudStorageClient (28-84)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/models/project.py (1)
  • Project (29-60)
backend/app/core/config.py (1)
  • AWS_S3_BUCKET (76-77)
backend/app/api/routes/documents.py (5)
backend/app/models/document.py (4)
  • Document (20-40)
  • DocumentPublic (43-61)
  • DocumentUploadResponse (83-87)
  • TransformationJobInfo (64-80)
backend/app/utils.py (2)
  • APIResponse (27-48)
  • success_response (34-37)
backend/app/core/cloud/storage.py (2)
  • get_signed_url (135-137)
  • get_signed_url (217-242)
backend/app/core/doctransform/registry.py (4)
  • get_file_format (51-57)
  • is_transformation_supported (66-68)
  • get_available_transformers (70-72)
  • resolve_transformer (74-96)
backend/app/core/doctransform/service.py (1)
  • start_job (24-39)
backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py (2)
backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py (1)
  • upgrade (20-34)
backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)
  • upgrade (20-36)
backend/app/core/doctransform/zerox_transformer.py (2)
backend/app/core/doctransform/transformer.py (2)
  • Transformer (4-13)
  • transform (8-13)
backend/app/core/doctransform/test_transformer.py (1)
  • transform (9-15)
backend/app/models/document.py (3)
backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)
  • upgrade (20-36)
backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py (1)
  • upgrade (19-66)
backend/app/models/document_collection.py (1)
  • DocumentCollection (9-23)
backend/app/core/doctransform/service.py (7)
backend/app/crud/doc_transformation_job.py (4)
  • DocTransformationJobCrud (13-84)
  • create (18-27)
  • update_status (64-84)
  • read_one (29-46)
backend/app/crud/document.py (1)
  • DocumentCrud (14-135)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/models/doc_transformation_job.py (1)
  • TransformationStatus (9-13)
backend/app/core/cloud/storage.py (1)
  • get_cloud_storage (262-272)
backend/app/core/doctransform/registry.py (1)
  • convert_document (98-115)
backend/app/tests/core/doctransformer/test_service/conftest.py (2)
  • current_user (49-57)
  • background_tasks (61-63)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (5)
backend/app/crud/doc_transformation_job.py (3)
  • DocTransformationJobCrud (13-84)
  • create (18-27)
  • read_one (29-46)
backend/app/core/doctransform/service.py (1)
  • execute_job (42-128)
backend/app/tests/core/doctransformer/test_service/base.py (8)
  • DocTransformTestBase (21-68)
  • MockHelpers (104-128)
  • setup_aws_s3 (24-28)
  • create_s3_document_content (30-46)
  • failing_convert_document (111-120)
  • create_failing_convert_document (108-121)
  • persistent_failing_convert_document (126-127)
  • create_persistent_failing_convert_document (124-128)
backend/app/tests/conftest.py (1)
  • db (24-41)
backend/app/tests/core/doctransformer/test_service/conftest.py (1)
  • fast_execute_job (33-45)
backend/app/api/routes/doc_transformation_job.py (3)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-25)
  • DocTransformationJobs (28-30)
backend/app/crud/doc_transformation_job.py (3)
  • DocTransformationJobCrud (13-84)
  • read_one (29-46)
  • read_each (48-62)
backend/app/utils.py (2)
  • APIResponse (27-48)
  • success_response (34-37)
🪛 GitHub Actions: AI Platform CI
backend/app/core/doctransform/test_transformer.py

[error] 1-1: Black formatting applied; file reformatted.

backend/app/crud/doc_transformation_job.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/tests/utils/document.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/tests/crud/test_doc_transformation_job.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/api/docs/documents/upload.md

[error] 1-1: End-of-file fixer updated file to ensure trailing newline.


[error] 1-1: Trailing whitespace removed by pre-commit hook.

backend/app/tests/core/doctransformer/test_service/test_start_job.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/tests/core/doctransformer/test_service/test_execute_job.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/tests/api/routes/documents/test_route_document_upload.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/models/doc_transformation_job.py

[error] 1-1: End-of-file fixer updated file to ensure trailing newline.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/tests/core/doctransformer/test_service/conftest.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/tests/core/doctransformer/test_service/test_integration.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py

[error] 1-1: Black formatting applied; file reformatted.

backend/app/core/doctransform/transformer.py

[error] 1-1: Black formatting applied; file reformatted.

backend/app/core/doctransform/registry.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/models/__init__.py

[error] 1-1: Black formatting applied; file reformatted.

backend/app/tests/api/routes/test_doc_transformation_job.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/tests/core/doctransformer/test_service/base.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/api/routes/documents.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py

[error] 1-1: Black formatting applied; file reformatted.

backend/app/core/doctransform/zerox_transformer.py

[error] 1-1: End-of-file fixer updated file to ensure trailing newline.


[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/models/document.py

[error] 1-1: Black formatting applied; file reformatted.

backend/app/core/doctransform/service.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py

[error] 1-1: Trailing whitespace removed by pre-commit hook.


[error] 1-1: Black formatting applied; file reformatted.

backend/app/api/routes/doc_transformation_job.py

[error] 1-1: End-of-file fixer updated file to ensure trailing newline.


[error] 1-1: Black formatting applied; file reformatted.

🪛 Ruff (0.12.2)
backend/app/crud/doc_transformation_job.py

3-3: typing.List is deprecated, use list instead

(UP035)

backend/app/tests/core/doctransformer/test_service/test_start_job.py

4-4: typing.Tuple is deprecated, use tuple instead

(UP035)

backend/app/tests/core/doctransformer/test_service/test_execute_job.py

4-4: Import from collections.abc instead: Callable

Import from collections.abc

(UP035)


4-4: typing.Tuple is deprecated, use tuple instead

(UP035)


126-126: Do not assert blind exception: Exception

(B017)


227-227: Loop control variable expected_content_type not used within loop body

Rename unused expected_content_type to _expected_content_type

(B007)

backend/app/tests/api/routes/documents/test_route_document_upload.py

277-277: Local variable response is assigned to but never used

Remove assignment to unused variable response

(F841)

backend/app/models/doc_transformation_job.py

30-30: No newline at end of file

Add trailing newline

(W292)

backend/app/tests/core/doctransformer/test_service/conftest.py

5-5: Import from collections.abc instead: Callable, Generator

Import from collections.abc

(UP035)


5-5: typing.Tuple is deprecated, use tuple instead

(UP035)

backend/app/tests/core/doctransformer/test_service/test_integration.py

4-4: typing.Tuple is deprecated, use tuple instead

(UP035)


93-93: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


154-154: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

backend/app/core/doctransform/registry.py

2-2: typing.Type is deprecated, use type instead

(UP035)


2-2: typing.Dict is deprecated, use dict instead

(UP035)


2-2: typing.Set is deprecated, use set instead

(UP035)


2-2: typing.Tuple is deprecated, use tuple instead

(UP035)

backend/app/tests/core/doctransformer/test_service/base.py

13-13: typing.List is deprecated, use list instead

(UP035)


126-126: Unused function argument: args

(ARG001)


126-126: Unused function argument: kwargs

(ARG001)

backend/app/api/routes/documents.py

3-3: typing.List is deprecated, use list instead

(UP035)


9-9: Redefinition of unused HTTPException from line 6

Remove definition: HTTPException

(F811)

backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py

5-5: Import from collections.abc instead: Callable

Import from collections.abc

(UP035)


5-5: typing.Tuple is deprecated, use tuple instead

(UP035)


50-50: Do not assert blind exception: Exception

(B017)


127-127: Do not assert blind exception: Exception

(B017)


167-167: Do not assert blind exception: Exception

(B017)

backend/app/api/routes/doc_transformation_job.py

58-58: No newline at end of file

Add trailing newline

(W292)

🪛 LanguageTool
backend/app/api/docs/documents/upload.md

[grammar] ~4-~4: There might be a mistake here.
Context: ...nsformation job will also be created to transform document into target format in the back...

(QB_NEW_EN)


[grammar] ~4-~4: There might be a mistake here.
Context: ...l also be created to transform document into target format in the background. The re...

(QB_NEW_EN)


[grammar] ~10-~10: Ensure spelling is correct
Context: ...mations are supported: - pdf → markdown - zerox ### Transformers Available transformer name...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

@avirajsingh7 avirajsingh7 changed the title Squash feature/doc-transform Implement document transformation pipeline to improve RAG performance Sep 1, 2025
@avirajsingh7 avirajsingh7 marked this pull request as ready for review September 1, 2025 10:13
@codecov
Copy link

codecov bot commented Sep 1, 2025

Codecov Report

❌ Patch coverage is 97.88136% with 25 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
backend/app/core/doctransform/zerox_transformer.py 28.12% 23 Missing ⚠️
backend/app/core/doctransform/transformer.py 83.33% 1 Missing ⚠️
...pp/tests/core/doctransformer/test_service/utils.py 97.77% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@avirajsingh7 avirajsingh7 self-assigned this Sep 1, 2025
@avirajsingh7 avirajsingh7 linked an issue Sep 1, 2025 that may be closed by this pull request
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/api/docs/documents/upload.md (1)

1-18: Fix grammar, clarify parameters, and separate transformations vs. transformers.

Current text has grammatical issues and mixes a transformer name under “Supported Transformations”.

-Upload a document to the AI platform.
-
-- If only a file is provided, the document will be uploaded and stored, and its ID will be returned.
-- If a target format is specified, a transformation job will also be created to transform document into target format in the background. The response will include both the uploaded document details and information about the transformation job.
-
-### Supported Transformations
-
-The following (source_format → target_format) transformations are supported:
-
-- pdf → markdown
-  - zerox
-
-### Transformers
-
-Available transformer names and their implementations, default transformer is zerox:
-
-- `zerox`
+Upload a document to the AI platform.
+
+- If only a file is provided, the document is uploaded and stored; its ID is returned.
+- If `target_format` is provided, a background transformation job is created to transform the document into the target format. The response includes the uploaded document details and the transformation job info.
+
+### Supported transformations
+
+The following (source_format → target_format) pairs are supported:
+
+- pdf → markdown (via the `zerox` transformer)
+
+### Transformers
+
+Available transformer names (default: `zerox`):
+
+- `zerox`
♻️ Duplicate comments (5)
backend/app/tests/utils/document.py (1)

30-32: Fail fast if project is missing to avoid AttributeError in next.

get_project_by_id returns Optional[Project]; without a guard, Line 40/45 can crash on None. Assert early to keep the fixture deterministic.

         self.project: Project = get_project_by_id(
             session=self.session, project_id=self.project_id
         )
+        if self.project is None:
+            raise AssertionError(
+                f"Test setup error: Project id={self.project_id} not found"
+            )
backend/app/core/doctransform/zerox_transformer.py (1)

1-1: Replace asyncio.Runner with thread-safe execution strategy.

Runner (or asyncio.run) inside an active event loop can raise RuntimeError in FastAPI contexts. Offload the coroutine to a dedicated thread when a loop is running; otherwise run normally.

-from asyncio import Runner
+import asyncio
+import concurrent.futures

Verification (search for remaining Runner usages and call-sites):

#!/bin/bash
rg -nP -C2 '\bRunner\(' backend
rg -nP -C3 'ZeroxTransformer|transform\(' backend/app/core/doctransform
backend/app/crud/doc_transformation_job.py (1)

69-91: Make update_status robust to source-doc soft-deletion.

Calling self.read_one applies Document.is_deleted == False, so status updates can 404 after the source doc is soft-deleted. Query by job_id + project scope without the deletion predicate to allow finalization (COMPLETED/FAILED). This was flagged earlier.

     def update_status(
         self,
         job_id: UUID,
         status: TransformationStatus,
         *,
         error_message: Optional[str] = None,
         transformed_document_id: Optional[UUID] = None,
     ) -> DocTransformationJob:
-        job = self.read_one(job_id)
+        statement = (
+            select(DocTransformationJob)
+            .join(
+                Document, DocTransformationJob.source_document_id == Document.id
+            )
+            .where(
+                and_(
+                    DocTransformationJob.id == job_id,
+                    Document.project_id == self.project_id,
+                )
+            )
+        )
+        job = self.session.exec(statement).one_or_none()
+        if job is None:
+            logger.warning(
+                f"[DocTransformationJobCrud.update_status] Job not found | id: {job_id}, project_id: {self.project_id}"
+            )
+            raise HTTPException(status_code=404, detail="Transformation job not found")
         job.status = status
         job.updated_at = now()
         if error_message is not None:
             job.error_message = error_message
         if transformed_document_id is not None:
             job.transformed_document_id = transformed_document_id
backend/app/core/doctransform/service.py (1)

60-67: Minor: PROCESSING update depends on soft-delete-sensitive read.

If the source doc gets soft-deleted between create and run, update_status may 404 (see CRUD comment). After applying the CRUD fix, this call will be safe.

backend/app/api/routes/documents.py (1)

83-106: Predeclare actual_transformer for type-checkers/readability.

Avoid potential "possibly-unbound" warnings.

@@
-    # validate if transformation is possible or not
+    # validate if transformation is possible or not
+    actual_transformer: str | None = None
     if target_format:
@@
-        try:
-            actual_transformer = resolve_transformer(
+        try:
+            actual_transformer = resolve_transformer(
                 source_format, target_format, transformer
             )
🧹 Nitpick comments (31)
backend/app/models/doc_transformation_job.py (2)

16-27: Index FKs and status; define deletion semantics.

  • Add DB indexes on source_document_id, transformed_document_id, and status; these will be hot query predicates.
  • Consider FK ondelete: SET NULL for transformed_document_id and RESTRICT (or CASCADE) for source_document_id to avoid orphaned rows or unexpected deletions. Requires a migration.
-    source_document_id: UUID = Field(foreign_key="document.id")
-    transformed_document_id: Optional[UUID] = Field(
-        default=None, foreign_key="document.id"
-    )
-    status: TransformationStatus = Field(default=TransformationStatus.PENDING)
+    source_document_id: UUID = Field(foreign_key="document.id", index=True)
+    transformed_document_id: Optional[UUID] = Field(
+        default=None, foreign_key="document.id", index=True
+    )
+    status: TransformationStatus = Field(default=TransformationStatus.PENDING, index=True)

If you align FK ondelete behavior, add an Alembic migration to alter constraints accordingly.


30-32: Provide defaults for response container lists.

Avoid forcing callers to always supply both fields; safe defaults improve ergonomics.

-class DocTransformationJobs(SQLModel):
-    jobs: list[DocTransformationJob]
-    jobs_not_found: list[UUID]
+class DocTransformationJobs(SQLModel):
+    jobs: list[DocTransformationJob] = Field(default_factory=list)
+    jobs_not_found: list[UUID] = Field(default_factory=list)
backend/app/core/doctransform/test_transformer.py (1)

10-16: Ensure output directory exists before writing.

Prevents failures when the parent directory is missing.

     def transform(self, input_path: Path, output_path: Path) -> Path:
+        output_path.parent.mkdir(parents=True, exist_ok=True)
         content = (
             "Lorem ipsum dolor sit amet, consectetur adipiscing elit, "
             "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
         )
         output_path.write_text(content, encoding="utf-8")
         return output_path
backend/app/tests/core/doctransformer/test_service/test_start_job.py (1)

4-4: Switch to builtin tuple and drop unused typing imports (py3.12).

Use builtin tuple[...] and remove unused Any/Tuple to satisfy Ruff UP035 and keep types modern.

-from typing import Any, Tuple
+# no typing imports needed

@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],

Also applies to: 33-33, 88-88, 116-116, 146-146

backend/app/tests/core/doctransformer/test_service/test_execute_job.py (3)

4-4: Modernize typing: use collections.abc and builtin tuple.

Address Ruff UP035 and keep annotations idiomatic.

-from typing import Any, Callable, Tuple
+from typing import Any
+from collections.abc import Callable
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        self, db: Session, test_document: Tuple[Document, Project]
+        self, db: Session, test_document: tuple[Document, Project]
@@
-        self, db: Session, test_document: Tuple[Document, Project]
+        self, db: Session, test_document: tuple[Document, Project]

Also applies to: 36-36, 87-87, 114-114, 149-149, 189-189, 220-220


129-136: Avoid blind exception assertions.

Be explicit to prevent false positives under different environments/retry paths.

-            with pytest.raises(Exception):
+            with pytest.raises((RetryError, HTTPException)):
                 fast_execute_job(
                     project_id=project.id,
                     job_id=job.id,
                     transformer_name="test",
                     target_format="markdown",
                 )

228-233: Rename unused loop variable to underscore to satisfy linters.

expected_content_type isn’t used; keep intent clear.

-        for (
-            target_format,
-            expected_content_type,
-            expected_extension,
-        ) in format_extensions:
+        for (
+            target_format,
+            _expected_content_type,
+            expected_extension,
+        ) in format_extensions:
backend/app/tests/core/doctransformer/test_service/conftest.py (1)

5-5: Modernize typing imports and tuple annotation.

Use collections.abc for Callable/Generator and builtin tuple.

-from typing import Any, Callable, Generator, Tuple
+from typing import Any
+from collections.abc import Callable, Generator
@@
-) -> Tuple[Document, Project]:
+) -> tuple[Document, Project]:

Also applies to: 73-80

backend/app/tests/core/doctransformer/test_service/test_integration.py (3)

4-4: Use builtin tuple for annotations (py3.12).

-from typing import Tuple
+# no typing Tuple needed
@@
-        self, db: Session, test_document: Tuple[Document, Project]
+        self, db: Session, test_document: tuple[Document, Project]
@@
-        self, db: Session, test_document: Tuple[Document, Project]
+        self, db: Session, test_document: tuple[Document, Project]
@@
-        self, db: Session, test_document: Tuple[Document, Project]
+        self, db: Session, test_document: tuple[Document, Project]

Also applies to: 30-30, 85-85, 122-122


95-98: Rename unused loop variable.

i isn’t used; use underscore to quiet linters.

-        for i in range(3):
+        for _ in range(3):
             job = job_crud.create(source_document_id=document.id)
             jobs.append(job)

154-161: Drop enumerate since index isn’t used.

-        for i, (job, target_format) in enumerate(jobs):
+        for job, target_format in jobs:
             db.refresh(job)
             assert job.status == TransformationStatus.COMPLETED
             assert job.transformed_document_id is not None

             transformed_doc = document_crud.read_one(job.transformed_document_id)
backend/app/crud/doc_transformation_job.py (1)

3-4: Drop deprecated/unused imports.

  • UP035: avoid typing.List; you're already using list[...].
  • join is imported but unused.

Apply:

-from typing import List, Optional
-from sqlmodel import Session, select, and_, join
+from typing import Optional
+from sqlmodel import Session, select, and_
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (5)

5-6: Modernize typing and remove unused import.

  • Use collections.abc.Callable and built-in tuple[...].
  • Drop unused execute_job import.
-from typing import Any, Callable, Tuple
+from typing import Any
+from collections.abc import Callable
@@
-from app.core.doctransform.service import execute_job
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],

Also applies to: 14-14, 30-31, 74-75, 116-117, 160-161


40-41: Remove redundant commits in tests.

create() already commits; extra db.commit()s are noise.

-        db.commit()

Also applies to: 84-85, 126-127, 170-171


55-62: Assert specific error to avoid blind exception.

Tighten the expectation with a message match.

-            with pytest.raises(Exception):
+            with pytest.raises(Exception, match="S3 upload failed"):
                 fast_execute_job(

142-149: Tighten raised error assertion (persistent failure).

-            with pytest.raises(Exception):
+            with pytest.raises(Exception, match="Persistent error"):
                 fast_execute_job(

188-195: Tighten raised error assertion (DB error).

-                with pytest.raises(Exception):
+                with pytest.raises(Exception, match="Database error during document creation"):
                     fast_execute_job(
backend/app/tests/api/routes/documents/test_route_document_upload.py (3)

24-46: Safer default for content type when mimetype is unknown.

mimetypes.guess_type may return None; provide a fallback.

-        (mtype, _) = mimetypes.guess_type(str(scratch))
+        (mtype, _) = mimetypes.guess_type(str(scratch))
+        if not mtype:
+            mtype = "application/octet-stream"

56-72: Write PDF fixture in binary mode to avoid platform-dependent newline/encoding issues.

Text mode can corrupt the PDF header on some platforms.

-@pytest.fixture
-def pdf_scratch():
-    # Create a test PDF file for transformation tests
-    with NamedTemporaryFile(mode="w", suffix=".pdf", delete=False) as fp:
-        fp.write("%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj\n")
-        fp.write("2 0 obj<</Type/Pages/Kids[3 0 R]/Count 1>>endobj\n")
-        fp.write("3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]>>endobj\n")
-        fp.write("xref\n0 4\n0000000000 65535 f \n0000000009 00000 n \n")
-        fp.write(
-            "0000000074 00000 n \n0000000120 00000 n \ntrailer<</Size 4/Root 1 0 R>>\n"
-        )
-        fp.write("startxref\n202\n%%EOF")
-        fp.flush()
-        yield Path(fp.name)
-        # Clean up the temporary file
-        Path(fp.name).unlink()
+@pytest.fixture
+def pdf_scratch():
+    # Create a test PDF file for transformation tests
+    with NamedTemporaryFile(mode="wb", suffix=".pdf", delete=False) as fp:
+        fp.write(b"%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj\n")
+        fp.write(b"2 0 obj<</Type/Pages/Kids[3 0 R]/Count 1>>endobj\n")
+        fp.write(b"3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]>>endobj\n")
+        fp.write(b"xref\n0 4\n0000000000 65535 f \n0000000009 00000 n \n")
+        fp.write(
+            b"0000000074 00000 n \n0000000120 00000 n \ntrailer<</Size 4/Root 1 0 R>>\n"
+        )
+        fp.write(b"startxref\n202\n%%EOF")
+        fp.flush()
+        yield Path(fp.name)
+    # Clean up the temporary file
+    Path(fp.name).unlink(missing_ok=True)

295-297: Fix unused variable.

response isn’t used; avoid the assignment to satisfy Ruff (F841).

-        response = httpx_to_standard(
-            uploader.put(route, pdf_scratch, target_format="markdown")
-        )
+        uploader.put(route, pdf_scratch, target_format="markdown")
backend/app/core/doctransform/service.py (1)

16-16: Remove unused import.

User isn’t used.

-from app.models import User
backend/app/api/routes/documents.py (5)

68-75: Fix target_format docs; add normalization for common aliases (md/txt).

Description suggests "txt", but the registry uses "text" (and "markdown"). Normalize aliases to avoid 400s and sync the docs.

-        description="Desired output format for the uploaded document (e.g., pdf, docx, txt). ",
+        description="Desired output format for the uploaded document (e.g., pdf, docx, markdown, text). ",

Add normalization before validation:

@@
-    if target_format:
+    if target_format:
+        alias_map = {"md": "markdown", "txt": "text", "htm": "html"}
+        target_format = alias_map.get(target_format.lower(), target_format.lower())
         if not is_transformation_supported(source_format, target_format):

120-129: Guard on explicit None check.

Be explicit to avoid truthiness confusion if transformer names ever become falsy.

-    if target_format and actual_transformer:
+    if target_format and actual_transformer is not None:

130-137: Pass UUID directly; avoid string coercion.

Model expects UUID; Pydantic will coerce, but keeping it UUID is cleaner.

-            job_id=str(job_id),
+            job_id=job_id,

136-137: Avoid hard-coded URL; use router.url_path_for.

Prevents drift if the route prefix changes.

Example:

status_check_url = router.url_path_for("get_transformation_job", job_id=str(job_id))

(Adjust handler name as appropriate.)


44-55: Ensure response_model matches payload types for /list.

You return ORM models; response_model expects DocumentPublic. Consider mapping for consistent schema.

 def list_docs(
@@
-    data = crud.read_many(skip, limit)
-    return APIResponse.success_response(data)
+    docs = crud.read_many(skip, limit)
+    data = [DocumentPublic.model_validate(d, from_attributes=True) for d in docs]
+    return APIResponse.success_response(data)
backend/app/core/doctransform/registry.py (3)

2-2: Use builtin generics; address Ruff UP035.

Switch from typing.Dict/Set/Tuple/Type to builtin dict/set/tuple/type.

-from typing import Type, Dict, Set, Tuple, Optional
+from typing import Optional

And update annotations:

-TRANSFORMERS: Dict[str, Type[Transformer]] = {
+TRANSFORMERS: dict[str, type[Transformer]] = {
@@
-SUPPORTED_TRANSFORMATIONS: Dict[Tuple[str, str], Dict[str, str]] = {
+SUPPORTED_TRANSFORMATIONS: dict[tuple[str, str], dict[str, str]] = {
@@
-EXTENSION_TO_FORMAT: Dict[str, str] = {
+EXTENSION_TO_FORMAT: dict[str, str] = {
@@
-FORMAT_TO_EXTENSION: Dict[str, str] = {
+FORMAT_TO_EXTENSION: dict[str, str] = {
@@
-def get_supported_transformations() -> Dict[Tuple[str, str], Set[str]]:
+def get_supported_transformations() -> dict[tuple[str, str], set[str]]:

71-74: Normalize format names (aliases) within the registry.

Handle md/txt/htm gracefully across helpers to reduce API churn.

@@
+def _normalize_format(name: str) -> str:
+    aliases = {"md": "markdown", "txt": "text", "htm": "html"}
+    return aliases.get(name.lower(), name.lower())
@@
-def is_transformation_supported(source_format: str, target_format: str) -> bool:
+def is_transformation_supported(source_format: str, target_format: str) -> bool:
     """Check if a transformation from source_format to target_format is supported."""
-    return (source_format, target_format) in SUPPORTED_TRANSFORMATIONS
+    sf, tf = _normalize_format(source_format), _normalize_format(target_format)
+    return (sf, tf) in SUPPORTED_TRANSFORMATIONS
@@
-def get_available_transformers(
-    source_format: str, target_format: str
-) -> Dict[str, str]:
+def get_available_transformers(
+    source_format: str, target_format: str
+) -> dict[str, str]:
     """Get available transformers for a specific transformation."""
-    return SUPPORTED_TRANSFORMATIONS.get((source_format, target_format), {})
+    sf, tf = _normalize_format(source_format), _normalize_format(target_format)
+    return SUPPORTED_TRANSFORMATIONS.get((sf, tf), {})
@@
-def resolve_transformer(
-    source_format: str, target_format: str, transformer_name: Optional[str] = None
-) -> str:
+def resolve_transformer(
+    source_format: str, target_format: str, transformer_name: Optional[str] = None
+) -> str:
@@
-    available_transformers = get_available_transformers(source_format, target_format)
+    available_transformers = get_available_transformers(source_format, target_format)

Also applies to: 76-81, 83-108


14-18: Gate test-only transformer in non-test environments.

Avoid exposing "test" in production registries.

Option: register "test" only if settings.ENVIRONMENT in {"test","dev"}.

Also applies to: 21-29

backend/app/tests/core/doctransformer/test_service/base.py (2)

13-13: Adopt builtin generics; remove typing.List.

Modernize annotations per Ruff UP035.

-from typing import List
+# no need for typing.List on Python 3.10+
@@
-    def get_format_test_cases() -> List[tuple]:
+    def get_format_test_cases() -> list[tuple]:
@@
-    def get_content_type_test_cases() -> List[tuple]:
+    def get_content_type_test_cases() -> list[tuple]:
@@
-    def get_test_transformer_names() -> List[str]:
+    def get_test_transformer_names() -> list[str]:

Also applies to: 69-76, 78-86, 88-96


120-129: Silence unused-args warning in persistent failing mock.

Name args with underscores.

-        def persistent_failing_convert_document(*args, **kwargs):
+        def persistent_failing_convert_document(*_args, **_kwargs):
             raise Exception(error_message)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b615d5a and af8ab26.

📒 Files selected for processing (24)
  • backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py (1 hunks)
  • backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py (1 hunks)
  • backend/app/api/docs/documents/upload.md (1 hunks)
  • backend/app/api/routes/doc_transformation_job.py (1 hunks)
  • backend/app/api/routes/documents.py (4 hunks)
  • backend/app/core/doctransform/registry.py (1 hunks)
  • backend/app/core/doctransform/service.py (1 hunks)
  • backend/app/core/doctransform/test_transformer.py (1 hunks)
  • backend/app/core/doctransform/transformer.py (1 hunks)
  • backend/app/core/doctransform/zerox_transformer.py (1 hunks)
  • backend/app/crud/doc_transformation_job.py (1 hunks)
  • backend/app/models/__init__.py (1 hunks)
  • backend/app/models/doc_transformation_job.py (1 hunks)
  • backend/app/models/document.py (3 hunks)
  • backend/app/tests/api/routes/documents/test_route_document_upload.py (4 hunks)
  • backend/app/tests/api/routes/test_doc_transformation_job.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/base.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/conftest.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_execute_job.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_integration.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_start_job.py (1 hunks)
  • backend/app/tests/crud/test_doc_transformation_job.py (1 hunks)
  • backend/app/tests/utils/document.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py
  • backend/app/tests/api/routes/test_doc_transformation_job.py
  • backend/app/tests/crud/test_doc_transformation_job.py
  • backend/app/core/doctransform/transformer.py
  • backend/app/models/document.py
  • backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py
  • backend/app/api/routes/doc_transformation_job.py
🧰 Additional context used
🧬 Code graph analysis (16)
backend/app/tests/utils/document.py (3)
backend/app/models/project.py (1)
  • Project (29-60)
backend/app/crud/project.py (1)
  • get_project_by_id (37-38)
backend/app/tests/crud/documents/test_crud_document_update.py (1)
  • documents (12-15)
backend/app/models/__init__.py (2)
backend/app/models/document.py (4)
  • Document (20-40)
  • DocumentPublic (43-60)
  • DocumentUploadResponse (74-76)
  • TransformationJobInfo (63-71)
backend/app/models/doc_transformation_job.py (3)
  • DocTransformationJob (16-27)
  • DocTransformationJobs (30-32)
  • TransformationStatus (9-13)
backend/app/core/doctransform/test_transformer.py (2)
backend/app/core/doctransform/transformer.py (2)
  • Transformer (5-14)
  • transform (9-14)
backend/app/core/doctransform/zerox_transformer.py (1)
  • transform (18-56)
backend/app/api/routes/documents.py (4)
backend/app/core/cloud/storage.py (3)
  • get_cloud_storage (262-272)
  • get_signed_url (135-137)
  • get_signed_url (217-242)
backend/app/core/doctransform/registry.py (4)
  • get_available_transformers (76-80)
  • get_file_format (54-60)
  • is_transformation_supported (71-73)
  • resolve_transformer (83-107)
backend/app/models/document.py (4)
  • Document (20-40)
  • DocumentPublic (43-60)
  • DocumentUploadResponse (74-76)
  • TransformationJobInfo (63-71)
backend/app/core/doctransform/service.py (1)
  • start_job (25-44)
backend/app/tests/api/routes/documents/test_route_document_upload.py (4)
backend/app/tests/utils/document.py (5)
  • Route (82-109)
  • _ (140-141)
  • _ (145-146)
  • delete (123-127)
  • httpx_to_standard (22-23)
backend/app/core/cloud/storage.py (5)
  • client (30-42)
  • delete (140-142)
  • delete (244-259)
  • AmazonCloudStorageClient (28-84)
  • create (44-84)
backend/app/tests/conftest.py (3)
  • client (52-55)
  • user_api_key (89-91)
  • db (24-41)
backend/app/crud/doc_transformation_job.py (1)
  • create (19-30)
backend/app/models/doc_transformation_job.py (1)
backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py (1)
  • upgrade (19-66)
backend/app/core/doctransform/zerox_transformer.py (2)
backend/app/core/doctransform/transformer.py (2)
  • Transformer (5-14)
  • transform (9-14)
backend/app/core/doctransform/test_transformer.py (1)
  • transform (10-16)
backend/app/core/doctransform/service.py (5)
backend/app/crud/doc_transformation_job.py (4)
  • DocTransformationJobCrud (14-91)
  • create (19-30)
  • update_status (69-91)
  • read_one (32-51)
backend/app/crud/document.py (1)
  • DocumentCrud (14-135)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/core/cloud/storage.py (1)
  • get_cloud_storage (262-272)
backend/app/core/doctransform/registry.py (1)
  • convert_document (110-131)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (5)
backend/app/crud/doc_transformation_job.py (3)
  • DocTransformationJobCrud (14-91)
  • create (19-30)
  • read_one (32-51)
backend/app/core/doctransform/service.py (1)
  • execute_job (48-150)
backend/app/tests/core/doctransformer/test_service/base.py (8)
  • DocTransformTestBase (21-63)
  • MockHelpers (99-129)
  • setup_aws_s3 (24-28)
  • create_s3_document_content (30-42)
  • failing_convert_document (107-116)
  • create_failing_convert_document (103-118)
  • persistent_failing_convert_document (126-127)
  • create_persistent_failing_convert_document (121-129)
backend/app/tests/conftest.py (1)
  • db (24-41)
backend/app/tests/core/doctransformer/test_service/conftest.py (1)
  • fast_execute_job (33-51)
backend/app/core/doctransform/registry.py (3)
backend/app/core/doctransform/transformer.py (2)
  • Transformer (5-14)
  • transform (9-14)
backend/app/core/doctransform/test_transformer.py (2)
  • TestTransformer (5-16)
  • transform (10-16)
backend/app/core/doctransform/zerox_transformer.py (2)
  • ZeroxTransformer (10-56)
  • transform (18-56)
backend/app/crud/doc_transformation_job.py (3)
backend/app/crud/document.py (2)
  • DocumentCrud (14-135)
  • __init__ (15-17)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-27)
  • TransformationStatus (9-13)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/tests/core/doctransformer/test_service/base.py (4)
backend/app/core/cloud/storage.py (1)
  • AmazonCloudStorageClient (28-84)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/models/project.py (1)
  • Project (29-60)
backend/app/core/config.py (1)
  • AWS_S3_BUCKET (76-77)
backend/app/tests/core/doctransformer/test_service/conftest.py (6)
backend/app/crud/project.py (1)
  • get_project_by_id (37-38)
backend/app/models/user.py (2)
  • User (48-57)
  • UserProjectOrg (65-66)
backend/app/tests/utils/document.py (4)
  • DocumentStore (52-79)
  • get (117-121)
  • project (64-65)
  • put (67-72)
backend/app/tests/utils/test_data.py (1)
  • create_test_api_key (53-67)
backend/app/core/doctransform/service.py (1)
  • execute_job (48-150)
backend/app/tests/conftest.py (1)
  • db (24-41)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (8)
backend/app/crud/doc_transformation_job.py (3)
  • DocTransformationJobCrud (14-91)
  • create (19-30)
  • read_one (32-51)
backend/app/crud/document.py (1)
  • DocumentCrud (14-135)
backend/app/core/doctransform/registry.py (1)
  • TransformationError (9-10)
backend/app/core/doctransform/service.py (1)
  • execute_job (48-150)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-27)
  • TransformationStatus (9-13)
backend/app/tests/core/doctransformer/test_service/base.py (8)
  • DocTransformTestBase (21-63)
  • TestDataProvider (66-96)
  • get_format_test_cases (70-76)
  • setup_aws_s3 (24-28)
  • get_sample_document_content (94-96)
  • create_s3_document_content (30-42)
  • verify_s3_content (44-63)
  • get_content_type_test_cases (79-86)
backend/app/tests/core/doctransformer/test_service/conftest.py (2)
  • test_document (73-79)
  • fast_execute_job (33-51)
backend/app/tests/core/doctransformer/test_service/test_integration.py (8)
backend/app/crud/doc_transformation_job.py (3)
  • DocTransformationJobCrud (14-91)
  • read_one (32-51)
  • create (19-30)
backend/app/crud/document.py (1)
  • DocumentCrud (14-135)
backend/app/core/doctransform/service.py (2)
  • execute_job (48-150)
  • start_job (25-44)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-27)
  • TransformationStatus (9-13)
backend/app/models/user.py (1)
  • UserProjectOrg (65-66)
backend/app/tests/core/doctransformer/test_service/base.py (3)
  • DocTransformTestBase (21-63)
  • setup_aws_s3 (24-28)
  • create_s3_document_content (30-42)
backend/app/tests/core/doctransformer/test_service/conftest.py (3)
  • test_document (73-79)
  • current_user (55-63)
  • background_tasks (67-69)
backend/app/tests/core/doctransformer/test_service/test_start_job.py (7)
backend/app/core/doctransform/service.py (2)
  • execute_job (48-150)
  • start_job (25-44)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-27)
  • TransformationStatus (9-13)
backend/app/models/user.py (1)
  • UserProjectOrg (65-66)
backend/app/tests/core/doctransformer/test_service/base.py (2)
  • TestDataProvider (66-96)
  • get_test_transformer_names (89-91)
backend/app/tests/conftest.py (1)
  • db (24-41)
backend/app/tests/core/doctransformer/test_service/conftest.py (3)
  • current_user (55-63)
  • test_document (73-79)
  • background_tasks (67-69)
🪛 Ruff (0.12.2)
backend/app/tests/api/routes/documents/test_route_document_upload.py

295-295: Local variable response is assigned to but never used

Remove assignment to unused variable response

(F841)

backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py

5-5: Import from collections.abc instead: Callable

Import from collections.abc

(UP035)


5-5: typing.Tuple is deprecated, use tuple instead

(UP035)


55-55: Do not assert blind exception: Exception

(B017)


142-142: Do not assert blind exception: Exception

(B017)


188-188: Do not assert blind exception: Exception

(B017)

backend/app/core/doctransform/registry.py

2-2: typing.Type is deprecated, use type instead

(UP035)


2-2: typing.Dict is deprecated, use dict instead

(UP035)


2-2: typing.Set is deprecated, use set instead

(UP035)


2-2: typing.Tuple is deprecated, use tuple instead

(UP035)

backend/app/crud/doc_transformation_job.py

3-3: typing.List is deprecated, use list instead

(UP035)

backend/app/tests/core/doctransformer/test_service/base.py

13-13: typing.List is deprecated, use list instead

(UP035)


126-126: Unused function argument: args

(ARG001)


126-126: Unused function argument: kwargs

(ARG001)

backend/app/tests/core/doctransformer/test_service/conftest.py

5-5: Import from collections.abc instead: Callable, Generator

Import from collections.abc

(UP035)


5-5: typing.Tuple is deprecated, use tuple instead

(UP035)

backend/app/tests/core/doctransformer/test_service/test_execute_job.py

4-4: Import from collections.abc instead: Callable

Import from collections.abc

(UP035)


4-4: typing.Tuple is deprecated, use tuple instead

(UP035)


129-129: Do not assert blind exception: Exception

(B017)


230-230: Loop control variable expected_content_type not used within loop body

Rename unused expected_content_type to _expected_content_type

(B007)

backend/app/tests/core/doctransformer/test_service/test_integration.py

4-4: typing.Tuple is deprecated, use tuple instead

(UP035)


95-95: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


154-154: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

backend/app/tests/core/doctransformer/test_service/test_start_job.py

4-4: typing.Tuple is deprecated, use tuple instead

(UP035)

🪛 LanguageTool
backend/app/api/docs/documents/upload.md

[grammar] ~4-~4: There might be a mistake here.
Context: ...nsformation job will also be created to transform document into target format in the back...

(QB_NEW_EN)


[grammar] ~4-~4: There might be a mistake here.
Context: ...l also be created to transform document into target format in the background. The re...

(QB_NEW_EN)


[grammar] ~10-~10: Ensure spelling is correct
Context: ...mations are supported: - pdf → markdown - zerox ### Transformers Available transformer name...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (4)
backend/app/tests/core/doctransformer/test_service/test_start_job.py (2)

54-61: Nice: asserts validate BackgroundTasks wiring.

Verifying func and args for the scheduled task gives strong coupling between API and worker entrypoint; looks good.


11-13: No HTTPException inconsistencies detected. All raised exceptions use the same fastapi.HTTPException class (including via the alias in app.core.exception_handlers).

backend/app/models/__init__.py (1)

5-15: Re-exports look good.

Consolidating model exports here simplifies imports in routes and tests.

backend/app/api/routes/documents.py (1)

118-119: Remove suggestion to use createDocumentCrud has no create method and update() calls session.add(), so it correctly handles inserting new documents.

Likely an incorrect or invalid review comment.

@avirajsingh7 avirajsingh7 added the enhancement New feature or request label Sep 1, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/tests/core/doctransformer/test_registry.py (1)

1-95: Organize imports and use the fixture return value to satisfy linting
CI is failing due to an unsorted import block and unused fixture arguments. In backend/app/tests/core/doctransformer/test_registry.py:

  • Run isort --apply (or ruff --fix) to sort/format the import block.
  • In each test that accepts patched_transformations, assign it to a dummy variable (e.g. mapping = patched_transformations) before calling the API to avoid ARG001.
  • Keep two blank lines between top-level defs for Black.
🧹 Nitpick comments (9)
backend/app/core/logger.py (3)

24-26: Harden LiteLLM log suppression: cover correct logger name(s) and make level configurable.

Many installs use logger name "litellm" (lowercase). Apply to both and allow override via env to avoid hardcoding WARNING.

-# Suppress info logs from LiteLLM
-logging.getLogger("LiteLLM").setLevel(logging.WARNING)
+# Suppress LiteLLM logs (supports both logger name casings)
+_litellm_level_name = os.getenv("LITELLM_LOG_LEVEL", "WARNING").upper()
+_litellm_level = getattr(logging, _litellm_level_name, logging.WARNING)
+for _name in ("litellm", "LiteLLM"):
+    logging.getLogger(_name).setLevel(_litellm_level)

41-46: Specify encoding and delay file creation on rotating file handler.

Prevents Unicode issues on non-UTF-8 locales and avoids touching disk until first write.

-file_handler = RotatingFileHandler(
-    LOG_FILE_PATH, maxBytes=10 * 1024 * 1024, backupCount=5
-)
+file_handler = RotatingFileHandler(
+    LOG_FILE_PATH,
+    maxBytes=10 * 1024 * 1024,
+    backupCount=5,
+    encoding="utf-8",
+    delay=True,
+)

27-39: Avoid duplicate handlers under uvicorn --reload.

Without guarding, reload can attach multiple handlers and duplicate log lines.

-# Stream handler (console)
-stream_handler = logging.StreamHandler()
-stream_handler.setFormatter(formatter)
-stream_handler.addFilter(CorrelationIdFilter())
-logger.addHandler(stream_handler)
+if not any(isinstance(h, logging.StreamHandler) for h in logger.handlers):
+    stream_handler = logging.StreamHandler()
+    stream_handler.setFormatter(formatter)
+    stream_handler.addFilter(CorrelationIdFilter())
+    logger.addHandler(stream_handler)
 
-# Rotating file handler
-file_handler = RotatingFileHandler(
-    LOG_FILE_PATH, maxBytes=10 * 1024 * 1024, backupCount=5
-)
-file_handler.setFormatter(formatter)
-file_handler.addFilter(CorrelationIdFilter())
-logger.addHandler(file_handler)
+if not any(isinstance(h, RotatingFileHandler) for h in logger.handlers):
+    file_handler = RotatingFileHandler(
+        LOG_FILE_PATH, maxBytes=10 * 1024 * 1024, backupCount=5, encoding="utf-8", delay=True
+    )
+    file_handler.setFormatter(formatter)
+    file_handler.addFilter(CorrelationIdFilter())
+    logger.addHandler(file_handler)

Also applies to: 41-46

backend/app/tests/core/doctransformer/test_registry.py (6)

25-30: Parametrize to reduce repetition in format tests

Consolidates assertions and makes it easy to extend.

-def test_get_file_format_valid():
-    assert get_file_format("file.pdf") == "pdf"
-    assert get_file_format("file.docx") == "docx"
-    assert get_file_format("file.md") == "markdown"
-    assert get_file_format("file.html") == "html"
+@pytest.mark.parametrize(
+    "filename,expected",
+    [
+        ("file.pdf", "pdf"),
+        ("file.docx", "docx"),
+        ("file.md", "markdown"),
+        ("file.html", "html"),
+    ],
+)
+def test_get_file_format_valid(filename, expected):
+    assert get_file_format(filename) == expected

32-35: Broaden negative coverage and assert error message

Covers no-extension case and checks the error message substring.

-def test_get_file_format_invalid():
-    with pytest.raises(ValueError):
-        get_file_format("file.unknown")
+def test_get_file_format_invalid():
+    for name in ("file.unknown", "file"):
+        with pytest.raises(ValueError, match="Unsupported file extension"):
+            get_file_format(name)

37-43: Silence Ruff ARG001: mark unused fixture param with underscore

The fixture is used for setup; the argument itself isn’t used.

-def test_get_supported_transformations(patched_transformations):
+def test_get_supported_transformations(_patched_transformations):

54-59: Silence Ruff ARG001 here as well

Rename the unused fixture arg.

-def test_get_available_transformers(patched_transformations):
+def test_get_available_transformers(_patched_transformations):

61-68: Tighten assertions and silence Ruff ARG001

Match error substrings to ensure correct failure mode and rename unused arg.

-def test_resolve_transformer(patched_transformations):
+def test_resolve_transformer(_patched_transformations):
     assert resolve_transformer("docx", "pdf") == "pandoc"
     assert resolve_transformer("docx", "pdf", "pandoc") == "pandoc"
-    with pytest.raises(ValueError):
+    with pytest.raises(ValueError, match="not available"):
         resolve_transformer("docx", "pdf", "notfound")
-    with pytest.raises(ValueError):
+    with pytest.raises(ValueError, match="is not supported"):
         resolve_transformer("pdf", "docx")

70-95: Minor: also assert the returned path equals output_file

Keeps the contract explicit in addition to content check.

     result = convert_document(input_file, output_file, transformer_name="dummy")
     assert result.read_text() == "transformed"
+    assert result == output_file
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between af8ab26 and 194ce2d.

📒 Files selected for processing (2)
  • backend/app/core/logger.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_registry.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/tests/core/doctransformer/test_registry.py (1)
backend/app/core/doctransform/registry.py (7)
  • get_file_format (54-60)
  • get_supported_transformations (63-68)
  • is_transformation_supported (71-73)
  • get_available_transformers (76-80)
  • resolve_transformer (83-107)
  • convert_document (110-131)
  • TransformationError (9-10)
🪛 Ruff (0.12.2)
backend/app/tests/core/doctransformer/test_registry.py

37-37: Unused function argument: patched_transformations

(ARG001)


54-54: Unused function argument: patched_transformations

(ARG001)


61-61: Unused function argument: patched_transformations

(ARG001)

🪛 GitHub Actions: AI Platform CI
backend/app/tests/core/doctransformer/test_registry.py

[error] 1-1: Black formatting hook reformatted the file during pre-commit run --all-files (hook: black); exit code 1.

🔇 Additional comments (2)
backend/app/tests/core/doctransformer/test_registry.py (2)

15-23: Fixture patching looks good

Correct use of monkeypatch to stub SUPPORTED_TRANSFORMATIONS for tests.


45-52: LGTM: clear positive/negative path check

Accurately validates presence/absence in SUPPORTED_TRANSFORMATIONS.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (19)
backend/app/tests/core/doctransformer/test_registry.py (4)

27-32: Parametrize positive cases to reduce repetition and add case-insensitivity coverage.

Parametrization makes failures clearer and adds value by checking upper-case extensions.

Apply:

- def test_get_file_format_valid():
-     assert get_file_format("file.pdf") == "pdf"
-     assert get_file_format("file.docx") == "docx"
-     assert get_file_format("file.md") == "markdown"
-     assert get_file_format("file.html") == "html"
+import pytest
+
+@pytest.mark.parametrize(
+    "name,expected",
+    [
+        ("file.pdf", "pdf"),
+        ("file.PDF", "pdf"),
+        ("file.docx", "docx"),
+        ("file.md", "markdown"),
+        ("file.html", "html"),
+    ],
+)
+def test_get_file_format_valid(name, expected):
+    assert get_file_format(name) == expected

34-37: Add edge cases for files without a usable extension.

Catches common user inputs like “README” or “file.”.

Apply:

-def test_get_file_format_invalid():
-    with pytest.raises(ValueError):
-        get_file_format("file.unknown")
+@pytest.mark.parametrize("name", ["file.unknown", "file", "file."])
+def test_get_file_format_invalid(name):
+    with pytest.raises(ValueError):
+        get_file_format(name)

39-39: Silence Ruff ARG001 for fixture-only parameters.

The fixture is intentionally unused in-body; annotate to pass lint without adding no-op assertions.

Apply:

-def test_get_supported_transformations(patched_transformations):
+def test_get_supported_transformations(patched_transformations):  # noqa: ARG001
@@
-def test_get_available_transformers(patched_transformations):
+def test_get_available_transformers(patched_transformations):  # noqa: ARG001
@@
-def test_resolve_transformer(patched_transformations):
+def test_resolve_transformer(patched_transformations):  # noqa: ARG001

Also applies to: 56-56, 63-63


72-96: Strengthen convert_document assertions and cover default/missing-default flows.

Verify the function returns the exact output path and add tests for default resolution and absence of default.

Apply:

@@
     input_file.write_text("test")
     result = convert_document(input_file, output_file, transformer_name="dummy")
-    assert result.read_text() == "transformed"
+    assert result == output_file
+    assert result.read_text() == "transformed"

Add the following tests after this block:

def test_convert_document_uses_default_transformer(tmp_path, monkeypatch):
    class DummyTransformer:
        def transform(self, input_path, output_path):
            output_path.write_text("transformed")
            return output_path

    # Wire default -> dummy for docx->pdf and register transformer
    monkeypatch.setitem(TRANSFORMERS, "dummy", DummyTransformer)
    monkeypatch.setattr(
        "app.core.doctransform.registry.SUPPORTED_TRANSFORMATIONS",
        {("docx", "pdf"): {"default": "dummy", "dummy": "dummy"}},
    )

    src = tmp_path / "in.docx"
    dst = tmp_path / "out.pdf"
    src.write_text("x")
    result = convert_document(src, dst)
    assert result == dst
    assert result.read_text() == "transformed"


def test_convert_document_missing_default_raises(tmp_path, monkeypatch):
    # Pair exists but lacks a default entry
    monkeypatch.setattr(
        "app.core.doctransform.registry.SUPPORTED_TRANSFORMATIONS",
        {("docx", "pdf"): {}},
    )
    src = tmp_path / "in.docx"
    dst = tmp_path / "out.pdf"
    src.write_text("x")
    with pytest.raises(ValueError):
        convert_document(src, dst)
backend/app/tests/core/doctransformer/test_service/test_integration.py (4)

4-4: Prefer built-in tuple over typing.Tuple.

PEP 585 style avoids deprecation warnings and keeps typing modern.

-from typing import Tuple
+from __future__ import annotations
-    def test_execute_job_end_to_end_workflow(
-        self, db: Session, test_document: Tuple[Document, Project]
+    def test_execute_job_end_to_end_workflow(
+        self, db: Session, test_document: tuple[Document, Project]
     ) -> None:
-    def test_execute_job_concurrent_jobs(
-        self, db: Session, test_document: Tuple[Document, Project]
+    def test_execute_job_concurrent_jobs(
+        self, db: Session, test_document: tuple[Document, Project]
     ) -> None:
-    def test_multiple_format_transformations(
-        self, db: Session, test_document: Tuple[Document, Project]
+    def test_multiple_format_transformations(
+        self, db: Session, test_document: tuple[Document, Project]
     ) -> None:

Also applies to: 30-31, 85-87, 122-124


95-98: Silence unused loop index.

Index isn’t used; switch to underscore.

-        for i in range(3):
+        for _ in range(3):
             job = job_crud.create(source_document_id=document.id)
             jobs.append(job)

154-156: Remove unused enumerate index.

-        for i, (job, target_format) in enumerate(jobs):
+        for job, target_format in jobs:
             with patch("app.core.doctransform.service.Session") as mock_session_class:

84-87: Test name suggests concurrency but executes sequentially.

Either rename or actually run in parallel to validate concurrency.

-def test_execute_job_concurrent_jobs(
+def test_execute_job_multiple_jobs(
     self, db: Session, test_document: tuple[Document, Project]
 ) -> None:
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (3)

4-4: Modernize typing: use collections.abc.Callable and built-in tuple.

-from typing import Any, Callable, Tuple
+from typing import Any
+from collections.abc import Callable
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
-        self, db: Session, test_document: Tuple[Document, Project]
+        self, db: Session, test_document: tuple[Document, Project]
-        self, db: Session, test_document: Tuple[Document, Project]
+        self, db: Session, test_document: tuple[Document, Project]

Also applies to: 37-40, 88-90, 114-116, 149-151, 189-191, 220-222


130-136: Avoid blind Exception assertion; assert expected types.

Aligns with Ruff B017 and makes failures clearer.

-            with pytest.raises(Exception):
+            with pytest.raises((HTTPException, RetryError)):
                 fast_execute_job(
                     project_id=project.id,
                     job_id=job.id,
                     transformer_name="test",
                     target_format="markdown",
                 )

8-17: You compute expected content-types; assert them from S3.

Currently expected_content_type is unused. Validate via S3 HeadObject.

 from moto import mock_aws
 from sqlmodel import Session
 from tenacity import RetryError
 
 from app.crud import DocTransformationJobCrud, DocumentCrud
 from app.core.doctransform.registry import TransformationError
 from app.core.doctransform.service import execute_job
 from app.core.exception_handlers import HTTPException
 from app.models import Document, Project, TransformationStatus
 from app.tests.core.doctransformer.test_service.utils import DocTransformTestBase
+from urllib.parse import urlparse
+from app.core.config import settings
             transformed_doc = document_crud.read_one(job.transformed_document_id)
             assert transformed_doc is not None
             assert transformed_doc.fname.endswith(expected_extension)
+            # Assert S3 content-type
+            parsed = urlparse(transformed_doc.object_store_url)
+            key = parsed.path.lstrip("/")
+            head = aws.client.head_object(Bucket=settings.AWS_S3_BUCKET, Key=key)
+            assert head["ContentType"] == expected_content_type

Also applies to: 254-262

backend/app/tests/core/doctransformer/test_service/test_start_job.py (1)

4-4: Prefer built-in tuple over typing.Tuple.

-from typing import Tuple
+from __future__ import annotations
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],

Also applies to: 30-33, 85-88, 112-115, 141-145

backend/app/tests/core/doctransformer/test_service/utils.py (3)

33-34: Derive S3 bucket from the document URL instead of settings.

Reduces coupling and makes tests resilient to non-default buckets.

-        aws.client.put_object(Bucket=settings.AWS_S3_BUCKET, Key=s3_key, Body=content)
+        bucket = parsed_url.netloc or settings.AWS_S3_BUCKET
+        aws.client.put_object(Bucket=bucket, Key=s3_key, Body=content)
-        response = aws.client.get_object(
-            Bucket=settings.AWS_S3_BUCKET, Key=transformed_key
-        )
+        bucket = parsed_url.netloc or settings.AWS_S3_BUCKET
+        response = aws.client.get_object(Bucket=bucket, Key=transformed_key)

Also applies to: 49-53


79-81: Rename unused args/kwargs to underscore to satisfy linters.

-    def persistent_failing_convert_document(*args, **kwargs):
+    def persistent_failing_convert_document(*_args, **_kwargs):
         raise Exception(error_message)

60-67: Optional: raise a domain-specific error for clarity.

If convenient, raise a TransformationError (or a test-local subclass) instead of a bare Exception to make assertions more precise.

backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (4)

5-6: Modernize typing and tighten fast_execute_job signature.

-from typing import Any, Callable, Tuple
+from typing import Any
+from collections.abc import Callable
+from uuid import UUID
-        fast_execute_job: Callable[[int, Any, str, str], Any],
+        fast_execute_job: Callable[[int, UUID, str, str], Any],
-        fast_execute_job: Callable[[int, Any, str, str], Any],
+        fast_execute_job: Callable[[int, UUID, str, str], Any],
-        fast_execute_job: Callable[[int, Any, str, str], Any],
+        fast_execute_job: Callable[[int, UUID, str, str], Any],
-        fast_execute_job: Callable[[int, Any, str, str], Any],
+        fast_execute_job: Callable[[int, UUID, str, str], Any],

Also applies to: 30-32, 74-76, 114-116, 158-160


8-11: Use specific exception types for retry exhaustion.

Tenacity should raise RetryError when retries are exhausted.

-from moto import mock_aws
-from sqlmodel import Session
+from moto import mock_aws
+from sqlmodel import Session
+from tenacity import RetryError
-            with pytest.raises(Exception):
+            with pytest.raises(RetryError):
                 fast_execute_job(
                     project_id=project.id,
                     job_id=job.id,
                     transformer_name="test",
                     target_format="markdown",
                 )

Also applies to: 139-146


54-61: If exact exception type varies, assert on message and silence B017.

Keep the behavioral check but avoid linter noise.

-            with pytest.raises(Exception):
+            with pytest.raises(Exception) as exc_info:  # noqa: B017
                 fast_execute_job(
                     project_id=project.id,
                     job_id=job.id,
                     transformer_name="test",
                     target_format="markdown",
                 )
+            assert "S3 upload failed" in str(exc_info.value)

185-192: Same here: assert details or switch to a specific type.

-                with pytest.raises(Exception):
+                with pytest.raises(Exception) as exc_info:  # noqa: B017
                     fast_execute_job(
                         project_id=project.id,
                         job_id=job.id,
                         transformer_name="test",
                         target_format="markdown",
                     )
+                assert "Database error during document creation" in str(exc_info.value)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 194ce2d and de4a89d.

📒 Files selected for processing (6)
  • backend/app/tests/core/doctransformer/test_registry.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_execute_job.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_integration.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_start_job.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/utils.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
backend/app/tests/core/doctransformer/test_service/utils.py (2)
backend/app/tests/api/routes/documents/test_route_document_upload.py (2)
  • TestDocumentRouteUpload (64-103)
  • test_adds_to_S3 (82-103)
backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (2)
  • TestDocumentRoutePermanentRemove (43-99)
  • aws_credentials (33-38)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (2)
backend/app/tests/api/routes/documents/test_route_document_upload.py (2)
  • TestDocumentRouteUpload (64-103)
  • test_adds_to_S3 (82-103)
backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1)
  • TestDocumentRoutePermanentRemove (43-99)
backend/app/tests/core/doctransformer/test_service/test_integration.py (2)
backend/app/tests/api/routes/documents/test_route_document_upload.py (2)
  • TestDocumentRouteUpload (64-103)
  • test_adds_to_S3 (82-103)
backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1)
  • TestDocumentRoutePermanentRemove (43-99)
backend/app/tests/core/doctransformer/test_service/test_start_job.py (2)
backend/app/tests/api/routes/documents/test_route_document_upload.py (1)
  • TestDocumentRouteUpload (64-103)
backend/app/tests/api/routes/documents/test_route_document_remove.py (1)
  • TestDocumentRouteRemove (23-88)
🪛 Ruff (0.12.2)
backend/app/tests/core/doctransformer/test_registry.py

39-39: Unused function argument: patched_transformations

(ARG001)


56-56: Unused function argument: patched_transformations

(ARG001)


63-63: Unused function argument: patched_transformations

(ARG001)

backend/app/tests/core/doctransformer/test_service/utils.py

79-79: Unused function argument: args

(ARG001)


79-79: Unused function argument: kwargs

(ARG001)

backend/app/tests/core/doctransformer/test_service/test_execute_job.py

4-4: Import from collections.abc instead: Callable

Import from collections.abc

(UP035)


4-4: typing.Tuple is deprecated, use tuple instead

(UP035)


130-130: Do not assert blind exception: Exception

(B017)


236-236: Loop control variable expected_content_type not used within loop body

Rename unused expected_content_type to _expected_content_type

(B007)

backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py

5-5: Import from collections.abc instead: Callable

Import from collections.abc

(UP035)


5-5: typing.Tuple is deprecated, use tuple instead

(UP035)


54-54: Do not assert blind exception: Exception

(B017)


139-139: Do not assert blind exception: Exception

(B017)


185-185: Do not assert blind exception: Exception

(B017)

backend/app/tests/core/doctransformer/test_service/test_integration.py

4-4: typing.Tuple is deprecated, use tuple instead

(UP035)


95-95: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


154-154: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

backend/app/tests/core/doctransformer/test_service/test_start_job.py

4-4: typing.Tuple is deprecated, use tuple instead

(UP035)

🔇 Additional comments (1)
backend/app/tests/core/doctransformer/test_registry.py (1)

1-11: LGTM: import surface matches the public registry API.

Imports are precise and scoped to the functions/types under test.

Copy link
Collaborator

@kartpop kartpop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving this, but please take a loot at the comments first before merging


try:
with Runner() as runner:
result = runner.run(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we should have a timeout or error handling here as this will be a long running process

@avirajsingh7 I am not entirely sure how to do it with runner, can we explore a little? coderabbit seems to have a solution in the above comment, maybe worth exploring


def upgrade():
# ### commands auto generated by Alembic - please adjust! ###
op.add_column("document", sa.Column("source_document_id", sa.Uuid(), nullable=True))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the "source document id" column in this table and doc transformation table are different, like in the context of doc transformation table, I understand what the source document id could mean, but what is the role of this column in this table

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So first you upload doc to s3 and than you will use that doc for transformation, so this id of document you uploaded to s3 and from which the transformed document got created.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still a little confused

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/core/doctransform/registry.py (1)

108-130: Plumb transformer parameters through the registry (aligns with PR’s “transformerparameter”).

Currently, there’s no way to pass per-job transformer params (e.g., model). Accept kwargs and forward to the transformer constructor.

-def convert_document(
-    input_path: Path, output_path: Path, transformer_name: str = "default"
-) -> Path:
+def convert_document(
+    input_path: Path,
+    output_path: Path,
+    transformer_name: str = "default",
+    transformer_kwargs: Optional[dict] = None,
+) -> Path:
@@
-    transformer = transformer_cls()
+    transformer = transformer_cls(**(transformer_kwargs or {}))

Follow-up: Thread these transformer_kwargs from the API/service layer (e.g., persist in job payload and supply here).

🧹 Nitpick comments (31)
backend/app/core/doctransform/registry.py (2)

2-2: Modernize type hints to built-in generics (ruff UP035).

Use type, dict, set, and tuple to silence UP035 and keep code forward-compatible.

- from typing import Type, Dict, Set, Tuple, Optional
+ from typing import Optional
@@
-TRANSFORMERS: Dict[str, Type[Transformer]] = {
+TRANSFORMERS: dict[str, type[Transformer]] = {
@@
-SUPPORTED_TRANSFORMATIONS: Dict[Tuple[str, str], Dict[str, str]] = {
+SUPPORTED_TRANSFORMATIONS: dict[tuple[str, str], dict[str, str]] = {
@@
-EXTENSION_TO_FORMAT: Dict[str, str] = {
+EXTENSION_TO_FORMAT: dict[str, str] = {
@@
-FORMAT_TO_EXTENSION: Dict[str, str] = {
+FORMAT_TO_EXTENSION: dict[str, str] = {
@@
-def get_supported_transformations() -> Dict[Tuple[str, str], Set[str]]:
+def get_supported_transformations() -> dict[tuple[str, str], set[str]]:
@@
-def get_available_transformers(
+def get_available_transformers(
     source_format: str, target_format: str
-) -> Dict[str, str]:
+) -> dict[str, str]:

Also applies to: 13-16, 18-27, 29-39, 41-49, 61-67, 74-79


18-27: Add pdf→text/html to SUPPORTED_TRANSFORMATIONS to match service/tests.

Service allows "text" and "html" (extensions and content types handled), and tests exercise these formats. Add entries so API-level validation doesn’t reject valid requests.

 SUPPORTED_TRANSFORMATIONS: dict[tuple[str, str], dict[str, str]] = {
     ("pdf", "markdown"): {
         "default": "zerox",
         "zerox": "zerox",
     },
+    ("pdf", "text"): {
+        "default": "zerox",
+        "zerox": "zerox",
+    },
+    ("pdf", "html"): {
+        "default": "zerox",
+        "zerox": "zerox",
+    },
     # Future transformations can be added here
     # ("docx", "markdown"): {"default": "pandoc", "pandoc": "pandoc"},
     # ("html", "markdown"): {"default": "pandoc", "pandoc": "pandoc"},
 }
backend/app/tests/core/doctransformer/test_service/test_integration.py (4)

4-4: Use built-in generics (ruff UP035).

Replace typing.Tuple with tuple and drop the import.

-from typing import Tuple
@@
-    def test_execute_job_end_to_end_workflow(
-        self, db: Session, test_document: Tuple[Document, Project]
+    def test_execute_job_end_to_end_workflow(
+        self, db: Session, test_document: tuple[Document, Project]
@@
-    def test_execute_job_concurrent_jobs(
-        self, db: Session, test_document: Tuple[Document, Project]
+    def test_execute_job_concurrent_jobs(
+        self, db: Session, test_document: tuple[Document, Project]
@@
-    def test_multiple_format_transformations(
-        self, db: Session, test_document: Tuple[Document, Project]
+    def test_multiple_format_transformations(
+        self, db: Session, test_document: tuple[Document, Project]

Also applies to: 33-35, 92-94, 133-136


102-104: Rename unused loop variable (ruff B007).

-        for i in range(3):
+        for _ in range(3):
             job = job_crud.create(source_document_id=document.id)
             jobs.append(job)

83-88: Assert transformed content in S3 as part of end-to-end validation.

You already have helpers; add the content check for stronger E2E coverage.

         transformed_doc = document_crud.read_one(job.transformed_document_id)
         assert transformed_doc.source_document_id == document.id
         assert "<transformed>" in transformed_doc.fname
+        # Verify content uploaded to S3
+        self.verify_s3_content(aws, transformed_doc)

99-124: Optional: run the “concurrent jobs” test in parallel to truly exercise concurrency.

Use ThreadPoolExecutor to dispatch jobs simultaneously; today they run sequentially.

backend/app/tests/core/doctransformer/test_service/test_execute_job.py (6)

4-4: Import Callable from collections.abc and drop deprecated Tuple (ruff UP035).

-from typing import Any, Callable, Tuple
+from typing import Any
+from collections.abc import Callable

7-7: Add imports for content-type verification and specific S3 error type.

 from uuid import uuid4, UUID
+from urllib.parse import urlparse
@@
 from app.core.doctransform.service import execute_job
+from app.core.config import settings

Also applies to: 13-13


8-12: Import botocore ClientError to narrow exception expectations.

 import pytest
 from moto import mock_aws
 from sqlmodel import Session
 from tenacity import RetryError
+from botocore.exceptions import ClientError

39-43: Use built-in generics for tuple annotations (ruff UP035).

-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        self, db: Session, test_document: Tuple[Document, Project]
+        self, db: Session, test_document: tuple[Document, Project]
@@
-        self, db: Session, test_document: Tuple[Document, Project]
+        self, db: Session, test_document: tuple[Document, Project]

Also applies to: 94-98, 124-128, 206-207, 241-242


145-151: Avoid catching bare Exception; assert expected S3 error types.

Missing source objects typically raise botocore.exceptions.ClientError. Include RetryError due to tenacity.

-            with pytest.raises(Exception):
+            with pytest.raises((ClientError, RetryError)):
                 fast_execute_job(
                     project_id=project.id,
                     job_id=job.id,
                     transformer_name="test",
                     target_format="markdown",
                 )

248-253: Actually assert expected content types in S3 (and remove B007 warning).

You compute expected_content_type but never assert it; check S3 object metadata.

         format_extensions = [
-            ("markdown", "text/markdown", ".md"),
-            ("text", "text/plain", ".txt"),
-            ("html", "text/html", ".html"),
-            ("unknown", "text/plain", ".unknown"),  # Default fallback
+            ("markdown", "text/markdown", ".md"),
+            ("text", "text/plain", ".txt"),
+            ("html", "text/html", ".html"),
+            ("unknown", "text/plain", ".unknown"),  # Default fallback
         ]
@@
             assert job.transformed_document_id is not None
             transformed_doc = document_crud.read_one(job.transformed_document_id)
             assert transformed_doc is not None
             assert transformed_doc.fname.endswith(expected_extension)
+            # Verify ContentType stored in S3 matches expectation
+            parsed = urlparse(transformed_doc.object_store_url)
+            key = parsed.path.lstrip("/")
+            head = aws.client.head_object(Bucket=settings.AWS_S3_BUCKET, Key=key)
+            assert head["ContentType"] == expected_content_type

Also applies to: 280-288

backend/app/tests/core/doctransformer/test_service/test_start_job.py (8)

4-4: Prefer built-in tuple[...] over deprecated typing.Tuple.

Modern typing: use tuple[Document, Project] and drop the Tuple import.

-from typing import Tuple
+# (no import needed for tuple)

-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
...
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
...
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
...
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],

Also applies to: 34-34, 89-89, 116-116, 149-149


13-13: Import HTTPException directly from FastAPI.

Reduces coupling to app internals and matches the raised exception type.

-from app.core.exception_handlers import HTTPException
+from fastapi import HTTPException

82-84: Also assert no background task is scheduled on failure.

Prevents accidental scheduling when the document is invalid.

         assert exc_info.value.status_code == 404
         assert "Document not found" in str(exc_info.value.detail)
+        assert len(background_tasks.tasks) == 0

109-111: Likewise, ensure no background task is scheduled for deleted documents.

         assert exc_info.value.status_code == 404
         assert "Document not found" in str(exc_info.value.detail)
+        assert len(background_tasks.tasks) == 0

121-123: Drop transformer registry patch here; start_job doesn't validate transformers.

Keeps the test scoped to scheduling concerns and removes an unnecessary dependency on the registry.

-        # Add the test transformer to the registry for this test
-        monkeypatch.setitem(TRANSFORMERS, "test", MockTestTransformer)

118-118: Remove unused monkeypatch parameter if you drop the registry patch.

-        monkeypatch,

155-157: Same: no need to patch the registry for start_job.

-        # Add the test transformer to the registry for this test
-        monkeypatch.setitem(TRANSFORMERS, "test", MockTestTransformer)

152-152: Remove monkeypatch param here if patching is dropped.

-        monkeypatch,
backend/app/tests/core/doctransformer/test_service/utils.py (5)

17-21: Clarify docstring to reflect file-writing behavior.

-    A mock transformer for testing that returns a hardcoded lorem ipsum string.
+    A mock transformer for testing that writes a hardcoded lorem ipsum string
+    to the provided output_path and returns that path.

49-49: Use the bucket from the document URL to avoid config drift.

This makes tests resilient if AWS_S3_BUCKET changes or documents point to a different bucket.

-        aws.client.put_object(Bucket=settings.AWS_S3_BUCKET, Key=s3_key, Body=content)
+        aws.client.put_object(Bucket=parsed_url.netloc, Key=s3_key, Body=content)

66-66: Same: verify content against the URL’s bucket.

-            Bucket=settings.AWS_S3_BUCKET, Key=transformed_key
+            Bucket=parsed_url.netloc, Key=transformed_key

76-85: Silence unused-args warnings and keep intent clear.

-    def failing_convert_document(*args, **kwargs):
+    def failing_convert_document(*_args, **_kwargs):
         nonlocal call_count
         call_count += 1
         if call_count <= fail_count:
             raise Exception("Transient error")
-        output_path = args[1] if len(args) > 1 else kwargs.get("output_path")
+        output_path = _args[1] if len(_args) > 1 else _kwargs.get("output_path")
         if output_path:
             output_path.write_text("Success after retries", encoding="utf-8")
             return output_path
         raise ValueError("output_path is required")

95-96: Same for persistent failing helper.

-    def persistent_failing_convert_document(*args, **kwargs):
+    def persistent_failing_convert_document(*_args, **_kwargs):
         raise Exception(error_message)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (6)

5-5: Modernize typing imports and annotations.

  • Import Callable from collections.abc.
  • Use built-in tuple[...] instead of typing.Tuple.
-from typing import Any, Callable, Tuple
+from typing import Any
+from collections.abc import Callable
...
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
...
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
...
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
...
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],

Also applies to: 32-32, 78-78, 120-120, 166-166


59-60: Avoid blind exception assertions; match the message.

Improves signal and prevents false positives.

-            with pytest.raises(Exception):
+            with pytest.raises(Exception, match="S3 upload failed"):
                 fast_execute_job(
                     project_id=project.id,
                     job_id=job.id,
                     transformer_name="test",
                     target_format="markdown",
                 )

148-154: Same here: assert the error message for exhausted retries.

-            with pytest.raises(Exception):
+            with pytest.raises(Exception, match="Persistent error"):
                 fast_execute_job(
                     project_id=project.id,
                     job_id=job.id,
                     transformer_name="test",
                     target_format="markdown",
                 )

198-204: Same for DB error during completion.

-                with pytest.raises(Exception):
+                with pytest.raises(Exception, match="Database error during document creation"):
                     fast_execute_job(
                         project_id=project.id,
                         job_id=job.id,
                         transformer_name="test",
                         target_format="markdown",
                     )

50-51: Prefer patch.dict when overriding registry entries.

Makes intent explicit and avoids replacing the whole mapping accidentally.

-        ), patch(
-            "app.core.doctransform.registry.TRANSFORMERS", {"test": MockTestTransformer}
+        ), patch.dict(
+            "app.core.doctransform.registry.TRANSFORMERS", {"test": MockTestTransformer}, clear=True
         ):

99-100: Apply the same patch.dict(..., clear=True) pattern in other tests.

-        ), patch(
-            "app.core.doctransform.registry.TRANSFORMERS", {"test": MockTestTransformer}
+        ), patch.dict(
+            "app.core.doctransform.registry.TRANSFORMERS", {"test": MockTestTransformer}, clear=True
         ):

Also applies to: 143-144

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between de4a89d and f12cba0.

📒 Files selected for processing (6)
  • backend/app/core/doctransform/registry.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_execute_job.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_integration.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/test_start_job.py (1 hunks)
  • backend/app/tests/core/doctransformer/test_service/utils.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (5)
backend/app/crud/doc_transformation_job.py (3)
  • DocTransformationJobCrud (14-91)
  • create (19-30)
  • read_one (32-51)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/tests/core/doctransformer/test_service/utils.py (8)
  • DocTransformTestBase (30-69)
  • MockTestTransformer (16-27)
  • create_failing_convert_document (72-87)
  • create_persistent_failing_convert_document (90-98)
  • setup_aws_s3 (33-37)
  • create_s3_document_content (39-50)
  • failing_convert_document (76-85)
  • persistent_failing_convert_document (95-96)
backend/app/tests/conftest.py (1)
  • db (24-41)
backend/app/tests/core/doctransformer/test_service/conftest.py (2)
  • test_document (73-79)
  • fast_execute_job (33-51)
backend/app/tests/core/doctransformer/test_service/utils.py (5)
backend/app/core/cloud/storage.py (1)
  • AmazonCloudStorageClient (28-84)
backend/app/core/doctransform/transformer.py (1)
  • Transformer (5-14)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/core/config.py (1)
  • AWS_S3_BUCKET (76-77)
backend/app/tests/api/routes/documents/test_route_document_upload.py (2)
  • TestDocumentRouteUpload (64-103)
  • test_adds_to_S3 (82-103)
backend/app/core/doctransform/registry.py (3)
backend/app/core/doctransform/transformer.py (2)
  • Transformer (5-14)
  • transform (9-14)
backend/app/core/doctransform/zerox_transformer.py (2)
  • ZeroxTransformer (10-56)
  • transform (18-56)
backend/app/tests/core/doctransformer/test_service/utils.py (1)
  • transform (21-27)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (7)
backend/app/crud/doc_transformation_job.py (3)
  • DocTransformationJobCrud (14-91)
  • create (19-30)
  • read_one (32-51)
backend/app/crud/document.py (1)
  • DocumentCrud (14-135)
backend/app/core/doctransform/registry.py (1)
  • TransformationError (8-9)
backend/app/core/doctransform/service.py (1)
  • execute_job (48-150)
backend/app/tests/core/doctransformer/test_service/utils.py (5)
  • DocTransformTestBase (30-69)
  • MockTestTransformer (16-27)
  • setup_aws_s3 (33-37)
  • create_s3_document_content (39-50)
  • verify_s3_content (52-69)
backend/app/tests/conftest.py (1)
  • db (24-41)
backend/app/tests/core/doctransformer/test_service/conftest.py (1)
  • fast_execute_job (33-51)
backend/app/tests/core/doctransformer/test_service/test_integration.py (8)
backend/app/crud/doc_transformation_job.py (3)
  • DocTransformationJobCrud (14-91)
  • read_one (32-51)
  • create (19-30)
backend/app/crud/document.py (1)
  • DocumentCrud (14-135)
backend/app/core/doctransform/service.py (2)
  • execute_job (48-150)
  • start_job (25-44)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-27)
  • TransformationStatus (9-13)
backend/app/tests/core/doctransformer/test_service/utils.py (4)
  • DocTransformTestBase (30-69)
  • MockTestTransformer (16-27)
  • setup_aws_s3 (33-37)
  • create_s3_document_content (39-50)
backend/app/tests/conftest.py (1)
  • db (24-41)
backend/app/tests/core/doctransformer/test_service/conftest.py (3)
  • test_document (73-79)
  • current_user (55-63)
  • background_tasks (67-69)
backend/app/tests/core/doctransformer/test_service/test_start_job.py (7)
backend/app/core/doctransform/service.py (2)
  • execute_job (48-150)
  • start_job (25-44)
backend/app/models/document.py (1)
  • Document (20-40)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-27)
  • TransformationStatus (9-13)
backend/app/models/user.py (1)
  • UserProjectOrg (65-66)
backend/app/tests/core/doctransformer/test_service/utils.py (2)
  • DocTransformTestBase (30-69)
  • MockTestTransformer (16-27)
backend/app/tests/conftest.py (1)
  • db (24-41)
backend/app/tests/core/doctransformer/test_service/conftest.py (3)
  • current_user (55-63)
  • test_document (73-79)
  • background_tasks (67-69)
🪛 Ruff (0.12.2)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py

5-5: Import from collections.abc instead: Callable

Import from collections.abc

(UP035)


5-5: typing.Tuple is deprecated, use tuple instead

(UP035)


59-59: Do not assert blind exception: Exception

(B017)


148-148: Do not assert blind exception: Exception

(B017)


198-198: Do not assert blind exception: Exception

(B017)

backend/app/tests/core/doctransformer/test_service/utils.py

95-95: Unused function argument: args

(ARG001)


95-95: Unused function argument: kwargs

(ARG001)

backend/app/core/doctransform/registry.py

2-2: typing.Type is deprecated, use type instead

(UP035)


2-2: typing.Dict is deprecated, use dict instead

(UP035)


2-2: typing.Set is deprecated, use set instead

(UP035)


2-2: typing.Tuple is deprecated, use tuple instead

(UP035)

backend/app/tests/core/doctransformer/test_service/test_execute_job.py

4-4: Import from collections.abc instead: Callable

Import from collections.abc

(UP035)


4-4: typing.Tuple is deprecated, use tuple instead

(UP035)


145-145: Do not assert blind exception: Exception

(B017)


257-257: Loop control variable expected_content_type not used within loop body

Rename unused expected_content_type to _expected_content_type

(B007)

backend/app/tests/core/doctransformer/test_service/test_integration.py

4-4: typing.Tuple is deprecated, use tuple instead

(UP035)


102-102: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


171-171: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

backend/app/tests/core/doctransformer/test_service/test_start_job.py

4-4: typing.Tuple is deprecated, use tuple instead

(UP035)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.11.7, 6)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
backend/app/api/routes/doc_transformation_job.py (1)

41-43: Previous issue resolved: jobs_not_found is now a list

Conversion to list fixes the response model mismatch flagged earlier.

🧹 Nitpick comments (2)
backend/app/api/routes/doc_transformation_job.py (2)

33-43: Preserve input order for jobs_not_found and dedupe deterministically

Set subtraction returns an arbitrary order. If clients compare arrays, this can be flaky. Keep caller order while deduping.

Apply:

-def get_multiple_transformation_jobs(
+def get_multiple_transformation_jobs(
     session: SessionDep,
     current_user: CurrentUserOrgProject,
-    job_ids: list[UUID] = Query(description="List of transformation job IDs"),
+    job_ids: list[UUID] = Query(
+        description="List of transformation job IDs",
+        min_items=1
+    ),
 ):
     crud = DocTransformationJobCrud(session, project_id=current_user.project_id)
-    jobs = crud.read_each(set(job_ids))
-    jobs_not_found = set(job_ids) - {job.id for job in jobs}
+    # stable order + dedupe
+    seen: set[UUID] = set()
+    ordered_unique_ids: list[UUID] = []
+    for jid in job_ids:
+        if jid not in seen:
+            seen.add(jid)
+            ordered_unique_ids.append(jid)
+    jobs = crud.read_each(seen)
+    found_ids = {job.id for job in jobs}
+    jobs_not_found = [jid for jid in ordered_unique_ids if jid not in found_ids]
     return APIResponse.success_response(
-        DocTransformationJobs(jobs=jobs, jobs_not_found=list(jobs_not_found))
+        DocTransformationJobs(jobs=jobs, jobs_not_found=jobs_not_found)
     )

33-37: Optional: add an upper bound to job_ids to protect the DB

Large IN clauses can hurt performance. Consider enforcing a sane max (e.g., 500–1000) and returning 422 if exceeded.

Example:

-    job_ids: list[UUID] = Query(
+    job_ids: list[UUID] = Query(
         description="List of transformation job IDs",
-        min_items=1
+        min_items=1,
+        max_items=1000
     ),
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9f68c0f and 76f8757.

📒 Files selected for processing (2)
  • backend/app/api/routes/doc_transformation_job.py (1 hunks)
  • backend/app/tests/api/routes/test_doc_transformation_job.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/tests/api/routes/test_doc_transformation_job.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/api/routes/doc_transformation_job.py (3)
backend/app/crud/doc_transformation_job.py (3)
  • DocTransformationJobCrud (14-91)
  • read_one (32-51)
  • read_each (53-67)
backend/app/models/doc_transformation_job.py (2)
  • DocTransformationJob (16-27)
  • DocTransformationJobs (30-32)
backend/app/utils.py (2)
  • APIResponse (27-48)
  • success_response (34-37)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (1)
backend/app/api/routes/doc_transformation_job.py (1)

13-25: Single job endpoint looks good

CRUD enforces project scoping and 404s; response wrapping is correct.

@avirajsingh7 avirajsingh7 merged commit 98295d6 into main Sep 4, 2025
3 checks passed
@avirajsingh7 avirajsingh7 deleted the feature/doc-transformer-main branch September 4, 2025 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Doc-transform: Implement document transformation

3 participants