Implement document transformation pipeline to improve RAG performance #363

avirajsingh7 · 2025-09-01T08:16:20Z

Summary

Partners often deal with scanned PDFs and document formats that are not amenable to AI services like OpenAI's vector stores. This results in poor RAG (Retrieval-Augmented Generation) performance and limits the platform's utility for AI-powered document processing.

This PR implements a foundational document transformation pipeline that enhances the /documents/upload endpoint with on-demand, pluggable document conversion capabilities. Users can now prepare documents for optimal RAG performance directly within the platform by converting documents to LLM-friendly formats.

Key Features Changes

Enhanced /documents/upload endpoint with optional target_format and transformerparameter
RESTful APIs for job status monitoring and batch queries
Transformer Registry System: Pluggable architecture for different document transformers
Docker Support: Added Poppler utilities for PDF processing

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Summary by CodeRabbit

New Features
- Upload documents with optional background transformations (e.g., pdf→markdown); responses include transformation job metadata, signed access URL, and endpoints to fetch single or multiple transformation job statuses.
Documentation
- Upload API docs expanded with modes, supported conversions, and transformer details; README adds Poppler prerequisite.
Chores
- Docker installs poppler-utils; added py-zerox dependency; backend now targets Python 3.11.
Tests
- Extensive unit, integration, and error/retry tests covering registry, service, routes, CRUD, and upload flows.

coderabbitai · 2025-09-01T08:16:28Z

Walkthrough

Adds an asynchronous document transformation subsystem: new DB migrations, models, CRUD, API routes and docs; transformer registry with a Zerox PDF→Markdown transformer; background job service with retries; Docker and pyproject dependency updates (poppler, py-zerox); comprehensive unit and integration tests.

Changes

Cohort / File(s)	Summary
Docs `README.md`, `backend/app/api/docs/documents/upload.md`	Document Poppler prerequisite; update upload docs with optional transformation flow, supported mappings, and transformer names.
Container & Dependencies `backend/Dockerfile`, `backend/pyproject.toml`	Install `poppler-utils` in the Docker image; bump Python requirement to >=3.11 and add `py-zerox` dependency.
DB Migrations `backend/app/alembic/versions/...b5b9412d3d2a_add_source_document_id_to_document_table.py`, `backend/app/alembic/versions/...9f8a4af9d6fd_create_doc_transformation_job_table.py`	Add nullable `document.source_document_id` (self-FK) and create `doc_transformation_job` table with enum status, timestamps, FKs.
Models & Exports `backend/app/models/document.py`, `backend/app/models/doc_transformation_job.py`, `backend/app/models/__init__.py`	Add `source_document_id` to Document; new `TransformationStatus`, `DocTransformationJob`, `DocTransformationJobs`, `TransformationJobInfo`, `DocumentUploadResponse`; re-export models.
Core Transform Framework `backend/app/core/doctransform/transformer.py`, `.../registry.py`, `.../service.py`, `.../zerox_transformer.py`	Add Transformer ABC; registry with mappings/resolution and convert_document; ZeroxTransformer (py-zerox-based PDF→text) and async service implementing start_job/execute_job with retries, S3 download/upload, temp files, and job status updates.
CRUD `backend/app/crud/doc_transformation_job.py`, `backend/app/crud/__init__.py`	New project-scoped `DocTransformationJobCrud` (create/read_one/read_each/update_status); exported from crud package.
API & Routes `backend/app/api/main.py`, `backend/app/api/routes/documents.py`, `backend/app/api/routes/doc_transformation_job.py`	Register job router; `/upload` becomes async, accepts `target_format` and `transformer`, schedules background transformation jobs and returns `DocumentUploadResponse`; add endpoints to fetch single/multiple job statuses.
Logging `backend/app/core/logger.py`	Suppress LiteLLM logs by setting its logger level to WARNING.
Tests — API & Upload `backend/app/tests/api/routes/documents/test_route_document_upload.py`, `backend/app/tests/api/routes/test_doc_transformation_job.py`	Update test uploader to include transform params; add tests for uploads with/without transformations, validation errors, and job retrieval (single/multiple, cross-project).
Tests — Core Transform Service `backend/app/tests/core/doctransformer/test_service/*`, `backend/app/tests/core/doctransformer/test_registry.py`, `backend/app/tests/core/doctransformer/test_service/utils.py`, `backend/app/tests/core/doctransformer/test_service/conftest.py`	Add fixtures, utilities, and comprehensive tests for `start_job`, `execute_job` (success, retries, error handling), registry behavior, and integration flows with S3 (moto).
Tests — CRUD & Utilities `backend/app/tests/crud/test_doc_transformation_job.py`, `backend/app/tests/utils/document.py`	Add CRUD unit tests for job lifecycle and change DocumentMaker to eager project resolution.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant API as Documents API (/upload)
  participant DB as DB (Session)
  participant Storage as Cloud Storage
  participant BG as BackgroundTasks
  participant Svc as Transform Service
  participant Reg as Transformer Registry

  Client->>API: POST /documents/upload (file[, target_format, transformer])
  API->>API: get_file_format(filename)
  API->>Storage: put(file)
  API->>DB: Create Document
  alt target_format provided
    API->>Svc: start_job(db, user, doc_id, transformer, target_format, BG)
    Svc-->>API: job_id
    API-->>Client: DocumentUploadResponse + transformation_job
    note over BG,Svc: Async execution
    BG->>Svc: execute_job(project_id, job_id, transformer, target_format)
    Svc->>DB: Mark job PROCESSING
    Svc->>Storage: stream(source) -> tmp_in
    Svc->>Reg: convert_document(tmp_in, tmp_out, transformer)
    Reg->>Svc: transformed file
    Svc->>Storage: put(transformed)
    Svc->>DB: Create transformed Document (source_document_id link)
    Svc->>DB: Mark job COMPLETED (transformed_document_id)
  else no target_format
    API-->>Client: DocumentUploadResponse (transformation_job = null)
  end

sequenceDiagram
  autonumber
  actor Client
  participant API as Jobs API (/documents/transformations)
  participant DB as DB (Session)

  Client->>API: GET /{job_id}
  API->>DB: DocTransformationJobCrud.read_one(job_id)
  DB-->>API: Job
  API-->>Client: APIResponse(Job)

  Client->>API: GET /?job_ids=uuid1,uuid2
  API->>API: Parse/validate UUIDs
  API->>DB: read_each({uuid1, uuid2})
  DB-->>API: [Jobs]
  API-->>Client: APIResponse({jobs, jobs_not_found})

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Refactor Document Module and Cloud Storage to use project-based organization and add signed URL support #346 — Overlaps on document upload, per-project cloud/storage and signed-URL handling that this transformation flow depends on.

Suggested reviewers

AkhileshNegi
kartpop
nishika26

Poem

A rabbit taps keys in the moonlit fog,
Queues a job to turn PDF into log.
Poppler pops, Zerox hums along,
Background hops finish tasks so strong.
Hop—markdown lands, tidy and bright. 🐇✨

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 76f8757 and e06042f.

📒 Files selected for processing (1)

backend/app/api/routes/doc_transformation_job.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/api/routes/doc_transformation_job.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/doc-transformer-main

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 19

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

backend/pyproject.toml (1)
6-33: Trim runtime deps; move test/dev-only packages to dev-dependencies.

pytest, pre-commit, moto, openai_responses shouldn’t ship in prod images. Keep them in dev deps and exclude on image build.
 [project]
 dependencies = [
@@
-    "moto[s3]>=5.1.1",
@@
-    "pytest>=7.4.4",
-    "pre-commit>=3.8.0",
-    "openai_responses",
@@
 ]
 
 [tool.uv]
 dev-dependencies = [
     "pytest<8.0.0,>=7.4.3",
     "mypy<2.0.0,>=1.8.0",
     "ruff<1.0.0,>=0.2.2",
     "pre-commit<4.0.0,>=3.6.2",
     "types-passlib<2.0.0.0,>=1.7.7.20240106",
     "coverage<8.0.0,>=7.4.3",
+    "moto[s3]>=5.1.1",
+    "openai_responses",
 ]
Also consider pinning upper bounds for fast-moving libs (openai, boto3) if you need reproducible builds.
backend/app/api/routes/documents.py (1)

95-103: Separate create vs update in DocumentCrud
DocumentCrud.update unconditionally does session.add() on a transient Document, causing an INSERT every time (and duplicate‐key errors when the record already exists). Implement a create() that adds new rows and refactor update() to first load the existing document (e.g. via read_one or session.merge) before applying changes. backend/app/crud/document.py: def update (±lines 98–106)

🧹 Nitpick comments (63)

backend/app/tests/utils/document.py (2)
35-35: Remove stray blank line to avoid churn with Black.

Minor style nit; keeps diffs cleaner.
     def __iter__(self):
         return self
-    
46-51: Verify intentional extension mismatch (.xyz fname vs .txt object key).

fname uses .xyz while object_store_url uses .txt. If not intentional, align them to prevent brittle tests.

Possible fix (choose one):
-            fname=f"{doc_id}.xyz",
+            fname=f"{doc_id}.txt",
or
-        key = f"{self.project.storage_path}/{doc_id}.txt"
+        key = f"{self.project.storage_path}/{doc_id}.xyz"
backend/Dockerfile (1)
10-14: Slimmer image; avoid bringing dev deps.

Use --no-install-recommends for apt and drop curl unless required.

Exclude dev deps at sync time.
-RUN apt-get update && apt-get install -y \
-    curl \
-    poppler-utils \
- && rm -rf /var/lib/apt/lists/*
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    poppler-utils \
+ && rm -rf /var/lib/apt/lists/*
@@
-RUN --mount=type=cache,target=/root/.cache/uv \
-    uv sync --frozen --no-install-project
+RUN --mount=type=cache,target=/root/.cache/uv \
+    uv sync --frozen --no-install-project --no-dev
If curl is needed (healthchecks/debug), keep it; otherwise omit.

Also applies to: 30-31
backend/app/core/doctransform/transformer.py (1)
7-13: Define error contract in the interface.

Document expected exception type (e.g., TransformationError) to standardize handling upstream (registry/service).
     def transform(self, input_path: Path, output_path: Path) -> Path:
         """
         Transform the document at input_path and write the result to output_path.
         Returns the path to the transformed file.
+        
+        Raises:
+            TransformationError: if the transformation fails.
         """
         pass
backend/app/api/docs/documents/upload.md (1)
3-5: Tighten wording and fix minor grammar.

Short, clear phrasing; fix articles.
-- If a target format is specified, a transformation job will also be created to transform document into target format in the background. The response will include both the uploaded document details and information about the transformation job.
+- If a target format is specified, a background transformation job will be created to transform the document into the target format. The response will include the uploaded document details and information about the transformation job.
@@
-  - zerox 
+  - zerox
@@
-Available transformer names and their implementations, default transformer is zerox:
+Available transformer names and their implementations; the default transformer is zerox:
for the structural doc update.

Also applies to: 10-17
backend/app/core/doctransform/zerox_transformer.py (2)
23-31: Stronger result checks and robust write.

Treat empty page contents as failure, and be tolerant to encoding glitches.
-            if result is None or not hasattr(result, "pages") or result.pages is None:
+            if result is None or not hasattr(result, "pages") or result.pages is None:
                 raise RuntimeError("Zerox returned no pages. This may indicate a PDF/image conversion failure (is Poppler installed and in PATH?)")
 
-            with output_path.open("w", encoding="utf-8") as output_file:
+            pages = list(result.pages or [])
+            if not any(getattr(p, "content", None) for p in pages):
+                raise RuntimeError("Zerox returned pages without content. PDF/image conversion may have failed.")
+
+            with output_path.open("w", encoding="utf-8", errors="replace") as output_file:
-                for page in result.pages:
+                for page in pages:
                     if not getattr(page, "content", None):
                         continue    
                     output_file.write(page.content)
                     output_file.write("\n\n")
12-14: Make model configurable.

Default is fine, but allow env/config override to avoid hardcoding OpenAI model in code.
-    def __init__(self, model: str = "gpt-4o"):
-        self.model = model
+    def __init__(self, model: str | None = None):
+        # e.g., read from settings if not provided
+        self.model = model or os.getenv("ZER0X_MODEL", "gpt-4o")
(Remember to import os.)
README.md (1)
14-14: Add install commands for Poppler (per OS) and Docker note
Make it copy-pasteable and clarify Docker images already install it (if true).

Apply:
-- **Poppler** – Install Poppler, required for PDF processing.
+- **Poppler** – required for PDF processing.
+  - macOS: `brew install poppler`
+  - Ubuntu/Debian: `sudo apt-get update && sudo apt-get install -y poppler-utils`
+  - Fedora: `sudo dnf install -y poppler-utils`
+  - Windows (choco): `choco install poppler`
+  - Note: If you use our Docker setup, ensure the image includes Poppler (pdftoppm). Otherwise, add it to the Dockerfile.
backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1)
15-15: Unused import: get_project_by_id
This import isn’t used in this test; remove to keep tests tidy.
-from app.crud import get_project_by_id
backend/app/tests/crud/collections/test_crud_collection_read_all.py (1)
6-6: Unused import: get_project_by_id
Not referenced in this module; drop it.
-from app.crud import CollectionCrud, get_project_by_id
+from app.crud import CollectionCrud
backend/app/tests/api/routes/documents/test_route_document_info.py (1)
4-4: Unused import: get_project_by_id
Remove to avoid lint noise.
-from app.crud import get_project_by_id
backend/app/core/doctransform/test_transformer.py (1)
9-15: Harden write step by ensuring parent dir exists.

Preempt failures when the parent directory hasn’t been created by the caller.

Apply this diff:
     def transform(self, input_path: Path, output_path: Path) -> Path:
-        content = (
+        content = (
             "Lorem ipsum dolor sit amet, consectetur adipiscing elit, "
             "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
         )
-        output_path.write_text(content, encoding='utf-8')
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        output_path.write_text(content, encoding="utf-8")
         return output_path
backend/app/tests/crud/test_doc_transformation_job.py (2)
37-46: Prefer uuid4() for nonexistent IDs in tests.

SequentialUuidGenerator always starts at 0 when newly instantiated; uuid4() better reflects real IDs and avoids accidental collisions.
-        invalid_id = next(SequentialUuidGenerator())
+        from uuid import uuid4
+        invalid_id = uuid4()
Apply similarly to other tests expecting a non-existent UUID.

Also applies to: 74-83, 208-217

223-231: Clarify error_message retention semantics on state change.

Test asserts error persists when moving FAILED → PROCESSING. If intended, also add a test that explicitly clears the error when an empty string is passed, and one that preserves when None is passed.
-        updated_job = crud.update_status(job.id, TransformationStatus.PROCESSING)
+        # Explicitly clear the error if desired behavior is to reset
+        # updated_job = crud.update_status(job.id, TransformationStatus.PROCESSING, error_message="")
+        updated_job = crud.update_status(job.id, TransformationStatus.PROCESSING)
If the desired behavior is to clear on retry, adjust CRUD accordingly and update this test.
backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py (2)
30-33: Add helpful indexes and clean enum teardown.

Index by source_document_id and status/created_at for common lookups.

Drop the enum type on downgrade to avoid orphaned types (Postgres).
 def upgrade():
@@
-    )
+    )
+    op.create_index('ix_doc_transformation_job_source_document_id', 'doc_transformation_job', ['source_document_id'])
+    op.create_index('ix_doc_transformation_job_status_created_at', 'doc_transformation_job', ['status', 'created_at'])
@@
 def downgrade():
-    # ### commands auto generated by Alembic - please adjust! ###
-    op.drop_table('doc_transformation_job')
+    op.drop_index('ix_doc_transformation_job_status_created_at', table_name='doc_transformation_job')
+    op.drop_index('ix_doc_transformation_job_source_document_id', table_name='doc_transformation_job')
+    op.drop_table('doc_transformation_job')
+    # Best-effort cleanup for Postgres enum
+    try:
+        op.execute("DROP TYPE IF EXISTS transformationstatus")
+    except Exception:
+        pass
Also applies to: 37-40

30-31: Consider explicit ON DELETE behavior on FKs.

source_document_id: RESTRICT (prevents deleting source while jobs exist) or CASCADE if you want jobs removed with the source.

transformed_document_id: SET NULL is typical if the transformed doc is deleted independently.
-    sa.ForeignKeyConstraint(['source_document_id'], ['document.id'], ),
-    sa.ForeignKeyConstraint(['transformed_document_id'], ['document.id'], ),
+    sa.ForeignKeyConstraint(['source_document_id'], ['document.id'], ondelete='RESTRICT'),
+    sa.ForeignKeyConstraint(['transformed_document_id'], ['document.id'], ondelete='SET NULL'),
Validate with your deletion semantics.

Also applies to: 24-25
backend/app/tests/core/doctransformer/test_service/test_start_job.py (2)
4-4: Replace typing.Tuple with built-in tuple.

Aligns with Ruff UP035 and modern typing.
-from typing import Any, Tuple
+from typing import Any
+from typing import Tuple as _Tuple  # temporary to avoid rename churn
Or update annotations directly:
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
Apply across this file.

29-36: Standardize test transformer names to avoid confusion.

Registry test transformer is “test”; in one place you use “test-transformer”. Since execute_job isn’t run here, it won’t fail, but consistency helps readability.
-            transformer_name="test-transformer",
+            transformer_name="test",
Also consider parameterizing formats using TestDataProvider.get_format_test_cases() to avoid drift.

Also applies to: 112-129, 130-156
backend/app/tests/api/routes/test_doc_transformation_job.py (2)
37-38: Assert exact ID, not just non-null.

Stronger check improves signal and guards serialization issues.
-        assert data["data"]["id"] is not None
+        assert data["data"]["id"] == str(created_job.id)
1-1: Pre-commit/Black touched this file. Run hooks locally.

Pipeline shows trailing whitespace/formatting fixes. Please run pre-commit locally to avoid CI churn.
backend/app/models/doc_transformation_job.py (3)
21-25: Make nullability explicit for optional DB fields.

Being explicit helps migrations and DB introspection tools.
-    transformed_document_id: Optional[UUID] = Field(default=None, foreign_key="document.id")
+    transformed_document_id: Optional[UUID] = Field(default=None, foreign_key="document.id", nullable=True)
@@
-    error_message: Optional[str] = Field(default=None)
+    error_message: Optional[str] = Field(default=None, nullable=True)
16-26: Consider indexes for frequent filters/joins.

Given frequent joins on source_document_id and filters by status, add DB indexes (and Alembic migration) for (source_document_id), (status), and optionally (created_at DESC) for recent jobs.

1-1: EOF newline/formatting were auto-fixed by CI.

Keep pre-commit enabled locally to prevent reformat diffs.
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (5)
4-4: Modernize typing: use collections.abc.Callable and builtin tuple.

Aligns with Ruff UP035 and current typing best practices.
-from typing import Any, Callable, Tuple
+from typing import Any
+from collections.abc import Callable
@@
-        test_document: Tuple[Document, Project],
+        test_document: tuple[Document, Project],
@@
-        test_document: Tuple[Document, Project], 
+        test_document: tuple[Document, Project], 
@@
-        test_document: Tuple[Document, Project], 
+        test_document: tuple[Document, Project], 
@@
-        test_document: Tuple[Document, Project], 
+        test_document: tuple[Document, Project], 
@@
-        test_document: Tuple[Document, Project]
+        test_document: tuple[Document, Project]
@@
-        test_document: Tuple[Document, Project]
+        test_document: tuple[Document, Project]
Also applies to: 33-33, 84-84, 110-110, 145-145, 185-185, 218-218

126-126: Avoid catching broad Exception in tests.

Assert the specific failure kinds to reduce flakes and satisfy Ruff B017.
-            with pytest.raises(Exception):
+            with pytest.raises((RetryError, HTTPException)):
225-251: Use expected_content_type by asserting the S3 ContentType.

Prevents unused-var warning (B007) and verifies correct headers are stored.
+from urllib.parse import urlparse
@@
+from app.core.config import settings
@@
         for target_format, expected_content_type, expected_extension in format_extensions:
@@
             transformed_doc = document_crud.read_one(job.transformed_document_id)
             assert transformed_doc is not None
             assert transformed_doc.fname.endswith(expected_extension)
+            # Assert S3 ContentType
+            parsed = urlparse(transformed_doc.object_store_url)
+            key = parsed.path.lstrip("/")
+            head = aws.client.head_object(Bucket=settings.AWS_S3_BUCKET, Key=key)
+            assert head["ContentType"] == expected_content_type
Also applies to: 8-8, 13-13

47-47: Remove redundant db.commit() calls.

DocTransformationJobCrud.create commits already; extra commits slow tests.
-        db.commit()
Also applies to: 121-121, 156-156, 196-196, 231-231

1-1: Pre-commit/Black formatting tweaks were applied by CI.

Run hooks locally to keep CI green and diffs minimal.
backend/app/models/document.py (2)
35-35: Make deleted_at explicitly optional with default

Be explicit to avoid accidental “required field” interpretation in pydantic/SQLModel.

Apply this diff:
-    deleted_at: datetime | None
+    deleted_at: datetime | None = None
36-40: Add ondelete=SET NULL and index for source_document_id in the migration

In backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py, change the FK creation to
op.create_foreign_key(None, 'document', 'document', ['source_document_id'], ['id'], ondelete='SET NULL').

Add an index for faster lookups, e.g.
op.create_index('ix_document_source_document_id', 'document', ['source_document_id']).
backend/app/api/routes/doc_transformation_job.py (1)
36-47: Leverage FastAPI validation for UUID lists (optional)

Accept job_ids as list[UUID] to drop manual parsing and 422 handling.

Example:
-    job_ids: str = Query(..., description="Comma-separated list of transformation job IDs"),
+    job_ids: list[UUID] = Query(..., description="Repeat ?job_ids=<uuid> per id"),
@@
-    job_id_list = []
-    invalid_ids = []
-    for jid in job_ids.split(","):
-        ...
-    if invalid_ids:
-        raise HTTPException(...)
+    job_id_list = job_ids
backend/app/tests/core/doctransformer/test_service/conftest.py (3)
5-5: Modernize typing imports

Use collections.abc for Callable/Generator and built-in tuple syntax.

Apply this diff:
-from typing import Any, Callable, Generator, Tuple
+from typing import Any
+from collections.abc import Callable, Generator
67-71: Use built-in tuple type annotation

Aligns with modern typing and Ruff UP035.

Apply this diff:
-def test_document(db: Session, current_user: UserProjectOrg) -> Tuple[Document, Project]:
+def test_document(db: Session, current_user: UserProjectOrg) -> tuple[Document, Project]:
22-30: Prefer monkeypatch for env vars in tests (optional)

Using pytest’s monkeypatch keeps env scoped to the test run.
backend/app/tests/core/doctransformer/test_service/test_integration.py (6)
4-4: Drop deprecated Tuple import

Use built-in tuple annotations.

Apply this diff:
-from typing import Tuple
25-27: Update annotation to tuple[...]

Apply this diff:
-        test_document: Tuple[Document, Project]
+        test_document: tuple[Document, Project]
82-85: Update annotation to tuple[...]

Apply this diff:
-        test_document: Tuple[Document, Project]
+        test_document: tuple[Document, Project]
120-123: Update annotation to tuple[...]

Apply this diff:
-        test_document: Tuple[Document, Project]
+        test_document: tuple[Document, Project]
93-96: Unused loop variable and redundant commit

Rename i to _ and drop extra commit; create() already commits.

Apply this diff:
-        for i in range(3):
+        for _ in range(3):
             job = job_crud.create(source_document_id=document.id)
             jobs.append(job)
-        db.commit()
154-154: Unused loop index

Rename i to _i or remove enumerate.

Apply this diff:
-        for i, (job, target_format) in enumerate(jobs):
+        for _i, (job, target_format) in enumerate(jobs):
backend/app/core/doctransform/service.py (2)
66-72: Close streaming body after copy (optional)

Avoid leaking connections/file descriptors from storage.stream.

Apply this diff:
-        body = storage.stream(source_doc_object_store_url)
-        tmp_dir = Path(tempfile.mkdtemp())
-        tmp_in = tmp_dir / f"{source_doc_id}"
-        with open(tmp_in, "wb") as f:
-            shutil.copyfileobj(body, f)
+        body = storage.stream(source_doc_object_store_url)
+        tmp_dir = Path(tempfile.mkdtemp())
+        tmp_in = tmp_dir / f"{source_doc_id}"
+        try:
+            with open(tmp_in, "wb") as f:
+                shutil.copyfileobj(body, f)
+        finally:
+            try:
+                body.close()
+            except Exception:
+                pass
76-78: Idempotency and partial-failure risks (optional)

On retry, a new transformed_doc_id is generated; failures after upload but before job status update can orphan objects. Consider deriving object key from job_id and target_format, writing to a temp key then moving/overwriting on completion.

Also applies to: 98-114
backend/app/crud/doc_transformation_job.py (3)
3-4: Tighten imports; modernize typing and drop unused symbols.

List is unused; prefer built-in list[...] types.

join is imported but never used.

Apply:
-from typing import List, Optional
-from sqlmodel import Session, select, and_, join
+from typing import Optional
+from sqlmodel import Session, select, and_
29-46: Clarify behavior when the source document is soft-deleted.

read_one filters Document.is_deleted.is_(False). If a source doc is soft-deleted after job creation, this method will 404 the job. Is that intentional for readers of job status? If not, drop the is_deleted filter here (or only for internal callers) so jobs remain discoverable.

Example:
-                    Document.project_id == self.project_id,
-                    Document.is_deleted.is_(False)
+                    Document.project_id == self.project_id,
48-63: Guard against empty IN-clause and consider input type.

If job_ids is empty, many dialects render IN () which is invalid; return [] early.

Consider accepting Iterable[UUID] for flexibility.
-    def read_each(self, job_ids: set[UUID]) -> list[DocTransformationJob]:
+    def read_each(self, job_ids: set[UUID]) -> list[DocTransformationJob]:
+        if not job_ids:
+            return []
         statement = (
             select(DocTransformationJob)
backend/app/tests/api/routes/documents/test_route_document_upload.py (4)
25-40: Close the file handle to avoid descriptor leaks in tests.

Wrap the file open in a context manager so the descriptor is closed even if the request fails.
-    def put(self, route: Route, scratch: Path, target_format: str = None, transformer: str = None):
+    def put(self, route: Route, scratch: Path, target_format: str = None, transformer: str = None):
         (mtype, _) = mimetypes.guess_type(str(scratch))
-        files = {"src": (str(scratch), scratch.open("rb"), mtype)}
-        
-        data = {}
+        data = {}
         if target_format:
             data["target_format"] = target_format
         if transformer:
             data["transformer"] = transformer
-        
-        return self.client.post(
-            str(route),
-            headers={"X-API-KEY": self.user_api_key.key},
-            files=files,
-            data=data,
-        )
+        with scratch.open("rb") as fh:
+            files = {"src": (scratch.name, fh, mtype)}
+            return self.client.post(
+                str(route),
+                headers={"X-API-KEY": self.user_api_key.key},
+                files=files,
+                data=data,
+            )
200-216: Unsupported transformation error path looks good but message is brittle.

If server error messages change, this may become flaky. Consider matching a stable prefix or error code if available.

217-236: Invalid transformer case covered.

Assertions look good, same note about brittle full-string match applies.

261-285: Fix unused variable flagged by Ruff (F841).

response is assigned but unused (Line 277). Either remove the assignment or assert on it to strengthen the test.

Option A — remove assignment:
-        response = httpx_to_standard(uploader.put(route, pdf_scratch, target_format="markdown"))
+        httpx_to_standard(uploader.put(route, pdf_scratch, target_format="markdown"))
Option B — use the response:
-        response = httpx_to_standard(uploader.put(route, pdf_scratch, target_format="markdown"))
+        response = httpx_to_standard(uploader.put(route, pdf_scratch, target_format="markdown"))
+        assert response.success is True
+        assert response.data["transformation_job"]["job_id"] == mock_job_id
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (6)
5-5: Modernize typing imports (UP035).

Import Callable from collections.abc and prefer built-in generics.
-from typing import Any, Callable, Tuple
+from typing import Any
+from collections.abc import Callable
11-11: Remove unused import.

RetryError is not used.
-from tenacity import RetryError
+# from tenacity import RetryError  # unused
14-14: Remove unused import.

execute_job is not referenced directly.
-from app.core.doctransform.service import execute_job
+# from app.core.doctransform.service import execute_job  # unused
24-63: Avoid blind exception assertions; match expected message.

Strengthen the test by asserting on the error text (Ruff B017).
-            with pytest.raises(Exception):
+            with pytest.raises(Exception, match="S3 upload failed"):
                 fast_execute_job(
                     project_id=project.id,
                     job_id=job.id,
                     transformer_name="test",
                     target_format="markdown"
                 )
Also, update annotations to use built-in tuple:
-        test_document: Tuple[Document, Project], 
+        test_document: tuple[Document, Project],
101-139: Assert specific error on exhausted retries.

Don't catch Exception blindly; verify the message.
-            with pytest.raises(Exception):
+            with pytest.raises(Exception, match="Persistent error"):
                 fast_execute_job(
                     project_id=project.id,
                     job_id=job.id,
                     transformer_name="test",
                     target_format="markdown"
                 )
And modernize the annotation:
-        test_document: Tuple[Document, Project], 
+        test_document: tuple[Document, Project],
140-179: Also match message for DB error case.

Tighten the assertion to the expected failure text; update tuple typing.
-                with pytest.raises(Exception):
+                with pytest.raises(Exception, match="Database error during document creation"):
                     fast_execute_job(
                         project_id=project.id,
                         job_id=job.id,
                         transformer_name="test",
                         target_format="markdown"
                     )
-        test_document: Tuple[Document, Project], 
+        test_document: tuple[Document, Project],
backend/app/api/routes/documents.py (4)
3-3: Clean imports and modernize typing (fix Ruff F811/UP035).

Drop duplicate HTTPException import and unused JSONResponse.

Use built-in generics instead of typing.List; remove the typing import.
-from typing import List, Optional
+# typing imports not required; use built-in generics

-from fastapi import APIRouter, File, UploadFile, Query, Form, BackgroundTasks, HTTPException
+from fastapi import APIRouter, File, UploadFile, Query, Form, BackgroundTasks, HTTPException
-from fastapi.responses import JSONResponse
-from fastapi import HTTPException

-    response_model=APIResponse[List[DocumentPublic]],
+    response_model=APIResponse[list[DocumentPublic]],
Also applies to: 6-6, 8-9, 32-32

70-77: Guard against no-op transforms (source == target).

Rejects unnecessary jobs early.
     # validate if transformation is possible or not
     if target_format:
+        if source_format == target_format:
+            raise HTTPException(
+                status_code=400,
+                detail=f"Source and target formats are the same ('{source_format}'); no transformation needed."
+            )
         if not is_transformation_supported(source_format, target_format):
             raise HTTPException(
                 status_code=400,
                 detail=f"Transformation from {source_format} to {target_format} is not supported"
             )
114-121: Pass UUID, not str, to TransformationJobInfo.job_id.

The model field is UUID; avoid implicit coercion.
         job_info = TransformationJobInfo(
             message=f"Document accepted for transformation from {source_format} to {target_format}.",
-            job_id=str(job_id),
+            job_id=job_id,
             source_format=source_format,
             target_format=target_format,
             transformer=actual_transformer,
             status_check_url=f"/documents/transformations/{job_id}"
         )
45-49: Return 201 Created for uploads.

Aligns HTTP semantics with resource creation.
 @router.post(
     "/upload",
     description=load_description("documents/upload.md"),
-    response_model=APIResponse[DocumentUploadResponse],
+    response_model=APIResponse[DocumentUploadResponse],
+    status_code=201,
 )
backend/app/core/doctransform/registry.py (2)
2-2: Adopt built-in generics and union syntax (fix Ruff UP035).

Modernize type hints; drop typing imports.
-from typing import Type, Dict, Set, Tuple, Optional
+# Built-in generics and | None are used; no typing imports needed.

-TRANSFORMERS: Dict[str, Type[Transformer]] = {
+TRANSFORMERS: dict[str, type[Transformer]] = {
@@
-SUPPORTED_TRANSFORMATIONS: Dict[Tuple[str, str], Dict[str, str]] = {
+SUPPORTED_TRANSFORMATIONS: dict[tuple[str, str], dict[str, str]] = {
@@
-EXTENSION_TO_FORMAT: Dict[str, str] = {
+EXTENSION_TO_FORMAT: dict[str, str] = {
@@
-FORMAT_TO_EXTENSION: Dict[str, str] = {
+FORMAT_TO_EXTENSION: dict[str, str] = {
@@
-def get_supported_transformations() -> Dict[Tuple[str, str], Set[str]]:
+def get_supported_transformations() -> dict[tuple[str, str], set[str]]:
@@
-def get_available_transformers(source_format: str, target_format: str) -> Dict[str, str]:
+def get_available_transformers(source_format: str, target_format: str) -> dict[str, str]:
@@
-def resolve_transformer(source_format: str, target_format: str, transformer_name: Optional[str] = None) -> str:
+def resolve_transformer(source_format: str, target_format: str, transformer_name: str | None = None) -> str:
Also applies to: 12-16, 19-27, 30-39, 42-49, 59-65, 70-73, 74-79

98-111: Ensure output directory exists before writing.

Prevents failures when output_path parents are missing.
 def convert_document(input_path: Path, output_path: Path, transformer_name: str = "default") -> Path:
@@
-    transformer = transformer_cls()
+    # Make sure output directory exists
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    transformer = transformer_cls()
backend/app/tests/core/doctransformer/test_service/base.py (2)
13-13: Use built-in generics in tests (fix Ruff UP035).

Replace typing.List with list[...] in return annotations; drop typing import.
-from typing import List
+# typing.List not needed; use built-in generics
@@
-    def get_format_test_cases() -> List[tuple]:
+    def get_format_test_cases() -> list[tuple]:
@@
-    def get_content_type_test_cases() -> List[tuple]:
+    def get_content_type_test_cases() -> list[tuple]:
@@
-    def get_test_transformer_names() -> List[str]:
+    def get_test_transformer_names() -> list[str]:
Also applies to: 75-81, 84-91, 93-97

124-128: Silence unused args warnings in persistent failing mock.

Prefix with underscores.
-    def create_persistent_failing_convert_document(error_message: str = "Persistent error"):
+    def create_persistent_failing_convert_document(error_message: str = "Persistent error"):
         """Create a side effect function that always fails."""
-        def persistent_failing_convert_document(*args, **kwargs):
+        def persistent_failing_convert_document(*_args, **_kwargs):
             raise Exception(error_message)
         return persistent_failing_convert_document

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f24f4c5 and b615d5a.

⛔ Files ignored due to path filters (1)

backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (35)

README.md (1 hunks)
backend/Dockerfile (1 hunks)
backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py (1 hunks)
backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py (1 hunks)
backend/app/api/docs/documents/upload.md (1 hunks)
backend/app/api/main.py (2 hunks)
backend/app/api/routes/doc_transformation_job.py (1 hunks)
backend/app/api/routes/documents.py (3 hunks)
backend/app/core/doctransform/registry.py (1 hunks)
backend/app/core/doctransform/service.py (1 hunks)
backend/app/core/doctransform/test_transformer.py (1 hunks)
backend/app/core/doctransform/transformer.py (1 hunks)
backend/app/core/doctransform/zerox_transformer.py (1 hunks)
backend/app/crud/__init__.py (1 hunks)
backend/app/crud/doc_transformation_job.py (1 hunks)
backend/app/models/__init__.py (1 hunks)
backend/app/models/doc_transformation_job.py (1 hunks)
backend/app/models/document.py (3 hunks)
backend/app/tests/api/routes/documents/test_route_document_info.py (1 hunks)
backend/app/tests/api/routes/documents/test_route_document_list.py (1 hunks)
backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1 hunks)
backend/app/tests/api/routes/documents/test_route_document_remove.py (1 hunks)
backend/app/tests/api/routes/documents/test_route_document_upload.py (4 hunks)
backend/app/tests/api/routes/test_doc_transformation_job.py (1 hunks)
backend/app/tests/core/doctransformer/test_service/base.py (1 hunks)
backend/app/tests/core/doctransformer/test_service/conftest.py (1 hunks)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (1 hunks)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (1 hunks)
backend/app/tests/core/doctransformer/test_service/test_integration.py (1 hunks)
backend/app/tests/core/doctransformer/test_service/test_start_job.py (1 hunks)
backend/app/tests/crud/collections/test_crud_collection_create.py (1 hunks)
backend/app/tests/crud/collections/test_crud_collection_read_all.py (1 hunks)
backend/app/tests/crud/test_doc_transformation_job.py (1 hunks)
backend/app/tests/utils/document.py (1 hunks)
backend/pyproject.toml (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (30)

backend/app/core/doctransform/test_transformer.py (2)

backend/app/core/doctransform/transformer.py (2)

Transformer (4-13)

transform (8-13)

backend/app/core/doctransform/zerox_transformer.py (1)

transform (15-44)

backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/tests/api/routes/documents/test_route_document_list.py (2)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/tests/crud/test_project.py (1)

test_get_project_by_id (53-60)

backend/app/crud/doc_transformation_job.py (3)

backend/app/crud/document.py (2)

DocumentCrud (14-135)

__init__ (15-17)

backend/app/models/doc_transformation_job.py (2)

DocTransformationJob (16-25)

TransformationStatus (9-13)

backend/app/models/document.py (1)

Document (20-40)

backend/app/tests/utils/document.py (3)

backend/app/models/project.py (1)

Project (29-60)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/tests/utils/utils.py (1)

SequentialUuidGenerator (140-153)

backend/app/tests/api/routes/documents/test_route_document_remove.py (2)

backend/app/tests/utils/document.py (1)

project (67-68)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/tests/crud/test_doc_transformation_job.py (5)

backend/app/crud/doc_transformation_job.py (5)

DocTransformationJobCrud (13-84)

create (18-27)

read_one (29-46)

read_each (48-62)

update_status (64-84)

backend/app/models/doc_transformation_job.py (2)

DocTransformationJob (16-25)

TransformationStatus (9-13)

backend/app/tests/utils/document.py (4)

DocumentStore (55-82)

project (67-68)

put (70-75)

fill (81-82)

backend/app/tests/utils/utils.py (2)

get_project (70-89)

SequentialUuidGenerator (140-153)

backend/app/tests/conftest.py (1)

db (24-41)

backend/app/tests/api/routes/documents/test_route_document_info.py (1)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/crud/__init__.py (2)

backend/app/crud/doc_transformation_job.py (1)

DocTransformationJobCrud (13-84)

backend/app/crud/document.py (1)

DocumentCrud (14-135)

backend/app/tests/core/doctransformer/test_service/test_start_job.py (7)

backend/app/core/doctransform/service.py (2)

execute_job (42-128)

start_job (24-39)

backend/app/models/document.py (1)

Document (20-40)

backend/app/models/doc_transformation_job.py (2)

DocTransformationJob (16-25)

TransformationStatus (9-13)

backend/app/models/user.py (1)

UserProjectOrg (65-66)

backend/app/tests/core/doctransformer/test_service/base.py (3)

DocTransformTestBase (21-68)

TestDataProvider (71-101)

get_test_transformer_names (94-96)

backend/app/tests/conftest.py (1)

db (24-41)

backend/app/tests/core/doctransformer/test_service/conftest.py (3)

current_user (49-57)

test_document (67-71)

background_tasks (61-63)

backend/app/tests/core/doctransformer/test_service/test_execute_job.py (7)

backend/app/crud/doc_transformation_job.py (3)

DocTransformationJobCrud (13-84)

create (18-27)

read_one (29-46)

backend/app/crud/document.py (1)

DocumentCrud (14-135)

backend/app/core/doctransform/registry.py (1)

TransformationError (8-9)

backend/app/core/doctransform/service.py (1)

execute_job (42-128)

backend/app/models/document.py (1)

Document (20-40)

backend/app/models/doc_transformation_job.py (2)

DocTransformationJob (16-25)

TransformationStatus (9-13)

backend/app/tests/core/doctransformer/test_service/base.py (8)

DocTransformTestBase (21-68)

TestDataProvider (71-101)

get_format_test_cases (75-81)

setup_aws_s3 (24-28)

get_sample_document_content (99-101)

create_s3_document_content (30-46)

verify_s3_content (48-68)

get_content_type_test_cases (84-91)

backend/app/tests/api/routes/documents/test_route_document_upload.py (4)

backend/app/tests/utils/document.py (5)

Route (85-112)

_ (143-144)

_ (148-149)

delete (126-130)

httpx_to_standard (22-23)

backend/app/core/cloud/storage.py (5)

client (30-42)

delete (140-142)

delete (244-259)

AmazonCloudStorageClient (28-84)

create (44-84)

backend/app/tests/conftest.py (2)

client (52-55)

db (24-41)

backend/app/crud/doc_transformation_job.py (1)

create (18-27)

backend/app/tests/crud/collections/test_crud_collection_create.py (1)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/models/doc_transformation_job.py (2)

backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)

upgrade (20-36)

backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py (1)

upgrade (19-66)

backend/app/tests/core/doctransformer/test_service/conftest.py (6)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/models/user.py (2)

User (48-57)

UserProjectOrg (65-66)

backend/app/tests/utils/document.py (4)

DocumentStore (55-82)

get (120-124)

project (67-68)

put (70-75)

backend/app/tests/utils/test_data.py (1)

create_test_api_key (53-67)

backend/app/core/doctransform/service.py (1)

execute_job (42-128)

backend/app/tests/conftest.py (1)

db (24-41)

backend/app/tests/core/doctransformer/test_service/test_integration.py (8)

backend/app/crud/doc_transformation_job.py (3)

DocTransformationJobCrud (13-84)

read_one (29-46)

create (18-27)

backend/app/crud/document.py (1)

DocumentCrud (14-135)

backend/app/core/doctransform/service.py (2)

execute_job (42-128)

start_job (24-39)

backend/app/models/document.py (1)

Document (20-40)

backend/app/models/doc_transformation_job.py (2)

DocTransformationJob (16-25)

TransformationStatus (9-13)

backend/app/tests/core/doctransformer/test_service/base.py (3)

DocTransformTestBase (21-68)

setup_aws_s3 (24-28)

create_s3_document_content (30-46)

backend/app/tests/conftest.py (1)

db (24-41)

backend/app/tests/core/doctransformer/test_service/conftest.py (3)

test_document (67-71)

current_user (49-57)

background_tasks (61-63)

backend/app/tests/crud/collections/test_crud_collection_read_all.py (2)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/tests/crud/test_project.py (1)

test_get_project_by_id (53-60)

backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py (2)

backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py (2)

upgrade (20-24)

downgrade (27-31)

backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)

upgrade (20-36)

backend/app/core/doctransform/transformer.py (2)

backend/app/core/doctransform/test_transformer.py (1)

transform (9-15)

backend/app/core/doctransform/zerox_transformer.py (1)

transform (15-44)

backend/app/core/doctransform/registry.py (3)

backend/app/core/doctransform/transformer.py (2)

Transformer (4-13)

transform (8-13)

backend/app/core/doctransform/test_transformer.py (2)

TestTransformer (4-15)

transform (9-15)

backend/app/core/doctransform/zerox_transformer.py (2)

ZeroxTransformer (7-44)

transform (15-44)

backend/app/models/__init__.py (2)

backend/app/models/document.py (4)

Document (20-40)

DocumentPublic (43-61)

DocumentUploadResponse (83-87)

TransformationJobInfo (64-80)

backend/app/models/doc_transformation_job.py (3)

DocTransformationJob (16-25)

DocTransformationJobs (28-30)

TransformationStatus (9-13)

backend/app/tests/api/routes/test_doc_transformation_job.py (7)

backend/app/crud/doc_transformation_job.py (3)

DocTransformationJobCrud (13-84)

create (18-27)

update_status (64-84)

backend/app/models/doc_transformation_job.py (1)

TransformationStatus (9-13)

backend/app/tests/utils/document.py (5)

DocumentStore (55-82)

project (67-68)

put (70-75)

get (120-124)

fill (81-82)

backend/app/tests/utils/utils.py (1)

get_project (70-89)

backend/app/models/api_key.py (1)

APIKeyPublic (23-25)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/tests/conftest.py (3)

db (24-41)

user_api_key (89-91)

superuser_api_key (83-85)

backend/app/tests/core/doctransformer/test_service/base.py (4)

backend/app/core/cloud/storage.py (1)

AmazonCloudStorageClient (28-84)

backend/app/models/document.py (1)

Document (20-40)

backend/app/models/project.py (1)

Project (29-60)

backend/app/core/config.py (1)

AWS_S3_BUCKET (76-77)

backend/app/api/routes/documents.py (5)

backend/app/models/document.py (4)

Document (20-40)

DocumentPublic (43-61)

DocumentUploadResponse (83-87)

TransformationJobInfo (64-80)

backend/app/utils.py (2)

APIResponse (27-48)

success_response (34-37)

backend/app/core/cloud/storage.py (2)

get_signed_url (135-137)

get_signed_url (217-242)

backend/app/core/doctransform/registry.py (4)

get_file_format (51-57)

is_transformation_supported (66-68)

get_available_transformers (70-72)

resolve_transformer (74-96)

backend/app/core/doctransform/service.py (1)

start_job (24-39)

backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py (2)

backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py (1)

upgrade (20-34)

backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)

upgrade (20-36)

backend/app/core/doctransform/zerox_transformer.py (2)

backend/app/core/doctransform/transformer.py (2)

Transformer (4-13)

transform (8-13)

backend/app/core/doctransform/test_transformer.py (1)

transform (9-15)

backend/app/models/document.py (3)

backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)

upgrade (20-36)

backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py (1)

upgrade (19-66)

backend/app/models/document_collection.py (1)

DocumentCollection (9-23)

backend/app/core/doctransform/service.py (7)

backend/app/crud/doc_transformation_job.py (4)

DocTransformationJobCrud (13-84)

create (18-27)

update_status (64-84)

read_one (29-46)

backend/app/crud/document.py (1)

DocumentCrud (14-135)

backend/app/models/document.py (1)

Document (20-40)

backend/app/models/doc_transformation_job.py (1)

TransformationStatus (9-13)

backend/app/core/cloud/storage.py (1)

get_cloud_storage (262-272)

backend/app/core/doctransform/registry.py (1)

convert_document (98-115)

backend/app/tests/core/doctransformer/test_service/conftest.py (2)

current_user (49-57)

background_tasks (61-63)

backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (5)

backend/app/crud/doc_transformation_job.py (3)

DocTransformationJobCrud (13-84)

create (18-27)

read_one (29-46)

backend/app/core/doctransform/service.py (1)

execute_job (42-128)

backend/app/tests/core/doctransformer/test_service/base.py (8)

DocTransformTestBase (21-68)

MockHelpers (104-128)

setup_aws_s3 (24-28)

create_s3_document_content (30-46)

failing_convert_document (111-120)

create_failing_convert_document (108-121)

persistent_failing_convert_document (126-127)

create_persistent_failing_convert_document (124-128)

backend/app/tests/conftest.py (1)

db (24-41)

backend/app/tests/core/doctransformer/test_service/conftest.py (1)

fast_execute_job (33-45)

backend/app/api/routes/doc_transformation_job.py (3)

backend/app/models/doc_transformation_job.py (2)

DocTransformationJob (16-25)

DocTransformationJobs (28-30)

backend/app/crud/doc_transformation_job.py (3)

DocTransformationJobCrud (13-84)

read_one (29-46)

read_each (48-62)

backend/app/utils.py (2)

APIResponse (27-48)

success_response (34-37)

🪛 GitHub Actions: AI Platform CI

backend/app/core/doctransform/test_transformer.py