-
Couldn't load subscription status.
- Fork 5
Implement document transformation pipeline to improve RAG performance #363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds an asynchronous document transformation subsystem: new DB migrations, models, CRUD, API routes and docs; transformer registry with a Zerox PDF→Markdown transformer; background job service with retries; Docker and pyproject dependency updates (poppler, py-zerox); comprehensive unit and integration tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant API as Documents API (/upload)
participant DB as DB (Session)
participant Storage as Cloud Storage
participant BG as BackgroundTasks
participant Svc as Transform Service
participant Reg as Transformer Registry
Client->>API: POST /documents/upload (file[, target_format, transformer])
API->>API: get_file_format(filename)
API->>Storage: put(file)
API->>DB: Create Document
alt target_format provided
API->>Svc: start_job(db, user, doc_id, transformer, target_format, BG)
Svc-->>API: job_id
API-->>Client: DocumentUploadResponse + transformation_job
note over BG,Svc: Async execution
BG->>Svc: execute_job(project_id, job_id, transformer, target_format)
Svc->>DB: Mark job PROCESSING
Svc->>Storage: stream(source) -> tmp_in
Svc->>Reg: convert_document(tmp_in, tmp_out, transformer)
Reg->>Svc: transformed file
Svc->>Storage: put(transformed)
Svc->>DB: Create transformed Document (source_document_id link)
Svc->>DB: Mark job COMPLETED (transformed_document_id)
else no target_format
API-->>Client: DocumentUploadResponse (transformation_job = null)
end
sequenceDiagram
autonumber
actor Client
participant API as Jobs API (/documents/transformations)
participant DB as DB (Session)
Client->>API: GET /{job_id}
API->>DB: DocTransformationJobCrud.read_one(job_id)
DB-->>API: Job
API-->>Client: APIResponse(Job)
Client->>API: GET /?job_ids=uuid1,uuid2
API->>API: Parse/validate UUIDs
API->>DB: read_each({uuid1, uuid2})
DB-->>API: [Jobs]
API-->>Client: APIResponse({jobs, jobs_not_found})
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 19
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
backend/pyproject.toml (1)
6-33: Trim runtime deps; move test/dev-only packages to dev-dependencies.pytest, pre-commit, moto, openai_responses shouldn’t ship in prod images. Keep them in dev deps and exclude on image build.
[project] dependencies = [ @@ - "moto[s3]>=5.1.1", @@ - "pytest>=7.4.4", - "pre-commit>=3.8.0", - "openai_responses", @@ ] [tool.uv] dev-dependencies = [ "pytest<8.0.0,>=7.4.3", "mypy<2.0.0,>=1.8.0", "ruff<1.0.0,>=0.2.2", "pre-commit<4.0.0,>=3.6.2", "types-passlib<2.0.0.0,>=1.7.7.20240106", "coverage<8.0.0,>=7.4.3", + "moto[s3]>=5.1.1", + "openai_responses", ]Also consider pinning upper bounds for fast-moving libs (openai, boto3) if you need reproducible builds.
backend/app/api/routes/documents.py (1)
95-103: Separate create vs update in DocumentCrud
DocumentCrud.update unconditionally doessession.add()on a transient Document, causing an INSERT every time (and duplicate‐key errors when the record already exists). Implement acreate()that adds new rows and refactorupdate()to first load the existing document (e.g. viaread_oneorsession.merge) before applying changes. backend/app/crud/document.py: def update (±lines 98–106)
🧹 Nitpick comments (63)
backend/app/tests/utils/document.py (2)
35-35: Remove stray blank line to avoid churn with Black.Minor style nit; keeps diffs cleaner.
def __iter__(self): return self -
46-51: Verify intentional extension mismatch (.xyz fname vs .txt object key).
fnameuses.xyzwhileobject_store_urluses.txt. If not intentional, align them to prevent brittle tests.Possible fix (choose one):
- fname=f"{doc_id}.xyz", + fname=f"{doc_id}.txt",or
- key = f"{self.project.storage_path}/{doc_id}.txt" + key = f"{self.project.storage_path}/{doc_id}.xyz"backend/Dockerfile (1)
10-14: Slimmer image; avoid bringing dev deps.
- Use --no-install-recommends for apt and drop curl unless required.
- Exclude dev deps at sync time.
-RUN apt-get update && apt-get install -y \ - curl \ - poppler-utils \ - && rm -rf /var/lib/apt/lists/* +RUN apt-get update && apt-get install -y --no-install-recommends \ + poppler-utils \ + && rm -rf /var/lib/apt/lists/* @@ -RUN --mount=type=cache,target=/root/.cache/uv \ - uv sync --frozen --no-install-project +RUN --mount=type=cache,target=/root/.cache/uv \ + uv sync --frozen --no-install-project --no-devIf curl is needed (healthchecks/debug), keep it; otherwise omit.
Also applies to: 30-31
backend/app/core/doctransform/transformer.py (1)
7-13: Define error contract in the interface.Document expected exception type (e.g., TransformationError) to standardize handling upstream (registry/service).
def transform(self, input_path: Path, output_path: Path) -> Path: """ Transform the document at input_path and write the result to output_path. Returns the path to the transformed file. + + Raises: + TransformationError: if the transformation fails. """ passbackend/app/api/docs/documents/upload.md (1)
3-5: Tighten wording and fix minor grammar.Short, clear phrasing; fix articles.
-- If a target format is specified, a transformation job will also be created to transform document into target format in the background. The response will include both the uploaded document details and information about the transformation job. +- If a target format is specified, a background transformation job will be created to transform the document into the target format. The response will include the uploaded document details and information about the transformation job. @@ - - zerox + - zerox @@ -Available transformer names and their implementations, default transformer is zerox: +Available transformer names and their implementations; the default transformer is zerox:for the structural doc update.
Also applies to: 10-17
backend/app/core/doctransform/zerox_transformer.py (2)
23-31: Stronger result checks and robust write.Treat empty page contents as failure, and be tolerant to encoding glitches.
- if result is None or not hasattr(result, "pages") or result.pages is None: + if result is None or not hasattr(result, "pages") or result.pages is None: raise RuntimeError("Zerox returned no pages. This may indicate a PDF/image conversion failure (is Poppler installed and in PATH?)") - with output_path.open("w", encoding="utf-8") as output_file: + pages = list(result.pages or []) + if not any(getattr(p, "content", None) for p in pages): + raise RuntimeError("Zerox returned pages without content. PDF/image conversion may have failed.") + + with output_path.open("w", encoding="utf-8", errors="replace") as output_file: - for page in result.pages: + for page in pages: if not getattr(page, "content", None): continue output_file.write(page.content) output_file.write("\n\n")
12-14: Make model configurable.Default is fine, but allow env/config override to avoid hardcoding OpenAI model in code.
- def __init__(self, model: str = "gpt-4o"): - self.model = model + def __init__(self, model: str | None = None): + # e.g., read from settings if not provided + self.model = model or os.getenv("ZER0X_MODEL", "gpt-4o")(Remember to import os.)
README.md (1)
14-14: Add install commands for Poppler (per OS) and Docker note
Make it copy-pasteable and clarify Docker images already install it (if true).Apply:
-- **Poppler** – Install Poppler, required for PDF processing. +- **Poppler** – required for PDF processing. + - macOS: `brew install poppler` + - Ubuntu/Debian: `sudo apt-get update && sudo apt-get install -y poppler-utils` + - Fedora: `sudo dnf install -y poppler-utils` + - Windows (choco): `choco install poppler` + - Note: If you use our Docker setup, ensure the image includes Poppler (pdftoppm). Otherwise, add it to the Dockerfile.backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1)
15-15: Unused import: get_project_by_id
This import isn’t used in this test; remove to keep tests tidy.-from app.crud import get_project_by_idbackend/app/tests/crud/collections/test_crud_collection_read_all.py (1)
6-6: Unused import: get_project_by_id
Not referenced in this module; drop it.-from app.crud import CollectionCrud, get_project_by_id +from app.crud import CollectionCrudbackend/app/tests/api/routes/documents/test_route_document_info.py (1)
4-4: Unused import: get_project_by_id
Remove to avoid lint noise.-from app.crud import get_project_by_idbackend/app/core/doctransform/test_transformer.py (1)
9-15: Harden write step by ensuring parent dir exists.Preempt failures when the parent directory hasn’t been created by the caller.
Apply this diff:
def transform(self, input_path: Path, output_path: Path) -> Path: - content = ( + content = ( "Lorem ipsum dolor sit amet, consectetur adipiscing elit, " "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." ) - output_path.write_text(content, encoding='utf-8') + output_path.parent.mkdir(parents=True, exist_ok=True) + output_path.write_text(content, encoding="utf-8") return output_pathbackend/app/tests/crud/test_doc_transformation_job.py (2)
37-46: Prefer uuid4() for nonexistent IDs in tests.SequentialUuidGenerator always starts at 0 when newly instantiated; uuid4() better reflects real IDs and avoids accidental collisions.
- invalid_id = next(SequentialUuidGenerator()) + from uuid import uuid4 + invalid_id = uuid4()Apply similarly to other tests expecting a non-existent UUID.
Also applies to: 74-83, 208-217
223-231: Clarify error_message retention semantics on state change.Test asserts error persists when moving FAILED → PROCESSING. If intended, also add a test that explicitly clears the error when an empty string is passed, and one that preserves when None is passed.
- updated_job = crud.update_status(job.id, TransformationStatus.PROCESSING) + # Explicitly clear the error if desired behavior is to reset + # updated_job = crud.update_status(job.id, TransformationStatus.PROCESSING, error_message="") + updated_job = crud.update_status(job.id, TransformationStatus.PROCESSING)If the desired behavior is to clear on retry, adjust CRUD accordingly and update this test.
backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py (2)
30-33: Add helpful indexes and clean enum teardown.
- Index by source_document_id and status/created_at for common lookups.
- Drop the enum type on downgrade to avoid orphaned types (Postgres).
def upgrade(): @@ - ) + ) + op.create_index('ix_doc_transformation_job_source_document_id', 'doc_transformation_job', ['source_document_id']) + op.create_index('ix_doc_transformation_job_status_created_at', 'doc_transformation_job', ['status', 'created_at']) @@ def downgrade(): - # ### commands auto generated by Alembic - please adjust! ### - op.drop_table('doc_transformation_job') + op.drop_index('ix_doc_transformation_job_status_created_at', table_name='doc_transformation_job') + op.drop_index('ix_doc_transformation_job_source_document_id', table_name='doc_transformation_job') + op.drop_table('doc_transformation_job') + # Best-effort cleanup for Postgres enum + try: + op.execute("DROP TYPE IF EXISTS transformationstatus") + except Exception: + passAlso applies to: 37-40
30-31: Consider explicit ON DELETE behavior on FKs.
- source_document_id: RESTRICT (prevents deleting source while jobs exist) or CASCADE if you want jobs removed with the source.
- transformed_document_id: SET NULL is typical if the transformed doc is deleted independently.
- sa.ForeignKeyConstraint(['source_document_id'], ['document.id'], ), - sa.ForeignKeyConstraint(['transformed_document_id'], ['document.id'], ), + sa.ForeignKeyConstraint(['source_document_id'], ['document.id'], ondelete='RESTRICT'), + sa.ForeignKeyConstraint(['transformed_document_id'], ['document.id'], ondelete='SET NULL'),Validate with your deletion semantics.
Also applies to: 24-25
backend/app/tests/core/doctransformer/test_service/test_start_job.py (2)
4-4: Replace typing.Tuple with built-in tuple.Aligns with Ruff UP035 and modern typing.
-from typing import Any, Tuple +from typing import Any +from typing import Tuple as _Tuple # temporary to avoid rename churnOr update annotations directly:
- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],Apply across this file.
29-36: Standardize test transformer names to avoid confusion.Registry test transformer is “test”; in one place you use “test-transformer”. Since execute_job isn’t run here, it won’t fail, but consistency helps readability.
- transformer_name="test-transformer", + transformer_name="test",Also consider parameterizing formats using TestDataProvider.get_format_test_cases() to avoid drift.
Also applies to: 112-129, 130-156
backend/app/tests/api/routes/test_doc_transformation_job.py (2)
37-38: Assert exact ID, not just non-null.Stronger check improves signal and guards serialization issues.
- assert data["data"]["id"] is not None + assert data["data"]["id"] == str(created_job.id)
1-1: Pre-commit/Black touched this file. Run hooks locally.Pipeline shows trailing whitespace/formatting fixes. Please run pre-commit locally to avoid CI churn.
backend/app/models/doc_transformation_job.py (3)
21-25: Make nullability explicit for optional DB fields.Being explicit helps migrations and DB introspection tools.
- transformed_document_id: Optional[UUID] = Field(default=None, foreign_key="document.id") + transformed_document_id: Optional[UUID] = Field(default=None, foreign_key="document.id", nullable=True) @@ - error_message: Optional[str] = Field(default=None) + error_message: Optional[str] = Field(default=None, nullable=True)
16-26: Consider indexes for frequent filters/joins.Given frequent joins on
source_document_idand filters bystatus, add DB indexes (and Alembic migration) for(source_document_id),(status), and optionally(created_at DESC)for recent jobs.
1-1: EOF newline/formatting were auto-fixed by CI.Keep pre-commit enabled locally to prevent reformat diffs.
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (5)
4-4: Modernize typing: use collections.abc.Callable and builtin tuple.Aligns with Ruff UP035 and current typing best practices.
-from typing import Any, Callable, Tuple +from typing import Any +from collections.abc import Callable @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project] + test_document: tuple[Document, Project] @@ - test_document: Tuple[Document, Project] + test_document: tuple[Document, Project]Also applies to: 33-33, 84-84, 110-110, 145-145, 185-185, 218-218
126-126: Avoid catching broad Exception in tests.Assert the specific failure kinds to reduce flakes and satisfy Ruff B017.
- with pytest.raises(Exception): + with pytest.raises((RetryError, HTTPException)):
225-251: Use expected_content_type by asserting the S3 ContentType.Prevents unused-var warning (B007) and verifies correct headers are stored.
+from urllib.parse import urlparse @@ +from app.core.config import settings @@ for target_format, expected_content_type, expected_extension in format_extensions: @@ transformed_doc = document_crud.read_one(job.transformed_document_id) assert transformed_doc is not None assert transformed_doc.fname.endswith(expected_extension) + # Assert S3 ContentType + parsed = urlparse(transformed_doc.object_store_url) + key = parsed.path.lstrip("/") + head = aws.client.head_object(Bucket=settings.AWS_S3_BUCKET, Key=key) + assert head["ContentType"] == expected_content_typeAlso applies to: 8-8, 13-13
47-47: Remove redundant db.commit() calls.
DocTransformationJobCrud.createcommits already; extra commits slow tests.- db.commit()Also applies to: 121-121, 156-156, 196-196, 231-231
1-1: Pre-commit/Black formatting tweaks were applied by CI.Run hooks locally to keep CI green and diffs minimal.
backend/app/models/document.py (2)
35-35: Make deleted_at explicitly optional with defaultBe explicit to avoid accidental “required field” interpretation in pydantic/SQLModel.
Apply this diff:
- deleted_at: datetime | None + deleted_at: datetime | None = None
36-40: Add ondelete=SET NULL and index for source_document_id in the migration
- In backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py, change the FK creation to
op.create_foreign_key(None, 'document', 'document', ['source_document_id'], ['id'], ondelete='SET NULL').- Add an index for faster lookups, e.g.
op.create_index('ix_document_source_document_id', 'document', ['source_document_id']).backend/app/api/routes/doc_transformation_job.py (1)
36-47: Leverage FastAPI validation for UUID lists (optional)Accept job_ids as list[UUID] to drop manual parsing and 422 handling.
Example:
- job_ids: str = Query(..., description="Comma-separated list of transformation job IDs"), + job_ids: list[UUID] = Query(..., description="Repeat ?job_ids=<uuid> per id"), @@ - job_id_list = [] - invalid_ids = [] - for jid in job_ids.split(","): - ... - if invalid_ids: - raise HTTPException(...) + job_id_list = job_idsbackend/app/tests/core/doctransformer/test_service/conftest.py (3)
5-5: Modernize typing importsUse collections.abc for Callable/Generator and built-in tuple syntax.
Apply this diff:
-from typing import Any, Callable, Generator, Tuple +from typing import Any +from collections.abc import Callable, Generator
67-71: Use built-in tuple type annotationAligns with modern typing and Ruff UP035.
Apply this diff:
-def test_document(db: Session, current_user: UserProjectOrg) -> Tuple[Document, Project]: +def test_document(db: Session, current_user: UserProjectOrg) -> tuple[Document, Project]:
22-30: Prefer monkeypatch for env vars in tests (optional)Using pytest’s monkeypatch keeps env scoped to the test run.
backend/app/tests/core/doctransformer/test_service/test_integration.py (6)
4-4: Drop deprecated Tuple importUse built-in tuple annotations.
Apply this diff:
-from typing import Tuple
25-27: Update annotation to tuple[...]Apply this diff:
- test_document: Tuple[Document, Project] + test_document: tuple[Document, Project]
82-85: Update annotation to tuple[...]Apply this diff:
- test_document: Tuple[Document, Project] + test_document: tuple[Document, Project]
120-123: Update annotation to tuple[...]Apply this diff:
- test_document: Tuple[Document, Project] + test_document: tuple[Document, Project]
93-96: Unused loop variable and redundant commitRename i to _ and drop extra commit; create() already commits.
Apply this diff:
- for i in range(3): + for _ in range(3): job = job_crud.create(source_document_id=document.id) jobs.append(job) - db.commit()
154-154: Unused loop indexRename i to _i or remove enumerate.
Apply this diff:
- for i, (job, target_format) in enumerate(jobs): + for _i, (job, target_format) in enumerate(jobs):backend/app/core/doctransform/service.py (2)
66-72: Close streaming body after copy (optional)Avoid leaking connections/file descriptors from storage.stream.
Apply this diff:
- body = storage.stream(source_doc_object_store_url) - tmp_dir = Path(tempfile.mkdtemp()) - tmp_in = tmp_dir / f"{source_doc_id}" - with open(tmp_in, "wb") as f: - shutil.copyfileobj(body, f) + body = storage.stream(source_doc_object_store_url) + tmp_dir = Path(tempfile.mkdtemp()) + tmp_in = tmp_dir / f"{source_doc_id}" + try: + with open(tmp_in, "wb") as f: + shutil.copyfileobj(body, f) + finally: + try: + body.close() + except Exception: + pass
76-78: Idempotency and partial-failure risks (optional)On retry, a new transformed_doc_id is generated; failures after upload but before job status update can orphan objects. Consider deriving object key from job_id and target_format, writing to a temp key then moving/overwriting on completion.
Also applies to: 98-114
backend/app/crud/doc_transformation_job.py (3)
3-4: Tighten imports; modernize typing and drop unused symbols.
Listis unused; prefer built-inlist[...]types.joinis imported but never used.Apply:
-from typing import List, Optional -from sqlmodel import Session, select, and_, join +from typing import Optional +from sqlmodel import Session, select, and_
29-46: Clarify behavior when the source document is soft-deleted.
read_onefiltersDocument.is_deleted.is_(False). If a source doc is soft-deleted after job creation, this method will 404 the job. Is that intentional for readers of job status? If not, drop theis_deletedfilter here (or only for internal callers) so jobs remain discoverable.Example:
- Document.project_id == self.project_id, - Document.is_deleted.is_(False) + Document.project_id == self.project_id,
48-63: Guard against empty IN-clause and consider input type.
- If
job_idsis empty, many dialects renderIN ()which is invalid; return[]early.- Consider accepting
Iterable[UUID]for flexibility.- def read_each(self, job_ids: set[UUID]) -> list[DocTransformationJob]: + def read_each(self, job_ids: set[UUID]) -> list[DocTransformationJob]: + if not job_ids: + return [] statement = ( select(DocTransformationJob)backend/app/tests/api/routes/documents/test_route_document_upload.py (4)
25-40: Close the file handle to avoid descriptor leaks in tests.Wrap the file open in a context manager so the descriptor is closed even if the request fails.
- def put(self, route: Route, scratch: Path, target_format: str = None, transformer: str = None): + def put(self, route: Route, scratch: Path, target_format: str = None, transformer: str = None): (mtype, _) = mimetypes.guess_type(str(scratch)) - files = {"src": (str(scratch), scratch.open("rb"), mtype)} - - data = {} + data = {} if target_format: data["target_format"] = target_format if transformer: data["transformer"] = transformer - - return self.client.post( - str(route), - headers={"X-API-KEY": self.user_api_key.key}, - files=files, - data=data, - ) + with scratch.open("rb") as fh: + files = {"src": (scratch.name, fh, mtype)} + return self.client.post( + str(route), + headers={"X-API-KEY": self.user_api_key.key}, + files=files, + data=data, + )
200-216: Unsupported transformation error path looks good but message is brittle.If server error messages change, this may become flaky. Consider matching a stable prefix or error code if available.
217-236: Invalid transformer case covered.Assertions look good, same note about brittle full-string match applies.
261-285: Fix unused variable flagged by Ruff (F841).
responseis assigned but unused (Line 277). Either remove the assignment or assert on it to strengthen the test.Option A — remove assignment:
- response = httpx_to_standard(uploader.put(route, pdf_scratch, target_format="markdown")) + httpx_to_standard(uploader.put(route, pdf_scratch, target_format="markdown"))Option B — use the response:
- response = httpx_to_standard(uploader.put(route, pdf_scratch, target_format="markdown")) + response = httpx_to_standard(uploader.put(route, pdf_scratch, target_format="markdown")) + assert response.success is True + assert response.data["transformation_job"]["job_id"] == mock_job_idbackend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (6)
5-5: Modernize typing imports (UP035).Import
Callablefromcollections.abcand prefer built-in generics.-from typing import Any, Callable, Tuple +from typing import Any +from collections.abc import Callable
11-11: Remove unused import.
RetryErroris not used.-from tenacity import RetryError +# from tenacity import RetryError # unused
14-14: Remove unused import.
execute_jobis not referenced directly.-from app.core.doctransform.service import execute_job +# from app.core.doctransform.service import execute_job # unused
24-63: Avoid blind exception assertions; match expected message.Strengthen the test by asserting on the error text (Ruff B017).
- with pytest.raises(Exception): + with pytest.raises(Exception, match="S3 upload failed"): fast_execute_job( project_id=project.id, job_id=job.id, transformer_name="test", target_format="markdown" )Also, update annotations to use built-in
tuple:- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],
101-139: Assert specific error on exhausted retries.Don't catch
Exceptionblindly; verify the message.- with pytest.raises(Exception): + with pytest.raises(Exception, match="Persistent error"): fast_execute_job( project_id=project.id, job_id=job.id, transformer_name="test", target_format="markdown" )And modernize the annotation:
- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],
140-179: Also match message for DB error case.Tighten the assertion to the expected failure text; update tuple typing.
- with pytest.raises(Exception): + with pytest.raises(Exception, match="Database error during document creation"): fast_execute_job( project_id=project.id, job_id=job.id, transformer_name="test", target_format="markdown" )- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],backend/app/api/routes/documents.py (4)
3-3: Clean imports and modernize typing (fix Ruff F811/UP035).
- Drop duplicate HTTPException import and unused JSONResponse.
- Use built-in generics instead of typing.List; remove the typing import.
-from typing import List, Optional +# typing imports not required; use built-in generics -from fastapi import APIRouter, File, UploadFile, Query, Form, BackgroundTasks, HTTPException +from fastapi import APIRouter, File, UploadFile, Query, Form, BackgroundTasks, HTTPException -from fastapi.responses import JSONResponse -from fastapi import HTTPException - response_model=APIResponse[List[DocumentPublic]], + response_model=APIResponse[list[DocumentPublic]],Also applies to: 6-6, 8-9, 32-32
70-77: Guard against no-op transforms (source == target).Rejects unnecessary jobs early.
# validate if transformation is possible or not if target_format: + if source_format == target_format: + raise HTTPException( + status_code=400, + detail=f"Source and target formats are the same ('{source_format}'); no transformation needed." + ) if not is_transformation_supported(source_format, target_format): raise HTTPException( status_code=400, detail=f"Transformation from {source_format} to {target_format} is not supported" )
114-121: Pass UUID, not str, to TransformationJobInfo.job_id.The model field is UUID; avoid implicit coercion.
job_info = TransformationJobInfo( message=f"Document accepted for transformation from {source_format} to {target_format}.", - job_id=str(job_id), + job_id=job_id, source_format=source_format, target_format=target_format, transformer=actual_transformer, status_check_url=f"/documents/transformations/{job_id}" )
45-49: Return 201 Created for uploads.Aligns HTTP semantics with resource creation.
@router.post( "/upload", description=load_description("documents/upload.md"), - response_model=APIResponse[DocumentUploadResponse], + response_model=APIResponse[DocumentUploadResponse], + status_code=201, )backend/app/core/doctransform/registry.py (2)
2-2: Adopt built-in generics and union syntax (fix Ruff UP035).Modernize type hints; drop typing imports.
-from typing import Type, Dict, Set, Tuple, Optional +# Built-in generics and | None are used; no typing imports needed. -TRANSFORMERS: Dict[str, Type[Transformer]] = { +TRANSFORMERS: dict[str, type[Transformer]] = { @@ -SUPPORTED_TRANSFORMATIONS: Dict[Tuple[str, str], Dict[str, str]] = { +SUPPORTED_TRANSFORMATIONS: dict[tuple[str, str], dict[str, str]] = { @@ -EXTENSION_TO_FORMAT: Dict[str, str] = { +EXTENSION_TO_FORMAT: dict[str, str] = { @@ -FORMAT_TO_EXTENSION: Dict[str, str] = { +FORMAT_TO_EXTENSION: dict[str, str] = { @@ -def get_supported_transformations() -> Dict[Tuple[str, str], Set[str]]: +def get_supported_transformations() -> dict[tuple[str, str], set[str]]: @@ -def get_available_transformers(source_format: str, target_format: str) -> Dict[str, str]: +def get_available_transformers(source_format: str, target_format: str) -> dict[str, str]: @@ -def resolve_transformer(source_format: str, target_format: str, transformer_name: Optional[str] = None) -> str: +def resolve_transformer(source_format: str, target_format: str, transformer_name: str | None = None) -> str:Also applies to: 12-16, 19-27, 30-39, 42-49, 59-65, 70-73, 74-79
98-111: Ensure output directory exists before writing.Prevents failures when output_path parents are missing.
def convert_document(input_path: Path, output_path: Path, transformer_name: str = "default") -> Path: @@ - transformer = transformer_cls() + # Make sure output directory exists + output_path.parent.mkdir(parents=True, exist_ok=True) + transformer = transformer_cls()backend/app/tests/core/doctransformer/test_service/base.py (2)
13-13: Use built-in generics in tests (fix Ruff UP035).Replace typing.List with list[...] in return annotations; drop typing import.
-from typing import List +# typing.List not needed; use built-in generics @@ - def get_format_test_cases() -> List[tuple]: + def get_format_test_cases() -> list[tuple]: @@ - def get_content_type_test_cases() -> List[tuple]: + def get_content_type_test_cases() -> list[tuple]: @@ - def get_test_transformer_names() -> List[str]: + def get_test_transformer_names() -> list[str]:Also applies to: 75-81, 84-91, 93-97
124-128: Silence unused args warnings in persistent failing mock.Prefix with underscores.
- def create_persistent_failing_convert_document(error_message: str = "Persistent error"): + def create_persistent_failing_convert_document(error_message: str = "Persistent error"): """Create a side effect function that always fails.""" - def persistent_failing_convert_document(*args, **kwargs): + def persistent_failing_convert_document(*_args, **_kwargs): raise Exception(error_message) return persistent_failing_convert_document
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (35)
README.md(1 hunks)backend/Dockerfile(1 hunks)backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py(1 hunks)backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py(1 hunks)backend/app/api/docs/documents/upload.md(1 hunks)backend/app/api/main.py(2 hunks)backend/app/api/routes/doc_transformation_job.py(1 hunks)backend/app/api/routes/documents.py(3 hunks)backend/app/core/doctransform/registry.py(1 hunks)backend/app/core/doctransform/service.py(1 hunks)backend/app/core/doctransform/test_transformer.py(1 hunks)backend/app/core/doctransform/transformer.py(1 hunks)backend/app/core/doctransform/zerox_transformer.py(1 hunks)backend/app/crud/__init__.py(1 hunks)backend/app/crud/doc_transformation_job.py(1 hunks)backend/app/models/__init__.py(1 hunks)backend/app/models/doc_transformation_job.py(1 hunks)backend/app/models/document.py(3 hunks)backend/app/tests/api/routes/documents/test_route_document_info.py(1 hunks)backend/app/tests/api/routes/documents/test_route_document_list.py(1 hunks)backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py(1 hunks)backend/app/tests/api/routes/documents/test_route_document_remove.py(1 hunks)backend/app/tests/api/routes/documents/test_route_document_upload.py(4 hunks)backend/app/tests/api/routes/test_doc_transformation_job.py(1 hunks)backend/app/tests/core/doctransformer/test_service/base.py(1 hunks)backend/app/tests/core/doctransformer/test_service/conftest.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_execute_job.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_integration.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_start_job.py(1 hunks)backend/app/tests/crud/collections/test_crud_collection_create.py(1 hunks)backend/app/tests/crud/collections/test_crud_collection_read_all.py(1 hunks)backend/app/tests/crud/test_doc_transformation_job.py(1 hunks)backend/app/tests/utils/document.py(1 hunks)backend/pyproject.toml(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (30)
backend/app/core/doctransform/test_transformer.py (2)
backend/app/core/doctransform/transformer.py (2)
Transformer(4-13)transform(8-13)backend/app/core/doctransform/zerox_transformer.py (1)
transform(15-44)
backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1)
backend/app/crud/project.py (1)
get_project_by_id(37-38)
backend/app/tests/api/routes/documents/test_route_document_list.py (2)
backend/app/crud/project.py (1)
get_project_by_id(37-38)backend/app/tests/crud/test_project.py (1)
test_get_project_by_id(53-60)
backend/app/crud/doc_transformation_job.py (3)
backend/app/crud/document.py (2)
DocumentCrud(14-135)__init__(15-17)backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-25)TransformationStatus(9-13)backend/app/models/document.py (1)
Document(20-40)
backend/app/tests/utils/document.py (3)
backend/app/models/project.py (1)
Project(29-60)backend/app/crud/project.py (1)
get_project_by_id(37-38)backend/app/tests/utils/utils.py (1)
SequentialUuidGenerator(140-153)
backend/app/tests/api/routes/documents/test_route_document_remove.py (2)
backend/app/tests/utils/document.py (1)
project(67-68)backend/app/crud/project.py (1)
get_project_by_id(37-38)
backend/app/tests/crud/test_doc_transformation_job.py (5)
backend/app/crud/doc_transformation_job.py (5)
DocTransformationJobCrud(13-84)create(18-27)read_one(29-46)read_each(48-62)update_status(64-84)backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-25)TransformationStatus(9-13)backend/app/tests/utils/document.py (4)
DocumentStore(55-82)project(67-68)put(70-75)fill(81-82)backend/app/tests/utils/utils.py (2)
get_project(70-89)SequentialUuidGenerator(140-153)backend/app/tests/conftest.py (1)
db(24-41)
backend/app/tests/api/routes/documents/test_route_document_info.py (1)
backend/app/crud/project.py (1)
get_project_by_id(37-38)
backend/app/crud/__init__.py (2)
backend/app/crud/doc_transformation_job.py (1)
DocTransformationJobCrud(13-84)backend/app/crud/document.py (1)
DocumentCrud(14-135)
backend/app/tests/core/doctransformer/test_service/test_start_job.py (7)
backend/app/core/doctransform/service.py (2)
execute_job(42-128)start_job(24-39)backend/app/models/document.py (1)
Document(20-40)backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-25)TransformationStatus(9-13)backend/app/models/user.py (1)
UserProjectOrg(65-66)backend/app/tests/core/doctransformer/test_service/base.py (3)
DocTransformTestBase(21-68)TestDataProvider(71-101)get_test_transformer_names(94-96)backend/app/tests/conftest.py (1)
db(24-41)backend/app/tests/core/doctransformer/test_service/conftest.py (3)
current_user(49-57)test_document(67-71)background_tasks(61-63)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (7)
backend/app/crud/doc_transformation_job.py (3)
DocTransformationJobCrud(13-84)create(18-27)read_one(29-46)backend/app/crud/document.py (1)
DocumentCrud(14-135)backend/app/core/doctransform/registry.py (1)
TransformationError(8-9)backend/app/core/doctransform/service.py (1)
execute_job(42-128)backend/app/models/document.py (1)
Document(20-40)backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-25)TransformationStatus(9-13)backend/app/tests/core/doctransformer/test_service/base.py (8)
DocTransformTestBase(21-68)TestDataProvider(71-101)get_format_test_cases(75-81)setup_aws_s3(24-28)get_sample_document_content(99-101)create_s3_document_content(30-46)verify_s3_content(48-68)get_content_type_test_cases(84-91)
backend/app/tests/api/routes/documents/test_route_document_upload.py (4)
backend/app/tests/utils/document.py (5)
Route(85-112)_(143-144)_(148-149)delete(126-130)httpx_to_standard(22-23)backend/app/core/cloud/storage.py (5)
client(30-42)delete(140-142)delete(244-259)AmazonCloudStorageClient(28-84)create(44-84)backend/app/tests/conftest.py (2)
client(52-55)db(24-41)backend/app/crud/doc_transformation_job.py (1)
create(18-27)
backend/app/tests/crud/collections/test_crud_collection_create.py (1)
backend/app/crud/project.py (1)
get_project_by_id(37-38)
backend/app/models/doc_transformation_job.py (2)
backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)
upgrade(20-36)backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py (1)
upgrade(19-66)
backend/app/tests/core/doctransformer/test_service/conftest.py (6)
backend/app/crud/project.py (1)
get_project_by_id(37-38)backend/app/models/user.py (2)
User(48-57)UserProjectOrg(65-66)backend/app/tests/utils/document.py (4)
DocumentStore(55-82)get(120-124)project(67-68)put(70-75)backend/app/tests/utils/test_data.py (1)
create_test_api_key(53-67)backend/app/core/doctransform/service.py (1)
execute_job(42-128)backend/app/tests/conftest.py (1)
db(24-41)
backend/app/tests/core/doctransformer/test_service/test_integration.py (8)
backend/app/crud/doc_transformation_job.py (3)
DocTransformationJobCrud(13-84)read_one(29-46)create(18-27)backend/app/crud/document.py (1)
DocumentCrud(14-135)backend/app/core/doctransform/service.py (2)
execute_job(42-128)start_job(24-39)backend/app/models/document.py (1)
Document(20-40)backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-25)TransformationStatus(9-13)backend/app/tests/core/doctransformer/test_service/base.py (3)
DocTransformTestBase(21-68)setup_aws_s3(24-28)create_s3_document_content(30-46)backend/app/tests/conftest.py (1)
db(24-41)backend/app/tests/core/doctransformer/test_service/conftest.py (3)
test_document(67-71)current_user(49-57)background_tasks(61-63)
backend/app/tests/crud/collections/test_crud_collection_read_all.py (2)
backend/app/crud/project.py (1)
get_project_by_id(37-38)backend/app/tests/crud/test_project.py (1)
test_get_project_by_id(53-60)
backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py (2)
backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py (2)
upgrade(20-24)downgrade(27-31)backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)
upgrade(20-36)
backend/app/core/doctransform/transformer.py (2)
backend/app/core/doctransform/test_transformer.py (1)
transform(9-15)backend/app/core/doctransform/zerox_transformer.py (1)
transform(15-44)
backend/app/core/doctransform/registry.py (3)
backend/app/core/doctransform/transformer.py (2)
Transformer(4-13)transform(8-13)backend/app/core/doctransform/test_transformer.py (2)
TestTransformer(4-15)transform(9-15)backend/app/core/doctransform/zerox_transformer.py (2)
ZeroxTransformer(7-44)transform(15-44)
backend/app/models/__init__.py (2)
backend/app/models/document.py (4)
Document(20-40)DocumentPublic(43-61)DocumentUploadResponse(83-87)TransformationJobInfo(64-80)backend/app/models/doc_transformation_job.py (3)
DocTransformationJob(16-25)DocTransformationJobs(28-30)TransformationStatus(9-13)
backend/app/tests/api/routes/test_doc_transformation_job.py (7)
backend/app/crud/doc_transformation_job.py (3)
DocTransformationJobCrud(13-84)create(18-27)update_status(64-84)backend/app/models/doc_transformation_job.py (1)
TransformationStatus(9-13)backend/app/tests/utils/document.py (5)
DocumentStore(55-82)project(67-68)put(70-75)get(120-124)fill(81-82)backend/app/tests/utils/utils.py (1)
get_project(70-89)backend/app/models/api_key.py (1)
APIKeyPublic(23-25)backend/app/crud/project.py (1)
get_project_by_id(37-38)backend/app/tests/conftest.py (3)
db(24-41)user_api_key(89-91)superuser_api_key(83-85)
backend/app/tests/core/doctransformer/test_service/base.py (4)
backend/app/core/cloud/storage.py (1)
AmazonCloudStorageClient(28-84)backend/app/models/document.py (1)
Document(20-40)backend/app/models/project.py (1)
Project(29-60)backend/app/core/config.py (1)
AWS_S3_BUCKET(76-77)
backend/app/api/routes/documents.py (5)
backend/app/models/document.py (4)
Document(20-40)DocumentPublic(43-61)DocumentUploadResponse(83-87)TransformationJobInfo(64-80)backend/app/utils.py (2)
APIResponse(27-48)success_response(34-37)backend/app/core/cloud/storage.py (2)
get_signed_url(135-137)get_signed_url(217-242)backend/app/core/doctransform/registry.py (4)
get_file_format(51-57)is_transformation_supported(66-68)get_available_transformers(70-72)resolve_transformer(74-96)backend/app/core/doctransform/service.py (1)
start_job(24-39)
backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py (2)
backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py (1)
upgrade(20-34)backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)
upgrade(20-36)
backend/app/core/doctransform/zerox_transformer.py (2)
backend/app/core/doctransform/transformer.py (2)
Transformer(4-13)transform(8-13)backend/app/core/doctransform/test_transformer.py (1)
transform(9-15)
backend/app/models/document.py (3)
backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)
upgrade(20-36)backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py (1)
upgrade(19-66)backend/app/models/document_collection.py (1)
DocumentCollection(9-23)
backend/app/core/doctransform/service.py (7)
backend/app/crud/doc_transformation_job.py (4)
DocTransformationJobCrud(13-84)create(18-27)update_status(64-84)read_one(29-46)backend/app/crud/document.py (1)
DocumentCrud(14-135)backend/app/models/document.py (1)
Document(20-40)backend/app/models/doc_transformation_job.py (1)
TransformationStatus(9-13)backend/app/core/cloud/storage.py (1)
get_cloud_storage(262-272)backend/app/core/doctransform/registry.py (1)
convert_document(98-115)backend/app/tests/core/doctransformer/test_service/conftest.py (2)
current_user(49-57)background_tasks(61-63)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (5)
backend/app/crud/doc_transformation_job.py (3)
DocTransformationJobCrud(13-84)create(18-27)read_one(29-46)backend/app/core/doctransform/service.py (1)
execute_job(42-128)backend/app/tests/core/doctransformer/test_service/base.py (8)
DocTransformTestBase(21-68)MockHelpers(104-128)setup_aws_s3(24-28)create_s3_document_content(30-46)failing_convert_document(111-120)create_failing_convert_document(108-121)persistent_failing_convert_document(126-127)create_persistent_failing_convert_document(124-128)backend/app/tests/conftest.py (1)
db(24-41)backend/app/tests/core/doctransformer/test_service/conftest.py (1)
fast_execute_job(33-45)
backend/app/api/routes/doc_transformation_job.py (3)
backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-25)DocTransformationJobs(28-30)backend/app/crud/doc_transformation_job.py (3)
DocTransformationJobCrud(13-84)read_one(29-46)read_each(48-62)backend/app/utils.py (2)
APIResponse(27-48)success_response(34-37)
🪛 GitHub Actions: AI Platform CI
backend/app/core/doctransform/test_transformer.py
[error] 1-1: Black formatting applied; file reformatted.
backend/app/crud/doc_transformation_job.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/tests/utils/document.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/tests/crud/test_doc_transformation_job.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/api/docs/documents/upload.md
[error] 1-1: End-of-file fixer updated file to ensure trailing newline.
[error] 1-1: Trailing whitespace removed by pre-commit hook.
backend/app/tests/core/doctransformer/test_service/test_start_job.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/tests/core/doctransformer/test_service/test_execute_job.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/tests/api/routes/documents/test_route_document_upload.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/models/doc_transformation_job.py
[error] 1-1: End-of-file fixer updated file to ensure trailing newline.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/tests/core/doctransformer/test_service/conftest.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/tests/core/doctransformer/test_service/test_integration.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py
[error] 1-1: Black formatting applied; file reformatted.
backend/app/core/doctransform/transformer.py
[error] 1-1: Black formatting applied; file reformatted.
backend/app/core/doctransform/registry.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/models/__init__.py
[error] 1-1: Black formatting applied; file reformatted.
backend/app/tests/api/routes/test_doc_transformation_job.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/tests/core/doctransformer/test_service/base.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/api/routes/documents.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py
[error] 1-1: Black formatting applied; file reformatted.
backend/app/core/doctransform/zerox_transformer.py
[error] 1-1: End-of-file fixer updated file to ensure trailing newline.
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/models/document.py
[error] 1-1: Black formatting applied; file reformatted.
backend/app/core/doctransform/service.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py
[error] 1-1: Trailing whitespace removed by pre-commit hook.
[error] 1-1: Black formatting applied; file reformatted.
backend/app/api/routes/doc_transformation_job.py
[error] 1-1: End-of-file fixer updated file to ensure trailing newline.
[error] 1-1: Black formatting applied; file reformatted.
🪛 Ruff (0.12.2)
backend/app/crud/doc_transformation_job.py
3-3: typing.List is deprecated, use list instead
(UP035)
backend/app/tests/core/doctransformer/test_service/test_start_job.py
4-4: typing.Tuple is deprecated, use tuple instead
(UP035)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py
4-4: Import from collections.abc instead: Callable
Import from collections.abc
(UP035)
4-4: typing.Tuple is deprecated, use tuple instead
(UP035)
126-126: Do not assert blind exception: Exception
(B017)
227-227: Loop control variable expected_content_type not used within loop body
Rename unused expected_content_type to _expected_content_type
(B007)
backend/app/tests/api/routes/documents/test_route_document_upload.py
277-277: Local variable response is assigned to but never used
Remove assignment to unused variable response
(F841)
backend/app/models/doc_transformation_job.py
30-30: No newline at end of file
Add trailing newline
(W292)
backend/app/tests/core/doctransformer/test_service/conftest.py
5-5: Import from collections.abc instead: Callable, Generator
Import from collections.abc
(UP035)
5-5: typing.Tuple is deprecated, use tuple instead
(UP035)
backend/app/tests/core/doctransformer/test_service/test_integration.py
4-4: typing.Tuple is deprecated, use tuple instead
(UP035)
93-93: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
154-154: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
backend/app/core/doctransform/registry.py
2-2: typing.Type is deprecated, use type instead
(UP035)
2-2: typing.Dict is deprecated, use dict instead
(UP035)
2-2: typing.Set is deprecated, use set instead
(UP035)
2-2: typing.Tuple is deprecated, use tuple instead
(UP035)
backend/app/tests/core/doctransformer/test_service/base.py
13-13: typing.List is deprecated, use list instead
(UP035)
126-126: Unused function argument: args
(ARG001)
126-126: Unused function argument: kwargs
(ARG001)
backend/app/api/routes/documents.py
3-3: typing.List is deprecated, use list instead
(UP035)
9-9: Redefinition of unused HTTPException from line 6
Remove definition: HTTPException
(F811)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py
5-5: Import from collections.abc instead: Callable
Import from collections.abc
(UP035)
5-5: typing.Tuple is deprecated, use tuple instead
(UP035)
50-50: Do not assert blind exception: Exception
(B017)
127-127: Do not assert blind exception: Exception
(B017)
167-167: Do not assert blind exception: Exception
(B017)
backend/app/api/routes/doc_transformation_job.py
58-58: No newline at end of file
Add trailing newline
(W292)
🪛 LanguageTool
backend/app/api/docs/documents/upload.md
[grammar] ~4-~4: There might be a mistake here.
Context: ...nsformation job will also be created to transform document into target format in the back...
(QB_NEW_EN)
[grammar] ~4-~4: There might be a mistake here.
Context: ...l also be created to transform document into target format in the background. The re...
(QB_NEW_EN)
[grammar] ~10-~10: Ensure spelling is correct
Context: ...mations are supported: - pdf → markdown - zerox ### Transformers Available transformer name...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py
Outdated
Show resolved
Hide resolved
backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py
Show resolved
Hide resolved
backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/api/docs/documents/upload.md (1)
1-18: Fix grammar, clarify parameters, and separate transformations vs. transformers.Current text has grammatical issues and mixes a transformer name under “Supported Transformations”.
-Upload a document to the AI platform. - -- If only a file is provided, the document will be uploaded and stored, and its ID will be returned. -- If a target format is specified, a transformation job will also be created to transform document into target format in the background. The response will include both the uploaded document details and information about the transformation job. - -### Supported Transformations - -The following (source_format → target_format) transformations are supported: - -- pdf → markdown - - zerox - -### Transformers - -Available transformer names and their implementations, default transformer is zerox: - -- `zerox` +Upload a document to the AI platform. + +- If only a file is provided, the document is uploaded and stored; its ID is returned. +- If `target_format` is provided, a background transformation job is created to transform the document into the target format. The response includes the uploaded document details and the transformation job info. + +### Supported transformations + +The following (source_format → target_format) pairs are supported: + +- pdf → markdown (via the `zerox` transformer) + +### Transformers + +Available transformer names (default: `zerox`): + +- `zerox`
♻️ Duplicate comments (5)
backend/app/tests/utils/document.py (1)
30-32: Fail fast if project is missing to avoid AttributeError in next.get_project_by_id returns Optional[Project]; without a guard, Line 40/45 can crash on None. Assert early to keep the fixture deterministic.
self.project: Project = get_project_by_id( session=self.session, project_id=self.project_id ) + if self.project is None: + raise AssertionError( + f"Test setup error: Project id={self.project_id} not found" + )backend/app/core/doctransform/zerox_transformer.py (1)
1-1: Replace asyncio.Runner with thread-safe execution strategy.Runner (or asyncio.run) inside an active event loop can raise RuntimeError in FastAPI contexts. Offload the coroutine to a dedicated thread when a loop is running; otherwise run normally.
-from asyncio import Runner +import asyncio +import concurrent.futuresVerification (search for remaining Runner usages and call-sites):
#!/bin/bash rg -nP -C2 '\bRunner\(' backend rg -nP -C3 'ZeroxTransformer|transform\(' backend/app/core/doctransformbackend/app/crud/doc_transformation_job.py (1)
69-91: Make update_status robust to source-doc soft-deletion.Calling
self.read_oneappliesDocument.is_deleted == False, so status updates can 404 after the source doc is soft-deleted. Query by job_id + project scope without the deletion predicate to allow finalization (COMPLETED/FAILED). This was flagged earlier.def update_status( self, job_id: UUID, status: TransformationStatus, *, error_message: Optional[str] = None, transformed_document_id: Optional[UUID] = None, ) -> DocTransformationJob: - job = self.read_one(job_id) + statement = ( + select(DocTransformationJob) + .join( + Document, DocTransformationJob.source_document_id == Document.id + ) + .where( + and_( + DocTransformationJob.id == job_id, + Document.project_id == self.project_id, + ) + ) + ) + job = self.session.exec(statement).one_or_none() + if job is None: + logger.warning( + f"[DocTransformationJobCrud.update_status] Job not found | id: {job_id}, project_id: {self.project_id}" + ) + raise HTTPException(status_code=404, detail="Transformation job not found") job.status = status job.updated_at = now() if error_message is not None: job.error_message = error_message if transformed_document_id is not None: job.transformed_document_id = transformed_document_idbackend/app/core/doctransform/service.py (1)
60-67: Minor: PROCESSING update depends on soft-delete-sensitive read.If the source doc gets soft-deleted between create and run,
update_statusmay 404 (see CRUD comment). After applying the CRUD fix, this call will be safe.backend/app/api/routes/documents.py (1)
83-106: Predeclare actual_transformer for type-checkers/readability.Avoid potential "possibly-unbound" warnings.
@@ - # validate if transformation is possible or not + # validate if transformation is possible or not + actual_transformer: str | None = None if target_format: @@ - try: - actual_transformer = resolve_transformer( + try: + actual_transformer = resolve_transformer( source_format, target_format, transformer )
🧹 Nitpick comments (31)
backend/app/models/doc_transformation_job.py (2)
16-27: Index FKs and status; define deletion semantics.
- Add DB indexes on source_document_id, transformed_document_id, and status; these will be hot query predicates.
- Consider FK ondelete: SET NULL for transformed_document_id and RESTRICT (or CASCADE) for source_document_id to avoid orphaned rows or unexpected deletions. Requires a migration.
- source_document_id: UUID = Field(foreign_key="document.id") - transformed_document_id: Optional[UUID] = Field( - default=None, foreign_key="document.id" - ) - status: TransformationStatus = Field(default=TransformationStatus.PENDING) + source_document_id: UUID = Field(foreign_key="document.id", index=True) + transformed_document_id: Optional[UUID] = Field( + default=None, foreign_key="document.id", index=True + ) + status: TransformationStatus = Field(default=TransformationStatus.PENDING, index=True)If you align FK ondelete behavior, add an Alembic migration to alter constraints accordingly.
30-32: Provide defaults for response container lists.Avoid forcing callers to always supply both fields; safe defaults improve ergonomics.
-class DocTransformationJobs(SQLModel): - jobs: list[DocTransformationJob] - jobs_not_found: list[UUID] +class DocTransformationJobs(SQLModel): + jobs: list[DocTransformationJob] = Field(default_factory=list) + jobs_not_found: list[UUID] = Field(default_factory=list)backend/app/core/doctransform/test_transformer.py (1)
10-16: Ensure output directory exists before writing.Prevents failures when the parent directory is missing.
def transform(self, input_path: Path, output_path: Path) -> Path: + output_path.parent.mkdir(parents=True, exist_ok=True) content = ( "Lorem ipsum dolor sit amet, consectetur adipiscing elit, " "sed do eiusmod tempor incididunt ut labore et dolore magna aliqua." ) output_path.write_text(content, encoding="utf-8") return output_pathbackend/app/tests/core/doctransformer/test_service/test_start_job.py (1)
4-4: Switch to builtin tuple and drop unused typing imports (py3.12).Use builtin tuple[...] and remove unused Any/Tuple to satisfy Ruff UP035 and keep types modern.
-from typing import Any, Tuple +# no typing imports needed @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],Also applies to: 33-33, 88-88, 116-116, 146-146
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (3)
4-4: Modernize typing: use collections.abc and builtin tuple.Address Ruff UP035 and keep annotations idiomatic.
-from typing import Any, Callable, Tuple +from typing import Any +from collections.abc import Callable @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - self, db: Session, test_document: Tuple[Document, Project] + self, db: Session, test_document: tuple[Document, Project] @@ - self, db: Session, test_document: Tuple[Document, Project] + self, db: Session, test_document: tuple[Document, Project]Also applies to: 36-36, 87-87, 114-114, 149-149, 189-189, 220-220
129-136: Avoid blind exception assertions.Be explicit to prevent false positives under different environments/retry paths.
- with pytest.raises(Exception): + with pytest.raises((RetryError, HTTPException)): fast_execute_job( project_id=project.id, job_id=job.id, transformer_name="test", target_format="markdown", )
228-233: Rename unused loop variable to underscore to satisfy linters.
expected_content_typeisn’t used; keep intent clear.- for ( - target_format, - expected_content_type, - expected_extension, - ) in format_extensions: + for ( + target_format, + _expected_content_type, + expected_extension, + ) in format_extensions:backend/app/tests/core/doctransformer/test_service/conftest.py (1)
5-5: Modernize typing imports and tuple annotation.Use collections.abc for Callable/Generator and builtin tuple.
-from typing import Any, Callable, Generator, Tuple +from typing import Any +from collections.abc import Callable, Generator @@ -) -> Tuple[Document, Project]: +) -> tuple[Document, Project]:Also applies to: 73-80
backend/app/tests/core/doctransformer/test_service/test_integration.py (3)
4-4: Use builtin tuple for annotations (py3.12).-from typing import Tuple +# no typing Tuple needed @@ - self, db: Session, test_document: Tuple[Document, Project] + self, db: Session, test_document: tuple[Document, Project] @@ - self, db: Session, test_document: Tuple[Document, Project] + self, db: Session, test_document: tuple[Document, Project] @@ - self, db: Session, test_document: Tuple[Document, Project] + self, db: Session, test_document: tuple[Document, Project]Also applies to: 30-30, 85-85, 122-122
95-98: Rename unused loop variable.
iisn’t used; use underscore to quiet linters.- for i in range(3): + for _ in range(3): job = job_crud.create(source_document_id=document.id) jobs.append(job)
154-161: Drop enumerate since index isn’t used.- for i, (job, target_format) in enumerate(jobs): + for job, target_format in jobs: db.refresh(job) assert job.status == TransformationStatus.COMPLETED assert job.transformed_document_id is not None transformed_doc = document_crud.read_one(job.transformed_document_id)backend/app/crud/doc_transformation_job.py (1)
3-4: Drop deprecated/unused imports.
- UP035: avoid
typing.List; you're already usinglist[...].joinis imported but unused.Apply:
-from typing import List, Optional -from sqlmodel import Session, select, and_, join +from typing import Optional +from sqlmodel import Session, select, and_backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (5)
5-6: Modernize typing and remove unused import.
- Use
collections.abc.Callableand built-intuple[...].- Drop unused
execute_jobimport.-from typing import Any, Callable, Tuple +from typing import Any +from collections.abc import Callable @@ -from app.core.doctransform.service import execute_job @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],Also applies to: 14-14, 30-31, 74-75, 116-117, 160-161
40-41: Remove redundant commits in tests.
create()already commits; extradb.commit()s are noise.- db.commit()Also applies to: 84-85, 126-127, 170-171
55-62: Assert specific error to avoid blind exception.Tighten the expectation with a message match.
- with pytest.raises(Exception): + with pytest.raises(Exception, match="S3 upload failed"): fast_execute_job(
142-149: Tighten raised error assertion (persistent failure).- with pytest.raises(Exception): + with pytest.raises(Exception, match="Persistent error"): fast_execute_job(
188-195: Tighten raised error assertion (DB error).- with pytest.raises(Exception): + with pytest.raises(Exception, match="Database error during document creation"): fast_execute_job(backend/app/tests/api/routes/documents/test_route_document_upload.py (3)
24-46: Safer default for content type when mimetype is unknown.
mimetypes.guess_typemay returnNone; provide a fallback.- (mtype, _) = mimetypes.guess_type(str(scratch)) + (mtype, _) = mimetypes.guess_type(str(scratch)) + if not mtype: + mtype = "application/octet-stream"
56-72: Write PDF fixture in binary mode to avoid platform-dependent newline/encoding issues.Text mode can corrupt the PDF header on some platforms.
-@pytest.fixture -def pdf_scratch(): - # Create a test PDF file for transformation tests - with NamedTemporaryFile(mode="w", suffix=".pdf", delete=False) as fp: - fp.write("%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj\n") - fp.write("2 0 obj<</Type/Pages/Kids[3 0 R]/Count 1>>endobj\n") - fp.write("3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]>>endobj\n") - fp.write("xref\n0 4\n0000000000 65535 f \n0000000009 00000 n \n") - fp.write( - "0000000074 00000 n \n0000000120 00000 n \ntrailer<</Size 4/Root 1 0 R>>\n" - ) - fp.write("startxref\n202\n%%EOF") - fp.flush() - yield Path(fp.name) - # Clean up the temporary file - Path(fp.name).unlink() +@pytest.fixture +def pdf_scratch(): + # Create a test PDF file for transformation tests + with NamedTemporaryFile(mode="wb", suffix=".pdf", delete=False) as fp: + fp.write(b"%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj\n") + fp.write(b"2 0 obj<</Type/Pages/Kids[3 0 R]/Count 1>>endobj\n") + fp.write(b"3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]>>endobj\n") + fp.write(b"xref\n0 4\n0000000000 65535 f \n0000000009 00000 n \n") + fp.write( + b"0000000074 00000 n \n0000000120 00000 n \ntrailer<</Size 4/Root 1 0 R>>\n" + ) + fp.write(b"startxref\n202\n%%EOF") + fp.flush() + yield Path(fp.name) + # Clean up the temporary file + Path(fp.name).unlink(missing_ok=True)
295-297: Fix unused variable.
responseisn’t used; avoid the assignment to satisfy Ruff (F841).- response = httpx_to_standard( - uploader.put(route, pdf_scratch, target_format="markdown") - ) + uploader.put(route, pdf_scratch, target_format="markdown")backend/app/core/doctransform/service.py (1)
16-16: Remove unused import.
Userisn’t used.-from app.models import Userbackend/app/api/routes/documents.py (5)
68-75: Fix target_format docs; add normalization for common aliases (md/txt).Description suggests "txt", but the registry uses "text" (and "markdown"). Normalize aliases to avoid 400s and sync the docs.
- description="Desired output format for the uploaded document (e.g., pdf, docx, txt). ", + description="Desired output format for the uploaded document (e.g., pdf, docx, markdown, text). ",Add normalization before validation:
@@ - if target_format: + if target_format: + alias_map = {"md": "markdown", "txt": "text", "htm": "html"} + target_format = alias_map.get(target_format.lower(), target_format.lower()) if not is_transformation_supported(source_format, target_format):
120-129: Guard on explicit None check.Be explicit to avoid truthiness confusion if transformer names ever become falsy.
- if target_format and actual_transformer: + if target_format and actual_transformer is not None:
130-137: Pass UUID directly; avoid string coercion.Model expects UUID; Pydantic will coerce, but keeping it UUID is cleaner.
- job_id=str(job_id), + job_id=job_id,
136-137: Avoid hard-coded URL; use router.url_path_for.Prevents drift if the route prefix changes.
Example:
status_check_url = router.url_path_for("get_transformation_job", job_id=str(job_id))(Adjust handler name as appropriate.)
44-55: Ensure response_model matches payload types for /list.You return ORM models; response_model expects DocumentPublic. Consider mapping for consistent schema.
def list_docs( @@ - data = crud.read_many(skip, limit) - return APIResponse.success_response(data) + docs = crud.read_many(skip, limit) + data = [DocumentPublic.model_validate(d, from_attributes=True) for d in docs] + return APIResponse.success_response(data)backend/app/core/doctransform/registry.py (3)
2-2: Use builtin generics; address Ruff UP035.Switch from typing.Dict/Set/Tuple/Type to builtin dict/set/tuple/type.
-from typing import Type, Dict, Set, Tuple, Optional +from typing import OptionalAnd update annotations:
-TRANSFORMERS: Dict[str, Type[Transformer]] = { +TRANSFORMERS: dict[str, type[Transformer]] = { @@ -SUPPORTED_TRANSFORMATIONS: Dict[Tuple[str, str], Dict[str, str]] = { +SUPPORTED_TRANSFORMATIONS: dict[tuple[str, str], dict[str, str]] = { @@ -EXTENSION_TO_FORMAT: Dict[str, str] = { +EXTENSION_TO_FORMAT: dict[str, str] = { @@ -FORMAT_TO_EXTENSION: Dict[str, str] = { +FORMAT_TO_EXTENSION: dict[str, str] = { @@ -def get_supported_transformations() -> Dict[Tuple[str, str], Set[str]]: +def get_supported_transformations() -> dict[tuple[str, str], set[str]]:
71-74: Normalize format names (aliases) within the registry.Handle md/txt/htm gracefully across helpers to reduce API churn.
@@ +def _normalize_format(name: str) -> str: + aliases = {"md": "markdown", "txt": "text", "htm": "html"} + return aliases.get(name.lower(), name.lower()) @@ -def is_transformation_supported(source_format: str, target_format: str) -> bool: +def is_transformation_supported(source_format: str, target_format: str) -> bool: """Check if a transformation from source_format to target_format is supported.""" - return (source_format, target_format) in SUPPORTED_TRANSFORMATIONS + sf, tf = _normalize_format(source_format), _normalize_format(target_format) + return (sf, tf) in SUPPORTED_TRANSFORMATIONS @@ -def get_available_transformers( - source_format: str, target_format: str -) -> Dict[str, str]: +def get_available_transformers( + source_format: str, target_format: str +) -> dict[str, str]: """Get available transformers for a specific transformation.""" - return SUPPORTED_TRANSFORMATIONS.get((source_format, target_format), {}) + sf, tf = _normalize_format(source_format), _normalize_format(target_format) + return SUPPORTED_TRANSFORMATIONS.get((sf, tf), {}) @@ -def resolve_transformer( - source_format: str, target_format: str, transformer_name: Optional[str] = None -) -> str: +def resolve_transformer( + source_format: str, target_format: str, transformer_name: Optional[str] = None +) -> str: @@ - available_transformers = get_available_transformers(source_format, target_format) + available_transformers = get_available_transformers(source_format, target_format)Also applies to: 76-81, 83-108
14-18: Gate test-only transformer in non-test environments.Avoid exposing "test" in production registries.
Option: register "test" only if settings.ENVIRONMENT in {"test","dev"}.
Also applies to: 21-29
backend/app/tests/core/doctransformer/test_service/base.py (2)
13-13: Adopt builtin generics; remove typing.List.Modernize annotations per Ruff UP035.
-from typing import List +# no need for typing.List on Python 3.10+ @@ - def get_format_test_cases() -> List[tuple]: + def get_format_test_cases() -> list[tuple]: @@ - def get_content_type_test_cases() -> List[tuple]: + def get_content_type_test_cases() -> list[tuple]: @@ - def get_test_transformer_names() -> List[str]: + def get_test_transformer_names() -> list[str]:Also applies to: 69-76, 78-86, 88-96
120-129: Silence unused-args warning in persistent failing mock.Name args with underscores.
- def persistent_failing_convert_document(*args, **kwargs): + def persistent_failing_convert_document(*_args, **_kwargs): raise Exception(error_message)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (24)
backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py(1 hunks)backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py(1 hunks)backend/app/api/docs/documents/upload.md(1 hunks)backend/app/api/routes/doc_transformation_job.py(1 hunks)backend/app/api/routes/documents.py(4 hunks)backend/app/core/doctransform/registry.py(1 hunks)backend/app/core/doctransform/service.py(1 hunks)backend/app/core/doctransform/test_transformer.py(1 hunks)backend/app/core/doctransform/transformer.py(1 hunks)backend/app/core/doctransform/zerox_transformer.py(1 hunks)backend/app/crud/doc_transformation_job.py(1 hunks)backend/app/models/__init__.py(1 hunks)backend/app/models/doc_transformation_job.py(1 hunks)backend/app/models/document.py(3 hunks)backend/app/tests/api/routes/documents/test_route_document_upload.py(4 hunks)backend/app/tests/api/routes/test_doc_transformation_job.py(1 hunks)backend/app/tests/core/doctransformer/test_service/base.py(1 hunks)backend/app/tests/core/doctransformer/test_service/conftest.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_execute_job.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_integration.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_start_job.py(1 hunks)backend/app/tests/crud/test_doc_transformation_job.py(1 hunks)backend/app/tests/utils/document.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- backend/app/alembic/versions/b5b9412d3d2a_add_source_document_id_to_document_table.py
- backend/app/tests/api/routes/test_doc_transformation_job.py
- backend/app/tests/crud/test_doc_transformation_job.py
- backend/app/core/doctransform/transformer.py
- backend/app/models/document.py
- backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py
- backend/app/api/routes/doc_transformation_job.py
🧰 Additional context used
🧬 Code graph analysis (16)
backend/app/tests/utils/document.py (3)
backend/app/models/project.py (1)
Project(29-60)backend/app/crud/project.py (1)
get_project_by_id(37-38)backend/app/tests/crud/documents/test_crud_document_update.py (1)
documents(12-15)
backend/app/models/__init__.py (2)
backend/app/models/document.py (4)
Document(20-40)DocumentPublic(43-60)DocumentUploadResponse(74-76)TransformationJobInfo(63-71)backend/app/models/doc_transformation_job.py (3)
DocTransformationJob(16-27)DocTransformationJobs(30-32)TransformationStatus(9-13)
backend/app/core/doctransform/test_transformer.py (2)
backend/app/core/doctransform/transformer.py (2)
Transformer(5-14)transform(9-14)backend/app/core/doctransform/zerox_transformer.py (1)
transform(18-56)
backend/app/api/routes/documents.py (4)
backend/app/core/cloud/storage.py (3)
get_cloud_storage(262-272)get_signed_url(135-137)get_signed_url(217-242)backend/app/core/doctransform/registry.py (4)
get_available_transformers(76-80)get_file_format(54-60)is_transformation_supported(71-73)resolve_transformer(83-107)backend/app/models/document.py (4)
Document(20-40)DocumentPublic(43-60)DocumentUploadResponse(74-76)TransformationJobInfo(63-71)backend/app/core/doctransform/service.py (1)
start_job(25-44)
backend/app/tests/api/routes/documents/test_route_document_upload.py (4)
backend/app/tests/utils/document.py (5)
Route(82-109)_(140-141)_(145-146)delete(123-127)httpx_to_standard(22-23)backend/app/core/cloud/storage.py (5)
client(30-42)delete(140-142)delete(244-259)AmazonCloudStorageClient(28-84)create(44-84)backend/app/tests/conftest.py (3)
client(52-55)user_api_key(89-91)db(24-41)backend/app/crud/doc_transformation_job.py (1)
create(19-30)
backend/app/models/doc_transformation_job.py (1)
backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py (1)
upgrade(19-66)
backend/app/core/doctransform/zerox_transformer.py (2)
backend/app/core/doctransform/transformer.py (2)
Transformer(5-14)transform(9-14)backend/app/core/doctransform/test_transformer.py (1)
transform(10-16)
backend/app/core/doctransform/service.py (5)
backend/app/crud/doc_transformation_job.py (4)
DocTransformationJobCrud(14-91)create(19-30)update_status(69-91)read_one(32-51)backend/app/crud/document.py (1)
DocumentCrud(14-135)backend/app/models/document.py (1)
Document(20-40)backend/app/core/cloud/storage.py (1)
get_cloud_storage(262-272)backend/app/core/doctransform/registry.py (1)
convert_document(110-131)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (5)
backend/app/crud/doc_transformation_job.py (3)
DocTransformationJobCrud(14-91)create(19-30)read_one(32-51)backend/app/core/doctransform/service.py (1)
execute_job(48-150)backend/app/tests/core/doctransformer/test_service/base.py (8)
DocTransformTestBase(21-63)MockHelpers(99-129)setup_aws_s3(24-28)create_s3_document_content(30-42)failing_convert_document(107-116)create_failing_convert_document(103-118)persistent_failing_convert_document(126-127)create_persistent_failing_convert_document(121-129)backend/app/tests/conftest.py (1)
db(24-41)backend/app/tests/core/doctransformer/test_service/conftest.py (1)
fast_execute_job(33-51)
backend/app/core/doctransform/registry.py (3)
backend/app/core/doctransform/transformer.py (2)
Transformer(5-14)transform(9-14)backend/app/core/doctransform/test_transformer.py (2)
TestTransformer(5-16)transform(10-16)backend/app/core/doctransform/zerox_transformer.py (2)
ZeroxTransformer(10-56)transform(18-56)
backend/app/crud/doc_transformation_job.py (3)
backend/app/crud/document.py (2)
DocumentCrud(14-135)__init__(15-17)backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-27)TransformationStatus(9-13)backend/app/models/document.py (1)
Document(20-40)
backend/app/tests/core/doctransformer/test_service/base.py (4)
backend/app/core/cloud/storage.py (1)
AmazonCloudStorageClient(28-84)backend/app/models/document.py (1)
Document(20-40)backend/app/models/project.py (1)
Project(29-60)backend/app/core/config.py (1)
AWS_S3_BUCKET(76-77)
backend/app/tests/core/doctransformer/test_service/conftest.py (6)
backend/app/crud/project.py (1)
get_project_by_id(37-38)backend/app/models/user.py (2)
User(48-57)UserProjectOrg(65-66)backend/app/tests/utils/document.py (4)
DocumentStore(52-79)get(117-121)project(64-65)put(67-72)backend/app/tests/utils/test_data.py (1)
create_test_api_key(53-67)backend/app/core/doctransform/service.py (1)
execute_job(48-150)backend/app/tests/conftest.py (1)
db(24-41)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (8)
backend/app/crud/doc_transformation_job.py (3)
DocTransformationJobCrud(14-91)create(19-30)read_one(32-51)backend/app/crud/document.py (1)
DocumentCrud(14-135)backend/app/core/doctransform/registry.py (1)
TransformationError(9-10)backend/app/core/doctransform/service.py (1)
execute_job(48-150)backend/app/models/document.py (1)
Document(20-40)backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-27)TransformationStatus(9-13)backend/app/tests/core/doctransformer/test_service/base.py (8)
DocTransformTestBase(21-63)TestDataProvider(66-96)get_format_test_cases(70-76)setup_aws_s3(24-28)get_sample_document_content(94-96)create_s3_document_content(30-42)verify_s3_content(44-63)get_content_type_test_cases(79-86)backend/app/tests/core/doctransformer/test_service/conftest.py (2)
test_document(73-79)fast_execute_job(33-51)
backend/app/tests/core/doctransformer/test_service/test_integration.py (8)
backend/app/crud/doc_transformation_job.py (3)
DocTransformationJobCrud(14-91)read_one(32-51)create(19-30)backend/app/crud/document.py (1)
DocumentCrud(14-135)backend/app/core/doctransform/service.py (2)
execute_job(48-150)start_job(25-44)backend/app/models/document.py (1)
Document(20-40)backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-27)TransformationStatus(9-13)backend/app/models/user.py (1)
UserProjectOrg(65-66)backend/app/tests/core/doctransformer/test_service/base.py (3)
DocTransformTestBase(21-63)setup_aws_s3(24-28)create_s3_document_content(30-42)backend/app/tests/core/doctransformer/test_service/conftest.py (3)
test_document(73-79)current_user(55-63)background_tasks(67-69)
backend/app/tests/core/doctransformer/test_service/test_start_job.py (7)
backend/app/core/doctransform/service.py (2)
execute_job(48-150)start_job(25-44)backend/app/models/document.py (1)
Document(20-40)backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-27)TransformationStatus(9-13)backend/app/models/user.py (1)
UserProjectOrg(65-66)backend/app/tests/core/doctransformer/test_service/base.py (2)
TestDataProvider(66-96)get_test_transformer_names(89-91)backend/app/tests/conftest.py (1)
db(24-41)backend/app/tests/core/doctransformer/test_service/conftest.py (3)
current_user(55-63)test_document(73-79)background_tasks(67-69)
🪛 Ruff (0.12.2)
backend/app/tests/api/routes/documents/test_route_document_upload.py
295-295: Local variable response is assigned to but never used
Remove assignment to unused variable response
(F841)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py
5-5: Import from collections.abc instead: Callable
Import from collections.abc
(UP035)
5-5: typing.Tuple is deprecated, use tuple instead
(UP035)
55-55: Do not assert blind exception: Exception
(B017)
142-142: Do not assert blind exception: Exception
(B017)
188-188: Do not assert blind exception: Exception
(B017)
backend/app/core/doctransform/registry.py
2-2: typing.Type is deprecated, use type instead
(UP035)
2-2: typing.Dict is deprecated, use dict instead
(UP035)
2-2: typing.Set is deprecated, use set instead
(UP035)
2-2: typing.Tuple is deprecated, use tuple instead
(UP035)
backend/app/crud/doc_transformation_job.py
3-3: typing.List is deprecated, use list instead
(UP035)
backend/app/tests/core/doctransformer/test_service/base.py
13-13: typing.List is deprecated, use list instead
(UP035)
126-126: Unused function argument: args
(ARG001)
126-126: Unused function argument: kwargs
(ARG001)
backend/app/tests/core/doctransformer/test_service/conftest.py
5-5: Import from collections.abc instead: Callable, Generator
Import from collections.abc
(UP035)
5-5: typing.Tuple is deprecated, use tuple instead
(UP035)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py
4-4: Import from collections.abc instead: Callable
Import from collections.abc
(UP035)
4-4: typing.Tuple is deprecated, use tuple instead
(UP035)
129-129: Do not assert blind exception: Exception
(B017)
230-230: Loop control variable expected_content_type not used within loop body
Rename unused expected_content_type to _expected_content_type
(B007)
backend/app/tests/core/doctransformer/test_service/test_integration.py
4-4: typing.Tuple is deprecated, use tuple instead
(UP035)
95-95: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
154-154: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
backend/app/tests/core/doctransformer/test_service/test_start_job.py
4-4: typing.Tuple is deprecated, use tuple instead
(UP035)
🪛 LanguageTool
backend/app/api/docs/documents/upload.md
[grammar] ~4-~4: There might be a mistake here.
Context: ...nsformation job will also be created to transform document into target format in the back...
(QB_NEW_EN)
[grammar] ~4-~4: There might be a mistake here.
Context: ...l also be created to transform document into target format in the background. The re...
(QB_NEW_EN)
[grammar] ~10-~10: Ensure spelling is correct
Context: ...mations are supported: - pdf → markdown - zerox ### Transformers Available transformer name...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (4)
backend/app/tests/core/doctransformer/test_service/test_start_job.py (2)
54-61: Nice: asserts validate BackgroundTasks wiring.Verifying func and args for the scheduled task gives strong coupling between API and worker entrypoint; looks good.
11-13: No HTTPException inconsistencies detected. All raised exceptions use the same fastapi.HTTPException class (including via the alias in app.core.exception_handlers).backend/app/models/__init__.py (1)
5-15: Re-exports look good.Consolidating model exports here simplifies imports in routes and tests.
backend/app/api/routes/documents.py (1)
118-119: Remove suggestion to usecreate–DocumentCrudhas nocreatemethod andupdate()callssession.add(), so it correctly handles inserting new documents.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/tests/core/doctransformer/test_registry.py (1)
1-95: Organize imports and use the fixture return value to satisfy linting
CI is failing due to an unsorted import block and unused fixture arguments. Inbackend/app/tests/core/doctransformer/test_registry.py:
- Run
isort --apply(orruff --fix) to sort/format the import block.- In each test that accepts
patched_transformations, assign it to a dummy variable (e.g.mapping = patched_transformations) before calling the API to avoid ARG001.- Keep two blank lines between top-level defs for Black.
🧹 Nitpick comments (9)
backend/app/core/logger.py (3)
24-26: Harden LiteLLM log suppression: cover correct logger name(s) and make level configurable.Many installs use logger name "litellm" (lowercase). Apply to both and allow override via env to avoid hardcoding WARNING.
-# Suppress info logs from LiteLLM -logging.getLogger("LiteLLM").setLevel(logging.WARNING) +# Suppress LiteLLM logs (supports both logger name casings) +_litellm_level_name = os.getenv("LITELLM_LOG_LEVEL", "WARNING").upper() +_litellm_level = getattr(logging, _litellm_level_name, logging.WARNING) +for _name in ("litellm", "LiteLLM"): + logging.getLogger(_name).setLevel(_litellm_level)
41-46: Specify encoding and delay file creation on rotating file handler.Prevents Unicode issues on non-UTF-8 locales and avoids touching disk until first write.
-file_handler = RotatingFileHandler( - LOG_FILE_PATH, maxBytes=10 * 1024 * 1024, backupCount=5 -) +file_handler = RotatingFileHandler( + LOG_FILE_PATH, + maxBytes=10 * 1024 * 1024, + backupCount=5, + encoding="utf-8", + delay=True, +)
27-39: Avoid duplicate handlers under uvicorn --reload.Without guarding, reload can attach multiple handlers and duplicate log lines.
-# Stream handler (console) -stream_handler = logging.StreamHandler() -stream_handler.setFormatter(formatter) -stream_handler.addFilter(CorrelationIdFilter()) -logger.addHandler(stream_handler) +if not any(isinstance(h, logging.StreamHandler) for h in logger.handlers): + stream_handler = logging.StreamHandler() + stream_handler.setFormatter(formatter) + stream_handler.addFilter(CorrelationIdFilter()) + logger.addHandler(stream_handler) -# Rotating file handler -file_handler = RotatingFileHandler( - LOG_FILE_PATH, maxBytes=10 * 1024 * 1024, backupCount=5 -) -file_handler.setFormatter(formatter) -file_handler.addFilter(CorrelationIdFilter()) -logger.addHandler(file_handler) +if not any(isinstance(h, RotatingFileHandler) for h in logger.handlers): + file_handler = RotatingFileHandler( + LOG_FILE_PATH, maxBytes=10 * 1024 * 1024, backupCount=5, encoding="utf-8", delay=True + ) + file_handler.setFormatter(formatter) + file_handler.addFilter(CorrelationIdFilter()) + logger.addHandler(file_handler)Also applies to: 41-46
backend/app/tests/core/doctransformer/test_registry.py (6)
25-30: Parametrize to reduce repetition in format testsConsolidates assertions and makes it easy to extend.
-def test_get_file_format_valid(): - assert get_file_format("file.pdf") == "pdf" - assert get_file_format("file.docx") == "docx" - assert get_file_format("file.md") == "markdown" - assert get_file_format("file.html") == "html" +@pytest.mark.parametrize( + "filename,expected", + [ + ("file.pdf", "pdf"), + ("file.docx", "docx"), + ("file.md", "markdown"), + ("file.html", "html"), + ], +) +def test_get_file_format_valid(filename, expected): + assert get_file_format(filename) == expected
32-35: Broaden negative coverage and assert error messageCovers no-extension case and checks the error message substring.
-def test_get_file_format_invalid(): - with pytest.raises(ValueError): - get_file_format("file.unknown") +def test_get_file_format_invalid(): + for name in ("file.unknown", "file"): + with pytest.raises(ValueError, match="Unsupported file extension"): + get_file_format(name)
37-43: Silence Ruff ARG001: mark unused fixture param with underscoreThe fixture is used for setup; the argument itself isn’t used.
-def test_get_supported_transformations(patched_transformations): +def test_get_supported_transformations(_patched_transformations):
54-59: Silence Ruff ARG001 here as wellRename the unused fixture arg.
-def test_get_available_transformers(patched_transformations): +def test_get_available_transformers(_patched_transformations):
61-68: Tighten assertions and silence Ruff ARG001Match error substrings to ensure correct failure mode and rename unused arg.
-def test_resolve_transformer(patched_transformations): +def test_resolve_transformer(_patched_transformations): assert resolve_transformer("docx", "pdf") == "pandoc" assert resolve_transformer("docx", "pdf", "pandoc") == "pandoc" - with pytest.raises(ValueError): + with pytest.raises(ValueError, match="not available"): resolve_transformer("docx", "pdf", "notfound") - with pytest.raises(ValueError): + with pytest.raises(ValueError, match="is not supported"): resolve_transformer("pdf", "docx")
70-95: Minor: also assert the returned path equals output_fileKeeps the contract explicit in addition to content check.
result = convert_document(input_file, output_file, transformer_name="dummy") assert result.read_text() == "transformed" + assert result == output_file
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
backend/app/core/logger.py(1 hunks)backend/app/tests/core/doctransformer/test_registry.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/tests/core/doctransformer/test_registry.py (1)
backend/app/core/doctransform/registry.py (7)
get_file_format(54-60)get_supported_transformations(63-68)is_transformation_supported(71-73)get_available_transformers(76-80)resolve_transformer(83-107)convert_document(110-131)TransformationError(9-10)
🪛 Ruff (0.12.2)
backend/app/tests/core/doctransformer/test_registry.py
37-37: Unused function argument: patched_transformations
(ARG001)
54-54: Unused function argument: patched_transformations
(ARG001)
61-61: Unused function argument: patched_transformations
(ARG001)
🪛 GitHub Actions: AI Platform CI
backend/app/tests/core/doctransformer/test_registry.py
[error] 1-1: Black formatting hook reformatted the file during pre-commit run --all-files (hook: black); exit code 1.
🔇 Additional comments (2)
backend/app/tests/core/doctransformer/test_registry.py (2)
15-23: Fixture patching looks goodCorrect use of monkeypatch to stub SUPPORTED_TRANSFORMATIONS for tests.
45-52: LGTM: clear positive/negative path checkAccurately validates presence/absence in SUPPORTED_TRANSFORMATIONS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (19)
backend/app/tests/core/doctransformer/test_registry.py (4)
27-32: Parametrize positive cases to reduce repetition and add case-insensitivity coverage.Parametrization makes failures clearer and adds value by checking upper-case extensions.
Apply:
- def test_get_file_format_valid(): - assert get_file_format("file.pdf") == "pdf" - assert get_file_format("file.docx") == "docx" - assert get_file_format("file.md") == "markdown" - assert get_file_format("file.html") == "html" +import pytest + +@pytest.mark.parametrize( + "name,expected", + [ + ("file.pdf", "pdf"), + ("file.PDF", "pdf"), + ("file.docx", "docx"), + ("file.md", "markdown"), + ("file.html", "html"), + ], +) +def test_get_file_format_valid(name, expected): + assert get_file_format(name) == expected
34-37: Add edge cases for files without a usable extension.Catches common user inputs like “README” or “file.”.
Apply:
-def test_get_file_format_invalid(): - with pytest.raises(ValueError): - get_file_format("file.unknown") +@pytest.mark.parametrize("name", ["file.unknown", "file", "file."]) +def test_get_file_format_invalid(name): + with pytest.raises(ValueError): + get_file_format(name)
39-39: Silence Ruff ARG001 for fixture-only parameters.The fixture is intentionally unused in-body; annotate to pass lint without adding no-op assertions.
Apply:
-def test_get_supported_transformations(patched_transformations): +def test_get_supported_transformations(patched_transformations): # noqa: ARG001 @@ -def test_get_available_transformers(patched_transformations): +def test_get_available_transformers(patched_transformations): # noqa: ARG001 @@ -def test_resolve_transformer(patched_transformations): +def test_resolve_transformer(patched_transformations): # noqa: ARG001Also applies to: 56-56, 63-63
72-96: Strengthen convert_document assertions and cover default/missing-default flows.Verify the function returns the exact output path and add tests for default resolution and absence of default.
Apply:
@@ input_file.write_text("test") result = convert_document(input_file, output_file, transformer_name="dummy") - assert result.read_text() == "transformed" + assert result == output_file + assert result.read_text() == "transformed"Add the following tests after this block:
def test_convert_document_uses_default_transformer(tmp_path, monkeypatch): class DummyTransformer: def transform(self, input_path, output_path): output_path.write_text("transformed") return output_path # Wire default -> dummy for docx->pdf and register transformer monkeypatch.setitem(TRANSFORMERS, "dummy", DummyTransformer) monkeypatch.setattr( "app.core.doctransform.registry.SUPPORTED_TRANSFORMATIONS", {("docx", "pdf"): {"default": "dummy", "dummy": "dummy"}}, ) src = tmp_path / "in.docx" dst = tmp_path / "out.pdf" src.write_text("x") result = convert_document(src, dst) assert result == dst assert result.read_text() == "transformed" def test_convert_document_missing_default_raises(tmp_path, monkeypatch): # Pair exists but lacks a default entry monkeypatch.setattr( "app.core.doctransform.registry.SUPPORTED_TRANSFORMATIONS", {("docx", "pdf"): {}}, ) src = tmp_path / "in.docx" dst = tmp_path / "out.pdf" src.write_text("x") with pytest.raises(ValueError): convert_document(src, dst)backend/app/tests/core/doctransformer/test_service/test_integration.py (4)
4-4: Prefer built-intupleovertyping.Tuple.PEP 585 style avoids deprecation warnings and keeps typing modern.
-from typing import Tuple +from __future__ import annotations- def test_execute_job_end_to_end_workflow( - self, db: Session, test_document: Tuple[Document, Project] + def test_execute_job_end_to_end_workflow( + self, db: Session, test_document: tuple[Document, Project] ) -> None:- def test_execute_job_concurrent_jobs( - self, db: Session, test_document: Tuple[Document, Project] + def test_execute_job_concurrent_jobs( + self, db: Session, test_document: tuple[Document, Project] ) -> None:- def test_multiple_format_transformations( - self, db: Session, test_document: Tuple[Document, Project] + def test_multiple_format_transformations( + self, db: Session, test_document: tuple[Document, Project] ) -> None:Also applies to: 30-31, 85-87, 122-124
95-98: Silence unused loop index.Index isn’t used; switch to underscore.
- for i in range(3): + for _ in range(3): job = job_crud.create(source_document_id=document.id) jobs.append(job)
154-156: Remove unused enumerate index.- for i, (job, target_format) in enumerate(jobs): + for job, target_format in jobs: with patch("app.core.doctransform.service.Session") as mock_session_class:
84-87: Test name suggests concurrency but executes sequentially.Either rename or actually run in parallel to validate concurrency.
-def test_execute_job_concurrent_jobs( +def test_execute_job_multiple_jobs( self, db: Session, test_document: tuple[Document, Project] ) -> None:backend/app/tests/core/doctransformer/test_service/test_execute_job.py (3)
4-4: Modernize typing: usecollections.abc.Callableand built-intuple.-from typing import Any, Callable, Tuple +from typing import Any +from collections.abc import Callable- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],- self, db: Session, test_document: Tuple[Document, Project] + self, db: Session, test_document: tuple[Document, Project]- self, db: Session, test_document: Tuple[Document, Project] + self, db: Session, test_document: tuple[Document, Project]Also applies to: 37-40, 88-90, 114-116, 149-151, 189-191, 220-222
130-136: Avoid blindExceptionassertion; assert expected types.Aligns with Ruff B017 and makes failures clearer.
- with pytest.raises(Exception): + with pytest.raises((HTTPException, RetryError)): fast_execute_job( project_id=project.id, job_id=job.id, transformer_name="test", target_format="markdown", )
8-17: You compute expected content-types; assert them from S3.Currently
expected_content_typeis unused. Validate via S3 HeadObject.from moto import mock_aws from sqlmodel import Session from tenacity import RetryError from app.crud import DocTransformationJobCrud, DocumentCrud from app.core.doctransform.registry import TransformationError from app.core.doctransform.service import execute_job from app.core.exception_handlers import HTTPException from app.models import Document, Project, TransformationStatus from app.tests.core.doctransformer.test_service.utils import DocTransformTestBase +from urllib.parse import urlparse +from app.core.config import settingstransformed_doc = document_crud.read_one(job.transformed_document_id) assert transformed_doc is not None assert transformed_doc.fname.endswith(expected_extension) + # Assert S3 content-type + parsed = urlparse(transformed_doc.object_store_url) + key = parsed.path.lstrip("/") + head = aws.client.head_object(Bucket=settings.AWS_S3_BUCKET, Key=key) + assert head["ContentType"] == expected_content_typeAlso applies to: 254-262
backend/app/tests/core/doctransformer/test_service/test_start_job.py (1)
4-4: Prefer built-intupleovertyping.Tuple.-from typing import Tuple +from __future__ import annotations- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],Also applies to: 30-33, 85-88, 112-115, 141-145
backend/app/tests/core/doctransformer/test_service/utils.py (3)
33-34: Derive S3 bucket from the document URL instead of settings.Reduces coupling and makes tests resilient to non-default buckets.
- aws.client.put_object(Bucket=settings.AWS_S3_BUCKET, Key=s3_key, Body=content) + bucket = parsed_url.netloc or settings.AWS_S3_BUCKET + aws.client.put_object(Bucket=bucket, Key=s3_key, Body=content)- response = aws.client.get_object( - Bucket=settings.AWS_S3_BUCKET, Key=transformed_key - ) + bucket = parsed_url.netloc or settings.AWS_S3_BUCKET + response = aws.client.get_object(Bucket=bucket, Key=transformed_key)Also applies to: 49-53
79-81: Rename unusedargs/kwargsto underscore to satisfy linters.- def persistent_failing_convert_document(*args, **kwargs): + def persistent_failing_convert_document(*_args, **_kwargs): raise Exception(error_message)
60-67: Optional: raise a domain-specific error for clarity.If convenient, raise a
TransformationError(or a test-local subclass) instead of a bareExceptionto make assertions more precise.backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (4)
5-6: Modernize typing and tightenfast_execute_jobsignature.-from typing import Any, Callable, Tuple +from typing import Any +from collections.abc import Callable +from uuid import UUID- fast_execute_job: Callable[[int, Any, str, str], Any], + fast_execute_job: Callable[[int, UUID, str, str], Any],- fast_execute_job: Callable[[int, Any, str, str], Any], + fast_execute_job: Callable[[int, UUID, str, str], Any],- fast_execute_job: Callable[[int, Any, str, str], Any], + fast_execute_job: Callable[[int, UUID, str, str], Any],- fast_execute_job: Callable[[int, Any, str, str], Any], + fast_execute_job: Callable[[int, UUID, str, str], Any],Also applies to: 30-32, 74-76, 114-116, 158-160
8-11: Use specific exception types for retry exhaustion.Tenacity should raise
RetryErrorwhen retries are exhausted.-from moto import mock_aws -from sqlmodel import Session +from moto import mock_aws +from sqlmodel import Session +from tenacity import RetryError- with pytest.raises(Exception): + with pytest.raises(RetryError): fast_execute_job( project_id=project.id, job_id=job.id, transformer_name="test", target_format="markdown", )Also applies to: 139-146
54-61: If exact exception type varies, assert on message and silence B017.Keep the behavioral check but avoid linter noise.
- with pytest.raises(Exception): + with pytest.raises(Exception) as exc_info: # noqa: B017 fast_execute_job( project_id=project.id, job_id=job.id, transformer_name="test", target_format="markdown", ) + assert "S3 upload failed" in str(exc_info.value)
185-192: Same here: assert details or switch to a specific type.- with pytest.raises(Exception): + with pytest.raises(Exception) as exc_info: # noqa: B017 fast_execute_job( project_id=project.id, job_id=job.id, transformer_name="test", target_format="markdown", ) + assert "Database error during document creation" in str(exc_info.value)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
backend/app/tests/core/doctransformer/test_registry.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_execute_job.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_integration.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_start_job.py(1 hunks)backend/app/tests/core/doctransformer/test_service/utils.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
backend/app/tests/core/doctransformer/test_service/utils.py (2)
backend/app/tests/api/routes/documents/test_route_document_upload.py (2)
TestDocumentRouteUpload(64-103)test_adds_to_S3(82-103)backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (2)
TestDocumentRoutePermanentRemove(43-99)aws_credentials(33-38)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (2)
backend/app/tests/api/routes/documents/test_route_document_upload.py (2)
TestDocumentRouteUpload(64-103)test_adds_to_S3(82-103)backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1)
TestDocumentRoutePermanentRemove(43-99)
backend/app/tests/core/doctransformer/test_service/test_integration.py (2)
backend/app/tests/api/routes/documents/test_route_document_upload.py (2)
TestDocumentRouteUpload(64-103)test_adds_to_S3(82-103)backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1)
TestDocumentRoutePermanentRemove(43-99)
backend/app/tests/core/doctransformer/test_service/test_start_job.py (2)
backend/app/tests/api/routes/documents/test_route_document_upload.py (1)
TestDocumentRouteUpload(64-103)backend/app/tests/api/routes/documents/test_route_document_remove.py (1)
TestDocumentRouteRemove(23-88)
🪛 Ruff (0.12.2)
backend/app/tests/core/doctransformer/test_registry.py
39-39: Unused function argument: patched_transformations
(ARG001)
56-56: Unused function argument: patched_transformations
(ARG001)
63-63: Unused function argument: patched_transformations
(ARG001)
backend/app/tests/core/doctransformer/test_service/utils.py
79-79: Unused function argument: args
(ARG001)
79-79: Unused function argument: kwargs
(ARG001)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py
4-4: Import from collections.abc instead: Callable
Import from collections.abc
(UP035)
4-4: typing.Tuple is deprecated, use tuple instead
(UP035)
130-130: Do not assert blind exception: Exception
(B017)
236-236: Loop control variable expected_content_type not used within loop body
Rename unused expected_content_type to _expected_content_type
(B007)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py
5-5: Import from collections.abc instead: Callable
Import from collections.abc
(UP035)
5-5: typing.Tuple is deprecated, use tuple instead
(UP035)
54-54: Do not assert blind exception: Exception
(B017)
139-139: Do not assert blind exception: Exception
(B017)
185-185: Do not assert blind exception: Exception
(B017)
backend/app/tests/core/doctransformer/test_service/test_integration.py
4-4: typing.Tuple is deprecated, use tuple instead
(UP035)
95-95: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
154-154: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
backend/app/tests/core/doctransformer/test_service/test_start_job.py
4-4: typing.Tuple is deprecated, use tuple instead
(UP035)
🔇 Additional comments (1)
backend/app/tests/core/doctransformer/test_registry.py (1)
1-11: LGTM: import surface matches the public registry API.Imports are precise and scoped to the functions/types under test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approving this, but please take a loot at the comments first before merging
|
|
||
| try: | ||
| with Runner() as runner: | ||
| result = runner.run( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if we should have a timeout or error handling here as this will be a long running process
@avirajsingh7 I am not entirely sure how to do it with runner, can we explore a little? coderabbit seems to have a solution in the above comment, maybe worth exploring
backend/app/alembic/versions/9f8a4af9d6fd_create_doc_transformation_job_table.py
Show resolved
Hide resolved
|
|
||
| def upgrade(): | ||
| # ### commands auto generated by Alembic - please adjust! ### | ||
| op.add_column("document", sa.Column("source_document_id", sa.Uuid(), nullable=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the "source document id" column in this table and doc transformation table are different, like in the context of doc transformation table, I understand what the source document id could mean, but what is the role of this column in this table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So first you upload doc to s3 and than you will use that doc for transformation, so this id of document you uploaded to s3 and from which the transformed document got created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still a little confused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/core/doctransform/registry.py (1)
108-130: Plumb transformer parameters through the registry (aligns with PR’s “transformerparameter”).Currently, there’s no way to pass per-job transformer params (e.g., model). Accept kwargs and forward to the transformer constructor.
-def convert_document( - input_path: Path, output_path: Path, transformer_name: str = "default" -) -> Path: +def convert_document( + input_path: Path, + output_path: Path, + transformer_name: str = "default", + transformer_kwargs: Optional[dict] = None, +) -> Path: @@ - transformer = transformer_cls() + transformer = transformer_cls(**(transformer_kwargs or {}))Follow-up: Thread these
transformer_kwargsfrom the API/service layer (e.g., persist in job payload and supply here).
🧹 Nitpick comments (31)
backend/app/core/doctransform/registry.py (2)
2-2: Modernize type hints to built-in generics (ruff UP035).Use
type,dict,set, andtupleto silence UP035 and keep code forward-compatible.- from typing import Type, Dict, Set, Tuple, Optional + from typing import Optional @@ -TRANSFORMERS: Dict[str, Type[Transformer]] = { +TRANSFORMERS: dict[str, type[Transformer]] = { @@ -SUPPORTED_TRANSFORMATIONS: Dict[Tuple[str, str], Dict[str, str]] = { +SUPPORTED_TRANSFORMATIONS: dict[tuple[str, str], dict[str, str]] = { @@ -EXTENSION_TO_FORMAT: Dict[str, str] = { +EXTENSION_TO_FORMAT: dict[str, str] = { @@ -FORMAT_TO_EXTENSION: Dict[str, str] = { +FORMAT_TO_EXTENSION: dict[str, str] = { @@ -def get_supported_transformations() -> Dict[Tuple[str, str], Set[str]]: +def get_supported_transformations() -> dict[tuple[str, str], set[str]]: @@ -def get_available_transformers( +def get_available_transformers( source_format: str, target_format: str -) -> Dict[str, str]: +) -> dict[str, str]:Also applies to: 13-16, 18-27, 29-39, 41-49, 61-67, 74-79
18-27: Add pdf→text/html to SUPPORTED_TRANSFORMATIONS to match service/tests.Service allows "text" and "html" (extensions and content types handled), and tests exercise these formats. Add entries so API-level validation doesn’t reject valid requests.
SUPPORTED_TRANSFORMATIONS: dict[tuple[str, str], dict[str, str]] = { ("pdf", "markdown"): { "default": "zerox", "zerox": "zerox", }, + ("pdf", "text"): { + "default": "zerox", + "zerox": "zerox", + }, + ("pdf", "html"): { + "default": "zerox", + "zerox": "zerox", + }, # Future transformations can be added here # ("docx", "markdown"): {"default": "pandoc", "pandoc": "pandoc"}, # ("html", "markdown"): {"default": "pandoc", "pandoc": "pandoc"}, }backend/app/tests/core/doctransformer/test_service/test_integration.py (4)
4-4: Use built-in generics (ruff UP035).Replace
typing.Tuplewithtupleand drop the import.-from typing import Tuple @@ - def test_execute_job_end_to_end_workflow( - self, db: Session, test_document: Tuple[Document, Project] + def test_execute_job_end_to_end_workflow( + self, db: Session, test_document: tuple[Document, Project] @@ - def test_execute_job_concurrent_jobs( - self, db: Session, test_document: Tuple[Document, Project] + def test_execute_job_concurrent_jobs( + self, db: Session, test_document: tuple[Document, Project] @@ - def test_multiple_format_transformations( - self, db: Session, test_document: Tuple[Document, Project] + def test_multiple_format_transformations( + self, db: Session, test_document: tuple[Document, Project]Also applies to: 33-35, 92-94, 133-136
102-104: Rename unused loop variable (ruff B007).- for i in range(3): + for _ in range(3): job = job_crud.create(source_document_id=document.id) jobs.append(job)
83-88: Assert transformed content in S3 as part of end-to-end validation.You already have helpers; add the content check for stronger E2E coverage.
transformed_doc = document_crud.read_one(job.transformed_document_id) assert transformed_doc.source_document_id == document.id assert "<transformed>" in transformed_doc.fname + # Verify content uploaded to S3 + self.verify_s3_content(aws, transformed_doc)
99-124: Optional: run the “concurrent jobs” test in parallel to truly exercise concurrency.Use ThreadPoolExecutor to dispatch jobs simultaneously; today they run sequentially.
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (6)
4-4: Import Callable from collections.abc and drop deprecated Tuple (ruff UP035).-from typing import Any, Callable, Tuple +from typing import Any +from collections.abc import Callable
7-7: Add imports for content-type verification and specific S3 error type.from uuid import uuid4, UUID +from urllib.parse import urlparse @@ from app.core.doctransform.service import execute_job +from app.core.config import settingsAlso applies to: 13-13
8-12: Import botocore ClientError to narrow exception expectations.import pytest from moto import mock_aws from sqlmodel import Session from tenacity import RetryError +from botocore.exceptions import ClientError
39-43: Use built-in generics for tuple annotations (ruff UP035).- test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], @@ - self, db: Session, test_document: Tuple[Document, Project] + self, db: Session, test_document: tuple[Document, Project] @@ - self, db: Session, test_document: Tuple[Document, Project] + self, db: Session, test_document: tuple[Document, Project]Also applies to: 94-98, 124-128, 206-207, 241-242
145-151: Avoid catching bare Exception; assert expected S3 error types.Missing source objects typically raise
botocore.exceptions.ClientError. IncludeRetryErrordue to tenacity.- with pytest.raises(Exception): + with pytest.raises((ClientError, RetryError)): fast_execute_job( project_id=project.id, job_id=job.id, transformer_name="test", target_format="markdown", )
248-253: Actually assert expected content types in S3 (and remove B007 warning).You compute
expected_content_typebut never assert it; check S3 object metadata.format_extensions = [ - ("markdown", "text/markdown", ".md"), - ("text", "text/plain", ".txt"), - ("html", "text/html", ".html"), - ("unknown", "text/plain", ".unknown"), # Default fallback + ("markdown", "text/markdown", ".md"), + ("text", "text/plain", ".txt"), + ("html", "text/html", ".html"), + ("unknown", "text/plain", ".unknown"), # Default fallback ] @@ assert job.transformed_document_id is not None transformed_doc = document_crud.read_one(job.transformed_document_id) assert transformed_doc is not None assert transformed_doc.fname.endswith(expected_extension) + # Verify ContentType stored in S3 matches expectation + parsed = urlparse(transformed_doc.object_store_url) + key = parsed.path.lstrip("/") + head = aws.client.head_object(Bucket=settings.AWS_S3_BUCKET, Key=key) + assert head["ContentType"] == expected_content_typeAlso applies to: 280-288
backend/app/tests/core/doctransformer/test_service/test_start_job.py (8)
4-4: Prefer built-intuple[...]over deprecatedtyping.Tuple.Modern typing: use
tuple[Document, Project]and drop theTupleimport.-from typing import Tuple +# (no import needed for tuple) - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], ... - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], ... - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], ... - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],Also applies to: 34-34, 89-89, 116-116, 149-149
13-13: ImportHTTPExceptiondirectly from FastAPI.Reduces coupling to app internals and matches the raised exception type.
-from app.core.exception_handlers import HTTPException +from fastapi import HTTPException
82-84: Also assert no background task is scheduled on failure.Prevents accidental scheduling when the document is invalid.
assert exc_info.value.status_code == 404 assert "Document not found" in str(exc_info.value.detail) + assert len(background_tasks.tasks) == 0
109-111: Likewise, ensure no background task is scheduled for deleted documents.assert exc_info.value.status_code == 404 assert "Document not found" in str(exc_info.value.detail) + assert len(background_tasks.tasks) == 0
121-123: Drop transformer registry patch here;start_jobdoesn't validate transformers.Keeps the test scoped to scheduling concerns and removes an unnecessary dependency on the registry.
- # Add the test transformer to the registry for this test - monkeypatch.setitem(TRANSFORMERS, "test", MockTestTransformer)
118-118: Remove unusedmonkeypatchparameter if you drop the registry patch.- monkeypatch,
155-157: Same: no need to patch the registry forstart_job.- # Add the test transformer to the registry for this test - monkeypatch.setitem(TRANSFORMERS, "test", MockTestTransformer)
152-152: Removemonkeypatchparam here if patching is dropped.- monkeypatch,backend/app/tests/core/doctransformer/test_service/utils.py (5)
17-21: Clarify docstring to reflect file-writing behavior.- A mock transformer for testing that returns a hardcoded lorem ipsum string. + A mock transformer for testing that writes a hardcoded lorem ipsum string + to the provided output_path and returns that path.
49-49: Use the bucket from the document URL to avoid config drift.This makes tests resilient if
AWS_S3_BUCKETchanges or documents point to a different bucket.- aws.client.put_object(Bucket=settings.AWS_S3_BUCKET, Key=s3_key, Body=content) + aws.client.put_object(Bucket=parsed_url.netloc, Key=s3_key, Body=content)
66-66: Same: verify content against the URL’s bucket.- Bucket=settings.AWS_S3_BUCKET, Key=transformed_key + Bucket=parsed_url.netloc, Key=transformed_key
76-85: Silence unused-args warnings and keep intent clear.- def failing_convert_document(*args, **kwargs): + def failing_convert_document(*_args, **_kwargs): nonlocal call_count call_count += 1 if call_count <= fail_count: raise Exception("Transient error") - output_path = args[1] if len(args) > 1 else kwargs.get("output_path") + output_path = _args[1] if len(_args) > 1 else _kwargs.get("output_path") if output_path: output_path.write_text("Success after retries", encoding="utf-8") return output_path raise ValueError("output_path is required")
95-96: Same for persistent failing helper.- def persistent_failing_convert_document(*args, **kwargs): + def persistent_failing_convert_document(*_args, **_kwargs): raise Exception(error_message)backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (6)
5-5: Modernize typing imports and annotations.
- Import
Callablefromcollections.abc.- Use built-in
tuple[...]instead oftyping.Tuple.-from typing import Any, Callable, Tuple +from typing import Any +from collections.abc import Callable ... - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], ... - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], ... - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project], ... - test_document: Tuple[Document, Project], + test_document: tuple[Document, Project],Also applies to: 32-32, 78-78, 120-120, 166-166
59-60: Avoid blind exception assertions; match the message.Improves signal and prevents false positives.
- with pytest.raises(Exception): + with pytest.raises(Exception, match="S3 upload failed"): fast_execute_job( project_id=project.id, job_id=job.id, transformer_name="test", target_format="markdown", )
148-154: Same here: assert the error message for exhausted retries.- with pytest.raises(Exception): + with pytest.raises(Exception, match="Persistent error"): fast_execute_job( project_id=project.id, job_id=job.id, transformer_name="test", target_format="markdown", )
198-204: Same for DB error during completion.- with pytest.raises(Exception): + with pytest.raises(Exception, match="Database error during document creation"): fast_execute_job( project_id=project.id, job_id=job.id, transformer_name="test", target_format="markdown", )
50-51: Preferpatch.dictwhen overriding registry entries.Makes intent explicit and avoids replacing the whole mapping accidentally.
- ), patch( - "app.core.doctransform.registry.TRANSFORMERS", {"test": MockTestTransformer} + ), patch.dict( + "app.core.doctransform.registry.TRANSFORMERS", {"test": MockTestTransformer}, clear=True ):
99-100: Apply the samepatch.dict(..., clear=True)pattern in other tests.- ), patch( - "app.core.doctransform.registry.TRANSFORMERS", {"test": MockTestTransformer} + ), patch.dict( + "app.core.doctransform.registry.TRANSFORMERS", {"test": MockTestTransformer}, clear=True ):Also applies to: 143-144
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
backend/app/core/doctransform/registry.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_execute_job.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_integration.py(1 hunks)backend/app/tests/core/doctransformer/test_service/test_start_job.py(1 hunks)backend/app/tests/core/doctransformer/test_service/utils.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py (5)
backend/app/crud/doc_transformation_job.py (3)
DocTransformationJobCrud(14-91)create(19-30)read_one(32-51)backend/app/models/document.py (1)
Document(20-40)backend/app/tests/core/doctransformer/test_service/utils.py (8)
DocTransformTestBase(30-69)MockTestTransformer(16-27)create_failing_convert_document(72-87)create_persistent_failing_convert_document(90-98)setup_aws_s3(33-37)create_s3_document_content(39-50)failing_convert_document(76-85)persistent_failing_convert_document(95-96)backend/app/tests/conftest.py (1)
db(24-41)backend/app/tests/core/doctransformer/test_service/conftest.py (2)
test_document(73-79)fast_execute_job(33-51)
backend/app/tests/core/doctransformer/test_service/utils.py (5)
backend/app/core/cloud/storage.py (1)
AmazonCloudStorageClient(28-84)backend/app/core/doctransform/transformer.py (1)
Transformer(5-14)backend/app/models/document.py (1)
Document(20-40)backend/app/core/config.py (1)
AWS_S3_BUCKET(76-77)backend/app/tests/api/routes/documents/test_route_document_upload.py (2)
TestDocumentRouteUpload(64-103)test_adds_to_S3(82-103)
backend/app/core/doctransform/registry.py (3)
backend/app/core/doctransform/transformer.py (2)
Transformer(5-14)transform(9-14)backend/app/core/doctransform/zerox_transformer.py (2)
ZeroxTransformer(10-56)transform(18-56)backend/app/tests/core/doctransformer/test_service/utils.py (1)
transform(21-27)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py (7)
backend/app/crud/doc_transformation_job.py (3)
DocTransformationJobCrud(14-91)create(19-30)read_one(32-51)backend/app/crud/document.py (1)
DocumentCrud(14-135)backend/app/core/doctransform/registry.py (1)
TransformationError(8-9)backend/app/core/doctransform/service.py (1)
execute_job(48-150)backend/app/tests/core/doctransformer/test_service/utils.py (5)
DocTransformTestBase(30-69)MockTestTransformer(16-27)setup_aws_s3(33-37)create_s3_document_content(39-50)verify_s3_content(52-69)backend/app/tests/conftest.py (1)
db(24-41)backend/app/tests/core/doctransformer/test_service/conftest.py (1)
fast_execute_job(33-51)
backend/app/tests/core/doctransformer/test_service/test_integration.py (8)
backend/app/crud/doc_transformation_job.py (3)
DocTransformationJobCrud(14-91)read_one(32-51)create(19-30)backend/app/crud/document.py (1)
DocumentCrud(14-135)backend/app/core/doctransform/service.py (2)
execute_job(48-150)start_job(25-44)backend/app/models/document.py (1)
Document(20-40)backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-27)TransformationStatus(9-13)backend/app/tests/core/doctransformer/test_service/utils.py (4)
DocTransformTestBase(30-69)MockTestTransformer(16-27)setup_aws_s3(33-37)create_s3_document_content(39-50)backend/app/tests/conftest.py (1)
db(24-41)backend/app/tests/core/doctransformer/test_service/conftest.py (3)
test_document(73-79)current_user(55-63)background_tasks(67-69)
backend/app/tests/core/doctransformer/test_service/test_start_job.py (7)
backend/app/core/doctransform/service.py (2)
execute_job(48-150)start_job(25-44)backend/app/models/document.py (1)
Document(20-40)backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-27)TransformationStatus(9-13)backend/app/models/user.py (1)
UserProjectOrg(65-66)backend/app/tests/core/doctransformer/test_service/utils.py (2)
DocTransformTestBase(30-69)MockTestTransformer(16-27)backend/app/tests/conftest.py (1)
db(24-41)backend/app/tests/core/doctransformer/test_service/conftest.py (3)
current_user(55-63)test_document(73-79)background_tasks(67-69)
🪛 Ruff (0.12.2)
backend/app/tests/core/doctransformer/test_service/test_execute_job_errors.py
5-5: Import from collections.abc instead: Callable
Import from collections.abc
(UP035)
5-5: typing.Tuple is deprecated, use tuple instead
(UP035)
59-59: Do not assert blind exception: Exception
(B017)
148-148: Do not assert blind exception: Exception
(B017)
198-198: Do not assert blind exception: Exception
(B017)
backend/app/tests/core/doctransformer/test_service/utils.py
95-95: Unused function argument: args
(ARG001)
95-95: Unused function argument: kwargs
(ARG001)
backend/app/core/doctransform/registry.py
2-2: typing.Type is deprecated, use type instead
(UP035)
2-2: typing.Dict is deprecated, use dict instead
(UP035)
2-2: typing.Set is deprecated, use set instead
(UP035)
2-2: typing.Tuple is deprecated, use tuple instead
(UP035)
backend/app/tests/core/doctransformer/test_service/test_execute_job.py
4-4: Import from collections.abc instead: Callable
Import from collections.abc
(UP035)
4-4: typing.Tuple is deprecated, use tuple instead
(UP035)
145-145: Do not assert blind exception: Exception
(B017)
257-257: Loop control variable expected_content_type not used within loop body
Rename unused expected_content_type to _expected_content_type
(B007)
backend/app/tests/core/doctransformer/test_service/test_integration.py
4-4: typing.Tuple is deprecated, use tuple instead
(UP035)
102-102: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
171-171: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
backend/app/tests/core/doctransformer/test_service/test_start_job.py
4-4: typing.Tuple is deprecated, use tuple instead
(UP035)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: checks (3.11.7, 6)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
backend/app/api/routes/doc_transformation_job.py (1)
41-43: Previous issue resolved: jobs_not_found is now a listConversion to list fixes the response model mismatch flagged earlier.
🧹 Nitpick comments (2)
backend/app/api/routes/doc_transformation_job.py (2)
33-43: Preserve input order for jobs_not_found and dedupe deterministicallySet subtraction returns an arbitrary order. If clients compare arrays, this can be flaky. Keep caller order while deduping.
Apply:
-def get_multiple_transformation_jobs( +def get_multiple_transformation_jobs( session: SessionDep, current_user: CurrentUserOrgProject, - job_ids: list[UUID] = Query(description="List of transformation job IDs"), + job_ids: list[UUID] = Query( + description="List of transformation job IDs", + min_items=1 + ), ): crud = DocTransformationJobCrud(session, project_id=current_user.project_id) - jobs = crud.read_each(set(job_ids)) - jobs_not_found = set(job_ids) - {job.id for job in jobs} + # stable order + dedupe + seen: set[UUID] = set() + ordered_unique_ids: list[UUID] = [] + for jid in job_ids: + if jid not in seen: + seen.add(jid) + ordered_unique_ids.append(jid) + jobs = crud.read_each(seen) + found_ids = {job.id for job in jobs} + jobs_not_found = [jid for jid in ordered_unique_ids if jid not in found_ids] return APIResponse.success_response( - DocTransformationJobs(jobs=jobs, jobs_not_found=list(jobs_not_found)) + DocTransformationJobs(jobs=jobs, jobs_not_found=jobs_not_found) )
33-37: Optional: add an upper bound to job_ids to protect the DBLarge IN clauses can hurt performance. Consider enforcing a sane max (e.g., 500–1000) and returning 422 if exceeded.
Example:
- job_ids: list[UUID] = Query( + job_ids: list[UUID] = Query( description="List of transformation job IDs", - min_items=1 + min_items=1, + max_items=1000 ),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
backend/app/api/routes/doc_transformation_job.py(1 hunks)backend/app/tests/api/routes/test_doc_transformation_job.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/app/tests/api/routes/test_doc_transformation_job.py
🧰 Additional context used
🧬 Code graph analysis (1)
backend/app/api/routes/doc_transformation_job.py (3)
backend/app/crud/doc_transformation_job.py (3)
DocTransformationJobCrud(14-91)read_one(32-51)read_each(53-67)backend/app/models/doc_transformation_job.py (2)
DocTransformationJob(16-27)DocTransformationJobs(30-32)backend/app/utils.py (2)
APIResponse(27-48)success_response(34-37)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: checks (3.11.7, 6)
🔇 Additional comments (1)
backend/app/api/routes/doc_transformation_job.py (1)
13-25: Single job endpoint looks goodCRUD enforces project scoping and 404s; response wrapping is correct.
Summary
Partners often deal with scanned PDFs and document formats that are not amenable to AI services like OpenAI's vector stores. This results in poor RAG (Retrieval-Augmented Generation) performance and limits the platform's utility for AI-powered document processing.
This PR implements a foundational document transformation pipeline that enhances the /documents/upload endpoint with on-demand, pluggable document conversion capabilities. Users can now prepare documents for optimal RAG performance directly within the platform by converting documents to LLM-friendly formats.
Key Features Changes
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Summary by CodeRabbit
New Features
Documentation
Chores
Tests