Refactor Document Module and Cloud Storage to use project-based organization and add signed URL support #346

avirajsingh7 · 2025-08-25T05:03:17Z

Summary

Target issuen: #85
This PR refactors the document module and cloud storage system to organize files by project rather than user, adds support for generating signed URLs for secure file access, and implements a project-based storage path structure.

Database Schema Updates

Added storage_path UUID column to the project table for unique project storage identification

Storage Architecture Refactor

Before: Files stored under user ID paths (/user_id/file)
After: Files stored under project storage paths (/project_storage_path/file)

API Improvements

Updated document endpoints to use project-based access control
Added soft delete functionality with is_deleted flag
Enhanced error handling and logging with project context
Updated response models (DocumentPublic vs internal Document)

Migration Impact

Existing documents will need to be migrated from user-based to project-based storage
All document access is now scoped to projects rather than individual users
Storage paths are now organized by unique project UUIDs for better isolation

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

New Features
- Project-scoped document storage and access; optional signed download URLs; public document representation for API responses.
Refactor
- Endpoints now return public document views; delete endpoints use DELETE and return confirmation messages; documents are soft-deleted and omitted from listings by default; ownership scoped to projects.
Chores
- DB migration adds per-project storage identifiers and migrates document ownership to projects.
Tests
- Test suite updated for project-scoped fixtures and X-API-KEY API key usage.

coderabbitai · 2025-08-25T05:03:27Z

Walkthrough

Models, storage, CRUD, API routes, and tests were migrated from user-scoped to project-scoped ownership: documents now reference project_id and project.storage_path; public response models (DocumentPublic, Message) added; CloudStorage gained project-aware APIs (signed URL, size, delete); soft-delete (is_deleted) and an Alembic migration were added.

Changes

Cohort / File(s)	Summary
API Routes: Collections `backend/app/api/routes/collections.py`	Import `DocumentPublic`; collection documents endpoint now returns `APIResponse[List[DocumentPublic]]`; storage/CRUD initialized via `get_cloud_storage(..., project_id)` and collection operations use project scope.
API Routes: Documents `backend/app/api/routes/documents.py`	Endpoints switched to `CurrentUserOrgProject`; responses use `DocumentPublic`/`Message`; `doc_info` adds `include_url: bool`; delete endpoints use `DELETE`; per-project `get_cloud_storage` used and signed URLs attached when requested.
Cloud Storage `backend/app/core/cloud/storage.py`, `backend/app/core/cloud/__init__.py`	`CloudStorage`/`AmazonCloudStorage` take `project_id` and `storage_path: UUID`; `put` uses relative `file_path`; added `get_signed_url`, `get_file_size_kb`, `delete`; added `get_cloud_storage` and re-exported it.
CRUD: Document `backend/app/crud/document.py`	`DocumentCrud` ctor and queries now use `project_id`; reads filter out `is_deleted`; `delete` performs soft-delete (`is_deleted=True`); ownership checks reference `project_id`.
Models: Document / Public `backend/app/models/document.py`, `backend/app/models/__init__.py`	Added `DocumentBase` (project_id, fname); `Document` inherits it and adds id/timestamps/is_deleted; removed `owner_id` and owner relationship; added and exported `DocumentPublic`.
Models: Project / User `backend/app/models/project.py`, `backend/app/models/user.py`	`Project` gains `storage_path: UUID` (unique, default uuid4); removed `User.documents` relationship.
DB Migration `backend/app/alembic/versions/...40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py`	Migration adds `project.storage_path`, `document.project_id`, `document.is_deleted`; backfills storage_path and project_id, drops `owner_id` FK/column, creates FK to `project.id` (ON DELETE CASCADE); includes downgrade.
Tests: Routes & CRUD `backend/app/tests/api/routes/`, `backend/app/tests/crud/`	Tests updated to use `APIKeyPublic` and `X-API-KEY`, project-scoped `DocumentStore`/`DocumentMaker` requiring `project_id`, deletion assertions check `is_deleted`, upload tests read object_store_url from DB.
Test Utilities `backend/app/tests/utils/document.py`, `backend/app/tests/utils/collection.py`, `backend/app/tests/utils/*`	`DocumentMaker`/`DocumentStore` now require `project_id` and `session`; use `project.storage_path` for object keys; `WebCrawler` uses `APIKeyPublic` and sends `X-API-KEY`; `DocumentComparator` adds `to_public_dict()` and compares to `DocumentPublic`; `get_collection` accepts optional `owner_id`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant API as Documents API
  participant Auth as CurrentUserOrgProject
  participant Proj as ProjectService
  participant Store as AmazonCloudStorage(project_id)
  participant CRUD as DocumentCrud(project_id)

  Client->>API: POST /docs/upload (file, X-API-KEY)
  API->>Auth: resolve user + project_id
  API->>Proj: get_project_by_id(project_id)
  Proj-->>API: Project(storage_path)
  API->>Store: put(file, path=storage_path/...)
  Store-->>API: object_store_url
  API->>CRUD: create(document with project_id, object_store_url)
  CRUD-->>API: Document
  API-->>Client: APIResponse(DocumentPublic)

sequenceDiagram
  autonumber
  actor Client
  participant API as Documents API
  participant Auth as CurrentUserOrgProject
  participant CRUD as DocumentCrud(project_id)
  participant Store as AmazonCloudStorage(project_id)

  Client->>API: GET /docs/{id}?include_url=true (X-API-KEY)
  API->>Auth: resolve user + project_id
  API->>CRUD: read_one(id) [filters is_deleted=False]
  CRUD-->>API: Document
  alt include_url == true
    API->>Store: get_signed_url(object_store_url, expires_in)
    Store-->>API: signed_url
  end
  API-->>Client: APIResponse(DocumentPublic with optional signed_url)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–90 minutes

Possibly related PRs

Credentials: Fetching ID from API key #354 — Overlaps migration to project-scoped current-user types and route ownership changes.
Collection: Standardized OpenAI Creds Fetching #258 — Similar updates to collection/document route handlers and signatures.
Testing : API Key Fixture #282 — Related test fixture and API-key header changes (X-API-KEY, APIKeyPublic).

Suggested reviewers

AkhileshNegi
nishika26
kartpop

Poem

I hopped from owner burrows to a project field so wide,
Soft deletes like carrot crumbs now safely hide.
Signed links twinkle down the S3 trail,
Public docs peek where project IDs prevail.
storage_path planted — projects stride. 🐇✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch update/document_module

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/crud/document.py (1)

46-56: Fix log context keys: use project_id (not owner_id) in error logs

The structured log payload still uses 'owner_id'. Replace with 'project_id'.

Apply this diff:

-                    logger.error(
-                        f"[DocumentCrud.read_many] Invalid skip value | {{'owner_id': {self.project_id}, 'skip': {skip}, 'error': '{str(err)}'}}",
+                    logger.error(
+                        f"[DocumentCrud.read_many] Invalid skip value | {{'project_id': {self.project_id}, 'skip': {skip}, 'error': '{str(err)}'}}",
@@
-                    logger.error(
-                        f"[DocumentCrud.read_many] Invalid limit value | {{'owner_id': {self.project_id}, 'limit': {limit}, 'error': '{str(err)}'}}",
+                    logger.error(
+                        f"[DocumentCrud.read_many] Invalid limit value | {{'project_id': {self.project_id}, 'limit': {limit}, 'error': '{str(err)}'}}",

Also applies to: 58-68

🧹 Nitpick comments (21)

backend/app/models/project.py (1)
1-1: Nit: remove unused import UUID

Only uuid4 is used.

Apply this diff:
-from uuid import UUID, uuid4
+from uuid import uuid4
backend/app/alembic/versions/42c4c8b22a09_add_is_deleted_column_in_document.py (1)
8-11: Remove unused import

sqlmodel.sql.sqltypes is not used.

Apply this diff:
-from alembic import op
-import sqlalchemy as sa
-import sqlmodel.sql.sqltypes
+from alembic import op
+import sqlalchemy as sa
backend/app/crud/document.py (3)
42-45: Optional: add deterministic ordering to read_many

Consider ordering results (e.g., newest first) for stable pagination.

Apply this diff:
-        statement = select(Document).where(
+        statement = select(Document).where(
             and_(Document.project_id == self.project_id, Document.is_deleted.is_(False))
-        )
+        ).order_by(Document.inserted_at.desc())
Also applies to: 70-71

83-91: Tighten error message grammar and detail in read_each

Minor but user-facing in logs; fix “atleast” and pluralization, and include missing IDs for easier debugging.

Apply this diff:
-                raise ValueError(
-                    f"Requested atleast {requested_count} document retrieved {retrieved_count}"
-                )
+                missing = set(doc_ids) - {d.id for d in results}
+                raise ValueError(
+                    f"Requested at least {requested_count} documents, retrieved {retrieved_count}. Missing: {sorted(missing)}"
+                )
98-106: Clarify ownership error message

The placeholders are reversed and “attempter” is awkward. Make the message unambiguous.

Apply this diff:
-        elif document.project_id != self.project_id:
-            error = "Invalid document ownership: project={} attempter={}".format(
-                self.project_id,
-                document.project_id,
-            )
+        elif document.project_id != self.project_id:
+            error = "Invalid document ownership: document.project_id={} request.project_id={}".format(
+                document.project_id,
+                self.project_id,
+            )
backend/app/models/document.py (3)
10-16: Index project_id for query performance

Most queries filter by project_id and is_deleted. Adding an index on project_id (and optionally a composite (project_id, is_deleted)) will speed reads.

Apply this diff (minimal):
-    project_id: int = Field(
+    project_id: int = Field(
         description="The ID of the project to which the document belongs",
         foreign_key="project.id",
-        nullable=False,
+        nullable=False,
+        index=True,
         ondelete="CASCADE",
     )
If you prefer a composite index, we can add it via Alembic instead.

37-47: Enable from-attributes mapping for DocumentPublic (SQLModel/Pydantic v2)

When returning ORM instances as DocumentPublic, set model_config so model_validate(..., from_attributes=True) works cleanly.

Apply this diff:
 class DocumentPublic(DocumentBase):
     id: UUID = Field(description="The unique identifier of the document")
     signed_url: str | None = Field(
         default=None, description="A signed URL for accessing the document"
     )
     inserted_at: datetime = Field(
         description="The timestamp when the document was inserted"
     )
     updated_at: datetime = Field(
         description="The timestamp when the document was last updated"
     )
+    # Pydantic v2 / SQLModel v0.0.22+
+    model_config = {"from_attributes": True}
And add this import once at the top if you prefer a typed config:
# optional typed config
from pydantic import ConfigDict
# ...
model_config = ConfigDict(from_attributes=True)
37-47: Confirm public exposure of project_id is intentional

DocumentPublic inherits project_id from DocumentBase. If APIs are always project-scoped, returning project_id may be redundant and could leak internal IDs to clients. Keep if needed; otherwise consider a slimmer public DTO without project_id.
backend/app/models/__init__.py (1)
5-5: Silence Ruff F401 for package re-exports

In init.py, importing for re-export triggers F401. Easiest is to add a noqa on the line.

Apply this diff:
-from .document import Document, DocumentPublic
+from .document import Document, DocumentPublic  # noqa: F401
Alternative: define all and include these names explicitly.
backend/app/alembic/versions/b79fc198879a_add_storage_path_column_in_project_table.py (1)
10-10: Remove unused import.

sqlmodel.sql.sqltypes is not used.
-import sqlmodel.sql.sqltypes
backend/app/api/routes/collections.py (2)
27-27: Drop unused import Document.

Document isn’t referenced in this module. Keeps lints clean.
-from app.models import Collection, Document, DocumentPublic
+from app.models import Collection, DocumentPublic
426-439: Response model now uses DocumentPublic — filter out soft-deleted docs and prefer built-in list typing.

The route returns DocumentPublic, but DocumentCollectionCrud.read doesn’t currently exclude Document.is_deleted documents or enforce project scoping. You can prevent leaking soft-deleted or cross-project docs by filtering in the CRUD.

Prefer list over List per modern typing style.
-    response_model=APIResponse[List[DocumentPublic]],
+    response_model=APIResponse[list[DocumentPublic]],
Outside this file (CRUD suggestion for backend/app/crud/document_collection.py):
# inside DocumentCollectionCrud.read(...)
statement = (
    select(Document)
    .join(DocumentCollection, DocumentCollection.document_id == Document.id)
    .where(
        and_(
            DocumentCollection.collection_id == collection.id,
            Document.is_deleted.is_(False),
            Document.project_id == collection.project_id,  # ensure project-scoping
        )
    )
)
backend/app/api/routes/documents.py (5)
1-6: Adopt built-in list typing and remove typing.List import.

Modern Python favors list[...] over List[...]. Also drop the now-unused import.
-from typing import List
+from typing import List  # remove
@@
-    response_model=APIResponse[List[DocumentPublic]],
+    response_model=APIResponse[list[DocumentPublic]],
@@
-    response_model=APIResponse[DocumentPublic],
+    response_model=APIResponse[DocumentPublic],
@@
-    response_model=APIResponse[DocumentPublic],
+    response_model=APIResponse[DocumentPublic],
And actually remove the import:
-from typing import List
Also applies to: 20-24, 123-126

4-4: Build S3 keys with POSIX semantics to avoid backslashes on Windows.

pathlib.Path can emit backslashes on Windows, which becomes part of the S3 key. Use PurePosixPath to guarantee forward slashes in object keys.
-from pathlib import Path
+from pathlib import PurePosixPath
@@
-    key = Path(str(project.storage_path), str(document_id))
+    key = PurePosixPath(str(project.storage_path)) / str(document_id)
Also applies to: 52-54

46-54: Validate project existence before upload — good. Consider early content-type/size checks.

404 on missing project is correct. As a follow-up, consider validating src.content_type and limiting size before uploading to avoid wasted bandwidth.

84-85: Remove unused variable.

The result from c_crud.delete(document, a_crud) is not used.
-    data = c_crud.delete(document, a_crud)
+    c_crud.delete(document, a_crud)
106-116: “Permanent” delete still soft-deletes DB row. Consider adding a hard-delete path.

You delete the S3 object, but d_crud.delete(doc_id) only marks the document as deleted. If “permanent” should also remove the DB row, add a purge to DocumentCrud (hard delete) and call it here.

Outside this file (in backend/app/crud/document.py):
def purge(self, doc_id: UUID):
    document = self.read_one(doc_id)
    self.session.delete(document)
    self.session.commit()
    logger.info(f"[DocumentCrud.purge] Document hard-deleted | {{'doc_id': '{doc_id}', 'project_id': {self.project_id}}}")
Then update this route to:
c_crud.delete(document, a_crud)
storage.delete(document.object_store_url)
d_crud.purge(doc_id)
Also applies to: 117-119
backend/app/core/cloud/storage.py (4)
13-13: Remove unused import.

UserProjectOrg isn’t referenced.
-from app.models import UserProjectOrg
110-116: Align abstract put signature with subclass.

Base class uses basename: str, subclass uses key: Path. Aligning keeps contracts consistent and helps tooling.
-    def put(self, source: UploadFile, basename: str):
+    def put(self, source: UploadFile, key: Path):
         raise NotImplementedError()
126-141: Normalize S3 key formatting with as_posix().

Ensure forward slashes regardless of OS.
-    def put(self, source: UploadFile, key: Path) -> SimpleStorageName:
-        destination = SimpleStorageName(str(key))
+    def put(self, source: UploadFile, key: Path) -> SimpleStorageName:
+        destination = SimpleStorageName(key.as_posix())
190-216: Presigned URL implementation — solid. Consider validating scheme.

Implementation looks correct. As a minor hardening, validate that url has s3 scheme before presigning and raise a clear error otherwise.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8c681d3 and 527c83d.

📒 Files selected for processing (10)

backend/app/alembic/versions/42c4c8b22a09_add_is_deleted_column_in_document.py (1 hunks)
backend/app/alembic/versions/b79fc198879a_add_storage_path_column_in_project_table.py (1 hunks)
backend/app/api/routes/collections.py (3 hunks)
backend/app/api/routes/documents.py (6 hunks)
backend/app/core/cloud/storage.py (7 hunks)
backend/app/crud/document.py (6 hunks)
backend/app/models/__init__.py (1 hunks)
backend/app/models/document.py (1 hunks)
backend/app/models/project.py (2 hunks)
backend/app/models/user.py (0 hunks)

💤 Files with no reviewable changes (1)

backend/app/models/user.py

🧰 Additional context used

🧬 Code graph analysis (9)

backend/app/alembic/versions/b79fc198879a_add_storage_path_column_in_project_table.py (2)

backend/app/alembic/versions/99f4fc325617_add_organization_project_setup.py (1)

upgrade (20-72)

backend/app/alembic/versions/d98dd8ec85a3_edit_replace_id_integers_in_all_models_.py (1)

upgrade (21-73)

backend/app/models/project.py (3)

backend/app/alembic/versions/d98dd8ec85a3_edit_replace_id_integers_in_all_models_.py (1)

upgrade (21-73)

backend/app/alembic/versions/66abc97f3782_user_id_from_uuid_to_int.py (1)

upgrade (20-152)

backend/app/alembic/versions/0f205e3779ee_add_api_key_table.py (1)

upgrade (20-39)

backend/app/models/__init__.py (1)

backend/app/models/document.py (2)

Document (19-34)

DocumentPublic (37-47)

backend/app/alembic/versions/42c4c8b22a09_add_is_deleted_column_in_document.py (4)

backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)

upgrade (20-36)

backend/app/alembic/versions/8e7dc5eab0b0_add_fk_constraint_to_user_id_columns.py (1)

upgrade (20-42)

backend/app/alembic/versions/60b6c511a485_add_project_id_in_api_key_table.py (1)

upgrade (20-47)

backend/app/alembic/versions/99f4fc325617_add_organization_project_setup.py (1)

upgrade (20-72)

backend/app/api/routes/collections.py (5)

backend/app/models/document.py (2)

Document (19-34)

DocumentPublic (37-47)

backend/app/core/cloud/storage.py (1)

AmazonCloudStorage (121-232)

backend/app/crud/document.py (1)

DocumentCrud (14-135)

backend/app/utils.py (1)

APIResponse (27-48)

backend/app/crud/document_collection.py (2)

DocumentCollectionCrud (8-48)

create (12-23)

backend/app/crud/document.py (5)

backend/app/crud/collection.py (3)

read_one (84-93)

delete (107-115)

CollectionCrud (17-146)

backend/app/models/document.py (1)

Document (19-34)

backend/app/tests/crud/documents/test_crud_document_delete.py (2)

document (13-21)

TestDatabaseDelete (24-44)

backend/app/tests/crud/documents/test_crud_document_update.py (2)

TestDatabaseUpdate (15-67)

test_update_respects_owner (57-67)

backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)

upgrade (20-36)

backend/app/models/document.py (6)

backend/app/core/util.py (1)

now (13-14)

backend/app/models/document_collection.py (1)

DocumentCollection (9-23)

backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)

upgrade (20-36)

backend/app/models/user.py (1)

User (48-60)

backend/app/tests/utils/document.py (1)

DocumentMaker (30-48)

backend/app/tests/crud/documents/test_crud_document_update.py (1)

TestDatabaseUpdate (15-67)

backend/app/api/routes/documents.py (7)

backend/app/crud/document.py (4)

DocumentCrud (14-135)

read_many (37-71)

delete (125-135)

read_one (19-35)

backend/app/crud/collection.py (3)

CollectionCrud (17-146)

delete (107-115)

read_one (84-93)

backend/app/crud/project.py (1)

get_project_by_id (26-28)

backend/app/models/document.py (2)

Document (19-34)

DocumentPublic (37-47)

backend/app/models/message.py (1)

Message (5-6)

backend/app/utils.py (2)

APIResponse (27-48)

success_response (34-37)

backend/app/core/cloud/storage.py (5)

AmazonCloudStorage (121-232)

put (114-115)

put (126-150)

delete (217-232)

get_signed_url (190-215)

backend/app/core/cloud/storage.py (1)

backend/app/models/user.py (1)

UserProjectOrg (65-66)

🪛 Ruff (0.12.2)

backend/app/alembic/versions/b79fc198879a_add_storage_path_column_in_project_table.py

10-10: sqlmodel.sql.sqltypes imported but unused

Remove unused import: sqlmodel.sql.sqltypes

(F401)

backend/app/models/__init__.py

5-5: .document.Document imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

5-5: .document.DocumentPublic imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

backend/app/alembic/versions/42c4c8b22a09_add_is_deleted_column_in_document.py

10-10: sqlmodel.sql.sqltypes imported but unused

Remove unused import: sqlmodel.sql.sqltypes

(F401)

backend/app/api/routes/collections.py

27-27: app.models.Document imported but unused

Remove unused import: app.models.Document

(F401)

426-426: Use list instead of List for type annotation

Replace with list

(UP006)

backend/app/api/routes/documents.py

23-23: Use list instead of List for type annotation

Replace with list

(UP006)

84-84: Local variable data is assigned to but never used

Remove assignment to unused variable data

(F841)

backend/app/core/cloud/storage.py

13-13: app.models.UserProjectOrg imported but unused

Remove unused import: app.models.UserProjectOrg

(F401)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (5)

backend/app/crud/document.py (2)

19-25: LGTM: read_one adds correct project-scoped + soft-delete filters

The query adds both project_id equality and is_deleted=False. This matches the new ownership model and soft delete semantics.

125-136: LGTM: delete implements soft-deletion via is_deleted/deleted_at

Soft delete flows through update() and logs context. Behavior aligns with list/read filters.

backend/app/api/routes/collections.py (1)

228-230: Project-scoped storage and CRUD initialization — good change.

Switching to AmazonCloudStorage(current_user.project_id) and DocumentCrud(session, current_user.project_id) aligns the collections flow with the new per-project storage model.

backend/app/api/routes/documents.py (2)

31-33: Project-scoped listing — good alignment.

DocumentCrud(session, current_user.project_id) ensures list operations respect project boundaries and is_deleted filtering via the CRUD. Looks good.

131-144: Signed URL support in doc_info — looks correct.

The DocumentPublic.model_validate(..., from_attributes=True) conversion and conditional presign are clean. Good defaults.

backend/app/alembic/versions/42c4c8b22a09_add_is_deleted_column_in_document.py

backend/app/alembic/versions/b79fc198879a_add_storage_path_column_in_project_table.py

backend/app/crud/document.py

backend/app/models/project.py

backend/app/alembic/versions/42c4c8b22a09_add_is_deleted_column_in_document.py

backend/app/api/routes/documents.py

codecov · 2025-08-28T06:19:36Z

Codecov Report

❌ Patch coverage is 91.90476% with 17 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/core/cloud/storage.py	66.66%	15 Missing ⚠️
backend/app/api/routes/documents.py	89.47%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (1)
18-24: Remove imported crawler fixture to avoid shadowing parameter

Importing crawler and also using a parameter named crawler can trip linters (F811) and is unnecessary.
 from app.tests.utils.document import (
     DocumentStore,
     DocumentMaker,
     Route,
     WebCrawler,
-    crawler,
     crawler,
 )
Replace with:
 from app.tests.utils.document import (
     DocumentStore,
     DocumentMaker,
     Route,
     WebCrawler,
-    crawler,
+)
backend/app/tests/api/routes/documents/test_route_document_info.py (1)
4-12: Fix linter error: imported crawler shadows the fixture parameter

Ruff flags F811 (redefinition/shadowing). Drop the import; rely on the fixture parameter.
 from app.tests.utils.document import (
     DocumentComparator,
     DocumentMaker,
     DocumentStore,
     Route,
     WebCrawler,
-    crawler,
     httpx_to_standard,
 )
backend/app/tests/api/routes/documents/test_route_document_upload.py (1)
80-81: Don’t persist absolute local file paths as fname.
Storing full paths can leak local env/PII; store basename only. Update test expectation accordingly (and ensure route normalizes to basename).

Apply this diff:
-        assert result.fname == str(scratch)
+        assert result.fname == Path(scratch).name
If the route currently stores the full path, normalize it server-side using Path(file.filename).name before persisting. I can provide a patch if desired.
backend/app/tests/utils/document.py (1)
136-150: Fix incorrect @singledispatchmethod usage (don’t combine with @staticmethod).

Current implementation will raise due to wrong signature. Accept self and register typed variants accordingly.
-    @ft.singledispatchmethod
-    @staticmethod
-    def to_string(value):
-        return value
+    @ft.singledispatchmethod
+    def to_string(self, value):
+        return value
@@
-    @to_string.register
-    @staticmethod
-    def _(value: UUID):
-        return str(value)
+    @to_string.register
+    def _(self, value: UUID):
+        return str(value)
@@
-    @to_string.register
-    @staticmethod
-    def _(value: datetime):
-        return value.isoformat()
+    @to_string.register
+    def _(self, value: datetime):
+        return value.isoformat()

🧹 Nitpick comments (20)

backend/app/tests/crud/collections/test_crud_collection_read_one.py (1)
8-8: Remove unused import: settings

Keeps tests lint-clean.
-from app.core.config import settings
backend/app/tests/crud/collections/test_crud_collection_delete.py (2)
61-62: Prefer .is_(False) for SQLAlchemy boolean comparisons

Avoid E712 and generate correct SQL.
-stmt = select(APIKey).where(
-    APIKey.project_id == project.id, APIKey.is_deleted == False
-)
+stmt = select(APIKey).where(
+    APIKey.project_id == project.id, APIKey.is_deleted.is_(False)
+)
6-6: Remove unused import: settings

Keep lint green.
-from app.core.config import settings
backend/app/tests/crud/collections/test_crud_collection_read_all.py (2)
8-8: Add missing import for DocumentMaker

Required if adopting the proposed fix.
-from app.tests.utils.document import DocumentStore
+from app.tests.utils.document import DocumentStore, DocumentMaker
29-33: Fixture signature looks wrong and unused

This fixture takes self (invalid for pytest fixtures) and appears unused. Remove or fix to avoid confusion.
-@pytest.fixture(scope="class")
-def refresh(self, db: Session):
-    db.query(Collection).delete()
-    db.commit()
+@pytest.fixture(scope="class")
+def refresh(db: Session):
+    db.query(Collection).delete()
+    db.commit()
If unused, delete the fixture entirely.
backend/app/tests/crud/documents/test_crud_document_read_one.py (2)
40-43: Avoid dependency on a hard-coded project name ("Dalgo").
This can flake if such a project isn’t seeded. Use a guaranteed different project_id without relying on DB seed data.

Apply this diff:
-        other_project = get_project(db, name="Dalgo")  # Different project
-
-        crud = DocumentCrud(db, other_project.id)
+        other_project_id = store.project.id + 1  # Different project
+        crud = DocumentCrud(db, other_project_id)
38-48: Add coverage for soft-deleted read behavior.
Reading a soft-deleted document should 404; add a test.

Example:
def test_cannot_read_soft_deleted_document(self, db: Session, store: DocumentStore):
    doc = store.put()
    crud = DocumentCrud(db, store.project.id)
    crud.delete(doc.id)
    with pytest.raises(HTTPException) as exc_info:
        crud.read_one(doc.id)
    assert exc_info.value.status_code == 404
backend/app/tests/api/routes/documents/test_route_document_remove.py (2)
57-66: Assert deleted_at as well to fully validate soft delete.
Currently only checks is_deleted.

Apply this diff:
             result = db.exec(statement).one()

-            assert result.is_deleted is True
+            assert result.is_deleted is True
+            assert result.deleted_at is not None
83-87: Strengthen assertion on unknown-document removal.
Also verify 404 and error detail.

Apply this diff:
-            response = crawler.delete(route.append(next(maker)))
-
-            assert response.is_error
+            response = crawler.delete(route.append(next(maker)))
+            assert response.status_code == 404
+            body = response.json()
+            assert body.get("detail") == "Document not found"
backend/app/tests/api/routes/documents/test_route_document_upload.py (2)
93-104: Add S3-specific assertions for scheme and bucket.
Makes the storage contract explicit.

Apply this diff:
-        url = urlparse(result.object_store_url)
+        url = urlparse(result.object_store_url)
+        assert url.scheme == "s3"
+        assert url.netloc == settings.AWS_S3_BUCKET
         key = Path(url.path)
         key = key.relative_to(key.root)
Optionally, also assert the key prefix matches the project’s storage_path.

62-64: Missing tests for signed URL support.
Add an API test that requests a signed GET URL for an uploaded document and validates: (a) 200 response, (b) URL host is the configured S3 endpoint, (c) query contains X-Amz-Signature and X-Amz-Expires, (d) access is denied after expiry in a time-mocked context.

I can draft this test against your current route shape if you share the endpoint signature.
backend/app/tests/crud/documents/test_crud_document_update.py (3)
52-58: Rename tests to reflect “project” terminology.
The implementation moved away from “owner”; align test names.

Apply this diff:
-def test_update_sets_default_owner(
+def test_update_sets_default_project(
@@
-def test_update_respects_owner(
+def test_update_respects_project(
65-70: Valid negative ownership check.
For robustness, consider creating a real second project instead of +1 id if/when fixtures are available.

19-20: Leftover “owner_id” in DocumentCrud logs.
Log fields still reference owner_id; rename to project_id for consistency across the refactor.

Proposed patch (in app/crud/document.py):
-                        f"[DocumentCrud.read_many] Invalid skip value | {{'owner_id': {self.project_id}, 'skip': {skip}, 'error': '{str(err)}'}}",
+                        f"[DocumentCrud.read_many] Invalid skip value | {{'project_id': {self.project_id}, 'skip': {skip}, 'error': '{str(err)}'}}",
@@
-                        f"[DocumentCrud.read_many] Invalid limit value | {{'owner_id': {self.project_id}, 'limit': {limit}, 'error': '{str(err)}'}}",
+                        f"[DocumentCrud.read_many] Invalid limit value | {{'project_id': {self.project_id}, 'limit': {limit}, 'error': '{str(err)}'}}",
backend/app/tests/crud/documents/test_crud_document_delete.py (1)

37-47: Cross-project delete correctly yields 404.
Good negative test for access scoping.

Consider adding a test that a second delete on the same id returns 404 to confirm idempotent soft-delete behavior.
backend/app/tests/crud/documents/test_crud_document_read_many.py (2)
12-16: Fail fast if project lookup returns None in the fixture.

Add an assertion so the fixture errors early with a clear message when no active project is found (avoids later AttributeError on store.project.id).
     project = get_project(db)
     ds = DocumentStore(db, project.id)
     ds.fill(TestDatabaseReadMany._ndocs)
+    assert ds.project is not None, "DocumentStore.project not initialized"
27-27: DRY: provide a crud fixture instead of re-creating DocumentCrud in each test.

Reduces duplication and keeps test setup consistent.
 @pytest.fixture
 def store(db: Session):
     project = get_project(db)
     ds = DocumentStore(db, project.id)
     ds.fill(TestDatabaseReadMany._ndocs)
     return ds
+
+@pytest.fixture
+def crud(db: Session, store: DocumentStore):
+    return DocumentCrud(db, store.project.id)
Then update tests to accept crud and remove local instantiations:
-        crud = DocumentCrud(db, store.project.id)
         docs = crud.read_many()
(Apply similarly in other tests.)

Also applies to: 44-44, 55-55, 64-64, 73-73, 82-82, 93-93, 101-101, 110-110
backend/app/tests/utils/document.py (3)
27-31: Type the lazy-loaded project as optional.

Clarifies intent and avoids static analysis warnings.
-        self.project: Project = None
+        self.project: Project | None = None
43-43: Align file extension between key and fname for consistency.

Using different extensions is confusing in tests and storage expectations.
-        key = f"{self.project.storage_path}/{doc_id}.txt"
+        key = f"{self.project.storage_path}/{doc_id}.xyz"
(Alternatively, switch both to .txt if downstream expects that.)

Also applies to: 49-52

155-156: Minor: redundant dict() wrapper.

to_public_dict() already returns a dict.
-        this = dict(self.to_public_dict())
+        this = self.to_public_dict()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 527c83d and a27d551.

📒 Files selected for processing (18)

backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py (1 hunks)
backend/app/models/project.py (2 hunks)
backend/app/tests/api/routes/collections/test_create_collections.py (3 hunks)
backend/app/tests/api/routes/documents/test_route_document_info.py (3 hunks)
backend/app/tests/api/routes/documents/test_route_document_list.py (2 hunks)
backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (2 hunks)
backend/app/tests/api/routes/documents/test_route_document_remove.py (3 hunks)
backend/app/tests/api/routes/documents/test_route_document_upload.py (5 hunks)
backend/app/tests/crud/collections/test_crud_collection_create.py (1 hunks)
backend/app/tests/crud/collections/test_crud_collection_delete.py (2 hunks)
backend/app/tests/crud/collections/test_crud_collection_read_all.py (1 hunks)
backend/app/tests/crud/collections/test_crud_collection_read_one.py (1 hunks)
backend/app/tests/crud/documents/test_crud_document_delete.py (2 hunks)
backend/app/tests/crud/documents/test_crud_document_read_many.py (9 hunks)
backend/app/tests/crud/documents/test_crud_document_read_one.py (2 hunks)
backend/app/tests/crud/documents/test_crud_document_update.py (3 hunks)
backend/app/tests/utils/collection.py (1 hunks)
backend/app/tests/utils/document.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/models/project.py

🧰 Additional context used

🧬 Code graph analysis (16)

backend/app/tests/utils/collection.py (4)

backend/app/models/collection.py (1)

Collection (21-54)

backend/app/tests/utils/utils.py (1)

get_user_id_by_email (52-54)

backend/app/crud/collection.py (3)

__init__ (18-20)

read_one (84-93)

CollectionCrud (17-146)

backend/app/api/routes/collections.py (1)

collection_info (399-406)

backend/app/tests/crud/collections/test_crud_collection_read_all.py (3)

backend/app/tests/crud/documents/test_crud_document_read_many.py (1)

store (11-16)

backend/app/tests/crud/documents/test_crud_document_read_one.py (1)

store (13-15)

backend/app/tests/utils/document.py (2)

DocumentStore (55-82)

fill (81-82)

backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py (4)

backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)

upgrade (20-36)

backend/app/alembic/versions/60b6c511a485_add_project_id_in_api_key_table.py (1)

upgrade (20-47)

backend/app/alembic/versions/0f205e3779ee_add_api_key_table.py (1)

upgrade (20-39)

backend/app/alembic/versions/99f4fc325617_add_organization_project_setup.py (1)

upgrade (20-72)

backend/app/tests/api/routes/documents/test_route_document_list.py (4)

backend/app/tests/crud/documents/test_crud_document_read_many.py (1)

store (11-16)

backend/app/tests/crud/documents/test_crud_document_read_one.py (1)

store (13-15)

backend/app/tests/utils/document.py (2)

DocumentStore (55-82)

crawler (171-172)

backend/app/tests/conftest.py (2)

db (24-41)

user_api_key (89-91)

backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (2)

backend/app/tests/utils/document.py (3)

DocumentStore (55-82)

crawler (171-172)

DocumentMaker (26-52)

backend/app/tests/conftest.py (2)

db (24-41)

user_api_key (89-91)

backend/app/tests/crud/collections/test_crud_collection_create.py (4)

backend/app/tests/crud/documents/test_crud_document_read_many.py (1)

store (11-16)

backend/app/tests/crud/documents/test_crud_document_read_one.py (1)

store (13-15)

backend/app/tests/utils/document.py (2)

DocumentStore (55-82)

fill (81-82)

backend/app/tests/crud/documents/test_crud_document_update.py (1)

documents (11-14)

backend/app/tests/crud/collections/test_crud_collection_delete.py (5)

backend/app/crud/collection.py (3)

CollectionCrud (17-146)

_ (118-125)

_ (128-146)

backend/app/models/api_key.py (1)

APIKey (28-38)

backend/app/tests/utils/utils.py (1)

get_project (70-89)

backend/app/tests/utils/document.py (5)

project (67-68)

DocumentStore (55-82)

fill (81-82)

_ (143-144)

_ (148-149)

backend/app/tests/utils/collection.py (1)

get_collection (24-55)

backend/app/tests/api/routes/documents/test_route_document_upload.py (3)

backend/app/models/api_key.py (1)

APIKeyPublic (23-25)

backend/app/tests/conftest.py (3)

user_api_key (89-91)

client (52-55)

db (24-41)

backend/app/models/document.py (1)

Document (19-34)

backend/app/tests/crud/collections/test_crud_collection_read_one.py (2)

backend/app/tests/utils/document.py (2)

DocumentStore (55-82)

fill (81-82)

backend/app/tests/utils/collection.py (1)

get_collection (24-55)

backend/app/tests/utils/document.py (6)

backend/app/crud/project.py (1)

get_project_by_id (37-39)

backend/app/models/api_key.py (1)

APIKeyPublic (23-25)

backend/app/models/document.py (2)

Document (19-34)

DocumentPublic (37-47)

backend/app/tests/utils/utils.py (1)

SequentialUuidGenerator (140-153)

backend/app/core/config.py (1)

AWS_S3_BUCKET (76-77)

backend/app/tests/conftest.py (2)

db (24-41)

user_api_key (89-91)

backend/app/tests/api/routes/collections/test_create_collections.py (2)

backend/app/tests/api/routes/test_api_key.py (1)

test_create_api_key (14-29)

backend/app/tests/api/routes/collections/test_collection_info.py (1)

test_collection_info_successful (57-77)

backend/app/tests/crud/documents/test_crud_document_read_one.py (3)

backend/app/tests/utils/utils.py (1)

get_project (70-89)

backend/app/tests/utils/document.py (3)

project (67-68)

DocumentStore (55-82)

put (70-75)

backend/app/crud/document.py (2)

DocumentCrud (14-135)

read_one (19-35)

backend/app/tests/api/routes/documents/test_route_document_remove.py (4)

backend/app/tests/utils/document.py (5)

DocumentStore (55-82)

delete (126-130)

append (107-112)

put (70-75)

DocumentMaker (26-52)

backend/app/tests/conftest.py (2)

db (24-41)

user_api_key (89-91)

backend/app/tests/crud/documents/test_crud_document_delete.py (1)

document (14-23)

backend/app/models/document.py (1)

Document (19-34)

backend/app/tests/crud/documents/test_crud_document_update.py (3)

backend/app/tests/utils/utils.py (1)

get_project (70-89)

backend/app/tests/utils/document.py (3)

project (67-68)

DocumentStore (55-82)

DocumentMaker (26-52)

backend/app/crud/document.py (2)

DocumentCrud (14-135)

update (98-123)

backend/app/tests/crud/documents/test_crud_document_read_many.py (3)

backend/app/tests/utils/utils.py (1)

get_project (70-89)

backend/app/tests/utils/document.py (2)

project (67-68)

DocumentStore (55-82)

backend/app/crud/document.py (2)

DocumentCrud (14-135)

read_many (37-71)

backend/app/tests/crud/documents/test_crud_document_delete.py (4)

backend/app/tests/utils/utils.py (1)

get_project (70-89)

backend/app/tests/utils/document.py (3)

project (67-68)

DocumentStore (55-82)

put (70-75)

backend/app/crud/document.py (1)

DocumentCrud (14-135)

backend/app/models/document.py (1)

Document (19-34)

🪛 Ruff (0.12.2)

backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py

44-44: Trailing whitespace

Remove trailing whitespace

(W291)

45-45: Trailing whitespace

Remove trailing whitespace

(W291)

backend/app/tests/crud/collections/test_crud_collection_delete.py

6-6: app.core.config.settings imported but unused

Remove unused import: app.core.config.settings

(F401)

61-61: Avoid equality comparisons to False; use not APIKey.is_deleted: for false checks

Replace with not APIKey.is_deleted

(E712)

backend/app/tests/crud/collections/test_crud_collection_read_one.py

8-8: app.core.config.settings imported but unused

Remove unused import: app.core.config.settings

(F401)

10-10: app.tests.utils.collection.uuid_increment imported but unused

Remove unused import: app.tests.utils.collection.uuid_increment

(F401)

backend/app/tests/api/routes/documents/test_route_document_info.py

47-47: Redefinition of unused crawler from line 10

(F811)

🪛 GitHub Actions: AI Platform CI

backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py

[error] 1-1: Trailing whitespace detected by pre-commit (hook: trailing-whitespace). The file was updated.

[error] 1-1: Black formatting changed the file. Hook: black. Please re-run 'pre-commit run --all-files'.

🔇 Additional comments (32)

backend/app/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py (1)

34-52: Backfill document.project_id with earliest active API key and guard against NULLs
Use a CTE to deterministically pick each user’s earliest non-deleted API key and update only NULL project_id values, then raise an error if any NULLs remain. Since alembic/psql weren’t available here, manually run the migration and confirm SELECT COUNT(*) FROM document WHERE project_id IS NULL; returns 0.

backend/app/tests/api/routes/documents/test_route_document_list.py (1)

51-51: LGTM: DocumentStore now scoped by project_id

Per-project instantiation aligns with storage/CRUD changes.

Also applies to: 81-81

backend/app/tests/crud/collections/test_crud_collection_read_one.py (1)

18-19: LGTM: Create documents under the collection’s project

Ensures project-consistent fixtures.

backend/app/tests/api/routes/documents/test_route_document_permanent_remove.py (2)

62-62: LGTM: Store scoped to project

Creating DocumentStore with project_id derived from the crawler aligns with project-scoped storage.

96-97: LGTM: Using DocumentMaker with project_id for non-existent doc case

Correctly generates an in-memory doc tied to the right project.

backend/app/tests/crud/collections/test_crud_collection_create.py (1)

16-19: LGTM: Documents are created under the collection’s project

Initializing DocumentStore with collection.project_id matches the project-based refactor.

backend/app/tests/api/routes/collections/test_create_collections.py (3)

9-9: LGTM: Use APIKeyPublic for per-project context

Switching the fixture type clarifies inputs and aligns with the API surface.

52-55: LGTM: Project-scoped DocumentStore in API test

Documents seeded correctly for the API call using the key’s project_id.

66-66: LGTM: Header built from API key

Matches expected auth header consumption.

backend/app/tests/api/routes/documents/test_route_document_info.py (4)

27-27: LGTM: Project-scoped store for info route

Consistent with project-based ownership.

38-38: LGTM: Same here

Project-scoped seeding looks correct.

47-47: LGTM: Correct fixture typing

Using WebCrawler matches how the tests authenticate.

50-50: LGTM: Proper maker usage for unknown doc

Creates an id not present in DB to exercise error path.

backend/app/tests/crud/documents/test_crud_document_read_one.py (4)

8-8: Import aligns with project-scoped tests.
Matches the new project-based access pattern.

14-15: Project-scoped DocumentStore fixture is correct.
Ensures all generated docs are tied to a specific project.

22-24: Correct: DocumentCrud now receives project_id.
Scoping reads to the project is enforced.

30-36: Good negative case using a non-persisted document.
Validates 404 on unknown id in the same project.

backend/app/tests/api/routes/documents/test_route_document_remove.py (1)

38-40: Correct project scoping via API key.
Route call uses the API key’s project_id consistently.

backend/app/tests/api/routes/documents/test_route_document_upload.py (2)

29-31: Header standardization looks good.
Fixed header "X-API-KEY" matches the test client helpers.

49-51: Fixture update aligns uploader with APIKeyPublic.
Clean and explicit dependency on user_api_key.

backend/app/tests/crud/documents/test_crud_document_update.py (3)

19-26: Project-scoped CRUD usage is correct.
Update path respects per-project ownership.

32-36: Sequential updates ordering check is sound.
Covers monotonic insert timestamps.

42-46: Active document remains undeleted on update.
Good invariant check on is_deleted.

backend/app/tests/crud/documents/test_crud_document_delete.py (2)

15-21: Project-scoped deletion path is correct.
DocumentCrud constructed with the document’s project_id and soft-deletes the record.

31-31: Proper soft-delete assertion.
Pairing this with inserted_at ≤ deleted_at is a solid check.

backend/app/tests/crud/documents/test_crud_document_read_many.py (1)

44-48: LGTM on skip/limit semantics.

Assertions correctly reflect offset/limit behavior and boundary cases (0, negative, and oversized values).

Also applies to: 55-57, 64-66, 73-75, 82-86, 93-95, 101-103, 110-115

backend/app/tests/utils/document.py (6)

15-16: LGTM on imports and model surface update.

Imports reflect project-scoped ownership and public models.

Also applies to: 19-19

56-60: LGTM: deterministic DB state for tests.

Clearing the table on init makes test runs isolated.

118-130: LGTM: API key handling moved to model, header usage correct.

Header name is consistent and avoids dict plumbing across tests.

134-135: Helpful docstring.

Clarifies the comparator’s intent against DocumentPublic.

158-167: LGTM: public-shape projection matches DocumentPublic fields.

Covers inherited fields and converts UUID/datetime to strings.

171-173: LGTM: crawler fixture signature updated to APIKeyPublic.

Header wiring is correct.

...p/alembic/versions/40307ab77e9f_add_storage_path_to_project_and_project_to_document_table.py

backend/app/tests/crud/collections/test_crud_collection_delete.py

backend/app/tests/crud/collections/test_crud_collection_read_all.py

backend/app/tests/crud/documents/test_crud_document_read_many.py

backend/app/tests/utils/collection.py

backend/app/tests/utils/document.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

backend/app/api/routes/documents.py (1)
83-84: Use sentence case for user-facing message

Matches prior feedback and reads better.
-        Message(message="Document Deleted Successfully")
+        Message(message="Document deleted successfully")

🧹 Nitpick comments (12)

backend/app/core/cloud/__init__.py (1)
1-8: Export get_cloud_storage to satisfy Ruff F401

Expose the public surface via all.
 from .storage import (
     AmazonCloudStorage,
     AmazonCloudStorageClient,
     CloudStorage,
     CloudStorageError,
     get_cloud_storage,
 )
+
+__all__ = [
+    "AmazonCloudStorage",
+    "AmazonCloudStorageClient",
+    "CloudStorage",
+    "CloudStorageError",
+    "get_cloud_storage",
+]
backend/app/api/routes/collections.py (2)
27-28: Remove unused import Document

Keep imports minimal.
-from app.models import Collection, Document, DocumentPublic
+from app.models import Collection, DocumentPublic
426-427: Prefer built-in generics (list) over List

Align with PEP 585 and silence UP006.
-    response_model=APIResponse[List[DocumentPublic]],
+    response_model=APIResponse[list[DocumentPublic]],
backend/app/api/routes/documents.py (6)
6-6: Drop unused import HTTPException

Silence Ruff F401.
-from fastapi import APIRouter, File, UploadFile, Query, HTTPException
+from fastapi import APIRouter, File, UploadFile, Query
9-10: Trim imports: remove unused get_project_by_id

It’s not used after scoping to project via CurrentUserOrgProject.
-from app.crud import DocumentCrud, CollectionCrud, get_project_by_id
+from app.crud import DocumentCrud, CollectionCrud
12-13: Remove unused CurrentUser

Project-scoped deps are used everywhere.
-from app.api.deps import CurrentUser, SessionDep, CurrentUserOrgProject
+from app.api.deps import SessionDep, CurrentUserOrgProject
23-24: Prefer built-in generics (list) over List

Align with PEP 585 and silence UP006.
-    response_model=APIResponse[List[DocumentPublic]],
+    response_model=APIResponse[list[DocumentPublic]],
80-81: Remove unused variable data

Result isn’t used; call for side-effects only.
-    data = c_crud.delete(document, a_crud)
+    c_crud.delete(document, a_crud)
100-105: Remove unused project fetch; not needed here

Avoid dead code.
-    project = get_project_by_id(session=session, project_id=current_user.project_id)
     a_crud = OpenAIAssistantCrud(client)
     d_crud = DocumentCrud(session, current_user.project_id)
     c_crud = CollectionCrud(session, current_user.id)
     storage = get_cloud_storage(session=session, project_id=current_user.project_id)
backend/app/core/cloud/storage.py (3)
16-17: Remove unused import UserProjectOrg

Silence Ruff F401.
-from app.models import UserProjectOrg
139-141: Ensure POSIX S3 keys; avoid OS-specific separators

Path() may emit backslashes on Windows. Use PurePosixPath for S3 keys.
-    def put(self, source: UploadFile, file_path: Path) -> SimpleStorageName:
-        key = Path(self.storage_path, file_path)
+    def put(self, source: UploadFile, file_path: Path) -> SimpleStorageName:
+        from pathlib import PurePosixPath
+        key = PurePosixPath(self.storage_path) / PurePosixPath(str(file_path))
139-141: Align base/derived method signatures for put

Base accepts basename: str; derived expects file_path: Path. Unify to Path + return type.

Apply in base class (outside this hunk):
-    def put(self, source: UploadFile, basename: str):
+    def put(self, source: UploadFile, file_path: Path) -> "SimpleStorageName":
         raise NotImplementedError()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6ac0c63 and efd4d03.

📒 Files selected for processing (5)

backend/app/api/routes/collections.py (4 hunks)
backend/app/api/routes/documents.py (6 hunks)
backend/app/core/cloud/__init__.py (1 hunks)
backend/app/core/cloud/storage.py (7 hunks)
backend/app/crud/project.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (5)

backend/app/crud/project.py (4)

backend/app/models/project.py (2)

Project (29-60)

ProjectPublic (62-66)

backend/app/tests/crud/test_project.py (2)

test_get_project_by_id (53-60)

test_get_non_existent_project (118-121)

backend/app/api/routes/project.py (3)

update_project (76-91)

read_project (59-67)

delete_project (100-111)

backend/app/crud/openai_conversation.py (1)

get_conversation_by_id (11-23)

backend/app/core/cloud/__init__.py (1)

backend/app/core/cloud/storage.py (1)

get_cloud_storage (249-259)

backend/app/core/cloud/storage.py (3)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/models/user.py (1)

UserProjectOrg (65-66)

backend/app/utils.py (1)

mask_string (158-167)

backend/app/api/routes/documents.py (5)

backend/app/crud/document.py (4)

DocumentCrud (14-135)

read_many (37-71)

delete (125-135)

read_one (19-35)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/models/document.py (2)

Document (19-34)

DocumentPublic (37-47)

backend/app/models/message.py (1)

Message (5-6)

backend/app/core/cloud/storage.py (8)

get_cloud_storage (249-259)

put (118-119)

put (139-164)

delete (130-131)

delete (231-246)

client (29-41)

get_signed_url (127-128)

get_signed_url (204-229)

backend/app/api/routes/collections.py (3)

backend/app/core/cloud/storage.py (1)

get_cloud_storage (249-259)

backend/app/models/document.py (2)

Document (19-34)

DocumentPublic (37-47)

backend/app/crud/document.py (1)

DocumentCrud (14-135)

🪛 Ruff (0.12.2)

backend/app/core/cloud/__init__.py

6-6: .storage.get_cloud_storage imported but unused; consider removing, adding to __all__, or using a redundant alias

(F401)

backend/app/core/cloud/storage.py

16-16: app.models.UserProjectOrg imported but unused

Remove unused import: app.models.UserProjectOrg

(F401)

backend/app/api/routes/documents.py

6-6: fastapi.HTTPException imported but unused

Remove unused import: fastapi.HTTPException

(F401)

12-12: app.api.deps.CurrentUser imported but unused

Remove unused import: app.api.deps.CurrentUser

(F401)

23-23: Use list instead of List for type annotation

Replace with list

(UP006)

80-80: Local variable data is assigned to but never used

Remove assignment to unused variable data

(F841)

100-100: Local variable project is assigned to but never used

Remove assignment to unused variable project

(F841)

backend/app/api/routes/collections.py

27-27: app.models.Document imported but unused

Remove unused import: app.models.Document

(F401)

426-426: Use list instead of List for type annotation

Replace with list

(UP006)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (5)

backend/app/crud/project.py (1)

37-39: LGTM: switch to session.get() is correct and simpler

The change improves clarity and leverages the identity map semantics. No issues.

backend/app/api/routes/documents.py (3)

31-33: LGTM: per-project scoping for listing is correct

DocumentCrud(session, current_user.project_id) ensures project-level isolation.

46-52: LGTM: storage + CRUD correctly scoped per project

Key construction uses project storage path; CRUD sets project_id on new docs.

135-139: LGTM: signed-URL gating via query flag is clean

Public projection + on-demand presign is appropriate.

backend/app/core/cloud/storage.py (1)

248-260: LGTM: project-scoped storage factory

Validates project, propagates storage_path, returns configured instance.

backend/app/api/routes/documents.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

backend/app/api/routes/documents.py (1)
100-105: “Permanent delete” still soft-deletes in DB; implement hard-delete and remove unused project.

S3 object is removed, but d_crud.delete only sets is_deleted=True. Provide a purge that actually deletes the DB row; also drop the unused project variable.
-    project = get_project_by_id(session=session, project_id=current_user.project_id)
@@
-    d_crud.delete(doc_id)
+    d_crud.purge(doc_id)
Add to backend/app/crud/document.py:
def purge(self, doc_id: UUID) -> None:
    document = self.session.get(Document, doc_id)
    if document is None or document.project_id != self.project_id:
        raise HTTPException(status_code=404, detail="Document not found")
    self.session.delete(document)
    self.session.commit()
    logger.info(
        f"[DocumentCrud.purge] Document permanently deleted | "
        f"{{'doc_id': '{doc_id}', 'project_id': {self.project_id}}}"
    )
Also applies to: 111-111

🧹 Nitpick comments (4)

backend/app/api/routes/documents.py (4)

6-6: Prune unused imports (ruff F401).

Remove HTTPException and CurrentUser.

-from fastapi import APIRouter, File, UploadFile, Query, HTTPException
+from fastapi import APIRouter, File, UploadFile, Query

-from app.api.deps import CurrentUser, SessionDep, CurrentUserOrgProject
+from app.api.deps import SessionDep, CurrentUserOrgProject

Also applies to: 12-12

23-23: Modernize typing (UP006).

Use built-in list in the response model.

-    response_model=APIResponse[List[DocumentPublic]],
+    response_model=APIResponse[list[DocumentPublic]],

9-9: Remove now-unused import after dropping project lookup.

get_project_by_id becomes unused; prune it.

-from app.crud import DocumentCrud, CollectionCrud, get_project_by_id
+from app.crud import DocumentCrud, CollectionCrud

127-139: Signed URL integration looks correct; consider optional TTL parameter.

Current default (1h) is fine; optionally expose expires_in: int = Query(3600, ge=60, le=86400) and pass it to get_signed_url.

-    include_url: bool = Query(
-        False, description="Include a signed URL to access the document"
-    ),
+    include_url: bool = Query(False, description="Include a signed URL to access the document"),
+    expires_in: int = Query(3600, ge=60, le=86400, description="Signed URL TTL (seconds)"),
@@
-    if include_url:
+    if include_url:
         storage = get_cloud_storage(session=session, project_id=current_user.project_id)
-        doc_schema.signed_url = storage.get_signed_url(document.object_store_url)
+        doc_schema.signed_url = storage.get_signed_url(document.object_store_url, expires_in=expires_in)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 016bbde and 2f07600.

📒 Files selected for processing (1)

backend/app/api/routes/documents.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

backend/app/api/routes/documents.py (6)

backend/app/crud/document.py (4)

DocumentCrud (14-135)

read_many (37-71)

delete (125-135)

read_one (19-35)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/models/document.py (2)

Document (19-34)

DocumentPublic (37-47)

backend/app/models/message.py (1)

Message (5-6)

backend/app/utils.py (2)

APIResponse (27-48)

success_response (34-37)

backend/app/core/cloud/storage.py (8)

get_cloud_storage (249-259)

put (118-119)

put (139-164)

delete (130-131)

delete (231-246)

client (29-41)

get_signed_url (127-128)

get_signed_url (204-229)

🪛 Ruff (0.12.2)

backend/app/api/routes/documents.py

6-6: fastapi.HTTPException imported but unused

Remove unused import: fastapi.HTTPException

(F401)

12-12: app.api.deps.CurrentUser imported but unused

Remove unused import: app.api.deps.CurrentUser

(F401)

23-23: Use list instead of List for type annotation

Replace with list

(UP006)

80-80: Local variable data is assigned to but never used

Remove assignment to unused variable data

(F841)

100-100: Local variable project is assigned to but never used

Remove assignment to unused variable project

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (3)

backend/app/api/routes/documents.py (3)

27-33: Project-scoped listing looks good.

CurrentUserOrgProject + DocumentCrud(session, current_user.project_id) is correct.

113-115: LGTM on message phrasing.

Sentence case “Document permanently deleted successfully” matches previous nitpick.

46-56: Persisting str(object_store_url) is correct. SimpleStorageName.str returns the canonical s3://bucket/key URL via urlunparse(self.to_url()), so no change is needed.

Likely an incorrect or invalid review comment.

backend/app/api/routes/documents.py

kartpop

approved with comments

backend/app/core/cloud/storage.py

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/core/cloud/storage.py (1)
95-104: Fix S3 URI construction: missing leading “/” and None fields in ParseResult.

Without a leading slash, str(SimpleStorageName(...)) can render like s3://bucketkey. Also, defaulting params/query/fragment to None is unnecessary—use empty strings.

Apply:
 def to_url(self):
-    kwargs = {
-        "scheme": "s3",
-        "netloc": self.Bucket,
-        "path": self.Key,
-    }
-    for k in ParseResult._fields:
-        kwargs.setdefault(k)
+    kwargs = {
+        "scheme": "s3",
+        "netloc": self.Bucket,
+        "path": "/" + self.Key.lstrip("/"),
+    }
+    for k in ParseResult._fields:
+        kwargs.setdefault(k, "")
     return ParseResult(**kwargs)

♻️ Duplicate comments (1)

backend/app/core/cloud/storage.py (1)
16-18: Remove unused import.

UserProjectOrg is unused and flagged by Ruff F401.

Apply:
-from app.models import UserProjectOrg

🧹 Nitpick comments (1)

backend/app/core/cloud/storage.py (1)

155-162: Seek to start before uploading to avoid partial uploads.

Apply:

         try:
-            self.aws.client.upload_fileobj(
+            source.file.seek(0)
+            self.aws.client.upload_fileobj(
                 source.file,

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2f07600 and 8fbaf8f.

📒 Files selected for processing (1)

backend/app/core/cloud/storage.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

backend/app/core/cloud/storage.py (3)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/models/user.py (1)

UserProjectOrg (65-66)

backend/app/utils.py (1)

mask_string (158-167)

🪛 Ruff (0.12.2)

backend/app/core/cloud/storage.py

17-17: app.models.UserProjectOrg imported but unused

Remove unused import: app.models.UserProjectOrg

(F401)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

backend/app/core/cloud/storage.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

backend/app/core/cloud/storage.py (3)

114-122: Abstract put signature mismatches implementation — align to Path and return type.

ABC takes (basename: str) -> SimpleStorageName while the impl uses (file_path: Path) -> SimpleStorageName. This breaks LSP and confuses callers.

 @abstractmethod
-    def put(self, source: UploadFile, basename: str) -> SimpleStorageName:
-        """Upload a file to storage"""
+    def put(self, source: UploadFile, file_path: Path) -> SimpleStorageName:
+        """Upload a file to storage at the given relative file_path and return its storage name"""
         pass

Run to verify no lingering call sites use the old API:

#!/bin/bash
set -euo pipefail
# Find any abstract/override still using basename or -> str
rg -nP 'def\s+put\s*\([^)]*\bbasename\b' -C2
rg -nP 'def\s+put\s*\([^)]*\)\s*->\s*str' -C2

# Calls passing basename= or a string literal as 2nd arg (likely old API)
rg -nP '\bput\s*\([^)]*basename\s*=' -C2 || true
rg -nP '\bput\s*\(\s*[^,]+,\s*["\']' -n -C2 || true

217-235: Validate expires_in (AWS limit is 7 days) to prevent boto ParamValidationError.

Reject values outside 1–604800 seconds with a clear error.

     def get_signed_url(self, url: str, expires_in: int = 3600) -> str:
@@
-        name = SimpleStorageName.from_url(url)
+        if not isinstance(expires_in, int) or not (1 <= expires_in <= 604800):
+            raise ValueError("expires_in must be an integer between 1 and 604800 seconds (7 days).")
+        name = SimpleStorageName.from_url(url)

Check for call sites passing excessive TTLs:

#!/bin/bash
set -euo pipefail
rg -nP '\bget_signed_url\s*\([^)]*expires_in\s*=\s*(\d+)' | \
python - << 'PY'
import re,sys
for line in sys.stdin:
    m = re.search(r'expires_in\s*=\s*(\d+)', line)
    if m and int(m.group(1))>604800:
        print("[exceeds 7 days] ", line.strip())
PY

262-272: Fail fast when storage_path is missing (unmigrated project).

Avoid constructing a client with an empty/None root.

     storage_path = project.storage_path
+    if not storage_path:
+        raise ValueError(f"Project {project_id} has no storage_path configured")
 
     return AmazonCloudStorage(project_id=project_id, storage_path=storage_path)

🧹 Nitpick comments (2)

backend/app/core/cloud/storage.py (2)
17-17: Remove unused import.

UserProjectOrg isn’t used; drop it to satisfy Ruff F401 and reduce noise.
-from app.models import UserProjectOrg
157-164: Default Content-Type to octet-stream when missing.

UploadFile.content_type can be None; boto will error or store without type. Set a sane default.
             self.aws.client.upload_fileobj(
                 source.file,
-                ExtraArgs={
-                    "ContentType": source.content_type,
-                },
+                ExtraArgs={"ContentType": (source.content_type or "application/octet-stream")},
                 **kwargs,
             )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8fbaf8f and 198b22c.

📒 Files selected for processing (1)

backend/app/core/cloud/storage.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

backend/app/core/cloud/storage.py (3)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/models/user.py (1)

UserProjectOrg (65-66)

backend/app/utils.py (1)

mask_string (158-167)

🪛 Ruff (0.12.2)

backend/app/core/cloud/storage.py

17-17: app.models.UserProjectOrg imported but unused

Remove unused import: app.models.UserProjectOrg

(F401)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (1)

backend/app/core/cloud/storage.py (1)

167-169: Good: project-scoped telemetry with masked bucket/key.

Logging now consistently includes project_id and masks sensitive values.

Also applies to: 186-189, 205-208, 250-252

backend/app/core/cloud/storage.py

backend/app/api/routes/documents.py

backend/app/crud/document.py

backend/app/models/document.py

backend/app/tests/api/routes/documents/test_route_document_upload.py

backend/app/tests/crud/documents/test_crud_document_read_one.py

backend/app/tests/utils/document.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

backend/app/core/cloud/storage.py (2)

92-104: Fix S3 URL formatting and POSIX key handling.

to_url should ensure a leading slash; from_url should normalize to POSIX. Current code risks malformed s3://bucketkey and backslashes on Windows.

Apply:

 class SimpleStorageName:
@@
-    def to_url(self):
-        kwargs = {
-            "scheme": "s3",
-            "netloc": self.Bucket,
-            "path": self.Key,
-        }
-        for k in ParseResult._fields:
-            kwargs.setdefault(k)
-        return ParseResult(**kwargs)
+    def to_url(self):
+        path = self.Key if str(self.Key).startswith("/") else f"/{self.Key}"
+        kwargs = {"scheme": "s3", "netloc": self.Bucket, "path": path}
+        for k in ParseResult._fields:
+            kwargs.setdefault(k, "")
+        return ParseResult(**kwargs)
@@
-    def from_url(cls, url: str):
-        url = urlparse(url)
-        path = Path(url.path)
-        if path.is_absolute():
-            path = path.relative_to(path.root)
-        return cls(Bucket=url.netloc, Key=str(path))
+    def from_url(cls, url: str):
+        url = urlparse(url)
+        key = PurePosixPath(url.path.lstrip("/")).as_posix()
+        return cls(Bucket=url.netloc, Key=key)

And update the import:

from pathlib import Path, PurePosixPath

Also applies to: 105-112

44-85: Robust bucket checks: avoid int() on error code; handle us-east-1; catch ParamValidationError.

Casting Error.Code to int is brittle (e.g., "NoSuchBucket"). Also, omit CreateBucketConfiguration in us-east-1.

Apply:

-        except ValueError as err:
+        except (ValueError, ParamValidationError) as err:
@@
-        except ClientError as err:
-            response = int(err.response["Error"]["Code"])
-            if response != 404:
+        except ClientError as err:
+            code = str(err.response.get("Error", {}).get("Code", ""))
+            if code not in ("404", "NoSuchBucket"):
@@
-                self.client.create_bucket(
-                    Bucket=settings.AWS_S3_BUCKET,
-                    CreateBucketConfiguration={
-                        "LocationConstraint": settings.AWS_DEFAULT_REGION,
-                    },
-                )
+                create_kwargs = {"Bucket": settings.AWS_S3_BUCKET}
+                if settings.AWS_DEFAULT_REGION and settings.AWS_DEFAULT_REGION != "us-east-1":
+                    create_kwargs["CreateBucketConfiguration"] = {
+                        "LocationConstraint": settings.AWS_DEFAULT_REGION
+                    }
+                self.client.create_bucket(**create_kwargs)

Add (if not already present):

from botocore.exceptions import ClientError, ParamValidationError

♻️ Duplicate comments (4)

backend/app/core/cloud/storage.py (4)

16-17: Remove unused import and keep imports lean.

UserProjectOrg is not used in this module.

Apply:

-from app.models import UserProjectOrg

262-273: Fail fast if project has no storage_path.

Older/missing rows will otherwise produce confusing errors downstream.

Apply:

     storage_path = project.storage_path
+    if not storage_path:
+        raise ValueError(f"Project {project_id} has no storage_path configured")
 
     return AmazonCloudStorage(project_id=project_id, storage_path=storage_path)

150-155: Harden key construction: block traversal segments and keep keys relative.

Reject ./.. path segments to avoid odd S3 keys and potential traversal issues. You already reject absolute paths and normalize to POSIX—good.

Apply:

 def put(self, source: UploadFile, file_path: Path) -> SimpleStorageName:
     if file_path.is_absolute():
         raise ValueError("file_path must be relative to the project's storage root")
+    if any(part in (".", "..") for part in file_path.parts):
+        raise ValueError("file_path must not contain '.' or '..' segments'")
     key = Path(self.storage_path) / file_path
     destination = SimpleStorageName(key.as_posix())

217-236: Validate expires_in and surface clear errors for presigned URLs.

Constrain to AWS limits (1–604800s) and catch ParamValidationError.

Apply:

 def get_signed_url(self, url: str, expires_in: int = 3600) -> str:
@@
-        name = SimpleStorageName.from_url(url)
+        if not isinstance(expires_in, int) or not (1 <= expires_in <= 604800):
+            raise ValueError("expires_in must be an integer between 1 and 604800 seconds (7 days).")
+        name = SimpleStorageName.from_url(url)
         try:
             signed_url = self.aws.client.generate_presigned_url(
                 "get_object",
                 Params={"Bucket": name.Bucket, "Key": name.Key},
                 ExpiresIn=expires_in,
             )
@@
-        except ClientError as err:
+        except (ClientError, ParamValidationError) as err:
             logger.error(
                 f"[AmazonCloudStorage.get_signed_url] AWS presign error | "
                 f"{{'project_id': '{self.project_id}', 'bucket': '{mask_string(name.Bucket)}', 'key': '{mask_string(name.Key)}', 'error': '{str(err)}'}}",
                 exc_info=True,
             )
             raise CloudStorageError(f'AWS Error: "{err}" ({url})') from err

Also add the missing import:

from botocore.exceptions import ClientError, ParamValidationError

🧹 Nitpick comments (1)

backend/app/core/cloud/storage.py (1)

157-164: Set a safe default ContentType to avoid boto ParamValidationError.

UploadFile.content_type can be empty; boto3 expects a string.

Apply:

             self.aws.client.upload_fileobj(
                 source.file,
                 ExtraArgs={
-                    "ContentType": source.content_type,
+                    "ContentType": source.content_type or "application/octet-stream",
                 },
                 **kwargs,
             )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 198b22c and 46099b5.

📒 Files selected for processing (1)

backend/app/core/cloud/storage.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

backend/app/core/cloud/storage.py (3)

backend/app/crud/project.py (1)

get_project_by_id (37-38)

backend/app/models/user.py (1)

UserProjectOrg (65-66)

backend/app/utils.py (1)

mask_string (158-167)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (2)

backend/app/core/cloud/storage.py (2)

166-168: LGTM: Project-scoped logging.

Including project_id and masking bucket/key improves traceability without leaking secrets.

Also applies to: 171-173

179-196: LGTM: S3 interactions and project-scoped logging.

stream, get_file_size_kb, and delete look correct and consistent with the new project-based design.

Also applies to: 197-216, 244-259

backend/app/core/cloud/storage.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

backend/app/api/routes/documents.py (2)

100-114: Permanent delete currently soft-deletes; call purge + fix CollectionCrud arg

Ensure DB row is removed after S3 delete and collections cleanup. Also pass project_id to CollectionCrud.

-    d_crud = DocumentCrud(session, current_user.project_id)
-    c_crud = CollectionCrud(session, current_user.id)
+    d_crud = DocumentCrud(session, current_user.project_id)
+    c_crud = CollectionCrud(session, current_user.project_id)
@@
-    storage.delete(document.object_store_url)
-    d_crud.delete(doc_id)
+    storage.delete(document.object_store_url)
+    d_crud.purge(doc_id)

76-84: Fix scope: use project_id for collections; drop unused var; sentence case message

Pass project_id to CollectionCrud (project-scoped).
Remove unused variable (ruff F841).
Standardize message casing.

-    d_crud = DocumentCrud(session, current_user.project_id)
-    c_crud = CollectionCrud(session, current_user.id)
+    d_crud = DocumentCrud(session, current_user.project_id)
+    c_crud = CollectionCrud(session, current_user.project_id)
@@
-    document = d_crud.delete(doc_id)
-    data = c_crud.delete(document, a_crud)
+    document = d_crud.delete(doc_id)
+    c_crud.delete(document, a_crud)
@@
-    return APIResponse.success_response(
-        Message(message="Document Deleted Successfully")
-    )
+    return APIResponse.success_response(Message(message="Document deleted successfully"))

🧹 Nitpick comments (2)

backend/app/crud/document.py (1)

73-96: Edge case + message wording in read_each

Short-circuit empty doc_ids to avoid generating a no-op query.
Fix wording (“at least”, pluralize “documents”) for clearer errors.

-    def read_each(self, doc_ids: list[UUID]):
+    def read_each(self, doc_ids: list[UUID]):
+        if not doc_ids:
+            return []
@@
-                raise ValueError(
-                    f"Requested atleast {requested_count} document retrieved {retrieved_count}"
-                )
+                raise ValueError(
+                    f"Requested at least {requested_count} documents; retrieved {retrieved_count}"
+                )

backend/app/api/routes/documents.py (1)

23-33: Optional: explicitly coerce to DocumentPublic

FastAPI can coerce, but making it explicit keeps parity with info() and avoids surprises.
-    data = crud.read_many(skip, limit)
-    return APIResponse.success_response(data)
+    docs = crud.read_many(skip, limit)
+    public = [DocumentPublic.model_validate(d, from_attributes=True) for d in docs]
+    return APIResponse.success_response(public)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 46099b5 and 1cee5b2.

📒 Files selected for processing (2)

backend/app/api/routes/documents.py (6 hunks)
backend/app/crud/document.py (6 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

backend/app/crud/document.py (4)

backend/app/crud/collection.py (3)

read_one (84-93)

delete (107-115)

CollectionCrud (17-146)

backend/app/models/document.py (1)

Document (19-34)

backend/app/tests/crud/documents/test_crud_document_update.py (2)

TestDatabaseUpdate (15-67)

test_update_respects_owner (57-67)

backend/app/alembic/versions/c43313eca57d_add_document_tables.py (1)

upgrade (20-36)

backend/app/api/routes/documents.py (6)

backend/app/crud/document.py (4)

DocumentCrud (14-135)

read_many (37-71)

delete (125-135)

read_one (19-35)

backend/app/crud/collection.py (3)

CollectionCrud (17-146)

delete (107-115)

read_one (84-93)

backend/app/models/document.py (3)

Document (19-34)

DocumentPublic (37-47)

Document (10-30)

backend/app/models/message.py (1)

Message (5-6)

backend/app/utils.py (3)

APIResponse (27-48)

get_openai_client (170-200)

success_response (34-37)

backend/app/core/cloud/storage.py (7)

get_cloud_storage (262-272)

put (120-122)

put (150-177)

delete (140-142)

delete (244-259)

get_signed_url (135-137)

get_signed_url (217-242)

🪛 Ruff (0.12.2)

backend/app/api/routes/documents.py

80-80: Local variable data is assigned to but never used

Remove assignment to unused variable data

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (5)

backend/app/crud/document.py (3)

15-25: Good: project-scoped access + soft-delete filter applied in lookups

Constructor and read_one correctly scope by project_id and exclude soft-deleted rows.

39-71: Input validation on pagination looks sound

Negative skip/limit checks with structured logging are fine.

98-114: Ownership enforcement on update is correct

Populate missing project_id and reject mismatches; timestamp update is in place.

backend/app/api/routes/documents.py (2)

46-58: Upload path looks good

Per-project storage via get_cloud_storage and deferring project_id set to CRUD is fine.

130-139: LGTM: signed URL opt-in on info()

The from_attributes conversion and conditional presign are correct.

backend/app/crud/document.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (7)

backend/app/tests/crud/documents/test_crud_document_read_one.py (3)
15-16: Avoid flaky fixture: create a dedicated test project instead of fetching any active one.

get_project(db) can raise when no active projects exist and may couple this test to seeded data. Prefer an isolated test project.

Apply:
-    project = get_project(db)
-    return DocumentStore(db, project.id)
+    project = create_test_project(db)
+    return DocumentStore(db, project.id)
Additionally, to avoid cross-test interference, scope DocumentStore.clear to the project (change in backend/app/tests/utils/document.py):
 class DocumentStore:
     def __init__(self, db: Session, project_id: int):
         self.db = db
         self.documents = DocumentMaker(project_id=project_id, session=db)
-        self.clear(self.db)
+        self.clear(self.db, project_id)

-    @staticmethod
-    def clear(db: Session):
-        db.exec(delete(Document))
+    @staticmethod
+    def clear(db: Session, project_id: int | None = None):
+        if project_id is None:
+            db.exec(delete(Document))
+        else:
+            db.exec(delete(Document).where(Document.project_id == project_id))
         db.commit()
31-31: OK; consider a simpler invalid-ID path.

Use a random UUID instead of generating an unpersisted Document to reduce coupling to DocumentMaker internals.

Apply:
+from uuid import uuid4
@@
-        document = next(store.documents)
+        random_id = uuid4()
@@
-            crud.read_one(document.id)
+            crud.read_one(random_id)
43-43: Correct: enforce project boundary.

Using other_project.id validates 404 on cross-project reads. Consider adding a companion test for is_deleted=True returning 404.
backend/app/tests/crud/documents/test_crud_document_delete.py (4)
9-10: Prefer deterministic project creation to avoid env-coupled flakes

Relying on get_project(db) assumes an active project already exists. Using create_test_project(db) makes tests self-contained and stable across environments.

Apply:
-from app.tests.utils.utils import get_project
 from app.tests.utils.test_data import create_test_project
-    project = get_project(db)
+    project = create_test_project(db)
-        project = get_project(db)
+        project = create_test_project(db)
16-21: Initialize CRUD with the known project id, not the generated document’s

Passing project.id is clearer and avoids coupling the test to DocumentMaker’s behavior of setting document.project_id.
-    crud = DocumentCrud(db, document.project_id)
+    crud = DocumentCrud(db, project.id)
38-43: Strengthen the negative test by asserting no side effects on the source doc

After the 404, re-fetch the original doc and assert it wasn’t mutated.
         with pytest.raises(HTTPException) as exc_info:
             crud.delete(document.id)

         assert exc_info.value.status_code == 404
         assert "Document not found" in str(exc_info.value.detail)
+        # Verify no side effects on the original document
+        fresh = db.exec(select(Document).where(Document.id == document.id)).one()
+        assert fresh.is_deleted is False
27-33: Add a targeted read_one test to lock in soft-delete semantics

Validate that CRUD reads exclude deleted documents by asserting 404 on read_one after deletion.
 class TestDatabaseDelete:
@@
     def test_delete_marks_deleted(self, document: Document):
         assert document.is_deleted is True
+
+    def test_read_one_404_after_delete(self, db: Session):
+        project = create_test_project(db)
+        store = DocumentStore(db, project.id)
+        doc = store.put()
+        crud = DocumentCrud(db, project.id)
+        crud.delete(doc.id)
+        with pytest.raises(HTTPException) as exc_info:
+            crud.read_one(doc.id)
+        assert exc_info.value.status_code == 404

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1cee5b2 and eddbacf.

📒 Files selected for processing (3)

backend/app/tests/crud/documents/test_crud_document_delete.py (2 hunks)
backend/app/tests/crud/documents/test_crud_document_read_one.py (2 hunks)
backend/app/tests/crud/documents/test_crud_document_update.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/tests/crud/documents/test_crud_document_update.py

🧰 Additional context used

🧬 Code graph analysis (2)

backend/app/tests/crud/documents/test_crud_document_delete.py (5)

backend/app/tests/utils/utils.py (1)

get_project (70-89)

backend/app/tests/utils/test_data.py (1)

create_test_project (34-50)

backend/app/tests/utils/document.py (3)

project (67-68)

DocumentStore (55-82)

put (70-75)

backend/app/crud/document.py (1)

DocumentCrud (14-135)

backend/app/models/document.py (1)

Document (19-34)

backend/app/tests/crud/documents/test_crud_document_read_one.py (4)

backend/app/tests/utils/utils.py (1)

get_project (70-89)

backend/app/tests/utils/test_data.py (1)

create_test_project (34-50)

backend/app/tests/utils/document.py (3)

project (67-68)

DocumentStore (55-82)

put (70-75)

backend/app/crud/document.py (2)

DocumentCrud (14-135)

read_one (19-35)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (5)

backend/app/tests/crud/documents/test_crud_document_read_one.py (3)

8-9: Imports align with project-scoped tests.

These imports are appropriate for the new project-based organization.

23-23: Correct project scoping in CRUD instantiation.

Passing store.project.id ensures reads are constrained to the project.

41-41: Good isolation for cross-project access.

Creating a separate project for this check is the right approach.

backend/app/tests/crud/documents/test_crud_document_delete.py (2)

31-33: Good assertion on soft-delete flag

Asserting is_deleted is True verifies the core behavior post-delete.

17-18: Be cautious: DocumentStore.clear() purges all documents

Because DocumentStore.init calls clear(), creating a store inside a test wipes all documents in the DB. Ensure tests run in isolated transactions or a per-test database to avoid cross-test interference, especially under parallel runs.

If isolation isn’t guaranteed, consider moving the clear() call into a dedicated fixture with explicit scope or using transaction rollbacks per test.

Also applies to: 39-39

backend/app/models/document.py

avirajsingh7 added 5 commits August 22, 2025 16:04

Add project id in model and use project id for crud

25816ae

add support for signed url

d556fea

Add storage path in project table

ae1b873

fix collection endpoint

98610a9

pre commit

527c83d

avirajsingh7 changed the base branch from main to dev/classification-transformer-aws August 25, 2025 05:03

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

avirajsingh7 added 3 commits August 25, 2025 12:00

Fix api routes test cases for document module

7af3a5c

fix crud unit test for document module

ed98c67

fix unit test for collections

c388423

avirajsingh7 self-assigned this Aug 25, 2025

use project id instead of project model in document store utils

2009ed1

avirajsingh7 requested review from AkhileshNegi, kartpop and nishika26 August 26, 2025 04:42

avirajsingh7 added ready-for-review enhancement New feature or request labels Aug 26, 2025

kartpop reviewed Aug 28, 2025

View reviewed changes

backend/app/alembic/versions/42c4c8b22a09_add_is_deleted_column_in_document.py Outdated Show resolved Hide resolved

Merge branch 'main' into update/document_module

7409b1c

avirajsingh7 changed the base branch from dev/classification-transformer-aws to main August 28, 2025 05:33

kartpop reviewed Aug 28, 2025

View reviewed changes

backend/app/api/routes/documents.py Outdated Show resolved Hide resolved

avirajsingh7 added 2 commits August 28, 2025 11:43

Fix migration

a27d551

precommit

6ac0c63

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

introduce get_cloud_storage method

efd4d03

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

backend/app/api/routes/documents.py Show resolved Hide resolved

avirajsingh7 added 2 commits August 28, 2025 15:05

fix testcase

016bbde

fix message

2f07600

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

backend/app/api/routes/documents.py Show resolved Hide resolved

backend/app/api/routes/documents.py Show resolved Hide resolved

kartpop self-requested a review August 28, 2025 09:54

kartpop approved these changes Aug 28, 2025

View reviewed changes

backend/app/core/cloud/storage.py Show resolved Hide resolved

implement abstract method in CloudStorage class

8fbaf8f

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

backend/app/core/cloud/storage.py Show resolved Hide resolved

backend/app/core/cloud/storage.py Show resolved Hide resolved

backend/app/core/cloud/storage.py Show resolved Hide resolved

backend/app/core/cloud/storage.py Show resolved Hide resolved

use POSIX separators and Reject absolute file_path

198b22c

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

backend/app/core/cloud/storage.py Show resolved Hide resolved

nishika26 requested changes Aug 31, 2025

View reviewed changes

fix argument of put method in CloudStorage

46099b5

coderabbitai bot reviewed Sep 1, 2025

View reviewed changes

backend/app/core/cloud/storage.py Show resolved Hide resolved

fix

1cee5b2

coderabbitai bot reviewed Sep 1, 2025

View reviewed changes

backend/app/crud/document.py Show resolved Hide resolved

fix unit test

eddbacf

coderabbitai bot reviewed Sep 1, 2025

View reviewed changes

nishika26 requested changes Sep 1, 2025

View reviewed changes

backend/app/models/document.py Show resolved Hide resolved

nishika26 approved these changes Sep 1, 2025

View reviewed changes

avirajsingh7 merged commit f24f4c5 into main Sep 1, 2025
3 checks passed

avirajsingh7 deleted the update/document_module branch September 1, 2025 05:57

This was referenced Sep 1, 2025

Classification Feature #320

Merged

Implement document transformation pipeline to improve RAG performance #363

Merged

nishika26 mentioned this pull request Sep 1, 2025

Classification immediate fixes and enhancements #364

Closed

coderabbitai bot mentioned this pull request Oct 1, 2025

Celery : moving collections module to celery #385

Merged

2 tasks

Refactor Document Module and Cloud Storage to use project-based organization and add signed URL support #346

Refactor Document Module and Cloud Storage to use project-based organization and add signed URL support #346

Uh oh!

Conversation

avirajsingh7 commented Aug 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Database Schema Updates

Storage Architecture Refactor

API Improvements

Migration Impact

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kartpop left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

avirajsingh7 commented Aug 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 25, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Aug 28, 2025 •

edited

Loading