Skip to content

collections:add dynamic batching#718

Open
nishika26 wants to merge 9 commits intomainfrom
enhancement/dynamic_batching
Open

collections:add dynamic batching#718
nishika26 wants to merge 9 commits intomainfrom
enhancement/dynamic_batching

Conversation

@nishika26
Copy link
Collaborator

@nishika26 nishika26 commented Mar 25, 2026

Summary

Target issue is #716

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

  • New Features

    • Automatic batching now uses size (30 MB) and count (200 files) thresholds.
    • Uploads now record file size; collection jobs track document count and total size.
  • Documentation

    • Updated collection creation docs for automatic batching.
    • Marked batch_size as deprecated.
  • Bug Fixes

    • Simplified error reporting for vector store operations.
  • Tests

    • Updated and added tests to reflect new batching and size-tracking behavior.

@coderabbitai
Copy link

coderabbitai bot commented Mar 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds document file-size tracking and collection-job size/count fields; replaces fixed batch_size with automatic batching by cumulative size (30 MB) and count (200); updates models, migration, routes, services, providers, helpers, docs, and tests to compute, persist, and use size/count metadata.

Changes

Cohort / File(s) Summary
Database Migration
backend/app/alembic/versions/050_add_columns_to_collection_job_and_documents.py
New Alembic migration adding nullable columns: collection_jobs.docs_num, collection_jobs.total_size_mb, and document.file_size_kb with corresponding downgrade logic.
Models
backend/app/models/document.py, backend/app/models/collection_job.py, backend/app/models/collection.py
Added Document.file_size_kb; added CollectionJob.docs_num and CollectionJob.total_size_mb; marked CollectionOptions.batch_size description deprecated; removed DocumentPublic.signed_url field declaration.
API Routes
backend/app/api/routes/documents.py, backend/app/api/routes/collections.py
Document upload computes and persists file_size_kb; collection creation includes docs_num=len(request.documents) when starting jobs; minor import adjustments.
Services / Batching / Providers
backend/app/services/documents/helpers.py, backend/app/services/collections/helpers.py, backend/app/services/collections/create_collection.py, backend/app/services/collections/providers/openai.py, backend/app/services/collections/providers/base.py
Added calculate_file_size(UploadFile); refactored batch_documents to accept List[Document] and perform dynamic batching using MAX_BATCH_SIZE_KB (30 MB) and MAX_BATCH_COUNT (200); pre-read documents to compute total_size_mb and pass documents to providers; provider signatures updated.
CRUD / Error Handling
backend/app/crud/rag/open_ai.py
Simplified vector-store completion mismatch handling: removed per-document inspection and payload construction; now logs completed/total and raises InterruptedError with concise message.
Docs & Tests
backend/app/api/docs/collections/create.md, backend/app/tests/...
Docs updated to describe automatic size/count batching; tests updated to remove batch_size, use DB-backed document fixtures, and validate new batching behavior (extensive test changes for batching).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as API Route
    participant Storage as Cloud Storage
    participant DB as Document DB
    participant CollService as Collection Service
    participant Batcher as Batching Helper
    participant Provider as OpenAI Provider

    Client->>API: Upload file
    API->>Storage: calculate_file_size(file)
    Storage-->>API: file_size_kb
    API->>DB: Persist Document (file_size_kb)
    API-->>Client: upload response

    Client->>API: Create collection (document IDs)
    API->>DB: Read documents by IDs
    DB-->>API: Documents (include file_size_kb)
    API->>CollService: Start job (docs_num, total_size_mb)
    CollService->>Batcher: batch_documents(documents)
    Batcher->>Batcher: accumulate until >=30MB or 200 docs
    Batcher-->>Provider: send batches
    Provider-->>CollService: update vector store / create assistant
    CollService-->>API: job complete
    API-->>Client: collection result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

Possibly related PRs

Suggested labels

ready-for-review

Suggested reviewers

  • vprashrex
  • kartpop
  • Prajna1999

Poem

🐰 I counted bytes both big and small,
I hopped through batches, split at thirty MB's call.
Two hundred pals or size that swells,
I roll and pack — no batch_size tells.
Little paws, swift hops — collections thrive and stand tall.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 65.63% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'collections:add dynamic batching' clearly and specifically describes the main feature being added: dynamic batching for collections, which is the central objective of this PR.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch enhancement/dynamic_batching

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
backend/app/models/document.py (1)

2-5: Remove unused imports.

Any from typing and model_serializer from pydantic are imported but not used in this file.

🧹 Proposed fix
 from datetime import datetime
-from typing import Any
 from uuid import UUID, uuid4

-from pydantic import model_serializer
 from sqlmodel import Field, SQLModel
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/models/document.py` around lines 2 - 5, The file imports unused
symbols — remove the unused imports "Any" from typing and "model_serializer"
from pydantic; keep only the needed imports (UUID, uuid4 from uuid and any other
used names) so the top of models/document.py no longer includes Any or
model_serializer.
backend/app/services/collections/providers/openai.py (1)

33-34: Update or remove the stale comment.

The comment "Use user-provided batch_size, default to 10 if not set" no longer reflects the implementation, as batch_documents now uses dynamic batching based on cumulative file size rather than a fixed batch size.

🧹 Proposed fix
-            # Use user-provided batch_size, default to 10 if not set
+            # Dynamic batching based on file size (30 MB limit) and document count (200 per batch)
             docs_batches = batch_documents(document_crud, collection_request.documents)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/providers/openai.py` around lines 33 - 34,
The inline comment above the call to batch_documents(document_crud,
collection_request.documents) is stale — it claims a user-provided batch_size
with default 10, but batching is now dynamic based on cumulative file size;
update or remove that comment to accurately reflect current behavior (refer to
batch_documents and collection_request.documents) so the comment either explains
dynamic size-based batching or is deleted.
backend/app/services/collections/helpers.py (1)

81-97: Consider batch-fetching documents to avoid N+1 queries.

The current implementation issues one database query per document (read_one in a loop), which creates an N+1 query pattern. For large document lists, this could impact performance significantly.

Consider fetching all documents in a single query (or chunked queries) and then iterating over the results in memory.

Sketch of batch-fetch approach
# Fetch all documents in one query
from sqlmodel import select, col
from app.models import Document

statement = select(Document).where(
    Document.id.in_(documents),
    Document.project_id == document_crud.project_id,
    Document.is_deleted.is_(False)
)
all_docs = {doc.id: doc for doc in document_crud.session.exec(statement).all()}

for doc_id in documents:
    doc = all_docs.get(doc_id)
    if doc is None:
        # Handle missing document - log warning or raise
        continue
    doc_size = doc.file_size or 0
    # ... rest of batching logic
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/helpers.py` around lines 81 - 97, The loop
currently calls document_crud.read_one for each doc_id (creating N+1 queries);
replace that by batch-fetching all required Document rows once (e.g., using a
select on Document.id.in_(documents) with document_crud.session or your ORM
session), build a dict mapping id->Document, then iterate the original documents
list and use the mapped doc (handle missing docs by logging/skipping) while
keeping the existing batching logic that uses current_batch, current_batch_size,
docs_batches, MAX_BATCH_SIZE_BYTES and MAX_BATCH_COUNT; keep the rest of the
batch-completion and size-count checks unchanged so behavior is identical but
with far fewer DB roundtrips.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@backend/app/alembic/versions/050_add_columns_to_collection_job_and_documents.py`:
- Line 10: The migration file 050_add_columns_to_collection_job_and_documents.py
contains an unused import "import sqlmodel.sql.sqltypes"; remove this import
statement from the top of the file so only actually used modules remain (verify
no other references to sqlmodel.sql.sqltypes exist in functions or
upgrade/downgrade logic before deleting).

In `@backend/app/api/routes/collections.py`:
- Around line 116-117: The request payload uses num_docs but the DB model
expects docs_num so the value is dropped; rename the field in the pydantic/DTO
class CollectionJobCreate from num_docs to docs_num and update the call site in
the route (where you currently pass num_docs=len(request.documents)) to pass
docs_num=len(request.documents); also update the definition in
CollectionJobCreate in models/collection_job.py to use docs_num (and any related
serialization/validation) so CollectionJobCrud.create() which does
CollectionJob(**collection_job.model_dump()) preserves the value.

In `@backend/app/crud/rag/open_ai.py`:
- Around line 146-151: In OpenAIVectorStoreCrud.update where you build error_msg
and call logger.error, remove the exc_info=True flag (or set exc_info=False)
because this block isn't inside an exception handler and no exception context
exists; update the logger.error invocation that currently references
vector_store_id, req.file_counts.completed and req.file_counts.total so it logs
the message without exc_info to avoid "NoneType: None" entries.

In `@backend/app/models/collection_job.py`:
- Around line 56-67: The nullable fields docs_num and total_size in the
CollectionJob model lack explicit defaults; update their Field definitions
(docs_num: int | None and total_size: int | None) to include default=None so
they match other nullable fields like task_id/trace_id and avoid
instance-creation issues when omitted.
- Around line 121-123: The DTO/typed fields on CollectionJob are wrong: rename
the field currently declared as num_docs to docs_num to match the CollectionJob
model so model_dump() preserves the value, and remove the unused batch_size
declaration (or re-add a matching field on the CollectionJob model if it is
actually required). Update any references/usages of num_docs and batch_size in
the same module and any callers to use docs_num (or restore batch_size in the
domain model) so persistence via model_dump() no longer drops data.

In `@backend/app/services/collections/helpers.py`:
- Line 58: Add a return type annotation to the batch_documents function: update
the signature for batch_documents(document_crud: DocumentCrud, documents:
List[UUID]) to include the return type -> List[List[Document]] (or the
appropriate Document model type), and import Document from app.models if not
already imported; ensure the function signature references DocumentCrud,
documents and the new return type so static type checkers and readers know it
returns a list of document batches.

---

Nitpick comments:
In `@backend/app/models/document.py`:
- Around line 2-5: The file imports unused symbols — remove the unused imports
"Any" from typing and "model_serializer" from pydantic; keep only the needed
imports (UUID, uuid4 from uuid and any other used names) so the top of
models/document.py no longer includes Any or model_serializer.

In `@backend/app/services/collections/helpers.py`:
- Around line 81-97: The loop currently calls document_crud.read_one for each
doc_id (creating N+1 queries); replace that by batch-fetching all required
Document rows once (e.g., using a select on Document.id.in_(documents) with
document_crud.session or your ORM session), build a dict mapping id->Document,
then iterate the original documents list and use the mapped doc (handle missing
docs by logging/skipping) while keeping the existing batching logic that uses
current_batch, current_batch_size, docs_batches, MAX_BATCH_SIZE_BYTES and
MAX_BATCH_COUNT; keep the rest of the batch-completion and size-count checks
unchanged so behavior is identical but with far fewer DB roundtrips.

In `@backend/app/services/collections/providers/openai.py`:
- Around line 33-34: The inline comment above the call to
batch_documents(document_crud, collection_request.documents) is stale — it
claims a user-provided batch_size with default 10, but batching is now dynamic
based on cumulative file size; update or remove that comment to accurately
reflect current behavior (refer to batch_documents and
collection_request.documents) so the comment either explains dynamic size-based
batching or is deleted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1ca4c2c0-c402-4368-b123-09e52d751195

📥 Commits

Reviewing files that changed from the base of the PR and between 284eca2 and 614801c.

📒 Files selected for processing (10)
  • backend/app/alembic/versions/050_add_columns_to_collection_job_and_documents.py
  • backend/app/api/routes/collections.py
  • backend/app/api/routes/documents.py
  • backend/app/crud/rag/open_ai.py
  • backend/app/models/collection_job.py
  • backend/app/models/document.py
  • backend/app/services/collections/create_collection.py
  • backend/app/services/collections/helpers.py
  • backend/app/services/collections/providers/openai.py
  • backend/app/services/documents/helpers.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/api/docs/collections/create.md`:
- Around line 6-9: Fix the typo "btch" -> "batch" and update the batching
description in the create.md paragraph to explicitly state the actual limits:
create a new batch when the cumulative size of documents in the batch reaches 30
MB or when the document count reaches 200 documents, whichever limit is hit
first; adjust the sentence around "configured total size of documents given to
upload to a vector store" to the clearer phrasing "cumulative size reaches 30
MB" so users understand the exact batching behavior.

In `@backend/app/services/collections/helpers.py`:
- Around line 58-60: The return type annotation of batch_documents references
Document but Document is not imported, causing a NameError; update the imports
to include the Document class (or its actual symbol) used by the function
signature so the annotation resolves, e.g., add an import for Document alongside
DocumentCrud (or adjust to use a forward reference string 'Document' if
preferred) and ensure the function signature and any other type hints
(batch_documents, DocumentCrud) are consistent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ea73e5d0-26bb-454e-937a-318196c408f1

📥 Commits

Reviewing files that changed from the base of the PR and between 614801c and f8f9c2f.

📒 Files selected for processing (11)
  • backend/app/alembic/versions/050_add_columns_to_collection_job_and_documents.py
  • backend/app/api/docs/collections/create.md
  • backend/app/api/routes/collections.py
  • backend/app/models/collection.py
  • backend/app/models/collection_job.py
  • backend/app/services/collections/create_collection.py
  • backend/app/services/collections/helpers.py
  • backend/app/services/collections/providers/openai.py
  • backend/app/tests/api/routes/collections/test_create_collections.py
  • backend/app/tests/services/collections/providers/test_openai_provider.py
  • backend/app/tests/services/collections/test_create_collection.py
💤 Files with no reviewable changes (2)
  • backend/app/tests/services/collections/providers/test_openai_provider.py
  • backend/app/tests/api/routes/collections/test_create_collections.py
✅ Files skipped from review due to trivial changes (2)
  • backend/app/models/collection.py
  • backend/app/alembic/versions/050_add_columns_to_collection_job_and_documents.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • backend/app/models/collection_job.py
  • backend/app/services/collections/create_collection.py

@nishika26 nishika26 self-assigned this Mar 25, 2026
@nishika26 nishika26 added the enhancement New feature or request label Mar 25, 2026
@nishika26 nishika26 linked an issue Mar 25, 2026 that may be closed by this pull request
@nishika26 nishika26 requested review from AkhileshNegi and kartpop and removed request for Prajna1999 March 25, 2026 07:50
Comment on lines +2 to +5
from typing import Any
from uuid import UUID, uuid4

from pydantic import model_serializer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need Any and model_serializer?
seems like we are not using them

document_crud = DocumentCrud(session, current_user.project_.id)
total_size = 0
for doc_id in request.documents:
doc = document_crud.read_one(doc_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the looks of it seems we are reading files multiple time

  1. collections.py:102
  2. helpers.py:79
  3. create_collection.py:202

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all three are being calculated differently for different things, in collections.py we are calculating total size of documents given and storing it in db, then in helper we are counting and adding up document sizes till it reaches the level we have set and makes batches, and in create collection after the collections is created i am just converting the total sizr i calculated and saved in db in routes/collections.py, chnaging it to a readable format (in KBs) and returning it in in logs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what he is suggesting is - can we not read the file once from database and use it everywhere instead of doing a relatively heavy operation (db read) 3 times for the same thing. Having read the documents the first time, try and pass them around as params instead of doing a db fetch every time.

Copy link
Collaborator Author

@nishika26 nishika26 Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed this now so that instead of three separate db calls, we make one db call (in create_collection.py) and use whatever is returned for all three things - calculating the total size of docs being uploaded, using the docs in batch document helper function and using the info for logs


if current_batch and (would_exceed_size or would_exceed_count):
logger.info(
f"[batch_documents] Batch completed | {{'batch_num': {len(docs_batches) + 1}, 'doc_count': {len(current_batch)}, 'batch_size_bytes': {current_batch_size}, 'batch_size_mb': {round(current_batch_size / (1024 * 1024), 2)}}}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we calculate the size one time and the use it across instead of using MB logic again in logs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is being used to calculate size per batch and not total size

total_size += doc.file_size or 0

logger.info(
f"[create_collection] Calculated total size | {{'total_documents': {len(request.documents)}, 'total_size_bytes': {total_size}, 'total_size_mb': {round(total_size / (1024 * 1024), 2)}}}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we calculate the size one time and the use it across instead of using MB logic again in logs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saving total size in mb only now and directly fetching that for logs

@@ -105,6 +105,7 @@ class CollectionOptions(SQLModel):
batch_size: int = Field(
default=10,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this default since we are anyway gonna use our default logic so accepting this value from client is not needed.
in providers/openai.py
we moved to
docs_batches = batch_documents(document_crud, collection_request.documents)

Copy link
Collaborator Author

@nishika26 nishika26 Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this now

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/app/tests/api/routes/collections/test_create_collections.py (1)

18-32: Consider using a factory pattern for test document creation.

The helper works correctly, but per project guidelines, test fixtures in backend/app/tests/ should use the factory pattern. Consider moving this to a shared factory (e.g., DocumentFactory) that can be reused across test files.

Example factory approach
# In backend/app/tests/factories/document.py
class DocumentFactory:
    `@staticmethod`
    def create(
        db: Session,
        project_id: int,
        file_size: int = 1024,
        fname: str = "test_document.txt",
        object_store_url: str = "s3://test-bucket/test_document.txt",
    ) -> Document:
        doc = Document(
            id=uuid4(),
            fname=fname,
            object_store_url=object_store_url,
            project_id=project_id,
            file_size=file_size,
        )
        db.add(doc)
        db.commit()
        db.refresh(doc)
        return doc

Based on learnings: "Use factory pattern for test fixtures in backend/app/tests/"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/api/routes/collections/test_create_collections.py` around
lines 18 - 32, Replace the inline helper _create_test_document with a reusable
factory: implement a DocumentFactory class (e.g., DocumentFactory.create) in
backend/app/tests/factories/document.py that accepts db, project_id, file_size
and optional fname/object_store_url, moves the Document creation and
db.add/db.commit/db.refresh logic into that static method, then update tests in
test_create_collections.py to call DocumentFactory.create instead of
_create_test_document and remove the local helper; ensure the factory returns
the created Document and retains the same behavior.
backend/app/tests/services/collections/test_helpers.py (1)

48-68: Add type hints to FakeDocumentCrud methods.

Per coding guidelines, all function parameters and return values should have type hints.

♻️ Proposed fix
+from typing import Optional
+from uuid import UUID
+
 class FakeDocumentCrud:
-    def __init__(self, file_size_per_doc=1024):
+    def __init__(self, file_size_per_doc: Optional[int] = 1024) -> None:
         """
         Args:
             file_size_per_doc: Size in bytes for each fake document (default 1 KB)
         """
         self.calls = []
         self.file_size_per_doc = file_size_per_doc
-        self.documents = {}
+        self.documents: dict = {}

-    def read_one(self, doc_id):
+    def read_one(self, doc_id: UUID) -> SimpleNamespace:
         """Simulate reading a single document by ID."""
         if doc_id not in self.documents:
             self.documents[doc_id] = SimpleNamespace(
                 id=doc_id,
                 fname=f"{doc_id}.txt",
                 object_store_url=f"s3://bucket/{doc_id}.txt",
                 file_size=self.file_size_per_doc,
             )
         self.calls.append(doc_id)
         return self.documents[doc_id]

As per coding guidelines: "Always add type hints to all function parameters and return values in Python code"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/services/collections/test_helpers.py` around lines 48 - 68,
Add explicit type hints for FakeDocumentCrud: annotate __init__ signature with
file_size_per_doc: int and set instance attributes types (calls: List[str],
file_size_per_doc: int, documents: Dict[str, SimpleNamespace]) and annotate
read_one(self, doc_id: str) -> SimpleNamespace (or Optional[SimpleNamespace] if
you prefer None handling). Also ensure necessary imports are added (from typing
import List, Dict and from types import SimpleNamespace) so the annotations
resolve; update any related variable assignments in __init__ to match the
declared types.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/app/tests/api/routes/collections/test_create_collections.py`:
- Around line 18-32: Replace the inline helper _create_test_document with a
reusable factory: implement a DocumentFactory class (e.g.,
DocumentFactory.create) in backend/app/tests/factories/document.py that accepts
db, project_id, file_size and optional fname/object_store_url, moves the
Document creation and db.add/db.commit/db.refresh logic into that static method,
then update tests in test_create_collections.py to call DocumentFactory.create
instead of _create_test_document and remove the local helper; ensure the factory
returns the created Document and retains the same behavior.

In `@backend/app/tests/services/collections/test_helpers.py`:
- Around line 48-68: Add explicit type hints for FakeDocumentCrud: annotate
__init__ signature with file_size_per_doc: int and set instance attributes types
(calls: List[str], file_size_per_doc: int, documents: Dict[str,
SimpleNamespace]) and annotate read_one(self, doc_id: str) -> SimpleNamespace
(or Optional[SimpleNamespace] if you prefer None handling). Also ensure
necessary imports are added (from typing import List, Dict and from types import
SimpleNamespace) so the annotations resolve; update any related variable
assignments in __init__ to match the declared types.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec3c5dc0-df31-4ab6-a277-ada6ebce2918

📥 Commits

Reviewing files that changed from the base of the PR and between f8f9c2f and 1106abc.

📒 Files selected for processing (5)
  • backend/app/api/docs/collections/create.md
  • backend/app/crud/rag/open_ai.py
  • backend/app/services/collections/helpers.py
  • backend/app/tests/api/routes/collections/test_create_collections.py
  • backend/app/tests/services/collections/test_helpers.py
✅ Files skipped from review due to trivial changes (2)
  • backend/app/api/docs/collections/create.md
  • backend/app/crud/rag/open_ai.py

Copy link
Collaborator

@kartpop kartpop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please resolve Akhilesh's comments before merging

document_crud = DocumentCrud(session, current_user.project_.id)
total_size = 0
for doc_id in request.documents:
doc = document_crud.read_one(doc_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what he is suggesting is - can we not read the file once from database and use it everywhere instead of doing a relatively heavy operation (db read) 3 times for the same thing. Having read the documents the first time, try and pass them around as params instead of doing a db fetch every time.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/app/services/collections/providers/openai.py (1)

2-2: Use built-in generic types in this provider signature/import.

Apply the same Listlist modernization here to stay consistent with Python 3.11 typing conventions.

♻️ Proposed diff
-from typing import List
...
-        documents: List[Document],
+        documents: list[Document],

As per coding guidelines, "Use Python 3.11+ with type hints throughout the codebase".

Also applies to: 27-27

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/providers/openai.py` at line 2, Replace the
typing.List import and any uses of List in function/type signatures with the
built-in generic list to follow Python 3.11 conventions: remove or stop
importing List from typing in
backend/app/services/collections/providers/openai.py and update signatures
(e.g., methods on OpenAIProvider such as get_embeddings / embed / any function
that currently uses List in its parameters or return types) to use list[...]
instead of List[...].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/services/collections/providers/base.py`:
- Line 2: Replace legacy typing.List usage with built-in generic list[...] and
remove List from the import; update the import line (currently "from typing
import Any, List") to only import Any and change all annotations that reference
List to use the built-in list type (e.g., List[Foo] -> list[Foo]) in this module
(look for any occurrences of List in base.py to update). Ensure no runtime
semantics change and keep Any from typing if still used.

---

Nitpick comments:
In `@backend/app/services/collections/providers/openai.py`:
- Line 2: Replace the typing.List import and any uses of List in function/type
signatures with the built-in generic list to follow Python 3.11 conventions:
remove or stop importing List from typing in
backend/app/services/collections/providers/openai.py and update signatures
(e.g., methods on OpenAIProvider such as get_embeddings / embed / any function
that currently uses List in its parameters or return types) to use list[...]
instead of List[...].

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5af78839-dc60-4750-866e-1a3248061977

📥 Commits

Reviewing files that changed from the base of the PR and between 1106abc and ff0726f.

📒 Files selected for processing (12)
  • backend/app/alembic/versions/050_add_columns_to_collection_job_and_documents.py
  • backend/app/api/docs/collections/create.md
  • backend/app/api/routes/collections.py
  • backend/app/api/routes/documents.py
  • backend/app/models/collection_job.py
  • backend/app/models/document.py
  • backend/app/services/collections/create_collection.py
  • backend/app/services/collections/helpers.py
  • backend/app/services/collections/providers/base.py
  • backend/app/services/collections/providers/openai.py
  • backend/app/services/documents/helpers.py
  • backend/app/tests/services/collections/test_helpers.py
✅ Files skipped from review due to trivial changes (1)
  • backend/app/api/docs/collections/create.md
🚧 Files skipped from review as they are similar to previous changes (9)
  • backend/app/api/routes/collections.py
  • backend/app/services/documents/helpers.py
  • backend/app/api/routes/documents.py
  • backend/app/models/collection_job.py
  • backend/app/services/collections/create_collection.py
  • backend/app/alembic/versions/050_add_columns_to_collection_job_and_documents.py
  • backend/app/tests/services/collections/test_helpers.py
  • backend/app/services/collections/helpers.py
  • backend/app/models/document.py

@nishika26 nishika26 removed the request for review from vprashrex March 25, 2026 13:51
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
backend/app/services/collections/providers/base.py (2)

48-58: ⚠️ Potential issue | 🟡 Minor

Docstring args don't match method signature.

The delete method's docstring mentions llm_service_id and llm_service_name as arguments, but the actual method signature only accepts collection: Collection. This appears to be stale documentation from a previous implementation.

📝 Proposed fix
     `@abstractmethod`
     def delete(self, collection: Collection) -> None:
         """Delete remote resources associated with a collection.

         Called when a collection is being deleted and remote resources need to be cleaned up.

         Args:
-            llm_service_id: ID of the resource to delete
-            llm_service_name: Name of the service (determines resource type)
+            collection: Collection object containing the remote resource identifiers
         """
         raise NotImplementedError("Providers must implement delete method")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/providers/base.py` around lines 48 - 58, The
docstring for the abstract delete method is stale (mentions llm_service_id and
llm_service_name) and should describe the actual parameter; update the
delete(self, collection: Collection) docstring to document the single collection
argument and its purpose (e.g., "collection: Collection — the collection whose
remote resources should be deleted") and remove the incorrect arg names,
ensuring the docstring matches the delete method signature in class Base
provider.

30-46: ⚠️ Potential issue | 🔴 Critical

Update OpenAI provider to match base class signature and fix tests.

The signature change from document_crud: DocumentCrud to documents: list[Document] in base.py is a breaking change. Tests at backend/app/tests/services/collections/providers/test_openai_provider.py:42, 86 still pass document_crud as the third parameter, which will fail at runtime.

Additionally, the OpenAI implementation at backend/app/services/collections/providers/openai.py:23 uses List[Document] instead of the modernized list[Document] syntax, violating the Python 3.11+ type hint guideline. Update openai.py to use list[Document] and remove the from typing import List import to align with the base class and coding guidelines.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/providers/base.py` around lines 30 - 46, The
base provider's create signature changed to accept documents: list[Document]
(not document_crud: DocumentCrud); update the OpenAI provider's create method
(in openai.py) to match that signature and accept documents: list[Document] so
tests that call create with a documents list succeed; also replace any
typing.List[Document] usage with built-in list[Document] and remove the
now-unneeded from typing import List import; ensure the method implementation
uses the passed-in documents parameter (not document_crud) and adjust any call
sites in that file accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@backend/app/services/collections/providers/base.py`:
- Around line 48-58: The docstring for the abstract delete method is stale
(mentions llm_service_id and llm_service_name) and should describe the actual
parameter; update the delete(self, collection: Collection) docstring to document
the single collection argument and its purpose (e.g., "collection: Collection —
the collection whose remote resources should be deleted") and remove the
incorrect arg names, ensuring the docstring matches the delete method signature
in class Base provider.
- Around line 30-46: The base provider's create signature changed to accept
documents: list[Document] (not document_crud: DocumentCrud); update the OpenAI
provider's create method (in openai.py) to match that signature and accept
documents: list[Document] so tests that call create with a documents list
succeed; also replace any typing.List[Document] usage with built-in
list[Document] and remove the now-unneeded from typing import List import;
ensure the method implementation uses the passed-in documents parameter (not
document_crud) and adjust any call sites in that file accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9a570d3a-350b-4597-bb56-8f46650d982b

📥 Commits

Reviewing files that changed from the base of the PR and between ff0726f and c7da9f0.

📒 Files selected for processing (1)
  • backend/app/services/collections/providers/base.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Collection Module: Enhance smart batching

3 participants