Skip to content

Collection: Split Celery tasks for uploads #798

@nishika26

Description

@nishika26

Is your feature request related to a problem?
The current collection creation process is handled in a single Celery task, risking timeouts and failed uploads due to large document sets or slow responses from OpenAI.

Describe the solution you'd like
Split the collection creation so each batch is processed by its own Celery task. This should ensure:

  • No single task exceeds the 25-minute limit
  • Each task handles one batch only
  • Failed files within a batch are retried in the next batch
  • Job checkpointing clearly indicates progress and where to resume from if issues arise
Original issue

Describe the current behavior
Currently, the entire collection creation process — including all batches — is handled within a single Celery task. All batches are processed sequentially within that one task, which means if the document set is large or OpenAI is slow to respond, the task can easily exceed the Celery soft time limit (currently set at 25 minutes). When this happens, the task gets killed mid-way and the collection creation fails, with no clear record of which batches succeeded and which did not.

Describe the enhancement you'd like
Split the collection creation process so that each batch is handled by its own dedicated Celery task. A batch task should only queue the next batch after the current one completes successfully — making the processing strictly sequential but spread across multiple tasks. This means:

  • No single task is responsible for the entire upload, so the 25 minute soft time limit is no longer a concern regardless of collection size
  • Each batch task has a clear, bounded scope — one batch, one task
  • When a file fails within a batch, it gets added back to the next batch to retry rather than failing the whole collection
  • The job checkpointing on the CollectionJob table tracks which batch we are on, how many documents have been uploaded so far, and what remains — so if anything does go wrong, we know exactly where to resume from

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions