Fix collection delete timeout by batching license pool deletions (PP-3683)#3034
Fix collection delete timeout by batching license pool deletions (PP-3683)#3034jonathangreen merged 2 commits intomainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3034 +/- ##
=======================================
Coverage 93.06% 93.06%
=======================================
Files 479 479
Lines 43651 43650 -1
Branches 6026 6026
=======================================
Hits 40625 40625
Misses 1960 1960
+ Partials 1066 1065 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
cdf9694 to
10621ce
Compare
Collection.delete() was processing all license pools in a single transaction, causing the collection_delete celery task to exceed its 30-minute hard time limit for large collections. - Refactor Collection.delete() to process a configurable batch_size of pools per call and return a bool indicating completion - Add task.replace() re-queueing to collection_delete so it processes pools across multiple short-lived transactions - Delegate collection_reaper to queue collection_delete tasks instead of deleting inline (avoiding the same timeout risk) - Remove ExternalSearchIndex dependency from Collection.delete(); orphaned works are cleaned up by the existing work_reaper task - Use explicit SELECT query instead of relationship attribute for reliable cross-transaction batch selection
10621ce to
3d5e67d
Compare
| # https://docs.sqlalchemy.org/en/14/orm/cascades.html#notes-on-delete-deleting-objects-referenced-from-collections-and-scalar-relationships | ||
| work.license_pools.remove(pool) | ||
| if not work.license_pools: | ||
| work.delete(search_index=search_index) |
There was a problem hiding this comment.
Orphaned works are cleaned up by the existing work_reaper task. This avoids complicating this task, and seems like a better separation of concerns.
There was a problem hiding this comment.
Minor - this is mentioned in a test comment, as well. Should it be mentioned in a "NOTE:" here, as well?
There was a problem hiding this comment.
Its mentioned in the doc comment for delete above, so I think this is covered.
| # application where LicensePools are permanently deleted. | ||
| # We use an explicit query rather than the relationship attribute so | ||
| # that each call gets a fresh result from the database. | ||
| pool_query = ( |
There was a problem hiding this comment.
Minor - Assuming that we're ordering by license pool id here to avoid deadlocks. Might be worth noting that here, if it's important.
| # https://docs.sqlalchemy.org/en/14/orm/cascades.html#notes-on-delete-deleting-objects-referenced-from-collections-and-scalar-relationships | ||
| work.license_pools.remove(pool) | ||
| if not work.license_pools: | ||
| work.delete(search_index=search_index) |
There was a problem hiding this comment.
Minor - this is mentioned in a test comment, as well. Should it be mentioned in a "NOTE:" here, as well?
Description
Refactor
Collection.delete()and thecollection_deletecelery task to process license pools in batches using thetask.replace()re-queueing pattern, preventing task timeouts when deleting large collections.Motivation and Context
The
collection_deletecelery task was hitting the 1800s (30-minute) hard time limit when deleting large collections. The root cause was thatCollection.delete()processed all license pools in a single transaction. For collections with hundreds of thousands of pools, this easily exceeded 30 minutes.Changes
Collection.delete(): Accepts abatch_sizeparameter (default 1000), processes only that many pools per call, and returns aboolindicating whether deletion is complete. Removed the@inject/ExternalSearchIndexdependency (orphaned works are cleaned up by the existingwork_reapertask) and removed the inlinecommit()call so the caller manages the transaction.collection_deletetask: Addedbatch_sizeparameter andtask.replace()re-queueing when more pools remain. Changed queue fromhightodefaultsince this is now a background batch job.collection_reaper: Now queuescollection_deletetasks for each marked collection instead of callingCollection.delete()directly (avoiding the same timeout risk).Collection.delete()internals: Uses an explicitSELECTquery instead of theself.licensepoolsrelationship attribute, which ensures a fresh result from the database across transaction boundaries.How Has This Been Tested?
test_collection_delete_batched: verifies re-queueing with smallbatch_sizeacross multiple roundstest_collection_delete_with_child_records: verifies ORM cascade cleanup of loans, holds, and worksTestCollectionReaperto verify delegation tocollection_deletetest_deleteintest_collection.pyto match new method signatureChecklist