[16.0][FIX] fs_attachment: paginate autovacuum GC loop to bound worker memory#597
Open
TecnologiaIG wants to merge 1 commit intoOCA:16.0from
Open
[16.0][FIX] fs_attachment: paginate autovacuum GC loop to bound worker memory#597TecnologiaIG wants to merge 1 commit intoOCA:16.0from
TecnologiaIG wants to merge 1 commit intoOCA:16.0from
Conversation
Contributor
|
Hi @lmignon, |
lmignon
reviewed
Apr 22, 2026
| """, | ||
| (tuple(codes),), | ||
| ) | ||
| while True: |
Contributor
There was a problem hiding this comment.
``FsFileGC._gc_files_unsafe`` loaded the entire backlog of orphan files into a single Python list (via ``array_agg(store_fname)`` grouped by storage) and iterated ``fs.rm`` over all of them in one shot. With the Azure Blob backend (``adlfs``, same class of issue on ``s3fs`` and other fsspec-based clients) and tens of thousands of queued orphans, each HEAD+DELETE pair held onto response buffers and connection-pool state inside the SDK that was only released when the worker exited. In production (Odoo.sh, 30k+ orphans, Azure Blob backend) every autovacuum run hit Odoo's ``limit_memory_hard`` and received ``SIGKILL`` at around minute 17 — 60k+ blob requests per run, zero ``DELETE`` committed, queue never drained, next worker re-ran the same failing loop. 14 kills observed in a single 24 h window. Fix: paginate both the ``SELECT`` and the ``fs.rm`` loop per storage, in batches of ``_GC_BATCH_SIZE = 500``, with an explicit ``gc.collect()`` between batches to reclaim the SDK's buffered state. The caller ``_gc_files`` still holds the ``SHARE`` lock on ``fs_file_gc`` / ``ir_attachment`` and performs the final commit, so consistency guarantees and transactional semantics are unchanged. Signed-off-by: TecnologiaIG <tecnologia@intensegroupgt.com>
1afbf7c to
5cce247
Compare
Author
|
Force-pushed Diff vs. the earlier version: - deleted = []
- for (store_fname,) in rows:
+ fnames = [row[0] for row in rows]
+ for store_fname in fnames:
try:
fs.rm(store_fname.partition("://")[2])
- deleted.append(store_fname)
except Exception:
_logger.debug("Failed to remove file %s", store_fname)
- if deleted:
- self._cr.execute(
- "DELETE FROM fs_file_gc WHERE store_fname = ANY(%s)",
- (deleted,),
- )
+ self._cr.execute(
+ "DELETE FROM fs_file_gc WHERE store_fname = ANY(%s)",
+ (fnames,),
+ )Semantics now match the pre-batching upstream: an |
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
FsFileGC._gc_files_unsafeloads the entire backlog of orphan files into a single Python list (viaarray_agg(store_fname)grouped by storage) and iteratesfs.rmover all of them in one shot. With the Azure Blob backend (adlfs2026.4.0; same class of issue applies tos3fsand any fsspec-based client) and a large queue, each HEAD+DELETE pair retains response buffers and connection-pool state inside the SDK client — they are only released when the worker exits.Observed in production
@api.autovacuumrun hit Odoo'slimit_memory_hardand receivedSIGKILLat around minute 17 of the loop.DELETEcommitted (the process died mid-loop), queue never drained, respawned worker re-ran the same failing loop.TRUNCATE fs_file_gc(after CSV backup) — at the cost of a small set of orphan blobs leaking on Azure, but acceptable vs. nightly SIGKILL storms.Fix
Paginate the
SELECT+fs.rmloop per storage, in batches of 500, with an explicitgc.collect()between batches to reclaim SDK buffers:_gc_filesstill holds theSHARElock onfs_file_gc/ir_attachmentand commits at the end — transactional semantics are unchanged.Note on stacking
This PR stacks on top of #596 (
[FIX] fs_attachment: bound GC cursor with per-transaction timeouts). Version is bumped to16.0.2.0.3to leave room for #596 landing first at16.0.2.0.2. Happy to rebase as needed.Test plan
TRUNCATE: no OOM kills since (~22:00 GT 21-abr).fs_file_gc, run_gc_files(), observe worker RSS — should stay flat around the baseline instead of climbing linearly._mark_for_gcduring GC: upload'sINSERT ... ON CONFLICTtakesROW EXCLUSIVEwhich conflicts with the caller'sSHARElock → upload waits only for the duration of a batch (~sub-second), not the full queue.Forward-ports
Code is identical on 17.0 / 18.0 / 19.0 as of today's tip; happy to forward-port once 16.0 is reviewed.