[16.0][FIX] fs_attachment: paginate autovacuum GC loop to bound worker memory by TecnologiaIG · Pull Request #597 · OCA/storage

TecnologiaIG · 2026-04-22T03:17:18Z

Problem

FsFileGC._gc_files_unsafe loads the entire backlog of orphan files into a single Python list (via array_agg(store_fname) grouped by storage) and iterates fs.rm over all of them in one shot. With the Azure Blob backend (adlfs 2026.4.0; same class of issue applies to s3fs and any fsspec-based client) and a large queue, each HEAD+DELETE pair retains response buffers and connection-pool state inside the SDK client — they are only released when the worker exits.

Observed in production

Odoo.sh instance, Azure Blob backend, ~30 k orphans queued.
Every @api.autovacuum run hit Odoo's limit_memory_hard and received SIGKILL at around minute 17 of the loop.
60 k+ blob requests per run, zero DELETE committed (the process died mid-loop), queue never drained, respawned worker re-ran the same failing loop.
14 kills in a single 24 h window, worker lived 16-21 min each round.
Operationally mitigated with TRUNCATE fs_file_gc (after CSV backup) — at the cost of a small set of orphan blobs leaking on Azure, but acceptable vs. nightly SIGKILL storms.

Fix

Paginate the SELECT + fs.rm loop per storage, in batches of 500, with an explicit gc.collect() between batches to reclaim SDK buffers:

for code in codes:
    fs = self.env["fs.storage"].get_fs_by_code(code)
    while True:
        rows = SELECT ... LIMIT 500
        if not rows: break
        for (store_fname,) in rows:
            fs.rm(...)
            deleted.append(store_fname)
        DELETE FROM fs_file_gc WHERE store_fname = ANY(deleted)
        gc.collect()

No intermediate commits. The caller _gc_files still holds the SHARE lock on fs_file_gc / ir_attachment and commits at the end — transactional semantics are unchanged.
Memory is bounded: at most 500 fnames + one SDK response cycle in flight at any time.
Retries on the next cron tick remain idempotent: any batch whose deletion fails is left in the table and picked up on the next run.

Note on stacking

This PR stacks on top of #596 ([FIX] fs_attachment: bound GC cursor with per-transaction timeouts). Version is bumped to 16.0.2.0.3 to leave room for #596 landing first at 16.0.2.0.2. Happy to rebase as needed.

Test plan

Validated in production after TRUNCATE: no OOM kills since (~22:00 GT 21-abr).
Staging reproduction: queue 5 000 synthetic orphan rows in fs_file_gc, run _gc_files(), observe worker RSS — should stay flat around the baseline instead of climbing linearly.
Concurrent _mark_for_gc during GC: upload's INSERT ... ON CONFLICT takes ROW EXCLUSIVE which conflicts with the caller's SHARE lock → upload waits only for the duration of a batch (~sub-second), not the full queue.

Forward-ports

Code is identical on 17.0 / 18.0 / 19.0 as of today's tip; happy to forward-port once 16.0 is reviewed.

OCA-git-bot · 2026-04-22T04:35:21Z

Hi @lmignon,
some modules you are maintaining are being modified, check this out!

lmignon · 2026-04-22T06:59:39Z

-            """,
-            (tuple(codes),),
-        )
+            while True:


Tank you for your proposal. This makes the code more robust since it becomes less sensitive to the amount of data the garbage collector has to process. I'm still not sure about the first part of the change (#596) But this one LGTM. 😏 So let's discuss about #596 so we can move forward.

``FsFileGC._gc_files_unsafe`` loaded the entire backlog of orphan files into a single Python list (via ``array_agg(store_fname)`` grouped by storage) and iterated ``fs.rm`` over all of them in one shot. With the Azure Blob backend (``adlfs``, same class of issue on ``s3fs`` and other fsspec-based clients) and tens of thousands of queued orphans, each HEAD+DELETE pair held onto response buffers and connection-pool state inside the SDK that was only released when the worker exited. In production (Odoo.sh, 30k+ orphans, Azure Blob backend) every autovacuum run hit Odoo's ``limit_memory_hard`` and received ``SIGKILL`` at around minute 17 — 60k+ blob requests per run, zero ``DELETE`` committed, queue never drained, next worker re-ran the same failing loop. 14 kills observed in a single 24 h window. Fix: paginate both the ``SELECT`` and the ``fs.rm`` loop per storage, in batches of ``_GC_BATCH_SIZE = 500``, with an explicit ``gc.collect()`` between batches to reclaim the SDK's buffered state. The caller ``_gc_files`` still holds the ``SHARE`` lock on ``fs_file_gc`` / ``ir_attachment`` and performs the final commit, so consistency guarantees and transactional semantics are unchanged. Signed-off-by: TecnologiaIG <tecnologia@intensegroupgt.com>

TecnologiaIG · 2026-04-22T10:51:51Z

Force-pushed 5cce247a fixing a correctness regression in the original patch: every row the loop attempted is now cleared from fs_file_gc regardless of fs.rm outcome. Previous version only deleted on fs.rm success, which caused an infinite loop (and the 6h CI timeout) whenever a batch failed end-to-end.

Diff vs. the earlier version:

-                deleted = []
-                for (store_fname,) in rows:
+                fnames = [row[0] for row in rows]
+                for store_fname in fnames:
                     try:
                         fs.rm(store_fname.partition("://")[2])
-                        deleted.append(store_fname)
                     except Exception:
                         _logger.debug("Failed to remove file %s", store_fname)
-                if deleted:
-                    self._cr.execute(
-                        "DELETE FROM fs_file_gc WHERE store_fname = ANY(%s)",
-                        (deleted,),
-                    )
+                self._cr.execute(
+                    "DELETE FROM fs_file_gc WHERE store_fname = ANY(%s)",
+                    (fnames,),
+                )

Semantics now match the pre-batching upstream: an fs.rm that fails leaks a blob (debug-logged), same as before, but the DB row is cleared so the loop can terminate.

OCA-git-bot added mod:fs_attachment Module fs_attachment series:16.0 labels Apr 22, 2026

lmignon reviewed Apr 22, 2026

View reviewed changes

TecnologiaIG force-pushed the 16.0-fs_attachment-gc-batching branch from 1afbf7c to 5cce247 Compare April 22, 2026 10:51

TecnologiaIG mentioned this pull request Apr 22, 2026

[16.0][FIX] fs_attachment: reconcile fs_filename when datas/raw is rewritten #598

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[16.0][FIX] fs_attachment: paginate autovacuum GC loop to bound worker memory#597

[16.0][FIX] fs_attachment: paginate autovacuum GC loop to bound worker memory#597
TecnologiaIG wants to merge 1 commit intoOCA:16.0from
TecnologiaIG:16.0-fs_attachment-gc-batching

TecnologiaIG commented Apr 22, 2026

Uh oh!

OCA-git-bot commented Apr 22, 2026

Uh oh!

lmignon Apr 22, 2026

Uh oh!

TecnologiaIG commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

TecnologiaIG commented Apr 22, 2026

Problem

Observed in production

Fix

Note on stacking

Test plan

Forward-ports

Uh oh!

OCA-git-bot commented Apr 22, 2026

Uh oh!

lmignon Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

TecnologiaIG commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants