Skip to content

[16.0][FIX] fs_attachment: paginate autovacuum GC loop to bound worker memory#597

Open
TecnologiaIG wants to merge 1 commit intoOCA:16.0from
TecnologiaIG:16.0-fs_attachment-gc-batching
Open

[16.0][FIX] fs_attachment: paginate autovacuum GC loop to bound worker memory#597
TecnologiaIG wants to merge 1 commit intoOCA:16.0from
TecnologiaIG:16.0-fs_attachment-gc-batching

Conversation

@TecnologiaIG
Copy link
Copy Markdown

Problem

FsFileGC._gc_files_unsafe loads the entire backlog of orphan files into a single Python list (via array_agg(store_fname) grouped by storage) and iterates fs.rm over all of them in one shot. With the Azure Blob backend (adlfs 2026.4.0; same class of issue applies to s3fs and any fsspec-based client) and a large queue, each HEAD+DELETE pair retains response buffers and connection-pool state inside the SDK client — they are only released when the worker exits.

Observed in production

  • Odoo.sh instance, Azure Blob backend, ~30 k orphans queued.
  • Every @api.autovacuum run hit Odoo's limit_memory_hard and received SIGKILL at around minute 17 of the loop.
  • 60 k+ blob requests per run, zero DELETE committed (the process died mid-loop), queue never drained, respawned worker re-ran the same failing loop.
  • 14 kills in a single 24 h window, worker lived 16-21 min each round.
  • Operationally mitigated with TRUNCATE fs_file_gc (after CSV backup) — at the cost of a small set of orphan blobs leaking on Azure, but acceptable vs. nightly SIGKILL storms.

Fix

Paginate the SELECT + fs.rm loop per storage, in batches of 500, with an explicit gc.collect() between batches to reclaim SDK buffers:

for code in codes:
    fs = self.env["fs.storage"].get_fs_by_code(code)
    while True:
        rows = SELECT ... LIMIT 500
        if not rows: break
        for (store_fname,) in rows:
            fs.rm(...)
            deleted.append(store_fname)
        DELETE FROM fs_file_gc WHERE store_fname = ANY(deleted)
        gc.collect()
  • No intermediate commits. The caller _gc_files still holds the SHARE lock on fs_file_gc / ir_attachment and commits at the end — transactional semantics are unchanged.
  • Memory is bounded: at most 500 fnames + one SDK response cycle in flight at any time.
  • Retries on the next cron tick remain idempotent: any batch whose deletion fails is left in the table and picked up on the next run.

Note on stacking

This PR stacks on top of #596 ([FIX] fs_attachment: bound GC cursor with per-transaction timeouts). Version is bumped to 16.0.2.0.3 to leave room for #596 landing first at 16.0.2.0.2. Happy to rebase as needed.

Test plan

  • Validated in production after TRUNCATE: no OOM kills since (~22:00 GT 21-abr).
  • Staging reproduction: queue 5 000 synthetic orphan rows in fs_file_gc, run _gc_files(), observe worker RSS — should stay flat around the baseline instead of climbing linearly.
  • Concurrent _mark_for_gc during GC: upload's INSERT ... ON CONFLICT takes ROW EXCLUSIVE which conflicts with the caller's SHARE lock → upload waits only for the duration of a batch (~sub-second), not the full queue.

Forward-ports

Code is identical on 17.0 / 18.0 / 19.0 as of today's tip; happy to forward-port once 16.0 is reviewed.

@OCA-git-bot
Copy link
Copy Markdown
Contributor

Hi @lmignon,
some modules you are maintaining are being modified, check this out!

""",
(tuple(codes),),
)
while True:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tank you for your proposal. This makes the code more robust since it becomes less sensitive to the amount of data the garbage collector has to process. I'm still not sure about the first part of the change (#596) But this one LGTM. 😏 So let's discuss about #596 so we can move forward.

``FsFileGC._gc_files_unsafe`` loaded the entire backlog of orphan files
into a single Python list (via ``array_agg(store_fname)`` grouped by
storage) and iterated ``fs.rm`` over all of them in one shot. With the
Azure Blob backend (``adlfs``, same class of issue on ``s3fs`` and other
fsspec-based clients) and tens of thousands of queued orphans, each
HEAD+DELETE pair held onto response buffers and connection-pool state
inside the SDK that was only released when the worker exited.

In production (Odoo.sh, 30k+ orphans, Azure Blob backend) every
autovacuum run hit Odoo's ``limit_memory_hard`` and received
``SIGKILL`` at around minute 17 — 60k+ blob requests per run, zero
``DELETE`` committed, queue never drained, next worker re-ran the same
failing loop. 14 kills observed in a single 24 h window.

Fix: paginate both the ``SELECT`` and the ``fs.rm`` loop per storage,
in batches of ``_GC_BATCH_SIZE = 500``, with an explicit
``gc.collect()`` between batches to reclaim the SDK's buffered state.
The caller ``_gc_files`` still holds the ``SHARE`` lock on
``fs_file_gc`` / ``ir_attachment`` and performs the final commit, so
consistency guarantees and transactional semantics are unchanged.

Signed-off-by: TecnologiaIG <tecnologia@intensegroupgt.com>
@TecnologiaIG TecnologiaIG force-pushed the 16.0-fs_attachment-gc-batching branch from 1afbf7c to 5cce247 Compare April 22, 2026 10:51
@TecnologiaIG
Copy link
Copy Markdown
Author

Force-pushed 5cce247a fixing a correctness regression in the original patch: every row the loop attempted is now cleared from fs_file_gc regardless of fs.rm outcome. Previous version only deleted on fs.rm success, which caused an infinite loop (and the 6h CI timeout) whenever a batch failed end-to-end.

Diff vs. the earlier version:

-                deleted = []
-                for (store_fname,) in rows:
+                fnames = [row[0] for row in rows]
+                for store_fname in fnames:
                     try:
                         fs.rm(store_fname.partition("://")[2])
-                        deleted.append(store_fname)
                     except Exception:
                         _logger.debug("Failed to remove file %s", store_fname)
-                if deleted:
-                    self._cr.execute(
-                        "DELETE FROM fs_file_gc WHERE store_fname = ANY(%s)",
-                        (deleted,),
-                    )
+                self._cr.execute(
+                    "DELETE FROM fs_file_gc WHERE store_fname = ANY(%s)",
+                    (fnames,),
+                )

Semantics now match the pre-batching upstream: an fs.rm that fails leaks a blob (debug-logged), same as before, but the DB row is cleared so the loop can terminate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants