Possible k8s OOM kill prevention pill 1 (--max-old-space-size) by taylordowns2000 · Pull Request #1368 · OpenFn/kit

taylordowns2000 · 2026-04-13T18:16:22Z

Adds --max-old-space-size to child processes forked by the worker pool.

Currently, memory limits only apply at the Worker Thread level via V8's resourceLimits. The child process itself runs with V8's default heap limit (~1.5-4GB). During rapid allocation spikes (e.g., infinite loop pushing to an array), this means the child process can grow far beyond the intended run memory limit before V8 kills the thread.

This change passes --max-old-space-size to the child process, reducing the ceiling from V8's default down to WORKER_MAX_RUN_MEMORY_MB (default 500MB). This makes V8 GC more aggressively at the process level and trigger OOM earlier.

This does not guarantee protection against k8s OOM kills — RSS can still transiently exceed the V8 heap limit during rapid allocation. But it significantly reduces the headroom from gigabytes to something much closer to the configured limit. A hard guarantee would require kernel-level enforcement via cgroups (future work).

AI Usage

Please disclose whether you've used AI anywhere in this PR (it's cool, we just
want to know!):

I have used Claude Code
I have used another model
I have not used AI

You can read more details in our
Responsible AI Policy

Release branch checklist

Delete this section if this is not a release PR.

If this IS a release branch:

Run pnpm changeset version from root to bump versions
Run pnpm install
Commit the new version numbers
Run pnpm changeset tag to generate tags
Push tags git push --tags

Tags may need updating if commits come in after the tags are first generated.

josephjclark · 2026-04-14T07:08:08Z

Interesting - let me look into this. Thanks!

josephjclark · 2026-04-14T11:53:58Z

What I need to look into here is: the engine relies on the inner worker thread blowing up so that the child process can catch the OOM error and feed it back to the parent. If we're now constraining the child process, who catches the OOM error that it throws?

We might just need a bit of extra logic to support that.

I also want to re-validate the resourceLimits flag we're using. If it's not robust (and the AI seems to think its not) should we be using it at all? Maybe I should move all the OOM handling into the child process. I don't really like that we have two different strategies to trap memory issues.

I've also got a bit of a worry that setting the old size memory might not actually be helping in the kind of synchronous recursive functions that break us - we maybe need to be looking more at the young memory heap

taylordowns2000 · 2026-04-14T12:09:40Z

My recommendation is to get a build up on staging and be data driven about it. (Relying on observation cause I don’t understand the intervals.) So right now, I can crash the worker reliably, at the click of a button. Let’s put this on staging, click that button, and see if it crashes. Taylor Downs Founder & CEO OpenFn <https://www.openfn.org> // the DPG for ai orchestration & public service automation

…

On Tue, Apr 14, 2026 at 07:54 Joe Clark ***@***.***> wrote: *josephjclark* left a comment (OpenFn/kit#1368) <#1368 (comment)> What I need to look into here is: the engine relies on the inner worker thread blowing up so that the child process can catch the OOM error and feed it back to the parent. If we're now constraining the child process, who catches the OOM error that it throws? We might just need a bit of extra logic to support that. I also want to re-validate the resourceLimits flag we're using. If it's not robust (and the AI seems to think its not) should we be using it at all? Maybe I should move all the OOM handling into the child process. I don't really like that we have two different strategies to trap memory issues. I've also got a bit of a worry that setting the old size memory might not actually be helping in the kind of synchronous recursive functions that break us - we maybe need to be looking more at the young memory heap — Reply to this email directly, view it on GitHub <#1368 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCUBLIONWBQLZPLLGV4SLD4VYRGZAVCNFSM6AAAAACXXYKX7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DENBTGY2TKNBRG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

josephjclark · 2026-04-14T15:44:47Z

Ok, having run some tests locally I'm happy to give this a go. We already correctly process an OOMKill which comes from the child_process - so lightning reports the correct thing.

I don't really buy this at all:

This makes V8 GC more aggressively at the process level and trigger OOM earlier.

I can find no information on whether that may or may not be true. I really don't see the difference it would make? I think the real point here is that setting the memory limit outside the process is always going to be more reliable (ie, the cgroup stuff, which makes me very nervous indeed)

I don't know if gating the child process or the worker thread is more reliable, or whether it's wise/useful to maintain a limit on both. I'll come back to that later.

Anyway let's see whether this makes any difference on staging, then maybe I can look more into it.

* engine: Set max-old-space-size on child process #1368 --------- Co-authored-by: Joe Clark <jclark@openfn.org> * fix(project): include trigger reply fields in version hash (#1362) Signed-off-by: Asish Kumar <officialasishkumar@gmail.com> * fix state growth (#1371) * versions --------- Signed-off-by: Asish Kumar <officialasishkumar@gmail.com> Co-authored-by: Taylor Downs <taylor@openfn.org> Co-authored-by: Asish Kumar <87874775+officialasishkumar@users.noreply.github.com>

concept 1

d4e8bb0

taylordowns2000 added this to Core Apr 13, 2026

github-project-automation bot moved this to New Issues in Core Apr 13, 2026

taylordowns2000 changed the title ~~concept 1~~ Possible k8s OOM kill prevention pill #1 Apr 13, 2026

taylordowns2000 requested a review from josephjclark April 13, 2026 18:17

taylordowns2000 changed the title ~~Possible k8s OOM kill prevention pill #1~~ Possible k8s OOM kill prevention pill 1 (--max-old-space-size) Apr 13, 2026

taylordowns2000 marked this pull request as ready for review April 13, 2026 18:55

josephjclark approved these changes Apr 14, 2026

View reviewed changes

josephjclark changed the base branch from main to release/next April 14, 2026 15:45

changeset

1483f63

josephjclark merged commit aa833d6 into release/next Apr 14, 2026
7 checks passed

josephjclark deleted the memory-concept-1 branch April 14, 2026 16:21

github-project-automation bot moved this from New Issues to Done in Core Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible k8s OOM kill prevention pill 1 (--max-old-space-size)#1368

Possible k8s OOM kill prevention pill 1 (--max-old-space-size)#1368
josephjclark merged 2 commits intorelease/nextfrom
memory-concept-1

taylordowns2000 commented Apr 13, 2026

Uh oh!

josephjclark commented Apr 14, 2026

Uh oh!

josephjclark commented Apr 14, 2026

Uh oh!

taylordowns2000 commented Apr 14, 2026 via email

Uh oh!

josephjclark commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

taylordowns2000 commented Apr 13, 2026

AI Usage

Release branch checklist

Uh oh!

josephjclark commented Apr 14, 2026

Uh oh!

josephjclark commented Apr 14, 2026

Uh oh!

taylordowns2000 commented Apr 14, 2026 via email

Uh oh!

josephjclark commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants