Skip to content

Possible k8s OOM kill prevention pill 1 (--max-old-space-size)#1368

Merged
josephjclark merged 2 commits intorelease/nextfrom
memory-concept-1
Apr 14, 2026
Merged

Possible k8s OOM kill prevention pill 1 (--max-old-space-size)#1368
josephjclark merged 2 commits intorelease/nextfrom
memory-concept-1

Conversation

@taylordowns2000
Copy link
Copy Markdown
Member

Adds --max-old-space-size to child processes forked by the worker pool.

Currently, memory limits only apply at the Worker Thread level via V8's resourceLimits. The child process itself runs with V8's default heap limit (~1.5-4GB). During rapid allocation spikes (e.g., infinite loop pushing to an array), this means the child process can grow far beyond the intended run memory limit before V8 kills the thread.

This change passes --max-old-space-size to the child process, reducing the ceiling from V8's default down to WORKER_MAX_RUN_MEMORY_MB (default 500MB). This makes V8 GC more aggressively at the process level and trigger OOM earlier.

This does not guarantee protection against k8s OOM kills — RSS can still transiently exceed the V8 heap limit during rapid allocation. But it significantly reduces the headroom from gigabytes to something much closer to the configured limit. A hard guarantee would require kernel-level enforcement via cgroups (future work).

AI Usage

Please disclose whether you've used AI anywhere in this PR (it's cool, we just
want to know!):

  • I have used Claude Code
  • I have used another model
  • I have not used AI

You can read more details in our
Responsible AI Policy

Release branch checklist

Delete this section if this is not a release PR.

If this IS a release branch:

  • Run pnpm changeset version from root to bump versions
  • Run pnpm install
  • Commit the new version numbers
  • Run pnpm changeset tag to generate tags
  • Push tags git push --tags

Tags may need updating if commits come in after the tags are first generated.

@github-project-automation github-project-automation bot moved this to New Issues in Core Apr 13, 2026
@taylordowns2000 taylordowns2000 changed the title concept 1 Possible k8s OOM kill prevention pill #1 Apr 13, 2026
@taylordowns2000 taylordowns2000 changed the title Possible k8s OOM kill prevention pill #1 Possible k8s OOM kill prevention pill 1 (--max-old-space-size) Apr 13, 2026
@taylordowns2000 taylordowns2000 marked this pull request as ready for review April 13, 2026 18:55
@josephjclark
Copy link
Copy Markdown
Collaborator

Interesting - let me look into this. Thanks!

@josephjclark
Copy link
Copy Markdown
Collaborator

What I need to look into here is: the engine relies on the inner worker thread blowing up so that the child process can catch the OOM error and feed it back to the parent. If we're now constraining the child process, who catches the OOM error that it throws?

We might just need a bit of extra logic to support that.

I also want to re-validate the resourceLimits flag we're using. If it's not robust (and the AI seems to think its not) should we be using it at all? Maybe I should move all the OOM handling into the child process. I don't really like that we have two different strategies to trap memory issues.

I've also got a bit of a worry that setting the old size memory might not actually be helping in the kind of synchronous recursive functions that break us - we maybe need to be looking more at the young memory heap

@taylordowns2000
Copy link
Copy Markdown
Member Author

taylordowns2000 commented Apr 14, 2026 via email

@josephjclark
Copy link
Copy Markdown
Collaborator

Ok, having run some tests locally I'm happy to give this a go. We already correctly process an OOMKill which comes from the child_process - so lightning reports the correct thing.

I don't really buy this at all:

This makes V8 GC more aggressively at the process level and trigger OOM earlier.

I can find no information on whether that may or may not be true. I really don't see the difference it would make? I think the real point here is that setting the memory limit outside the process is always going to be more reliable (ie, the cgroup stuff, which makes me very nervous indeed)

I don't know if gating the child process or the worker thread is more reliable, or whether it's wise/useful to maintain a limit on both. I'll come back to that later.

Anyway let's see whether this makes any difference on staging, then maybe I can look more into it.

@josephjclark josephjclark changed the base branch from main to release/next April 14, 2026 15:45
@josephjclark josephjclark merged commit aa833d6 into release/next Apr 14, 2026
7 checks passed
@josephjclark josephjclark deleted the memory-concept-1 branch April 14, 2026 16:21
@github-project-automation github-project-automation bot moved this from New Issues to Done in Core Apr 14, 2026
josephjclark added a commit that referenced this pull request Apr 14, 2026
* engine: Set max-old-space-size on child process

#1368

---------

Co-authored-by: Joe Clark <jclark@openfn.org>

* fix(project): include trigger reply fields in version hash (#1362)

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>

* fix state growth (#1371)

* versions

---------

Signed-off-by: Asish Kumar <officialasishkumar@gmail.com>
Co-authored-by: Taylor Downs <taylor@openfn.org>
Co-authored-by: Asish Kumar <87874775+officialasishkumar@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants