Skip to content

Attempt to stabilise worker memory#1099

Merged
josephjclark merged 8 commits intorelease/nextfrom
memory-blues
Oct 27, 2025
Merged

Attempt to stabilise worker memory#1099
josephjclark merged 8 commits intorelease/nextfrom
memory-blues

Conversation

@josephjclark
Copy link
Copy Markdown
Collaborator

@josephjclark josephjclark commented Oct 27, 2025

Short Description

This PR removes a possible memory leak which may be triggering Lost runs.

Fixes #1096 (I hope)

Implementation Details

This PR absolutely does something for me locally. Running stress tests on the Worker with a low memory limit, on this branch I seem to have total stability, where on main heap errors are frequent. So that's good.

But will this help us in production? I'm really not sure. This PR may take out the memory leak, but if our workers in production are blowing up on 40% memory capacity, then I don't think we can blame the memory leak?

Here's what I know about production:

  • We have some internal QA workflows which are sending large payloads through the app on regular cron jobs. This is straining the system during times of high load, squeezing the margins.
  • Those have been disabled for now, but those workflows should absolutely be allowed to run if the services team wants them to

My suspicion remains that that when the worker is processing several large JSON payloads at once in the main thread, and its working memory happens to be quite high, it's running out of heap memory. The problem here may just be that parsing large JSON objects requires a lot of memory. In which case solutions are:

  • reduce the payload limit down from 10mb to 4 or 5mb
  • Find a more optimal way to process those JSON objects

AI Usage

Please disclose how you've used AI in this work (it's cool, we just want to know!):

  • Code generation (copilot but not intellisense)
  • Learning or fact checking
  • Strategy / design
  • Optimisation / refactoring
  • Translation / spellchecking / doc gen
  • Other
  • I have not used AI

You can read more details in our Responsible AI Policy

@github-project-automation github-project-automation bot moved this to New Issues in Core Oct 27, 2025
@josephjclark josephjclark changed the base branch from main to release/next October 27, 2025 17:54
@josephjclark josephjclark marked this pull request as ready for review October 27, 2025 18:00
@josephjclark josephjclark merged commit b61bf9b into release/next Oct 27, 2025
6 checks passed
@github-project-automation github-project-automation bot moved this from New Issues to Done in Core Oct 27, 2025
@josephjclark
Copy link
Copy Markdown
Collaborator Author

48 hours of worker memory usage. The first 24 hours-ish are before this patch, the second 24hours are after it was deployed

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Sudden explosion of lost runs

2 participants