Skip to content

Worker: find a better way to manage ephemeral storage #919

@josephjclark

Description

@josephjclark

We are learning that when a worker uses too much ephemeral storage (disk space), the pod will be killed. Instantly and without mercy, and at the cost of lost runs.

The problem here of course is autoinstall and node_modules

What solutions do we have here?

  • We can increase the ephemeral storage to reduce the frequency of this occuring
  • We can kill workers every 24 hours to reduce the rate of this happening. A purge. This also helps with the memory leak btw
  • Workers could manage their node_modules installation, aiming to keep < 20 modules installed and then removing the least used adaptors periodically. This will result in more installations but should ensure better memory management
  • Can we do something like: a worker only claims for certain adaptor versions? But this is hard to track, what do we do if eg no adaptor wants to install kobotoolbox@0.2.0? So I don't think this is anything
  • We could use a shared volume in kubernetes. This means workers start up faster (no need to autoinstall common). The downside being that that one shared volume might need to store every adaptor version ever released all at once (I guess the shared volume would need some management or a regular purge). Then again, a 1TB volume would presumably last us a very long time. Does the risk of the npm registry getting corrupted (which can happen locally with the CLI) increase? Yes - and if the installation DOES get corrupted, then all workers will break.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Product Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions