Release 2369
Trello card
Context
The Prometheus metrics work fine in the main application, but not all of the DelayedJob metrics are working.
Whilst we run the metrics server and DelayedJob command in the same process, under the hood daemonize ends up forking a new process which means the default Prometheus store is not going to work.
Currently the only alternative is to use a DirectFileStore which works across forked processes. Sidekiq doesn't have this issue because you can run the metrics server on each Sidekiq server, but I can't find a way of doing the equivalent with DelayedJob (it doesn't have a hook for configuring a server in the pool).
Increases the disk quota as we'll be writing metrics to disk; we were sitting at ~75% disk usage so this should give us plent of headroom.
Changes proposed in this pull request
- Switch to DirectFileStore for Prometheus metrics
Guidance to review
I've manually checked this works in review by booting a delayed-job worker for my review app and ssh'ing on the curling the metrics endpoint.
When we used the DirectFileStore in GiT we noticed the CPU usage steadily increase until it maxed out; we're going to have to monitor this to see if its still a problem. We can monitor the CPU here and disk usage here.
Relevant docs on Prometheus client built-in stores: https://github.com/prometheus/client_ruby#built-in-stores