Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make types of Que jobs Prometheus metrics exclusive #345

Merged
merged 5 commits into from Jun 29, 2020

Conversation

guicassolato
Copy link
Contributor

@guicassolato guicassolato commented Jun 25, 2020

Redefines the types of Prometheus metrics on Que jobs (que_jobs_scheduled_total) as the following:

Type Meaning
ready Number of jobs enqueued and ready to be executed ASAP (never failed, nor got expired)
scheduled Number of jobs enqueued to be executed some time in the future, but not now (never failed, nor got expired)
finished Number of jobs that executed with success
failed Number of jobs that failed at least once, did not run out of attempts to retry and therefore are scheduled for retry any time soon
expired Number of jobs that failed and ran out of attempts to retry, therefore won't be retried again

Considering current implementation in Zync, scheduled and finished are expected to be always zero. The former because Zync does not make any job to "wait until" except when handling retries of failed jobs. The latter because Zync only keeps in the database the executed jobs that fail; whenever succeeded, they are deleted.

que_jobs_scheduled_total{type="scheduled"} does not correspond to Que's definition of "scheduled" (as showed in the Que's web console). Que defines "scheduled" as virtually a synonym of "enqueued". Every jobs scheduled to be executed, now or some time in the future, will be listed as "scheduled" by Que. In fact, even "expired" jobs may be listed as "scheduled" by Que. Whereas for the metric, we do a clear distinction between "ready" (now), "scheduled" (to some time in the future) and "expired", with unambiguous meaning to each one of these terms.

Moreover, ready is also expected to be zero most of time. If it's not zero, it's because Zync if nor processing the jobs or not processing them fast enough.


Closes THREESCALE-5460

@guicassolato guicassolato self-assigned this Jun 25, 2020
@raelga
Copy link
Member

raelga commented Jun 25, 2020

Solves https://github.com/3scale/platform/issues/230

@guicassolato guicassolato requested a review from a team June 25, 2020 13:02
@guicassolato
Copy link
Contributor Author

guicassolato commented Jun 25, 2020

Test in OpenShift

[scaled zync-que down to 0]

oc exec -it $(oc get pods | grep zync-que | grep Running | awk '{ print $1 }') -- bash -c 'curl http://localhost:9394/metrics | grep que_jobs_scheduled_total' | grep 'UpdateJob'
que_jobs_scheduled_total{job="UpdateJob"} 0
que_jobs_scheduled_total{job="UpdateJob",type="ready"} 0
que_jobs_scheduled_total{job="UpdateJob",type="scheduled"} 0
que_jobs_scheduled_total{job="UpdateJob",type="finished"} 0
que_jobs_scheduled_total{job="UpdateJob",type="failed"} 0
que_jobs_scheduled_total{job="UpdateJob",type="expired"} 0
oc exec -it $(oc get pods | grep zync-database | grep Running | awk '{ print $1 }') -- bash -c 'psql -U zync zync_production'
select id, run_at from que_jobs;
 id |           run_at
----+-----------------------------------
 76 | 2020-06-25 12:47:22.409502+00

type=scheduled

update que_jobs set run_at='2020-06-26 12:47:22.409502+00' where id=76;

[scaled zync-que up to 1]

oc exec -it $(oc get pods | grep zync-que | grep Running | awk '{ print $1 }') -- bash -c 'curl http://localhost:9394/metrics | grep que_jobs_scheduled_total' | grep 'UpdateJob'
que_jobs_scheduled_total{job="UpdateJob"} 1
que_jobs_scheduled_total{job="UpdateJob",type="ready"} 0
que_jobs_scheduled_total{job="UpdateJob",type="scheduled"} 1
que_jobs_scheduled_total{job="UpdateJob",type="finished"} 0
que_jobs_scheduled_total{job="UpdateJob",type="failed"} 0
que_jobs_scheduled_total{job="UpdateJob",type="expired"} 0

type=failed

update que_jobs set error_count=1 where id=76;
oc exec -it $(oc get pods | grep zync-que | grep Running | awk '{ print $1 }') -- bash -c 'curl http://localhost:9394/metrics | grep que_jobs_scheduled_total' | grep 'UpdateJob'
que_jobs_scheduled_total{job="UpdateJob"} 1
que_jobs_scheduled_total{job="UpdateJob",type="ready"} 0
que_jobs_scheduled_total{job="UpdateJob",type="scheduled"} 0
que_jobs_scheduled_total{job="UpdateJob",type="finished"} 0
que_jobs_scheduled_total{job="UpdateJob",type="failed"} 1
que_jobs_scheduled_total{job="UpdateJob",type="expired"} 0

type=expired

update que_jobs set expired_at=now() where id=76;
oc exec -it $(oc get pods | grep zync-que | grep Running | awk '{ print $1 }') -- bash -c 'curl http://localhost:9394/metrics | grep que_jobs_scheduled_total' | grep 'UpdateJob'
que_jobs_scheduled_total{job="UpdateJob"} 1
que_jobs_scheduled_total{job="UpdateJob",type="ready"} 0
que_jobs_scheduled_total{job="UpdateJob",type="scheduled"} 0
que_jobs_scheduled_total{job="UpdateJob",type="finished"} 0
que_jobs_scheduled_total{job="UpdateJob",type="failed"} 0
que_jobs_scheduled_total{job="UpdateJob",type="expired"} 1

type=finished

update que_jobs set expired_at=null, error_count=0, finished_at=now() where id=76;
oc exec -it $(oc get pods | grep zync-que | grep Running | awk '{ print $1 }') -- bash -c 'curl http://localhost:9394/metrics | grep que_jobs_scheduled_total' | grep 'UpdateJob'
que_jobs_scheduled_total{job="UpdateJob"} 1
que_jobs_scheduled_total{job="UpdateJob",type="ready"} 0
que_jobs_scheduled_total{job="UpdateJob",type="scheduled"} 0
que_jobs_scheduled_total{job="UpdateJob",type="finished"} 1
que_jobs_scheduled_total{job="UpdateJob",type="failed"} 0
que_jobs_scheduled_total{job="UpdateJob",type="expired"} 0

Copy link
Member

@raelga raelga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for solving this that fast!

@guicassolato guicassolato merged commit 801da41 into master Jun 29, 2020
@guicassolato guicassolato deleted the prometheus-metrics/exclusive-types branch June 29, 2020 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants