-
Notifications
You must be signed in to change notification settings - Fork 28
Add metrics collection to the framework #523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@shreyb can you post an example of the currently published metrics? Just a raw scrape of /metrics. |
Yep - here you go: |
|
Thanks. Some or all of these duration metrics may be more interesting as histograms to keep an eye on outliers; summary only gives you average duration which is generally a less interesting metric than quantiles (median, 95th, and 99th percentile are common), which can be estimated from a histogram, including across multiple instances.
What's the estimated cardinality of pid over the lifetime of a DE instance? Anything else you can use that's more descriptive (and bounded)? |
|
This pull request introduces 3 alerts when merging 32d802d into a4a7938 - view on LGTM.com new alerts:
|
|
This pull request introduces 3 alerts when merging 6651d73 into a4a7938 - view on LGTM.com new alerts:
|
|
This pull request introduces 3 alerts when merging c4d0185 into a4a7938 - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging b7581fe into a4a7938 - view on LGTM.com new alerts:
|
8d09532 to
ae16e2b
Compare
|
This pull request introduces 1 alert when merging 79c25eb into 28919b1 - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging d5fa6f4 into 28919b1 - view on LGTM.com new alerts:
|
13f0069 to
c35a45b
Compare
|
This pull request introduces 1 alert when merging c35a45b into e95071f - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging 930bdd6 into e95071f - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging 4ab8fcc into e95071f - view on LGTM.com new alerts:
|
Codecov Report
@@ Coverage Diff @@
## master #523 +/- ##
==========================================
- Coverage 95.19% 94.61% -0.59%
==========================================
Files 46 46
Lines 2852 2934 +82
Branches 464 476 +12
==========================================
+ Hits 2715 2776 +61
- Misses 103 121 +18
- Partials 34 37 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
This pull request introduces 1 alert when merging d1f8507 into e95071f - view on LGTM.com new alerts:
|
d1f8507 to
a747ef9
Compare
|
This pull request introduces 1 alert when merging a747ef9 into 30951a5 - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging 465d380 into 30951a5 - view on LGTM.com new alerts:
|
465d380 to
ac26570
Compare
|
This pull request introduces 1 alert when merging c5cffa6 into 814669d - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging 1f76ac1 into 814669d - view on LGTM.com new alerts:
|
|
This pull request introduces 1 alert when merging 0c69662 into 814669d - view on LGTM.com new alerts:
|
|
@shreyb I think the linter problem is actually whitespace in an empty line that trips the pre-commit scripts. If you enable pre-commit it should have been catching and fixing that. |
|
#548 fixed the Github actions issue for this PR. |
|
Can you squash this down to a smaller set of commits? |
4f6d9ca to
c1baf7a
Compare
@jcpunk Done. |
…ess mode, and added CherryPy webserver for prometheus metrics
Added a section in the global config file (decision_engine.jsonnet) to allow for configuration of webserver. Currently only the port is configurable. Added code to DecisionEngine class to read port config, or use 8000 as default. Also changed DecisionEngine.start_metrics_server to DecisionEngine.start_webserver, and added a --no-webserver option to opt out of starting the webserver (metrics are still collected though).
Added steps to allow for metrics setup and disabling in systemd unit file and environment file.
In CI, all the unit tests would pass, but then the job to run the tests would time out. This is because the CherryPy webserver was started up alongside, and never was shut down. This would cause the job to eventually time out. Passing this argument should ensure that this doesn't happen.
698f03a to
6991d17
Compare
|
This pull request introduces 1 alert when merging 6991d17 into 4f920dc - view on LGTM.com new alerts:
|
|
When I've resolved the rebase errors, I'll squash those down to one commit and re-push this branch. |
Errors fixed: transform.name --> worker.name, transform timestamp should use set_to_current_time like the others, accidentally brought along a couple of old methods that were deprecated
1ddf907 to
7eb622b
Compare
Done. |
|
Would someone be able to take a look at this by any chance? This PR has been open for a while, and I've had to rebase twice since it's been ready for review, which, on a larger PR like this, is risky. Thanks! |
|
I don't see any obvious show stoppers. |
vitodb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By looking at the code an test results all seems to be good.
I' approving this PR also based on feedback provided by Pat that had a close look at this PR since the beginning.
This PR adds lightly-wrapped prometheus metrics collection to the code, as well as adding a cherrypy webserver to serve those metrics (and other future functions). The metrics are served by default on port 8000, at /metrics, but this is all configurable. These metrics can then be collected in plaintext either by a prometheus server, or some other metrics collection service.
It is also possible to run the decisionengine without the webserver, by passing the
--no-webserverflag, or by changing the environment of the systemd unit to pass that flag into the start command.