Skip to content

Conversation

@shreyb
Copy link
Contributor

@shreyb shreyb commented Sep 23, 2021

This PR adds lightly-wrapped prometheus metrics collection to the code, as well as adding a cherrypy webserver to serve those metrics (and other future functions). The metrics are served by default on port 8000, at /metrics, but this is all configurable. These metrics can then be collected in plaintext either by a prometheus server, or some other metrics collection service.

It is also possible to run the decisionengine without the webserver, by passing the --no-webserver flag, or by changing the environment of the systemd unit to pass that flag into the start command.

@pep8speaks
Copy link

pep8speaks commented Sep 23, 2021

Hello @shreyb! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-11-03 21:07:41 UTC

@retzkek
Copy link

retzkek commented Sep 23, 2021

@shreyb can you post an example of the currently published metrics? Just a raw scrape of /metrics.

@shreyb
Copy link
Contributor Author

shreyb commented Sep 23, 2021

@shreyb can you post an example of the currently published metrics? Just a raw scrape of /metrics.

Yep - here you go:

$ de-client --metrics
# HELP de_client_status_duration_seconds Multiprocess metric
# TYPE de_client_status_duration_seconds summary
de_client_status_duration_seconds_count 0.0
de_client_status_duration_seconds_sum 0.0
# HELP de_client_print_product_duration_seconds Multiprocess metric
# TYPE de_client_print_product_duration_seconds summary
de_client_print_product_duration_seconds_count 0.0
de_client_print_product_duration_seconds_sum 0.0
# HELP de_client_metrics_duration_seconds Multiprocess metric
# TYPE de_client_metrics_duration_seconds summary
de_client_metrics_duration_seconds_count 81.0
de_client_metrics_duration_seconds_sum 0.11079557938501239
# HELP de_client_start_channel_duration_seconds Multiprocess metric
# TYPE de_client_start_channel_duration_seconds summary
de_client_start_channel_duration_seconds_count{channel_name="Nersc"} 1.0
de_client_start_channel_duration_seconds_sum{channel_name="Nersc"} 0.7267125048674643
de_client_start_channel_duration_seconds_count{channel_name="test_channel"} 1.0
de_client_start_channel_duration_seconds_sum{channel_name="test_channel"} 1.3527377960272133
# HELP de_workers_total Multiprocess metric
# TYPE de_workers_total gauge
de_workers_total{pid="2102"} 1.0
de_workers_total{pid="2126"} 0.0
# HELP de_source_run_seconds Multiprocess metric
# TYPE de_source_run_seconds summary
de_source_run_seconds_count{channel_name="test_channel",source_name="SourceNOP"} 4666.0
de_source_run_seconds_sum{channel_name="test_channel",source_name="SourceNOP"} 4.778663814533502
# HELP de_transform_run_seconds Multiprocess metric
# TYPE de_transform_run_seconds summary
de_transform_run_seconds_count{channel_name="test_channel",transform_name="TransformNOP"} 4665.0
de_transform_run_seconds_sum{channel_name="test_channel",transform_name="TransformNOP"} 79.57181591587141
# HELP de_logicengine_run_seconds Multiprocess metric
# TYPE de_logicengine_run_seconds summary
de_logicengine_run_seconds_count{channel_name="test_channel",logicengine_name="LogicEngine"} 4665.0
de_logicengine_run_seconds_sum{channel_name="test_channel",logicengine_name="LogicEngine"} 90.91386365192011
# HELP de_publisher_run_seconds Multiprocess metric
# TYPE de_publisher_run_seconds summary
de_publisher_run_seconds_count{channel_name="test_channel",publisher_name="PublisherNOP"} 4665.0
de_publisher_run_seconds_sum{channel_name="test_channel",publisher_name="PublisherNOP"} 9.050738890189677
# HELP de_source_last_acquire_timestamp_seconds Multiprocess metric
# TYPE de_source_last_acquire_timestamp_seconds gauge
de_source_last_acquire_timestamp_seconds{channel_name="test_channel",pid="2126",source_name="SourceNOP"} 1.6324228903925512e+09
# HELP de_transform_last_run_timestamp_seconds Multiprocess metric
# TYPE de_transform_last_run_timestamp_seconds gauge
de_transform_last_run_timestamp_seconds{channel_name="test_channel",pid="2126",transform_name="TransformNOP"} 1.6324228904440765e+09
# HELP de_logicengine_last_run_timestamp_seconds Multiprocess metric
# TYPE de_logicengine_last_run_timestamp_seconds gauge
de_logicengine_last_run_timestamp_seconds{channel_name="test_channel",logicengine_name="LogicEngine",pid="2126"} 1.6324228904877343e+09
# HELP de_publisher_last_run_timestamp_seconds Multiprocess metric
# TYPE de_publisher_last_run_timestamp_seconds gauge
de_publisher_last_run_timestamp_seconds{channel_name="test_channel",pid="2126",publisher_name="PublisherNOP"} 1.6324228905370376e+09
# HELP de_channel_state Multiprocess metric
# TYPE de_channel_state gauge
de_channel_state{channel_name="test_channel",pid="2126"} 3.0

@retzkek
Copy link

retzkek commented Sep 23, 2021

Thanks. Some or all of these duration metrics may be more interesting as histograms to keep an eye on outliers; summary only gives you average duration which is generally a less interesting metric than quantiles (median, 95th, and 99th percentile are common), which can be estimated from a histogram, including across multiple instances. Summary should also be able to provide quantiles directly, but it seems that's not implemented in the Python client yet.

de_workers_total{pid="2102"} 1.0

What's the estimated cardinality of pid over the lifetime of a DE instance? Anything else you can use that's more descriptive (and bounded)?

@lgtm-com
Copy link

lgtm-com bot commented Sep 23, 2021

This pull request introduces 3 alerts when merging 32d802d into a4a7938 - view on LGTM.com

new alerts:

  • 2 for 'import *' may pollute namespace
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Sep 23, 2021

This pull request introduces 3 alerts when merging 6651d73 into a4a7938 - view on LGTM.com

new alerts:

  • 2 for 'import *' may pollute namespace
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Sep 23, 2021

This pull request introduces 3 alerts when merging c4d0185 into a4a7938 - view on LGTM.com

new alerts:

  • 2 for 'import *' may pollute namespace
  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Sep 24, 2021

This pull request introduces 1 alert when merging b7581fe into a4a7938 - view on LGTM.com

new alerts:

  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Oct 6, 2021

This pull request introduces 1 alert when merging 79c25eb into 28919b1 - view on LGTM.com

new alerts:

  • 1 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Oct 6, 2021

This pull request introduces 1 alert when merging d5fa6f4 into 28919b1 - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@lgtm-com
Copy link

lgtm-com bot commented Oct 13, 2021

This pull request introduces 1 alert when merging c35a45b into e95071f - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@lgtm-com
Copy link

lgtm-com bot commented Oct 13, 2021

This pull request introduces 1 alert when merging 930bdd6 into e95071f - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@lgtm-com
Copy link

lgtm-com bot commented Oct 13, 2021

This pull request introduces 1 alert when merging 4ab8fcc into e95071f - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@codecov
Copy link

codecov bot commented Oct 13, 2021

Codecov Report

Merging #523 (1ddf907) into master (4f920dc) will decrease coverage by 0.58%.
The diff coverage is 84.21%.

❗ Current head 1ddf907 differs from pull request most recent head 7eb622b. Consider uploading reports for the commit 7eb622b to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #523      +/-   ##
==========================================
- Coverage   95.19%   94.61%   -0.59%     
==========================================
  Files          46       46              
  Lines        2852     2934      +82     
  Branches      464      476      +12     
==========================================
+ Hits         2715     2776      +61     
- Misses        103      121      +18     
- Partials       34       37       +3     
Flag Coverage Δ
python-3.10 94.51% <84.21%> (-0.58%) ⬇️
python-3.6 94.35% <84.00%> (-0.48%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/decisionengine/framework/util/metrics.py 83.33% <42.85%> (-16.67%) ⬇️
src/decisionengine/framework/engine/de_client.py 93.22% <50.00%> (-1.52%) ⬇️
.../decisionengine/framework/engine/DecisionEngine.py 87.36% <82.69%> (-1.91%) ⬇️
...ecisionengine/framework/taskmanager/TaskManager.py 96.78% <100.00%> (+0.27%) ⬆️
src/decisionengine/framework/dataspace/maintain.py 99.04% <0.00%> (-0.96%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f920dc...7eb622b. Read the comment docs.

@lgtm-com
Copy link

lgtm-com bot commented Oct 13, 2021

This pull request introduces 1 alert when merging d1f8507 into e95071f - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@lgtm-com
Copy link

lgtm-com bot commented Oct 15, 2021

This pull request introduces 1 alert when merging a747ef9 into 30951a5 - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@lgtm-com
Copy link

lgtm-com bot commented Oct 15, 2021

This pull request introduces 1 alert when merging 465d380 into 30951a5 - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@lgtm-com
Copy link

lgtm-com bot commented Oct 22, 2021

This pull request introduces 1 alert when merging c5cffa6 into 814669d - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@lgtm-com
Copy link

lgtm-com bot commented Oct 25, 2021

This pull request introduces 1 alert when merging 1f76ac1 into 814669d - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@lgtm-com
Copy link

lgtm-com bot commented Oct 25, 2021

This pull request introduces 1 alert when merging 0c69662 into 814669d - view on LGTM.com

new alerts:

  • 1 for Module is imported with 'import' and 'import from'

@mambelli
Copy link
Contributor

mambelli commented Oct 25, 2021

@shreyb I think the linter problem is actually whitespace in an empty line that trips the pre-commit scripts. If you enable pre-commit it should have been catching and fixing that.
We will discuss the use of pypy on Thursday. If all other tests are working I'd feel OK

@shreyb
Copy link
Contributor Author

shreyb commented Oct 28, 2021

#548 fixed the Github actions issue for this PR.

@jcpunk
Copy link
Contributor

jcpunk commented Oct 29, 2021

Can you squash this down to a smaller set of commits?

@shreyb
Copy link
Contributor Author

shreyb commented Nov 1, 2021

Can you squash this down to a smaller set of commits?

@jcpunk Done.

shreyb added 16 commits November 3, 2021 19:40
…ess mode, and added CherryPy webserver for prometheus metrics
Added a section in the global config file (decision_engine.jsonnet) to allow for configuration of webserver.  Currently only the port is configurable.  Added code to DecisionEngine class to read port config, or use 8000 as default.  Also changed DecisionEngine.start_metrics_server to DecisionEngine.start_webserver, and added a --no-webserver option to opt out of starting the webserver (metrics are still collected though).
Added steps to allow for metrics setup and disabling in systemd unit file and environment file.
In CI, all the unit tests would pass, but then the job to run the tests would time out.  This is because the CherryPy webserver was started up alongside, and never was shut down.  This would cause the job to eventually time out.  Passing this argument should ensure that this doesn't happen.
@lgtm-com
Copy link

lgtm-com bot commented Nov 3, 2021

This pull request introduces 1 alert when merging 6991d17 into 4f920dc - view on LGTM.com

new alerts:

  • 1 for Mismatch between signature and use of an overriding method

@shreyb
Copy link
Contributor Author

shreyb commented Nov 3, 2021

When I've resolved the rebase errors, I'll squash those down to one commit and re-push this branch.

Errors fixed:   transform.name --> worker.name, transform timestamp should use set_to_current_time like the others, accidentally brought along a couple of old methods that were deprecated
@shreyb
Copy link
Contributor Author

shreyb commented Nov 3, 2021

When I've resolved the rebase errors, I'll squash those down to one commit and re-push this branch.

Done.

@shreyb
Copy link
Contributor Author

shreyb commented Nov 9, 2021

Would someone be able to take a look at this by any chance? This PR has been open for a while, and I've had to rebase twice since it's been ready for review, which, on a larger PR like this, is risky. Thanks!

@jcpunk
Copy link
Contributor

jcpunk commented Nov 9, 2021

I don't see any obvious show stoppers.

Copy link
Contributor

@vitodb vitodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By looking at the code an test results all seems to be good.
I' approving this PR also based on feedback provided by Pat that had a close look at this PR since the beginning.

@mambelli mambelli merged commit 91f7a76 into HEPCloud:master Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants