Add optional prometheus metrics #109

rklaehn · 2021-07-14T07:21:49Z

I had a hard time figuring out a set of metrics that is not overwhelming, but this seems to work reasonably well. It is not very fine grained, but it gives you a rough idea what the thing is doing overall.

E.g. this is what you get when running the pond integration test with 8 local linux hosts yields the following metrics (dumping more frequently so I see them):

AX_CI_HOSTS=8linux.yaml npm test pond | grep -A 300 got_metrics | grep -A 1 time_sum

...

2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_block_get_time_sum 11.24307093500005
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_block_get_time_count 48033
--
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_block_put_time_sum 1.7594724890000006
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_block_put_time_count 1559
--
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_branch_load_time_sum 2.9276649789999785
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_branch_load_time_count 8291
--
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_branch_store_time_sum 1.3235776810000015
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_branch_store_time_count 1037
--
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_leaf_load_time_sum 5.6611070490000515
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_leaf_load_time_count 34957
--
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_leaf_store_time_sum 0.6494339429999996
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_leaf_store_time_count 522

so it spends most of the db time getting blocks, and focused on banyan it spends most of the time loading leafs (this metric is the total time including decryption, decompression and decoding).

We are not going to optimize things just now, but if we wanted to this would give us a good idea where to start.

...and remove a few manual timings

The block related metrics will probably be removed again once we have prometheus metrics in the ipfs-sqlite-block-store.

This was to port the take_until_condition bugfix to stream_trees_chunked_threaded, but has nothing to do with metrics

dvc94ch

LGTM

rklaehn · 2021-07-14T12:28:16Z

@rkuhn given the metrics discussion, I hope you agree that this is still in principle the right thing to do. I don't really care that much if this ever ends up in a proper prometheus graph, I just use it as a tool to get a rough idea what the thing is doing, see above.

Note: We have kinda standardized on prometheus as a tool for gathering metrics in ipfs-embed and below ecosystem, see https://github.com/ipfs-rust/ipfs-embed/blob/master/src/telemetry.rs .

wngr

You can also use the weight-caches metrics feature, see https://github.com/Actyx/weight-cache/blob/master/Cargo.toml#L20-L22

rkuhn · 2021-07-15T18:07:00Z

So the approach would be to pass the default registry to Banyan to get the metrics of ipfs-embed as well?

Overall looks reasonable, but I haven’t yet got my hands dirty actually using prometheus gathering functionality.

rklaehn · 2021-07-16T08:43:49Z

So the approach would be to pass the default registry to Banyan to get the metrics of ipfs-embed as well?

We already have ipfs-embed metrics. You would just pass the default registry to this register as well. Which is what I did to get the numbers above, in addition to a bunch of one-off changes in cosmos to get the emission frequency up...

Not sure what is the overhead of registering multiple registries. If it is low overhead, we could have 2 registries, one with just the essential stuff that goes via ephemeral events to troubleshoot in case of serious problems, and one with everything and the kitchen sink that can only be locally polled by prometheus.

rkuhn · 2021-07-16T08:52:20Z

It seems to me that the ecosystem has already made a different choice: since some libraries already use the default registry, we are locked into using that as well (which is what I meant above). A secondary registry wouldn’t make sense to me.

rklaehn added 4 commits July 13, 2021 12:00

Add simple prometheus metrics

6d6048d

Add optional metrics feature that uses prometheus

6e4203b

Refactor histogram metrics tracking

75e3968

...and remove a few manual timings

Add additional metrics for the size of written and read blocks

a551a1b

The block related metrics will probably be removed again once we have prometheus metrics in the ipfs-sqlite-block-store.

rklaehn requested review from wngr and dvc94ch July 14, 2021 07:33

rklaehn added 2 commits July 14, 2021 11:18

Remove unrelated change

d5dedc4

This was to port the take_until_condition bugfix to stream_trees_chunked_threaded, but has nothing to do with metrics

Make lazy_static optional as well

6415e89

dvc94ch approved these changes Jul 14, 2021

View reviewed changes

rklaehn requested a review from rkuhn July 14, 2021 12:26

rklaehn requested a review from aruediger July 14, 2021 13:35

aruediger approved these changes Jul 14, 2021

View reviewed changes

rklaehn merged commit 9a8da9c into master Jul 14, 2021

rklaehn deleted the better-instrumentation branch July 14, 2021 14:31

wngr reviewed Jul 15, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional prometheus metrics #109

Add optional prometheus metrics #109

rklaehn commented Jul 14, 2021

dvc94ch left a comment

rklaehn commented Jul 14, 2021 •

edited

Loading

wngr left a comment

rkuhn commented Jul 15, 2021

rklaehn commented Jul 16, 2021 •

edited

Loading

rkuhn commented Jul 16, 2021

Add optional prometheus metrics #109

Add optional prometheus metrics #109

Conversation

rklaehn commented Jul 14, 2021

dvc94ch left a comment

Choose a reason for hiding this comment

rklaehn commented Jul 14, 2021 • edited Loading

wngr left a comment

Choose a reason for hiding this comment

rkuhn commented Jul 15, 2021

rklaehn commented Jul 16, 2021 • edited Loading

rkuhn commented Jul 16, 2021

rklaehn commented Jul 14, 2021 •

edited

Loading

rklaehn commented Jul 16, 2021 •

edited

Loading