Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional prometheus metrics #109

Merged
merged 6 commits into from
Jul 14, 2021
Merged

Add optional prometheus metrics #109

merged 6 commits into from
Jul 14, 2021

Conversation

rklaehn
Copy link
Contributor

@rklaehn rklaehn commented Jul 14, 2021

I had a hard time figuring out a set of metrics that is not overwhelming, but this seems to work reasonably well. It is not very fine grained, but it gives you a rough idea what the thing is doing overall.

E.g. this is what you get when running the pond integration test with 8 local linux hosts yields the following metrics (dumping more frequently so I see them):

AX_CI_HOSTS=8linux.yaml npm test pond | grep -A 300 got_metrics | grep -A 1 time_sum

...

2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_block_get_time_sum 11.24307093500005
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_block_get_time_count 48033
--
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_block_put_time_sum 1.7594724890000006
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_block_put_time_count 1559
--
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_branch_load_time_sum 2.9276649789999785
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_branch_load_time_count 8291
--
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_branch_store_time_sum 1.3235776810000015
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_branch_store_time_count 1037
--
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_leaf_load_time_sum 5.6611070490000515
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_leaf_load_time_count 34957
--
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_leaf_store_time_sum 0.6494339429999996
2021-07-14T07:17:13.669Z node local-linux-8 Actyx stdout: banyan_leaf_store_time_count 522

so it spends most of the db time getting blocks, and focused on banyan it spends most of the time loading leafs (this metric is the total time including decryption, decompression and decoding).

We are not going to optimize things just now, but if we wanted to this would give us a good idea where to start.

...and remove a few manual timings
The block related metrics will probably be removed again once we have
prometheus metrics in the ipfs-sqlite-block-store.
@rklaehn rklaehn requested review from wngr and dvc94ch July 14, 2021 07:33
This was to port the take_until_condition bugfix to stream_trees_chunked_threaded, but has nothing to do with metrics
Copy link
Contributor

@dvc94ch dvc94ch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rklaehn rklaehn requested a review from rkuhn July 14, 2021 12:26
@rklaehn
Copy link
Contributor Author

rklaehn commented Jul 14, 2021

@rkuhn given the metrics discussion, I hope you agree that this is still in principle the right thing to do. I don't really care that much if this ever ends up in a proper prometheus graph, I just use it as a tool to get a rough idea what the thing is doing, see above.

Note: We have kinda standardized on prometheus as a tool for gathering metrics in ipfs-embed and below ecosystem, see https://github.com/ipfs-rust/ipfs-embed/blob/master/src/telemetry.rs .

@rklaehn rklaehn requested a review from aruediger July 14, 2021 13:35
@rklaehn rklaehn merged commit 9a8da9c into master Jul 14, 2021
@rklaehn rklaehn deleted the better-instrumentation branch July 14, 2021 14:31
Copy link
Contributor

@wngr wngr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use the weight-caches metrics feature, see https://github.com/Actyx/weight-cache/blob/master/Cargo.toml#L20-L22

@rkuhn
Copy link
Member

rkuhn commented Jul 15, 2021

So the approach would be to pass the default registry to Banyan to get the metrics of ipfs-embed as well?

Overall looks reasonable, but I haven’t yet got my hands dirty actually using prometheus gathering functionality.

@rklaehn
Copy link
Contributor Author

rklaehn commented Jul 16, 2021

So the approach would be to pass the default registry to Banyan to get the metrics of ipfs-embed as well?

We already have ipfs-embed metrics. You would just pass the default registry to this register as well. Which is what I did to get the numbers above, in addition to a bunch of one-off changes in cosmos to get the emission frequency up...

Not sure what is the overhead of registering multiple registries. If it is low overhead, we could have 2 registries, one with just the essential stuff that goes via ephemeral events to troubleshoot in case of serious problems, and one with everything and the kitchen sink that can only be locally polled by prometheus.

@rkuhn
Copy link
Member

rkuhn commented Jul 16, 2021

It seems to me that the ecosystem has already made a different choice: since some libraries already use the default registry, we are locked into using that as well (which is what I meant above). A secondary registry wouldn’t make sense to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants