-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optional prometheus metrics #109
Conversation
...and remove a few manual timings
The block related metrics will probably be removed again once we have prometheus metrics in the ipfs-sqlite-block-store.
This was to port the take_until_condition bugfix to stream_trees_chunked_threaded, but has nothing to do with metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@rkuhn given the metrics discussion, I hope you agree that this is still in principle the right thing to do. I don't really care that much if this ever ends up in a proper prometheus graph, I just use it as a tool to get a rough idea what the thing is doing, see above. Note: We have kinda standardized on prometheus as a tool for gathering metrics in ipfs-embed and below ecosystem, see https://github.com/ipfs-rust/ipfs-embed/blob/master/src/telemetry.rs . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use the weight-cache
s metrics feature, see https://github.com/Actyx/weight-cache/blob/master/Cargo.toml#L20-L22
So the approach would be to pass the default registry to Banyan to get the metrics of ipfs-embed as well? Overall looks reasonable, but I haven’t yet got my hands dirty actually using prometheus gathering functionality. |
We already have ipfs-embed metrics. You would just pass the default registry to this register as well. Which is what I did to get the numbers above, in addition to a bunch of one-off changes in cosmos to get the emission frequency up... Not sure what is the overhead of registering multiple registries. If it is low overhead, we could have 2 registries, one with just the essential stuff that goes via ephemeral events to troubleshoot in case of serious problems, and one with everything and the kitchen sink that can only be locally polled by prometheus. |
It seems to me that the ecosystem has already made a different choice: since some libraries already use the default registry, we are locked into using that as well (which is what I meant above). A secondary registry wouldn’t make sense to me. |
I had a hard time figuring out a set of metrics that is not overwhelming, but this seems to work reasonably well. It is not very fine grained, but it gives you a rough idea what the thing is doing overall.
E.g. this is what you get when running the pond integration test with 8 local linux hosts yields the following metrics (dumping more frequently so I see them):
so it spends most of the db time getting blocks, and focused on banyan it spends most of the time loading leafs (this metric is the total time including decryption, decompression and decoding).
We are not going to optimize things just now, but if we wanted to this would give us a good idea where to start.