Add metrics and RT monitoring #266

alonkashtan · 2019-11-26T14:51:32Z

Hi,
I would like to add metrics to allow RT monitoring, to allow automatic & manual test on remote servers verify that the traffic is actually passing through toxiproxy, verify that they stop, see clients connects and disconnects and so on.
I've been experimenting with the code and got something working, and would like to consult before I create a PR.

The functionality I added does as follows:

added a /metrics endpoint that returns a map that describes the number of total messages per proxy.
added a /events endpoint that returns a list of messages (proxy, client, upstream, event type and timestamp) for up to 20 minutes ago. A token is returned with each response that can be used in the next request to receive only unseen events.

The way I did it is by creating a metrics module, with a RegisterEvent method. In proxy.go I report when client connects, in link.go I report that a clients disconnects.

Things got trickier when trying to report messages. I used the fact that noop toxic is always present to report from there. The problem is that noop toxic doesn't have any context to its work, so I had to add ProxyName and Upstream to NoopProxy, and had to pass this information on to and from where it is created, namely Link and ToxicStub.

Does this make sense? In addition, I wasn't able to tell in NoopProxy which client sent a specific message. Any idea how can I do that?

Thanks!

The text was updated successfully, but these errors were encountered:

xthexder · 2019-11-26T21:06:48Z

Metrics would definitely be a useful addition, and is something I've thought about in the past, but never got around to implementing anything.

The only issue I see with your proposal is, how are messages defined?
TCP operates as a continuous data stream, and both the network, or certain toxics like slicer, will end up breaking that stream differently each time.
Bytes transmitted / received is something we could measure instead, though I'm not sure if that works for your use-case.

For storing / reporting stats, the concept of stateful toxics was implemented for scenarios similar to this.
Toxics themselves are just a function definition, and are re-used for each connection, so to get around this, a separate state object is created per-link.
You can look at some of the code for this in link.go, and read the docs on stateful toxics.
Some modification of link.go will still be required to get at the metrics out of the state object though.

alonkashtan · 2019-11-27T07:42:49Z

Thank you.

I have been thinking about the idea of making a stateful toxic (I didn't know it exists already), my struggle with it is how to bring the context data to the toxic in the first place. It means either to make link.go, toxic_collection.go and toxic.go aware of this special toxic implementation (as they are aware of noop) or find another way to initiate it.

The other option is to go more in the path I took and monitor all the links through noop. I would prefer to do it through the link or the proxy instead of hacking noop, but since in the creation they just pass the channel directly to the toxic chain they are not aware of data passing through.

About the definition of message, you are right - I guess what I'm measuring is actually packets. I feel fine with that for the sake of keeping toxiproxy unaware of protocols above tcp. The idea of bytes received is good. I did a fork of toxiproxy-frontend and added a graph that shows the metrics in real time, while grouping close events if there are too many. Using bytes received as group size instead of num of packets will be more meaningful.

Any idea how can I know which client sent a specific packet? Also could be useful to show which toxics were active in a specific packet, especially where there is a probability they work. Any idea how can I know that?

alonkashtan · 2020-01-14T12:16:54Z

Any more comments, anyone? Before I open a PR?

neufeldtech linked a pull request Feb 25, 2022 that will close this issue

Add metrics collection and endpoint #284

Open

neufeldtech mentioned this issue Feb 25, 2022

Feature Request - Prometheus/OpenMetrics endpoint #365

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics and RT monitoring #266

Add metrics and RT monitoring #266

alonkashtan commented Nov 26, 2019

xthexder commented Nov 26, 2019

alonkashtan commented Nov 27, 2019

alonkashtan commented Jan 14, 2020

Add metrics and RT monitoring #266

Add metrics and RT monitoring #266

Comments

alonkashtan commented Nov 26, 2019

xthexder commented Nov 26, 2019

alonkashtan commented Nov 27, 2019

alonkashtan commented Jan 14, 2020