Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics and RT monitoring #266

Open
alonkashtan opened this issue Nov 26, 2019 · 3 comments · May be fixed by #284
Open

Add metrics and RT monitoring #266

alonkashtan opened this issue Nov 26, 2019 · 3 comments · May be fixed by #284

Comments

@alonkashtan
Copy link

Hi,
I would like to add metrics to allow RT monitoring, to allow automatic & manual test on remote servers verify that the traffic is actually passing through toxiproxy, verify that they stop, see clients connects and disconnects and so on.
I've been experimenting with the code and got something working, and would like to consult before I create a PR.

The functionality I added does as follows:

  • added a /metrics endpoint that returns a map that describes the number of total messages per proxy.
  • added a /events endpoint that returns a list of messages (proxy, client, upstream, event type and timestamp) for up to 20 minutes ago. A token is returned with each response that can be used in the next request to receive only unseen events.

The way I did it is by creating a metrics module, with a RegisterEvent method. In proxy.go I report when client connects, in link.go I report that a clients disconnects.

Things got trickier when trying to report messages. I used the fact that noop toxic is always present to report from there. The problem is that noop toxic doesn't have any context to its work, so I had to add ProxyName and Upstream to NoopProxy, and had to pass this information on to and from where it is created, namely Link and ToxicStub.

Does this make sense? In addition, I wasn't able to tell in NoopProxy which client sent a specific message. Any idea how can I do that?

Thanks!

@xthexder
Copy link
Contributor

Metrics would definitely be a useful addition, and is something I've thought about in the past, but never got around to implementing anything.

The only issue I see with your proposal is, how are messages defined?
TCP operates as a continuous data stream, and both the network, or certain toxics like slicer, will end up breaking that stream differently each time.
Bytes transmitted / received is something we could measure instead, though I'm not sure if that works for your use-case.

For storing / reporting stats, the concept of stateful toxics was implemented for scenarios similar to this.
Toxics themselves are just a function definition, and are re-used for each connection, so to get around this, a separate state object is created per-link.
You can look at some of the code for this in link.go, and read the docs on stateful toxics.
Some modification of link.go will still be required to get at the metrics out of the state object though.

@alonkashtan
Copy link
Author

Thank you.

I have been thinking about the idea of making a stateful toxic (I didn't know it exists already), my struggle with it is how to bring the context data to the toxic in the first place. It means either to make link.go, toxic_collection.go and toxic.go aware of this special toxic implementation (as they are aware of noop) or find another way to initiate it.

The other option is to go more in the path I took and monitor all the links through noop. I would prefer to do it through the link or the proxy instead of hacking noop, but since in the creation they just pass the channel directly to the toxic chain they are not aware of data passing through.

About the definition of message, you are right - I guess what I'm measuring is actually packets. I feel fine with that for the sake of keeping toxiproxy unaware of protocols above tcp. The idea of bytes received is good. I did a fork of toxiproxy-frontend and added a graph that shows the metrics in real time, while grouping close events if there are too many. Using bytes received as group size instead of num of packets will be more meaningful.

Any idea how can I know which client sent a specific packet? Also could be useful to show which toxics were active in a specific packet, especially where there is a probability they work. Any idea how can I know that?

@alonkashtan
Copy link
Author

Any more comments, anyone? Before I open a PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants