Rfc/node status backend #9910

ghost-not-in-the-shell · 2021-12-13T20:03:41Z

Explain your changes:
Add rfc for node status/error system: https://github.com/MinaProtocol/mina/blob/8df9bd805d25b2f5d864b62c1c4373cb89646527/rfcs/0044-node-status-and-node-error-backend.md

Explain how you tested your changes:
No test is needed.

Checklist:

Modified the current draft of release notes with details on what is completed or incomplete within this project
Document code purpose, how to use it
- Mention expected invariants, implicit constraints
Tests were added for the new behavior
- Document test purpose, significance of failures
- Test names should reflect their purpose
All tests pass (CI will check this if you didn't)
Serialized types are in stable-versioned modules
Does this close issues? List them

joseandro · 2021-12-13T20:22:38Z

Just to be clear, this same backend would be applied to both node status and node error collection systems, right?

bkase

Awesome, I'm so glad we're finally figuring this system out!

I'd love to see a lot more detail here around why you are proposing we go with the AWS stack over the other two alternatives and more detail around what an implementation would look like.

If you want, you could break up the work into two parts -- first getting feedback on the rationale around the choice with only minor details on implementation to get the point across just in case there is disagreement there. Afterwards, we can follow up with the implementation details once we're sure there is alignment on choice.

bkase · 2021-12-15T21:02:34Z

rfcs/0044-node-status-backend.md

+
+[implementation]:#implementation
+
+The implementation is mainly to setup a miniservice under a domain like https://minaprotocol.com/node-status-service that would validate the report that sends by the users and maybe setup some kind of DOS protection (help would be need to design and implement in this category) and then sends those to the AWS Kenesis stream.


I would recommend using a subdomain, like "node-status.minaprotocol.com"

Can you change the language to be more decisive here? I think the intent of the RFC is to understand what the plan is before implementation. So rather than saying "setup a miniservice under a domain like htttps://..." change to "a microservice will be deployed at https://node-status.minaprocotol.com ".

Same with "maybe setup some kind of DOS protection": This RFC should make a definite proposal one way or another. "We do not need to implement DOS protection because x" or "We are implementing DOS protection and here are details on how it will work"

Also, can you include more detail around the rationale here? Why are we using the AWS stack over the other choices here?

Reasoning should speak to all of the requirements in the node-status / node-error product reqs so we can be sure that the implementation will fulfill our needs

I think there's supposed to be a "Rationale and Alternatives" section that would be a good place for the reasoning

bkase · 2021-12-15T21:07:37Z

rfcs/0044-node-status-backend.md

+
+## Implementation
+
+[implementation]:#implementation


What does the microservice look like? Is it in OCaml? What data does it consume and in what format, what services does it talk to?

How does S3, OpenSearch, Kibana, and the microservice all communicate with each other?

The mini-service would push to Kenesis Firehose data stream and AWS use this to communicate between OpenSearch, Kibana and S3.

joseandro

Let me know if you need more info from my side, I requested some changes to the architecture of this solution. Basically, I would suggest us to remove the microservice that acts like a man-in-the-middle, introducing a single point of failure to our solution and more maintenance costs down the road.

joseandro · 2021-12-23T02:46:32Z

rfcs/0044-node-status-backend.md

+
+[implementation]:#implementation
+
+A mini-service will be deployed at https://node-status.minaprotocol.com. The mini-service would do some simple check against the format of the data that nodes send us. The mini-service would be implemented using python or javascript. The mini-service would listen on the designated port and decode the json data that peers send and then do a simple check that make sure that all the non-optional field are present and then call AWS `PUT` function to push it to the AWS kenesis firehose data stream. This mini-service would be put in a AWS EC2 container to make the configuration of things minimal.


Instead of funneling all the requests through a microservice, and therefore introducing a central point of failure, wouldn't it be better to have the client nodes push it all through Firehose endpoint directly instead? Firehose has the native capability to auto scale and handle a variable data streaming throughput on its own.
https://aws.amazon.com/blogs/aws/amazon-kinesis-firehose-simple-highly-scalable-data-ingestion/

The microservice would also need to be managed by us. Therefore, we would need to handle scaling and possible failures. I think that big maintenance cost outweighs the benefits of the role of a microservice described by this proposal.

You mean just call aws firehose put-record <data> to redirect the data to firehose?

Then we still need a bash script to do that right?

Yep! We could use the client side AWS Firehose SDK for that, here are the available options: https://github.com/aws/aws-sdk

joseandro · 2021-12-23T02:57:54Z

rfcs/0044-node-status-backend.md

+
+A mini-service will be deployed at https://node-status.minaprotocol.com. The mini-service would do some simple check against the format of the data that nodes send us. The mini-service would be implemented using python or javascript. The mini-service would listen on the designated port and decode the json data that peers send and then do a simple check that make sure that all the non-optional field are present and then call AWS `PUT` function to push it to the AWS kenesis firehose data stream. This mini-service would be put in a AWS EC2 container to make the configuration of things minimal.
+
+I only consider simple validity checks here against the format of the json data because any more complicated check would require us to run a node to observe the network conditions.


The validation done by such a microservice will likely be easily bypassed by bad actors trying to spam our logs, so I don't really see the value of it.

joseandro · 2021-12-23T03:00:22Z

rfcs/0044-node-status-backend.md

+1 Kenesis firehose data stream to receive the logs, and
+1 S3 storage bucket to store the logs, and
+1 OpenSearch service that provides the search ability for the logs, and
+1 Kibana service that provides the visualization for the data


Could we make this RFC a bit broader to accomodate the backend requirements for the node error collection backend as well? They are very similar and I want us to have one common flexible system for both contexts. In such scenario, I would assume we would basically duplicate the described setup for the error collection backend system, is my understanding correct?

Yes. We would just duplicate the setup for node error system too.

joseandro

LGTM! @bkase could you take a look at it again when you have time? I think we changed important pieces of it since you last reviewed it.

bkase · 2021-12-23T18:19:27Z

rfcs/0044-node-status-and-node-error-backend.md

+The reason that we choose AWS stack for our backend is that
+0. The requirements for the node status collection and node error collection system is that we need a backend that provides the storage of data and an easy way to search and visualize the data. AWS stack fulfills these requirements.
+
+1. AWS stack is robust and easy to use. It has built-in DOS protection. And we have team members who has used the AWS stack before and this means that some of the team members are already familiar with it. And it seems to be the cheapest choice for us.


Have we also considered the Google Cloud equivalent of this stack? I think a few years ago we moved most if not all our infra from AWS to Google Cloud. That may simplify management rather than having our system split between AWS and google cloud (having said that, I'm not up-to-date, do we have services that rely on AWS at the moment?)

I can do a search around that. The closest thing would be Google Cloud Logging service I guess.

bkase · 2021-12-23T18:20:17Z

rfcs/0044-node-status-and-node-error-backend.md

+
+Like the diagram shown above, We would setup a public Kenesis firehose data stream where the mina client can push to. And this Kenesis data stream would be connected to the S3 bucket, the OpenSearch and Kibana. We just won't use the Splunk and Redshift.
+
+This means we would modify the mina client to add the AWS SDK to do the push. There would no server maintained by us at the backend.


This is great that we don't need to maintain another service!

Could you include more detail here? Where will we do the push in the source code? Is it part of the logger or in some other module or library? How will data flow from the client.exe into this module to control the opt-in/opt-out status?

I just added this. I found there is an OCaml AWS library, but I've never tried it. In the worst case we have add AWS cli dependency for mina node to push to the firehose endpoint.

I think the RFC needs to make a decision here with reasoning so we can discuss with folks and get alignment. If it's not too hard, we should probably try to do it in OCaml if you ask me. Is there an HTTP API for the service? In the past, adding calls shelling out to other processes from bash, especially when you have to pass data to them has been regretful. Though, if it's super difficult to do in OCaml I suppose we need to do it.

The other thing to include is the details around where in the code this will live? Part of Prometheus metrics, you said right? Does it live in a particular library or is it just in various places in the source code?

I would consider using this library to send request to the AWS backend: https://github.com/inhabitedtype/ocaml-aws. I haven't tested it, but by looking at the API, I think it can handle our simple requirements to send an AWS PUT request. If this effort fails, I would turn back to using bash.

The frontend of the node status service is already merged. There is a library that collects the data (either from prometheus or directly from transition frontier): https://github.com/MinaProtocol/mina/blob/compatible/src/lib/node_status_service/node_status_service.ml

bkase · 2021-12-23T18:29:12Z

rfcs/0044-node-status-and-node-error-backend.md

+
+We need a backend for node staus/error systems. Candidates for the backend include AWS stack and Grafana Loki and LogDNA.
+
+## Implementation


I think this is still missing details on how we're going to meet some of the requirements in the PRD for node status and node error. Maybe you could argue that this RFC is scoped to the "backend", but actually I think in order to understand the backend it makes sense to holistically think of the end-to-end flow of the data from the "frontend" too. Can we include this information in this RFC?

Here are some that I think need to be referenced/explained (in addition to what I mentioned in the below comments):

How will the Mina client handle the status/error opt-in for the data collection?

The frequency of the data sending

How can the user specify the destination of the logs? (if they want to run their own collection infrastructure)

What are the mechanisms for surfacing all the data? Should we reuse the code we have for prometheus metrics or should it be different?

Where are the places that will we add the exception collection hook to send errors? How will the errors be indexed and queried that we're able to learn from the issues in our system?

In the current implementation, there would be a cli parameter for the destination url of the node status report. If the user provides that then the node status system would send reports there. Same applies to the node error system. I think we need to do some change there, since now our backend is just the kenesis firehose endpoint. So there would be no url. I would image for this design there would be 2 cli arguments. One required parameter to trigger the node status/error system; one optional parameter to specify the destination if user wants to send it to somewhere else.

For the node status system, it sends the status report every 5 slots. For the node error system, it would send a report before it crashes.

Already answered in 1.

In the implementation, I actually reused a lot of prometheus metrics.

For the node error system, I was adding the hook to the handle_crash function. The same function was used to generate the current crash reports for mina nodes. For more about the node error collection system, you can take a look at this RFC: Rfc/node error collection #9526. I failed to follow the exact requirements, for example, I failed to extract locations of exceptions. And the backtrace stack is considered part of the exception. Basically I am just reporting the exception from mina. Do you have any suggestion that I can do better than that?

As mentioned in Slack, can you add these answers to the actual document itself so we have a record of what our decisions ended up being?

I guess we can provide a bucket name or something for the kinesis firehose? I think we should do it with one CLI argument like --share-report-data <destination-bucket> or whatever name we use for kinesis storage. Then the boolean on/off and the destination are in the same command. I think we shouldn't hardcode our destination by default -- but try to name the kinesis endpoint "mina-report-data" or something

Sounds good

Sounds good

Can you mention here how the integration works with the Prometheus metrics?

Oh I missed the other RFC, I think we should combine them actually. Sorry I missed the original one -- a few changes: (1) it should be off by default (see the requirements doc https://www.notion.so/minaprotocol/Node-Error-Collection-030f806ede5345beb0089ffc8666f238 ) and (2) we need to adjust it to point to kinesis now and not a REST URL, right? I think it's important to evaluate the whole thing holistically.

I think providing a destination URL would be more flexible than using a destination-bucket. Let's say we want to switch to another backend architecture in the future, then we can setup a mini-service under a sub-domain of minaprotocol.com (like what I suggested in the RFC before, that part got deleted because joseandro argued that we can push everything to the frontend which means we don't need this mini-service at all). And another thing is that we are not directly pushing to some buckets, we are pushing to the kenesis data stream. I am ok to not hardcode any default value for the destination.

I can find the details in this PR: Feature/node status collection #9413. In short, what I did is that I extract the measurements in the metrics and filled the relevant fields in the reports with those measurements. There's no magic there.

Yes, by default it's off. Yes, if we decided to go with the AWS stack, then we have to change the frontend a little bit.

It may also be helpful to report on status transitions, not just a fixed interval of 5 slots - e.g. report when status goes from sync to catchup, etc. That would allow for understanding things like network-wide bootstrap times, etc.

Great -- using a URL sounds good to me 👍 .

@jrwashburn that's a good idea -- @ghost-not-in-the-shell would it be hard to capture those status transitions?

Sync status is part of the "node status report". If we are interested in the bootstrap time, then we could add that to the status report directly. Personally I think sending reports every 5 slots is enough for the status report system.

service

bkase

Thanks! Are you certain about going with AWS over Google Cloud or are you still thinking through that?

ghost-not-in-the-shell · 2022-01-10T17:59:29Z

Thanks! Are you certain about going with AWS over Google Cloud or are you still thinking through that?

I think we actually prefer Google Cloud over AWS right now. I tried to setup a simple Google Cloud Logging service last Friday. The setup is super easy. But I still don't know how to connect it to Grafana.

size of payload

bkase

A few more small asks but then sounds good ✅

bkase · 2022-01-20T21:48:30Z

rfcs/0044-node-status-and-node-error-backend.md

+
+![](res/gcloud_logging.png)
+
+For the frontend, each nodes would have 2 command line options to allow them to sed node status/error reports to any specific url: `--node-status-url URL` and `--node-error-url URL`. By default users would send their reports to our backend. They could change the destination by providing their own destination `URL` to the command options.


Could you reword this slightly:

By default users would send their reports to our backend.
By default users won't send any status or error reports and can opt-in by providing the --node-status-url URL and/or --node-error-url URL and our backend would be suggested.

Additionally can we make sure this is configurable from the daemon.json file?

Yeah. They are already configurable from the daemon.json file and that's already recorded in the corresponding RFCs.

bkase · 2022-01-20T21:50:14Z

rfcs/0044-node-status-and-node-error-backend.md

+      res.render('error', {error: "Version Mismatch"});
+  } else if (Buffer.byteLength(req.body, 'utf8') > 1000000) {
+      res.status(413);
+      res.render('error', {error: "Payload Too Large"})


Can you add a note to the RFC saying we will setup an alert to detect if this line is being sent?

…de-status-backend

ghost-not-in-the-shell added 2 commits December 14, 2021 03:30

adding rfc for node status backend

952fab7

clean up and adding more to the loki part

8e9ae3a

ghost-not-in-the-shell changed the base branch from develop to compatible December 13, 2021 20:04

joseandro approved these changes Dec 13, 2021

View reviewed changes

adding pricing for backends

632acd0

lk86 approved these changes Dec 15, 2021

View reviewed changes

bkase requested changes Dec 15, 2021

View reviewed changes

address brandon's concern

da49491

joseandro requested changes Dec 23, 2021

View reviewed changes

ghost-not-in-the-shell added 5 commits December 23, 2021 11:51

address comments from josendro

b091c78

remove the wording of micro-service all together

65273c7

adding a little more to the rationale

ac539c0

remove the micro-service at all

b98273e

adding diagram to depict the workflow

42d1a52

joseandro approved these changes Dec 23, 2021

View reviewed changes

joseandro requested a review from bkase December 23, 2021 17:12

bkase reviewed Dec 23, 2021

View reviewed changes

ghost-not-in-the-shell added 3 commits January 6, 2022 20:05

add google cloud logging

b321c44

adding more details about the change in the frontend of node status

52ee54c

service

adding more details for google cloud logging implementation

fd097c1

bkase reviewed Jan 7, 2022

View reviewed changes

Google Cloud API is actually HTTP based, so modify the options

2acf040

ghost-not-in-the-shell added 6 commits January 12, 2022 17:32

make google cloud logging to be the 1st option

7325923

add a section for micro-service

5b76f2d

adding more details for the micro-service implementation

bb4dae2

using Google Functions for the micro-service

6c92f70

adding kibana and elastic cloud for gcp

8df9bd8

adding version of the node status/error report and adding checks for

525120c

size of payload

bkase approved these changes Jan 20, 2022

View reviewed changes

ghost-not-in-the-shell added the ci-build-me Add this label to trigger a circle+buildkite build for this branch label Jan 21, 2022

ghost-not-in-the-shell added 3 commits January 21, 2022 11:52

address comments of brandon

be474a5

Merge branch 'compatible' of github.com:MinaProtocol/mina into rfc/no…

a68454b

…de-status-backend

adding reasons why we make kibana private

a300b37

ghost-not-in-the-shell merged commit 089cf2b into compatible Feb 28, 2022

ghost-not-in-the-shell deleted the rfc/node-status-backend branch February 28, 2022 13:25


		[implementation]:#implementation

		The implementation is mainly to setup a miniservice under a domain like https://minaprotocol.com/node-status-service that would validate the report that sends by the users and maybe setup some kind of DOS protection (help would be need to design and implement in this category) and then sends those to the AWS Kenesis stream.


		[implementation]:#implementation

		A mini-service will be deployed at https://node-status.minaprotocol.com. The mini-service would do some simple check against the format of the data that nodes send us. The mini-service would be implemented using python or javascript. The mini-service would listen on the designated port and decode the json data that peers send and then do a simple check that make sure that all the non-optional field are present and then call AWS `PUT` function to push it to the AWS kenesis firehose data stream. This mini-service would be put in a AWS EC2 container to make the configuration of things minimal.


		A mini-service will be deployed at https://node-status.minaprotocol.com. The mini-service would do some simple check against the format of the data that nodes send us. The mini-service would be implemented using python or javascript. The mini-service would listen on the designated port and decode the json data that peers send and then do a simple check that make sure that all the non-optional field are present and then call AWS `PUT` function to push it to the AWS kenesis firehose data stream. This mini-service would be put in a AWS EC2 container to make the configuration of things minimal.

		I only consider simple validity checks here against the format of the json data because any more complicated check would require us to run a node to observe the network conditions.


		Like the diagram shown above, We would setup a public Kenesis firehose data stream where the mina client can push to. And this Kenesis data stream would be connected to the S3 bucket, the OpenSearch and Kibana. We just won't use the Splunk and Redshift.

		This means we would modify the mina client to add the AWS SDK to do the push. There would no server maintained by us at the backend.


		We need a backend for node staus/error systems. Candidates for the backend include AWS stack and Grafana Loki and LogDNA.

		## Implementation


		![](res/gcloud_logging.png)

		For the frontend, each nodes would have 2 command line options to allow them to sed node status/error reports to any specific url: `--node-status-url URL` and `--node-error-url URL`. By default users would send their reports to our backend. They could change the destination by providing their own destination `URL` to the command options.

Rfc/node status backend #9910

Rfc/node status backend #9910

Conversation

ghost-not-in-the-shell commented Dec 13, 2021 • edited Loading

joseandro commented Dec 13, 2021

bkase left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joseandro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joseandro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkase left a comment

Choose a reason for hiding this comment

ghost-not-in-the-shell commented Jan 10, 2022

bkase left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost-not-in-the-shell commented Dec 13, 2021 •

edited

Loading