Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rfc/node status backend #9910

Merged
merged 22 commits into from
Feb 28, 2022
Merged

Conversation

ghost-not-in-the-shell
Copy link
Member

@ghost-not-in-the-shell ghost-not-in-the-shell commented Dec 13, 2021

Explain your changes:
Add rfc for node status/error system: https://github.com/MinaProtocol/mina/blob/8df9bd805d25b2f5d864b62c1c4373cb89646527/rfcs/0044-node-status-and-node-error-backend.md

Explain how you tested your changes:
No test is needed.

Checklist:

  • Modified the current draft of release notes with details on what is completed or incomplete within this project
  • Document code purpose, how to use it
    • Mention expected invariants, implicit constraints
  • Tests were added for the new behavior
    • Document test purpose, significance of failures
    • Test names should reflect their purpose
  • All tests pass (CI will check this if you didn't)
  • Serialized types are in stable-versioned modules
  • Does this close issues? List them

@ghost-not-in-the-shell ghost-not-in-the-shell changed the base branch from develop to compatible December 13, 2021 20:04
@joseandro
Copy link
Member

Just to be clear, this same backend would be applied to both node status and node error collection systems, right?

Copy link
Member

@bkase bkase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, I'm so glad we're finally figuring this system out!

I'd love to see a lot more detail here around why you are proposing we go with the AWS stack over the other two alternatives and more detail around what an implementation would look like.

If you want, you could break up the work into two parts -- first getting feedback on the rationale around the choice with only minor details on implementation to get the point across just in case there is disagreement there. Afterwards, we can follow up with the implementation details once we're sure there is alignment on choice.


[implementation]:#implementation

The implementation is mainly to setup a miniservice under a domain like https://minaprotocol.com/node-status-service that would validate the report that sends by the users and maybe setup some kind of DOS protection (help would be need to design and implement in this category) and then sends those to the AWS Kenesis stream.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend using a subdomain, like "node-status.minaprotocol.com"

Can you change the language to be more decisive here? I think the intent of the RFC is to understand what the plan is before implementation. So rather than saying "setup a miniservice under a domain like htttps://..." change to "a microservice will be deployed at https://node-status.minaprocotol.com ".

Same with "maybe setup some kind of DOS protection": This RFC should make a definite proposal one way or another. "We do not need to implement DOS protection because x" or "We are implementing DOS protection and here are details on how it will work"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, can you include more detail around the rationale here? Why are we using the AWS stack over the other choices here?

Reasoning should speak to all of the requirements in the node-status / node-error product reqs so we can be sure that the implementation will fulfill our needs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's supposed to be a "Rationale and Alternatives" section that would be a good place for the reasoning


## Implementation

[implementation]:#implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the microservice look like? Is it in OCaml? What data does it consume and in what format, what services does it talk to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does S3, OpenSearch, Kibana, and the microservice all communicate with each other?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mini-service would push to Kenesis Firehose data stream and AWS use this to communicate between OpenSearch, Kibana and S3.

Copy link
Member

@joseandro joseandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you need more info from my side, I requested some changes to the architecture of this solution. Basically, I would suggest us to remove the microservice that acts like a man-in-the-middle, introducing a single point of failure to our solution and more maintenance costs down the road.


[implementation]:#implementation

A mini-service will be deployed at https://node-status.minaprotocol.com. The mini-service would do some simple check against the format of the data that nodes send us. The mini-service would be implemented using python or javascript. The mini-service would listen on the designated port and decode the json data that peers send and then do a simple check that make sure that all the non-optional field are present and then call AWS `PUT` function to push it to the AWS kenesis firehose data stream. This mini-service would be put in a AWS EC2 container to make the configuration of things minimal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of funneling all the requests through a microservice, and therefore introducing a central point of failure, wouldn't it be better to have the client nodes push it all through Firehose endpoint directly instead? Firehose has the native capability to auto scale and handle a variable data streaming throughput on its own.
https://aws.amazon.com/blogs/aws/amazon-kinesis-firehose-simple-highly-scalable-data-ingestion/

The microservice would also need to be managed by us. Therefore, we would need to handle scaling and possible failures. I think that big maintenance cost outweighs the benefits of the role of a microservice described by this proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean just call aws firehose put-record <data> to redirect the data to firehose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we still need a bash script to do that right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! We could use the client side AWS Firehose SDK for that, here are the available options: https://github.com/aws/aws-sdk


A mini-service will be deployed at https://node-status.minaprotocol.com. The mini-service would do some simple check against the format of the data that nodes send us. The mini-service would be implemented using python or javascript. The mini-service would listen on the designated port and decode the json data that peers send and then do a simple check that make sure that all the non-optional field are present and then call AWS `PUT` function to push it to the AWS kenesis firehose data stream. This mini-service would be put in a AWS EC2 container to make the configuration of things minimal.

I only consider simple validity checks here against the format of the json data because any more complicated check would require us to run a node to observe the network conditions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation done by such a microservice will likely be easily bypassed by bad actors trying to spam our logs, so I don't really see the value of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

1 Kenesis firehose data stream to receive the logs, and
1 S3 storage bucket to store the logs, and
1 OpenSearch service that provides the search ability for the logs, and
1 Kibana service that provides the visualization for the data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this RFC a bit broader to accomodate the backend requirements for the node error collection backend as well? They are very similar and I want us to have one common flexible system for both contexts. In such scenario, I would assume we would basically duplicate the described setup for the error collection backend system, is my understanding correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We would just duplicate the setup for node error system too.

Copy link
Member

@joseandro joseandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @bkase could you take a look at it again when you have time? I think we changed important pieces of it since you last reviewed it.

@joseandro joseandro requested a review from bkase December 23, 2021 17:12
The reason that we choose AWS stack for our backend is that
0. The requirements for the node status collection and node error collection system is that we need a backend that provides the storage of data and an easy way to search and visualize the data. AWS stack fulfills these requirements.

1. AWS stack is robust and easy to use. It has built-in DOS protection. And we have team members who has used the AWS stack before and this means that some of the team members are already familiar with it. And it seems to be the cheapest choice for us.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we also considered the Google Cloud equivalent of this stack? I think a few years ago we moved most if not all our infra from AWS to Google Cloud. That may simplify management rather than having our system split between AWS and google cloud (having said that, I'm not up-to-date, do we have services that rely on AWS at the moment?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do a search around that. The closest thing would be Google Cloud Logging service I guess.


Like the diagram shown above, We would setup a public Kenesis firehose data stream where the mina client can push to. And this Kenesis data stream would be connected to the S3 bucket, the OpenSearch and Kibana. We just won't use the Splunk and Redshift.

This means we would modify the mina client to add the AWS SDK to do the push. There would no server maintained by us at the backend.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great that we don't need to maintain another service!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include more detail here? Where will we do the push in the source code? Is it part of the logger or in some other module or library? How will data flow from the client.exe into this module to control the opt-in/opt-out status?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added this. I found there is an OCaml AWS library, but I've never tried it. In the worst case we have add AWS cli dependency for mina node to push to the firehose endpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the RFC needs to make a decision here with reasoning so we can discuss with folks and get alignment. If it's not too hard, we should probably try to do it in OCaml if you ask me. Is there an HTTP API for the service? In the past, adding calls shelling out to other processes from bash, especially when you have to pass data to them has been regretful. Though, if it's super difficult to do in OCaml I suppose we need to do it.

The other thing to include is the details around where in the code this will live? Part of Prometheus metrics, you said right? Does it live in a particular library or is it just in various places in the source code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider using this library to send request to the AWS backend: https://github.com/inhabitedtype/ocaml-aws. I haven't tested it, but by looking at the API, I think it can handle our simple requirements to send an AWS PUT request. If this effort fails, I would turn back to using bash.

The frontend of the node status service is already merged. There is a library that collects the data (either from prometheus or directly from transition frontier): https://github.com/MinaProtocol/mina/blob/compatible/src/lib/node_status_service/node_status_service.ml


We need a backend for node staus/error systems. Candidates for the backend include AWS stack and Grafana Loki and LogDNA.

## Implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still missing details on how we're going to meet some of the requirements in the PRD for node status and node error. Maybe you could argue that this RFC is scoped to the "backend", but actually I think in order to understand the backend it makes sense to holistically think of the end-to-end flow of the data from the "frontend" too. Can we include this information in this RFC?

Here are some that I think need to be referenced/explained (in addition to what I mentioned in the below comments):

  1. How will the Mina client handle the status/error opt-in for the data collection?
  2. The frequency of the data sending
  3. How can the user specify the destination of the logs? (if they want to run their own collection infrastructure)
  4. What are the mechanisms for surfacing all the data? Should we reuse the code we have for prometheus metrics or should it be different?
  5. Where are the places that will we add the exception collection hook to send errors? How will the errors be indexed and queried that we're able to learn from the issues in our system?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In the current implementation, there would be a cli parameter for the destination url of the node status report. If the user provides that then the node status system would send reports there. Same applies to the node error system. I think we need to do some change there, since now our backend is just the kenesis firehose endpoint. So there would be no url. I would image for this design there would be 2 cli arguments. One required parameter to trigger the node status/error system; one optional parameter to specify the destination if user wants to send it to somewhere else.

  2. For the node status system, it sends the status report every 5 slots. For the node error system, it would send a report before it crashes.

  3. Already answered in 1.

  4. In the implementation, I actually reused a lot of prometheus metrics.

  5. For the node error system, I was adding the hook to the handle_crash function. The same function was used to generate the current crash reports for mina nodes. For more about the node error collection system, you can take a look at this RFC: Rfc/node error collection #9526. I failed to follow the exact requirements, for example, I failed to extract locations of exceptions. And the backtrace stack is considered part of the exception. Basically I am just reporting the exception from mina. Do you have any suggestion that I can do better than that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in Slack, can you add these answers to the actual document itself so we have a record of what our decisions ended up being?

  1. I guess we can provide a bucket name or something for the kinesis firehose? I think we should do it with one CLI argument like --share-report-data <destination-bucket> or whatever name we use for kinesis storage. Then the boolean on/off and the destination are in the same command. I think we shouldn't hardcode our destination by default -- but try to name the kinesis endpoint "mina-report-data" or something
  2. Sounds good
  3. Sounds good
  4. Can you mention here how the integration works with the Prometheus metrics?
  5. Oh I missed the other RFC, I think we should combine them actually. Sorry I missed the original one -- a few changes: (1) it should be off by default (see the requirements doc https://www.notion.so/minaprotocol/Node-Error-Collection-030f806ede5345beb0089ffc8666f238 ) and (2) we need to adjust it to point to kinesis now and not a REST URL, right? I think it's important to evaluate the whole thing holistically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think providing a destination URL would be more flexible than using a destination-bucket. Let's say we want to switch to another backend architecture in the future, then we can setup a mini-service under a sub-domain of minaprotocol.com (like what I suggested in the RFC before, that part got deleted because joseandro argued that we can push everything to the frontend which means we don't need this mini-service at all). And another thing is that we are not directly pushing to some buckets, we are pushing to the kenesis data stream. I am ok to not hardcode any default value for the destination.

  2. I can find the details in this PR: Feature/node status collection #9413. In short, what I did is that I extract the measurements in the metrics and filled the relevant fields in the reports with those measurements. There's no magic there.

  3. Yes, by default it's off. Yes, if we decided to go with the AWS stack, then we have to change the frontend a little bit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may also be helpful to report on status transitions, not just a fixed interval of 5 slots - e.g. report when status goes from sync to catchup, etc. That would allow for understanding things like network-wide bootstrap times, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great -- using a URL sounds good to me 👍 .

@jrwashburn that's a good idea -- @ghost-not-in-the-shell would it be hard to capture those status transitions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync status is part of the "node status report". If we are interested in the bootstrap time, then we could add that to the status report directly. Personally I think sending reports every 5 slots is enough for the status report system.

Copy link
Member

@bkase bkase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Are you certain about going with AWS over Google Cloud or are you still thinking through that?

@ghost-not-in-the-shell
Copy link
Member Author

Thanks! Are you certain about going with AWS over Google Cloud or are you still thinking through that?

I think we actually prefer Google Cloud over AWS right now. I tried to setup a simple Google Cloud Logging service last Friday. The setup is super easy. But I still don't know how to connect it to Grafana.

Copy link
Member

@bkase bkase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more small asks but then sounds good ✅


![](res/gcloud_logging.png)

For the frontend, each nodes would have 2 command line options to allow them to sed node status/error reports to any specific url: `--node-status-url URL` and `--node-error-url URL`. By default users would send their reports to our backend. They could change the destination by providing their own destination `URL` to the command options.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you reword this slightly:

By default users would send their reports to our backend.
By default users won't send any status or error reports and can opt-in by providing the --node-status-url URL and/or --node-error-url URL and our backend would be suggested.

Additionally can we make sure this is configurable from the daemon.json file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. They are already configurable from the daemon.json file and that's already recorded in the corresponding RFCs.

res.render('error', {error: "Version Mismatch"});
} else if (Buffer.byteLength(req.body, 'utf8') > 1000000) {
res.status(413);
res.render('error', {error: "Payload Too Large"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note to the RFC saying we will setup an alert to detect if this line is being sent?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@ghost-not-in-the-shell ghost-not-in-the-shell added the ci-build-me Add this label to trigger a circle+buildkite build for this branch label Jan 21, 2022
@ghost-not-in-the-shell ghost-not-in-the-shell merged commit 089cf2b into compatible Feb 28, 2022
@ghost-not-in-the-shell ghost-not-in-the-shell deleted the rfc/node-status-backend branch February 28, 2022 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-build-me Add this label to trigger a circle+buildkite build for this branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants