Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rfc/node status backend #9910

Merged
merged 22 commits into from
Feb 28, 2022
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
952fab7
adding rfc for node status backend
ghost-not-in-the-shell Dec 13, 2021
8e9ae3a
clean up and adding more to the loki part
ghost-not-in-the-shell Dec 13, 2021
632acd0
adding pricing for backends
ghost-not-in-the-shell Dec 14, 2021
da49491
address brandon's concern
ghost-not-in-the-shell Dec 22, 2021
b091c78
address comments from josendro
ghost-not-in-the-shell Dec 23, 2021
65273c7
remove the wording of micro-service all together
ghost-not-in-the-shell Dec 23, 2021
ac539c0
adding a little more to the rationale
ghost-not-in-the-shell Dec 23, 2021
b98273e
remove the micro-service at all
ghost-not-in-the-shell Dec 23, 2021
42d1a52
adding diagram to depict the workflow
ghost-not-in-the-shell Dec 23, 2021
b321c44
add google cloud logging
ghost-not-in-the-shell Jan 6, 2022
52ee54c
adding more details about the change in the frontend of node status
ghost-not-in-the-shell Jan 7, 2022
fd097c1
adding more details for google cloud logging implementation
ghost-not-in-the-shell Jan 7, 2022
2acf040
Google Cloud API is actually HTTP based, so modify the options
ghost-not-in-the-shell Jan 10, 2022
7325923
make google cloud logging to be the 1st option
ghost-not-in-the-shell Jan 12, 2022
5b76f2d
add a section for micro-service
ghost-not-in-the-shell Jan 13, 2022
bb4dae2
adding more details for the micro-service implementation
ghost-not-in-the-shell Jan 14, 2022
6c92f70
using Google Functions for the micro-service
ghost-not-in-the-shell Jan 14, 2022
8df9bd8
adding kibana and elastic cloud for gcp
ghost-not-in-the-shell Jan 14, 2022
525120c
adding version of the node status/error report and adding checks for
ghost-not-in-the-shell Jan 19, 2022
be474a5
address comments of brandon
ghost-not-in-the-shell Jan 21, 2022
a68454b
Merge branch 'compatible' of github.com:MinaProtocol/mina into rfc/no…
ghost-not-in-the-shell Feb 25, 2022
a300b37
adding reasons why we make kibana private
ghost-not-in-the-shell Feb 25, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions rfcs/0044-node-status-and-node-error-backend.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
## Summary

[summary]: #summary

This RFC proposes that we use AWS stack to be the backend of the node status/error system. To be specific by AWS stack, I mean we would use S3 as the storage backend; OpenSearch as the search engine; Kibana as the visualization and charting tool.

## Motivation

[motivation]:#motivation

We need a backend for node staus/error systems. Candidates for the backend include AWS stack and Grafana Loki and LogDNA.

## Implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still missing details on how we're going to meet some of the requirements in the PRD for node status and node error. Maybe you could argue that this RFC is scoped to the "backend", but actually I think in order to understand the backend it makes sense to holistically think of the end-to-end flow of the data from the "frontend" too. Can we include this information in this RFC?

Here are some that I think need to be referenced/explained (in addition to what I mentioned in the below comments):

  1. How will the Mina client handle the status/error opt-in for the data collection?
  2. The frequency of the data sending
  3. How can the user specify the destination of the logs? (if they want to run their own collection infrastructure)
  4. What are the mechanisms for surfacing all the data? Should we reuse the code we have for prometheus metrics or should it be different?
  5. Where are the places that will we add the exception collection hook to send errors? How will the errors be indexed and queried that we're able to learn from the issues in our system?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In the current implementation, there would be a cli parameter for the destination url of the node status report. If the user provides that then the node status system would send reports there. Same applies to the node error system. I think we need to do some change there, since now our backend is just the kenesis firehose endpoint. So there would be no url. I would image for this design there would be 2 cli arguments. One required parameter to trigger the node status/error system; one optional parameter to specify the destination if user wants to send it to somewhere else.

  2. For the node status system, it sends the status report every 5 slots. For the node error system, it would send a report before it crashes.

  3. Already answered in 1.

  4. In the implementation, I actually reused a lot of prometheus metrics.

  5. For the node error system, I was adding the hook to the handle_crash function. The same function was used to generate the current crash reports for mina nodes. For more about the node error collection system, you can take a look at this RFC: Rfc/node error collection #9526. I failed to follow the exact requirements, for example, I failed to extract locations of exceptions. And the backtrace stack is considered part of the exception. Basically I am just reporting the exception from mina. Do you have any suggestion that I can do better than that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in Slack, can you add these answers to the actual document itself so we have a record of what our decisions ended up being?

  1. I guess we can provide a bucket name or something for the kinesis firehose? I think we should do it with one CLI argument like --share-report-data <destination-bucket> or whatever name we use for kinesis storage. Then the boolean on/off and the destination are in the same command. I think we shouldn't hardcode our destination by default -- but try to name the kinesis endpoint "mina-report-data" or something
  2. Sounds good
  3. Sounds good
  4. Can you mention here how the integration works with the Prometheus metrics?
  5. Oh I missed the other RFC, I think we should combine them actually. Sorry I missed the original one -- a few changes: (1) it should be off by default (see the requirements doc https://www.notion.so/minaprotocol/Node-Error-Collection-030f806ede5345beb0089ffc8666f238 ) and (2) we need to adjust it to point to kinesis now and not a REST URL, right? I think it's important to evaluate the whole thing holistically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think providing a destination URL would be more flexible than using a destination-bucket. Let's say we want to switch to another backend architecture in the future, then we can setup a mini-service under a sub-domain of minaprotocol.com (like what I suggested in the RFC before, that part got deleted because joseandro argued that we can push everything to the frontend which means we don't need this mini-service at all). And another thing is that we are not directly pushing to some buckets, we are pushing to the kenesis data stream. I am ok to not hardcode any default value for the destination.

  2. I can find the details in this PR: Feature/node status collection #9413. In short, what I did is that I extract the measurements in the metrics and filled the relevant fields in the reports with those measurements. There's no magic there.

  3. Yes, by default it's off. Yes, if we decided to go with the AWS stack, then we have to change the frontend a little bit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may also be helpful to report on status transitions, not just a fixed interval of 5 slots - e.g. report when status goes from sync to catchup, etc. That would allow for understanding things like network-wide bootstrap times, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great -- using a URL sounds good to me 👍 .

@jrwashburn that's a good idea -- @ghost-not-in-the-shell would it be hard to capture those status transitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sync status is part of the "node status report". If we are interested in the bootstrap time, then we could add that to the status report directly. Personally I think sending reports every 5 slots is enough for the status report system.


[implementation]:#implementation

![](res/aws_stack.png)

Like the diagram shown above, We would setup a public Kenesis firehose data stream where the mina client can push to. And this Kenesis data stream would be connected to the S3 bucket, the OpenSearch and Kibana. We just won't use the Splunk and Redshift.

This means we would modify the mina client to add the AWS SDK to do the push. There would no server maintained by us at the backend.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great that we don't need to maintain another service!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include more detail here? Where will we do the push in the source code? Is it part of the logger or in some other module or library? How will data flow from the client.exe into this module to control the opt-in/opt-out status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added this. I found there is an OCaml AWS library, but I've never tried it. In the worst case we have add AWS cli dependency for mina node to push to the firehose endpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the RFC needs to make a decision here with reasoning so we can discuss with folks and get alignment. If it's not too hard, we should probably try to do it in OCaml if you ask me. Is there an HTTP API for the service? In the past, adding calls shelling out to other processes from bash, especially when you have to pass data to them has been regretful. Though, if it's super difficult to do in OCaml I suppose we need to do it.

The other thing to include is the details around where in the code this will live? Part of Prometheus metrics, you said right? Does it live in a particular library or is it just in various places in the source code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider using this library to send request to the AWS backend: https://github.com/inhabitedtype/ocaml-aws. I haven't tested it, but by looking at the API, I think it can handle our simple requirements to send an AWS PUT request. If this effort fails, I would turn back to using bash.

The frontend of the node status service is already merged. There is a library that collects the data (either from prometheus or directly from transition frontier): https://github.com/MinaProtocol/mina/blob/compatible/src/lib/node_status_service/node_status_service.ml


In summary, For the AWS stack we need to setup
1 Kenesis firehose data stream to receive the logs, and
1 S3 storage bucket to store the logs, and
1 OpenSearch service that provides the search ability for the logs, and
1 Kibana service that provides the visualization for the data

The same setup also applies to the node error system backend.

## Other choices

[other-choices]: #other-choices

### Grafana Loki

Grafana Loki functions as basically a log aggregation system for the logs. For log storage, we could choose between different cloud storage backend like S3 or GCS. It uses an agent to send logs to the loki server. This means we need to setup a micro-service that listening on https://node-status.minaprotocol.com and redirect the data to Loki. They provide a query language called LogQL which is similar to the prometheus query language. Another upside of this choice is that it has good integration with Grafana which is already used and loved by us. One thing to notice is that loki is "label" based, if we want to get the most of it, we need to find a good way to label stuff.

### LogDNA

LogDNA provides both data storage and data visualization and alerting functionality. Besides the usual log collecting agent like Loki, LogDNA also provides the option to sends logs directly to their API which could save us the work to implement a micro-service by ourselves (depending on whether we feel safe to give users the log-uploading keys). The alert service they provide is also handy in node error system.

## Prices

1. S3, $0.023 for 1GB/month, would be a little cheaper if we use more than 50GB
2. OpenSearch, Free usage for the first 12 months for 750 hrs per month
3. Loki, depending on the storage we choose. And we need to run the loki instance somewhere. We could choose to use the grafana cloud. But it seems to have 30 days of log retention. The prices $49/month for 100GB of logs. (I think we already use their service, so the log storage is already paid)
4. LogDNA, $3/GB, logs also have 30 days of retention.

## Rationale Behind our choices

The reason that we choose AWS stack for our backend is that
0. The requirements for the node status collection and node error collection system is that we need a backend that provides the storage of data and an easy way to search and visualize the data. AWS stack fulfills these requirements.

1. AWS stack is robust and easy to use. It has built-in DOS protection. And we have team members who has used the AWS stack before and this means that some of the team members are already familiar with it. And it seems to be the cheapest choice for us.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we also considered the Google Cloud equivalent of this stack? I think a few years ago we moved most if not all our infra from AWS to Google Cloud. That may simplify management rather than having our system split between AWS and google cloud (having said that, I'm not up-to-date, do we have services that rely on AWS at the moment?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do a search around that. The closest thing would be Google Cloud Logging service I guess.


2. For LogDNA, it has a 30 days log retention limit which clearly doesn't suit our needs. Plus, LogDNA is much expensive than the other 2.

3. For Grafana Loki, it features a "label"-indexed log compression system. This would shine if the log that it process has a certain amount of static labels. For our system, this is not the case. Plus that, none of us are familiar with Loki. And finally Grafana's cloud system also has a 30 days' limit on log retention. This limitation implies that if we want to use Loki then we have to set up our own Loki service which adds some more additional maintenance complexity.

To summary, AWS stack has the functionality of the other two choices and it's the easiest one to maintain and to setup. Plus it's the cheapest.
3 changes: 3 additions & 0 deletions rfcs/res/aws_stack.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.