-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rfc/node status backend #9910
Rfc/node status backend #9910
Changes from 9 commits
952fab7
8e9ae3a
632acd0
da49491
b091c78
65273c7
ac539c0
b98273e
42d1a52
b321c44
52ee54c
fd097c1
2acf040
7325923
5b76f2d
bb4dae2
6c92f70
8df9bd8
525120c
be474a5
a68454b
a300b37
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
## Summary | ||
|
||
[summary]: #summary | ||
|
||
This RFC proposes that we use AWS stack to be the backend of the node status/error system. To be specific by AWS stack, I mean we would use S3 as the storage backend; OpenSearch as the search engine; Kibana as the visualization and charting tool. | ||
|
||
## Motivation | ||
|
||
[motivation]:#motivation | ||
|
||
We need a backend for node staus/error systems. Candidates for the backend include AWS stack and Grafana Loki and LogDNA. | ||
|
||
## Implementation | ||
|
||
[implementation]:#implementation | ||
|
||
![](res/aws_stack.png) | ||
|
||
Like the diagram shown above, We would setup a public Kenesis firehose data stream where the mina client can push to. And this Kenesis data stream would be connected to the S3 bucket, the OpenSearch and Kibana. We just won't use the Splunk and Redshift. | ||
|
||
This means we would modify the mina client to add the AWS SDK to do the push. There would no server maintained by us at the backend. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is great that we don't need to maintain another service! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you include more detail here? Where will we do the push in the source code? Is it part of the logger or in some other module or library? How will data flow from the client.exe into this module to control the opt-in/opt-out status? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just added this. I found there is an OCaml AWS library, but I've never tried it. In the worst case we have add AWS cli dependency for mina node to push to the firehose endpoint. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the RFC needs to make a decision here with reasoning so we can discuss with folks and get alignment. If it's not too hard, we should probably try to do it in OCaml if you ask me. Is there an HTTP API for the service? In the past, adding calls shelling out to other processes from bash, especially when you have to pass data to them has been regretful. Though, if it's super difficult to do in OCaml I suppose we need to do it. The other thing to include is the details around where in the code this will live? Part of Prometheus metrics, you said right? Does it live in a particular library or is it just in various places in the source code? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would consider using this library to send request to the AWS backend: https://github.com/inhabitedtype/ocaml-aws. I haven't tested it, but by looking at the API, I think it can handle our simple requirements to send an AWS PUT request. If this effort fails, I would turn back to using bash. The frontend of the node status service is already merged. There is a library that collects the data (either from prometheus or directly from transition frontier): https://github.com/MinaProtocol/mina/blob/compatible/src/lib/node_status_service/node_status_service.ml |
||
|
||
In summary, For the AWS stack we need to setup | ||
1 Kenesis firehose data stream to receive the logs, and | ||
1 S3 storage bucket to store the logs, and | ||
1 OpenSearch service that provides the search ability for the logs, and | ||
1 Kibana service that provides the visualization for the data | ||
|
||
The same setup also applies to the node error system backend. | ||
|
||
## Other choices | ||
|
||
[other-choices]: #other-choices | ||
|
||
### Grafana Loki | ||
|
||
Grafana Loki functions as basically a log aggregation system for the logs. For log storage, we could choose between different cloud storage backend like S3 or GCS. It uses an agent to send logs to the loki server. This means we need to setup a micro-service that listening on https://node-status.minaprotocol.com and redirect the data to Loki. They provide a query language called LogQL which is similar to the prometheus query language. Another upside of this choice is that it has good integration with Grafana which is already used and loved by us. One thing to notice is that loki is "label" based, if we want to get the most of it, we need to find a good way to label stuff. | ||
|
||
### LogDNA | ||
|
||
LogDNA provides both data storage and data visualization and alerting functionality. Besides the usual log collecting agent like Loki, LogDNA also provides the option to sends logs directly to their API which could save us the work to implement a micro-service by ourselves (depending on whether we feel safe to give users the log-uploading keys). The alert service they provide is also handy in node error system. | ||
|
||
## Prices | ||
|
||
1. S3, $0.023 for 1GB/month, would be a little cheaper if we use more than 50GB | ||
2. OpenSearch, Free usage for the first 12 months for 750 hrs per month | ||
3. Loki, depending on the storage we choose. And we need to run the loki instance somewhere. We could choose to use the grafana cloud. But it seems to have 30 days of log retention. The prices $49/month for 100GB of logs. (I think we already use their service, so the log storage is already paid) | ||
4. LogDNA, $3/GB, logs also have 30 days of retention. | ||
|
||
## Rationale Behind our choices | ||
|
||
The reason that we choose AWS stack for our backend is that | ||
0. The requirements for the node status collection and node error collection system is that we need a backend that provides the storage of data and an easy way to search and visualize the data. AWS stack fulfills these requirements. | ||
|
||
1. AWS stack is robust and easy to use. It has built-in DOS protection. And we have team members who has used the AWS stack before and this means that some of the team members are already familiar with it. And it seems to be the cheapest choice for us. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have we also considered the Google Cloud equivalent of this stack? I think a few years ago we moved most if not all our infra from AWS to Google Cloud. That may simplify management rather than having our system split between AWS and google cloud (having said that, I'm not up-to-date, do we have services that rely on AWS at the moment?) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can do a search around that. The closest thing would be Google Cloud Logging service I guess. |
||
|
||
2. For LogDNA, it has a 30 days log retention limit which clearly doesn't suit our needs. Plus, LogDNA is much expensive than the other 2. | ||
|
||
3. For Grafana Loki, it features a "label"-indexed log compression system. This would shine if the log that it process has a certain amount of static labels. For our system, this is not the case. Plus that, none of us are familiar with Loki. And finally Grafana's cloud system also has a 30 days' limit on log retention. This limitation implies that if we want to use Loki then we have to set up our own Loki service which adds some more additional maintenance complexity. | ||
|
||
To summary, AWS stack has the functionality of the other two choices and it's the easiest one to maintain and to setup. Plus it's the cheapest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is still missing details on how we're going to meet some of the requirements in the PRD for node status and node error. Maybe you could argue that this RFC is scoped to the "backend", but actually I think in order to understand the backend it makes sense to holistically think of the end-to-end flow of the data from the "frontend" too. Can we include this information in this RFC?
Here are some that I think need to be referenced/explained (in addition to what I mentioned in the below comments):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, there would be a cli parameter for the destination url of the node status report. If the user provides that then the node status system would send reports there. Same applies to the node error system. I think we need to do some change there, since now our backend is just the kenesis firehose endpoint. So there would be no url. I would image for this design there would be 2 cli arguments. One required parameter to trigger the node status/error system; one optional parameter to specify the destination if user wants to send it to somewhere else.
For the node status system, it sends the status report every 5 slots. For the node error system, it would send a report before it crashes.
Already answered in 1.
In the implementation, I actually reused a lot of prometheus metrics.
For the node error system, I was adding the hook to the
handle_crash
function. The same function was used to generate the current crash reports for mina nodes. For more about the node error collection system, you can take a look at this RFC: Rfc/node error collection #9526. I failed to follow the exact requirements, for example, I failed to extract locations of exceptions. And the backtrace stack is considered part of the exception. Basically I am just reporting the exception from mina. Do you have any suggestion that I can do better than that?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in Slack, can you add these answers to the actual document itself so we have a record of what our decisions ended up being?
--share-report-data <destination-bucket>
or whatever name we use for kinesis storage. Then the boolean on/off and the destination are in the same command. I think we shouldn't hardcode our destination by default -- but try to name the kinesis endpoint "mina-report-data" or somethingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think providing a destination URL would be more flexible than using a
destination-bucket
. Let's say we want to switch to another backend architecture in the future, then we can setup a mini-service under a sub-domain of minaprotocol.com (like what I suggested in the RFC before, that part got deleted because joseandro argued that we can push everything to the frontend which means we don't need this mini-service at all). And another thing is that we are not directly pushing to some buckets, we are pushing to the kenesis data stream. I am ok to not hardcode any default value for the destination.I can find the details in this PR: Feature/node status collection #9413. In short, what I did is that I extract the measurements in the metrics and filled the relevant fields in the reports with those measurements. There's no magic there.
Yes, by default it's off. Yes, if we decided to go with the AWS stack, then we have to change the frontend a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may also be helpful to report on status transitions, not just a fixed interval of 5 slots - e.g. report when status goes from sync to catchup, etc. That would allow for understanding things like network-wide bootstrap times, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great -- using a URL sounds good to me 👍 .
@jrwashburn that's a good idea -- @ghost-not-in-the-shell would it be hard to capture those status transitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sync status is part of the "node status report". If we are interested in the bootstrap time, then we could add that to the status report directly. Personally I think sending reports every 5 slots is enough for the status report system.