Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testnet & Mainnet Status Page testnet.polykey.com mainnet.polykey.com #599

Assignees
Labels
development Standard development

Comments

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 23, 2023

Specification

The testnet 6 deployment #551 now being done shows us the utility of having a single dashboard that would be useful for tracking analytics and operational metrics of the testnet.

Right now AWS's dashboard and logging really sucks. The cloudwatch dashboard is hard to configure and doesn't automatically update in relation to changes in our infrastructure (it's not configured through our infrastructure deployment code). And the logging systems is also hard to navigate, there's alot of IDs in AWS that relate to different resources, and it's hard to correlate all these resources together in relation to the actual nodes that we have deployed.

Of particular note are these pages:

What we would like instead is to aggregate information and place it on testnet.polykey.com.

Here are some examples.

Current cloudwatch:

image

There are some challenges though. Right now we use A records on cloudflare to route testnet.polykey.com:

[nix-shell:~/Projects/Polykey-CLI]$ dig testnet.polykey.com

; <<>> DiG 9.18.16 <<>> testnet.polykey.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61838
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4095
;; QUESTION SECTION:
;testnet.polykey.com.		IN	A

;; ANSWER SECTION:
testnet.polykey.com.	300	IN	A	13.55.53.141
testnet.polykey.com.	300	IN	A	3.106.15.126

;; Query time: 15 msec
;; SERVER: 100.100.100.100#53(100.100.100.100) (UDP)
;; WHEN: Mon Oct 23 15:59:03 AEDT 2023
;; MSG SIZE  rcvd: 80

You can see here the 2 A records correspond to the Polykey testnet node container tasks.

If we navigate to testnet.polykey.com that would try to use one of those IPs and try to access via port 80 or 443. We should prefer 443 of course (https by default).

Now browsers will do some sort of resolution:

So actually it's a bit problematic. We can't use those A records, as they are going to point to polykey nodes directly. We would want to route to a service potentially a cloudflare worker to show the testnet network status page visualisation.

One way to do this is through cloudflare proxying. You can enable proxying and add in rules to cloudflare so that it can show different DNS records. I think this may not work and the simplest solution is to actually to use a different record type.

So DNS record types that are relevant could be:

  • A and AAAA for the web page to show the testnet network status
  • TXT or SRV records instead for the polykey nodes - the srv records look like this though:
    image

If we do change to using SRV record, we need to also address the bootstrapping into private network changes too.

Also in terms of setting up the dashboard, we could use a cloudflare worker which would not be long running. Not sure how to set this up. Another way is to always route to a cloudflare worker, and have the worker then do all the routing between the http status page and the actual nodes, the cloudflare workers seem quite flexible: https://developers.cloudflare.com/workers/examples/websockets/

Additional context

  • HTTP status page for Polykey Agent #412 - this issue is about having an HTTP status page for the polykey agent directly, whereas this issue focuses on having a public testnet status page, it will be useful for traction metrics and showing a sense of the community

image

The above shows a sort of global network status of rocketpool, but I think grafana can show all of that too.

Tasks

  1. Point A records of testnet.polykey.com and mainnet.polykey.com to Dashboards.
  2. Point A records of ${nodeId}.testnet.polykey.com to nodes
  3. Point _polykey_agent._udp.testnet.poly key.com SRV records to ${nodeId}.testnet.polykey.com A records.
  4. Change testnet.polykey.com and mainnet.polykey.com records to point towards the dashboard
@CMCDragonkai CMCDragonkai added the development Standard development label Oct 23, 2023
@CMCDragonkai
Copy link
Member Author

Should also consider if this should just be part of the general backend for Polykey Enterprise, since we are setting up a nextjs application server there anyway.

It is important to know that the public web page does not have the ability to directly connect to PK testnet node client service, since that requires authentication, and also the websocket transport on PK testnet requires specialised js-ws library, which doesn't yet work in the browser anyway.

@CMCDragonkai
Copy link
Member Author

Also other inspirations like https://status.hey.xyz/

@CMCDragonkai
Copy link
Member Author

@tegefaulkes thoughts on using SRV record? @amydevs any notes on SRV records?

@tegefaulkes
Copy link
Contributor

Are SRV records really what we want? AFAIK it's only used to specify a port for a service on an existing address.

@CMCDragonkai
Copy link
Member Author

It's one the alternatives to the A and AAAA record.

@okneigres
Copy link

Why not expose ports 80 and 443 from the container? and serve static html / js files with dashboard, or proxy traffic somewhere.
Curious about what do you guys think. Can it be a problem?

Also, I'm probably out of context yet – do you want Polykey to be a self-sufficient service (like we can run it fully functional with pure Docker), or do you want it to use AWS / Cloudflare infrastructure? Depending on that we can think of a clear and easy solution.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Oct 25, 2023

We have an issue regarding an http status page for the PK agent itself #412. That's separate to this issue, which is about a status page for the entire network. This would be unique to testnet or mainnet, and not part of any PK node.

So exposing 80/443 wouldn't be sufficient to achieve this as that would just show the agent's own status page.

@okneigres
Copy link

How do you think we should collect and store historical data?

I can imagine a couple of scenarios:

  1. To use something like Elasticsearch + Kibana. It works using PUSH model. The ecosystem has various data collectors that can be integrated into agents to collect all the metrics and logs and visualize them. To use it, we'll need to change the container image to integrate agents there. Kibana is very flexible in building any type of visualization. It also can be integrated with Grafana.
  2. To use Prometheus + Grafana. Prometheus uses PULL model and does not need significant changes on the agent's image, except exposing HTTP page with necessary metrics on the agent's side. Prometheus will do HTTP call to that page every 15-30 seconds and store it in its own data storage. Logs are not covered here, unfortunately.

There is also a commercial all-in-one solution called DataDog https://www.datadoghq.com/pricing/. As for me, it's quite costly, and I had issues with its flexibility and maintenance.

What's in your mind?

@okneigres
Copy link

About the DNS record problem: the most straightforward solution I can see is to use another domain. Let's say subdomain: status.testnet.polykey.com or panel.testnet.polykey.com
It does not require changing the current mechanism of retrieving a list of Polykey nodes, and it is pretty straightforward to understand and manage.

@CMCDragonkai
Copy link
Member Author

How do you think we should collect and store historical data?

I can imagine a couple of scenarios:

  1. To use something like Elasticsearch + Kibana. It works using PUSH model. The ecosystem has various data collectors that can be integrated into agents to collect all the metrics and logs and visualize them. To use it, we'll need to change the container image to integrate agents there. Kibana is very flexible in building any type of visualization. It also can be integrated with Grafana.
  2. To use Prometheus + Grafana. Prometheus uses PULL model and does not need significant changes on the agent's image, except exposing HTTP page with necessary metrics on the agent's side. Prometheus will do HTTP call to that page every 15-30 seconds and store it in its own data storage. Logs are not covered here, unfortunately.

There is also a commercial all-in-one solution called DataDog https://www.datadoghq.com/pricing/. As for me, it's quite costly, and I had issues with its flexibility and maintenance.

What's in your mind?

We want to keep the agent process minimal. So the pull model is probably better. Logs wise we can output in different formats.

@CMCDragonkai
Copy link
Member Author

About the DNS record problem: the most straightforward solution I can see is to use another domain. Let's say subdomain: status.testnet.polykey.com or panel.testnet.polykey.com
It does not require changing the current mechanism of retrieving a list of Polykey nodes, and it is pretty straightforward to understand and manage.

It's not just a technical issue, it's also optics. It's just a smoother thing to point everybody to testnet.polykey.com or mainnet.polykey.com. We should be able to switch to using SRV records we control the entire DNS resolution process.

The main issue is hosting the dashboard. Not sure if cloudflare worker can do this with live updates or something or whether we extend our current PKE to handle it.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Oct 29, 2023

I've assigned this to @okneigres, please spec out the task list in the OP.

I still need to set your email up and some account access which I'll do after our meeting.

@CMCDragonkai
Copy link
Member Author

Name of game right now will be speed, so if we can get away with hosting our logs and metrics data elsewhere, that would be best. Later we can incorporate this into our PKE infrastructure.

@CMCDragonkai CMCDragonkai changed the title Testnet Status Page testnet.polykey.com Testnet & Mainnet Status Page testnet.polykey.com mainnet.polykey.com Oct 29, 2023
@CMCDragonkai
Copy link
Member Author

Just tried this:

dig _polykey_agent._udp.testnet.polykey.com SRV

It gives you:


; <<>> DiG 9.18.16 <<>> _polykey_agent._udp.testnet.polykey.com SRV
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 3216
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4095
;; QUESTION SECTION:
;_polykey_agent._udp.testnet.polykey.com. IN SRV

;; ANSWER SECTION:
_polykey_agent._udp.testnet.polykey.com. 300 IN	SRV 0 0 1314 testnet.polykey.com.

;; Query time: 16 msec
;; SERVER: 100.100.100.100#53(100.100.100.100) (UDP)
;; WHEN: Mon Oct 30 09:51:17 AEDT 2023
;; MSG SIZE  rcvd: 107

So you still need a special hostname to point to the cluster IP addresses via A/AAAA records.

That does mean we can still reserve testnet.polykey.com for the webpage.

@addievo
Copy link
Contributor

addievo commented Oct 29, 2023

I think a simple solution is to use https://no-ip.com/ to get a hostname to point to an ip, which can then be used with the corresponding SRV address.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Oct 30, 2023

Some notes about SRV records.

Basically when a SRV record is created, it always has this structure:

  • _polykey_agent._udp.testnet.polykey.com

Technically it could even be:

  • _testnet._udp.polykey.com - the testnet name isn't necessary, and you can just use the service type field

So what that means is that you then have to use dig on _testnet._udp.polykey.com. The _ is part of the standard of SRV records.

So anyway I talked to chatgpt (https://chat.openai.com/share/48b02e63-414c-4a8f-838c-c441c3c2e1c4) about this and this is what it suggests:

  1. Setup multiple SRV records like _polykey_agent._udp.testnet.polykey.com
    _polykey_agent._udp.testnet.polykey.com. 3600 IN SRV 0 1 1314 node1.testnet.polykey.com.
    _polykey_agent._udp.testnet.polykey.com. 3600 IN SRV 1 1 1314 node2.testnet.polykey.com.
    
  2. Then one subdomain for each node too, where node1 and node2 would be separate Node IDs.
  3. Each node1.testnet.polykey.com can have A and AAAA records.
  4. Then testnet.polykey.com can point to the dashboard IP

This also enables us to support _polykey_client.tcp as a MDNS discovery of the client service, which could be interesting usecase for the PK CLI or other clients to control a local network agent. Actually I'm not sure if this is useful if PK CLI or clients don't have a way of discovering the local agent. Right now it just relies on the node path. Which uses filesystem beacon to find the agent to talk to. But it could be interesting in a local network if the PK client also does a quick MDNS discovery too. It may require a flag to switch off using the local default node path. Also could rely on the Node ID designation.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Oct 30, 2023

The place to modify this resolution process would be the network/utils.ts resolveHostname function. Which currently uses A and AAAA.

@CMCDragonkai
Copy link
Member Author

@okneigres can you explore this https://developers.cloudflare.com/workers/examples/websockets/ to see if it is a viable option for hosting the page? Or if not, let's just go straight to implementing it on the PKE.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Oct 30, 2023

Changing to #599 (comment) would give us:

  1. Ability to specify priority and weight to provide some level of load distribution mechanic
  2. Ability to not hard code the port being used by the seed nodes
  3. Ability to also have dynamic node IDs and IPs - that would be useful for the private network

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Oct 30, 2023

Testnet Architecture Plan

Based on the today's discussion.

Some things that need to be figured out:

  • Audit events vs Operation Logs - audit events can be used by the dashboard, but not operational logs
  • Audit events are persisted - but how to make this efficient
  • Audit events can be pulled and pushed?
  • Structured information needs to be processed and then pushed to the dashboard
  • Metrics need to be stored as well, but these are just numbers, Prometheus is a possibility, but if we need PKE or something sophisticated to process things, then maybe it should all be done through PKE to centralise the processing logic
  • Cloudflare worker persistence, how does its websocket system work?
  • The new nodes has to connect to only one of the seed nodes, we're going to need some level of decentralised signalling compared to connecting to all seed nodes

@CMCDragonkai
Copy link
Member Author

image

Add some ideas on what you want to see in the dashboard.

@tegefaulkes
Copy link
Contributor

As I mentioned, It would be neat to see what the network looks like with a force directed graph.

https://observablehq.com/@d3/disjoint-force-directed-graph/2?intent=fork

It would be a neat visualisation. But also useful in viewing the connectivity of the network and how it forms.

@CMCDragonkai
Copy link
Member Author

While that is cool, I don't think it would be possible for us to efficiently or accurately represent such a map. Also I imagine such information might be a privacy issue too even though it's a public network. The most we could do is show some representation of the node graph, but the geo visualisation would be the most impactful for now.

@okneigres this https://github.com/maxmind/GeoIP2-node is the most official library for, works only server side atm, not client side. You might need to investigate.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Jan 9, 2024

  1. Metrics inside Polykey should be limited to Gauges, Counters, Histograms and Summaries - construct an appropriate JSON format for this in your audit metrics methods.
  2. These are still going to be reported through the JSON RPC protocol, not the HTTP prometheus textual format.
  3. The PKNS acts like a sidecar in that it samples this data (pulling it) and then pushing to Grafana Mimir. It's basically a Prometheus agent in this regard. I don't know what other things Prometheus agent itself would be useful for.
  4. The Mimir API is just a remote-write HTTP API endpoint to receive the pushed data. You could construct the protobuf yourself. Use protobuf-es.
  5. Deploying Mimir only requires a connection to S3-compatible endpoint that being Cloudflare R2.

@CMCDragonkai
Copy link
Member Author

Make sure you're not doing derived calculations on every RPC request. That should be just done later by the analysis system.

@CMCDragonkai
Copy link
Member Author

The prometheus compatible metrics system should be a new issue. You want to complete this epic for just deployment and show that all in the dashboard.

@CMCDragonkai
Copy link
Member Author

MatrixAI/Polykey-CLI#93 cannot be closing this epic. We still need the final design of the deployment version table onto the dashboard to close this off.

@CMCDragonkai
Copy link
Member Author

@amydevs new issues required for:

  1. Polykey metrics limited to prometheus types
  2. Mimir architecture for metrics gathering

@CMCDragonkai
Copy link
Member Author

Going to try out the Mimir lcoally, and submit some metrics.

@CMCDragonkai
Copy link
Member Author

@amydevs I think your diagram is missing a critical piece of the puzzle here. The supabase database storing all the relevant metadata here and transactional data (especially if metric data is going to mimir).

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Jan 11, 2024

This was necessary to get mimir working:

      mimir = {
        enable = true;
        configuration = {
          ingester = {
            ring = {
              replication_factor = 1;
            };
          };
          multitenancy_enabled = false;
          no_auth_tenant = "anonymous";
          server = {
            http_listen_network = "tcp";
            http_listen_address = "";
            http_listen_port = 8080;
            http_listen_conn_limit = 0;
            grpc_listen_network = "tcp";
            grpc_listen_address = "";
            grpc_listen_port = 9095;
            grpc_listen_conn_limit = 0;
          };
          common = {
            storage = {
              backend = "s3";
              s3 = {
                # this is using R2 storage with special Mimir Prototype tokens
                # note that this is not the same as the tokens for Cloudflare
                # endpoint must not have protocol attached
                endpoint = "....r2.cloudflarestorage.com";
                region = "auto";
                secret_access_key = "";
                access_key_id = "";
              };
            };
          };
          blocks_storage = {
            s3 = {
              bucket_name = "mimir-blocks";
            };
          };
          alertmanager_storage = {
            s3 = {
              bucket_name = "mimir-alertmanager";
            };
          };
          ruler_storage = {
            s3 = {
              bucket_name = "mimir-ruler";
            };
          };
        };
      };

It wasn't well documented, but basically the 3 buckets need to be created ahead of time. Then an http endpoint and grpc endpoint are needed, the replication factor has to be 1 other wise it will refuse, and you have to disable multitenancy if there's only 1 org using this thing.

The URL to push prometheus records in is then: http://127.0.0.1:8080/api/v1/push.

I don't yet see any blocks uploaded to R2, but apparently:

An ingester takes up to 30 minutes to upload a block to the storage

But the docs seem wrong in a few places, so buyer beware: grafana/mimir#4187

Debugging service config takes too long, it's best to run the commands themselves first in the foreground in a nix-shell, then to transport working config to the system level later.

@amydevs
Copy link
Member

amydevs commented Jan 12, 2024

Deployment Status Table:
Untitled-2023-10-23-0424 excalidraw(17)

Supabase exposes Postgrest API, Lambda inserts rows via API.

@CMCDragonkai
Copy link
Member Author

To get build time information into PK CLI, we use build.json, which is produced by scripts/build.js. This is loaded by PK CLI at run time to get information if it exists about what happened during build. This is how we provide the commit hash information.

However during a nix-build, it ignores the .git directory and we cannot get the commit hash information when we run scripts/build.js. This is because it was assumed that we wouldn't need the git db for this. But now that the build involves information only stored in the git DB, then it does seem like we need to keep the .git directory around. However we don't actually need all of the .git, just the ones that give us the relevant build information.

One way is to use Nix to get just the info we need: https://chat.openai.com/share/849f4c37-2cc1-48c1-9df7-00a2904f9819

Alternatively just bring .git into the nix-build. It's probably not a big deal, it just increases the amount of data during building.

@CMCDragonkai
Copy link
Member Author

@tegefaulkes as mentioned earlier today, the version metadata should be working and deployed and fetchable from PKNS. So you should be able to complete the entire CI loop in #94 now. That will be the priority.

@amydevs For the remainder of this epic, it's about fixing up how we get the build information in the above comment, storing deployment information into the Supabase DB, graduating the configuration of the Supabase to Pulumi, and updating PKND to show that deployment information.

Metrics infrastructure can be separately done.

@CMCDragonkai
Copy link
Member Author

We will begin dog-fooding Polykey as soon as this is done. @tegefaulkes @amydevs @brynblack

@amydevs
Copy link
Member

amydevs commented Jan 17, 2024

To get build time information into PK CLI, we use build.json, which is produced by scripts/build.js. This is loaded by PK CLI at run time to get information if it exists about what happened during build. This is how we provide the commit hash information.

However during a nix-build, it ignores the .git directory and we cannot get the commit hash information when we run scripts/build.js. This is because it was assumed that we wouldn't need the git db for this. But now that the build involves information only stored in the git DB, then it does seem like we need to keep the .git directory around. However we don't actually need all of the .git, just the ones that give us the relevant build information.

One way is to use Nix to get just the info we need: https://chat.openai.com/share/849f4c37-2cc1-48c1-9df7-00a2904f9819

Alternatively just bring .git into the nix-build. It's probably not a big deal, it just increases the amount of data during building.

after that, should be it yeah, i have a separate issue on PKNS for tracking the state stuff @CMCDragonkai

@CMCDragonkai
Copy link
Member Author

I think we should be changing our PK_NETWORK option to be the fully qualified domain name, instead of testnet, it should be testnet.polykey.com.

@CMCDragonkai
Copy link
Member Author

I think we should be changing our PK_NETWORK option to be the fully qualified domain name, instead of testnet, it should be testnet.polykey.com.

Did an issue get created for this @tegefaulkes?

Also I think to fully finish off this issue we should have the rest of the deployment information. But as it stands, we can leave this closed.

@tegefaulkes
Copy link
Contributor

I have an issue MatrixAI/Polykey-CLI#97 for it.

@CMCDragonkai
Copy link
Member Author

@amydevs mentioned that the metrics are now batched up by PKNS and stored in timescaledb, and so the PKND is not re-requesting data from AWS backend. So it's pretty good right now. Fetched every 30 minutes atm with a whole batch of data.

@CMCDragonkai
Copy link
Member Author

The rest of the deployment table information is being discussed in the Orchestrator project now.

@CMCDragonkai
Copy link
Member Author

CMCDragonkai commented Jan 31, 2024

@amydevs since PKNS has to be contacted by PKND, is it exposed to the wider internet, or is there a CF worker API acting as a gateway under testnet.polykey.com/api and mainnet.polykey.com/api?

If it is exposed to the wider internet we should have TLS support for these. If it's locked to the CF worker gateway (access control wise), then we can be less stringent.

@amydevs
Copy link
Member

amydevs commented Jan 31, 2024

the cloudflare DNS records have the proxy enabled, which automatically is TLS enabled. The proxy proxies the TLS traffic to our insecure endpoints on the AWS ipv4 address. This is only done for /api, everything else is proxied to the CF worker that hosts the docusaurus frontend

@CMCDragonkai
Copy link
Member Author

Hmm I'm not understanding this. Is is possible to publically hit the PKNS endpoint via a non-TLS route?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment