Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API-2: [feat. request]: Add health check endpoint(s) to support a public status page #125

Closed
26 tasks done
benefacto opened this issue Apr 29, 2021 — with Board Genius Sync · 11 comments
Closed
26 tasks done
Assignees

Comments

Copy link
Contributor

benefacto commented Apr 29, 2021

Description

We need a public status page to communicate downtime and issues to our users, similar to what TheGraph has. In order to create this status page, (a) certain endpoint(s) will need to be added to the API (something like a /health route) to provide health checks for APIs, TheGraph, and ideally smart contracts (i.e. badger API endpoints returning 200s, subgraph endpoints returning 200s, smart contract reads returning normal responses, etc.). The ultimate monitoring of these are likely to come from the Badger Prometheus instance (still in development) and not available to consume yet. In the meantime, Prometheus could be run locally in a Docker container and configured to monitor the required sources that way.

Request Metrics

Metric Rating
Priority medium
Effort high
Risk medium

Options: low, medium, high

Documentation (optional)

Resources (optional)

  • Badger Prometheus instance

Scope

APIs

Contracts

Ethereum Mainnet

Note: these contracts were obtained from the latest v2-ui configuration on master; some contracts were excluded due to a lack of read methods, unverified code on Etherscan, or duplicate names

DIGG system
Logic
Tokens

Subgraphs

Binance Smart Chain

Tokens
Logic
Sett System
Controllers

Development

  • Create a controller: HealthController that includes a /health endpoint for a full health check with also paths for smaller health checks like /health/contracts, /health/subgraphs, and /health/apis
  • Create the following services: HealthService abstract class with extended classes -- ContractsHealthService, SubgraphsHealthService, ApisHealthService
  • Import all necessary contract ABIs to new folders: src/health/config/abi/digg, src/health/config/abi/logic, src/health/config/abi/tokens
  • Add a health.yml health.ts configuration to the base directory that specifies which contracts, subgraphs, and APIs should be checked
  • Remove TypeChain and rework ABIs added to match those present in the abi folder with a flat structure, an abi file name postfix, and the abis manually declared as TypeScript variables rather than using generated code like TypeChain does
  • Add health logic to ContractsHealthService to test every parameter-less view method of the configured contracts
  • Figure out efficient way of making view method calls (e.g. JsonRpcBatchProvider)
  • Report results (success/failure) of contract methods
  • Add contracts health logic to HealthController at /health/contracts to return JSON data
  • Add contract names to abis: maybe can get this from the import key
  • Rework config to draw API/subgraphs from other configs
  • Add health logic to SubgraphsHealthService to run example queries on each configured subgraph and report results (success/failure)
  • Add subgraph health logic to HealthController at /health/subgraphs to return JSON data
  • Add health logic to ProviderHealthService to run test RPC calls (like get block number or something) on each configured provider and report results (success/failure)
  • Add provider logic to HealthController at /health/providers to return JSON data
  • Add health logic to ApisHealthService to perform parameter-less API GET calls and report results (success/failure)
  • Add API health logic to HealthController at /health/apis to return JSON data
  • Add all health check logic to HealthController at /health to return JSON data
    Rework services & API to only retrieve cached data
    Add scheduled task to push cache data (example of caching, example of scheduled task)
    Delete any cached data that is older than 90 days via a scheduled task (not doing a react app
  • Remove storage/caching logic
  • Refactor and clean up code
  • Add name of subgraph, and isError property
  • Fix linting error
    Add storage and caching This will be handled by the status page
  • Add tests: TBD
  • Add docs to describe how to maintain & add/remove new contracts/apis/subgraphs
  • Comment pull request so it can be effectively reviewed: see message here
  • Undo serverless.yml comments before merging
    Integrate into selected status page: depends on this issue; implementation still TBD, this may be broken out into a separate issue
  • Integrate health logic into scout: implementation still TBD, this may be broken out into a separate issue on the scout side
@Tritium-VLK
Copy link

It would also be nice if you could include information around:

Rewards Cycle Time
Harvest Time on setts(harvest is for all setts per chain I think)

These are other operational data our community cares about. Maybe this doesn't fit well into changes in the API, but if you can easily build in metrics points/health checks that would be nice.

Cycles should occur every 2 hours but is sometimes late, more than 4-6 hours is bad/go red.
BSC harvests should occur ever 2 hours
ETH harvests should occur every 24 hours
These numbers should be dynamic.

This is a suggestion not a requirement.

@benefacto
Copy link
Contributor Author

It would also be nice if you could include information around:

Rewards Cycle Time
Harvest Time on setts(harvest is for all setts per chain I think)

These are other operational data our community cares about. Maybe this doesn't fit well into changes in the API, but if you can easily build in metrics points/health checks that would be nice.

Cycles should occur every 2 hours but is sometimes late, more than 4-6 hours is bad/go red.
BSC harvests should occur ever 2 hours
ETH harvests should occur every 24 hours
These numbers should be dynamic.

This is a suggestion not a requirement.

Those are good suggestions, @BenLeibig but probably out of scope of this particular issue which is focused on health checks for infrastructure (e.g. uptime/downtime for subgraphs, APIs, smart contracts, etc.) so I created a separate issue for this: #129

@Tritium-VLK
Copy link

Tritium-VLK commented May 3, 2021

I'd also recommend having your healthcheck endpoint return JSON so that it can be called directly from sites, and then we can write a little job to collect data from the api and present it in prometheus form. That code can even be part of daowatch/scout instead of the main API so you can just deal with JSON there which is cleaner.

@benefacto
Copy link
Contributor Author

Support for YAML within TypeScript as configuration does not seem to be very good, will go with a TypeScript configuration file instead.

@benefacto
Copy link
Contributor Author

Need to add Binance Smart Chain Contracts as well, referencing Binance Smart Chain Explorer

@benefacto
Copy link
Contributor Author

Running into a lot of trouble getting dynamic imports to work due to this TypeScript issue

@benefacto
Copy link
Contributor Author

Running into a lot of trouble getting dynamic imports to work due to this TypeScript issue

Seems to be working now but need to resolve some build issues

@benefacto
Copy link
Contributor Author

Decided to not use dynamic imports here as it caused many more problems then it solved

@benefacto
Copy link
Contributor Author

Will try using ethers.js's JsonRpcBatchProvider to batch these view method calls so they hopefully don't take forever

@benefacto
Copy link
Contributor Author

Getting this error when calling some view methods, may need to manually specify gas limit:

Error: cannot estimate gas; transaction may fail or may require manual gas limit (error={"code":-32000}, method="call", transaction={"to":"0x5435Fc74aeD67C81BB500490A4365AF5e6021bba","data":"0x722713f7","accessList":null}, code=UNPREDICTABLE_GAS_LIMIT, version=providers/5.2.0)

@benefacto
Copy link
Contributor Author

Getting this error when calling some view methods, may need to manually specify gas limit:

Error: cannot estimate gas; transaction may fail or may require manual gas limit (error={"code":-32000}, method="call", transaction={"to":"0x5435Fc74aeD67C81BB500490A4365AF5e6021bba","data":"0x722713f7","accessList":null}, code=UNPREDICTABLE_GAS_LIMIT, version=providers/5.2.0)

Manually specifying a very high gas limit (9 million) mostly worked but I'm still getting this error on some methods on some contracts

@Tritium-VLK Tritium-VLK changed the title [feat. request]: Add health check endpoint(s) to support a public status page API-2: [feat. request]: Add health check endpoint(s) to support a public status page Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants