Skip to content

Outage because of "JsonRpcProvider failed to detect network and cannot start up; retry in 1s" #312

@bajtos

Description

@bajtos

While investigating CheckerNetwork/node#569, I noticed that spark-evaluate logs are full of the following error messages:

2024-08-09T07:19:31Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:21:42Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:23:53Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:26:04Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:28:15Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:30:26Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:32:37Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:34:48Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:36:59Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:39:10Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:41:21Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)
2024-08-09T07:43:32Z app[e2867541be3e68] cdg [info]JsonRpcProvider failed to detect network and cannot start up; retry in 1s (perhaps the URL is wrong or the node is not started)

I think that this brought down the spark-evaluate service.

How can we detect this problem and send an alert to Slack?

What higher-level metric is affected? A bunch of charts in the Internal Spark Dasboard don't show any data points after 2024-08-08 14:08.

Screenshot 2024-08-09 at 09 53 33 Screenshot 2024-08-09 at 09 53 45

Can we create a new metric similar to "unpublished measurements max age" but for round evaluations and trigger an alert when there is no round evaluation posted in >30 minutes?

If that's not possible, then a last-resort option is to create a Papertrail filter to detect these error messages and trigger an alert. This can be too noisy, though.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

✅ done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions