Service outage 20230717 #571

danrademacher · 2023-07-18T01:09:04Z

Bug description
Several of our OHM services are falling down for unknown reasons:

Per https://stats.uptimerobot.com/0BBDoIkXKJ, currently the website, Overpass, and Taginfo are all down:

I confirmed that these are not loading for me.

The one red light on New Relic is this pod

danrademacher · 2023-07-18T01:18:46Z

Now Taginfo is back up:

I suspect Overpass and the website will come back online as well, but they have not yet done so.

danrademacher · 2023-07-18T03:22:38Z

The other services came back at 8:18 pm PT

Rub21 · 2023-07-18T21:57:19Z

@danrademacher @geohacker @batpad
I have been doing a research to understand why the service was down for almost two hours. I have thoroughly evaluated each of the involved aspects.

Incidents

Services	Incident started at (UTC)	Resolved at (UTC)	Duration
OHM Website	2023-07-18 00:29:15	2023-07-18 00:33:08	3 minutes and 53 seconds
OHM Website	2023-07-18 00:43:15	2023-07-18 03:18:14	2 hours and 34 minutes
OHM TagInfo	2023-07-18 00:26:25	2023-07-18 01:14:32	48 minutes and 7 seconds
OHM Overpass API	2023-07-18 01:03:43	2023-07-18 03:18:42	2 hours and 14 minutes

AWS Service health

The service has not experienced any interruptions on EC2. There was only one issue related to CloudFront, which occurred at a different time than the service downtime

https://health.aws.amazon.com/health/status

osmseed-production cluster

During the service downtime, other services within the cluster were operating normally. There were some that generated errors, but these seem to be related to another container that is not in use, namely 'dashboard-metrics-scraper'

https://onenr.io/0ERPMPYnvjW
https://onenr.io/0nQxP0YY5QV

DB Logs

For some reason, the database pod was halted during the service downtime. No logs were recorded for this time period.
https://onenr.io/0oQDKkGrDjy

Web container Logs

Similarly, the web service did not generate any unusual logs during the downtime

https://onenr.io/01wZvD3mvw6

Taginfo logs

https://onenr.io/0KQXGgP5Eja

Overpass API logs

https://onenr.io/0yw4NqoxLj3

Metrics-server Issue

After investigating, it seems that issues is related to "metrics-server", The metrics-server is a Kubernetes system component that collects and stores performance metrics from nodes and pods in the cluster.

kubectl get pods --all-namespaces -o wide
kubectl describe pod metrics-server-7668599459-dft95 -n kube-system
kubectl logs metrics-server-7668599459-dft95 -n kube-system

outputs:

Error: failed to get delegated authentication kubeconfig: failed to get delegated authentication kubeconfig: open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied
.....
panic: failed to get delegated authentication kubeconfig: failed to get delegated authentication kubeconfig: open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied

goroutine 1 [running]:
main.main()
	/go/src/github.com/kubernetes-incubator/metrics-server/cmd/metrics-server/metrics-server.go:39 +0x13b

I still don't have an understanding of why this issue is happening. I checked the staging cluster, and everything is working fine there. I'm not sure if it's related to a token expiration, but I can't pinpoint it yet.

Due to the "token: permission denied" issue, the web, taginfo, and overpass pods were unable to restart or collect metrics. However, it's strange that the services were restored after two hours. The first step would be to resolve this "open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied" issue. Here are some possible solutions according to chatgpt

@batpad ,I would like to your suggestion about this issue. From my perspective, the most viable option would be to delete the metrics-server pods and restore them using a metrics template. e.g: components.yml

jeffreyameyer · 2023-07-19T16:46:06Z

Overpass API appears to be working (i.e., returning READ requests), but is out of sync or possibly not syncing at all. Changes I made yesterday have yet to show up in its results.

danrademacher added the high_pri label Jul 18, 2023

danrademacher assigned batpad Jul 18, 2023

danrademacher added this to Backlog in Infrastructure via automation Jul 18, 2023

danrademacher assigned geohacker and Rub21 Jul 18, 2023

danrademacher moved this from Backlog to In progress in Infrastructure Jul 18, 2023

jeffreyameyer added infrastructure Deploy / Infrastructure issues overpass taginfo labels Jul 19, 2023

danrademacher mentioned this issue Jul 25, 2023

Enable EKS upgrade email notifications #572

Open

danrademacher closed this as completed Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service outage 20230717 #571

Service outage 20230717 #571

danrademacher commented Jul 18, 2023

danrademacher commented Jul 18, 2023

danrademacher commented Jul 18, 2023

Rub21 commented Jul 18, 2023 •

edited

jeffreyameyer commented Jul 19, 2023

Service outage 20230717 #571

Service outage 20230717 #571

Comments

danrademacher commented Jul 18, 2023

danrademacher commented Jul 18, 2023

danrademacher commented Jul 18, 2023

Rub21 commented Jul 18, 2023 • edited

Incidents

AWS Service health

osmseed-production cluster

DB Logs

Web container Logs

Taginfo logs

Overpass API logs

Metrics-server Issue

jeffreyameyer commented Jul 19, 2023

Rub21 commented Jul 18, 2023 •

edited