Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service outage 20230717 #571

Closed
danrademacher opened this issue Jul 18, 2023 · 4 comments
Closed

Service outage 20230717 #571

danrademacher opened this issue Jul 18, 2023 · 4 comments
Assignees
Labels

Comments

@danrademacher
Copy link
Member

Bug description
Several of our OHM services are falling down for unknown reasons:

Per https://stats.uptimerobot.com/0BBDoIkXKJ, currently the website, Overpass, and Taginfo are all down:
image

I confirmed that these are not loading for me.

The one red light on New Relic is this pod
image

@danrademacher danrademacher added this to Backlog in Infrastructure via automation Jul 18, 2023
@danrademacher danrademacher moved this from Backlog to In progress in Infrastructure Jul 18, 2023
@danrademacher
Copy link
Member Author

Now Taginfo is back up:
image

I suspect Overpass and the website will come back online as well, but they have not yet done so.

@danrademacher
Copy link
Member Author

The other services came back at 8:18 pm PT

@Rub21
Copy link

Rub21 commented Jul 18, 2023

@danrademacher @geohacker @batpad
I have been doing a research to understand why the service was down for almost two hours. I have thoroughly evaluated each of the involved aspects.

Incidents

Services Incident started at (UTC) Resolved at (UTC) Duration
OHM Website 2023-07-18 00:29:15 2023-07-18 00:33:08 3 minutes and 53 seconds
OHM Website 2023-07-18 00:43:15 2023-07-18 03:18:14 2 hours and 34 minutes
OHM TagInfo 2023-07-18 00:26:25 2023-07-18 01:14:32 48 minutes and 7 seconds
OHM Overpass API 2023-07-18 01:03:43 2023-07-18 03:18:42 2 hours and 14 minutes

AWS Service health

The service has not experienced any interruptions on EC2. There was only one issue related to CloudFront, which occurred at a different time than the service downtime

https://health.aws.amazon.com/health/status
image

osmseed-production cluster

During the service downtime, other services within the cluster were operating normally. There were some that generated errors, but these seem to be related to another container that is not in use, namely 'dashboard-metrics-scraper'

https://onenr.io/0ERPMPYnvjW
https://onenr.io/0nQxP0YY5QV
image

DB Logs

For some reason, the database pod was halted during the service downtime. No logs were recorded for this time period.
https://onenr.io/0oQDKkGrDjy

image

Web container Logs

Similarly, the web service did not generate any unusual logs during the downtime

https://onenr.io/01wZvD3mvw6
image

Taginfo logs

https://onenr.io/0KQXGgP5Eja

Overpass API logs

https://onenr.io/0yw4NqoxLj3

Metrics-server Issue

After investigating, it seems that issues is related to "metrics-server", The metrics-server is a Kubernetes system component that collects and stores performance metrics from nodes and pods in the cluster.

Screenshot 2023-07-18 at 4 49 16 PM

kubectl get pods --all-namespaces -o wide
kubectl describe pod metrics-server-7668599459-dft95 -n kube-system
kubectl logs metrics-server-7668599459-dft95 -n kube-system

outputs:

Error: failed to get delegated authentication kubeconfig: failed to get delegated authentication kubeconfig: open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied
.....
panic: failed to get delegated authentication kubeconfig: failed to get delegated authentication kubeconfig: open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied

goroutine 1 [running]:
main.main()
	/go/src/github.com/kubernetes-incubator/metrics-server/cmd/metrics-server/metrics-server.go:39 +0x13b

I still don't have an understanding of why this issue is happening. I checked the staging cluster, and everything is working fine there. I'm not sure if it's related to a token expiration, but I can't pinpoint it yet.

Due to the "token: permission denied" issue, the web, taginfo, and overpass pods were unable to restart or collect metrics. However, it's strange that the services were restored after two hours. The first step would be to resolve this "open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied" issue. Here are some possible solutions according to chatgpt

@batpad ,I would like to your suggestion about this issue. From my perspective, the most viable option would be to delete the metrics-server pods and restore them using a metrics template. e.g: components.yml

@jeffreyameyer
Copy link
Member

Overpass API appears to be working (i.e., returning READ requests), but is out of sync or possibly not syncing at all. Changes I made yesterday have yet to show up in its results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Infrastructure
  
In progress
Development

No branches or pull requests

5 participants