New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service outage 20230717 #571
Comments
The other services came back at 8:18 pm PT |
@danrademacher @geohacker @batpad Incidents
AWS Service healthThe service has not experienced any interruptions on EC2. There was only one issue related to CloudFront, which occurred at a different time than the service downtime https://health.aws.amazon.com/health/status osmseed-production clusterDuring the service downtime, other services within the cluster were operating normally. There were some that generated errors, but these seem to be related to another container that is not in use, namely 'dashboard-metrics-scraper' https://onenr.io/0ERPMPYnvjW DB LogsFor some reason, the database pod was halted during the service downtime. No logs were recorded for this time period. Web container LogsSimilarly, the web service did not generate any unusual logs during the downtime Taginfo logsOverpass API logsMetrics-server IssueAfter investigating, it seems that issues is related to "metrics-server", The metrics-server is a Kubernetes system component that collects and stores performance metrics from nodes and pods in the cluster. kubectl get pods --all-namespaces -o wide
kubectl describe pod metrics-server-7668599459-dft95 -n kube-system
kubectl logs metrics-server-7668599459-dft95 -n kube-system outputs: Error: failed to get delegated authentication kubeconfig: failed to get delegated authentication kubeconfig: open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied
.....
panic: failed to get delegated authentication kubeconfig: failed to get delegated authentication kubeconfig: open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied
goroutine 1 [running]:
main.main()
/go/src/github.com/kubernetes-incubator/metrics-server/cmd/metrics-server/metrics-server.go:39 +0x13b I still don't have an understanding of why this issue is happening. I checked the staging cluster, and everything is working fine there. I'm not sure if it's related to a token expiration, but I can't pinpoint it yet. Due to the "token: permission denied" issue, the web, taginfo, and overpass pods were unable to restart or collect metrics. However, it's strange that the services were restored after two hours. The first step would be to resolve this "open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied" issue. Here are some possible solutions according to chatgpt @batpad ,I would like to your suggestion about this issue. From my perspective, the most viable option would be to delete the metrics-server pods and restore them using a metrics template. e.g: components.yml |
Overpass API appears to be working (i.e., returning READ requests), but is out of sync or possibly not syncing at all. Changes I made yesterday have yet to show up in its results. |
Bug description
Several of our OHM services are falling down for unknown reasons:
Per https://stats.uptimerobot.com/0BBDoIkXKJ, currently the website, Overpass, and Taginfo are all down:
I confirmed that these are not loading for me.
The one red light on New Relic is this pod
The text was updated successfully, but these errors were encountered: