Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate downtime during deployment #176

Closed
3 tasks done
guidoiaquinti opened this issue Nov 2, 2021 · 2 comments · Fixed by #179
Closed
3 tasks done

Investigate downtime during deployment #176

guidoiaquinti opened this issue Nov 2, 2021 · 2 comments · Fixed by #179
Assignees
Labels
bug Something isn't working deployments helm Helm chart work P1 Urgent, non-breaking

Comments

@guidoiaquinti
Copy link
Contributor

Bug description

A customer running on AWS is reporting some downtime during helm operations. They are running with 2 replicas for web/events.

Expected behavior

No downtime expected

How to reproduce

"It appears that helm creates two new nodes, and terminates the old ones. But it terminates them before the new pods become ready / healthy. For example I can see this shortly after an upgrade, causing NGINX to show a 503 for all requests."

NAME                                                READY   STATUS    RESTARTS   AGE
posthog-web-c4bc4b487-4msjb                         0/1     Running   0          54s
posthog-web-c4bc4b487-cxzhn                         0/1     Running   0          54s

Environment

  • Deployment platform (gcp/aws/...): AWS (possibly everywhere)
  • Chart version/commit: latest
  • Posthog version: latest

Additional context

@Legion2
Copy link

Legion2 commented Dec 22, 2021

I think the downtime is related to the way the deployments are added via Helm, all deployments are annotated as helm hooks, so helm will create and delete them as part of this lifecycle.

# This is what defines this resource as a hook. Without this line, the
# job is considered part of the release.
"helm.sh/hook": "post-install,post-upgrade"
"helm.sh/resource-policy": "keep"
"helm.sh/hook-weight": "1"

I think this is a mistake, that the deployments are configured as hooks.

@guidoiaquinti
Copy link
Contributor Author

guidoiaquinti commented Dec 22, 2021

I think the downtime is related to the way the deployments are added via Helm, all deployments are annotated as helm hooks, so helm will create and delete them as part of this lifecycle.

# This is what defines this resource as a hook. Without this line, the
# job is considered part of the release.
"helm.sh/hook": "post-install,post-upgrade"
"helm.sh/resource-policy": "keep"
"helm.sh/hook-weight": "1"

I think this is a mistake, that the deployments are configured as hooks.

On point! Here's an attempt to remove them but it's currently blocked by some issues in posthog/posthog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deployments helm Helm chart work P1 Urgent, non-breaking
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants