Investigate downtime during deployment #176

guidoiaquinti · 2021-11-02T15:25:52Z

Bug description

A customer running on AWS is reporting some downtime during helm operations. They are running with 2 replicas for web/events.

Expected behavior

No downtime expected

How to reproduce

"It appears that helm creates two new nodes, and terminates the old ones. But it terminates them before the new pods become ready / healthy. For example I can see this shortly after an upgrade, causing NGINX to show a 503 for all requests."

NAME                                                READY   STATUS    RESTARTS   AGE
posthog-web-c4bc4b487-4msjb                         0/1     Running   0          54s
posthog-web-c4bc4b487-cxzhn                         0/1     Running   0          54s

Environment

Deployment platform (gcp/aws/...): AWS (possibly everywhere)
Chart version/commit: latest
Posthog version: latest

Additional context

See private Slack channel

The text was updated successfully, but these errors were encountered:

Legion2 · 2021-12-22T11:06:30Z

I think the downtime is related to the way the deployments are added via Helm, all deployments are annotated as helm hooks, so helm will create and delete them as part of this lifecycle.

charts-clickhouse/charts/posthog/templates/web-deployment.yaml

Lines 11 to 15 in 4273cd1

    
           # This is what defines this resource as a hook. Without this line, the 
        
           # job is considered part of the release. 
        
           "helm.sh/hook": "post-install,post-upgrade" 
        
           "helm.sh/resource-policy": "keep" 
        
           "helm.sh/hook-weight": "1"

I think this is a mistake, that the deployments are configured as hooks.

guidoiaquinti · 2021-12-22T11:13:10Z

I think the downtime is related to the way the deployments are added via Helm, all deployments are annotated as helm hooks, so helm will create and delete them as part of this lifecycle.

charts-clickhouse/charts/posthog/templates/web-deployment.yaml

Lines 11 to 15 in 4273cd1

# This is what defines this resource as a hook. Without this line, the

# job is considered part of the release.

"helm.sh/hook": "post-install,post-upgrade"

"helm.sh/resource-policy": "keep"

"helm.sh/hook-weight": "1"

I think this is a mistake, that the deployments are configured as hooks.

On point! Here's an attempt to remove them but it's currently blocked by some issues in posthog/posthog

guidoiaquinti added the bug Something isn't working label Nov 2, 2021

guidoiaquinti self-assigned this Nov 2, 2021

fuziontech added deployments P1 Urgent, non-breaking labels Nov 10, 2021

guidoiaquinti mentioned this issue Dec 10, 2021

Dependency: ingress-nginx update from '3.25.0' to '4.0.13' #221

Merged

6 tasks

guidoiaquinti assigned fuziontech Dec 20, 2021

guidoiaquinti mentioned this issue Dec 22, 2021

Set pod priorities to reduce probability of ingestion being down #158

Open

fuziontech added the helm Helm chart work label Jan 18, 2022

guidoiaquinti linked a pull request Jan 19, 2022 that will close this issue

Remove all the (possible) Helm annotations #179

Merged

6 tasks

fuziontech mentioned this issue Jan 19, 2022

Sprint 1.33.0 1/2 - Jan 24 to Feb 4 PostHog/posthog#8090

Closed

guidoiaquinti closed this as completed in #179 Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate downtime during deployment #176

Investigate downtime during deployment #176

guidoiaquinti commented Nov 2, 2021

Legion2 commented Dec 22, 2021

guidoiaquinti commented Dec 22, 2021 •

edited

Loading

Investigate downtime during deployment #176

Investigate downtime during deployment #176

Comments

guidoiaquinti commented Nov 2, 2021

Bug description

Expected behavior

How to reproduce

Environment

Additional context

Legion2 commented Dec 22, 2021

guidoiaquinti commented Dec 22, 2021 • edited Loading

guidoiaquinti commented Dec 22, 2021 •

edited

Loading