Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add contents/docs/self-host/runbook #2202

Merged
merged 6 commits into from
Oct 11, 2021
Merged

Add contents/docs/self-host/runbook #2202

merged 6 commits into from
Oct 11, 2021

Conversation

guidoiaquinti
Copy link
Contributor

Changes

Add a new runbook section in our docs.

Screenshot 2021-10-08 at 17 04 32

I wasn't able to add a subsection so that we can have runbooks/kafka/ but we can add it later.

Checklist

#### Resize data disk

##### Requirements
You need to run a Kubernetes cluster with the _Volume Expansion_ feature enabled. This feature is supported on the majority of volume types since Kubernetes version >= 1.11 (see [docs](https://kubernetes.io/docs/concepts/storage/storage-classes/#allow-volume-expansion)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just do this for users by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is this aimed at users who have already deployed and might run into this issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's another follow-up we need to check & update the docs for all of them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we need to follow up as well with another PR describing a suggested k8s setup

#### Resize data disk

##### Requirements
You need to run a Kubernetes cluster with the _Volume Expansion_ feature enabled. This feature is supported on the majority of volume types since Kubernetes version >= 1.11 (see [docs](https://kubernetes.io/docs/concepts/storage/storage-classes/#allow-volume-expansion)).
Copy link
Contributor

@tiina303 tiina303 Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we break this up to two pieces: advice for when you don't have it enables (for which we should say you have no option but to nuke it & how to create the new one as expandable if possible or just suggest oversizing) & for when you do (with the implications for both)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to tackle this from another angle (with a follow up PR) by adding a new alert/ folder where you can find what to do procedures when you are running low on free space:

  • vertical scaling (if you have volume expansion enabled)
  • horizontal scaling
  • nuke everything

What do you think about it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, happy for that to be there, maybe we can link to it. It's just odd to see a runbook have a requirement with something that hasn't always been the default for us.


1. List your pods
```
➜ kubectl get pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use -n posthog everywhere as that's the default namespace we ask folks to install everything, the ones who didn't know to remove it, the ones who did the default stuff will be confused about not seeing anything here


1. Resize the underlying PVC (in this example we are resizing it to to 20G)
```
➜ kubectl patch pvc data-posthog-posthog-kafka-0 -p '{ "spec": { "resources": { "requests": { "storage": "20Gi" }}}}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ how you made it as fool-proof as possible by not asking someone to edit the file but using patch

Comment on lines 66 to 70
Note: while resizing the PVC you might get an error `disk resize is only supported on Unattached disk, current disk state: Attached`. In this specific case you need to temporary scale down the `StatefulSet` replica value to zero. **This will briefly disrupt the Kafka service availability and all the events after this point will be dropped as event ingestion will stop working**

You can do that by running: `kubectl patch statefulset posthog-posthog-kafka -p '{ "spec": { "replicas": 0 }}'`

After you successfully resized the PVC, you can restore the initial replica definition with: `kubectl patch statefulset posthog-posthog-kafka -p '{ "spec": { "replicas": 1 }}'`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we put this into a box or something visually so I can more easily jump over if I didn't run into that error


After you successfully resized the PVC, you can restore the initial replica definition with: `kubectl patch statefulset posthog-posthog-kafka -p '{ "spec": { "replicas": 1 }}'`

1. Delete the `StatefulSet` definition but leave its `pod`s online: `kubectl delete sts --cascade=orphan posthog-posthog-kafka`
Copy link
Contributor

@tiina303 tiina303 Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please explain why we want to leave the pod online (& what is happening while statefulset is deleted - since the pod is still running we have all the events data in memory and hence don't lose any events???)


1. In your Helm chart configuration, update the `kafka.persistence` value in `value.yaml` to the target size (20G in this example)

1. Run `helm update` to recycle all the pods and re-deploy the `StatefulSet` definition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe link to the update commands in the various docs or to https://posthog.com/docs/self-host/configure/upgrading-posthog#upgrade-instructions

Potential problem: they haven't updated in a long time and now run into major chart updates etc which require some manual steps
Proposal:
(1) if they are trying to update and the chart is currently in a working state add 0. do a helm upgrade to make sure we're up to date so a later helm update will work fine
(2) if they are in a bad state, e.g. they are doing this because kafka was full <- can we give commands that don't use helm update here instead? && suggest that they later run helm update

Copy link
Contributor

@tiina303 tiina303 Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually how do you feel about moving update & uninstall under runbooks & just linking to the runbook from the other pages. They are the same on all platforms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they run helm update they are simply going to upgrade their release to a new version of a chart (in this case one with the kafka.persistent value updated). I don't see where is the problem if they haven't updated it in long time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we update the chart constantly, so if they ran a helm update repo after last helm update or if it's someone other than the person who last updated then the chart will have a different local repo state. So during helm update all the differences in the repo state will be applied also. Does this make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimally we should call this out that this could be a problem, ideally we'd do something better & propose solutions instead.

@tiina303
Copy link
Contributor

tiina303 commented Oct 8, 2021

This is awesome!

Can we mention this in product-internal/infrastructure/runbooks/kafka/ so we don't forget to update this here later (consider removing it from there & linking to here instead & keeping only cloud specific stuff there).

@guidoiaquinti
Copy link
Contributor Author

I'm going to merge and iterate further on this.

Copy link
Contributor

@yakkomajuri yakkomajuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for documenting.

We can always iterate so don't feel blocked on PRs like these

@guidoiaquinti guidoiaquinti merged commit 90dfb9f into master Oct 11, 2021
@guidoiaquinti guidoiaquinti deleted the kafka branch October 11, 2021 10:25
"children": [
{
"name": "Overview",
"url": "/docs/self-host/runbook/overview"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit late here, sorry, I was sick on Friday. This should have been /docs/self-host/runbook. We've dropped using overview.md[x].

@@ -0,0 +1,7 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have been index.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants