[WIP] Add k8s Job definition for journal format #15092

ZhuTopher · 2022-03-03T02:41:29Z

What changes are proposed in this pull request?

(Re)adding a k8s Job to handle master journal formatting.

We previously migrated from k8s Jobs to an initContainer in this PR

Why are the changes needed?

Currently in order to perform journal formatting through the Helm chart, you must set the value journal.format.runFormat=true. This defines an initContainer inside the Alluxio master StatefulSet Pods which runs alluxio format. This means that any time the pods are restarted, the journal will be reformatted.

The only way to stop this behaviour is to helm upgrade the chart release with value journal.format.runFormat=false.

This change defines a k8s Job which performs the alluxio format outside of the lifecycle scope of the Pod by using Helm hooks. The end-user can modify the lifecycle of the journal format job in a limited way through the usage of the new value `journal.format.frequency=("once"|"always").

journal.format.frequency="once" will run alluxio format exactly once per Helm release (i.e: during helm install)
journal.format.frequency="always" will run alluxio format on any change to the Helm release (i.e: helm install, helm upgrade, helm rollback)

Does this PR introduce any user facing changes?

Please list the user-facing changes introduced by your change, including

A new Helm chart value journal.format.frequency

Design choice explanation

The goal in this design was to decouple Pod lifecycles with the journal format lifecycle. Unfortunately in doing so, there was no way to guarantee that the Pods could wait for the format Job to actually run to completion before starting without the usage of an initContainer.

Sometimes I observed the Master pod would start and identify a pre-existing journal before the Job could complete formatting it; this despite the fact that the Job would get defined and deployed in k8s before the Master StatefulSet due to the "helm.sh/hook": pre-install annotation.

There appears to be some demand in the K8s community about asking for this type of feature, which resulted in the following workaround: kubernetes/kubernetes#106802 (comment)

This workaround involves passing the namespace's default ServiceAccount credentials to the initContainer with additional Role permissions in order to allow the container to query the k8s API for the Job list & status

Note that because of this initContainer implementation any time that journal.format.runFormat is true, the corresponding k8s Job(s) must be Completed and present in the k8s system. A consequence of that fact is that we must explicitly avoid deletion of completed/terminated Jobs. In the k8s Job docs they state that:

If the field [.spec.ttkSecondsAfterFinished] is unset, this Job won't be cleaned up by the TTL controller after it finishes.
...
Even though the control plane eventually garbage collects the Pods from a deleted Job after they either fail or complete...

So what this means is that even if we define that we do not wish for completed Jobs to be cleaned up, their corresponding pods may still get garbage collected regardless. I don't believe this affects the Job resource whatsoever, however I was unable to find a concrete answer. The only thing I found in the docs was for the aforementioned TTL mechanism.

I considered using ownerReferences to define a parent-child relationship with the Jobs and the Master StatefulSet but this creates a circular dependency loop
You can use the post-install Helm hook instead of pre-install since the Master pods are protected by the initContainer, however there is no way to define an ownerReferences block without the resource UID (which is unknown until object creation); you would have to PATCH the Job after-the-fact

Test scenarios

DONE: test journal formatting with local PVs and PVCs
DONE: test journal formatting (or lack thereof) with emptyDir journal
TODO: test journal formatting with HDFS journal
DONE: test journal formatting with multiple masters (embedded)
TODO: test journal formatting with multiple masters (not embedded)
TODO (bugfix?): test loading blocks from workers after master restart (no journal format)
- Successfully identified and loaded Alluxio fs from prior journal, however the blocks which should have been in cache were no longer accessible. Reading PERSISTED files resulted in reads from UFS, and any NOT_PERSISTED files prior to restart were unable to be read.
- TODO: test using a prior ServiceAccount and attaching the "job-reader" role to it as opposed to creating a new ServiceAccount

ZhuTopher · 2022-03-03T02:45:59Z

integration/kubernetes/helm-chart/alluxio/templates/master/journal-format-job.yaml

+{{- $count := int (ternary .Values.master.count 1 (eq .Values.journal.type "EMBEDDED")) }}
+{{ range $i, $e := until $count }}
+{{- $journalFormatJobName := printf "%s-%d-%s" $statefulsetName $i $journalFormatJobNameSuffix }}


Unfortunately in order to prevent Helm from overriding k8s Jobs in the same release but for different Master Pods (i.e: embedded multi-master scenario), the current workaround is to define unique resource names for the jobs.

In the future, it is possible to replace this with indexed jobs which would allow the use of a single k8s Job resource instead.

This feature is alpha as of k8s 1.21, and beta in 1.22+

ZhuTopher · 2022-03-03T02:47:05Z

integration/kubernetes/helm-chart/alluxio/templates/master/journal-format-job.yaml

+  completions: 1
+  parallelism: 1
+  backoffLimit: 2
+  activeDeadlineSeconds: 30


activeDeadlineSeconds set to 30 to match with previously removed implementation from this PR

ZhuTopher · 2022-03-03T03:02:24Z

integration/kubernetes/helm-chart/alluxio/templates/master/statefulset.yaml

+          # POD_NAME is the StatefulSet name with the indexed suffix
+          - name: POD_NAME
+            valueFrom:
+              fieldRef:
+                fieldPath: metadata.name
+        # manually override entrypoint command in order to evaluate POD_NAME
+        command: ['sh', '-c', 'wait_for.sh "job" "${POD_NAME}-{{ $journalFormatJobNameSuffix }}"']


Since there is currently not yet an easy way to access the Pod index number for a StatefulSet, we use the name.

This was the workaround I found in a k8s issue on this topic

jiacheliu3 · 2022-03-05T14:03:26Z

@ZhuTopher I can see the state of this PR is still WIP and there are some unaddressed scenarios. Do you need my review at this stage (and discuss)? Or I should wait till this PR is past the WIP stage?

Just a side note, from our past few interactions with the helm hooks, I get the vague feeling that they don't work too well with some K8s cases just because of some limitations K8s is not able to resolve at the moment. Or even if K8s gets able to resolve those limitations, the fact that the hooks don't work with some older K8s versions is still there. If that's the case, feel free to park the attempt until the time and tools are ripe, in the (near) future.

Sometimes it's even very hard to find the truth about some undocumented or unwanted behaviors. My opinion at this stage is, we don't have that familiarity with the K8s code at this moment, so let's not spend too much time digging. Asking on stackoverflow is sometimes worth a try, and in a few weeks some questions get answered unexpectedly.

ZhuTopher · 2022-03-07T23:44:11Z

@jiacheliu3 Thanks, yes this PR is mostly here just to park the stuff I started looking into regarding this topic. There's no urgency in completing this feature. The implementation is "done" so from that standpoint feel free to review at your own leisure.

As you mention there are a lot of factors to consider (supported k8s/Helm version, sufficient hook understanding, use-cases) which I wanted to use this PR as a place for discussion as well.

github-actions · 2023-01-29T15:06:17Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

github-actions · 2023-06-15T15:09:55Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

ZhuTopher added 2 commits March 2, 2022 17:22

Add k8s Job definition for journal format

cc68cd7

Bump Chart and CHANGELOG versions

ca16c62

ZhuTopher requested a review from jiacheliu3 March 3, 2022 02:41

alluxio-bot assigned LuQQiu Mar 3, 2022

ZhuTopher commented Mar 3, 2022

View reviewed changes

Remove erroneous ownerReferences declaration

05e5cf9

github-actions bot added the stale The PR/Issue does not have recent activities and will be closed automatically label Jan 29, 2023

github-actions bot removed the stale The PR/Issue does not have recent activities and will be closed automatically label May 16, 2023

github-actions bot added the stale The PR/Issue does not have recent activities and will be closed automatically label Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add k8s Job definition for journal format #15092

[WIP] Add k8s Job definition for journal format #15092

ZhuTopher commented Mar 3, 2022 •

edited

ZhuTopher Mar 3, 2022

ZhuTopher Mar 3, 2022

ZhuTopher Mar 3, 2022

jiacheliu3 commented Mar 5, 2022

ZhuTopher commented Mar 7, 2022

github-actions bot commented Jan 29, 2023

github-actions bot commented Jun 15, 2023

[WIP] Add k8s Job definition for journal format #15092

Are you sure you want to change the base?

[WIP] Add k8s Job definition for journal format #15092

Conversation

ZhuTopher commented Mar 3, 2022 • edited

What changes are proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user facing changes?

Design choice explanation

Test scenarios

ZhuTopher Mar 3, 2022

Choose a reason for hiding this comment

ZhuTopher Mar 3, 2022

Choose a reason for hiding this comment

ZhuTopher Mar 3, 2022

Choose a reason for hiding this comment

jiacheliu3 commented Mar 5, 2022

ZhuTopher commented Mar 7, 2022

github-actions bot commented Jan 29, 2023

github-actions bot commented Jun 15, 2023

ZhuTopher commented Mar 3, 2022 •

edited