Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add k8s Job definition for journal format #15092

Open
wants to merge 3 commits into
base: master-2.x
Choose a base branch
from

Conversation

ZhuTopher
Copy link
Contributor

@ZhuTopher ZhuTopher commented Mar 3, 2022

What changes are proposed in this pull request?

(Re)adding a k8s Job to handle master journal formatting.

  • We previously migrated from k8s Jobs to an initContainer in this PR

Why are the changes needed?

Currently in order to perform journal formatting through the Helm chart, you must set the value journal.format.runFormat=true. This defines an initContainer inside the Alluxio master StatefulSet Pods which runs alluxio format. This means that any time the pods are restarted, the journal will be reformatted.

The only way to stop this behaviour is to helm upgrade the chart release with value journal.format.runFormat=false.

This change defines a k8s Job which performs the alluxio format outside of the lifecycle scope of the Pod by using Helm hooks. The end-user can modify the lifecycle of the journal format job in a limited way through the usage of the new value `journal.format.frequency=("once"|"always").

  • journal.format.frequency="once" will run alluxio format exactly once per Helm release (i.e: during helm install)
  • journal.format.frequency="always" will run alluxio format on any change to the Helm release (i.e: helm install, helm upgrade, helm rollback)

Does this PR introduce any user facing changes?

Please list the user-facing changes introduced by your change, including

  1. A new Helm chart value journal.format.frequency

Design choice explanation

The goal in this design was to decouple Pod lifecycles with the journal format lifecycle. Unfortunately in doing so, there was no way to guarantee that the Pods could wait for the format Job to actually run to completion before starting without the usage of an initContainer.

  • Sometimes I observed the Master pod would start and identify a pre-existing journal before the Job could complete formatting it; this despite the fact that the Job would get defined and deployed in k8s before the Master StatefulSet due to the "helm.sh/hook": pre-install annotation.

There appears to be some demand in the K8s community about asking for this type of feature, which resulted in the following workaround: kubernetes/kubernetes#106802 (comment)

  • This workaround involves passing the namespace's default ServiceAccount credentials to the initContainer with additional Role permissions in order to allow the container to query the k8s API for the Job list & status

Note that because of this initContainer implementation any time that journal.format.runFormat is true, the corresponding k8s Job(s) must be Completed and present in the k8s system. A consequence of that fact is that we must explicitly avoid deletion of completed/terminated Jobs. In the k8s Job docs they state that:

If the field [.spec.ttkSecondsAfterFinished] is unset, this Job won't be cleaned up by the TTL controller after it finishes.
...
Even though the control plane eventually garbage collects the Pods from a deleted Job after they either fail or complete...

So what this means is that even if we define that we do not wish for completed Jobs to be cleaned up, their corresponding pods may still get garbage collected regardless. I don't believe this affects the Job resource whatsoever, however I was unable to find a concrete answer. The only thing I found in the docs was for the aforementioned TTL mechanism.

  • I considered using ownerReferences to define a parent-child relationship with the Jobs and the Master StatefulSet but this creates a circular dependency loop
  • You can use the post-install Helm hook instead of pre-install since the Master pods are protected by the initContainer, however there is no way to define an ownerReferences block without the resource UID (which is unknown until object creation); you would have to PATCH the Job after-the-fact

Test scenarios

  • DONE: test journal formatting with local PVs and PVCs
  • DONE: test journal formatting (or lack thereof) with emptyDir journal
  • TODO: test journal formatting with HDFS journal
  • DONE: test journal formatting with multiple masters (embedded)
  • TODO: test journal formatting with multiple masters (not embedded)
  • TODO (bugfix?): test loading blocks from workers after master restart (no journal format)
    • Successfully identified and loaded Alluxio fs from prior journal, however the blocks which should have been in cache were no longer accessible. Reading PERSISTED files resulted in reads from UFS, and any NOT_PERSISTED files prior to restart were unable to be read.
    • TODO: test using a prior ServiceAccount and attaching the "job-reader" role to it as opposed to creating a new ServiceAccount

Comment on lines +24 to +26
{{- $count := int (ternary .Values.master.count 1 (eq .Values.journal.type "EMBEDDED")) }}
{{ range $i, $e := until $count }}
{{- $journalFormatJobName := printf "%s-%d-%s" $statefulsetName $i $journalFormatJobNameSuffix }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately in order to prevent Helm from overriding k8s Jobs in the same release but for different Master Pods (i.e: embedded multi-master scenario), the current workaround is to define unique resource names for the jobs.

In the future, it is possible to replace this with indexed jobs which would allow the use of a single k8s Job resource instead.

completions: 1
parallelism: 1
backoffLimit: 2
activeDeadlineSeconds: 30
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

activeDeadlineSeconds set to 30 to match with previously removed implementation from this PR

Comment on lines +109 to +115
# POD_NAME is the StatefulSet name with the indexed suffix
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
# manually override entrypoint command in order to evaluate POD_NAME
command: ['sh', '-c', 'wait_for.sh "job" "${POD_NAME}-{{ $journalFormatJobNameSuffix }}"']
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is currently not yet an easy way to access the Pod index number for a StatefulSet, we use the name.

@jiacheliu3
Copy link
Contributor

@ZhuTopher I can see the state of this PR is still WIP and there are some unaddressed scenarios. Do you need my review at this stage (and discuss)? Or I should wait till this PR is past the WIP stage?

Just a side note, from our past few interactions with the helm hooks, I get the vague feeling that they don't work too well with some K8s cases just because of some limitations K8s is not able to resolve at the moment. Or even if K8s gets able to resolve those limitations, the fact that the hooks don't work with some older K8s versions is still there. If that's the case, feel free to park the attempt until the time and tools are ripe, in the (near) future.

Sometimes it's even very hard to find the truth about some undocumented or unwanted behaviors. My opinion at this stage is, we don't have that familiarity with the K8s code at this moment, so let's not spend too much time digging. Asking on stackoverflow is sometimes worth a try, and in a few weeks some questions get answered unexpectedly.

@ZhuTopher
Copy link
Contributor Author

@jiacheliu3 Thanks, yes this PR is mostly here just to park the stuff I started looking into regarding this topic. There's no urgency in completing this feature. The implementation is "done" so from that standpoint feel free to review at your own leisure.

As you mention there are a lot of factors to consider (supported k8s/Helm version, sufficient hook understanding, use-cases) which I wanted to use this PR as a place for discussion as well.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The PR/Issue does not have recent activities and will be closed automatically label Jan 29, 2023
@github-actions github-actions bot removed the stale The PR/Issue does not have recent activities and will be closed automatically label May 16, 2023
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The PR/Issue does not have recent activities and will be closed automatically label Jun 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale The PR/Issue does not have recent activities and will be closed automatically
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants